<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#SparkSession" data-toc-modified-id="SparkSession-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>SparkSession</a></span></li><li><span><a href="#Chapter2_종합예제" data-toc-modified-id="Chapter2_종합예제-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Chapter2_종합예제</a></span><ul class="toc-item"><li><span><a href="#SQL-구문-vs-DataFrame-구문" data-toc-modified-id="SQL-구문-vs-DataFrame-구문-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>SQL 구문 vs DataFrame 구문</a></span></li></ul></li></ul></div>

# SparkSession

In [1]:
spark

In [2]:
myRange = spark.range(1000).toDF('number')
myRange

DataFrame[number: bigint]

# Chapter2_종합예제

In [3]:
# 읽어들일 CSV 파일 미리보기
!head /Users/younghun/Desktop/gitrepo/data/spark_perfect_guide/flight-data/csv/2015-summary.csv

DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States,Romania,15
United States,Croatia,1
United States,Ireland,344
Egypt,United States,15
United States,India,62
United States,Singapore,1
United States,Grenada,62
Costa Rica,United States,588
Senegal,United States,40


In [6]:
# CSV 파일 로드하기
flightData2015 = spark\
                 .read\
                 .option('inferSchema', 'true')\
                 .option('header', 'true')\
                 .csv('/Users/younghun/Desktop/gitrepo/data/spark_perfect_guide/flight-data/csv/2015-summary.csv')

In [None]:
# CSV 파일 로드하기
flightData2015 = spark\
                 .read\
                 .option('inferSchema', 'true')\ # 스키마 정보를 알아내는 옵션
                 .option('header', 'true')\ # 첫 row를 헤더로 지정하는 옵션
                 .csv('/Users/younghun/Desktop/gitrepo/data/spark_perfect_guide/flight-data/csv/2015-summary.csv')

In [8]:
print(flightData2015)
type(flightData2015)

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: int]


pyspark.sql.dataframe.DataFrame

-  로드한 DF의 row 개수를 알 수 없는 이유: 데이터 읽는 과정이 지연 연산 형태의 트랜스포메이션이라서 그럼. 즉, 데이터 일부만을 읽어서 **스키마 추론**과 **헤더 지정**을 수행하기 때문

In [8]:
flightData2015

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: int]

- ``head``와 같이 데이터 미리보아서 배열 or 리스트 형태로 저장하는 ``take()``

In [12]:
# 하나의 Row를 Named Tuple 형태로 저장되어 있음. 즉, immutable함
flightData2015.take(5)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344),
 Row(DEST_COUNTRY_NAME='Egypt', ORIGIN_COUNTRY_NAME='United States', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='India', count=62)]

- 셔플을 사용하는 Wide Dependency 메소드인 ``sort()``는 DF를 변경하는 것이 아닌 새로운 DF를 생성하는 것임. 그래서 ``sort()``메소드 호출 시 DF에 아무런 변화는 없음
- ``explain()``: DF의 계보(Linage)와 스파크 쿼리 실행 계획을 확인할 수 있음

In [13]:
# sort할 스키마를 명시
flightData2015.sort('count').explain()

== Physical Plan ==
*(1) Sort [count#22 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#22 ASC NULLS FIRST, 200), true, [id=#50]
   +- FileScan csv [DEST_COUNTRY_NAME#20,ORIGIN_COUNTRY_NAME#21,count#22] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/Users/younghun/Desktop/gitrepo/data/spark_perfect_guide/flight-data/csv/2..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>




- 실행 계획은 ``위 -> 아래`` 방향으로 읽기
- 최종 결과는 가장 맨위에, 맨 아래 부분은 데이터 소스를 의미

In [14]:
# sort()는 셔플을 수행하기 때문에 기본적으로 제공하는 200개의 셔플 파티션을 생성
# 이를 사전에 바꿔줄 수 있음
spark.conf.set('spark.sql.shuffle.partitions', '5')

flightData2015.sort('count').take(2)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

In [15]:
# 파티션 개수를 달리 설정하면 런타임이 달라짐 
import time

spark.conf.set('spark.sql.shuffle,partitons', '1000')
start = time.time()

flightData2015.sort('count').take(5)

print('런타임:', time.time() - start)

런타임: 0.10617208480834961


In [16]:
spark.conf.set('spark.sql.shuffle,partitons', '5')
start = time.time()

flightData2015.sort('count').take(5)

print('런타임:', time.time() - start)

런타임: 0.09792113304138184


## SQL 구문 vs DataFrame 구문

- 성능 차이는 없음. 왜냐하면 같은 실행 계획으로 컴파일 되기 때문!
- Spark SQL을 사용하기 위해 **DataFrame 테이블이나 View(임시 테이블)로 등록 후 사용**

In [17]:
# Spark SQL 사용 위해 테이블 등록
flightData2015.createOrReplaceTempView('flight_data_2015')

In [18]:
flightData2015.show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



In [19]:
# SQL 구문
# 참고로 spark 변수는 SparkSession의 변수임!
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1)
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME""")

sqlWay.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#20], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#20, 5), true, [id=#103]
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#20], functions=[partial_count(1)])
      +- FileScan csv [DEST_COUNTRY_NAME#20] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/Users/younghun/Desktop/gitrepo/data/spark_perfect_guide/flight-data/csv/2..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>




In [20]:
# DataFrame 구문
dataFrameWay = flightData2015\
               .groupBy("DEST_COUNTRY_NAME")\
               .count()

dataFrameWay.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#20], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#20, 5), true, [id=#122]
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#20], functions=[partial_count(1)])
      +- FileScan csv [DEST_COUNTRY_NAME#20] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/Users/younghun/Desktop/gitrepo/data/spark_perfect_guide/flight-data/csv/2..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>




- 이번엔 *특정 위치를 왕래하는 최대 비행횟수* 구하는 구문

In [21]:
# SQL 구문
spark.sql("SELECT max(count) FROM flight_data_2015").take(1)

[Row(max(count)=370002)]

In [22]:
# DataFrame 구문
from pyspark.sql.functions import max

flightData2015.select(max('count')).take(1)

[Row(max(count)=370002)]

- *상위 5개의 도착 국가를 찾는* 구문

In [23]:
# SQL 구문
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5""")

maxSql.show(5)

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [24]:
maxSql.explain()

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[aggOrder#92L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#20,destination_total#90L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#20], functions=[sum(cast(count#22 as bigint))])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#20, 5), true, [id=#231]
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#20], functions=[partial_sum(cast(count#22 as bigint))])
         +- FileScan csv [DEST_COUNTRY_NAME#20,count#22] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/Users/younghun/Desktop/gitrepo/data/spark_perfect_guide/flight-data/csv/2..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>




In [25]:
# DataFrame 구문
from pyspark.sql.functions import desc

dfWay = flightData2015\
        .groupBy('DEST_COUNTRY_NAME')\
        .sum('count')\
        .withColumnRenamed('sum(count)', 'destination_total')\
        .sort(desc('destination_total'))\
        .limit(5)\
        .show()
dfWay

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [26]:
dfWay = flightData2015\
        .groupBy('DEST_COUNTRY_NAME')\
        .sum('count')\
        .withColumnRenamed('sum(count)', 'destination_total')\
        .sort(desc('destination_total'))\
        .limit(5)
dfWay.explain()

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[destination_total#139L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#20,destination_total#139L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#20], functions=[sum(cast(count#22 as bigint))])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#20, 5), true, [id=#293]
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#20], functions=[partial_sum(cast(count#22 as bigint))])
         +- FileScan csv [DEST_COUNTRY_NAME#20,count#22] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/Users/younghun/Desktop/gitrepo/data/spark_perfect_guide/flight-data/csv/2..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>


