## Starting Spark

This notebook contains the practice codes from Chapter 1 and 2 of Spark: The Definitive Guide by Bill Chambers & Matei Zaharia.

In [16]:
from pyspark.sql import SparkSession

In [17]:
# Initialize SparkSession
spark = SparkSession.builder \
    .appName("MyFirstSparkNotebook") \
    .master("local[*]") \
    .getOrCreate()

In [18]:
spark

In [19]:
# This creates a DataFrame with one column containing 1,000 rows with values from 0 to 999
myRange = spark.range(1000).toDF("number")

## TRANSFORMATIONS
Let's perform a simple trasnformation to find all event numbers in our current DataFrame

In [20]:
divisby2 = myRange.where("number % 2 = 0")

## ACTIONS
To trigger the computation, we run an action.
The simplest action is count, which gives us the total number fo records in the DataFrame:

In [21]:
divisby2.count()

500

## An End-to-End Example:
We’ll use Spark to analyze some flight data from the United States Bureau of
Transportation statistics.

Reading with Spark

- spark.read() is used to access the DataFrameReader
- .option("inferSchema", "true") tells Spark to guess data types
- .option("header", "true") instructs taht first row contains column names
- .csv() specifies format and path of file

In [28]:
flightData2015 = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("data/flight-data/csv/2015-summary.csv")


In [29]:
# Executes the read + transformations
# Pulls 3 rows from the distributed DataFrame
# Converts it into a local Python list of Row objects
flightData2015.take(3)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344)]

## Sorting
This line does **not** execute anything. It only defines a transformation: 
    
    - sort the DataFrame by the <count> column

In [32]:
flightData2015.sort("count").explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#93 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(count#93 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=168]
      +- FileScan csv [DEST_COUNTRY_NAME#91,ORIGIN_COUNTRY_NAME#92,count#93] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/Users/satkarkarki/spark_the_definitive_guide/data/flight-data/cs..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>




## Configure Shuffle Output
This sets the **number of output partitions** when a shuffle (like a sort) occurs.

In [34]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

## Trigger the Job with .take()

    - .take(2) is an action which:
        - sorts the DataFrame
        - Shuffles data
        - Takes the first 2 rows


In [35]:
flightData2015.sort("count").take(2)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

## What createOrReplaceTempView() Does
This line **does not write anything to disk**. Instead, it:

    - Registers the flightData2015 **DataFrame** as a **temporary SQL view** named "flight_data_2015"
    - Makes the DataFrame **accessible via SQL syntax** in the same Spark session
    - Allows us to **run SQL queries** on it just like we would on a SQL table.
    
Spark doesn’t distinguish between SQL and DataFrame code internally.
Whether you use .filter(), .groupBy() in Python or SELECT ... WHERE ... in SQL —
Spark compiles both into the same logical and physical plan.

In [48]:
flightData2015.createOrReplaceTempView("flight_data_2015")

- Now we can query our data in SQL.
- To do so, we'll use the spark.sql function

In [42]:
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1)
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
""")

dataFrameWay = flightData2015\
    .groupby("DEST_COUNTRY_NAME")\
    .count()

sqlWay.explain()
dataFrameWay.explain()


== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[DEST_COUNTRY_NAME#91], functions=[count(1)])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#91, 5), ENSURE_REQUIREMENTS, [plan_id=190]
      +- HashAggregate(keys=[DEST_COUNTRY_NAME#91], functions=[partial_count(1)])
         +- FileScan csv [DEST_COUNTRY_NAME#91] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/Users/satkarkarki/spark_the_definitive_guide/data/flight-data/cs..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[DEST_COUNTRY_NAME#91], functions=[count(1)])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#91, 5), ENSURE_REQUIREMENTS, [plan_id=203]
      +- HashAggregate(keys=[DEST_COUNTRY_NAME#91], functions=[partial_count(1)])
         +- FileScan csv [DEST_COUNTRY_NAME#91] Batched: false, DataFilters: [], Format: CSV, Location: InMe

## Option 1: SQL Query Interface

    We can write pure SQL using spark.sql() if we prefer a traditional, declarative style:

In [43]:
spark.sql("SELECT max(count) FROM flight_data_2015").take(1)

[Row(max(count)=370002)]

## Option 2: DataFrame API
    We can use Spark's functional API to build transformations fluently:

In [44]:
from pyspark.sql.functions import max

flightData2015.select(max("count")).take(1)

[Row(max(count)=370002)]

Both methods complie to the **same logical plan** internally.
That means:

    - Same optimization engine
    - Same performance
    - Same execution DAG
    
**So it's really a matter of style and use case!**

## A bit more advanced query: Find top five destination in the data
This is our first multi-transformation query


## Option 1: SQL Query Syntax

In [50]:
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) AS destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")

maxSql.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



## Option 2: DatFrame Syntax

In [53]:
from pyspark.sql.functions import desc

flightData2015\
    .groupBy("DEST_COUNTRY_NAME")\
    .sum("count")\
    .withColumnRenamed("sum(count)", "destination_total")\
    .sort(desc("destination_total"))\
    .limit(5)\
    .show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [54]:
flightData2015\
    .groupBy("DEST_COUNTRY_NAME")\
    .sum("count")\
    .withColumnRenamed("sum(count)", "destination_total")\
    .sort(desc("destination_total"))\
    .limit(5)\
    .explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- TakeOrderedAndProject(limit=5, orderBy=[destination_total#181L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#91,destination_total#181L])
   +- HashAggregate(keys=[DEST_COUNTRY_NAME#91], functions=[sum(count#93)])
      +- Exchange hashpartitioning(DEST_COUNTRY_NAME#91, 5), ENSURE_REQUIREMENTS, [plan_id=380]
         +- HashAggregate(keys=[DEST_COUNTRY_NAME#91], functions=[partial_sum(count#93)])
            +- FileScan csv [DEST_COUNTRY_NAME#91,count#93] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/Users/satkarkarki/spark_the_definitive_guide/data/flight-data/cs..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>


