### grp

# Spark: The Definitive Guide

## PART 1: Gentle Overview of Big Data and Spark

## dataPaths

In [1]:
flightData2015 = '/Users/grp/sparkTheDefinitiveGuide/data/flight-data/csv/2015-summary.csv'
retailDataDay = '/Users/grp/sparkTheDefinitiveGuide/data/retail-data/by-day/'

## _Chapter #1 - What is Apache Spark?_

-  Unified computing engine for parallel data processing distributed across clusters (machine nodes)   
<br>
    -  **Structured APIs**:
        -  Datasets
        -  DataFrames
        -  SQL   
        <br>
    -  **Unstructured APIs**:
        -  RDDs   
        <br>
    -  **Libraries**:
        -  Structured Streaming
        -  Machine Learning
        -  Graph   
        <br>
    -  **Resource Manager**:
        -  Local
        -  Standalone (Cluster)
        -  YARN (Cluster)
        -  Mesos (Cluster)   
        <br>
    -  **Language APIs**:
        -  Scala
        -  Java
        -  Python
        -  SQL
        -  R

## _Chapter #2 - A Gentle Introduction to Spark_

-  **Spark Applications**:
    -  **Driver** (heart of Spark Application during application's lifecycle):
        -  maintains information about Spark Application
        -  responds to user's program / input
        -  distributes and schedules work across executors   
        <br>
    -  **Executors**:
        -  executes work (code) assigned by driver
        -  reports state of work execution back to driver node   
        <br>
    -  **SparkSession**:
        -  entry point that manages Spark Application via driver process   
        <br>
    -  **DataFrames**:
        -  represents a table of data with rows and columns
        -  compiled in a schema that defines the column labels and data types   
        <br>
    -  **Partitions**:
        -  chunks of data distributed across cluster for parallel execution
        -  in addition, a collection of rows sitting on one physical machine in cluster
        -  parallelism = partitions = executors (x: 1 partition / 1,000 executors = parallelism of 1; 1,000 paritions / 1 executors = parallelism of 1)   
        <br>
    -  **Lazy Evaluation**:
        -  bundles plan of transformations on source data into DAG then triggers DAG on action
        -  molds a logical plan into a pysical plan that will run across cluster   
        <br>
    -  **Transformations**:
         -  data manipulations and modifications   
            <br>
             -  **Narrow Transformations** (1 to 1):
                 -  each input partition will contribute to only one output partition
                 -  no dependencies => 1 parent w/ 1 child
                 -  ex: filter, maps   
                <br>
             -  **Wide Transformations** (1 to N):
                  -  "aka" shuffle
                  -  many dependencies => 1 parent w/ many children
                  -  each input parition will contribute to many output partitions across the cluster
                  -  **when a shuffle occurs Spark writes the results to disk** ex: spark.sql.shuffle.partitions
                  -  ex: aggregations, joins, groupings   
       <br>           
    -  **Actions**:
        -  triggers the series of transformations into a spark job
            -  types:
                -  view data in the console
                -  collect data
                -  write to output data sources

## **Additional Definitions**:
   -  Spark Job (represents set of transformations triggered by an individual action)
   -  Schema Inference (have Spark best guess the schema of data) ***triggers Spark Job when scanning through data***
   -  Spark-Submit (launches application code to a cluster)
   -  Catalyst (planning and processing of work engine)

### _Spark contains separate Python and R processes hence when using Spark from Python or R language API the Python or R code is transaled into code that Spark can run on the executor JVMs_

### _Chapter #2 - Exercises_

In [2]:
flightDataDF2015 = spark\
.read\
.option("inferSchema", "true")\
.option("header", "true")\
.csv(flightData2015)

In [3]:
flightDataDF2015.rdd.getNumPartitions()

1

In [4]:
flightDataDF2015.take(3)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344)]

-  **Explain Plan**:
    -  displays DFs lineage

In [5]:
flightDataDF2015.sort("count").explain()

== Physical Plan ==
*Sort [count#14 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#14 ASC NULLS FIRST, 200)
   +- *FileScan csv [DEST_COUNTRY_NAME#12,ORIGIN_COUNTRY_NAME#13,count#14] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/grp/sparkTheDefinitiveGuide/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


-  **SPARK.SQL.SHUFFLE.PARTITIONS**:
    -  by default, there are 200 shuffle partitions

In [6]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

In [7]:
flightDataDF2015.sort("count").explain()

== Physical Plan ==
*Sort [count#14 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#14 ASC NULLS FIRST, 5)
   +- *FileScan csv [DEST_COUNTRY_NAME#12,ORIGIN_COUNTRY_NAME#13,count#14] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/grp/sparkTheDefinitiveGuide/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


In [8]:
flightDataDF2015.createOrReplaceTempView("flight_data_2015")

In [9]:
sqlWay = spark\
.sql("""
select dest_country_name, count(1)
from flight_data_2015
group by dest_country_name
""")

In [10]:
dataFrameWay = flightDataDF2015\
.groupBy("dest_country_name")\
.count()

In [11]:
sqlWay.explain()
dataFrameWay.explain()

== Physical Plan ==
*HashAggregate(keys=[dest_country_name#12], functions=[count(1)])
+- Exchange hashpartitioning(dest_country_name#12, 5)
   +- *HashAggregate(keys=[dest_country_name#12], functions=[partial_count(1)])
      +- *FileScan csv [DEST_COUNTRY_NAME#12] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/grp/sparkTheDefinitiveGuide/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>
== Physical Plan ==
*HashAggregate(keys=[dest_country_name#12], functions=[count(1)])
+- Exchange hashpartitioning(dest_country_name#12, 5)
   +- *HashAggregate(keys=[dest_country_name#12], functions=[partial_count(1)])
      +- *FileScan csv [DEST_COUNTRY_NAME#12] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/grp/sparkTheDefinitiveGuide/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


In [12]:
from pyspark.sql.functions import desc

In [13]:
flightDataDF2015\
.groupBy("dest_country_name")\
.sum("count")\
.withColumnRenamed("sum(count)", "destination_total")\
.sort(desc("destination_total"))\
.limit(5)\
.explain()

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[destination_total#56L DESC NULLS LAST], output=[dest_country_name#12,destination_total#56L])
+- *HashAggregate(keys=[dest_country_name#12], functions=[sum(cast(count#14 as bigint))])
   +- Exchange hashpartitioning(dest_country_name#12, 5)
      +- *HashAggregate(keys=[dest_country_name#12], functions=[partial_sum(cast(count#14 as bigint))])
         +- *FileScan csv [DEST_COUNTRY_NAME#12,count#14] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/grp/sparkTheDefinitiveGuide/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>


In [14]:
flightDataDF2015\
.groupBy("dest_country_name")\
.sum("count")\
.withColumnRenamed("sum(count)", "destination_total")\
.sort(desc("destination_total"))\
.limit(3)\
.show()

+-----------------+-----------------+
|dest_country_name|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
+-----------------+-----------------+



## _Chapter #3 - A Tour of Spark's Toolset_

### _Chapter #3 - Exercise (Structured Streaming)_

-  **Structured Streaming**:
    -  read streams
    -  window functions
    -  triggers
    -  write streams

In [15]:
staticDataFrame = spark\
.read\
.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load(retailDataDay)

In [16]:
staticDataFrame.createOrReplaceTempView("retail_data")
staticSchema = staticDataFrame.schema

In [17]:
from pyspark.sql.functions import window, column, desc, col

In [18]:
staticDataFrame\
.selectExpr(\
           "CustomerId",
           "(UnitPrice * Quantity) as total_cost",\
           "InvoiceDate")\
.groupBy(\
        col("CustomerId"), window(col("InvoiceDate"), "1 day"))\
.sum("total_cost")\
.show(3, truncate=False)

+----------+---------------------------------------------+------------------+
|CustomerId|window                                       |sum(total_cost)   |
+----------+---------------------------------------------+------------------+
|14075.0   |[2011-12-04 18:00:00.0,2011-12-05 18:00:00.0]|316.78000000000003|
|18180.0   |[2011-12-04 18:00:00.0,2011-12-05 18:00:00.0]|310.73            |
|15358.0   |[2011-12-04 18:00:00.0,2011-12-05 18:00:00.0]|830.0600000000003 |
+----------+---------------------------------------------+------------------+
only showing top 3 rows



In [19]:
streamingDataFrame = spark\
.readStream\
.schema(staticSchema)\
.option("maxFilesPerTrigger", 1)\
.format("csv")\
.option("header", "true")\
.load(retailDataDay)

In [20]:
print(streamingDataFrame.isStreaming)

True


In [21]:
purchaseByCustomerPerHour = streamingDataFrame\
.selectExpr(\
           "CustomerId",
           "(UnitPrice * Quantity) as total_cost",\
           "InvoiceDate")\
.groupBy(\
        col("CustomerId"), window(col("InvoiceDate"), "1 day"))\
.sum("total_cost")

In [22]:
purchaseByCustomerPerHour.writeStream\
.format("memory")\
.queryName("customer_purchases")\
.outputMode("complete")\
.start()


<pyspark.sql.streaming.StreamingQuery at 0x107c1d3c8>

In [23]:
from time import sleep
sleep(10)

In [24]:
spark.sql("""
select *
from customer_purchases
order by 'sum(total_cost)' desc""")\
.show(3, truncate=False)

+----------+---------------------------------------------+------------------+
|CustomerId|window                                       |sum(total_cost)   |
+----------+---------------------------------------------+------------------+
|15909.0   |[2011-06-18 19:00:00.0,2011-06-19 19:00:00.0]|191.93999999999997|
|15805.0   |[2011-11-15 18:00:00.0,2011-11-16 18:00:00.0]|370.4             |
|17858.0   |[2011-03-15 19:00:00.0,2011-03-16 19:00:00.0]|432.20000000000005|
+----------+---------------------------------------------+------------------+
only showing top 3 rows



### _Chapter #3 - Exercise (Machine Learning)_

-  **Machine Learning**:
    -  numerical representation
    -  data cleansing
    -  train / test split
    -  feature engineering (index, encode, vector assemble)

In [25]:
from pyspark.sql.functions import date_format, col

In [26]:
preppedDataFrame = staticDataFrame\
.na.fill(0)\
.withColumn("day_of_week", date_format(col("InvoiceDate"), "EEEE"))\
.coalesce(5)

In [27]:
trainDF = preppedDataFrame\
.where("InvoiceDate < '2011-07-01'")
testDF = preppedDataFrame\
.where("InvoiceDate >= '2011-07-01'")

In [28]:
print(trainDF.count())
print(testDF.count())

245903
296006


In [29]:
for i in trainDF.take(3): print(i)

Row(InvoiceNo='537226', StockCode='22811', Description='SET OF 6 T-LIGHTS CACTI ', Quantity=6, InvoiceDate=datetime.datetime(2010, 12, 6, 8, 34), UnitPrice=2.95, CustomerID=15987.0, Country='United Kingdom', day_of_week='Monday')
Row(InvoiceNo='537226', StockCode='21713', Description='CITRONELLA CANDLE FLOWERPOT', Quantity=8, InvoiceDate=datetime.datetime(2010, 12, 6, 8, 34), UnitPrice=2.1, CustomerID=15987.0, Country='United Kingdom', day_of_week='Monday')
Row(InvoiceNo='537226', StockCode='22927', Description='GREEN GIANT GARDEN THERMOMETER', Quantity=2, InvoiceDate=datetime.datetime(2010, 12, 6, 8, 34), UnitPrice=5.95, CustomerID=15987.0, Country='United Kingdom', day_of_week='Monday')


In [30]:
from pyspark.ml.feature import StringIndexer

In [31]:
indexer = StringIndexer()\
.setInputCol("day_of_week")\
.setOutputCol("day_of_week_index")

In [32]:
indexer.fit(trainDF).transform(testDF).select("day_of_week", "day_of_week_index").show(3, truncate=False)

+-----------+-----------------+
|day_of_week|day_of_week_index|
+-----------+-----------------+
|Monday     |2.0              |
|Monday     |2.0              |
|Monday     |2.0              |
+-----------+-----------------+
only showing top 3 rows



In [33]:
from pyspark.ml.feature import OneHotEncoder

In [34]:
encoder = OneHotEncoder()\
.setInputCol("day_of_week_index")\
.setOutputCol("day_of_week_encoded")

In [35]:
from pyspark.ml.feature import VectorAssembler

In [36]:
vectorAssembler = VectorAssembler()\
.setInputCols(["UnitPrice", "Quantity", "day_of_week_encoded"])\
.setOutputCol("features")

In [37]:
from pyspark.ml import Pipeline

In [38]:
transformationPipeline = Pipeline()\
.setStages([indexer, encoder, vectorAssembler]) # feature engineering to prepare for learning algorithm

In [39]:
fittedPipeline = transformationPipeline.fit(trainDF)

In [40]:
transformedTraining = fittedPipeline.transform(trainDF)

In [41]:
transformedTraining.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = false)
 |-- CustomerID: double (nullable = false)
 |-- Country: string (nullable = true)
 |-- day_of_week: string (nullable = true)
 |-- day_of_week_index: double (nullable = true)
 |-- day_of_week_encoded: vector (nullable = true)
 |-- features: vector (nullable = true)



In [42]:
transformedTraining.select("UnitPrice", "Quantity", \
                      "day_of_week", "day_of_week_index", \
                      "day_of_week_encoded", "features")\
.show(3, truncate=False)

+---------+--------+-----------+-----------------+-------------------+--------------------------+
|UnitPrice|Quantity|day_of_week|day_of_week_index|day_of_week_encoded|features                  |
+---------+--------+-----------+-----------------+-------------------+--------------------------+
|2.95     |6       |Monday     |2.0              |(5,[2],[1.0])      |(7,[0,1,4],[2.95,6.0,1.0])|
|2.1      |8       |Monday     |2.0              |(5,[2],[1.0])      |(7,[0,1,4],[2.1,8.0,1.0]) |
|5.95     |2       |Monday     |2.0              |(5,[2],[1.0])      |(7,[0,1,4],[5.95,2.0,1.0])|
+---------+--------+-----------+-----------------+-------------------+--------------------------+
only showing top 3 rows



In [43]:
transformedTraining.cache()

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string, day_of_week: string, day_of_week_index: double, day_of_week_encoded: vector, features: vector]

In [44]:
print(transformedTraining.count())

245903


-  **kMeans**:
    -  "k" centers are assigned to data points
    -  points are assigned to a class and center points (centroid) are computed

In [45]:
from pyspark.ml.clustering import KMeans

In [46]:
kmeans = KMeans()\
.setK(20)\
.setSeed(1)

In [47]:
kmModel = kmeans.fit(transformedTraining)

In [48]:
kmModel.computeCost(transformedTraining)

84553739.96537484

In [49]:
transformedTest = fittedPipeline.transform(testDF)

In [50]:
kmModel.computeCost(transformedTest)

517507094.72221166

### grp