## Spark: The Definitive Guide


Notebooks here are created from book's [Code](https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/code) and [Data](https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data)

After cloning the [Git repo](https://github.com/databricks/Spark-The-Definitive-Guide) locally, set os env var `SPARK_BOOK_DATA_PATH` to that folder.

### How to Get Started

#### Databrick Cloud Sandbox

Use Spark Cluster free at :

https://community.cloud.databricks.com/

#### How to run code example

https://github.com/databricks/Spark-The-Definitive-Guide

dataset can be found at 
```
%fs ls /databricks-datasets/definitive-guide/data
```

imported py code to Databricks workspace : `HOME > Spark_Guide > py` folder

#### Local Installation

To install pyspark
```
$ pip install pyspark
```

To start jupyter notebook
```
$ PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
```

To use other pyspark packages, add `--packages <pkg-name>`, e.g.
```
$ PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --packages graphframes:graphframes:0.8.0-spark3.0-s_2.12
```
pkg jar is at `$SPARK_HOME/jars/graphframes-0.8.0-spark3.0-s_2.12.jar`

Launch Scala console

```
cmd> $SPARK_HOME/bin/spark-shell
```

Launch pyspark console

```
cmd> $SPARK_HOME/bin/pyspark
```

Launch SQL console

```
cmd> $SPARK_HOME/bin/spark-sql
```

Submit Spark app

```
cmd> $SPARK_HOME/bin/spark-submit \
    --master local[*] \
    --packages 'com.somesparkjar.dependency:1.0.0' \
    --py-files packages.zip \
    --files configs/etl_config.json \
    jobs/etl_job.py  
```



#### How to upload file to databrick cloud

https://forums.databricks.com/questions/10648/upload-local-files-into-dbfs-1.html


git clone https://github.com/databricks/Spark-The-Definitive-Guide

replace data path inside `code` folder with `/databricks-datasets/definitive-guide/data` globally

```
$ zip -r spark_guide_code.dbc spark_guide_code/
```

login to https://community.cloud.databricks.com/

create a cluster

workspace > Import File

Not working because file format are different!!!

### Use Spark

add `export SPARK_BOOK_DATA_PATH=~/spark_data/` to ~/.bash_path

In [1]:
import os
SPARK_BOOK_DATA_PATH = os.environ['SPARK_BOOK_DATA_PATH']

In [2]:
SPARK_BOOK_DATA_PATH

'/home/wengong/spark_data/'

In [3]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *

spark = SparkSession\
    .builder\
    .appName("chapter-02-intro")\
    .getOrCreate()

In [4]:
spark

In [5]:
sc = spark.sparkContext

In [6]:
sc

#### simple dataframe of numbers

In [7]:
# spark.range(1000) returns a RDD, toDF() converts it to DataFrame
myRange = spark.range(10).toDF("number")
myRange.show()

+------+
|number|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
+------+



In [8]:
# filter
divisBy2 = myRange.where("number % 2 = 0")
evens = divisBy2.collect()     # convert to python list

In [9]:
evens, evens[0].number

([Row(number=0), Row(number=2), Row(number=4), Row(number=6), Row(number=8)],
 0)

In [10]:
# convert collection to RDD
rdd = sc.parallelize(range(10))

nums = rdd.collect()

In [11]:
nums

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [12]:
from pyspark.sql import Row

test_rdd = sc.parallelize([Row(1), Row(2), Row(3)])

In [13]:
test_rdd.collect()

[<Row(1)>, <Row(2)>, <Row(3)>]

In [14]:
type(test_rdd)

pyspark.rdd.RDD

In [15]:
type(spark.range(10))

pyspark.sql.dataframe.DataFrame

In [16]:
test_df = test_rdd.toDF()

In [17]:
type(test_df)

pyspark.sql.dataframe.DataFrame

In [18]:
test_df.show()

+---+
| _1|
+---+
|  1|
|  2|
|  3|
+---+



In [19]:
test_df.toDF("id").show()   # name column "id"

+---+
| id|
+---+
|  1|
|  2|
|  3|
+---+



#### work with data in csv

In [20]:
# read csv file
file_path = SPARK_BOOK_DATA_PATH + "/data/flight-data/csv/2015-summary.csv"
flightData2015 = spark\
  .read\
  .option("inferSchema", "true")\
  .option("header", "true")\
  .csv(file_path)

short form:

`flightData2015 = spark.read.csv(file_path, header=True, inferSchema=True)`

In [21]:
flightData2015.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)



In [22]:
flightData2015.schema

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,IntegerType,true)))

In [23]:
flightData2015.columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

In [24]:
flightData2015.count()

256

In [25]:
flightData2015.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



In [26]:
# save to parquet file
file_path = SPARK_BOOK_DATA_PATH + "/data/flight-data/parquet/2015-summary.parquet"
flightData2015.write\
    .format("parquet")\
    .mode("overwrite")\
    .save(file_path)

# read it back
flightData2015_2 = spark\
    .read\
    .format("parquet")\
    .load(file_path)

flightData2015_2.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



#### Spark SQL

In [45]:
# convert DataFrame to temp Table
flightData2015.createOrReplaceTempView("flight_data_2015")

In [46]:
# run SQL directly against temp table
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(*)
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
--having count(1) > 4
""")

In [47]:
sqlWay.show(5)

+-----------------+--------+
|DEST_COUNTRY_NAME|count(1)|
+-----------------+--------+
|         Anguilla|       1|
|           Russia|       1|
|         Paraguay|       1|
|          Senegal|       1|
|           Sweden|       1|
+-----------------+--------+
only showing top 5 rows



In [30]:
dataFrameWay = flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .count()

In [31]:
dataFrameWay.show(5)

+-----------------+-----+
|DEST_COUNTRY_NAME|count|
+-----------------+-----+
|         Anguilla|    1|
|           Russia|    1|
|         Paraguay|    1|
|          Senegal|    1|
|           Sweden|    1|
+-----------------+-----+
only showing top 5 rows



Spark Catalyst turns logic plans to optimized physical plan

In [32]:
sqlWay.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#41], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#41, 200), true, [id=#187]
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#41], functions=[partial_count(1)])
      +- FileScan csv [DEST_COUNTRY_NAME#41] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/wengong/spark_data/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>




In [33]:
dataFrameWay.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#41], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#41, 200), true, [id=#206]
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#41], functions=[partial_count(1)])
      +- FileScan csv [DEST_COUNTRY_NAME#41] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/wengong/spark_data/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>




The underlying physical plans are the same

In [34]:
# Spark SQL Functions
from pyspark.sql.functions import max

flightData2015.select(max("count")).take(1)

[Row(max(count)=370002)]

In [35]:
max_count = spark.sql("""
SELECT max(count) as max_count
FROM flight_data_2015
""")

In [36]:
type(max_count)

pyspark.sql.dataframe.DataFrame

In [37]:
max_count.collect()[0]

Row(max_count=370002)

In [38]:
max_count.collect()[0].max_count

370002

In [39]:
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")

maxSql.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [40]:
# from pyspark.sql.functions import desc

top5_destDF = flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .sum("count")\
  .withColumnRenamed("sum(count)", "destination_total")\
  .sort(F.desc("destination_total"))\
  .limit(5)

top5_destDF.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [41]:
top5_destDF.explain()

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[destination_total#169L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#41,destination_total#169L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#41], functions=[sum(cast(count#43 as bigint))])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#41, 200), true, [id=#348]
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#41], functions=[partial_sum(cast(count#43 as bigint))])
         +- FileScan csv [DEST_COUNTRY_NAME#41,count#43] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/wengong/spark_data/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>




In [42]:
maxSql.explain()

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[aggOrder#145L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#41,destination_total#143L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#41], functions=[sum(cast(count#43 as bigint))])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#41, 200), true, [id=#372]
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#41], functions=[partial_sum(cast(count#43 as bigint))])
         +- FileScan csv [DEST_COUNTRY_NAME#41,count#43] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/wengong/spark_data/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>




##### Run SQL on files directly

https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options

In [43]:
file_path = SPARK_BOOK_DATA_PATH + "/data/flight-data/parquet/2015-summary.parquet"
df = spark.sql(f"SELECT * FROM parquet.`{file_path}`")

In [44]:
df.show()

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Romania|   15|
|       United States|            Croatia|    1|
|       United States|            Ireland|  344|
|               Egypt|      United States|   15|
|       United States|              India|   62|
|       United States|          Singapore|    1|
|       United States|            Grenada|   62|
|          Costa Rica|      United States|  588|
|             Senegal|      United States|   40|
|             Moldova|      United States|    1|
|       United States|       Sint Maarten|  325|
|       United States|   Marshall Islands|   39|
|              Guyana|      United States|   64|
|               Malta|      United States|    1|
|            Anguilla|      United States|   41|
|             Bolivia|      United States|   30|
|       United States|           Paraguay|    6|
|             Algeri