## Spark: The Definitive Guide


Notebooks here are created from book's [Code](https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/code) and [Data](https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data)

After cloning the [Git repo](https://github.com/databricks/Spark-The-Definitive-Guide) locally, set os env var `SPARK_BOOK_DATA_PATH` to that folder.

### How to Get Started

#### Databrick Cloud Sandbox

Use Spark Cluster free at :

https://community.cloud.databricks.com/

#### How to run code example

https://github.com/databricks/Spark-The-Definitive-Guide

dataset can be found at 
```
%fs ls /databricks-datasets/definitive-guide/data
```

imported py code to Databricks workspace : `HOME > Spark_Guide > py` folder

#### Local Installation

To install pyspark
```
$ pip install pyspark
```

To start jupyter notebook
```
$ PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
```

To use other pyspark packages, add `--packages <pkg-name>`, e.g.
```
$ PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
```

Launch Scala console

```
cmd> $SPARK_HOME/bin/spark-shell
```

Launch pyspark console

```
cmd> $SPARK_HOME/bin/pyspark
```

Launch SQL console

```
cmd> $SPARK_HOME/bin/spark-sql
```

Submit Spark app

```
cmd> $SPARK_HOME/bin/spark-submit \
    --master local[*] \
    --packages 'com.somesparkjar.dependency:1.0.0' \
    --py-files packages.zip \
    --files configs/etl_config.json \
    jobs/etl_job.py  
```



#### How to upload file to databrick

https://forums.databricks.com/questions/10648/upload-local-files-into-dbfs-1.html

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *

spark = SparkSession\
    .builder\
    .appName("chapter-02-intro")\
    .getOrCreate()

add `export SPARK_BOOK_DATA_PATH=~/spark_data/` to ~/.bash_path

In [2]:
import os
SPARK_BOOK_DATA_PATH = os.environ['SPARK_BOOK_DATA_PATH']

In [3]:
sc = spark.sparkContext

In [4]:
spark

In [5]:
sc

In [6]:
# spark.range(1000) returns a RDD, toDF() converts it to DataFrame
myRange = spark.range(10).toDF("number")
myRange.show()

+------+
|number|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
+------+



In [7]:
divisBy2 = myRange.where("number % 2 = 0")
divisBy2.collect()

[Row(number=0), Row(number=2), Row(number=4), Row(number=6), Row(number=8)]

In [8]:
# convert collection to RDD
rdd = sc.parallelize(range(10))

rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [9]:
from pyspark.sql import Row

test_rdd = spark.sparkContext.parallelize([Row(1), Row(2), Row(3)])

In [10]:
type(test_rdd)

pyspark.rdd.RDD

In [11]:
type(spark.range(10))

pyspark.sql.dataframe.DataFrame

In [12]:
test_df = test_rdd.toDF()

In [13]:
type(test_df)

pyspark.sql.dataframe.DataFrame

In [14]:
test_df.toDF("id").show()

+---+
| id|
+---+
|  1|
|  2|
|  3|
+---+



In [16]:
test_df.show()

+---+
| _1|
+---+
|  1|
|  2|
|  3|
+---+



In [12]:
# read from file
file_path = SPARK_BOOK_DATA_PATH + "/data/flight-data/csv/2015-summary.csv"
flightData2015 = spark\
  .read\
  .option("inferSchema", "true")\
  .option("header", "true")\
  .csv(file_path)

short form:

`flightData2015 = spark.read.csv(file_path, header=True, inferSchema=True)`

In [18]:
# write to parquet file
file_path = SPARK_BOOK_DATA_PATH + "/data/flight-data/parquet/2015-summary.parquet"
flightData2015.write\
    .format("parquet")\
    .mode("overwrite")\
    .save(file_path)

In [22]:
flightData2015.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



In [20]:
# read it back
flightData2015_2 = spark\
    .read\
    .format("parquet")\
    .load(file_path)

In [21]:
flightData2015_2.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



In [13]:
flightData2015.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)



In [14]:
flightData2015.schema

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,IntegerType,true)))

In [15]:
flightData2015.columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

In [16]:
flightData2015.count()

256

In [17]:
flightData2015.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



In [17]:
# convert DataFrame to temp Table
flightData2015.createOrReplaceTempView("flight_data_2015")

In [26]:
# run SQL directly against temp table
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1)
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
--having count(1) > 4
""")

In [27]:
sqlWay.show(5)

+-----------------+--------+
|DEST_COUNTRY_NAME|count(1)|
+-----------------+--------+
|         Anguilla|       1|
|           Russia|       1|
|         Paraguay|       1|
|          Senegal|       1|
|           Sweden|       1|
+-----------------+--------+
only showing top 5 rows



In [22]:
dataFrameWay = flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .count()

In [25]:
dataFrameWay.show(5)

+-----------------+-----+
|DEST_COUNTRY_NAME|count|
+-----------------+-----+
|         Anguilla|    1|
|           Russia|    1|
|         Paraguay|    1|
|          Senegal|    1|
|           Sweden|    1|
+-----------------+-----+
only showing top 5 rows



Spark Catalyst turns logic plans to optimized physical plan

In [30]:
sqlWay.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#26], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#26, 200)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#26], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#26] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/gong/spark/books/Spark-The-Definitive-Guide/data/flight-data/csv/201..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


In [31]:
dataFrameWay.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#26], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#26, 200)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#26], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#26] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/gong/spark/books/Spark-The-Definitive-Guide/data/flight-data/csv/201..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


In [32]:
# Spark SQL Functions
from pyspark.sql.functions import max

flightData2015.select(max("count")).take(1)

[Row(max(count)=370002)]

In [41]:
max_count = spark.sql("""
SELECT max(count) as max_count
FROM flight_data_2015
""")

In [42]:
type(max_count)

pyspark.sql.dataframe.DataFrame

In [43]:
max_count.collect()[0]

Row(max_count=370002)

In [44]:
max_count.collect()[0].max_count

370002

In [45]:
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")

maxSql.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [46]:
from pyspark.sql.functions import desc

top5_destDF = flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .sum("count")\
  .withColumnRenamed("sum(count)", "destination_total")\
  .sort(desc("destination_total"))\
  .limit(5)

top5_destDF.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [47]:
top5_destDF.explain()

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[destination_total#171L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#26,destination_total#171L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#26], functions=[sum(cast(count#28 as bigint))])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#26, 200)
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#26], functions=[partial_sum(cast(count#28 as bigint))])
         +- *(1) FileScan csv [DEST_COUNTRY_NAME#26,count#28] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/gong/spark/books/Spark-The-Definitive-Guide/data/flight-data/csv/201..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>
