# Hello Spark

Walkthrough of O'Reilly [Spark - The Definitive Guide](https://www.safaribooksonline.com/library/view/spark-the-definitive/9781491912201/), chapter 1

To install dependencies for this notebook locally, assuming the full Spark install is in `/usr/local/spark`:

```
$ virtualenv python=python3 spark
$ cd spark
$ source bin/activate
$ pip install pyspark jupyter
$ SPARK_HOME=/usr/local/spark PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" ./bin/pyspark
```

Sample data can be downloaded from: http://cdn.oreillystatic.com/books/0636920034957/data.zip

In [1]:
spark

# Simple range DataFrame

In [2]:
my_range = spark.range(1000).toDF("number")

In [3]:
my_range.describe()

DataFrame[summary: string, number: string]

In [4]:
my_range.count()

1000

In [5]:
my_range.take(10)

[Row(number=0),
 Row(number=1),
 Row(number=2),
 Row(number=3),
 Row(number=4),
 Row(number=5),
 Row(number=6),
 Row(number=7),
 Row(number=8),
 Row(number=9)]

In [6]:
evens = my_range.where("number % 2 = 0")

In [7]:
evens.take(10)

[Row(number=0),
 Row(number=2),
 Row(number=4),
 Row(number=6),
 Row(number=8),
 Row(number=10),
 Row(number=12),
 Row(number=14),
 Row(number=16),
 Row(number=18)]

# Read sample data in JSON format

In [8]:
!ls data/flight-data/json/2015-summary.json

data/flight-data/json/2015-summary.json


In [9]:
flight_data_json = spark.read \
    .json('data/flight-data/json/2015-summary.json')

In [10]:
flight_data_json

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

In [11]:
flight_data_json.explain()

== Physical Plan ==
*FileScan json [DEST_COUNTRY_NAME#91,ORIGIN_COUNTRY_NAME#92,count#93L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/Users/jeffrey.sternberg/Code/spark/hello-spark/data/flight-data/json/2015..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:bigint>


In [12]:
flight_data_json.take(2)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1)]

In [13]:
flight_data_json.explain()

== Physical Plan ==
*FileScan json [DEST_COUNTRY_NAME#91,ORIGIN_COUNTRY_NAME#92,count#93L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/Users/jeffrey.sternberg/Code/spark/hello-spark/data/flight-data/json/2015..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:bigint>


In [14]:
sorted_flight_data_json = flight_data_json.sort("count", descending=True)

In [15]:
sorted_flight_data_json.explain()

== Physical Plan ==
*Sort [count#93L ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#93L ASC NULLS FIRST, 200)
   +- *FileScan json [DEST_COUNTRY_NAME#91,ORIGIN_COUNTRY_NAME#92,count#93L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/Users/jeffrey.sternberg/Code/spark/hello-spark/data/flight-data/json/2015..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:bigint>


In [16]:
sorted_flight_data_json.take(2)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

# Read sample data in CSV format

In [17]:
flight_data_csv = spark.read \
    .option('inferSchema', 'true')\
    .option('header', 'true')\
    .csv('data/flight-data/csv/2015-summary.csv')

In [18]:
flight_data_csv

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: int]

In [19]:
# use json dataframe's schema to read csv version without inferring
flight_data_csv = spark.read \
    .schema(flight_data_json.schema) \
    .csv('data/flight-data/csv/2015-summary.csv')

In [20]:
flight_data_csv

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

In [21]:
# keep this one
flight_data = flight_data_csv

# Spark SQL

In [22]:
# convert dataframe we read above to Spark SQL table/view
flight_data.createOrReplaceTempView("flight_data")

In [23]:
via_sql = spark.sql("""
select dest_country_name, count(*)
from flight_data
group by dest_country_name
""")

In [24]:
via_df = flight_data \
    .groupBy('dest_country_name') \
    .count()

In [25]:
via_sql.explain() == via_df.explain()

== Physical Plan ==
*HashAggregate(keys=[dest_country_name#126], functions=[count(1)])
+- Exchange hashpartitioning(dest_country_name#126, 200)
   +- *HashAggregate(keys=[dest_country_name#126], functions=[partial_count(1)])
      +- *FileScan csv [DEST_COUNTRY_NAME#126] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/jeffrey.sternberg/Code/spark/hello-spark/data/flight-data/csv/2015-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>
== Physical Plan ==
*HashAggregate(keys=[dest_country_name#126], functions=[count(1)])
+- Exchange hashpartitioning(dest_country_name#126, 200)
   +- *HashAggregate(keys=[dest_country_name#126], functions=[partial_count(1)])
      +- *FileScan csv [DEST_COUNTRY_NAME#126] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/jeffrey.sternberg/Code/spark/hello-spark/data/flight-data/csv/2015-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


True

In [26]:
via_sql.take(5)

[Row(dest_country_name='Anguilla', count(1)=1),
 Row(dest_country_name='Russia', count(1)=1),
 Row(dest_country_name='Paraguay', count(1)=1),
 Row(dest_country_name='DEST_COUNTRY_NAME', count(1)=1),
 Row(dest_country_name='Senegal', count(1)=1)]

In [27]:
spark.sql("SELECT max(count) from flight_data").take(1)

[Row(max(count)=370002)]

In [28]:
from pyspark.sql.functions import max

flight_data.select(max('count')).take(1)

[Row(max(count)=370002)]

In [29]:
# What are the top five destination countries in the data set?
top_5_sql = spark.sql("""
select dest_country_name, sum(count) as dest_total
from flight_data
group by dest_country_name
order by sum(count) desc
limit 5
""")
top_5_sql.collect()

[Row(dest_country_name='United States', dest_total=411352),
 Row(dest_country_name='Canada', dest_total=8399),
 Row(dest_country_name='Mexico', dest_total=7140),
 Row(dest_country_name='United Kingdom', dest_total=2025),
 Row(dest_country_name='Japan', dest_total=1548)]

In [30]:
# same thing, via dataframe
from pyspark.sql.functions import desc

flight_data \
    .groupBy("dest_country_name") \
    .sum("count") \
    .withColumnRenamed("sum(count)", "dest_total") \
    .sort(desc("dest_total")) \
    .limit(5) \
    .collect()

[Row(dest_country_name='United States', dest_total=411352),
 Row(dest_country_name='Canada', dest_total=8399),
 Row(dest_country_name='Mexico', dest_total=7140),
 Row(dest_country_name='United Kingdom', dest_total=2025),
 Row(dest_country_name='Japan', dest_total=1548)]