# Spark Play 01

## Install and setup

Setup instructions for running pyspark in jupyter notebooks (using conda environment, such as "my-env"):
```
java -version # Should be 1.8.0_241 ... i.e. Final Java 8 release 
conda create -n my-env
conda activate -n my-env python=3.7 jupyter
conda install -c conda-forge pyspark
conda install -c anaconda ipykernel
python -m ipykernel install --user --name=my-env
jupyter notebook
```
Open new notebook file by selecting New > my-env

In [1]:
import numpy as np
import pandas as pd


from pyspark import SparkContext
from pyspark.sql import SparkSession

import pyspark.sql.functions as psf

First create a `SparkContext` which connects the Spark application to the cluster (in this case local).

In [2]:
sc = SparkContext('local[*]') # Interacts with RDDs API
print(sc)
print(sc.version) # Shows version of Spark being used
print(sc.pythonVer) # Shows version of Python being used
print(sc.master) # Shows 'name' of master node

<SparkContext master=local[*] appName=pyspark-shell>
2.4.5
3.7
local[*]


Then get or create a new `SparkSession` which we'll use to control the Spark driver.

In [3]:
spark = SparkSession.builder.getOrCreate() # Interacts with DataFrames API, see later
print(spark)

<pyspark.sql.session.SparkSession object at 0x00000221591CED48>


List the contents of the Spark session using its `catalog` attribute, which has methods such as `listTables()`. (Currently nothing in the session.)

In [4]:
spark.catalog.listTables()

[]

Other useful `catalog` methods might include:
- `cacheTable(tableName)`/`uncacheTable(tableName)`/`isCached(tableName)`
- `createTable()`
- `listColumns(tableName)`
- `listFunctions(dbName=None)` -- functions registered in a specified database.

-------------------------------------

## Spark context

We can load simple objects into pyspark by, for example, passing a number range into the SparkContext `sc` using its `parallelize` method:

In [5]:
first_range = sc.parallelize(np.arange(1, 101))
print(first_range)
print(type(first_range))

ParallelCollectionRDD[4] at parallelize at PythonRDD.scala:195
<class 'pyspark.rdd.RDD'>


`sc.parallelize()` creates an RDD (Resilient Distributed Datasets). We express our computations through Spark and using its distributed data structures it deals with the parallelisation automatically.

We can also load local files into Spark's context.

In [6]:
first_txt = sc.textFile('README.md')
print(first_txt)
print(type(first_txt))

README.md MapPartitionsRDD[6] at textFile at <unknown>:0
<class 'pyspark.rdd.RDD'>


RDD objects have a few attributes we can use explore the partitioning it does.

In [7]:
print(first_txt.getNumPartitions())

2


We can specify the minimum number of partitions we want when we load the data:

In [8]:
first_txt = sc.textFile('README.md', minPartitions=5)
print(first_txt.getNumPartitions())

5


## Using spark driver for first time

We can start off generating sequences to play with using the session:

In [9]:
spark.range(1000) # Returns Dataframe

DataFrame[id: bigint]

In [10]:
spark.range(1000).toDF("number") # Give it column header 'number'

DataFrame[number: bigint]

We could assign that dataframe to a variable and work on that. Or we can keep chaining functions, using functional programming style, to minimise use of intermediate global objects and variables.

In [11]:
spark.range(1000)\
    .toDF("number")\
    .where("number % 2 = 0")\
    .count()

500

The executions we've run above are saved in the Spark application as previous "jobs". You can view the job history, among other things, by looking at the Spark UI, on port 4040 of the driver node, which in our local case is http://localhost:4040.

### How jobs work: Transformations and actions

Each executed operation in Spark, or job, consists of a series of transformations on the data structure, ending with an action which executes the full operation and (typically) loads the output into memory. Spark uses lazy evaluation to avoid loading the data until the very end, when large datasets can be reduced to filtered/aggregated sets.

Transformation examples:
- `.map(fun)`
- `.filter(expression)`
- `.filterByKey(expression)`
- `.reduce(expression)`
- `.reduceByKey(expression)`
- `.sort(var)`
- `.sortByKey(var)`

Action examples:
- `.collect()`
- `.collectAsMap()` (returns pair RDD as dict)
- `.take(n)`
- `.count()`
- `.countByKey()` (pair RDDs only, see later)
- `.toPandas()`


The `.saveAsTextFile("tempFile")` action avoids loading into memory by instead saving to a file. By default, you specify a directory and the data are written to multiple files, one for each partition. `.coalesce(1).saveAsTextFile("tempFile")` overrides this default and saves to one file.

-------------------------------------------

## Spark's RDDs

We've seen Spark's RDDs above, like `first_txt`:

In [12]:
first_txt.take(5)

['# spark-play',
 'Getting to know Apache Spark with pyspark',
 '---------------------------------------------',
 '',
 'This code is intended to document my learning of Spark and pyspark, so I can return to this repository to remind myself who to use it.']

This RDD is basically a distributed collection of values. But we can use pair RDDs to have _distributed collections of key-value pairs_. This extra level of structure goes some way towards the creation of RDD Datasets that we'll see later. Pair RDDs allow transformations by key, such as `reduceByKey()`:

In [13]:
Rdd = sc.parallelize([(1,2), (3,4), (3,6), (4,5)])
Rdd_Reduced = Rdd.reduceByKey(lambda x, y: x + y)

for num in Rdd_Reduced.collect(): 
  print(f"Key {num[0]} has {num[1]} Counts")

Key 1 has 2 Counts
Key 3 has 10 Counts
Key 4 has 5 Counts


Note: the above output is not reproducible because the key ordering can differ, due to the distributed nature of its input. But you can use `sortByKey()` to make it reproducible.

In [14]:
for num in Rdd_Reduced.sortByKey(ascending=False).collect():
  print(f"Key {num[0]} has {num[1]} Counts")

Key 4 has 5 Counts
Key 3 has 10 Counts
Key 1 has 2 Counts


In [15]:
print(Rdd.take(1))
print(Rdd.countByKey())

[(1, 2)]
defaultdict(<class 'int'>, {1: 1, 3: 2, 4: 1})


In [16]:
{print("key", k, "has", v, "counts") for k,v in Rdd.countByKey().items()}

key 1 has 1 counts
key 3 has 2 counts
key 4 has 1 counts


{None}

## Example using RDD with complete works of Shakespeare

What are the most commonly used words in the complete works of William Shapespeare?

In [17]:
bill = sc.textFile("./data/shakespeare/complete_shakespeare.txt", minPartitions=6)

Shakespeares works are currently split into lines. Split into words instead.

In [18]:
bill_words = bill.flatMap(lambda x: x.split()) # flatMap does map() and flattens the results.
print(f"Total words: {bill_words.count()}")

Total words: 128576


In [19]:
# import nltk
# nltk.download("stopwords")
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

Code below will filter out stop words. Then it will create a pair RDD, where each word is a key and its value starts as 1 (its starting count). Then we'll reduce by key (unique word), counting the occurrences of each word.

In [20]:
word_counts = bill_words\
    .filter(lambda x: x.lower() not in stop_words)\
    .map(lambda w: (w, 1))\
    .reduceByKey(lambda x, y: x + y)

In [21]:
word_counts.take(10)

[('Project', 9),
 ('anywhere', 1),
 ('whatsoever.', 1),
 ('away', 30),
 ('copyright', 6),
 ('guidelines', 1),
 ('file.', 1),
 ('2011', 1),
 ('Language:', 1),
 ('GUTENBERG', 28)]

We'd like to know the 10 most common words, so we need to sort by values. One way to do this is to actually swap the keys and values around, then sort by key.

In [22]:
word_counts\
    .map(lambda x: (x[1], x[0]))\
    .sortByKey(ascending=False)\
    .take(10)

[(650, 'thou'),
 (574, 'thy'),
 (393, 'shall'),
 (311, 'would'),
 (295, 'good'),
 (286, 'thee'),
 (273, 'love'),
 (269, 'Enter'),
 (254, "th'"),
 (225, 'make')]

-----------------------

## Running first queries on a Dataset 

Rather than using SparkContext, called `sc` in spark-shell, as an entry point for interacting with RDDs, we now use the SparkSession, called `spark` in spark-shell, as the point of entry for interacting with Spark DataFrames.

Spark DataFrames are data structures that use RDDs at the core, but provide SQL/dataframe functionality and create optimised queries.

**It's recommended to interact with Spark using this newer DataFrames API rather than the older RDD API, so we'll almost always be using `spark` instead of `sc`.**

We can run SQL queries or dataframe manipulations through the `SparkSession`. But first we need to read some data into the session, using its `read` methods.

In [23]:
flights = spark.read.csv("./data/kaggle/flights.csv", inferSchema=True, header=True) # We could pass the schema using schema arg

If we needed to load a .txt or .json file instead, we'd use `spark.read.txt()` or `spark.read.json()` respectively.

What are `flights`'s dimensions?

In [24]:
print(f"Rows: {flights.count()}\nColumns: {len(flights.columns)}")

Rows: 5819079
Columns: 31


In [25]:
flights.columns

['YEAR',
 'MONTH',
 'DAY',
 'DAY_OF_WEEK',
 'AIRLINE',
 'FLIGHT_NUMBER',
 'TAIL_NUMBER',
 'ORIGIN_AIRPORT',
 'DESTINATION_AIRPORT',
 'SCHEDULED_DEPARTURE',
 'DEPARTURE_TIME',
 'DEPARTURE_DELAY',
 'TAXI_OUT',
 'WHEELS_OFF',
 'SCHEDULED_TIME',
 'ELAPSED_TIME',
 'AIR_TIME',
 'DISTANCE',
 'WHEELS_ON',
 'TAXI_IN',
 'SCHEDULED_ARRIVAL',
 'ARRIVAL_TIME',
 'ARRIVAL_DELAY',
 'DIVERTED',
 'CANCELLED',
 'CANCELLATION_REASON',
 'AIR_SYSTEM_DELAY',
 'SECURITY_DELAY',
 'AIRLINE_DELAY',
 'LATE_AIRCRAFT_DELAY',
 'WEATHER_DELAY']

Better than just looking at column names, we can print the entire schema for the DataFrame like so:

In [26]:
flights.printSchema()

root
 |-- YEAR: integer (nullable = true)
 |-- MONTH: integer (nullable = true)
 |-- DAY: integer (nullable = true)
 |-- DAY_OF_WEEK: integer (nullable = true)
 |-- AIRLINE: string (nullable = true)
 |-- FLIGHT_NUMBER: integer (nullable = true)
 |-- TAIL_NUMBER: string (nullable = true)
 |-- ORIGIN_AIRPORT: string (nullable = true)
 |-- DESTINATION_AIRPORT: string (nullable = true)
 |-- SCHEDULED_DEPARTURE: integer (nullable = true)
 |-- DEPARTURE_TIME: integer (nullable = true)
 |-- DEPARTURE_DELAY: integer (nullable = true)
 |-- TAXI_OUT: integer (nullable = true)
 |-- WHEELS_OFF: integer (nullable = true)
 |-- SCHEDULED_TIME: integer (nullable = true)
 |-- ELAPSED_TIME: integer (nullable = true)
 |-- AIR_TIME: integer (nullable = true)
 |-- DISTANCE: integer (nullable = true)
 |-- WHEELS_ON: integer (nullable = true)
 |-- TAXI_IN: integer (nullable = true)
 |-- SCHEDULED_ARRIVAL: integer (nullable = true)
 |-- ARRIVAL_TIME: integer (nullable = true)
 |-- ARRIVAL_DELAY: integer (null

What proportion of `flights`'s rows are unique?

In [27]:
flights.dropDuplicates().count() / flights.count()

1.0

All of them.

Here's a dataframe manipulation chain, finding the number of flights within each origin-destination group.

In [28]:
flights\
    .select('FLIGHT_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 
        'SCHEDULED_DEPARTURE', 'SCHEDULED_ARRIVAL',
        'DEPARTURE_TIME', 'ARRIVAL_TIME', 'ARRIVAL_DELAY')\
    .groupBy('ORIGIN_AIRPORT', 'DESTINATION_AIRPORT')\
    .count()\
    .withColumnRenamed('count', 'total_flights')\
    .sort(psf.desc('total_flights'))\
    .take(10) # For comparison with sql code, could be written as .limit(10)/.collect()

[Row(ORIGIN_AIRPORT='SFO', DESTINATION_AIRPORT='LAX', total_flights=13744),
 Row(ORIGIN_AIRPORT='LAX', DESTINATION_AIRPORT='SFO', total_flights=13457),
 Row(ORIGIN_AIRPORT='JFK', DESTINATION_AIRPORT='LAX', total_flights=12016),
 Row(ORIGIN_AIRPORT='LAX', DESTINATION_AIRPORT='JFK', total_flights=12015),
 Row(ORIGIN_AIRPORT='LAS', DESTINATION_AIRPORT='LAX', total_flights=9715),
 Row(ORIGIN_AIRPORT='LGA', DESTINATION_AIRPORT='ORD', total_flights=9639),
 Row(ORIGIN_AIRPORT='LAX', DESTINATION_AIRPORT='LAS', total_flights=9594),
 Row(ORIGIN_AIRPORT='ORD', DESTINATION_AIRPORT='LGA', total_flights=9575),
 Row(ORIGIN_AIRPORT='SFO', DESTINATION_AIRPORT='JFK', total_flights=8440),
 Row(ORIGIN_AIRPORT='JFK', DESTINATION_AIRPORT='SFO', total_flights=8437)]

The equivalent can be written as a SQL query, and from Spark's point of view they are an identical implementation. To run SQL queries we must first convert the dataframe into a database table. We then query the table by passing SQL code to the session driver.

In [29]:
flights.createOrReplaceTempView("flights")

In [30]:
spark.sql("""
    FROM flights
        SELECT origin_airport, destination_airport, count(*) AS total_flights
        GROUP BY origin_airport, destination_airport
        ORDER BY total_flights DESC
        LIMIT 10
    """).collect()

[Row(origin_airport='SFO', destination_airport='LAX', total_flights=13744),
 Row(origin_airport='LAX', destination_airport='SFO', total_flights=13457),
 Row(origin_airport='JFK', destination_airport='LAX', total_flights=12016),
 Row(origin_airport='LAX', destination_airport='JFK', total_flights=12015),
 Row(origin_airport='LAS', destination_airport='LAX', total_flights=9715),
 Row(origin_airport='LGA', destination_airport='ORD', total_flights=9639),
 Row(origin_airport='LAX', destination_airport='LAS', total_flights=9594),
 Row(origin_airport='ORD', destination_airport='LGA', total_flights=9575),
 Row(origin_airport='SFO', destination_airport='JFK', total_flights=8440),
 Row(origin_airport='JFK', destination_airport='SFO', total_flights=8437)]

Note: You don't have to cap off SQL queries with `;`.

**Note also:** The `take()` and `collect()` methods are actions at the end of spark queries which will **load the output data into the driver's memory**.

We can take the queried data using either SQL or DataFrame styles, and if we create a local `pandas` copy using `.toPandas()`

In [31]:
flights\
    .select('FLIGHT_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 
        'SCHEDULED_DEPARTURE', 'SCHEDULED_ARRIVAL',
        'DEPARTURE_TIME', 'ARRIVAL_TIME', 'ARRIVAL_DELAY')\
    .groupBy('ORIGIN_AIRPORT', 'DESTINATION_AIRPORT')\
    .avg('ARRIVAL_DELAY')\
    .withColumnRenamed('avg(ARRIVAL_DELAY)', 'avg_delay')\
    .sort(psf.desc('avg_delay'))\
    .limit(20)\
    .toPandas()

Unnamed: 0,ORIGIN_AIRPORT,DESTINATION_AIRPORT,avg_delay
0,IAD,TTN,381.0
1,SWF,PBI,260.5
2,RIC,CAE,228.0
3,RDU,IND,208.0
4,10581,11618,163.0
5,FCA,MSO,148.0
6,SWF,RSW,140.0
7,10581,12953,138.333333
8,14843,12264,122.0
9,OAK,FLL,106.0


Let's do one more for luck...

In [32]:
spark.sql("""
SELECT origin_airport, destination_airport, airline, ROUND(AVG(air_time) / 60, 2) AS duration_hrs
FROM flights
GROUP BY origin_airport, destination_airport, airline
ORDER BY duration_hrs DESC
""").toPandas()

Unnamed: 0,origin_airport,destination_airport,airline,duration_hrs
0,JFK,HNL,DL,10.74
1,JFK,HNL,HA,10.58
2,12478,12173,HA,10.46
3,EWR,HNL,UA,10.21
4,11618,12173,UA,10.02
...,...,...,...,...
14735,CWA,ATW,EV,
14736,MCO,HPN,EV,
14737,GCC,RAP,OO,
14738,FSD,LNK,EV,


-------------------------------------

## Manipulating Spark DataFrames

### Note on immutability in Spark

Spark DataFrames are immutable, because you're supposed to leave the data as-is and do all the transformations in your code. If you wanted to update your DataFrame, you'd have to re-assign it.

For example, let's say we needed to add an extra column. We could use the Spark DataFrame's `withColumn()` method like so:

In [33]:
flights = flights.withColumn('duration_hrs', flights.AIR_TIME / 60)
'duration_hrs' in flights.columns

True

### Filter

The `filter()` method takes on of two options:
- A string, which represents what would follow a WHERE clause in SQL, or
- A Spark Column of boolean (True/False) values

In other words, the following two lines are equivalent.

In [34]:
print(flights.filter("air_time > 120"), "\n")
print(flights.filter(flights.AIR_TIME > 120))

DataFrame[YEAR: int, MONTH: int, DAY: int, DAY_OF_WEEK: int, AIRLINE: string, FLIGHT_NUMBER: int, TAIL_NUMBER: string, ORIGIN_AIRPORT: string, DESTINATION_AIRPORT: string, SCHEDULED_DEPARTURE: int, DEPARTURE_TIME: int, DEPARTURE_DELAY: int, TAXI_OUT: int, WHEELS_OFF: int, SCHEDULED_TIME: int, ELAPSED_TIME: int, AIR_TIME: int, DISTANCE: int, WHEELS_ON: int, TAXI_IN: int, SCHEDULED_ARRIVAL: int, ARRIVAL_TIME: int, ARRIVAL_DELAY: int, DIVERTED: int, CANCELLED: int, CANCELLATION_REASON: string, AIR_SYSTEM_DELAY: int, SECURITY_DELAY: int, AIRLINE_DELAY: int, LATE_AIRCRAFT_DELAY: int, WEATHER_DELAY: int, duration_hrs: double] 

DataFrame[YEAR: int, MONTH: int, DAY: int, DAY_OF_WEEK: int, AIRLINE: string, FLIGHT_NUMBER: int, TAIL_NUMBER: string, ORIGIN_AIRPORT: string, DESTINATION_AIRPORT: string, SCHEDULED_DEPARTURE: int, DEPARTURE_TIME: int, DEPARTURE_DELAY: int, TAXI_OUT: int, WHEELS_OFF: int, SCHEDULED_TIME: int, ELAPSED_TIME: int, AIR_TIME: int, DISTANCE: int, WHEELS_ON: int, TAXI_IN: in

**Note:** That when using string approach, the variable names are **not case sensitive**, just like with SQL queries. But when writing in Python code, they are.

### Select

The `select()` method takes multiple arguments that define which columns to select, either as strings of column names or the column objects themselves.

In [35]:
flights.select("arrival_time", flights.DEPARTURE_TIME)

DataFrame[arrival_time: int, DEPARTURE_TIME: int]

**Note again:** That when selecting columns using strings, they are not case sensitive, just like in SQL. But passing a column object is case sensitive.

As we saw just a moment above, you can perform operations on your columns to transform them using the `withColumn()` method, which returns all columns in the outputted DataFrame.

But you can also do this inside `select()`, one of two ways:
- Pass it a column object which we transform, with the option of renaming using the `alias()` method.
- Pass it strings, which represent SQL expressions we'd pass to `SELECT`, with the option of renaming using `AS`. **But to do it this way, you have to use `selectExpr()` instead of `select()`.**

In [36]:
flights.select("arrival_time", flights.DEPARTURE_TIME, (flights.AIR_TIME / 60).alias('duration_hrs'))

DataFrame[arrival_time: int, DEPARTURE_TIME: int, duration_hrs: double]

In [37]:
flights.selectExpr("arrival_time", "departure_time", "air_time / 60 AS duration_hrs")

DataFrame[arrival_time: int, departure_time: int, duration_hrs: double]

Another example:

In [38]:
flights.select("origin_airport", "destination_airport", "tail_number",
               (flights.DISTANCE/(flights.AIR_TIME/60)).alias("avg_speed"))

DataFrame[origin_airport: string, destination_airport: string, tail_number: string, avg_speed: double]

In [39]:
flights.selectExpr("origin_airport", "destination_airport", "tail_number",
                   "distance/(air_time/60) as avg_speed")

DataFrame[origin_airport: string, destination_airport: string, tail_number: string, avg_speed: double]

### GroupBy Aggregate

We can group our DataFrame using `groupBy()`. Once we've created a grouped DataFrame (GroupedData), we can use the GroupedDat methods for aggregation, like `min()`, `max()`, `count()`, `avg()`, and so on. We can aggregate the whole data by leaving `groupBy()` empty.

Average flight duration:

In [40]:
flights\
    .select((flights.AIR_TIME / 60).alias('duration_hrs'))\
    .groupBy()\
    .avg('duration_hrs')\
    .show()

+------------------+
| avg(duration_hrs)|
+------------------+
|1.8918604681687405|
+------------------+



Average flight duration by airline carrier, sorted in descending order:

In [41]:
flights\
    .selectExpr("airline", "air_time / 60 AS duration_hrs")\
    .groupBy('airline')\
    .avg('duration_hrs')\
    .sort(psf.desc('avg(duration_hrs)'))\
    .show(5)

# ^^ Note: See agg() below for renaming var at same time

+-------+------------------+
|airline| avg(duration_hrs)|
+-------+------------------+
|     VX| 3.043846601793797|
|     UA|2.7472647618372408|
|     AS|2.6307464073713316|
|     B6| 2.397641535835225|
|     AA|2.3301034689931734|
+-------+------------------+
only showing top 5 rows



Total number of flights taken, by airline and origin airport, sorted in descending order:

In [42]:
flights\
    .groupBy('origin_airport', 'airline')\
    .count()\
    .sort(psf.desc('count'))\
    .show(5)

+--------------+-------+------+
|origin_airport|airline| count|
+--------------+-------+------+
|           ATL|     DL|221705|
|           DFW|     AA|134270|
|           MDW|     WN| 76350|
|           LAS|     WN| 68520|
|           BWI|     WN| 64063|
+--------------+-------+------+
only showing top 5 rows



#### `agg()` method

As well as the basic aggregation functions, there is also the generic `agg()` method, which lets us pass any of the aggregation functions from the `pyspark.sql.functions` submodule. (I loaded this submodule earlier as `psf`.)

Every function in the submodule takes the name of a GroupedData column as its argument.

Here are some examples:

In [43]:
flights\
    .groupBy('month', 'destination_airport')\
    .agg((psf.stddev('departure_delay')).alias('dep_delay_stdev'))\
    .show(5)

+-----+-------------------+------------------+
|month|destination_airport|   dep_delay_stdev|
+-----+-------------------+------------------+
|    1|                ACY| 32.81265545745788|
|    1|                EYW| 63.65052137951282|
|    1|                OME|19.546923118225806|
|    1|                RDM|34.260918828983726|
|    1|                TWF| 7.985009333135624|
+-----+-------------------+------------------+
only showing top 5 rows



Note: We can use the `alias()` method with aggregation as well as when selecting columns as we saw earlier. But I don't think aliasing works when using the basic aggregators; you have to wrap it inside `agg()` instead.

### Joins

In [45]:
airlines = spark.read.csv("./data/kaggle/airlines.csv", inferSchema=True, header=True)
airlines.show(5)

+---------+--------------------+
|IATA_CODE|             AIRLINE|
+---------+--------------------+
|       UA|United Air Lines ...|
|       AA|American Airlines...|
|       US|     US Airways Inc.|
|       F9|Frontier Airlines...|
|       B6|     JetBlue Airways|
+---------+--------------------+
only showing top 5 rows



In [46]:
airports = spark.read.csv("./data/kaggle/airports.csv", inferSchema=True, header=True)
airports.show(5)

+---------+--------------------+-----------+-----+-------+--------+----------+
|IATA_CODE|             AIRPORT|       CITY|STATE|COUNTRY|LATITUDE| LONGITUDE|
+---------+--------------------+-----------+-----+-------+--------+----------+
|      ABE|Lehigh Valley Int...|  Allentown|   PA|    USA|40.65236|  -75.4404|
|      ABI|Abilene Regional ...|    Abilene|   TX|    USA|32.41132|  -99.6819|
|      ABQ|Albuquerque Inter...|Albuquerque|   NM|    USA|35.04022|-106.60919|
|      ABR|Aberdeen Regional...|   Aberdeen|   SD|    USA|45.44906| -98.42183|
|      ABY|Southwest Georgia...|     Albany|   GA|    USA|31.53552| -84.19447|
+---------+--------------------+-----------+-----+-------+--------+----------+
only showing top 5 rows



The `join()` method lets us do all kinds of joins. First arg is the dataset we want to use in our join, second arg `on` specifies the keys to join on, and third arg `how` specifies which kind of join to perform.

In [47]:
flights\
    .withColumnRenamed("destination_airport", "iata_code")\
    .join(airports, on="iata_code", how="leftouter")\
    .show(5)

+---------+----+-----+---+-----------+-------+-------------+-----------+--------------+-------------------+--------------+---------------+--------+----------+--------------+------------+--------+--------+---------+-------+-----------------+------------+-------------+--------+---------+-------------------+----------------+--------------+-------------+-------------------+-------------+-----------------+--------------------+---------------+-----+-------+--------+----------+
|iata_code|YEAR|MONTH|DAY|DAY_OF_WEEK|AIRLINE|FLIGHT_NUMBER|TAIL_NUMBER|ORIGIN_AIRPORT|SCHEDULED_DEPARTURE|DEPARTURE_TIME|DEPARTURE_DELAY|TAXI_OUT|WHEELS_OFF|SCHEDULED_TIME|ELAPSED_TIME|AIR_TIME|DISTANCE|WHEELS_ON|TAXI_IN|SCHEDULED_ARRIVAL|ARRIVAL_TIME|ARRIVAL_DELAY|DIVERTED|CANCELLED|CANCELLATION_REASON|AIR_SYSTEM_DELAY|SECURITY_DELAY|AIRLINE_DELAY|LATE_AIRCRAFT_DELAY|WEATHER_DELAY|     duration_hrs|             AIRPORT|           CITY|STATE|COUNTRY|LATITUDE| LONGITUDE|
+---------+----+-----+---+-----------+-------+--

## Visualising and summarising data with pyspark

We can't use traditional plotting packages directly with Spark DataFrames. There are three methods to plot data from Spark DataFrames:
1. pyspark_dist_explore methods
2. toPandas() method to create a pandas dataframe
3. HandySpark library

There are methods to compute basic summaries of Spark DataFrame columns.

In [48]:
flights.select('arrival_time').summary().show()

+-------+------------------+
|summary|      arrival_time|
+-------+------------------+
|  count|           5726566|
|   mean|1476.4911879126164|
| stddev|  526.319737212267|
|    min|                 1|
|    25%|              1059|
|    50%|              1513|
|    75%|              1917|
|    max|              2400|
+-------+------------------+



Or the slight less detailed...

In [49]:
flights.select('arrival_time').describe().show()

+-------+------------------+
|summary|      arrival_time|
+-------+------------------+
|  count|           5726566|
|   mean|1476.4911879126164|
| stddev|  526.319737212267|
|    min|                 1|
|    max|              2400|
+-------+------------------+



NOTE TO SELF: Add sections on pyspark_dist_explore and HandySpark