# Getting To Know PySpark

In [None]:
# install spark and all according to these instructions 

# https://towardsdatascience.com/how-to-get-started-with-pyspark-1adc142456ec

## What is Spark?

Spark is platform for cluster computing which allows the split of calculations across multiple nodes. This simplifies dealing with Big Data since each node only works with a subset of the data. It can greatly speed up analysis, provided the communication overhead does not exceed the computation speed-up (i.e. only makes sense with large enough data).

## Using Spark in Python

To use Spark, we need to connect to a cluster. Spark uses a *master-worker architecture*: there's a master node which spreads calculation across worker nodes which then return the results back to the master.

A cluster can bee run locally (where a cluster is simulated - usually for testing) or in a remote machine.

This starts by creating an instance of the `SparkContext` class.

## Using DataFrames

Spark's core data structure is the Resilient Distributed Dataframe (RDD) - it is how Spark distributes computation across nodes. Since this is a low-level object, the DataFrame abstraction is what is often used, in practice. 

To start working with a Spark DataFrame weE:
1. Create a `SparkContext`
2. Create a `SparkSession` from our `SparkContext`

In [None]:
# check https://www.freecodecamp.org/news/installing-scala-and-apache-spark-on-mac-os-837ae57d283f/
# for how to install apache spark

# !pip install pyspark

In [1]:
import pyspark

sc = pyspark.SparkContext()

In [2]:
# Verify SparkContext
print(sc)

# Print Spark version
print(sc.version)

<SparkContext master=local[*] appName=pyspark-shell>
2.4.4


In [3]:
# Print the tables in the catalog
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

In [4]:
# spark is an existing SparkSession
df = spark.read.csv("../datasets/flights.csv", header=True)

In [5]:
df.createOrReplaceTempView("flights")

flights10 = spark.sql("FROM flights SELECT * LIMIT 10")

flights10.show()

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
|2014|    1| 15|    1037|        7|    1

## Convert from Spark DataFrame to Pandas

We can convert a Spark DataFrame to pandas if we need to. Usually, this happens when we have a subset of our query which we want to analyze locally.

In [6]:
import pandas as pd

# Don't change this query
query = "SELECT origin, dest, COUNT(*) as N FROM flights GROUP BY origin, dest"

# Run the query
flight_counts = spark.sql(query)

flight_counts.toPandas()

Unnamed: 0,origin,dest,N
0,SEA,RNO,8
1,SEA,DTW,98
2,SEA,CLE,2
3,SEA,LAX,450
4,PDX,SEA,144
5,SEA,BLI,5
6,PDX,IAH,57
7,PDX,PHX,209
8,SEA,SLC,225
9,SEA,SBA,23


## Convert from pandas to Spark

We can also convert a pandas dataframe into a Spark DataFrame. There are a few things regarding this:
- The output is stored locally meaning we can't access data from other contexts
    1. For example, a `.sql()` query will throw an error
- This can be avoided by saving the dataframe as a `temporary table`
- It will still only be accessible from the specific `SparkSession`

In [7]:
import numpy as np

pd_temp = pd.DataFrame(np.random.random(10))

In [8]:
spark_temp = spark.createDataFrame(pd_temp)

In [9]:
# notice how the temporary dataframe is not listed
spark.catalog.listTables()

[Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

In [10]:
# add the temporary table to our session
spark_temp.createOrReplaceTempView("temp")

In [11]:
# now the temporary dataframe **is** listed
spark.catalog.listTables()

[Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

## Reading straight into Spark



In [12]:
# spark is an existing SparkSession
df = spark.read.csv("../datasets/flights.csv", header=True)

In [14]:
df.show()

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
|2014|    1| 15|    1037|        7|    1

# Manipulating Data

