# Spark 101

In [16]:
import pyspark

import pandas as pd
import numpy as np

from seaborn import load_dataset
from pydataset import data
from vega_datasets import data as vega


from pyspark.sql.functions import *
from pyspark.sql.functions import concat, sum, avg, min, max, count, mean

In [2]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/19 10:21:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### 1. Create & Explore a DF
Create a spark data frame that contains your favorite programming languages.

- The name of the column should be language
- View the schema of the dataframe
- Output the shape of the dataframe
- Show the first 5 records in the dataframe

---

#### Create DF
The name of the column should be language

In [3]:
# Create a Pandas DF with column name of language

pandas_df = pd.DataFrame(data=['JavaScript', 'Java', 'C#', 'C', 'C++',\
                               'Go', 'PHP','Python', 'R', 'Swift'], columns=['language'])

In [4]:
# convert the pandas DF to a Spark dataframe
df = spark.createDataFrame(pandas_df)
df

DataFrame[language: string]

In [5]:
df.show(5)

[Stage 0:>                                                          (0 + 1) / 1]

+----------+
|  language|
+----------+
|JavaScript|
|      Java|
|        C#|
|         C|
|       C++|
+----------+
only showing top 5 rows



                                                                                

In [6]:
df.describe().show()

+-------+--------+
|summary|language|
+-------+--------+
|  count|      10|
|   mean|    null|
| stddev|    null|
|    min|       C|
|    max|   Swift|
+-------+--------+



---

#### View Schema
View the schema of the dataframe

In [7]:
# View the schema of the dataframe
df.printSchema()

root
 |-- language: string (nullable = true)



---

#### Shape
Output the shape of the dataframe

In [8]:
# Output the shape of the dataframe
print(df.count(), "rows", len(df.columns), "columns")

10 rows 1 columns


---

#### First 5 Records
Show the first 5 records in the dataframe

In [9]:
#Show the first 5 records in the dataframe
df.head(5)

[Row(language='JavaScript'),
 Row(language='Java'),
 Row(language='C#'),
 Row(language='C'),
 Row(language='C++')]

## 2. MPG
Load the mpg dataset as a spark dataframe.

In [10]:
from pydataset import data

In [11]:
mpg = spark.createDataFrame(data('mpg'))
mpg.show(5)

+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|manufacturer|model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|        audi|   a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|        audi|   a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
|        audi|   a4|  2.0|2008|  4|manual(m6)|  f| 20| 31|  p|compact|
|        audi|   a4|  2.0|2008|  4|  auto(av)|  f| 21| 30|  p|compact|
|        audi|   a4|  2.8|1999|  6|  auto(l5)|  f| 16| 26|  p|compact|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
only showing top 5 rows



### - 2a. 
Create 1 column of output that contains a message like the one below:

The 1999 audi a4 has a 4 cylinder engine.

For each vehicle.

In [30]:
mpg.message = concat(lit('The '),
                    mpg.year,
                    lit(' '),
                    mpg.manufacturer,
                    lit(' '),
                    mpg.model,
                    lit(' has a '),
                    mpg.cyl,
                    lit(' cylinder engine.'))
mpg.select(mpg.message).show(10, truncate=False)  

+------------------------------------------------------------------------------+
|concat(The , year,  , manufacturer,  , model,  has a , cyl,  cylinder engine.)|
+------------------------------------------------------------------------------+
|The 1999 audi a4 has a 4 cylinder engine.                                     |
|The 1999 audi a4 has a 4 cylinder engine.                                     |
|The 2008 audi a4 has a 4 cylinder engine.                                     |
|The 2008 audi a4 has a 4 cylinder engine.                                     |
|The 1999 audi a4 has a 6 cylinder engine.                                     |
|The 1999 audi a4 has a 6 cylinder engine.                                     |
|The 2008 audi a4 has a 6 cylinder engine.                                     |
|The 1999 audi a4 quattro has a 4 cylinder engine.                             |
|The 1999 audi a4 quattro has a 4 cylinder engine.                             |
|The 2008 audi a4 quattro ha

#### - 2b. 
Transform the trans column so that it only contains either manual or auto.

## 3. TIPS
Load the tips dataset as a spark dataframe.

In [12]:
tips = spark.createDataFrame(data('tips'))
tips.show(5)

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
+----------+----+------+------+---+------+----+
only showing top 5 rows



- 3a. 
What percentage of observations are smokers?

- 3b.
Create a column that contains the tip percentage

- 3c. 
Calculate the average tip percentage for each combination of sex and smoker.

## 4. Seattle Weather 
Use the seattle weather dataset referenced in the lesson to answer the questions below.

- Convert the temperatures to fahrenheit.

- Which month has the most rain, on average?

- Which year was the windiest?

- What is the most frequent type of weather in January?

- What is the average high and low temperature on sunny days in July in 2013 and 2014?

- What percentage of days were rainy in q3 of 2015?

- For each year, find what percentage of days it rained (had non-zero precipitation).