# Spark Exercises

### 1.) Create a spark data frame that contains your favorite programming languages.

- The name of the column should be language
- View the schema of the dataframe
- Output the shape of the dataframe
- Show the first 5 records in the dataframe

In [3]:
import pyspark

import pandas as pd
import numpy as np

In [4]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [14]:
np.random.seed(456)

pandas_dataframe = pd.DataFrame(
    dict(language=np.random.choice(("Java","Python", "HTML", "C++",".Net"), 20))
)
pandas_dataframe

Unnamed: 0,language
0,C++
1,C++
2,Python
3,HTML
4,.Net
5,HTML
6,.Net
7,.Net
8,.Net
9,Python


In [15]:
df = spark.createDataFrame(pandas_dataframe)
df

DataFrame[language: string]

In [16]:
df.show(5)

+--------+
|language|
+--------+
|     C++|
|     C++|
|  Python|
|    HTML|
|    .Net|
+--------+
only showing top 5 rows



In [17]:
df.describe()

DataFrame[summary: string, language: string]

In [18]:
df.describe().show()

+-------+--------+
|summary|language|
+-------+--------+
|  count|      20|
|   mean|    null|
| stddev|    null|
|    min|    .Net|
|    max|  Python|
+-------+--------+



In [19]:
len(df.columns), df.count()

(1, 20)

### 2.) Load the mpg dataset as a spark dataframe.

- Create 1 column of output that contains a message like the one below:


       - The 1999 audi a4 has a 4 cylinder engine.
       - For each vehicle.

- Transform the trans column so that it only contains either manual or auto.

In [20]:
from pydataset import data

mpg = spark.createDataFrame(data("mpg"))
mpg.show(5)

+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|manufacturer|model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|        audi|   a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|        audi|   a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
|        audi|   a4|  2.0|2008|  4|manual(m6)|  f| 20| 31|  p|compact|
|        audi|   a4|  2.0|2008|  4|  auto(av)|  f| 21| 30|  p|compact|
|        audi|   a4|  2.8|1999|  6|  auto(l5)|  f| 16| 26|  p|compact|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
only showing top 5 rows



In [29]:
from pyspark.sql.functions import *

In [30]:
mpg.select(concat(lit("The "), mpg.year,lit(" "),  mpg.manufacturer,lit(" "), mpg.model, lit(" has a "), mpg.cyl, lit(" cylinder engine.")).alias("Description")).show(5)

+--------------------+
|         Description|
+--------------------+
|The 1999 audi a4 ...|
|The 1999 audi a4 ...|
|The 2008 audi a4 ...|
|The 2008 audi a4 ...|
|The 1999 audi a4 ...|
+--------------------+
only showing top 5 rows



In [39]:
mpg.select(
    'trans',
    regexp_replace('trans', r'.{4}$', '').alias('trans')
).show(10)

+----------+------+
|     trans| trans|
+----------+------+
|  auto(l5)|  auto|
|manual(m5)|manual|
|manual(m6)|manual|
|  auto(av)|  auto|
|  auto(l5)|  auto|
|manual(m5)|manual|
|  auto(av)|  auto|
|manual(m5)|manual|
|  auto(l5)|  auto|
|manual(m6)|manual|
+----------+------+
only showing top 10 rows



### 3.) Load the tips dataset as a spark dataframe.

- What percentage of observations are smokers?
- Create a column that contains the tip percentage
- Calculate the average tip percentage for each combination of sex and smoker.

In [42]:
import pydataset
tips = pydataset.data('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [43]:
tips = spark.createDataFrame(tips) # any pandas dataframe

In [48]:
smokers = tips.where(tips.smoker == 'Yes')

In [55]:
pct_smokers = tips.where(tips.smoker == 'Yes') / tips.count()

TypeError: unsupported operand type(s) for /: 'DataFrame' and 'int'

In [50]:
#Calculate tip percentage
tip_percentage = tips.tip / tips.total_bill

In [51]:
tip_percentage.alias('tip_perc')

Column<'(tip / total_bill) AS tip_perc'>

In [53]:
#Create column tip_perc
tips.select('*', tip_percentage.alias('tip_perc')).show()

+----------+----+------+------+---+------+----+-------------------+
|total_bill| tip|   sex|smoker|day|  time|size|           tip_perc|
+----------+----+------+------+---+------+----+-------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|0.05944673337257211|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|0.16054158607350097|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|0.16658733936220846|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2| 0.1397804054054054|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|0.14680764538430255|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|0.18623962040332148|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|0.22805017103762829|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|0.11607142857142858|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|0.13031914893617022|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2| 0.2185385656292287|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2| 0.1665043816942551|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|0