# Handling Missing Data in PySpark HW Solutions

In this HW assignment you will be strengthening your skill sets dealing with missing data.
 
**Review:** you have 2 basic options for filling in missing data (you will personally have to make the decision for what is the right approach:

1. Drop the missing data points (including the entire row)
2. Fill them in with some other value.

Let's practice some examples of each of these methods!


#### But first!

Start your Spark session

In [2]:
import pyspark
from pyspark.sql import SparkSession

## Read in the dataset for this Notebook

Weather.csv attached to this lecture. 

In [3]:
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('Weather.csv', header=True)

## About this dataset

**New York City Taxi Trip - Hourly Weather Data**

Here is some detailed weather data for the New York City Taxi Trips.

**Source:** https://www.kaggle.com/meinertsen/new-york-city-taxi-trip-hourly-weather-data

### Print a view of the first several lines of the dataframe to see what our data looks like

In [4]:
df.toPandas()

Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,...,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2015-12-31 00:15:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.5,0.02,Light Rain,rain,0,1,0,0,0,0
1,2015-12-31 00:42:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
2,2015-12-31 00:51:00,7.8,46.0,6.1,43.0,89.0,5.6,3.5,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
3,2015-12-31 01:51:00,7.2,45.0,5.6,42.1,90.0,7.4,4.6,,,...,0.3,0.01,Overcast,cloudy,0,0,0,0,0,0
4,2015-12-31 02:51:00,7.2,45.0,5.6,42.1,90.0,0.0,0.0,,,...,,,Overcast,cloudy,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10476,2016-12-31 19:51:00,6.1,43.0,-4.4,24.1,47.0,7.4,4.6,,,...,,,Overcast,cloudy,0,0,0,0,0,0
10477,2016-12-31 20:51:00,6.1,43.0,-4.4,24.1,47.0,13.0,8.1,38.9,24.2,...,,,Overcast,cloudy,0,0,0,0,0,0
10478,2016-12-31 21:51:00,6.1,43.0,-5.0,23.0,45.0,9.3,5.8,29.6,18.4,...,,,Overcast,cloudy,0,0,0,0,0,0
10479,2016-12-31 22:51:00,6.7,44.1,-5.0,23.0,43.0,14.8,9.2,,,...,,,Overcast,cloudy,0,0,0,0,0,0


### Print the schema 

So that we can see if we need to make any corrections to the data types.

In [5]:
df.printSchema()

root
 |-- pickup_datetime: string (nullable = true)
 |-- tempm: string (nullable = true)
 |-- tempi: string (nullable = true)
 |-- dewptm: string (nullable = true)
 |-- dewpti: string (nullable = true)
 |-- hum: string (nullable = true)
 |-- wspdm: string (nullable = true)
 |-- wspdi: string (nullable = true)
 |-- wgustm: string (nullable = true)
 |-- wgusti: string (nullable = true)
 |-- wdird: string (nullable = true)
 |-- wdire: string (nullable = true)
 |-- vism: string (nullable = true)
 |-- visi: string (nullable = true)
 |-- pressurem: string (nullable = true)
 |-- pressurei: string (nullable = true)
 |-- windchillm: string (nullable = true)
 |-- windchilli: string (nullable = true)
 |-- heatindexm: string (nullable = true)
 |-- heatindexi: string (nullable = true)
 |-- precipm: string (nullable = true)
 |-- precipi: string (nullable = true)
 |-- conds: string (nullable = true)
 |-- icon: string (nullable = true)
 |-- fog: string (nullable = true)
 |-- rain: string (nullable = t

In [6]:
from pyspark.sql.functions import col,isnan, when, count

In [7]:
missing_count = df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns])

## 1. How much missing data are we working with?

Get a count and percentage of each variable in the dataset to answer this question.

In [8]:
missing_count.toPandas()

Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,...,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,0,5,5,5,5,5,737,737,8605,8605,...,8775,8775,0,0,0,0,0,0,0,0


## 2. How many rows contain at least one null value?

We want to know, if we use the df.na option, how many rows will we loose. 

In [13]:
df_pandas = df.toPandas()

In [16]:
df_pandas.isna().any(axis=1).sum()

10481

## 3. Drop the missing data

Drop any row that contains missing data across the whole dataset

In [17]:
df.na.drop(how='any').show()

+---------------+-----+-----+------+------+---+-----+-----+------+------+-----+-----+----+----+---------+---------+----------+----------+----------+----------+-------+-------+-----+----+---+----+----+----+-------+-------+
|pickup_datetime|tempm|tempi|dewptm|dewpti|hum|wspdm|wspdi|wgustm|wgusti|wdird|wdire|vism|visi|pressurem|pressurei|windchillm|windchilli|heatindexm|heatindexi|precipm|precipi|conds|icon|fog|rain|snow|hail|thunder|tornado|
+---------------+-----+-----+------+------+---+-----+-----+------+------+-----+-----+----+----+---------+---------+----------+----------+----------+----------+-------+-------+-----+----+---+----+----+----+-------+-------+
+---------------+-----+-----+------+------+---+-----+-----+------+------+-----+-----+----+----+---------+---------+----------+----------+----------+----------+-------+-------+-----+----+---+----+----+----+-------+-------+



## 4. Drop with a threshold

Count how many rows would be dropped if we only dropped rows that had a least 12 NON-Null values

In [21]:
df.na.drop(thresh=len(df.columns)).count()

0

## 5. Drop rows according to specific column value

Now count how many rows would be dropped if you only drop rows whose values in the tempm column are null/NaN

In [29]:
df_pandas.dropna(subset=['tempm'], inplace=True)
row_count = df_pandas.shape[0]

In [30]:
print(row_count)

10476


## 6. Drop rows that are null accross all columns

Count how many rows would be dropped if you only dropped rows where ALL the values are null

In [31]:
df.na.drop(how='all').count()

10481

## 7. Fill in all the string columns missing values with the word "N/A"

Make sure you don't edit the df dataframe itself. Create a copy of the df then edit that one.

In [32]:
df2 = df

In [34]:
df2.na.fill('N/A').toPandas()

Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,...,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2015-12-31 00:15:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.5,0.02,Light Rain,rain,0,1,0,0,0,0
1,2015-12-31 00:42:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
2,2015-12-31 00:51:00,7.8,46.0,6.1,43.0,89.0,5.6,3.5,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
3,2015-12-31 01:51:00,7.2,45.0,5.6,42.1,90.0,7.4,4.6,,,...,0.3,0.01,Overcast,cloudy,0,0,0,0,0,0
4,2015-12-31 02:51:00,7.2,45.0,5.6,42.1,90.0,0.0,0.0,,,...,,,Overcast,cloudy,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10476,2016-12-31 19:51:00,6.1,43.0,-4.4,24.1,47.0,7.4,4.6,,,...,,,Overcast,cloudy,0,0,0,0,0,0
10477,2016-12-31 20:51:00,6.1,43.0,-4.4,24.1,47.0,13.0,8.1,38.9,24.2,...,,,Overcast,cloudy,0,0,0,0,0,0
10478,2016-12-31 21:51:00,6.1,43.0,-5.0,23.0,45.0,9.3,5.8,29.6,18.4,...,,,Overcast,cloudy,0,0,0,0,0,0
10479,2016-12-31 22:51:00,6.7,44.1,-5.0,23.0,43.0,14.8,9.2,,,...,,,Overcast,cloudy,0,0,0,0,0,0


## 8. Fill in NaN values with averages for the tempm and tempi columns

*Note: you will first need to compute the averages for each column and then fill in with the corresponding value.*

In [43]:
from pyspark.sql.functions import mean
avg_tempm = df2.select(mean(df2["tempm"]))
avg_tempi = df2.select(mean(df2["tempi"]))

In [54]:
avg_tempm_val = avg_tempm.collect()[0][0]
avg_tempi_val = avg_tempi.collect()[0][0]

In [57]:
df2.fillna(avg_tempm_val, subset='tempm').toPandas()
df2.fillna(avg_tempi_val, subset='tempi').toPandas()

Unnamed: 0,pickup_datetime,tempm,tempi,dewptm,dewpti,hum,wspdm,wspdi,wgustm,wgusti,...,precipm,precipi,conds,icon,fog,rain,snow,hail,thunder,tornado
0,2015-12-31 00:15:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.5,0.02,Light Rain,rain,0,1,0,0,0,0
1,2015-12-31 00:42:00,7.8,46.0,6.1,43.0,89.0,7.4,4.6,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
2,2015-12-31 00:51:00,7.8,46.0,6.1,43.0,89.0,5.6,3.5,,,...,0.8,0.03,Overcast,cloudy,0,0,0,0,0,0
3,2015-12-31 01:51:00,7.2,45.0,5.6,42.1,90.0,7.4,4.6,,,...,0.3,0.01,Overcast,cloudy,0,0,0,0,0,0
4,2015-12-31 02:51:00,7.2,45.0,5.6,42.1,90.0,0.0,0.0,,,...,,,Overcast,cloudy,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10476,2016-12-31 19:51:00,6.1,43.0,-4.4,24.1,47.0,7.4,4.6,,,...,,,Overcast,cloudy,0,0,0,0,0,0
10477,2016-12-31 20:51:00,6.1,43.0,-4.4,24.1,47.0,13.0,8.1,38.9,24.2,...,,,Overcast,cloudy,0,0,0,0,0,0
10478,2016-12-31 21:51:00,6.1,43.0,-5.0,23.0,45.0,9.3,5.8,29.6,18.4,...,,,Overcast,cloudy,0,0,0,0,0,0
10479,2016-12-31 22:51:00,6.7,44.1,-5.0,23.0,43.0,14.8,9.2,,,...,,,Overcast,cloudy,0,0,0,0,0,0


### That's it! Great Job!