# Spark DataFrame - Data Cleaning
There are three options when dealing with missing data: 
1. Changing the data to null
2. Drop the data point (or entire row)
3. Fill it in with a different value

These points are dependent on your requirements. 

Objective: Let's explore our options when it comes to cleaning a basic dataset.

In [1]:
# Must be included at the beginning of each new notebook. Remember to change the app name.
import findspark
findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('missing').getOrCreate()

In [2]:
# Importing data which has a header. Schema is automatically configured.
df = spark.read.csv('Datasets/contains_null.csv', header=True, inferSchema=True)

# Let's see the data. You'll notice nulls.
df.show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2| null| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [3]:
# Using this syntax, we can drop any row with missing data. Three rows are dropped.
df.na.drop().show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



In [4]:
# Requires a certain amount of non-null values. Row two was dropped, as there's only one non-null value.
df.na.drop(thresh=2).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [5]:
# Drops a row if all values are missing. Zero rows are dropped. 
df.na.drop(how="all").show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2| null| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [6]:
# Drops a row if a value from a particular row is missing. Two rows are dropped.
df.na.drop(subset="Sales").show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



In [7]:
# Instead of dropping the row, this fills null string types with FILL VALUE. 
df.na.fill("FILL VALUE").show()

# Spark will only apply numbers to number data types, and strings to string data types.
df.na.fill(0).show()

# However, it's good practice to specify the row you want to fill using subset. 
df.na.fill('FILL NAME', subset=['Name']).show()

+----+----------+-----+
|  Id|      Name|Sales|
+----+----------+-----+
|emp1|      John| null|
|emp2|FILL VALUE| null|
|emp3|FILL VALUE|345.0|
|emp4|     Cindy|456.0|
+----+----------+-----+

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|  0.0|
|emp2| null|  0.0|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

+----+---------+-----+
|  Id|     Name|Sales|
+----+---------+-----+
|emp1|     John| null|
|emp2|FILL NAME| null|
|emp3|FILL NAME|345.0|
|emp4|    Cindy|456.0|
+----+---------+-----+



In [8]:
# Also, it's good practice to use your sales average to fill missing data. 
from pyspark.sql.functions import mean

# Let's collect the average. You'll notice that the collection returns the average in an interesting format.
mean_sales = df.select(mean(df['Sales'])).collect()
mean_sales

[Row(avg(Sales)=400.5)]

In [9]:
# If we use the index, we should be able to access the actual value.
mean_sales[0]


Row(avg(Sales)=400.5)

In [10]:
# Looks like we need to go one level deeper. Perfect! Let's assign that value to a variable.
mean_sales[0][0]
mean_sales_val = mean_sales[0][0]

In [11]:
# And finally, fill the missing values with the mean.
df.na.fill(mean_sales_val, subset=['Sales']).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| null|400.5|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



Great work! At this stage, we're pretty much done with understanding DataFrames. You can now move on to applying an algorithm. We recommend going through linear regression, then logistic regression and finishing off with tree methods. It's best to start with the documentation example before moving to the advanced example. 