# Filling of Null Values

In this notebook we will go over the basic tools used to fill null values in Spark tables.

We will start by importing the pyspark machinery and a helper function (```create_sp_table4```) that creates the table we will be using.

In [1]:
# Spark related machinery
import pyspark
import pyspark.sql.functions as F
from pyspark import SparkConf
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext
from pyspark.sql.functions import concat_ws

spark = pyspark.sql.SparkSession.builder.enableHiveSupport().getOrCreate()

In [2]:
from pyspark_functions import create_sp_table4

In the next lines we will create a table called **data** using the function ```create_sp_table4```.  The table has 5 columns:
* participant_id
* name
* last_name
* age
* score_game_1
* score_game_2

In all columns, except **participant_id**, we have ```null``` or ```NaN``` values. A ```nul``` value represents **no value**, while a ```NaN``` value stands for "not a number". While they are different, in this notebook we won't make a distinction when dealing with them, and for now on we will use the word "null" when referring to both of them.

In [3]:
data = create_sp_table4()
data.show()

+--------------+-------+---------+----+------------+------------+
|participant_id|   name|last_name| age|score_game_1|score_game_2|
+--------------+-------+---------+----+------------+------------+
|             1|   null|     Lara|27.0|        89.0|         NaN|
|             2|   Liam|     null| NaN|        50.0|         NaN|
|             3| Olivia|   Wilson|25.0|         NaN|        70.0|
|             4|Jackson|     null|24.0|        98.0|        90.0|
|             5|   null|    Moore|26.0|       100.0|        89.0|
|             6| Oliver|     Leon| NaN|        65.0|        70.0|
|             7|  Lucas|    Brown|24.0|         NaN|        75.0|
|             8|   null|     null|31.0|        85.0|        79.0|
|             9|   Aria| Robinson| NaN|        80.0|         NaN|
|            10| Amelia|     null|29.0|         NaN|        99.0|
+--------------+-------+---------+----+------------+------------+



# Method 1

In our first example we will fill all of the null values using a dictionary and the ```fillna``` method. We will fill the null values based in the table shown below:

* name ---------> "JAMES"
* last_name ----> "BOND"
* age ----------> "-100"
* score_game_1 -> "-10"
* score_game_2 -> "-20"

The dictionary has the name of the columns and the value used to fill the null values. Then, this dictionary is passed to the ```fillna``` method, see the block of code below:

In [4]:
fill_values_dictio = {"name": "JAMES",
                      "last_name": "BOND",
                      "age": -100,
                      "score_game_1": -10,
                      "score_game_2": -20}

example_1 = data.fillna(fill_values_dictio)

example_1.show()

+--------------+-------+---------+------+------------+------------+
|participant_id|   name|last_name|   age|score_game_1|score_game_2|
+--------------+-------+---------+------+------------+------------+
|             1|  JAMES|     Lara|  27.0|        89.0|       -20.0|
|             2|   Liam|     BOND|-100.0|        50.0|       -20.0|
|             3| Olivia|   Wilson|  25.0|       -10.0|        70.0|
|             4|Jackson|     BOND|  24.0|        98.0|        90.0|
|             5|  JAMES|    Moore|  26.0|       100.0|        89.0|
|             6| Oliver|     Leon|-100.0|        65.0|        70.0|
|             7|  Lucas|    Brown|  24.0|       -10.0|        75.0|
|             8|  JAMES|     BOND|  31.0|        85.0|        79.0|
|             9|   Aria| Robinson|-100.0|        80.0|       -20.0|
|            10| Amelia|     BOND|  29.0|       -10.0|        99.0|
+--------------+-------+---------+------+------------+------------+



# Method 2

A second way of filling null values is to pass, to the ```fillna``` method, the value we want to use to fill the null values and a list with the columns. In this example we will fill the null values in the columns **age**, **score_game_1**, and **score_game_2** with the value **-100**, see the block of code below:

In [5]:
example_2 = data.fillna(-100, subset=["age", "score_game_1", "score_game_2"])
example_2.show()

+--------------+-------+---------+------+------------+------------+
|participant_id|   name|last_name|   age|score_game_1|score_game_2|
+--------------+-------+---------+------+------------+------------+
|             1|   null|     Lara|  27.0|        89.0|      -100.0|
|             2|   Liam|     null|-100.0|        50.0|      -100.0|
|             3| Olivia|   Wilson|  25.0|      -100.0|        70.0|
|             4|Jackson|     null|  24.0|        98.0|        90.0|
|             5|   null|    Moore|  26.0|       100.0|        89.0|
|             6| Oliver|     Leon|-100.0|        65.0|        70.0|
|             7|  Lucas|    Brown|  24.0|      -100.0|        75.0|
|             8|   null|     null|  31.0|        85.0|        79.0|
|             9|   Aria| Robinson|-100.0|        80.0|      -100.0|
|            10| Amelia|     null|  29.0|      -100.0|        99.0|
+--------------+-------+---------+------+------------+------------+



# Method 3

A third way of filling null values in a spark table is to use the method ```na.fill```. In this case we pass the value we want to use to fill the null values, but we do not specify the columns where we want the null values to be filled. Pyspark will fill the null values in all columns where it is possible to fill them with the value that we passed. 

In the block of code below we will fill null values with the string "NOT_FOUND".

In [6]:
example_3 = data.na.fill("NOT_FOUND")
example_3.show()

+--------------+---------+---------+----+------------+------------+
|participant_id|     name|last_name| age|score_game_1|score_game_2|
+--------------+---------+---------+----+------------+------------+
|             1|NOT_FOUND|     Lara|27.0|        89.0|         NaN|
|             2|     Liam|NOT_FOUND| NaN|        50.0|         NaN|
|             3|   Olivia|   Wilson|25.0|         NaN|        70.0|
|             4|  Jackson|NOT_FOUND|24.0|        98.0|        90.0|
|             5|NOT_FOUND|    Moore|26.0|       100.0|        89.0|
|             6|   Oliver|     Leon| NaN|        65.0|        70.0|
|             7|    Lucas|    Brown|24.0|         NaN|        75.0|
|             8|NOT_FOUND|NOT_FOUND|31.0|        85.0|        79.0|
|             9|     Aria| Robinson| NaN|        80.0|         NaN|
|            10|   Amelia|NOT_FOUND|29.0|         NaN|        99.0|
+--------------+---------+---------+----+------------+------------+



# Final Words

We went through the basic concepts of how to fill null values using Pyspark. Now it is your turn to start coding. Try the following:
* In method 1, try changing the dictionary used to fill the null values.
* In method 2, try changing the value used to fill the null values and the list of columns where you want the null values to be filled.
* In method 3, try passing a string not a number.