# PySpark 01 Handling Missing Values

**Summary** 
- Dropping columns
- Dropping rows
- Parameters of drop function
- Handling Missing values by Mean, Median, and Mode 

### Setup

In [3]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('practice').getOrCreate()
spark

### Load

In [68]:
# read the dataset
FILE_PATH = "C:\\PySpark\\food-consumption_missing.csv"
df = spark.read.csv(FILE_PATH, header=True, inferSchema=True)
df

DataFrame[Country: string, Real coffee: int, Instant coffee: int, Tea: int, Sweetener: int, Biscuits: int, Powder soup: int, Tin soup: int, Potatoes: int, Frozen fish: int, Frozen veggies: int, Apples: int, Oranges: int, Tinned fruit: int, Jam: int, Garlic: int, Butter: int, Margarine: int, Olive oil: int, Yoghurt: int, Crisp bread: int]

In [69]:
df.columns[:5]

['Country', 'Real coffee', 'Instant coffee', 'Tea', 'Sweetener']

## 1. Drop column

**Syntax**  

- drop by:
  - `df.drop('col1', 'col2', ...)`
  - `df.na.drop()`
  - `df.na.drop(how='any', thresh=None, subset=None)`
    - If `'any'`, drop a row if it contains any nulls.
    - If `'all'`, drop a row only if all its values are null.
    - `thresh`: the count of missing values per record
    - `subset`: the column by which to drop records if missing value is present.
- treat na by:
  - df
  - drop
  - fill
  - replace

In [73]:
df = df.drop('Real coffee', 'Instant coffee', 'Tea', 'Frozen veggies', 'Potatoes', 
        'Biscuits', 'Yoghurt', 'Crisp bread', 'Frozen fish', 'Apples', 
        'Oranges', 'Garlic', 'Olive oil', 'Busquites')
df.show()

+-----------+---------+-----------+--------+------------+----+------+---------+
|    Country|Sweetener|Powder soup|Tin soup|Tinned fruit| Jam|Butter|Margarine|
+-----------+---------+-----------+--------+------------+----+------+---------+
|    Germany|       19|         51|      19|          44|  71|    91|       85|
|      Italy|        2|       null|       3|           9|null|    66|     null|
|     France|     null|         53|      11|        null|null|    94|       47|
|    Holland|       32|       null|      43|          61|  81|    31|     null|
|    Belgium|       11|         37|      23|        null|null|    84|     null|
| Luxembourg|     null|         73|      12|          83|  20|    94|     null|
|    England|       22|       null|      76|        null|  91|    95|       94|
|   Portugal|     null|       null|       1|        null|  16|    65|       78|
|    Austria|       15|         33|       1|          14|  41|    51|       72|
|Switzerland|     null|       null|     

In [74]:
### any==how
df.na.drop(how='any').show()

+-------+---------+-----------+--------+------------+---+------+---------+
|Country|Sweetener|Powder soup|Tin soup|Tinned fruit|Jam|Butter|Margarine|
+-------+---------+-----------+--------+------------+---+------+---------+
|Germany|       19|         51|      19|          44| 71|    91|       85|
|Austria|       15|         33|       1|          14| 41|    51|       72|
| Sweden|       31|         43|      43|          53| 75|    68|       32|
| Norway|       13|         51|       4|          34| 51|    63|       94|
|Finland|       20|         27|      10|          22| 37|    96|       94|
|Ireland|       11|         75|      18|          46| 89|    97|       25|
+-------+---------+-----------+--------+------------+---+------+---------+



In [91]:
## threshold
df.na.drop(how='any', thresh=2).show()

+-----------+---------+-----------+--------+------------+----+------+---------+
|    Country|Sweetener|Powder soup|Tin soup|Tinned fruit| Jam|Butter|Margarine|
+-----------+---------+-----------+--------+------------+----+------+---------+
|    Germany|       19|         51|      19|          44|  71|    91|       85|
|      Italy|        2|       null|       3|           9|null|    66|     null|
|     France|     null|         53|      11|        null|null|    94|       47|
|    Holland|       32|       null|      43|          61|  81|    31|     null|
|    Belgium|       11|         37|      23|        null|null|    84|     null|
| Luxembourg|     null|         73|      12|          83|  20|    94|     null|
|    England|       22|       null|      76|        null|  91|    95|       94|
|   Portugal|     null|       null|       1|        null|  16|    65|       78|
|    Austria|       15|         33|       1|          14|  41|    51|       72|
|Switzerland|     null|       null|     

In [85]:
# subset
df.na.drop(how='any', subset='Sweetener').show()

+-------+---------+-----------+--------+------------+----+------+---------+
|Country|Sweetener|Powder soup|Tin soup|Tinned fruit| Jam|Butter|Margarine|
+-------+---------+-----------+--------+------------+----+------+---------+
|Germany|       19|         51|      19|          44|  71|    91|       85|
|  Italy|        2|       null|       3|           9|null|    66|     null|
|Holland|       32|       null|      43|          61|  81|    31|     null|
|Belgium|       11|         37|      23|        null|null|    84|     null|
|England|       22|       null|      76|        null|  91|    95|       94|
|Austria|       15|         33|       1|          14|  41|    51|       72|
| Sweden|       31|         43|      43|          53|  75|    68|       32|
| Norway|       13|         51|       4|          34|  51|    63|       94|
|Finland|       20|         27|      10|          22|  37|    96|       94|
|Ireland|       11|         75|      18|          46|  89|    97|       25|
+-------+---

## 2. Fill missing values

Syntax
- `df.na.fill(value, subset=None)`
- `df.na.fill(value, subset='col1')`
- `df.na.fill(value, subset=['col1', 'col2', ... ])`
  - Make sure the filling value type corresponds with that of column's data type. 

Fill with specific values - `Imputer`
- `from pyspark.ml.feature import Imputer`
- `imputer = Imputer(inputCols=['col1', 'col2', ...],`   
    `outputCols=["{}_imputed".format(c) for c in ['col1', 'col2', ... ]])`  
    `.setStrategy('mean')`

In [97]:
# fill the missing value with givin value
df.na.fill(10000).show()

+-----------+---------+-----------+--------+------------+-----+------+---------+
|    Country|Sweetener|Powder soup|Tin soup|Tinned fruit|  Jam|Butter|Margarine|
+-----------+---------+-----------+--------+------------+-----+------+---------+
|    Germany|       19|         51|      19|          44|   71|    91|       85|
|      Italy|        2|      10000|       3|           9|10000|    66|    10000|
|     France|    10000|         53|      11|       10000|10000|    94|       47|
|    Holland|       32|      10000|      43|          61|   81|    31|    10000|
|    Belgium|       11|         37|      23|       10000|10000|    84|    10000|
| Luxembourg|    10000|         73|      12|          83|   20|    94|    10000|
|    England|       22|      10000|      76|       10000|   91|    95|       94|
|   Portugal|    10000|      10000|       1|       10000|   16|    65|       78|
|    Austria|       15|         33|       1|          14|   41|    51|       72|
|Switzerland|    10000|     

In [98]:
df.na.fill(100000, subset='Sweetener').show()

+-----------+---------+-----------+--------+------------+----+------+---------+
|    Country|Sweetener|Powder soup|Tin soup|Tinned fruit| Jam|Butter|Margarine|
+-----------+---------+-----------+--------+------------+----+------+---------+
|    Germany|       19|         51|      19|          44|  71|    91|       85|
|      Italy|        2|       null|       3|           9|null|    66|     null|
|     France|   100000|         53|      11|        null|null|    94|       47|
|    Holland|       32|       null|      43|          61|  81|    31|     null|
|    Belgium|       11|         37|      23|        null|null|    84|     null|
| Luxembourg|   100000|         73|      12|          83|  20|    94|     null|
|    England|       22|       null|      76|        null|  91|    95|       94|
|   Portugal|   100000|       null|       1|        null|  16|    65|       78|
|    Austria|       15|         33|       1|          14|  41|    51|       72|
|Switzerland|   100000|       null|     

In [100]:
df.na.fill(100000, subset=['Margarine', 'Jam']).show()

+-----------+---------+-----------+--------+------------+------+------+---------+
|    Country|Sweetener|Powder soup|Tin soup|Tinned fruit|   Jam|Butter|Margarine|
+-----------+---------+-----------+--------+------------+------+------+---------+
|    Germany|       19|         51|      19|          44|    71|    91|       85|
|      Italy|        2|       null|       3|           9|100000|    66|   100000|
|     France|     null|         53|      11|        null|100000|    94|       47|
|    Holland|       32|       null|      43|          61|    81|    31|   100000|
|    Belgium|       11|         37|      23|        null|100000|    84|   100000|
| Luxembourg|     null|         73|      12|          83|    20|    94|   100000|
|    England|       22|       null|      76|        null|    91|    95|       94|
|   Portugal|     null|       null|       1|        null|    16|    65|       78|
|    Austria|       15|         33|       1|          14|    41|    51|       72|
|Switzerland|   

### Imputer

Impute by mean

In [114]:
from pyspark.ml.feature import Imputer

cols = ['Sweetener', 'Powder soup']

imputer = Imputer(inputCols=cols, 
                  outputCols=["{}_imputed".format(col) for col in cols]).setStrategy('mean')

In [118]:
df_imputed = imputer.fit(df).transform(df)
df_imputed.select('Country', 'Sweetener', 'Sweetener_imputed').show()

+-----------+---------+-----------------+
|    Country|Sweetener|Sweetener_imputed|
+-----------+---------+-----------------+
|    Germany|       19|               19|
|      Italy|        2|                2|
|     France|     null|               17|
|    Holland|       32|               32|
|    Belgium|       11|               11|
| Luxembourg|     null|               17|
|    England|       22|               22|
|   Portugal|     null|               17|
|    Austria|       15|               15|
|Switzerland|     null|               17|
|     Sweden|       31|               31|
|    Denmark|     null|               17|
|     Norway|       13|               13|
|    Finland|       20|               20|
|      Spain|     null|               17|
|    Ireland|       11|               11|
+-----------+---------+-----------------+



Impute by Median

In [120]:
from pyspark.ml.feature import Imputer

cols = ['Sweetener', 'Powder soup']

imputer = Imputer(inputCols=cols, 
                  outputCols=["{}_imputed".format(col) for col in cols]).setStrategy('median')
df_imputed = imputer.fit(df).transform(df)
df_imputed.select('Country', 'Sweetener', 'Sweetener_imputed').show()

+-----------+---------+-----------------+
|    Country|Sweetener|Sweetener_imputed|
+-----------+---------+-----------------+
|    Germany|       19|               19|
|      Italy|        2|                2|
|     France|     null|               15|
|    Holland|       32|               32|
|    Belgium|       11|               11|
| Luxembourg|     null|               15|
|    England|       22|               22|
|   Portugal|     null|               15|
|    Austria|       15|               15|
|Switzerland|     null|               15|
|     Sweden|       31|               31|
|    Denmark|     null|               15|
|     Norway|       13|               13|
|    Finland|       20|               20|
|      Spain|     null|               15|
|    Ireland|       11|               11|
+-----------+---------+-----------------+

