### Handling Missing Values
<ul>
    <li>Dropping Columns</li>
    <li>Dropping Rows</li>
    <li>Various Parameter in Dropping Functionalities</li>
    <li>Handling Missing values by Mean, Median and Mode</li>
</ul>

In [14]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[1]').appName('SparkPractice').getOrCreate()

In [15]:
data = spark.read.csv('test2.csv', header=True, inferSchema=True)

In [16]:
data.show(n=10)

+--------+----+-------+-------+
|    Name| Age|Address| Salary|
+--------+----+-------+-------+
|     Jay|  23|  Ajmer|  32000|
|  Manish|  19| Jaipur|  36450|
|    Oman|  11|  Ajmer|  61700|
|    Yuzi|  30|  Noida|  78200|
|Virendra|null|  Ajmer|  83000|
|   Viraj|  25| Mumbai|1012000|
|Shrijesh|  22|   null|  43230|
|  Ganesh|null|   null|  21000|
|Amrindra|  32| Panjab|   null|
|  Amisha|  23| Jaipur|  45010|
+--------+----+-------+-------+



In [17]:
### drop columns
data.drop('Address').show(n=3)

+------+---+------+
|  Name|Age|Salary|
+------+---+------+
|   Jay| 23| 32000|
|Manish| 19| 36450|
|  Oman| 11| 61700|
+------+---+------+
only showing top 3 rows



In [20]:
# drop null rows if any
data.dropna(how='any').show()

+------+---+-------+-------+
|  Name|Age|Address| Salary|
+------+---+-------+-------+
|   Jay| 23|  Ajmer|  32000|
|Manish| 19| Jaipur|  36450|
|  Oman| 11|  Ajmer|  61700|
|  Yuzi| 30|  Noida|  78200|
| Viraj| 25| Mumbai|1012000|
|Amisha| 23| Jaipur|  45010|
+------+---+-------+-------+



In [21]:
# drop rows if all null values
data.dropna(how='all').show()

+--------+----+-------+-------+
|    Name| Age|Address| Salary|
+--------+----+-------+-------+
|     Jay|  23|  Ajmer|  32000|
|  Manish|  19| Jaipur|  36450|
|    Oman|  11|  Ajmer|  61700|
|    Yuzi|  30|  Noida|  78200|
|Virendra|null|  Ajmer|  83000|
|   Viraj|  25| Mumbai|1012000|
|Shrijesh|  22|   null|  43230|
|  Ganesh|null|   null|  21000|
|Amrindra|  32| Panjab|   null|
|  Amisha|  23| Jaipur|  45010|
+--------+----+-------+-------+



In [23]:
### threshold in dropping (thresh=3 means atleast 3 non-null values then not drop)
data.dropna(thresh=3).show()

+--------+----+-------+-------+
|    Name| Age|Address| Salary|
+--------+----+-------+-------+
|     Jay|  23|  Ajmer|  32000|
|  Manish|  19| Jaipur|  36450|
|    Oman|  11|  Ajmer|  61700|
|    Yuzi|  30|  Noida|  78200|
|Virendra|null|  Ajmer|  83000|
|   Viraj|  25| Mumbai|1012000|
|Shrijesh|  22|   null|  43230|
|Amrindra|  32| Panjab|   null|
|  Amisha|  23| Jaipur|  45010|
+--------+----+-------+-------+



In [24]:
### Subset
data.dropna(how='any', subset=['Age']).show()

+--------+---+-------+-------+
|    Name|Age|Address| Salary|
+--------+---+-------+-------+
|     Jay| 23|  Ajmer|  32000|
|  Manish| 19| Jaipur|  36450|
|    Oman| 11|  Ajmer|  61700|
|    Yuzi| 30|  Noida|  78200|
|   Viraj| 25| Mumbai|1012000|
|Shrijesh| 22|   null|  43230|
|Amrindra| 32| Panjab|   null|
|  Amisha| 23| Jaipur|  45010|
+--------+---+-------+-------+



In [29]:
### Filling missing values
data.fillna({'Age': 0, 'Address': 'Unknown', 'Salary': -1}).show()

+--------+---+-------+-------+
|    Name|Age|Address| Salary|
+--------+---+-------+-------+
|     Jay| 23|  Ajmer|  32000|
|  Manish| 19| Jaipur|  36450|
|    Oman| 11|  Ajmer|  61700|
|    Yuzi| 30|  Noida|  78200|
|Virendra|  0|  Ajmer|  83000|
|   Viraj| 25| Mumbai|1012000|
|Shrijesh| 22|Unknown|  43230|
|  Ganesh|  0|Unknown|  21000|
|Amrindra| 32| Panjab|     -1|
|  Amisha| 23| Jaipur|  45010|
+--------+---+-------+-------+



In [31]:
### Filling Missing values by statistical measure
from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols=['Age', 'Salary'],
    outputCols=["{}_imputed".format(c) for c in ['Age', 'Salary']]
).setStrategy("mean")

In [32]:
# Add imputation cols to df
imputer.fit(data).transform(data).show()

+--------+----+-------+-------+-----------+--------------+
|    Name| Age|Address| Salary|Age_imputed|Salary_imputed|
+--------+----+-------+-------+-----------+--------------+
|     Jay|  23|  Ajmer|  32000|         23|         32000|
|  Manish|  19| Jaipur|  36450|         19|         36450|
|    Oman|  11|  Ajmer|  61700|         11|         61700|
|    Yuzi|  30|  Noida|  78200|         30|         78200|
|Virendra|null|  Ajmer|  83000|         23|         83000|
|   Viraj|  25| Mumbai|1012000|         25|       1012000|
|Shrijesh|  22|   null|  43230|         22|         43230|
|  Ganesh|null|   null|  21000|         23|         21000|
|Amrindra|  32| Panjab|   null|         32|        156954|
|  Amisha|  23| Jaipur|  45010|         23|         45010|
+--------+----+-------+-------+-----------+--------------+

