# Missing Data in Spark Dataframe

## Table of Content

<ol style = "type:1">
    <li><a href = "#naanddrop"><code>`na`</code> and <code>`drop`</code></a></li>
    <li><a href = "#fill">Filling Missing Data with <code>`fill`</code></a></li>
    <li><a href = "#ref">References</a></li>
</ol>

## <a name = "naanddrop">`na` and `drop`</a>

In [2]:
import findspark

In [3]:
findspark.init("/home/virchan/spark-3.3.1-bin-hadoop3")

In [4]:
from pyspark.sql import SparkSession

In [6]:
# Output hidden
spark = SparkSession.builder.appName("miss").getOrCreate()

In [8]:
df = spark.read.csv("ContainsNull.csv", 
               header = True, 
               inferSchema = True
              )

In [9]:
df.show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2| null| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



For any dataframe, we can use the `.na()` method to drop, fill, etc with missing data.

In [10]:
df.na.drop().show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



We can also specify a threshold arugment. For example, by passing in `thresh = 2`, we are requesting rows that have at least two non-null values. (I.e., we get rows with at most one null value.)

In [11]:
df.na.drop(thresh = 2).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



There is another argument `how`. Passing in `how = 'any'` means we are dropping rows with <i>any</i> null values. (I.e., we get the dense part of the dataframe.)

In [12]:
df.na.drop(how = "any").show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+



Passing in `how = 'all'` means we are dropping rows with <i>all</i> null values. (I.e., we are removing empty rows.)

In [13]:
df.na.drop(how = "all").show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2| null| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



If you want to consider only a certain column as far as missing data, we can clarify that with the `subset` argument.

In [14]:
df.na.drop(subset = ["Sales"]).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



# <a name = "fill">Filling Missing Data with `fill`</a>

We need to pay attention to data types when filling missing data. In order words, we need to make good use of `.printSchema()`.

In [15]:
df.printSchema()

root
 |-- Id: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sales: double (nullable = true)



We have two string columns and one double column. Say we try the following code

In [16]:
df.na.fill("FILL VALUE").show()

+----+----------+-----+
|  Id|      Name|Sales|
+----+----------+-----+
|emp1|      John| null|
|emp2|FILL VALUE| null|
|emp3|FILL VALUE|345.0|
|emp4|     Cindy|456.0|
+----+----------+-----+



What it does is it fills in the value `FILL VALUE` to any string column. If we put in numbers, we get

In [17]:
df.na.fill(0).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|  0.0|
|emp2| null|  0.0|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



All null values in numeric columns are replaced with the value 0. Of course, in general we will have to specify the columns to fill.

In [18]:
df.na.fill("No Name", subset = ["Name"]).show()

+----+-------+-----+
|  Id|   Name|Sales|
+----+-------+-----+
|emp1|   John| null|
|emp2|No Name| null|
|emp3|No Name|345.0|
|emp4|  Cindy|456.0|
+----+-------+-----+



A common practice is filling missing numerica values with average value.

In [19]:
from pyspark.sql.functions import mean

In [20]:
mean_val = df.select(mean(df["Sales"])).collect()

Let's unpack this object.

In [21]:
mean_val

[Row(avg(Sales)=400.5)]

In [22]:
mean_val[0]

Row(avg(Sales)=400.5)

In [23]:
mean_val[0][0]

400.5

Alternatively, can do it with dictionary.

In [25]:
mean_val[0].asDict()

{'avg(Sales)': 400.5}

In [26]:
mean_val[0].asDict()["avg(Sales)"]

400.5

To replace missing values with mean value,

In [27]:
mean_sales = mean_val[0][0]

In [28]:
df.na.fill(mean_sales, ["Sales"]).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| null|400.5|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



Alternatively,

In [29]:
df.na.fill(df.select(mean(df["Sales"])).collect()[0][0], ["Sales"]).show()

+----+-----+-----+
|  Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| null|400.5|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+



## <a name = "ref">References</a>

<ol style = "type:1">
    <li>Jose Portilla. Spark and Python for Big Data with PySpark.</li>
    <li>Apache Spark. <a href = "https://spark.apache.org/docs/latest/api/python/">https://spark.apache.org/docs/latest/api/python/</a>.</li>
</ol>