## Handling NULL Values in PySpark

Data cleaning is a crucial step in any data processing pipeline, and dealing with `Null` values is a common challenge.

In PySpark, you have several powerful options to handle `NULLs` effectively, such as:
- Dropping Null Values: dropna()
- Filling Null Values: Using `fillna()` or `replace()` to substitute missing values.
- Applying conditional logic with `when()` and `otherwise()`.
- using functions like `isNull()` and `isNotNull()`.

In [0]:
# sample data
data = [
    ("Rohish", 30),
    ("Ajit", None),
    ("Rajani", 25),
    (None, 35),
    ("Eve", None)
]

columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)
df.show()

+------+----+
|  Name| Age|
+------+----+
|Rohish|  30|
|  Ajit|null|
|Rajani|  25|
|  null|  35|
|   Eve|null|
+------+----+



**Dropping Null Values - `dropna()`**:

Remove rows with null values in any or specific columns

In [0]:
# Drops rows with any null values
cleaned_df = df.dropna()
cleaned_df.show()

+------+---+
|  Name|Age|
+------+---+
|Rohish| 30|
|Rajani| 25|
+------+---+



In [0]:
# Drops rows only if all values are null
df.dropna(how="all").show()

+------+----+
|  Name| Age|
+------+----+
|Rohish|  30|
|  Ajit|null|
|Rajani|  25|
|  null|  35|
|   Eve|null|
+------+----+



In [0]:
# Drops rows where name is null. we can give more columns as well
df.dropna(subset=["Name"]).show()

+------+----+
|  Name| Age|
+------+----+
|Rohish|  30|
|  Ajit|null|
|Rajani|  25|
|   Eve|null|
+------+----+



**Filling Null Values (fillna())** - 
Replace null values with a specific value.

In [0]:
filled_df = df.fillna(0)
filled_df.show()

+------+---+
|  Name|Age|
+------+---+
|Rohish| 30|
|  Ajit|  0|
|Rajani| 25|
|  null| 35|
|   Eve|  0|
+------+---+



In [0]:
filled_df2 = df.fillna(0, subset=["Name", "Age"])
filled_df2.show()

+------+---+
|  Name|Age|
+------+---+
|Rohish| 30|
|  Ajit|  0|
|Rajani| 25|
|  null| 35|
|   Eve|  0|
+------+---+



**Replacing Null Values (na.replace()):**
Similar to fillna(), but more flexible.

In [0]:
df.na.fill("Unknown").show()

+-------+----+
|   Name| Age|
+-------+----+
| Rohish|  30|
|   Ajit|null|
| Rajani|  25|
|Unknown|  35|
|    Eve|null|
+-------+----+



**Note: The behavior you’re seeing occurs because `df.na.fill("Unknown")` replaces null values in only `string-type` columns by default. If you want to apply it to all columns (regardless of type), specify subset=None explicitly or provide a list of column names.**

In [0]:
df.na.fill("Unknown", subset=None).show()

+-------+----+
|   Name| Age|
+-------+----+
| Rohish|  30|
|   Ajit|null|
| Rajani|  25|
|Unknown|  35|
|    Eve|null|
+-------+----+



**Using coalesce()** Provide a default value if a column has nulls.

In [0]:
from pyspark.sql.functions import coalesce, lit

df_1 = df.withColumn("Name", coalesce(df["Name"], lit("Unknows"))) \
    .withColumn("Age", coalesce(df["Age"], lit("Unknown")))
    
df_1.show()

+-------+-------+
|   Name|    Age|
+-------+-------+
| Rohish|     30|
|   Ajit|Unknown|
| Rajani|     25|
|Unknows|     35|
|    Eve|Unknown|
+-------+-------+



**Filtering Out Nulls (filter() or where()):**

In [0]:
from pyspark.sql.functions import col

df.filter(col("Name").isNotNull()).show()
df.where(col("Name").isNotNull()).show()

+------+----+
|  Name| Age|
+------+----+
|Rohish|  30|
|  Ajit|null|
|Rajani|  25|
|   Eve|null|
+------+----+

+------+----+
|  Name| Age|
+------+----+
|Rohish|  30|
|  Ajit|null|
|Rajani|  25|
|   Eve|null|
+------+----+



In [0]:
df.filter(col("Name").isNull()).show()

+----+---+
|Name|Age|
+----+---+
|null| 35|
+----+---+



**Using Conditional Logic (when and otherwise):** Apply conditional logic to replace NULL values.

In [0]:
from pyspark.sql.functions import when, col

df.withColumn("Name", when(col("Name").isNull(), "Unknown").otherwise(col("Name"))) \
    .withColumn("Age", when(col("Age").isNull(), "Unknown").otherwise(col("Age"))).show()

+-------+-------+
|   Name|    Age|
+-------+-------+
| Rohish|     30|
|   Ajit|Unknown|
| Rajani|     25|
|Unknown|     35|
|    Eve|Unknown|
+-------+-------+



### Potential Interview questions

**How do you handle NULL values in a PySpark DataFrame?**

**Answer:** You can handle NULL values using various functions:
 - df.fillna(value) to replace NULL values with a specified value.
 - df.dropna() to drop rows with NULL values.
 - df.replace(to_replace, value) to replace specific NULL values with another value.
 - df.na.fill(value) is an alternative way to fill NULL values.
 - df.na.drop() to remove rows with NULL values.
 - Using coalesce() Function: Use the first non-null value from a list of columns.
 - when() and otherwise() Functions: Apply conditional logic to replace NULL values.

**How can you filter out rows with NULL values in a specific column?**

**Answer:**
- We can use the `filter()` or `where()` method with `isNull()` method


In [0]:
df.filter(col("Age").isNull()).show()

+----+----+
|Name| Age|
+----+----+
|Ajit|null|
| Eve|null|
+----+----+



**How do you handle NULL values when performing aggregations such as SUM, AVG, or COUNT?**

**Answer:** PySpark functions like `SUM`, `AVG`, and `COUNT` automatically ignore NULL values

In [0]:
# For example:
from pyspark.sql.functions import avg, sum

avg_value = df.select(avg('Age')).collect()[0][0]
sum_value = df.select(sum('Age')).collect()[0][0]

print("avg_value:", avg_value)
print("sum_value:", sum_value)

avg_value: 30.0
sum_value: 90


**How can you create a custom function to handle NULL values in PySpark?**

**Answer:** we can use the udf (User Defined Function) feature and we can utilize functions like fillna to handle null values 

In [0]:
from pyspark.sql.functions import udf
 from pyspark.sql.types import StringType

 def replace_null(value):
 return 0 if value is None else value

 replace_null_udf = udf(replace_null, IntegerType())
 df_filled = df.withColumn('column_name', replace_null_udf(df['column_name']))

In [0]:
# for example

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def replace_null(value):
    return 0 if value is None else value

# Registering the Function as a UDF
replace_null_udf =  udf(replace_null, StringType())

filled_df = df.withColumn("Name", replace_null_udf(df["Name"]))

filled_df.show()

+------+----+
|  Name| Age|
+------+----+
|Rohish|  30|
|  Ajit|null|
|Rajani|  25|
|     0|  35|
|   Eve|null|
+------+----+

