## PySpark SQL isNull() and isNotNull() Functions: Handling Null Values in Data

## Introduction to `isNull()` and `isNotNull()` Functions

The `isNull()` and `isNotNull()` functions in PySpark are used to check if a column's value is null or not null, respectively. Null values in datasets can represent missing or undefined data, and handling them properly is crucial in data analysis and transformations.


## Basic Syntax:

```
DataFrame.filter(column.isNull())
DataFrame.filter(column.isNotNull())
```

### Functions:

- **`isNull()`**: Returns rows where the column has a null value.
- **`isNotNull()`**: Returns rows where the column is not null.


## Why Use `isNull()` and `isNotNull()`?

- Null values often indicate missing data that needs special treatment, whether through filtering, filling missing values, or applying specific rules.
- `isNull()` and `isNotNull()` help identify and manage such null values in your dataset, allowing you to handle missing or undefined data appropriately during analysis and transformations.


## Practical Examples

### 1. Filtering Rows with Null Values

**Scenario**: You have a DataFrame with product sales, and some sales values are null. You want to filter rows where sales are null.

**Code Example**:

In [0]:
df = spark.createDataFrame([
    ("Product A", 500),
    ("Product B", None),
    ("Product C", 700)
], ["product_name", "sales"])

# Filter rows where sales are null
df.filter(df.sales.isNull()).show()


+------------+-----+
|product_name|sales|
+------------+-----+
|   Product B| null|
+------------+-----+



### 2. Filtering Rows with Non-Null Values

**Scenario**: You want to filter out rows where the sales value is null, keeping only rows with valid sales values.

**Code Example**:

In [0]:
# Filter rows where sales are not null
df.filter(df.sales.isNotNull()).show()


+------------+-----+
|product_name|sales|
+------------+-----+
|   Product A|  500|
|   Product C|  700|
+------------+-----+



### 3. Using `isNull()` for Data Cleaning

**Scenario**: You want to clean your data by replacing null values in the sales column with a default value, such as 0.

**Code Example**:

In [0]:
from pyspark.sql.functions import when

# Replace null sales with 0
df_cleaned = df.withColumn("sales", when(df.sales.isNull(), 0).otherwise(df.sales))
df_cleaned.show()


+------------+-----+
|product_name|sales|
+------------+-----+
|   Product A|  500|
|   Product B|    0|
|   Product C|  700|
+------------+-----+



### 4. Combining `isNull()` and `isNotNull()` with Other Filters

**Scenario**: You want to filter rows where the sales are greater than 400 and not null.

**Code Example**:

In [0]:
# Filter rows where sales are greater than 400 and not null
df.filter(df.sales.isNotNull() & (df.sales > 400)).show()


+------------+-----+
|product_name|sales|
+------------+-----+
|   Product A|  500|
|   Product C|  700|
+------------+-----+



### 5. Counting Null and Non-Null Values in a Column

**Scenario**: You want to count how many rows have null values in the sales column and how many are not null.

**Code Example**:

In [0]:
# Count rows with null and non-null sales
null_count = df.filter(df.sales.isNull()).count()
not_null_count = df.filter(df.sales.isNotNull()).count()

print(f"Null sales count: {null_count}")
print(f"Non-null sales count: {not_null_count}")


Null sales count: 1
Non-null sales count: 2


### 6. Using `isNull()` and `isNotNull()` with Multiple Columns

**Scenario**: You have a DataFrame with multiple columns, and you want to filter rows where both the `Sales` and `ProductName` columns are not null.

**Code Example**:

In [0]:
# Filter rows where both product_name and sales are not null
df.filter(df.product_name.isNotNull() & df.sales.isNotNull()).show()


+------------+-----+
|product_name|sales|
+------------+-----+
|   Product A|  500|
|   Product C|  700|
+------------+-----+

