Problem Statement:

Here are working with a PySpark DataFrame that contains several columns, and some of the cells in these columns contain null values. Your goal is to calculate the total number of null values for each column in the DataFrame. Specifically, you want to:

Count the number of null values in each column.
Display the result as a new DataFrame, where each column represents the count of null values for the corresponding column in the original DataFrame.
This task mirrors an SQL query that counts null values using CASE statements and the SUM() function for each column.

In [0]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("ExampleApp").getOrCreate()

# Sample data
data = [(1, None, "ab"), (2, 10, None), (None, None, "cd")]
columns = ["col1", "col2", "col3"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.display()

col1,col2,col3
1.0,,ab
2.0,10.0,
,,cd


In [0]:
df.createOrReplaceTempView("emp")

In [0]:
%sql
select
  sum(
    case
      when col1 is null then 1
      else 0
    end
  ) as col1,
  sum(
    case
      when col2 is null then 1
      else 0
    end
  ) as col2,
  sum(
    case
      when col3 is null then 1
      else 0
    end
  ) as col3
from
  emp;

col1,col2,col3
1,2,1


In [0]:
from pyspark.sql import functions as F

# Count the number of nulls in each column
df_null_count = df.select(
    F.sum(F.when(F.col("col1").isNull(), 1).otherwise(0)).alias("col1_null_count"),
    F.sum(F.when(F.col("col2").isNull(), 1).otherwise(0)).alias("col2_null_count"),
    F.sum(F.when(F.col("col3").isNull(), 1).otherwise(0)).alias("col3_null_count")
)

# Show the result
df_null_count.display()


col1_null_count,col2_null_count,col3_null_count
1,2,1


Explanation:

F.when(F.col("col1").isNull(), 1).otherwise(0): This creates a column that returns 1 if the value is null, otherwise 0.
F.sum(...): This sums up the 1s to count the null values.
.alias("col1_null_count"): This renames the output columns to indicate the null count for each column.