Problem Statement:

You are given two datasets: 

          one containing customer information and the other containing order information. 
          Your task is to identify customers who have never placed an order. 
          The Customers dataset contains customer details such as Id and Name, while the Orders 
          dataset contains the Id of the orders and the corresponding CustomerId.

Dataset Details:

Customers:

          This table contains customer information:
          Id: Unique identifier for each customer.
          NameCust: Name of the customer.

Orders: 

          This table contains order information:
          Id: Unique identifier for each order.
          CustomerId: Foreign key that references the customer who placed the order.

In [0]:
from pyspark.sql.types import *

# Create a Spark session

# Define schema for Customers table
schema_customers = StructType(
    [
        StructField("Id", IntegerType(), True),
        StructField("NameCust", StringType(), True),
    ]
)

# Define schema for Order_ table
schema_orders = StructType(
    [
        StructField("Id", IntegerType(), True),
        StructField("CustomerId", IntegerType(), True),
    ]
)

# Create DataFrame for Customers
customers_data = [(1, "Joe"), (2, "Henry"), (3, "Sam"), (4, "Max")]
df_customers = spark.createDataFrame(customers_data, schema_customers)

# Create DataFrame for Order_
orders_data = [(1, 3), (2, 1)]
df_orders = spark.createDataFrame(orders_data, schema_orders)

# display the DataFrames
print("Customers DataFrame:")
df_customers.display()

print("Orders DataFrame:")
df_orders.display()

Customers DataFrame:


Id,NameCust
1,Joe
2,Henry
3,Sam
4,Max


Orders DataFrame:


Id,CustomerId
1,3
2,1


In [0]:
df_customers.createOrReplaceTempView("Customers")
df_orders.createOrReplaceTempView("Orders")

In [0]:
%sql
WITH OrderedCustomers AS (
  SELECT
    CustomerId
  FROM
    Orders
)
SELECT id, 
  NameCust
FROM
  Customers
WHERE
  Id NOT IN (
    SELECT
      CustomerId
    FROM
      OrderedCustomers
  );

id,NameCust
2,Henry
4,Max


In [0]:
# Extract customers who have placed an order
ordered_customers = df_orders.select("CustomerId").distinct()

# Filter customers who are not in the orders
customers_never_ordered = df_customers.filter(
    ~df_customers["Id"].isin([row["CustomerId"] for row in ordered_customers.collect()])
)

customers_never_ordered.display()

Id,NameCust
2,Henry
4,Max


Summary of Results:

Using NOT IN: Converts IN logic using Python list comprehension.

Using LEFT JOIN + IS NULL: Left joins the tables and filters by NULL values.

Using LEFT ANTI JOIN: Most efficient approach to find records that donâ€™t exist in another table.

Each approach works in PySpark similarly to its SQL counterpart, and you can choose based on your specific needs.