Problem Statement:

Here are given a dataset containing customer orders with two columns: order_id (an integer representing the order number) and item (a string representing the name of the ordered item). Your task is to modify the order_id column based on the following rules:

If the order_id is odd and not the last order, increase the order_id by 1.
If the order_id is odd and is the last order, leave the order_id unchanged.
If the order_id is even, decrease the order_id by 1.

In [0]:
from pyspark.sql.types import *

# Define the schema for the DataFrame
schema = StructType(
    [
        StructField("order_id", IntegerType(), True),
        StructField("item", StringType(), True),
    ]
)

# Sample data
data = [
    (1, "Chow Mein"),
    (2, "Pizza"),
    (3, "Pad Thai"),
    (4, "Butter Chicken"),
    (5, "Eggrolls"),
    (6, "Burger"),
    (7, "Tandoori Chicken"),
    (8, "Sushi"),
    (9, "Tacos"),
    (10, "Ramen"),
    (11, "Burrito"),
    (12, "Lasagna"),
    (13, "Salad"),
    (14, "Steak"),
    (15, "Spaghetti"),
]

# Create DataFrame
order_df = spark.createDataFrame(data, schema)

# display the DataFrame
order_df.display()

order_id,item
1,Chow Mein
2,Pizza
3,Pad Thai
4,Butter Chicken
5,Eggrolls
6,Burger
7,Tandoori Chicken
8,Sushi
9,Tacos
10,Ramen


In [0]:
order_df.createOrReplaceTempView("orders")

In [0]:
# Run the SQL query in Spark
corrected_orders_df = spark.sql(
    """
  SELECT
    CASE
      WHEN order_id % 2 != 0 AND order_id != COUNT(order_id) OVER () THEN order_id + 1
      WHEN order_id % 2 != 0 AND order_id = COUNT(order_id) OVER () THEN order_id
      ELSE order_id - 1
    END AS corrected_order_id,
    item
  FROM orders
  ORDER BY corrected_order_id
"""
)

# display the result
corrected_orders_df.display()

corrected_order_id,item
1,Pizza
2,Chow Mein
3,Butter Chicken
4,Pad Thai
5,Burger
6,Eggrolls
7,Sushi
8,Tandoori Chicken
9,Ramen
10,Tacos


In [0]:
from pyspark.sql.functions import *
from pyspark.sql import Window

# Assuming the order_df DataFrame is already created

# Define a window to calculate the total number of orders
window_spec = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

# Add a column for the total count of orders
order_df_with_count = order_df.withColumn(
    "total_orders", count("order_id").over(window_spec)
)

# Apply the same logic using PySpark's API
corrected_orders_df = order_df_with_count.withColumn(
    "corrected_order_id",
    when(
        (col("order_id") % 2 != 0) & (col("order_id") != col("total_orders")),
        col("order_id") + 1,
    )
    .when(
        (col("order_id") % 2 != 0) & (col("order_id") == col("total_orders")),
        col("order_id"),
    )
    .otherwise(col("order_id") - 1),
)

# Select and order the result
result_df = corrected_orders_df.select("corrected_order_id", "item").orderBy(
    "corrected_order_id"
)

# display the result
result_df.display()

corrected_order_id,item
1,Pizza
2,Chow Mein
3,Butter Chicken
4,Pad Thai
5,Burger
6,Eggrolls
7,Sushi
8,Tandoori Chicken
9,Ramen
10,Tacos


Explanation:

Window Function: We use Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) to calculate the total number of order_ids. This acts like the COUNT(*) OVER () in SQL.

withColumn: We add a new column total_orders that holds the total count of orders for each row.

when and otherwise: We apply the same logic as in the SQL CASE using PySpark's when and otherwise functions.

Ordering: Finally, we select the necessary columns and order by corrected_order_id.

This method fully uses PySpark's DataFrame API without needing to run SQL queries directly.