PySpark Coding Challenge: Analyzing Online Store Orders

Task: You have a dataset containing information about orders from an online store. Your task is to use PySpark to analyze the data and answer a few questions using aggregate functions.

Dataset: The dataset is in CSV format and contains the following columns: order_id, customer_id, order_date, total_amount.

Questions:

Calculate the total revenue generated from all orders.

Find the average order amount.

Identify the highest total order amount and its corresponding customer.

Calculate the total number of orders for each customer.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, max, count

# Create a Spark session
spark = SparkSession.builder.appName("OnlineStoreAnalysis").getOrCreate()

# Load the dataset
data = [
    (1, "C101", "2023-07-01", 150),
    (2, "C102", "2023-07-02", 200),
    (3, "C101", "2023-07-02", 100),
    (4, "C103", "2023-07-03", 300),
    (5, "C102", "2023-07-04", 250),
    (6, "C101", "2023-07-05", 120)
]
columns = ["order_id", "customer_id", "order_date", "total_amount"]
df = spark.createDataFrame(data, columns)

# Question 1: Calculate total revenue
total_revenue = df.select(sum("total_amount")).collect()[0][0]
print("Total Revenue:", total_revenue)

# Question 2: Average order amount
average_order_amount = df.agg(avg("total_amount")).collect()[0][0]
print("Average Order Amount:", average_order_amount)

# Question 3: Highest total order amount and corresponding customer
highest_order = df.orderBy(col("total_amount").desc()).limit(1).first()
print("Highest Order Amount:", highest_order["total_amount"])
print("Customer ID:", highest_order["customer_id"])

# Question 4: Total number of orders per customer
total_orders_per_customer = df.groupBy("customer_id").agg(count("order_id").alias("total_orders"))
total_orders_per_customer.show()


Total Revenue: 1120
Average Order Amount: 186.66666666666666
Highest Order Amount: 300
Customer ID: C103
+-----------+------------+
|customer_id|total_orders|
+-----------+------------+
|       C101|           3|
|       C102|           2|
|       C103|           1|
+-----------+------------+

