Problem Statement:

Flatten the transactions: Extract each amount from the list of transactions for each user_id so that each user_id has one row per transaction.

Aggregate the total amount: After flattening the transactions, calculate the total amount for each user_id by summing up the individual transaction amounts.

Implement the solution using Spark SQL: Use Spark SQL with a Common Table Expression (CTE) to simplify the process of flattening and aggregating the transaction data.

In [0]:
from pyspark.sql.functions import explode, col, sum

# Sample data
data = [
    (1, [{"amount": 45}, {"amount": 60}]),
    (2, [{"amount": 30}, {"amount": 20}]),
    (3, [{"amount": 120}, {"amount": 80}]),
]

# Create DataFrame
df = spark.createDataFrame(data, ["user_id", "transactions"])

In [0]:
# Explode the transactions column to get one row per transaction
df_exploded = df.withColumn("transaction", explode(col("transactions")))

# Select user_id and amount from the exploded transaction data
df_final = df_exploded.select("user_id", col("transaction.amount").alias("amount"))

# Show the final DataFrame
df_final.display()

user_id,amount
1,45
1,60
2,30
2,20
3,120
3,80


In [0]:
# Sum the amounts for each user_id
df_summed = df_final.groupBy("user_id").agg(sum("amount").alias("total_amount"))

# Show the final result
df_summed.display()

user_id,total_amount
1,105
2,50
3,200


In [0]:
# Register the DataFrame as a temporary SQL view
df.createOrReplaceTempView("transactions_table")

In [0]:
spark.sql("select * from transactions_table").display()

user_id,transactions
1,"List(Map(amount -> 45), Map(amount -> 60))"
2,"List(Map(amount -> 30), Map(amount -> 20))"
3,"List(Map(amount -> 120), Map(amount -> 80))"


In [0]:
# SQL query to flatten the transactions
result = spark.sql(
    """
    SELECT
        user_id,
        transaction.amount AS amount
    FROM
        transactions_table
    LATERAL VIEW explode(transactions) AS transaction
"""
)

# SQL query to sum amounts per user
sum_result = spark.sql(
    """
    SELECT
        user_id,
        SUM(transaction.amount) AS total_amount
    FROM
        transactions_table
    LATERAL VIEW explode(transactions) AS transaction
    GROUP BY
        user_id
"""
)

# Show the summed amounts
sum_result.show()

+-------+------------+
|user_id|total_amount|
+-------+------------+
|      1|         105|
|      2|          50|
|      3|         200|
+-------+------------+



In [0]:
# Spark SQL query with CTE
result = spark.sql("""
    WITH flattened_transactions AS (
        SELECT user_id, transaction.amount AS amount
        FROM transactions_table
        LATERAL VIEW EXPLODE(transactions) AS transaction
    )
    SELECT user_id, SUM(amount) AS total_amount
    FROM flattened_transactions
    GROUP BY user_id
""")

# display the result
result.display()

user_id,total_amount
1,105
2,50
3,200


Explanation:

CTE flattened_transactions: This CTE first flattens the transactions array by using EXPLODE and extracts the amount for each transaction.
Main Query: The outer query references the CTE, flattened_transactions, and calculates the total amount for each user_id by using SUM and GROUP BY.