Problem Statement:

Here is tasked with transforming and summarizing a dataset that contains information about individuals, items, and their associated weights. The goal is to group the data by the name column, aggregate the weights for each item per name, and present the results in a compact, readable format.

Input Data:

A dataset (data_table) with the following columns:

name: The name of the individual or entity.

item: The item associated with the name.

weight: The numeric weight associated with the name-item pair.

In [0]:
data = [
    ("john", "tomato", 2),
    ("𝚋𝚒𝚕𝚕", "𝚊𝚙𝚙𝚕𝚎", 2),
    ("john", "𝚋𝚊𝚗𝚊𝚗𝚊", 2),
    ("john", "tomato", 3),
    ("𝚋𝚒𝚕𝚕", "𝚝𝚊𝚌𝚘", 2),
    ("𝚋𝚒𝚕𝚕", "𝚊𝚙𝚙𝚕𝚎", 2),
]
schema = "name string,item string,weight int"
df = spark.createDataFrame(data, schema)
df.display()

name,item,weight
john,tomato,2
𝚋𝚒𝚕𝚕,𝚊𝚙𝚙𝚕𝚎,2
john,𝚋𝚊𝚗𝚊𝚗𝚊,2
john,tomato,3
𝚋𝚒𝚕𝚕,𝚝𝚊𝚌𝚘,2
𝚋𝚒𝚕𝚕,𝚊𝚙𝚙𝚕𝚎,2


In [0]:
from pyspark.sql import functions as F

result_df = (
    df.groupBy("name", "item")
      .agg(F.sum("weight").alias("weight"))
      .withColumn("tuple", F.concat(F.lit("("), F.col("item"), F.lit(","), F.col("weight"), F.lit(")")))
      .groupBy("name")
      .agg(F.concat_ws(',', F.collect_list("tuple")).alias("tuple"))
)

result_df.display()

name,tuple
𝚋𝚒𝚕𝚕,"(𝚊𝚙𝚙𝚕𝚎,4),(𝚝𝚊𝚌𝚘,2)"
john,"(tomato,5),(𝚋𝚊𝚗𝚊𝚗𝚊,2)"


In [0]:
df.createOrReplaceTempView("data_table")

In [0]:
# Execute Spark SQL Query
result_df = spark.sql("""
    WITH SummedWeights AS (
    SELECT 
        name, 
        item, 
        SUM(weight) AS weight
    FROM data_table
    GROUP BY name, item
),
Tuples AS (
    SELECT 
        name, 
        CONCAT('(', item, ',', weight, ')') AS tuple
    FROM SummedWeights
),
GroupedTuples AS (
    SELECT 
        name, 
        CONCAT_WS(',', COLLECT_LIST(tuple)) AS tuple
    FROM Tuples
    GROUP BY name
)
SELECT * 
FROM GroupedTuples
""")
result_df.display()


name,tuple
𝚋𝚒𝚕𝚕,"(𝚊𝚙𝚙𝚕𝚎,4),(𝚝𝚊𝚌𝚘,2)"
john,"(tomato,5),(𝚋𝚊𝚗𝚊𝚗𝚊,2)"


Explanation:

The result of the query summarizes and transforms data in the following way:

Grouping by Name: Each unique name forms a single row in the output.

Aggregated Tuples: For each name, all (item, weight) combinations are aggregated into a comma-separated list of tuples.

Final Structure: The result is a table with two columns:

name: The name of the person/entity.

tuple: A single string containing all (item, weight) pairs associated with that name, separated by commas.