Problem Statement: Analyzing Post Comments Hierarchy

We have a table named Submissions containing two columns:

sub_id: A unique identifier for a post or comment.
parent_id: This indicates whether the entry is a reply to another post (parent_id contains the sub_id of the parent post). If it’s a top-level post, the parent_id will be NULL.

Your task is to:

Identify all distinct top-level posts (entries where parent_id is NULL).
For each top-level post, count the number of distinct direct comments (i.e., entries where the parent_id matches the sub_id of the post).
Ensure that posts with no comments are assigned a count of 0.

The output should display:
post_id: The sub_id of the top-level post.

No_of_cmnt: 
The number of distinct comments each post received.
The result should be sorted by post_id in ascending order.

In [0]:
from pyspark.sql.functions import *

# Define the data as a list of tuples
data = [
    (1, None),
    (2, None),
    (1, None),
    (12, None),
    (3, 1),
    (5, 2),
    (3, 1),
    (4, 1),
    (9, 1),
    (10, 2),
    (6, 7),
]

# Define schema for DataFrame
schema = ["sub_id", "parent_id"]

# Create DataFrame
df = spark.createDataFrame(data, schema)

# display the DataFrame
df.display()

sub_id,parent_id
1,
2,
1,
12,
3,1.0
5,2.0
3,1.0
4,1.0
9,1.0
10,2.0


In [0]:
# Step 1: Create the equivalent of the CTE
cte_df = df.filter(col("parent_id").isNull()).select("sub_id").distinct().alias("cte")

# Step 2: Perform LEFT JOIN with the original DataFrame using aliases
joined_df = cte_df.alias("c").join(
    df.alias("s"), col("c.sub_id") == col("s.parent_id"), "left"
)

# Step 3: Group by the post ID and calculate the number of distinct comments
result_df = (
    joined_df.groupBy("c.sub_id")
    .agg(coalesce(countDistinct("s.sub_id"), lit(0)).alias("No_of_cmnt"))
    .orderBy("c.sub_id")
)

# Display the result
result_df.display()

sub_id,No_of_cmnt
1,3
2,2
12,0


In [0]:
# Register DataFrame as a temporary SQL view
df.createOrReplaceTempView("submissions")

In [0]:
# Spark SQL query with CTE logic
query = """
WITH cte AS (
    SELECT DISTINCT sub_id AS post 
    FROM submissions 
    WHERE parent_id IS NULL
)
SELECT 
    c.post AS post_id, 
    COALESCE(COUNT(DISTINCT s.sub_id), 0) AS No_of_cmnt
FROM cte c
LEFT JOIN submissions s 
    ON c.post = s.parent_id
GROUP BY c.post
ORDER BY post_id
"""

# Execute the query using Spark SQL
result_df = spark.sql(query)

# Show the result
result_df.show()

+-------+----------+
|post_id|No_of_cmnt|
+-------+----------+
|      1|         3|
|      2|         2|
|     12|         0|
+-------+----------+



Explanation

Create a temporary SQL view:

We register the DataFrame as a temporary view using createOrReplaceTempView(). This lets us query it with SQL syntax.
Write the SQL query with CTE:

The WITH cte clause selects distinct sub_id values where parent_id is NULL.
It performs a LEFT JOIN with the original submissions view on parent_id.
The query uses COALESCE to ensure No_of_cmnt is 0 if no comments are found.
Execute the SQL query:

Use spark.sql() to run the query.