Problem Statement:

Given a table of Facebook posts, for each user who posted at least twice in 2021, write a query to find the number of days between each user’s first post of the year and last post of the year in the year 2021. Output the user and number of the days between each user's first and last post.

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import *

# Define schema for the DataFrame, use StringType for the date initially
schema = StructType(
    [
        StructField("user_id", IntegerType(), True),
        StructField("post_id", IntegerType(), True),
        StructField("post_content", StringType(), True),
        StructField("post_date", StringType(), True),  # Keep as StringType initially
    ]
)

# Sample data
data = [
    (151652, 111766, "it's always winter, but never Christmas.", "12/01/2021 11:00:00"),
    (
        661093,
        442560,
        "Bed. Class 8-12. Work 12-3. Gym 3-5 or 6. Then class 6-10. Another day that's gonna fly by. I miss my girlfriend",
        "09/08/2021 10:00:00",
    ),
    (661093, 624356, "Happy valentines!", "02/14/2021 00:00:00"),
    (151652, 599415, "Need a hug", "01/28/2021 00:00:00"),
    (
        178425,
        157336,
        "I'm so done with these restrictions - I want to travel!!!",
        "03/24/2021 11:00:00",
    ),
    (
        423967,
        784254,
        "Just going to cry myself to sleep after watching Marley and Me.",
        "05/05/2021 00:00:00",
    ),
    (151325, 613451, "Happy new year all my friends!", "01/01/2022 11:00:00"),
    (
        151325,
        987562,
        "The global surface temperature for June 2022 was the sixth-highest in the 143-year record. This is definitely global warming happening.",
        "07/01/2022 10:00:00",
    ),
    (
        661093,
        674356,
        "Can't wait to start my freshman year - super excited!",
        "08/18/2021 10:00:00",
    ),
    (
        151325,
        451464,
        "Garage sale this Saturday 1 PM. All welcome to check out!",
        "10/25/2021 10:00:00",
    ),
    (
        151652,
        994156,
        "Does anyone have an extra iPhone charger to sell?",
        "04/01/2021 10:00:00",
    ),
]

# Create DataFrame
df = spark.createDataFrame(data, schema=schema)

# Convert 'post_date' from StringType to TimestampType
df = df.withColumn("post_date", to_timestamp("post_date", "MM/dd/yyyy HH:mm:ss"))
df.display()

user_id,post_id,post_content,post_date
151652,111766,"it's always winter, but never Christmas.",2021-12-01T11:00:00.000+0000
661093,442560,Bed. Class 8-12. Work 12-3. Gym 3-5 or 6. Then class 6-10. Another day that's gonna fly by. I miss my girlfriend,2021-09-08T10:00:00.000+0000
661093,624356,Happy valentines!,2021-02-14T00:00:00.000+0000
151652,599415,Need a hug,2021-01-28T00:00:00.000+0000
178425,157336,I'm so done with these restrictions - I want to travel!!!,2021-03-24T11:00:00.000+0000
423967,784254,Just going to cry myself to sleep after watching Marley and Me.,2021-05-05T00:00:00.000+0000
151325,613451,Happy new year all my friends!,2022-01-01T11:00:00.000+0000
151325,987562,The global surface temperature for June 2022 was the sixth-highest in the 143-year record. This is definitely global warming happening.,2022-07-01T10:00:00.000+0000
661093,674356,Can't wait to start my freshman year - super excited!,2021-08-18T10:00:00.000+0000
151325,451464,Garage sale this Saturday 1 PM. All welcome to check out!,2021-10-25T10:00:00.000+0000


In [0]:
# Calculate days between for users with more than one post in 2021
result = df.filter(year(col("post_date")) == 2021).groupBy("user_id").agg(
        datediff(max("post_date"), min("post_date")).alias("days_between"),
        count("*").alias("post_count"),
    ).filter(col("post_count") > 1).select("user_id", "days_between")

# Show the result
result.display()

user_id,days_between
151652,307
661093,206


In [0]:
# Register the DataFrame as a temporary SQL table
df.createOrReplaceTempView("posts")

# Example SQL query
result = spark.sql(
    """
    SELECT 
        user_id, 
        MAX(CAST(post_date AS DATE)) - MIN(CAST(post_date AS DATE)) AS days_between
    FROM posts
    WHERE YEAR(post_date) = 2021
    GROUP BY user_id
    HAVING COUNT(post_id) > 1
"""
)

# Show the result
result.show()

+-------+------------------+
|user_id|      days_between|
+-------+------------------+
| 151652|INTERVAL '307' DAY|
| 661093|INTERVAL '206' DAY|
+-------+------------------+



Explanation of Changes:

Filtering the Result: 
After calculating days_between, the results are filtered to show only the users 151652 and 661093, which allows us to see if the expected output is met.