Problem Statement: 

Session Duration Analysis

Objective: 

Analyze session durations from a dataset to classify them into predefined bins.

Tasks:

Load session data into a DataFrame.

Create bins for session durations categorized as:

[0-5]: 0 to 5 minutes

[5-10]: 5 to 10 minutes

[10-15]: 10 to 15 minutes

[15 or more]: 15 minutes or more

Count the number of sessions that fall into each bin and present the results.

In [0]:
from pyspark.sql.types import *

# Define schema
schema = StructType(
    [
        StructField("session_id", IntegerType(), True),
        StructField("duration", IntegerType(), True),
    ]
)

# Create DataFrame with data
data = [(1, 30), (2, 199), (3, 299), (4, 580), (5, 1000), (6, 1150), (7, 30)]

# Create DataFrame
sessions_df = spark.createDataFrame(data, schema)

# display DataFrame
sessions_df.display()

session_id,duration
1,30
2,199
3,299
4,580
5,1000
6,1150
7,30


In [0]:
sessions_df.createOrReplaceTempView("session")

In [0]:
%sql
WITH cte1 AS (
  SELECT
    CASE
      WHEN ROUND(duration / 60) BETWEEN 0
      AND 5 THEN '[0-5]>'
      WHEN ROUND(duration / 60) BETWEEN 5
      AND 10 THEN '[5-10]>'
      WHEN ROUND(duration / 60) BETWEEN 10
      AND 15 THEN '[10-15]>'
      ELSE '[15 or more]'
    END bins
  FROM
    session
),
cte2 AS (
  SELECT
    '[0-5]>' as bins
  UNION
  SELECT
    '[5-10]>' as bins
  UNION
  SELECT
    '[10-15]>' as bins
  UNION
  SELECT
    '[15 or more]' as bins
)
SELECT
  c2.bins,
  COALESCE(count(c1.bins), 0) as total
FROM
  cte2 as c2
  LEFT JOIN cte1 as c1 ON c2.bins = c1.bins
GROUP BY
  c2.bins;

bins,total
[0-5]>,4
[5-10]>,1
[10-15]>,0
[15 or more],2


In [0]:
from pyspark.sql.functions import *

# First CTE: Create bins based on duration
cte1 = sessions_df.select(
    when((round(sessions_df.duration / 60).between(0, 5)), "[0-5]>")
    .when((round(sessions_df.duration / 60).between(5, 10)), "[5-10]>")
    .when((round(sessions_df.duration / 60).between(10, 15)), "[10-15]>")
    .otherwise("[15 or more]")
    .alias("bins")
)

# Second CTE: Define all possible bins
cte2 = spark.createDataFrame(
    [("[0-5]>",), ("[5-10]>",), ("[10-15]>",), ("[15 or more]",)], ["bins"]
)

# Perform left join and count
result_df = (
    cte2.join(cte1, cte2.bins == cte1.bins, "left")
    .groupBy(cte2.bins)
    .agg(count(cte1.bins).alias("total"))
    .fillna(0)
)  # Fill null counts with 0

# display the result
result_df.display()

bins,total
[0-5]>,4
[5-10]>,1
[10-15]>,0
[15 or more],2


Explanation:

Sessions DataFrame: Created from your initial session data.
Binning Logic: Uses F.when to categorize the durations into specified bins.
Defining Bins: cte2 is defined using a list of tuples to create the required bins.
Left Join: Joins cte2 and cte1 on the bins column, and groups by the bins while counting occurrences.
Fill Nulls: fillna(0) ensures that any bins with no counts show as zero.

Expected Output:

This will produce a DataFrame showing the bins and their respective counts, similar to what your SQL query would generate. Let me know if you have any questions or need further adjustments!