# **Spotify Sessions**

## **Problem Statement**
You are given the following two tables:

### **Table: Playback**
| Column Name | Type |
|------------|------|
| `session_id`  | int  |
| `customer_id` | int  |
| `start_time`  | int  |
| `end_time`    | int  |

- `session_id` is the primary key.
- Each row represents a customer's playback session, from `start_time` to `end_time` (inclusive).
- Sessions for the same customer do not overlap.

### **Table: Ads**
| Column Name | Type |
|------------|------|
| `ad_id`       | int  |
| `customer_id` | int  |
| `timestamp`   | int  |

- `ad_id` is the primary key.
- Each row represents an ad shown to a customer at a specific `timestamp`.

### **Objective**
Write a query to find **all session IDs that did not get any ads**.

---

## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Initialize Spark Session**
2. **Create DataFrames for `Playback` and `Ads` Tables**
3. **Perform a Left Anti Join to Find Sessions Without Ads**
4. **Display the Result**

### **Code**

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("PlaybackWithoutAds").getOrCreate()

# Step 2: Create DataFrames for Playback and Ads Tables
playback_data = [
    (1, 1, 1, 5), (2, 1, 15, 23), (3, 2, 10, 12), (4, 2, 17, 28), (5, 2, 2, 8)
]
ads_data = [
    (1, 1, 5), (2, 2, 17), (3, 2, 20)
]

playback_columns = ["session_id", "customer_id", "start_time", "end_time"]
ads_columns = ["ad_id", "customer_id", "timestamp"]

playback_df = spark.createDataFrame(playback_data, playback_columns)
ads_df = spark.createDataFrame(ads_data, ads_columns)

# Step 3: Perform Left Anti Join (Sessions that did not get any ads)
no_ads_sessions = playback_df.join(
    ads_df,
    (playback_df.customer_id == ads_df.customer_id) &
    (ads_df.timestamp.between(playback_df.start_time, playback_df.end_time)),
    "leftanti"
).select("session_id")

# Step 4: Display Result
no_ads_sessions.show()


StatementMeta(, 87721247-a989-47fe-9071-eded697494cd, 6, Finished, Available, Finished)

+----------+
|session_id|
+----------+
|         2|
|         3|
|         5|
+----------+



---

## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Create Spark Session**
2. **Create DataFrames for `Playback` and `Ads` Tables**
3. **Register Them as SQL Views**
4. **Write and Execute SQL Query**
5. **Display the Output**

### **Code**

In [5]:
# Step 1: Register DataFrames as SQL Views
playback_df.createOrReplaceTempView("Playback")
ads_df.createOrReplaceTempView("Ads")

# Step 2: Run SQL Query
sql_query = """
SELECT session_id
FROM Playback
WHERE session_id NOT IN (
    SELECT DISTINCT p.session_id
    FROM Playback p
    JOIN Ads a
    ON p.customer_id = a.customer_id
    WHERE a.timestamp BETWEEN p.start_time AND p.end_time
);
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
result_sql.show()

StatementMeta(, 87721247-a989-47fe-9071-eded697494cd, 7, Finished, Available, Finished)

+----------+
|session_id|
+----------+
|         2|
|         3|
|         5|
+----------+



---

## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses `leftanti` join |
| **Approach 2** | SQL Query in PySpark     | Uses `NOT IN` with `JOIN` |
