# **Ad-Free Sessions**

## **Problem Statement**
We have two tables: **Playback** and **Ads**.

### **Table: Playback**
This table records the playback sessions of customers.

| Column Name | Type |
|-------------|------|
| session_id  | int  |
| customer_id | int  |
| start_time  | int  |
| end_time    | int  |

- `session_id` is the **primary key**.
- `customer_id` represents the **ID of the customer** watching this session.
- The session runs during the **inclusive interval** between `start_time` and `end_time`.
- **No two sessions for the same customer overlap**.

### **Table: Ads**
This table records the ads shown to customers.

| Column Name | Type |
|-------------|------|
| ad_id       | int  |
| customer_id | int  |
| timestamp   | int  |

- `ad_id` is the **primary key**.
- `customer_id` is the **ID of the customer** viewing this ad.
- `timestamp` represents the exact **time** the ad was shown.

---

## **Objective**
Find all `session_id`s where **no ads** were shown.

### **Example**
#### **Input: Playback Table**
| session_id | customer_id | start_time | end_time |
|------------|------------|------------|----------|
| 1          | 1          | 1          | 5        |
| 2          | 1          | 15         | 23       |
| 3          | 2          | 10         | 12       |
| 4          | 2          | 17         | 28       |
| 5          | 2          | 2          | 8        |

#### **Input: Ads Table**
| ad_id | customer_id | timestamp |
|-------|------------|-----------|
| 1     | 1          | 5         |
| 2     | 2          | 17        |
| 3     | 2          | 20        |

#### **Expected Output**
| session_id |
|------------|
| 2          |
| 3          |
| 5          |

- **Session 1** had an ad at time **5**, so it's excluded.
- **Session 4** had ads at times **17 & 20**, so it's excluded.
- **Sessions 2, 3, and 5** did not receive any ads, so they are included in the result.

---


## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Initialize Spark Session**
2. **Create DataFrames for Playback & Ads**
3. **Perform Left Anti-Join on Playback and Ads**
4. **Filter Sessions Where No Ad Timestamp Falls in Playback Interval**
5. **Display the Result**

### **Code**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("SessionsWithoutAds").getOrCreate()

# Step 2: Create DataFrames
playback_data = [
    (1, 1, 1, 5),
    (2, 1, 15, 23),
    (3, 2, 10, 12),
    (4, 2, 17, 28),
    (5, 2, 2, 8),
]
playback_columns = ["session_id", "customer_id", "start_time", "end_time"]

ads_data = [
    (1, 1, 5),
    (2, 2, 17),
    (3, 2, 20),
]
ads_columns = ["ad_id", "customer_id", "timestamp"]

playback_df = spark.createDataFrame(playback_data, playback_columns)
ads_df = spark.createDataFrame(ads_data, ads_columns)

# Step 3: Perform Left Anti-Join (Exclude Matching Sessions)
joined_df = playback_df.join(
    ads_df,
    (playback_df.customer_id == ads_df.customer_id) & 
    (ads_df.timestamp >= playback_df.start_time) & 
    (ads_df.timestamp <= playback_df.end_time),
    "left_anti"
)

# Step 4: Select Only session_id
result_df = joined_df.select("session_id")

# Step 5: Display Output
result_df.show()

StatementMeta(, 54611358-5b89-406c-b3a3-1838ca108f1c, 3, Finished, Available, Finished)

+----------+
|session_id|
+----------+
|         2|
|         3|
|         5|
+----------+



---

## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Create DataFrames for Playback & Ads**
2. **Register Both as SQL Views**
3. **Use `NOT EXISTS` to Find Sessions Without Ads**
4. **Return the Result**

### **Code**

In [2]:
# Step 1: Register DataFrames as SQL Views
playback_df.createOrReplaceTempView("Playback")
ads_df.createOrReplaceTempView("Ads")

# Step 2: Run SQL Query
sql_query = """
SELECT p.session_id
FROM Playback p
WHERE NOT EXISTS (
    SELECT 1
    FROM Ads a
    WHERE p.customer_id = a.customer_id
    AND a.timestamp BETWEEN p.start_time AND p.end_time
);
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
result_sql.show()

StatementMeta(, 54611358-5b89-406c-b3a3-1838ca108f1c, 4, Finished, Available, Finished)

+----------+
|session_id|
+----------+
|         2|
|         3|
|         5|
+----------+



---

## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses **Left Anti-Join** to filter out sessions with ads |
| **Approach 2** | SQL Query in PySpark     | Uses **NOT EXISTS** to check for ads in session intervals |

