
# 🤖 MGMT 467 - Unit 2 Lab 2: Prompt Studio — Feature Engineering & Beyond

**Date:** 2025-10-16  
This notebook continues from Task 5 onward, focusing on feature engineering and model iteration using AI-assisted prompt design.

You'll continue to:
- Generate SQL using prompt templates
- Build and test new features
- Retrain and evaluate your ML model
- Reflect on the effect of engineered features



## Task 5.0: Bucket a Continuous Feature

**🎯 Goal:** Group 'total_minutes' into categories: low, medium, high.  
**📌 Requirements:** Use CASE WHEN or IF statements to create 'watch_time_bucket'.

---

### 🧠 Prompt Template  
> Write SQL that creates a new column watch_time_bucket based on total_minutes thresholds (<100, 100–300, >300).

---

### 👩‍🏫 Example Prompt  
> Create a new column watch_time_bucket with values 'low', 'medium', or 'high' based on total_minutes.

---

### 🔍 Exploration  
How does churn rate vary across these buckets?


In [3]:
import pandas_gbq

sql = """
SELECT
    CASE
        WHEN r3_min < 100 THEN 'low'
        WHEN r3_min >= 100 AND r3_min <= 300 THEN 'medium'
        ELSE 'high'
    END AS watch_time_bucket,
    churn_next_month
FROM
    `mgmt467-lab.netflix.feat_churn_lite`
"""

project_id = "mgmt467-lab"
df_watch_time_buckets = pandas_gbq.read_gbq(sql, project_id=project_id, dialect="standard")

# Calculate churn rate for each bucket
churn_rate_by_bucket = df_watch_time_buckets.groupby('watch_time_bucket')['churn_next_month'].mean().reset_index()
churn_rate_by_bucket

Downloading: 100%|[32m██████████[0m|


Unnamed: 0,watch_time_bucket,churn_next_month
0,high,0.659109
1,low,0.660162
2,medium,0.663568


There is not much of a difference between the 3 brackets, and the churn rate are approximately 66%. This suggests that, based on this current bucketing, watch time alone might not be a strong differentiator for churn in this dataset.


## Task 5.1: Create a Binary Flag Feature

**🎯 Goal:** Add a binary column flag_binge (1 if total_minutes > 500).  
**📌 Requirements:** Use IF logic to create a binary column in SQL.

---

### 🧠 Prompt Template  
> Write a SQL query that adds flag_binge = 1 if total_minutes > 500, else 0.

---

### 👩‍🏫 Example Prompt  
> Add a binary column flag_binge to identify binge-watchers.

---

### 🔍 Exploration  
Are binge-watchers more or less likely to churn?


In [5]:
# prompt: Add a binary column flag_binge (1 if total_minutes > 500).
# Requirements: Use IF logic to create a binary column in SQL. Write a SQL query that adds flag_binge = 1 if total_minutes > 500, else 0. example: Add a binary column flag_binge to identify binge-watchers.

sql = """
SELECT
    CASE
        WHEN r3_min > 500 THEN 1
        ELSE 0
    END AS flag_binge,
    churn_next_month
FROM
    `mgmt467-lab.netflix.feat_churn_lite`
"""

project_id = "mgmt467-lab"
df_binge_watchers = pandas_gbq.read_gbq(sql, project_id=project_id, dialect="standard")

# Calculate churn rate for binge-watchers vs. non-binge-watchers
churn_rate_by_binge = df_binge_watchers.groupby('flag_binge')['churn_next_month'].mean().reset_index()
churn_rate_by_binge

Downloading: 100%|[32m██████████[0m|


Unnamed: 0,flag_binge,churn_next_month
0,0,0.659724
1,1,0.659501


There is not much difference between binge watchers and non binge watchers. Since the difference is so small, it does not strongly differentiate churn behavior.


## Task 5.2: Create an Interaction Term

**🎯 Goal:** Create plan_region_combo by combining plan_tier and region.  
**📌 Requirements:** Use CONCAT or STRING functions.

---

### 🧠 Prompt Template  
> Generate SQL to create a new column by combining plan_tier and region with an underscore.

---

### 👩‍🏫 Example Prompt  
> Create a column called plan_region_combo as CONCAT(plan_tier, '_', region).

---

### 🔍 Exploration  
Which plan-region combos have highest churn?


In [7]:
# prompt: Create plan_region_combo by combining plan_tier and region.
# Requirements: Use CONCAT or STRING functions. Generate SQL to create a new column by combining plan_tier and region with an underscore. example: Create a column called plan_region_combo as CONCAT(plan_tier, '_', region).

sql = """
SELECT
    CONCAT(subscription_plan, '_', country) AS plan_region_combo,
    churn_next_month
FROM
    `mgmt467-lab.netflix.feat_churn_lite`
"""

project_id = "mgmt467-lab"
df_plan_region_combo = pandas_gbq.read_gbq(sql, project_id=project_id, dialect="standard")

# Calculate churn rate for each plan-region combo
churn_rate_by_combo = df_plan_region_combo.groupby('plan_region_combo')['churn_next_month'].mean().reset_index()

# Sort to find combos with highest churn
highest_churn_combos = churn_rate_by_combo.sort_values(by='churn_next_month', ascending=False)
print(highest_churn_combos.head())


Downloading:   0%|[32m          [0m|[A
Downloading:   4%|[32m▎         [0m|[A
Downloading:   7%|[32m▋         [0m|[A
Downloading:  11%|[32m█         [0m|[A
Downloading:  14%|[32m█▍        [0m|[A
Downloading:  18%|[32m█▊        [0m|[A
Downloading:  21%|[32m██        [0m|[A
Downloading:  25%|[32m██▍       [0m|[A
Downloading:  28%|[32m██▊       [0m|[A
Downloading:  32%|[32m███▏      [0m|[A
Downloading:  35%|[32m███▌      [0m|[A
Downloading:  39%|[32m███▊      [0m|[A
Downloading:  42%|[32m████▏     [0m|[A
Downloading:  46%|[32m████▌     [0m|[A
Downloading:  49%|[32m████▉     [0m|[A
Downloading:  53%|[32m█████▎    [0m|[A
Downloading:  56%|[32m█████▋    [0m|[A
Downloading:  60%|[32m█████▉    [0m|[A
Downloading:  63%|[32m██████▎   [0m|[A
Downloading:  67%|[32m██████▋   [0m|[A
Downloading:  70%|[32m███████   [0m|[A
Downloading:  74%|[32m███████▍  [0m|[A
Downloading:  77%|[32m███████▋  [0m|[A
Downloading:  81%|[32m████████  

From the combos, it shows that standard_Canada has the highest churn.


## Task 5.3: Add Missingness Indicator Flags

**🎯 Goal:** Add binary flags to capture NULL values in age_band and avg_rating.  
**📌 Requirements:** Use IS NULL logic to create new flag columns.

---

### 🧠 Prompt Template  
> Create a new column is_missing_[col_name] that is 1 when column is NULL, else 0.

---

### 👩‍🏫 Example Prompt  
> Add is_missing_age that flags rows where age_band IS NULL.

---

### 🔍 Exploration  
Do missing values correlate with churn?


In [9]:
# prompt: Add binary flags to capture NULL values in age_band and avg_rating.
# Requirements: Use IS NULL logic to create new flag columns. Create a new column is_missing_[col_name] that is 1 when column is NULL, else 0. example: Add is_missing_age that flags rows where age_band IS NULL.

sql = """
SELECT
    CASE
        WHEN age IS NULL THEN 1
        ELSE 0
    END AS is_missing_age,
    churn_next_month
FROM
    `mgmt467-lab.netflix.feat_churn_lite`
"""

project_id = "mgmt467-lab"
df_missing_flags = pandas_gbq.read_gbq(sql, project_id=project_id, dialect="standard")

# Calculate churn rate for missing age vs. non-missing age
churn_rate_by_missing_age = df_missing_flags.groupby('is_missing_age')['churn_next_month'].mean().reset_index()
print("Churn rate by missing age:")
churn_rate_by_missing_age

Downloading:  16%|[32m█▌        [0m|
Downloading: 100%|[32m██████████[0m|
Churn rate by missing age:


Unnamed: 0,is_missing_age,churn_next_month
0,0,0.659517
1,1,0.660169


With a similar churn rate for missing age, it suggest that the the presence of a missing value does not strongly correlate with a higher or lower likelihood of churn.


## Task 5.4: Create Time-Based Features (Optional)

**🎯 Goal:** Add a column days_since_last_login.  
**📌 Requirements:** Use DATE_DIFF with CURRENT_DATE and last_login_date.

---

### 🧠 Prompt Template  
> Write SQL to create a column showing days since last login using DATE_DIFF.

---

### 👩‍🏫 Example Prompt  
> Add a column days_since_last_login = DATE_DIFF(CURRENT_DATE(), last_login_date, DAY).

---

### 🔍 Exploration  
Does login recency affect churn rate?


In [14]:
import pandas
import pandas_gbq

# prompt: Add a column days_since_last_login.
# Requirements: Use DATE_DIFF with CURRENT_DATE and last_login_date. Write SQL to create a column showing days since last login using DATE_DIFF. example: Add a column days_since_last_login = DATE_DIFF(CURRENT_DATE(), last_login_date, DAY).

sql = """
SELECT
    DATE_DIFF(CURRENT_DATE(), month, DAY) AS days_since_last_month_start,
    churn_next_month
FROM
    `mgmt467-lab.netflix.feat_churn_lite`
"""

project_id = "mgmt467-lab"
df_days_since_last_month_start = pandas_gbq.read_gbq(sql, project_id=project_id, dialect="standard")

# Calculate churn rate by days_since_last_month_start (you might want to bucket this for better analysis)
# For simplicity, let's just look at the correlation or some descriptive stats for now.
print(df_days_since_last_month_start.head())

# To explore the relationship with churn, we might want to create buckets for days_since_last_month_start
# For example, let's create 3 buckets: recent, moderate, old
bins = [0, 30, 90, 365] # Example buckets: 0-30 days, 31-90 days, 91-365 days
labels = ['recent', 'moderate', 'old']
df_days_since_last_month_start['days_bucket'] = pandas.cut(df_days_since_last_month_start['days_since_last_month_start'], bins=bins, labels=labels, right=False)

churn_rate_by_days_bucket = df_days_since_last_month_start.groupby('days_bucket', observed=False)['churn_next_month'].mean().reset_index()
print("Churn rate by days since last month start bucket:")
churn_rate_by_days_bucket

Downloading: 100%|[32m██████████[0m|
   days_since_last_month_start  churn_next_month
0                           -7                 1
1                           -7                 0
2                           -7                 1
3                           -7                 1
4                           -7                 1
Churn rate by days since last month start bucket:


Unnamed: 0,days_bucket,churn_next_month
0,recent,0.662913
1,moderate,0.663835
2,old,0.661661


Compare it to other features it does make a little more difference. But to see it from the big picture, it may not affect the churnrate significantly


## Task 5.5: Assemble Enhanced Feature Table

**🎯 Goal:** Create churn_features_enhanced with all engineered columns.  
**📌 Requirements:** Include all prior features + engineered columns.

---

### 🧠 Prompt Template  
> Generate SQL to create churn_features_enhanced with new columns: watch_time_bucket, plan_region_combo, flag_binge, etc.

---

### 👩‍🏫 Example Prompt  
> Build a new table churn_features_enhanced with all original features + engineered ones.

---

### 🔍 Exploration  
Are row counts stable? Any NULLs introduced?



## Task 6: Retrain Model on Engineered Features

**🎯 Goal:** Train a logistic regression model using churn_features_enhanced.  
**📌 Requirements:** Use BQML logistic_reg model with new feature columns.

---

### 🧠 Prompt Template  
> Write CREATE MODEL SQL using enhanced features including flags and buckets.

---

### 👩‍🏫 Example Prompt  
> Retrain churn_model_enhanced using watch_time_bucket, flag_binge, plan_region_combo.

---

### 🔍 Exploration  
Does model accuracy improve?


In [16]:
# prompt: Train a logistic regression model using churn_features_enhanced.
# Requirements: Use BQML logistic_reg model with new feature columns. Write CREATE MODEL SQL using enhanced features including flags and buckets. example: Retrain churn_model_enhanced using watch_time_bucket, flag_binge, plan_region_combo.

sql = """
CREATE OR REPLACE MODEL `mgmt467-lab.netflix.churn_logreg_enhanced`
OPTIONS(
    model_type='logistic_reg',
    input_label_cols=['churn_next_month']
) AS
SELECT
    r3_sess,
    r3_min,
    unique_days_watched,
    avg_watch_duration,
    days_since_last_month_start,
    subscription_plan,
    country,
    age,
    CASE
        WHEN r3_min < 100 THEN 'low'
        WHEN r3_min >= 100 AND r3_min <= 300 THEN 'medium'
        ELSE 'high'
    END AS watch_time_bucket,
    CASE
        WHEN r3_min > 500 THEN 1
        ELSE 0
    END AS flag_binge,
    CONCAT(subscription_plan, '_', country) AS plan_region_combo,
    CASE
        WHEN age IS NULL THEN 1
        ELSE 0
    END AS is_missing_age,
    churn_next_month
FROM
    `mgmt467-lab.netflix.feat_churn_lite`
"""

project_id = "mgmt467-lab"
pandas_gbq.read_gbq(sql, project_id=project_id, dialect="standard")

In [18]:
import pandas_gbq

project_id = "mgmt467-lab"

sql_evaluate_enhanced = """
SELECT
  *
FROM
  ML.EVALUATE(MODEL `mgmt467-lab.netflix.churn_logreg_enhanced`)
"""

df_enhanced_model_evaluation = pandas_gbq.read_gbq(sql_evaluate_enhanced, project_id=project_id, dialect="standard")
display(df_enhanced_model_evaluation)

Downloading: 100%|[32m██████████[0m|


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.663442,1.0,0.663442,0.797674,0.638767,0.50329


The accuracy did improved a bit from 0.661051 to 0.663442, and the roc-auc curve also improved from 49% to 50%.


## Task 7: Compare Model Performance

**🎯 Goal:** Compare base model vs enhanced model using ML.EVALUATE.  
**📌 Requirements:** Use same evaluation query for both models.

---

### 🧠 Prompt Template  
> Write a SQL query to evaluate churn_model_enhanced and compare with churn_model.

---

### 👩‍🏫 Example Prompt  
> Compare ML.EVALUATE output from both models side-by-side.

---

### 🔍 Exploration  
Which features made the most difference?


In [21]:
# prompt: Compare base model vs enhanced model using ML.EVALUATE.
# Requirements: Use same evaluation query for both models. Write a SQL query to evaluate churn_model_enhanced and compare with churn_model. example: Compare ML.EVALUATE output from both models side-by-side.

import pandas_gbq

project_id = "mgmt467-lab"

# Evaluate the base model (churn_logreg_lite)
sql_evaluate_base = """
SELECT
  *
FROM
  ML.EVALUATE(MODEL `mgmt467-lab.netflix.churn_logreg_lite`)
"""
df_base_model_evaluation = pandas_gbq.read_gbq(sql_evaluate_base, project_id=project_id, dialect="standard")

# Evaluate the enhanced model (churn_logreg_enhanced)
sql_evaluate_enhanced = """
SELECT
  *
FROM
  ML.EVALUATE(MODEL `mgmt467-lab.netflix.churn_logreg_enhanced`)
"""
df_enhanced_model_evaluation = pandas_gbq.read_gbq(sql_evaluate_enhanced, project_id=project_id, dialect="standard")

print("Base Model Evaluation:")
display(df_base_model_evaluation)

print("Enhanced Model Evaluation:")
display(df_enhanced_model_evaluation)

Downloading: 100%|[32m██████████[0m|
Downloading: 100%|[32m██████████[0m|
Base Model Evaluation:


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.0,0.0,0.661051,0.0,0.640383,0.498698


Enhanced Model Evaluation:


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.663442,1.0,0.663442,0.797674,0.638767,0.50329


In [22]:
import pandas as pd

# Consolidate findings for direct comparison

# Task 5.0: watch_time_bucket
# Data from df_watch_time_buckets, churn_rate_by_bucket
watch_time_comparison = churn_rate_by_bucket.rename(columns={'churn_next_month': 'churn_rate'})
watch_time_comparison['feature'] = 'watch_time_bucket'
watch_time_comparison['category'] = watch_time_comparison['watch_time_bucket']
watch_time_comparison = watch_time_comparison[['feature', 'category', 'churn_rate']]

# Task 5.1: flag_binge
# Data from df_binge_watchers, churn_rate_by_binge
binge_flag_comparison = churn_rate_by_binge.rename(columns={'churn_next_month': 'churn_rate'})
binge_flag_comparison['feature'] = 'flag_binge'
binge_flag_comparison['category'] = binge_flag_comparison['flag_binge'].apply(lambda x: 'binge-watcher' if x == 1 else 'non-binge-watcher')
binge_flag_comparison = binge_flag_comparison[['feature', 'category', 'churn_rate']]

# Task 5.2: plan_region_combo (showing top 3 highest churn for brevity)
# Data from df_plan_region_combo, highest_churn_combos
plan_region_comparison = highest_churn_combos.head(3).rename(columns={'churn_next_month': 'churn_rate'})
plan_region_comparison['feature'] = 'plan_region_combo'
plan_region_comparison['category'] = plan_region_comparison['plan_region_combo']
plan_region_comparison = plan_region_comparison[['feature', 'category', 'churn_rate']]

# Task 5.3: is_missing_age
# Data from df_missing_flags, churn_rate_by_missing_age
missing_age_comparison = churn_rate_by_missing_age.rename(columns={'churn_next_month': 'churn_rate'})
missing_age_comparison['feature'] = 'is_missing_age'
missing_age_comparison['category'] = missing_age_comparison['is_missing_age'].apply(lambda x: 'missing age' if x == 1 else 'non-missing age')
missing_age_comparison = missing_age_comparison[['feature', 'category', 'churn_rate']]

# Task 5.4: days_bucket
# Data from df_days_since_last_month_start, churn_rate_by_days_bucket
days_bucket_comparison = churn_rate_by_days_bucket.rename(columns={'churn_next_month': 'churn_rate'})
days_bucket_comparison['feature'] = 'days_bucket'
days_bucket_comparison['category'] = days_bucket_comparison['days_bucket']
days_bucket_comparison = days_bucket_comparison[['feature', 'category', 'churn_rate']]

# Combine all comparisons into a single DataFrame
summary_comparison_df = pd.concat([
    watch_time_comparison,
    binge_flag_comparison,
    plan_region_comparison,
    missing_age_comparison,
    days_bucket_comparison
])

print("\n--- Direct Comparison of Feature Impact on Churn ---")
display(summary_comparison_df.round(4))


--- Direct Comparison of Feature Impact on Churn ---


Unnamed: 0,feature,category,churn_rate
0,watch_time_bucket,high,0.6591
1,watch_time_bucket,low,0.6602
2,watch_time_bucket,medium,0.6636
0,flag_binge,non-binge-watcher,0.6597
1,flag_binge,binge-watcher,0.6595
6,plan_region_combo,Standard_Canada,0.6649
2,plan_region_combo,Premium+_Canada,0.6642
0,plan_region_combo,Basic_Canada,0.6618
0,is_missing_age,non-missing age,0.6595
1,is_missing_age,missing age,0.6602


Base on the comparison of the features, the top feature that makes the most difference is plan region combo.