<a href="https://colab.research.google.com/github/surmehta1/mgmt467-analytics-portfolio/blob/main/Unit2_Lab2_Churn_Modeling_FeatureEngineering_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📊 MGMT 467 - Unit 2 Lab 2: Churn Modeling with BigQueryML + Feature Engineering
**Date:** 2025-10-16

In this lab you will:
- Connect to BigQuery from Colab
- Create features and labels
- Engineer new features from user behavior
- Train and evaluate logistic regression models
- Reflect on modeling assumptions and interpret results

In [23]:
# ✅ Authenticate and set up GCP project
from google.colab import auth
auth.authenticate_user()

project_id = "mgmt467-471119"  # <-- Replace with your actual project ID
!gcloud config set project $project_id

Updated property [core/project].


In [34]:
# ✅ Verify BigQuery access
%%bigquery --project $project_id
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,today,user
0,2025-10-25,mehtasur27@gmail.com


In [38]:
%%bigquery --project $project_id
SELECT table_name
FROM `mgmt467-471119.netflix.INFORMATION_SCHEMA.TABLES`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,table_name
0,users
1,user_month_grid
2,reviews
3,calendar_months
4,watch_history
5,activity_roll3
6,search_logs
7,month_bounds
8,recommendation_logs
9,activity_filled


In [28]:
# ✅ Prepare base churn features
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `your_dataset.churn_features` AS
SELECT
  user_id,
  region,okay
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `your_dataset.cleaned_features`
WHERE churn_label IS NOT NULL;

Executing query with job ID: e9417edd-1dac-45a3-a928-e9d62532f23d
Query executing: 0.35s


ERROR:
 404 Not found: Dataset mgmt467-471119:your_dataset was not found in location US; reason: notFound, message: Not found: Dataset mgmt467-471119:your_dataset was not found in location US

Location: US
Job ID: e9417edd-1dac-45a3-a928-e9d62532f23d



In [45]:
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `mgmt467-471119.netflix.churn_features_final` AS
SELECT
  u.user_id,
  u.region,
  u.plan_tier,
  u.age_band,
  l.churn_label
FROM `mgmt467-471119.netflix.users` AS u
LEFT JOIN `mgmt467-471119.netflix.labels_next_month` AS l
  USING (user_id)
WHERE l.churn_label IS NOT NULL;

Executing query with job ID: c43154a4-7ae0-47f8-b5d6-2770178891eb
Query executing: 0.51s


ERROR:
 400 Name churn_label not found inside l at [11:9]; reason: invalidQuery, location: query, message: Name churn_label not found inside l at [11:9]

Location: US
Job ID: c43154a4-7ae0-47f8-b5d6-2770178891eb



In [44]:
%%bigquery --project $project_id
SELECT *
FROM `mgmt467-471119.netflix.labels_next_month`
LIMIT 5;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,user_id,month,active_next_month
0,user_02198,2025-12-01,
1,user_01378,2025-12-01,
2,user_05915,2025-12-01,
3,user_09919,2025-12-01,
4,user_01820,2025-12-01,


In [46]:
# ✅ Train base logistic regression model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `your_dataset.churn_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `your_dataset.churn_features`;

Executing query with job ID: 6ac8daa8-4569-4127-8320-473204ebca00
Query executing: 0.33s


ERROR:
 404 Not found: Dataset mgmt467-471119:your_dataset was not found in location US; reason: notFound, message: Not found: Dataset mgmt467-471119:your_dataset was not found in location US

Location: US
Job ID: 6ac8daa8-4569-4127-8320-473204ebca00



In [47]:
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `mgmt467-471119.netflix.churn_features_final` AS
SELECT
  u.user_id,
  u.region,
  u.plan_tier,
  u.age_band,
  a.total_minutes,
  a.num_sessions,
  CASE
    WHEN l.active_next_month = 0 THEN 1
    WHEN l.active_next_month = 1 THEN 0
    ELSE NULL
  END AS churn_label
FROM `mgmt467-471119.netflix.users` AS u
LEFT JOIN `mgmt467-471119.netflix.activity_monthly` AS a
  USING (user_id)
LEFT JOIN `mgmt467-471119.netflix.labels_next_month` AS l
  USING (user_id)
WHERE l.active_next_month IS NOT NULL;

Executing query with job ID: 3efb6cb6-bb9b-4cc5-add9-28ea69eb852f
Query executing: 0.79s


ERROR:
 400 Name region not found inside u at [4:5]; reason: invalidQuery, location: query, message: Name region not found inside u at [4:5]

Location: US
Job ID: 3efb6cb6-bb9b-4cc5-add9-28ea69eb852f



In [48]:
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `mgmt467-471119.netflix.churn_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  avg_progress,
  num_sessions,
  churn_label
FROM `mgmt467-471119.netflix.churn_features`;

Executing query with job ID: 211618a8-2039-4ff2-8f64-ace0d8534438
Query executing: 0.33s


ERROR:
 404 Not found: Table mgmt467-471119:netflix.churn_features was not found in location US; reason: notFound, message: Not found: Table mgmt467-471119:netflix.churn_features was not found in location US

Location: US
Job ID: 211618a8-2039-4ff2-8f64-ace0d8534438



In [41]:
# ✅ Evaluate base model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `your_dataset.churn_model`);

Executing query with job ID: 7f53869b-2ffc-4750-8f8c-0a3033792fdf
Query executing: 0.38s


ERROR:
 404 Not found: Dataset mgmt467-471119:your_dataset was not found in location US; reason: notFound, message: Not found: Dataset mgmt467-471119:your_dataset was not found in location US

Location: US
Job ID: 7f53869b-2ffc-4750-8f8c-0a3033792fdf



In [31]:
# ✅ Predict churn with base model
%%bigquery --project $project_id
SELECT
  user_id,
  predicted_churn_label,
  predicted_churn_label_probs
FROM ML.PREDICT(MODEL `your_dataset.churn_model`,
                (SELECT * FROM `your_dataset.churn_features`));

Executing query with job ID: 5ad2b6ee-ca7d-4702-b408-34c201df0820
Query executing: 0.34s


ERROR:
 404 Not found: Dataset mgmt467-471119:your_dataset was not found in location US; reason: notFound, message: Not found: Dataset mgmt467-471119:your_dataset was not found in location US

Location: US
Job ID: 5ad2b6ee-ca7d-4702-b408-34c201df0820




## 🛠️ Feature Engineering Section

We will now engineer new features to improve model performance:

- Bucket continuous variables
- Create interaction terms
- Add behavioral flags


In [32]:

# ✅ Create enhanced feature set
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `your_dataset.churn_features_enhanced` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    ELSE 'high'
  END AS watch_time_bucket,
  avg_progress,
  num_sessions,
  CONCAT(plan_tier, '_', region) AS plan_region_combo,
  IF(total_minutes > 500, 1, 0) AS flag_binge,
  churn_label
FROM `your_dataset.churn_features`;


Executing query with job ID: 01ef1c73-4e9c-425f-be95-517dbbedb4a5
Query executing: 0.25s


ERROR:
 404 Not found: Dataset mgmt467-471119:your_dataset was not found in location US; reason: notFound, message: Not found: Dataset mgmt467-471119:your_dataset was not found in location US

Location: US
Job ID: 01ef1c73-4e9c-425f-be95-517dbbedb4a5



In [33]:

# ✅ Train enhanced model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `your_dataset.churn_model_enhanced`
OPTIONS(model_type='logistic_reg') AS
SELECT
  region,
  plan_tier,
  age_band,
  watch_time_bucket,
  avg_rating,
  avg_progress,
  num_sessions,
  plan_region_combo,
  flag_binge,
  churn_label
FROM `your_dataset.churn_features_enhanced`;


Executing query with job ID: 3512dfb0-22f6-48bc-b498-27d3be9f5b71
Query executing: 0.24s


ERROR:
 404 Not found: Dataset mgmt467-471119:your_dataset was not found in location US; reason: notFound, message: Not found: Dataset mgmt467-471119:your_dataset was not found in location US

Location: US
Job ID: 3512dfb0-22f6-48bc-b498-27d3be9f5b71



In [49]:

# ✅ Evaluate enhanced model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `your_dataset.churn_model_enhanced`);


Executing query with job ID: 16310725-e27c-47de-b672-b64f3c3cdff1
Query executing: 0.34s


ERROR:
 404 Not found: Dataset mgmt467-471119:your_dataset was not found in location US; reason: notFound, message: Not found: Dataset mgmt467-471119:your_dataset was not found in location US

Location: US
Job ID: 16310725-e27c-47de-b672-b64f3c3cdff1




## 🤔 Chain-of-Thought Prompts: Feature Engineering

### 1. Why bucket continuous values like watch time?
- What patterns become clearer by using categories like "low", "medium", "high"?

### 2. What value do interaction terms (e.g., `plan_tier_region`) add?
- Could some plans behave differently in different regions?

### 3. What’s the purpose of binary flags like `flag_binge`?
- Can these capture unique behaviors not reflected in raw totals?

### 4. After evaluating the enhanced model:
- Which new features helped the most?
- Did any surprise you?

✍️ Write your responses in a text cell below or in a shared doc for discussion.


### 1. Why bucket continuous values like watch time?
Bucketing continuous values like watch time can help simplify the relationship between the feature and the target variable (churn). By grouping users into categories like "low", "medium", and "high" watch time, we can identify if there are clear differences in churn rates among these distinct groups. This can make the model more interpretable and less sensitive to small variations in watch time.

### 2. What value do interaction terms (e.g., `plan_tier_region`) add?
Interaction terms capture how the effect of one variable depends on the value of another. For example, the churn rate for a specific plan tier might be different in one region compared to another. An interaction term like `plan_tier_region` allows the model to capture these synergistic effects that wouldn't be evident by just looking at plan tier or region independently.

### 3. What’s the purpose of binary flags like `flag_binge`?
Binary flags are useful for highlighting specific, potentially important behaviors that might not be captured by continuous or categorical variables alone. A `flag_binge` could indicate a user who consumes a large amount of content in a short period, which might be indicative of either high engagement or a risk of burnout and subsequent churn. These flags can add a non-linear element to the model and help identify specific user segments with unique churn patterns.

### 4. After evaluating the enhanced model:
*(Note: You would answer this after running the enhanced model evaluation cell)*
Based on the model evaluation metrics (e.g., AUC, accuracy, precision, recall), you would analyze which of the newly engineered features (watch time bucket, interaction terms, binary flags) have the largest coefficients or the strongest impact on the model's performance.

You might be surprised by:
- The significance of an interaction term you didn't expect to be important.
- A binary flag having a larger impact on churn than a continuous variable.
- The model performing better or worse than anticipated with the new features.

Analyzing these results helps refine the feature engineering process and gain deeper insights into the drivers of churn.