<a href="https://colab.research.google.com/github/raleight1/mgmt467-analytics-portfolio/blob/main/Labs/Labs%204-6/Unit2_Lab1_PromptPlusExamples_Colab_Kaggle_GCS_BQ_DQ_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MGMT 467 — Prompt-Driven Lab (with Commented Examples)
## Kaggle ➜ Google Cloud Storage ➜ BigQuery ➜ Data Quality (DQ)

**How to use this notebook**
- Each section gives you a **Build Prompt** to paste into Gemini/Vertex AI (or Gemini in Colab).
- Below each prompt, you’ll see a **commented example** of what a good LLM answer might look like.
- **Do not** just uncomment and run. Use the prompt to generate your own code, then compare to the example.
- After every step, run the **Verification Prompt**, and write the **Reflection** in Markdown.

> Goal today: Download the Netflix dataset (Kaggle) → Stage on GCS → Load into BigQuery → Run DQ profiling (missingness, duplicates, outliers, anomaly flags).


### Academic integrity & LLM usage
- Use the prompts here to generate your own code cells.
- Read concept notes and write the reflection answers in your own words.
- Keep credentials out of code. Upload `kaggle.json` when asked.


## Learning objectives
1) Explain **why** we stage data in GCS and load it to BigQuery.  
2) Build an **idempotent**, auditable pipeline.  
3) Diagnose **missingness**, **duplicates**, and **outliers** and justify cleaning choices.  
4) Connect DQ decisions to **business/ML impact**.


## 0) Environment setup — What & Why
Authenticate Colab to Google Cloud so we can use `gcloud`, GCS, and BigQuery. Set **PROJECT_ID** and **REGION** once for consistency (cost/latency).

### Build Prompt (paste to LLM)
You are my cloud TA. Generate a single **Colab code cell** that:
1) Authenticates to Google Cloud in Colab,  
2) Prompts for `PROJECT_ID` via `input()` and sets `REGION="us-central1"` (editable),  
3) Exports `GOOGLE_CLOUD_PROJECT`,  
4) Runs `gcloud config set project $GOOGLE_CLOUD_PROJECT`,  
5) Prints both values. Add 2–3 comments explaining what/why.
End with a comment: `# Done: Auth + Project/Region set`.


In [None]:
# Authenticate Colab to Google Cloud
from google.colab import auth
auth.authenticate_user()

import os

# Prompt for Project ID and set Region
PROJECT_ID = input("Enter your GCP Project ID: ").strip()
REGION = "US"  # Editable: Set your desired region here

# Export PROJECT_ID as GOOGLE_CLOUD_PROJECT environment variable
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID

# Set the active project for gcloud and BigQuery CLI
!gcloud config set project $GOOGLE_CLOUD_PROJECT

# Print the set values for verification
print("Project:", PROJECT_ID, "| Region:", REGION)

# Done: Auth + Project/Region set

Enter your GCP Project ID: unit2-mgmt467labs
Updated property [core/project].
Project: unit2-mgmt467labs | Region: US


In [None]:
# # EXAMPLE (from LLM) — Auth + Project/Region (commented; write your own cell using the prompt)
# # from google.colab import auth
# # auth.authenticate_user()
# #
# # import os
# # PROJECT_ID = input("Enter your GCP Project ID: ").strip()
# # REGION = "us-central1"  # keep consistent; change if instructed
# # os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
# # print("Project:", PROJECT_ID, "| Region:", REGION)
# #
# # # Set active project for gcloud/BigQuery CLI
# # !gcloud config set project $GOOGLE_CLOUD_PROJECT
# # !gcloud config get-value project
# # # Done: Auth + Project/Region set

### Verification Prompt
Generate a short cell that prints the active project using `gcloud config get-value project` and echoes the `REGION` you set.


In [None]:
# Verify active project and region
!gcloud config get-value project
import os
print("REGION:", os.environ.get(REGION, "REGION not set"))

unit2-mgmt467labs
REGION: REGION not set


**Reflection: Why do we set `PROJECT_ID` and `REGION` at the top? What can go wrong if we don’t?**

Setting the `PROJECT_ID` and `REGION` at the top of the notebook ensures consistency and avoids potential errors when interacting with Google Cloud services.

*   **Consistency:** By defining these values once at the beginning, all subsequent commands and operations will use the same project and region. This is crucial for managing resources, costs, and data locality.
*   **Avoiding Errors:** Many `gcloud`, `bq`, and client library commands require a project ID and sometimes a region. If these are not explicitly set, the commands might fail, use an unintended default, or prompt for input repeatedly, disrupting the workflow. Not setting the region can also lead to data being stored in unintended locations, potentially impacting latency and costs.
*   **Reproducibility:** Explicitly setting these parameters makes the notebook more reproducible. Anyone running the notebook will clearly see which project and region are being used, and the code will behave consistently regardless of their local gcloud configuration.

## 1) Kaggle API — What & Why
Use Kaggle CLI for reproducible downloads. Store `kaggle.json` at `~/.kaggle/kaggle.json` with `0600` permissions to protect secrets.

### Build Prompt
Generate a **single Colab code cell** that:
- Prompts me to upload `kaggle.json`,
- Saves to `~/.kaggle/kaggle.json` with `0600` permissions,
- Prints `kaggle --version`.
Add comments about security and reproducibility.


In [None]:
# Prompt user to upload kaggle.json
# This file contains your Kaggle API credentials.
# Keeping it secure is crucial for protecting your account.
from google.colab import files
print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
uploaded = files.upload()

import os
# Create the .kaggle directory if it doesn't exist
# This is where the Kaggle CLI expects to find the credentials.
os.makedirs('/root/.kaggle', exist_ok=True)

# Save the uploaded file to the correct location
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])

# Set file permissions to owner-only read/write (0600)
# This ensures only the owner can access the credentials, enhancing security.
os.chmod('/root/.kaggle/kaggle.json', 0o600)

# Verify Kaggle CLI installation and version
# This confirms the setup was successful and the CLI is ready to use.
!kaggle --version

Upload your kaggle.json (Kaggle > Account > Create New API Token)


Saving kaggle.json to kaggle.json
Kaggle API 1.7.4.5


In [None]:
# # EXAMPLE (from LLM) — Kaggle setup (commented)
# # from google.colab import files
# # print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
# # uploaded = files.upload()
# #
# # import os
# # os.makedirs('/root/.kaggle', exist_ok=True)
# # with open('/root/.kaggle/kaggle.json', 'wb') as f:
# #     f.write(uploaded[list(uploaded.keys())[0]])
# # os.chmod('/root/.kaggle/kaggle.json', 0o600)  # owner-only
# #
# # !kaggle --version

### Verification Prompt
Generate a one-liner that runs `kaggle --help | head -n 20` to show the CLI is ready.


In [None]:
!kaggle --help | head -n 20

usage: kaggle [-h] [-v] [-W]
              {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
              ...

options:
  -h, --help            show this help message and exit
  -v, --version         Print the Kaggle API version

commands:
  {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
                        Use one of:
                        competitions {list, files, download, submit, submissions, leaderboard}
                        datasets {list, files, download, create, version, init, metadata, status}
                        kernels {list, files, init, push, pull, output, status}
                        models {instances, get, list, init, create, delete, update}
                        models instances {versions, get, files, init, create, delete, update}
                        models instances versions {init, create, download, delete, files}
                        config {view, set, unset}
    competitions (c)    Commands related to Kaggle compe

**Reflection: Why require strict `0600` permissions on API tokens? What risks are we avoiding?**

Requiring strict `0600` permissions on API tokens (like `kaggle.json`) is a critical security measure. The `0600` permission setting means that only the owner of the file has read and write access, and no other user or group has any access.

The risks we are avoiding by enforcing these strict permissions include:

*   **Unauthorized Access:** Without `0600` permissions, other users on the same system (if applicable) or processes could potentially read your API token. This token grants programmatic access to your account and data on platforms like Kaggle.
*   **Data Breach:** If an attacker gains access to the system, weak file permissions make it easy for them to steal your API token and access your private datasets, kernels, or even submit on your behalf.
*   **Account Compromise:** With your API token, an attacker could potentially take actions within your Kaggle account, such as deleting data, creating spam, or misusing resources, leading to reputational damage or policy violations.
*   **Supply Chain Attacks:** If your code or environment is compromised, strict permissions limit the ability of malicious scripts to access sensitive credentials stored in files.

In essence, `0600` permissions are a fundamental layer of defense to ensure that only the intended user can access the sensitive API token, significantly reducing the attack surface and protecting your account and data.

## 2) Download & unzip dataset — What & Why
Keep raw files under `/content/data/raw` for predictable paths and auditing.
**Dataset:** `sayeeduddin/netflix-2025user-behavior-dataset-210k-records`

### Build Prompt
Generate a **Colab code cell** that:
- Creates `/content/data/raw`,
- Downloads the dataset to `/content/data` with Kaggle CLI,
- Unzips into `/content/data/raw` (overwrite OK),
- Lists all CSVs with sizes in a neat table.
Include comments describing each step.


In [None]:
# Create the directory for raw data
# This ensures a clean and predictable location for the downloaded files.
!mkdir -p /content/data/raw

# Download the dataset using Kaggle CLI
# The -d flag specifies the dataset, and -p specifies the download path.
!kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data

# Unzip the downloaded dataset into the raw data directory
# The -o flag allows overwriting existing files, and -d specifies the destination.
!unzip -o /content/data/*.zip -d /content/data/raw

# List all CSV files in the raw data directory with their sizes
# This provides a clear inventory of the downloaded and unzipped files.
!ls -lh /content/data/raw/*.csv

Dataset URL: https://www.kaggle.com/datasets/sayeeduddin/netflix-2025user-behavior-dataset-210k-records
License(s): CC0-1.0
Downloading netflix-2025user-behavior-dataset-210k-records.zip to /content/data
  0% 0.00/4.02M [00:00<?, ?B/s]
100% 4.02M/4.02M [00:00<00:00, 497MB/s]
Archive:  /content/data/netflix-2025user-behavior-dataset-210k-records.zip
  inflating: /content/data/raw/README.md  
  inflating: /content/data/raw/movies.csv  
  inflating: /content/data/raw/recommendation_logs.csv  
  inflating: /content/data/raw/reviews.csv  
  inflating: /content/data/raw/search_logs.csv  
  inflating: /content/data/raw/users.csv  
  inflating: /content/data/raw/watch_history.csv  
-rw-r--r-- 1 root root 114K Aug  2 19:36 /content/data/raw/movies.csv
-rw-r--r-- 1 root root 4.5M Aug  2 19:36 /content/data/raw/recommendation_logs.csv
-rw-r--r-- 1 root root 1.8M Aug  2 19:36 /content/data/raw/reviews.csv
-rw-r--r-- 1 root root 2.2M Aug  2 19:36 /content/data/raw/search_logs.csv
-rw-r--r-- 1 root 

In [None]:
# # EXAMPLE (from LLM) — Download & unzip (commented)
# # !mkdir -p /content/data/raw
# # !kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data
# # !unzip -o /content/data/*.zip -d /content/data/raw
# # # List CSV inventory
# # !ls -lh /content/data/raw/*.csv

### Verification Prompt
Generate a snippet that asserts there are exactly **six** CSV files and prints their names.


In [None]:
import glob
import os

csv_files = glob.glob('/content/data/raw/*.csv')
num_csv_files = len(csv_files)

# Assert that there are exactly six CSV files
assert num_csv_files == 6, f"Expected 6 CSV files, but found {num_csv_files}"

print(f"Found exactly {num_csv_files} CSV files:")
for csv_file in csv_files:
    print(os.path.basename(csv_file))

Found exactly 6 CSV files:
movies.csv
recommendation_logs.csv
search_logs.csv
users.csv
watch_history.csv
reviews.csv


**Reflection:** Why is keeping a clean file inventory (names, sizes) useful downstream?

Keeping a clean file inventory with names and sizes is useful downstream for several reasons:

*   **Auditing and Reproducibility:** A clear inventory allows you to easily verify that you have downloaded the correct files and that they haven't been corrupted or altered. This is crucial for auditing your data pipeline and ensuring reproducibility of your results.
*   **Data Validation:** Knowing the expected file names and sizes helps in automating data validation steps. You can write scripts to check if all expected files are present and if their sizes fall within a reasonable range, catching potential issues early in the process.
*   **Troubleshooting:** If there are errors or unexpected results downstream, having a file inventory can help in pinpointing the source of the problem. You can quickly check if the correct input files were used and if they were the expected size.
*   **Documentation:** The file inventory serves as documentation for your data source. It provides a clear record of the files used, which is helpful for collaboration and for anyone else who needs to understand or replicate your work.
*   **Resource Management:** Knowing the size of your files helps in estimating storage requirements and managing resources effectively, especially when dealing with large datasets.

## 3) Create GCS bucket & upload — What & Why
Stage in GCS → consistent, versionable source for BigQuery loads. Bucket names must be **globally unique**.

### Build Prompt
Generate a **Colab code cell** that:
- Creates a unique bucket in `${REGION}` (random suffix),
- Saves name to `BUCKET_NAME` env var,
- Uploads all CSVs to `gs://$BUCKET_NAME/netflix/`,
- Prints the bucket name and explains staging benefits.


In [None]:
import uuid
import os

# Create a unique bucket name with a random suffix
bucket_name = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"
# Save the bucket name to an environment variable
os.environ["BUCKET_NAME"] = bucket_name

# Create the GCS bucket in the specified region
# The --location flag sets the bucket's geographical location.
!gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION

# Upload all CSV files from the raw data directory to the bucket
# The -m flag enables parallel transfers, and the -r flag copies directories recursively.
!gcloud storage cp /content/data/raw/* gs://$BUCKET_NAME/netflix/

# Print the bucket name for verification
print("Created and uploaded data to GCS bucket:", bucket_name)

# Explain the benefits of staging data in GCS
print("\nBenefits of staging data in GCS:")
print("- **Centralized storage:** Provides a single, accessible location for your data.")
print("- **Version control:** GCS supports object versioning, allowing you to track changes.")
print("- **Integration with GCP services:** Seamlessly integrates with services like BigQuery for data loading.")
print("- **Scalability and durability:** Offers highly scalable and durable storage.")

Creating gs://mgmt467-netflix-61682576/...
[1;31mERROR:[0m (gcloud.storage.buckets.create) HTTPError 400: The specified location constraint is not valid.
[1;31mERROR:[0m (gcloud.storage.cp) gs://mgmt467-netflix-61682576 not found: 404.
Created and uploaded data to GCS bucket: mgmt467-netflix-61682576

Benefits of staging data in GCS:
- **Centralized storage:** Provides a single, accessible location for your data.
- **Version control:** GCS supports object versioning, allowing you to track changes.
- **Integration with GCP services:** Seamlessly integrates with services like BigQuery for data loading.
- **Scalability and durability:** Offers highly scalable and durable storage.


In [None]:
# # EXAMPLE (from LLM) — GCS staging (commented)
# # import uuid, os
# # bucket_name = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"
# # os.environ["BUCKET_NAME"] = bucket_name
# # !gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION
# # !gcloud storage cp /content/data/raw/* gs://$BUCKET_NAME/netflix/
# # print("Bucket:", bucket_name)
# # # Verify contents
# # !gcloud storage ls gs://$BUCKET_NAME/netflix/

### Verification Prompt
Generate a snippet that lists the `netflix/` prefix and shows object sizes.


In [None]:
import os

# Get the bucket name from the environment variable
bucket_name = os.environ.get("BUCKET_NAME")

if bucket_name:
  # List objects in the netflix/ prefix with sizes
  !gcloud storage ls -l gs://$BUCKET_NAME/netflix/
else:
  print("BUCKET_NAME environment variable is not set.")

[1;31mERROR:[0m (gcloud.storage.ls) gs://mgmt467-netflix-61682576 not found: 404.


**Reflection:** Name two benefits of staging in GCS vs loading directly from local Colab.

Two benefits of staging data in GCS versus loading directly from local Colab are:

1.  **Scalability and Durability:** GCS is a highly scalable and durable object storage service. Staging data in GCS ensures that your data is stored reliably and can be easily accessed by various Google Cloud services, regardless of the size of your dataset. Loading directly from local Colab is limited by Colab's temporary storage and is not suitable for large datasets or long-term storage.
2.  **Integration with GCP Services:** GCS seamlessly integrates with other Google Cloud services like BigQuery, Dataproc, and AI Platform. Staging data in GCS makes it much easier and more efficient to load data into BigQuery for analysis, process it with Dataproc, or use it for machine learning tasks with AI Platform. Loading from local Colab would require manual transfers or workarounds for each service.

## 4) BigQuery dataset & loads — What & Why
Create dataset `netflix` and load six CSVs with **autodetect** for speed (we’ll enforce schemas later).

### Build Prompt (two cells)
**Cell A:** Create (idempotently) dataset `netflix` in US multi-region; if it exists, print a friendly message.  
**Cell B:** Load tables from `gs://$BUCKET_NAME/netflix/`:
`users, movies, watch_history, recommendation_logs, search_logs, reviews`
with `--skip_leading_rows=1 --autodetect --source_format=CSV`.
Finish with row-count queries for each table.


In [None]:
# Cell A: Create BigQuery dataset (idempotent)
DATASET="netflix"
# Attempt to create the dataset in the US multi-region.
# The --location flag specifies the dataset's geographical location.
# The || true part makes the command idempotent, so it doesn't fail if the dataset already exists.
!bq --location=US mk -d --description "MGMT467 Netflix dataset" $DATASET || echo "Dataset may already exist or you may not have permissions to create it."

Dataset 'unit2-mgmt467labs:netflix' successfully created.


In [None]:
# Cell B: Load tables from GCS
tables = {
  "users": "users.csv",
  "movies": "movies.csv",
  "watch_history": "watch_history.csv",
  "recommendation_logs": "recommendation_logs.csv",
  "search_logs": "search_logs.csv",
  "reviews": "reviews.csv",
}

import os
DATASET="netflix" # Ensure DATASET is defined

for tbl, fname in tables.items():
  src = f"gs://{os.environ['BUCKET_NAME']}/netflix/{fname}"
  print(f"Loading {tbl} from {src}")
  # Load data into BigQuery table
  # --skip_leading_rows=1 skips the header row.
  # --autodetect infers the schema and data types.
  # --source_format=CSV specifies the format of the source data.
  !bq load --skip_leading_rows=1 --autodetect --source_format=CSV {DATASET}.{tbl} {src}

# Row counts for verification
print("\nVerifying row counts:")
for tbl in tables.keys():
  # Query to get the row count for each table
  # --nouse_legacy_sql ensures standard SQL is used.
  # The query selects the table name and counts all rows.
  !bq query --nouse_legacy_sql "SELECT '{tbl}' AS table_name, COUNT(*) AS n FROM `{os.environ['GOOGLE_CLOUD_PROJECT']}.{DATASET}.{tbl}`".format(tbl=tbl)

Loading users from gs://mgmt467-netflix-8f1668b4/netflix/users.csv
Waiting on bqjob_r194585669814451b_00000199ca82dd5f_1 ... (1s) Current status: DONE   
Loading movies from gs://mgmt467-netflix-8f1668b4/netflix/movies.csv
Waiting on bqjob_r27233dde93474003_00000199ca8300d8_1 ... (1s) Current status: DONE   
Loading watch_history from gs://mgmt467-netflix-8f1668b4/netflix/watch_history.csv
Waiting on bqjob_r7d852cd3bd526ee5_00000199ca8325fa_1 ... (2s) Current status: DONE   
Loading recommendation_logs from gs://mgmt467-netflix-8f1668b4/netflix/recommendation_logs.csv
Waiting on bqjob_r389ecaddc7e676de_00000199ca835030_1 ... (1s) Current status: DONE   
Loading search_logs from gs://mgmt467-netflix-8f1668b4/netflix/search_logs.csv
Waiting on bqjob_r68ed9fb2607962b_00000199ca837186_1 ... (1s) Current status: DONE   
Loading reviews from gs://mgmt467-netflix-8f1668b4/netflix/reviews.csv
Waiting on bqjob_r4e28df7e984b2313_00000199ca839359_1 ... (1s) Current status: DONE   

Verifying row 

In [None]:
# # EXAMPLE (from LLM) — BigQuery dataset (commented)
# # DATASET="netflix"
# # # Attempt to create; ignore if exists
# # !bq --location=US mk -d --description "MGMT467 Netflix dataset" $DATASET || echo "Dataset may already exist."

In [None]:
# # EXAMPLE (from LLM) — Load tables (commented)
# # tables = {
# #   "users": "users.csv",
# #   "movies": "movies.csv",
# #   "watch_history": "watch_history.csv",
# #   "recommendation_logs": "recommendation_logs.csv",
# #   "search_logs": "search_logs.csv",
# #   "reviews": "reviews.csv",
# # }
# # import os
# # for tbl, fname in tables.items():
# #   src = f"gs://{os.environ['BUCKET_NAME']}/netflix/{fname}"
# #   print("Loading", tbl, "from", src)
# #   !bq load --skip_leading_rows=1 --autodetect --source_format=CSV $DATASET.$tbl $src
# #
# # # Row counts
# # for tbl in tables.keys():
# #   !bq query --nouse_legacy_sql "SELECT '{tbl}' AS table_name, COUNT(*) AS n FROM `${GOOGLE_CLOUD_PROJECT}.netflix.{tbl}`".format(tbl=tbl)

### Verification Prompt
Generate a single query that returns `table_name, row_count` for all six tables in `${GOOGLE_CLOUD_PROJECT}.netflix`.


In [None]:
# Single query to get row counts for all tables
import os

project_id = os.environ['GOOGLE_CLOUD_PROJECT']
dataset_id = "netflix"

# Format the query into a single line string, escaping the backticks
query = f"SELECT table_id AS table_name, row_count FROM `{project_id}.{dataset_id}.__TABLES__` WHERE table_id IN ('users', 'movies', 'watch_history', 'recommendation_logs', 'search_logs', 'reviews')"

# Execute the bq query command
!bq query --nouse_legacy_sql "{query}"

Waiting on bqjob_r39ae41a9e0005a95_00000199ca880c48_1 ... (0s) Current status: DONE   
+---------------------+-----------+
|     table_name      | row_count |
+---------------------+-----------+
| movies              |      1040 |
| recommendation_logs |     52000 |
| reviews             |     15450 |
| search_logs         |     26500 |
| users               |     10300 |
| watch_history       |    105000 |
+---------------------+-----------+


**Reflection:** When is `autodetect` acceptable? When should you enforce explicit schemas and why?

autodetect is acceptable for initial data exploration or when dealing with data where the schema is expected to be consistent and simple (like the CSVs in this lab). It's quick and convenient.

However, you should enforce explicit schemas when:

Data types are critical: You need specific data types (e.g., DATE, TIMESTAMP, NUMERIC) that autodetect might misinterpret (e.g., loading a date as a string).
Data quality needs strict enforcement: Explicit schemas act as a contract, failing loads if the data doesn't match, preventing corrupted data from entering your warehouse.
Performance optimization: Specifying appropriate data types can sometimes lead to better query performance and storage efficiency.
Complex or nested data: For formats like JSON or Avro, explicit schemas are often necessary to correctly define nested structures and arrays.
Idempotency and reproducibility: Explicit schemas make your data loading process more deterministic and easier to reproduce.

## 5) Data Quality (DQ) — Concepts we care about
- **Missingness** (MCAR/MAR/MNAR). Impute vs drop. Add `is_missing_*` indicators.
- **Duplicates** (exact vs near). Double-counted engagement corrupts labels & KPIs.
- **Outliers** (IQR). Winsorize/cap vs robust models. Always **flag** and explain.
- **Reproducibility**. Prefer `CREATE OR REPLACE` and deterministic keys.


### 5.1 Missingness (users) — What & Why
Measure % missing and check if missingness depends on another variable (MAR) → potential bias & instability.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Total rows and % missing in `region`, `plan_tier`, `age_band` from `users`.
2) `% plan_tier missing by region` ordered descending. Add comments on MAR.


In [None]:
import os
from google.cloud import bigquery

project_id = os.environ['GOOGLE_CLOUD_PROJECT']

client = bigquery.Client(project=project_id)

query = f"""
-- Users: % missing per column
WITH base AS (
  SELECT COUNT(*) n,
         COUNTIF(country IS NULL) miss_country,
         COUNTIF(subscription_plan IS NULL) miss_plan,
         COUNTIF(age IS NULL) miss_age
  FROM `{project_id}.netflix.users`
)
SELECT n,
       ROUND(100*miss_country/n,2) AS pct_missing_country,
       ROUND(100*miss_plan/n,2)   AS pct_missing_subscription_plan,
       ROUND(100*miss_age/n,2)    AS pct_missing_age
FROM base;
"""

query_job = client.query(query)
results = query_job.result()

# Print the results
for row in results:
    print(row)

Row((10300, 0.0, 0.0, 11.93), {'n': 0, 'pct_missing_country': 1, 'pct_missing_subscription_plan': 2, 'pct_missing_age': 3})


In [None]:
import os
GOOGLE_CLOUD_PROJECT = os.environ['GOOGLE_CLOUD_PROJECT']

_sql = """
-- Cell 1: Total rows and %% missing for specific columns in the users table
-- This query calculates the total number of rows and the percentage of missing
-- values for 'country', 'subscription_plan', and 'age' columns in the users table.
WITH base AS (
  SELECT
    COUNT(*) AS total_rows,
    COUNTIF(country IS NULL) AS missing_country,
    COUNTIF(subscription_plan IS NULL) AS missing_subscription_plan,
    COUNTIF(age IS NULL) AS missing_age
  FROM
    `%s.netflix.users`
)
SELECT
  total_rows,
  ROUND(100 * missing_country / total_rows, 2) AS pct_missing_country,
  ROUND(100 * missing_subscription_plan / total_rows, 2) AS pct_missing_subscription_plan,
  ROUND(100 * missing_age / total_rows, 2) AS pct_missing_age
FROM
  base;
"""
from google.colab.sql import bigquery as _bqsqlcell
df = _bqsqlcell.run(_sql % GOOGLE_CLOUD_PROJECT)
df

TableWidget(page_size=10, row_count=1, table_html='<table border="1" class="dataframe table table-striped tabl…

In [None]:
# # EXAMPLE (from LLM) — Missingness profile (commented)
# # -- Users: % missing per column
# # WITH base AS (
# #   SELECT COUNT(*) n,
# #          COUNTIF(region IS NULL) miss_region,
# #          COUNTIF(plan_tier IS NULL) miss_plan,
# #          COUNTIF(age_band IS NULL) miss_age
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`
# # )
# # SELECT n,
# #        ROUND(100*miss_region/n,2) AS pct_missing_region,
# #        ROUND(100*miss_plan/n,2)   AS pct_missing_plan_tier,
# #        ROUND(100*miss_age/n,2)    AS pct_missing_age_band
# # FROM base;

In [None]:
# # EXAMPLE (from LLM) — MAR by region (commented)
# # SELECT region,
# #        COUNT(*) AS n,
# #        ROUND(100*COUNTIF(plan_tier IS NULL)/COUNT(*),2) AS pct_missing_plan_tier
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`
# # GROUP BY region
# # ORDER BY pct_missing_plan_tier DESC;

### Verification Prompt
Generate a query that prints the three missingness percentages from (1), rounded to two decimals.


In [None]:
import os
from google.colab.sql import bigquery as _bqsqlcell

GOOGLE_CLOUD_PROJECT = os.environ['GOOGLE_CLOUD_PROJECT']

_sql = """
-- Verification query: Print the three missingness percentages
-- This query retrieves the calculated percentage of missing values
-- for country, subscription_plan, and age from the previous step.
WITH base AS (
  SELECT
    COUNT(*) AS total_rows,
    COUNTIF(country IS NULL) AS missing_country,
    COUNTIF(subscription_plan IS NULL) AS missing_subscription_plan,
    COUNTIF(age IS NULL) AS missing_age
  FROM
    `%s.netflix.users`
)
SELECT
  ROUND(100 * missing_country / total_rows, 2) AS pct_missing_country,
  ROUND(100 * missing_subscription_plan / total_rows, 2) AS pct_missing_subscription_plan,
  ROUND(100 * missing_age / total_rows, 2) AS pct_missing_age
FROM
  base;
"""
df_missing_percentages = _bqsqlcell.run(_sql % GOOGLE_CLOUD_PROJECT)
display(df_missing_percentages)

TableWidget(page_size=10, row_count=1, table_html='<table border="1" class="dataframe table table-striped tabl…

**Reflection:** Which columns are most missing? Hypothesize MCAR/MAR/MNAR and why.

**Which columns are most missing? Hypothesize MCAR/MAR/MNAR and why.**

Based on the `df_missing_percentages` DataFrame:

*   **Most Missing Column:** `age` is the most missing column with **11.93%** of its values missing. `country` and `subscription_plan` have 0.0% missing.

**Hypothesizing MCAR/MAR/MNAR for the `age` column:**

For the `age` column, I would hypothesize that the missingness is likely **MNAR (Missing Not At Random)**.

*   **Why MNAR?** The decision to provide one's age, especially in a dataset that might be used for behavioral analysis, can often be related to the age itself. For instance, very young users (e.g., under 18) or very old users (e.g., over 80) might be more reluctant to disclose their age due to privacy concerns, perceived relevance, or simply not fitting into typical demographic buckets. If individuals at the extremes of the age spectrum are systematically choosing not to provide their age, then the missingness is directly correlated with the unobserved value of `age` itself, even if we account for other factors. This makes it MNAR, as the missing data cannot be reliably inferred from other observed data alone without understanding the underlying reasons for non-response related to age.

### 5.2 Duplicates (watch_history) — What & Why
Find exact duplicate interaction records and keep **one best** per group (deterministic policy).

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Report duplicate groups on `(user_id, movie_id, event_ts, device_type)` with counts (top 20).
2) Create table `watch_history_dedup` that keeps one row per group (prefer higher `progress_ratio`, then `minutes_watched`). Add comments.


In [None]:
import os

project_id = os.environ['GOOGLE_CLOUD_PROJECT']
dataset_id = 'netflix'
table_id = 'watch_history'

print(f"Fetching schema for {project_id}.{dataset_id}.{table_id}")
!bq show --schema --format=json {project_id}:{dataset_id}.{table_id}

Fetching schema for unit2-mgmt467labs.netflix.watch_history
[{"name":"session_id","type":"STRING","mode":"NULLABLE"},{"name":"user_id","type":"STRING","mode":"NULLABLE"},{"name":"movie_id","type":"STRING","mode":"NULLABLE"},{"name":"watch_date","type":"DATE","mode":"NULLABLE"},{"name":"device_type","type":"STRING","mode":"NULLABLE"},{"name":"watch_duration_minutes","type":"FLOAT","mode":"NULLABLE"},{"name":"progress_percentage","type":"FLOAT","mode":"NULLABLE"},{"name":"action","type":"STRING","mode":"NULLABLE"},{"name":"quality","type":"STRING","mode":"NULLABLE"},{"name":"location_country","type":"STRING","mode":"NULLABLE"},{"name":"is_download","type":"BOOLEAN","mode":"NULLABLE"},{"name":"user_rating","type":"INTEGER","mode":"NULLABLE"}]


In [None]:
import os

project_id = os.environ['GOOGLE_CLOUD_PROJECT']
dataset_id = 'netflix'
table_id = 'watch_history'

print(f"Fetching schema for {project_id}.{dataset_id}.{table_id}")
!bq show --schema --format=json {project_id}:{dataset_id}.{table_id}

Fetching schema for unit2-mgmt467labs.netflix.watch_history
[{"name":"session_id","type":"STRING","mode":"NULLABLE"},{"name":"user_id","type":"STRING","mode":"NULLABLE"},{"name":"movie_id","type":"STRING","mode":"NULLABLE"},{"name":"watch_date","type":"DATE","mode":"NULLABLE"},{"name":"device_type","type":"STRING","mode":"NULLABLE"},{"name":"watch_duration_minutes","type":"FLOAT","mode":"NULLABLE"},{"name":"progress_percentage","type":"FLOAT","mode":"NULLABLE"},{"name":"action","type":"STRING","mode":"NULLABLE"},{"name":"quality","type":"STRING","mode":"NULLABLE"},{"name":"location_country","type":"STRING","mode":"NULLABLE"},{"name":"is_download","type":"BOOLEAN","mode":"NULLABLE"},{"name":"user_rating","type":"INTEGER","mode":"NULLABLE"}]


In [None]:
import os
from google.colab.sql import bigquery as _bqsqlcell

GOOGLE_CLOUD_PROJECT = os.environ['GOOGLE_CLOUD_PROJECT']

_sql = """
-- Report duplicate groups on (user_id, movie_id, watch_date, device_type) with counts (top 20)
SELECT user_id, movie_id, watch_date, device_type, COUNT(*) AS dup_count
FROM `%s.netflix.watch_history`
GROUP BY user_id, movie_id, watch_date, device_type
HAVING dup_count > 1
ORDER BY dup_count DESC
LIMIT 20;
"""
df_duplicates = _bqsqlcell.run(_sql % GOOGLE_CLOUD_PROJECT)
display(df_duplicates)

TableWidget(page_size=10, row_count=20, table_html='<table border="1" class="dataframe table table-striped tab…

In [None]:
import os
from google.colab.sql import bigquery as _bqsqlcell

GOOGLE_CLOUD_PROJECT = os.environ['GOOGLE_CLOUD_PROJECT']

_sql = """
-- Create table watch_history_dedup that keeps one row per group.
-- This query uses a window function to assign a rank to duplicate rows.
-- The ranking prioritizes higher 'progress_ratio' and then 'minutes_watched'
-- to deterministically select one row from each duplicate group.
CREATE OR REPLACE TABLE `%s.netflix.watch_history_dedup` AS
SELECT * EXCEPT(rk) FROM (
  SELECT h.*,
         ROW_NUMBER() OVER (
           PARTITION BY user_id, movie_id, watch_date, device_type
           ORDER BY progress_percentage DESC, watch_duration_minutes DESC
         ) AS rk
  FROM `%s.netflix.watch_history` h
)
WHERE rk = 1;
"""
# We need to pass GOOGLE_CLOUD_PROJECT twice because there are two %s placeholders
_bqsqlcell.run(_sql % (GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_PROJECT))
print("Table `watch_history_dedup` created successfully.")

Table `watch_history_dedup` created successfully.


In [None]:
# # EXAMPLE (from LLM) — Detect duplicate groups (commented)
# # SELECT user_id, movie_id, event_ts, device_type, COUNT(*) AS dup_count
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history`
# # GROUP BY user_id, movie_id, event_ts, device_type
# # HAVING dup_count > 1
# # ORDER BY dup_count DESC
# # LIMIT 20;

In [None]:
# # EXAMPLE (from LLM) — Keep-one policy (commented)
# # CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` AS
# # SELECT * EXCEPT(rk) FROM (
# #   SELECT h.*,
# #          ROW_NUMBER() OVER (
# #            PARTITION BY user_id, movie_id, event_ts, device_type
# #            ORDER BY progress_ratio DESC, minutes_watched DESC
# #          ) AS rk
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history` h
# # )
# # WHERE rk = 1;

### Verification Prompt
Generate a before/after count query comparing raw vs `watch_history_dedup`.


In [None]:
import os
from google.colab.sql import bigquery as _bqsqlcell

GOOGLE_CLOUD_PROJECT = os.environ['GOOGLE_CLOUD_PROJECT']

_sql = """
-- Compare row counts before and after deduplication
SELECT
  'raw_watch_history' AS table_name,
  COUNT(*) AS row_count
FROM `%s.netflix.watch_history`
UNION ALL
SELECT
  'dedup_watch_history' AS table_name,
  COUNT(*) AS row_count
FROM `%s.netflix.watch_history_dedup`;
"""
df_counts = _bqsqlcell.run(_sql % (GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_PROJECT))
display(df_counts)

TableWidget(page_size=10, row_count=2, table_html='<table border="1" class="dataframe table table-striped tabl…

**Reflection:** Why do duplicates arise (natural vs system-generated)? How do they corrupt labels and KPIs?


**Duplicates arise for various reasons, broadly categorized as natural or system-generated:**

*   **Natural Duplicates:** These occur when valid, distinct events or entities are recorded multiple times due to real-world occurrences or reporting nuances, but each record is technically unique in some aspect. For instance, a customer might click on an ad multiple times, or a product might be returned and then repurchased. While these are distinct events, if the core identifiers (like user_id, product_id, event_type) are the same, they might appear as duplicates in some analyses if not handled carefully.

*   **System-Generated Duplicates:** These are typically errors introduced by data collection, processing, or storage systems. Common causes include:
    *   **Data Ingestion Errors:** A bug in an ETL (Extract, Transform, Load) pipeline might cause the same data batch to be processed and inserted multiple times.
    *   **Faulty Merges/Joins:** When combining data from different sources, an incorrect join key or logic can lead to records being duplicated.
    *   **Retry Mechanisms:** If a system fails to receive a confirmation for a data submission, it might retry the submission, leading to multiple identical entries.
    *   **Lack of Unique Constraints:** Databases without proper primary keys or unique constraints on relevant columns can easily accumulate duplicate records.
    *   **User Error:** Manual data entry might lead to accidental duplication.

**How Duplicates Corrupt Labels and KPIs:**

Duplicates can severely distort analytical results, KPIs (Key Performance Indicators), and machine learning models:

1.  **Inflated Counts and Totals:** Duplicates artificially inflate metrics like total users, total sales, total watch hours, or total engagement events. If a watch event is duplicated 5 times, your reported 'total watch hours' for that movie will be 5x higher than reality, leading to incorrect revenue projections or content performance assessments.

2.  **Skewed Averages and Proportions:** Averages (e.g., average time watched, average clicks per user) will be inaccurate. Proportions (e.g., conversion rates, market share) will also be misleading if the numerator or denominator (or both) are affected by duplicates. For example, if active users are counted with duplicates, your 'active user percentage' will be erroneously high.

3.  **Biased Machine Learning Models:**
    *   **Overfitting:** Models trained on duplicated data might overfit to these specific instances, making them seem more important than they are. This leads to models that perform well on the training data but generalize poorly to new, unseen (and non-duplicated) data.
    *   **Incorrect Feature Importance:** Features associated with duplicated records will appear to have stronger correlations or predictive power than they actually do.
    *   **Imbalanced Datasets:** If certain types of events or entities are duplicated more often than others, it can create an artificial class imbalance, leading the model to prioritize the over-represented class.

4.  **Flawed Business Decisions:** Decision-making based on corrupted KPIs (e.g., investing more in a feature or content type that appears popular due to inflated engagement) can lead to wasted resources, missed opportunities, and poor strategic direction.

### 5.3 Outliers (minutes_watched) — What & Why
Estimate extreme values via IQR; report % outliers; **winsorize** to P01/P99 for robustness while also **flagging** extremes.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Compute IQR bounds for `minutes_watched` on `watch_history_dedup` and report % outliers.
2) Create `watch_history_robust` with `minutes_watched_capped` capped at P01/P99; return quantile summaries before/after.


In [None]:
import os
from google.colab.sql import bigquery as _bqsqlcell

GOOGLE_CLOUD_PROJECT = os.environ['GOOGLE_CLOUD_PROJECT']

_sql = """
-- Compute IQR bounds for watch_duration_minutes and report %% outliers.
WITH dist AS (
  SELECT
    APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(1)] AS q1,
    APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(3)] AS q3
  FROM `%s.netflix.watch_history_dedup`
),
bounds AS (
  SELECT q1, q3, (q3-q1) AS iqr,
         q1 - 1.5*(q3-q1) AS lo,
         q3 + 1.5*(q3-q1) AS hi
  FROM dist
)
SELECT
  COUNTIF(h.watch_duration_minutes < b.lo OR h.watch_duration_minutes > b.hi) AS outliers,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(h.watch_duration_minutes < b.lo OR h.watch_duration_minutes > b.hi)/COUNT(*),2) AS pct_outliers
FROM `%s.netflix.watch_history_dedup` h
CROSS JOIN bounds b;
"""
df_iqr_outliers = _bqsqlcell.run(_sql % (GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_PROJECT))
display(df_iqr_outliers)

TableWidget(page_size=10, row_count=1, table_html='<table border="1" class="dataframe table table-striped tabl…

In [None]:
import os
from google.colab.sql import bigquery as _bqsqlcell

GOOGLE_CLOUD_PROJECT = os.environ['GOOGLE_CLOUD_PROJECT']

_sql_create_table = """
-- Create watch_history_robust with minutes_watched_capped capped at P01/P99.
CREATE OR REPLACE TABLE `%s.netflix.watch_history_robust` AS
WITH q AS (
  SELECT
    APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(1)]  AS p01,
    APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(98)] AS p99
  FROM `%s.netflix.watch_history_dedup`
)
SELECT
  h.*,
  GREATEST(q.p01, LEAST(q.p99, h.watch_duration_minutes)) AS minutes_watched_capped
FROM `%s.netflix.watch_history_dedup` h, q;
"""
_bqsqlcell.run(_sql_create_table % (GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_PROJECT))
print("Table `watch_history_robust` created successfully.")

_sql_quantile_summary = """
-- Return quantile summaries before/after capping.
WITH before AS (
  SELECT 'before' AS which, APPROX_QUANTILES(watch_duration_minutes, 5) AS q
  FROM `%s.netflix.watch_history_dedup`
),
after AS (
  SELECT 'after' AS which, APPROX_QUANTILES(minutes_watched_capped, 5) AS q
  FROM `%s.netflix.watch_history_robust`
)
SELECT * FROM before UNION ALL SELECT * FROM after;
"""
df_quantile_summary = _bqsqlcell.run(_sql_quantile_summary % (GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_PROJECT))
display(df_quantile_summary)

Table `watch_history_robust` created successfully.


TableWidget(page_size=10, row_count=2, table_html='<table border="1" class="dataframe table table-striped tabl…

In [None]:
# # EXAMPLE (from LLM) — IQR outlier rate (commented)
# # WITH dist AS (
# #   SELECT
# #     APPROX_QUANTILES(minutes_watched, 4)[OFFSET(1)] AS q1,
# #     APPROX_QUANTILES(minutes_watched, 4)[OFFSET(3)] AS q3
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # ),
# # bounds AS (
# #   SELECT q1, q3, (q3-q1) AS iqr,
# #          q1 - 1.5*(q3-q1) AS lo,
# #          q3 + 1.5*(q3-q1) AS hi
# #   FROM dist
# # )
# # SELECT
# #   COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi) AS outliers,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi)/COUNT(*),2) AS pct_outliers
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h
# # CROSS JOIN bounds b;

In [None]:
# # EXAMPLE (from LLM) — Winsorize + quantiles (commented)
# # CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust` AS
# # WITH q AS (
# #   SELECT
# #     APPROX_QUANTILES(minutes_watched, 100)[OFFSET(1)]  AS p01,
# #     APPROX_QUANTILES(minutes_watched, 100)[OFFSET(98)] AS p99
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # )
# # SELECT
# #   h.*,
# #   GREATEST(q.p01, LEAST(q.p99, h.minutes_watched)) AS minutes_watched_capped
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h, q;
# #
# # -- Quantiles before vs after
# # WITH before AS (
# #   SELECT 'before' AS which, APPROX_QUANTILES(minutes_watched, 5) AS q
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # ),
# # after AS (
# #   SELECT 'after' AS which, APPROX_QUANTILES(minutes_watched_capped, 5) AS q
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`
# # )
# # SELECT * FROM before UNION ALL SELECT * FROM after;

### Verification Prompt
Generate a query that shows min/median/max before vs after capping.


In [None]:
import os
from google.colab.sql import bigquery as _bqsqlcell

GOOGLE_CLOUD_PROJECT = os.environ['GOOGLE_CLOUD_PROJECT']

_sql = """
-- Verification: Min, Median, Max before vs. after capping
WITH before_capping AS (
  SELECT
    'Before Capping' AS metric_type,
    MIN(watch_duration_minutes) AS min_duration,
    APPROX_QUANTILES(watch_duration_minutes, 2)[OFFSET(1)] AS median_duration,
    MAX(watch_duration_minutes) AS max_duration
  FROM `%s.netflix.watch_history_dedup`
),
after_capping AS (
  SELECT
    'After Capping' AS metric_type,
    MIN(minutes_watched_capped) AS min_duration,
    APPROX_QUANTILES(minutes_watched_capped, 2)[OFFSET(1)] AS median_duration,
    MAX(minutes_watched_capped) AS max_duration
  FROM `%s.netflix.watch_history_robust`
)
SELECT * FROM before_capping
UNION ALL
SELECT * FROM after_capping;
"""
df_min_median_max = _bqsqlcell.run(_sql % (GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_PROJECT))
display(df_min_median_max)

TableWidget(page_size=10, row_count=2, table_html='<table border="1" class="dataframe table table-striped tabl…

**Reflection:** When might capping be harmful? Name a model type less sensitive to outliers and why.

**Reflection: When might capping be harmful? Name a model type less sensitive to outliers and why.**

**When capping might be harmful:**

Capping, while useful for outlier treatment, can be harmful when:

*   **Outliers represent valuable information:** In some domains, extreme values are not errors but critical data points (e.g., fraud detection, rare disease outbreaks, peak loads in systems). Capping these can lead to loss of vital information and missed detection of important events.
*   **Distributional assumptions are violated:** Capping can distort the underlying distribution of the data, which might negatively impact models that rely on specific distributional properties (e.g., normality).
*   **Interpretability is paramount:** Capped values no longer reflect the true observed range, making the interpretation of coefficients or feature importance in some models less straightforward.
*   **Future unseen data might naturally exceed caps:** If new data naturally contains values beyond the current P01/P99, applying the old caps might inaccurately label valid data as extreme.

**Model type less sensitive to outliers and why:**

**Tree-based models, such as Decision Trees, Random Forests, and Gradient Boosting Machines (e.g., XGBoost, LightGBM), are generally less sensitive to outliers.**

*   **Why?** These models make decisions based on splits in the data. When an outlier is present, it will typically end up in its own leaf node or be grouped with a few other extreme values. The splits are determined by finding the best threshold to separate data points, and an outlier will only influence the specific split that isolates it, rather than pulling the entire model's parameters in one direction (as it would in, say, a linear regression model). The prediction for that outlier will be based on the average or majority class of its leaf node, which effectively 'caps' its influence without explicitly modifying the original data. They don't rely on distance metrics or assumptions about the data's mean or variance, which are heavily affected by outliers.

### 5.4 Business anomaly flags — What & Why
Human-readable flags help both product decisioning and ML features (e.g., binge behavior).

### Build Prompt
Generate **three BigQuery SQL cells** (adjust if columns differ):
1) In `watch_history_robust`, compute and summarize `flag_binge` for sessions > 8 hours.
2) In `users`, compute and summarize `flag_age_extreme` if age can be parsed from `age_band` (<10 or >100).
3) In `movies`, compute and summarize `flag_duration_anomaly` where `duration_min` < 15 or > 480 (if exists).
Each cell should output count and percentage and include 1–2 comments.


In [None]:
import os
from google.colab.sql import bigquery as _bqsqlcell

GOOGLE_CLOUD_PROJECT = os.environ['GOOGLE_CLOUD_PROJECT']

_sql = """
-- Compute and summarize flag_binge for sessions > 8 hours in watch_history_robust.
-- Binge watching is defined here as any session lasting longer than 8 hours (480 minutes).
SELECT
  COUNTIF(minutes_watched_capped > 8*60) AS binge_sessions_count,
  COUNT(*) AS total_sessions,
  ROUND(100*COUNTIF(minutes_watched_capped > 8*60)/COUNT(*),2) AS pct_binge_sessions
FROM `%s.netflix.watch_history_robust`;
"""
df_binge = _bqsqlcell.run(_sql % GOOGLE_CLOUD_PROJECT)
display(df_binge)

TableWidget(page_size=10, row_count=1, table_html='<table border="1" class="dataframe table table-striped tabl…

In [None]:
import os
from google.colab.sql import bigquery as _bqsqlcell

GOOGLE_CLOUD_PROJECT = os.environ['GOOGLE_CLOUD_PROJECT']

_sql = """
-- Compute and summarize flag_age_extreme if age is <10 or >100 in the users table.
-- Extreme ages often indicate data entry errors or require special handling.
SELECT
  COUNTIF(age < 10 OR age > 100) AS extreme_age_users_count,
  COUNT(*) AS total_users,
  ROUND(100*COUNTIF(age < 10 OR age > 100)/COUNT(*),2) AS pct_extreme_age_users
FROM `%s.netflix.users`;
"""
df_age_extreme = _bqsqlcell.run(_sql % GOOGLE_CLOUD_PROJECT)
display(df_age_extreme)

TableWidget(page_size=10, row_count=1, table_html='<table border="1" class="dataframe table table-striped tabl…

In [None]:
import os
from google.colab.sql import bigquery as _bqsqlcell

GOOGLE_CLOUD_PROJECT = os.environ['GOOGLE_CLOUD_PROJECT']

_sql = """
-- Compute and summarize flag_duration_anomaly where duration_minutes < 15 or > 480 in the movies table.
-- Movies with extremely short or long durations might be anomalies or special content.
SELECT
  COUNTIF(duration_minutes < 15 OR duration_minutes > 480) AS duration_anomaly_movies_count,
  COUNT(*) AS total_movies,
  ROUND(100*COUNTIF(duration_minutes < 15 OR duration_minutes > 480)/COUNT(*),2) AS pct_duration_anomaly_movies
FROM `%s.netflix.movies`;
"""
df_duration_anomaly = _bqsqlcell.run(_sql % GOOGLE_CLOUD_PROJECT)
display(df_duration_anomaly)

TableWidget(page_size=10, row_count=1, table_html='<table border="1" class="dataframe table table-striped tabl…

In [None]:
# # EXAMPLE (from LLM) — flag_binge (commented)
# # SELECT
# #   COUNTIF(minutes_watched > 8*60) AS sessions_over_8h,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(minutes_watched > 8*60)/COUNT(*),2) AS pct
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`;

In [None]:
# # EXAMPLE (from LLM) — flag_age_extreme (commented)
# # SELECT
# #   COUNTIF(CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
# #           CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100) AS extreme_age_rows,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
# #                     CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100)/COUNT(*),2) AS pct
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`;

In [None]:
# # EXAMPLE (from LLM) — flag_duration_anomaly (commented)
# # SELECT
# #   COUNTIF(duration_min < 15) AS titles_under_15m,
# #   COUNTIF(duration_min > 8*60) AS titles_over_8h,
# #   COUNT(*) AS total
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.movies`;

### Verification Prompt
Generate a single compact summary query that returns two columns per flag: `flag_name, pct_of_rows`.


In [None]:
import os
from google.colab.sql import bigquery as _bqsqlcell

GOOGLE_CLOUD_PROJECT = os.environ['GOOGLE_CLOUD_PROJECT']

_sql = """
-- Single compact summary query for all anomaly flags: flag_name, pct_of_rows
SELECT 'flag_binge' AS flag_name,
       ROUND(100*COUNTIF(minutes_watched_capped > 8*60)/COUNT(*),2) AS pct_of_rows
FROM `%s.netflix.watch_history_robust`
UNION ALL
SELECT 'flag_age_extreme' AS flag_name,
       ROUND(100*COUNTIF(age < 10 OR age > 100)/COUNT(*),2) AS pct_of_rows
FROM `%s.netflix.users`
UNION ALL
SELECT 'flag_duration_anomaly' AS flag_name,
       ROUND(100*COUNTIF(duration_minutes < 15 OR duration_minutes > 480)/COUNT(*),2) AS pct_of_rows
FROM `%s.netflix.movies`;
"""

df_anomaly_summary = _bqsqlcell.run(_sql % (GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_PROJECT))
display(df_anomaly_summary)

TableWidget(page_size=10, row_count=3, table_html='<table border="1" class="dataframe table table-striped tabl…

**Reflection:** Which anomaly flag is most common? Which would you keep as a feature and why?

**Reflection: Which anomaly flag is most common? Which would you keep as a feature and why?**

Based on the execution results:

*   `flag_binge`: 0.0%
*   `flag_age_extreme`: 1.74%
*   `flag_duration_anomaly`: 2.21%

*   **Most Common Flag:** The `flag_duration_anomaly` (2.21%) is the most common anomaly flag in this dataset.

*   **Which to keep as a feature and why:** I would likely keep `flag_duration_anomaly` as a feature. While `flag_age_extreme` is also present, `flag_duration_anomaly` highlights unusual movie lengths (very short or very long). This could be a valuable feature for recommendation engines or content analysis. For example, users might have different preferences for short-form vs. long-form content, and flagging these could help models understand and predict viewing habits more accurately. It directly speaks to a characteristic of the *content* itself, which often drives user interaction. `flag_age_extreme`, on the other hand, might represent data quality issues more than inherent user behavior, and while important for cleaning, might not be as directly useful as a predictive feature.

## 6) Save & submit — What & Why
Reproducibility: save artifacts and document decisions so others can rerun and audit.

### Build Prompt
Generate a checklist (Markdown) students can paste at the end:
- Save this notebook to the team Drive.
- Export a `.sql` file with your DQ queries and save to repo.
- Push notebook + SQL to the **team GitHub** with a descriptive commit.
- Add a README with your `PROJECT_ID`, `REGION`, bucket, dataset, and today’s row counts.


## Grading rubric (quick)
- Profiling completeness (30)  
- Cleaning policy correctness & reproducibility (40)  
- Reflection/insight (20)  
- Hygiene (naming, verification, idempotence) (10)
