<a href="https://colab.research.google.com/github/surmehta1/mgmt467-analytics-portfolio/blob/main/Unit_2_Lab_1_Prompt_Plus_Examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MGMT 467 — Prompt-Driven Lab (with Commented Examples)
## Kaggle ➜ Google Cloud Storage ➜ BigQuery ➜ Data Quality (DQ)

**How to use this notebook**
- Each section gives you a **Build Prompt** to paste into Gemini/Vertex AI (or Gemini in Colab).
- Below each prompt, you’ll see a **commented example** of what a good LLM answer might look like.
- **Do not** just uncomment and run. Use the prompt to generate your own code, then compare to the example.
- After every step, run the **Verification Prompt**, and write the **Reflection** in Markdown.

> Goal today: Download the Netflix dataset (Kaggle) → Stage on GCS → Load into BigQuery → Run DQ profiling (missingness, duplicates, outliers, anomaly flags).


### Academic integrity & LLM usage
- Use the prompts here to generate your own code cells.
- Read concept notes and write the reflection answers in your own words.
- Keep credentials out of code. Upload `kaggle.json` when asked.


## Learning objectives
1) Explain **why** we stage data in GCS and load it to BigQuery.  
2) Build an **idempotent**, auditable pipeline.  
3) Diagnose **missingness**, **duplicates**, and **outliers** and justify cleaning choices.  
4) Connect DQ decisions to **business/ML impact**.


## 0) Environment setup — What & Why
Authenticate Colab to Google Cloud so we can use `gcloud`, GCS, and BigQuery. Set **PROJECT_ID** and **REGION** once for consistency (cost/latency).

### Build Prompt (paste to LLM)
You are my cloud TA. Generate a single **Colab code cell** that:
1) Authenticates to Google Cloud in Colab,  
2) Prompts for `PROJECT_ID` via `input()` and sets `REGION="us-central1"` (editable),  
3) Exports `GOOGLE_CLOUD_PROJECT`,  
4) Runs `gcloud config set project $GOOGLE_CLOUD_PROJECT`,  
5) Prints both values. Add 2–3 comments explaining what/why.
End with a comment: `# Done: Auth + Project/Region set`.


In [None]:
# # EXAMPLE (from LLM) — Auth + Project/Region (commented; write your own cell using the prompt)
# # from google.colab import auth
# # auth.authenticate_user()
# #
# # import os
# # PROJECT_ID = input("Enter your GCP Project ID: ").strip()
# # REGION = "us-central1"  # keep consistent; change if instructed
# # os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
# # print("Project:", PROJECT_ID, "| Region:", REGION)
# #
# # # Set active project for gcloud/BigQuery CLI
# # !gcloud config set project $GOOGLE_CLOUD_PROJECT
# # !gcloud config get-value project
# # # Done: Auth + Project/Region set

In [None]:
# Authenticate to Google Cloud in Colab
from google.colab import auth
auth.authenticate_user()

import os
# Prompt for PROJECT_ID and set REGION
PROJECT_ID = input("Enter your GCP Project ID: ").strip()
REGION = "us-central1"  # keep consistent; change if instructed
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
print("Project:", PROJECT_ID, "| Region:", REGION)

# Set active project for gcloud/BigQuery CLI
!gcloud config set project $GOOGLE_CLOUD_PROJECT
!gcloud config get-value project
# Done: Auth + Project/Region set

Enter your GCP Project ID: mgmt467-471119
Project: mgmt467-471119 | Region: us-central1
Updated property [core/project].
mgmt467-471119


### Verification Prompt
Generate a short cell that prints the active project using `gcloud config get-value project` and echoes the `REGION` you set.


**Reflection:** Why do we set `PROJECT_ID` and `REGION` at the top? What can go wrong if we don’t?

## 1) Kaggle API — What & Why
Use Kaggle CLI for reproducible downloads. Store `kaggle.json` at `~/.kaggle/kaggle.json` with `0600` permissions to protect secrets.

### Build Prompt
Generate a **single Colab code cell** that:
- Prompts me to upload `kaggle.json`,
- Saves to `~/.kaggle/kaggle.json` with `0600` permissions,
- Prints `kaggle --version`.
Add comments about security and reproducibility.


In [None]:
# # EXAMPLE (from LLM) — Kaggle setup (commented)
# # from google.colab import files
# # print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
# # uploaded = files.upload()
# #
# # import os
# # os.makedirs('/root/.kaggle', exist_ok=True)
# # with open('/root/.kaggle/kaggle.json', 'wb') as f:
# #     f.write(uploaded[list(uploaded.keys())[0]])
# # os.chmod('/root/.kaggle/kaggle.json', 0o600)  # owner-only
# #
# # !kaggle --version

In [None]:
# Set up Kaggle API
from google.colab import files
print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
uploaded = files.upload()

import os
os.makedirs('/root/.kaggle', exist_ok=True)
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])
os.chmod('/root/.kaggle/kaggle.json', 0o600)  # owner-only

!kaggle --version

Upload your kaggle.json (Kaggle > Account > Create New API Token)


Saving kaggle.json to kaggle (1).json
Kaggle API 1.7.4.5


In [None]:
# Verification: print active project and region
!gcloud config get-value project
print("REGION:", os.environ.get("REGION"))

mgmt467-471119
REGION: None


### Verification Prompt
Generate a one-liner that runs `kaggle --help | head -n 20` to show the CLI is ready.


**Reflection:** Why require strict `0600` permissions on API tokens? What risks are we avoiding?

## 2) Download & unzip dataset — What & Why
Keep raw files under `/content/data/raw` for predictable paths and auditing.
**Dataset:** `sayeeduddin/netflix-2025user-behavior-dataset-210k-records`

### Build Prompt
Generate a **Colab code cell** that:
- Creates `/content/data/raw`,
- Downloads the dataset to `/content/data` with Kaggle CLI,
- Unzips into `/content/data/raw` (overwrite OK),
- Lists all CSVs with sizes in a neat table.
Include comments describing each step.


In [None]:
# # EXAMPLE (from LLM) — Download & unzip (commented)
# # !mkdir -p /content/data/raw
# # !kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data
# # !unzip -o /content/data/*.zip -d /content/data/raw
# # # List CSV inventory
# # !ls -lh /content/data/raw/*.csv

In [None]:
# Download and unzip dataset
!mkdir -p /content/data/raw
!kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data
!unzip -o /content/data/*.zip -d /content/data/raw
# List CSV inventory
!ls -lh /content/data/raw/*.csv

Dataset URL: https://www.kaggle.com/datasets/sayeeduddin/netflix-2025user-behavior-dataset-210k-records
License(s): CC0-1.0
netflix-2025user-behavior-dataset-210k-records.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  /content/data/netflix-2025user-behavior-dataset-210k-records.zip
  inflating: /content/data/raw/README.md  
  inflating: /content/data/raw/movies.csv  
  inflating: /content/data/raw/recommendation_logs.csv  
  inflating: /content/data/raw/reviews.csv  
  inflating: /content/data/raw/search_logs.csv  
  inflating: /content/data/raw/users.csv  
  inflating: /content/data/raw/watch_history.csv  
-rw-r--r-- 1 root root 114K Aug  2 19:36 /content/data/raw/movies.csv
-rw-r--r-- 1 root root 4.5M Aug  2 19:36 /content/data/raw/recommendation_logs.csv
-rw-r--r-- 1 root root 1.8M Aug  2 19:36 /content/data/raw/reviews.csv
-rw-r--r-- 1 root root 2.2M Aug  2 19:36 /content/data/raw/search_logs.csv
-rw-r--r-- 1 root root 1.6M Aug  2 1

### Verification Prompt
Generate a snippet that asserts there are exactly **six** CSV files and prints their names.


**Reflection:** Why is keeping a clean file inventory (names, sizes) useful downstream?

## 3) Create GCS bucket & upload — What & Why
Stage in GCS → consistent, versionable source for BigQuery loads. Bucket names must be **globally unique**.

In [None]:
# Create BigQuery dataset (idempotent)
DATASET="netflix"
# Attempt to create; ignore if exists
!bq --location=US mk -d --description "MGMT467 Netflix dataset" $DATASET || echo "Dataset may already exist."

BigQuery error in mk operation: Dataset 'mgmt467-471119:netflix' already exists.
Dataset may already exist.


In [None]:
# Load tables from GCS into BigQuery
tables = {
  "users": "users.csv",
  "movies": "movies.csv",
  "watch_history": "watch_history.csv",
  "recommendation_logs": "recommendation_logs.csv",
  "search_logs": "search_logs.csv",
  "reviews": "reviews.csv",
}

import os
# Ensure BUCKET_NAME is set before proceeding
if "BUCKET_NAME" not in os.environ:
  print("Error: BUCKET_NAME environment variable is not set.")
  print("Please run the cell to create and upload data to the GCS bucket first.")
else:
  for tbl, fname in tables.items():
    src = f"gs://{os.environ['BUCKET_NAME']}/netflix/{fname}"
    print("Loading", tbl, "from", src)
    !bq load --skip_leading_rows=1 --autodetect --source_format=CSV {DATASET}.{tbl} {src}

  # Row counts
  for tbl in tables.keys():
    !bq query --nouse_legacy_sql "SELECT '{tbl}' AS table_name, COUNT(*) AS n FROM `{GOOGLE_CLOUD_PROJECT}.netflix.{tbl}`"

Loading users from gs://mgmt467-471119-netflix-60e311fb/netflix/users.csv
Waiting on bqjob_rece09e24026369c_0000019a178b0c19_1 ... (1s) Current status: DONE   
Loading movies from gs://mgmt467-471119-netflix-60e311fb/netflix/movies.csv
Waiting on bqjob_raebe44d7a21e8b3_0000019a178b236d_1 ... (1s) Current status: DONE   
Loading watch_history from gs://mgmt467-471119-netflix-60e311fb/netflix/watch_history.csv
Waiting on bqjob_r5bf1abe80b1a0f5c_0000019a178b3a2a_1 ... (3s) Current status: DONE   
Loading recommendation_logs from gs://mgmt467-471119-netflix-60e311fb/netflix/recommendation_logs.csv
Waiting on bqjob_r10a121573a1d6727_0000019a178b5a14_1 ... (1s) Current status: DONE   
Loading search_logs from gs://mgmt467-471119-netflix-60e311fb/netflix/search_logs.csv
Waiting on bqjob_r50ce06c29bfc3e54_0000019a178b731b_1 ... (1s) Current status: DONE   
Loading reviews from gs://mgmt467-471119-netflix-60e311fb/netflix/reviews.csv
Waiting on bqjob_r52fe8837d31cb4d1_0000019a178b898a_1 ... (1s

In [None]:
# Create a unique bucket in the specified region and upload data
import uuid
import os

bucket_name = f"{PROJECT_ID}-netflix-{uuid.uuid4().hex[:8]}"
os.environ["BUCKET_NAME"] = bucket_name

!gcloud storage buckets create gs://$BUCKET_NAME
!gcloud storage cp /content/data/raw/*.csv gs://$BUCKET_NAME/netflix/

print(f"Created bucket: {bucket_name}")
print("Dataset files uploaded to GCS. Staging data in GCS provides a centralized, versionable source for loading into BigQuery and other services.")

# Verify contents (optional)
# !gcloud storage ls gs://$BUCKET_NAME/netflix/

Creating gs://mgmt467-471119-netflix-a67fb9b7/...
Copying file:///content/data/raw/movies.csv to gs://mgmt467-471119-netflix-a67fb9b7/netflix/movies.csv
Copying file:///content/data/raw/recommendation_logs.csv to gs://mgmt467-471119-netflix-a67fb9b7/netflix/recommendation_logs.csv
Copying file:///content/data/raw/reviews.csv to gs://mgmt467-471119-netflix-a67fb9b7/netflix/reviews.csv
Copying file:///content/data/raw/search_logs.csv to gs://mgmt467-471119-netflix-a67fb9b7/netflix/search_logs.csv
Copying file:///content/data/raw/users.csv to gs://mgmt467-471119-netflix-a67fb9b7/netflix/users.csv
Copying file:///content/data/raw/watch_history.csv to gs://mgmt467-471119-netflix-a67fb9b7/netflix/watch_history.csv

Average throughput: 60.3MiB/s
Created bucket: mgmt467-471119-netflix-a67fb9b7
Dataset files uploaded to GCS. Staging data in GCS provides a centralized, versionable source for loading into BigQuery and other services.


In [None]:
# Verification: List contents of the GCS bucket
import os
bucket_name = os.environ.get("BUCKET_NAME")
if bucket_name:
  !gcloud storage ls gs://$BUCKET_NAME/netflix/ --recursive --readable-sizes
else:
  print("BUCKET_NAME environment variable not set.")

gs://mgmt467-471119-netflix-a67fb9b7/netflix/:
gs://mgmt467-471119-netflix-a67fb9b7/netflix/movies.csv
gs://mgmt467-471119-netflix-a67fb9b7/netflix/recommendation_logs.csv
gs://mgmt467-471119-netflix-a67fb9b7/netflix/reviews.csv
gs://mgmt467-471119-netflix-a67fb9b7/netflix/search_logs.csv
gs://mgmt467-471119-netflix-a67fb9b7/netflix/users.csv
gs://mgmt467-471119-netflix-a67fb9b7/netflix/watch_history.csv


### Build Prompt
Generate a **Colab code cell** that:
- Creates a unique bucket in `${REGION}` (random suffix),
- Saves name to `BUCKET_NAME` env var,
- Uploads all CSVs to `gs://$BUCKET_NAME/netflix/`,
- Prints the bucket name and explains staging benefits.


In [None]:
# # EXAMPLE (from LLM) — GCS staging (commented)
# # import uuid, os
# # bucket_name = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"
# # os.environ["BUCKET_NAME"] = bucket_name
# # !gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION
# # !gcloud storage cp /content/data/raw/* gs://$BUCKET_NAME/netflix/
# # print("Bucket:", bucket_name)
# # # Verify contents
# # !gcloud storage ls gs://$BUCKET_NAME/netflix/

In [None]:
# Create BigQuery dataset (idempotent)
DATASET="netflix"
# Attempt to create; ignore if exists
!bq --location=US mk -d --description "MGMT467 Netflix dataset" $DATASET || echo "Dataset may already exist."

BigQuery error in mk operation: Dataset 'mgmt467-471119:netflix' already exists.
Dataset may already exist.


In [None]:
# Load tables from GCS into BigQuery
tables = {
  "users": "users.csv",
  "movies": "movies.csv",
  "watch_history": "watch_history.csv",
  "recommendation_logs": "recommendation_logs.csv",
  "search_logs": "search_logs.csv",
  "reviews": "reviews.csv",
}

import os
for tbl, fname in tables.items():
  src = f"gs://{os.environ['BUCKET_NAME']}/netflix/{fname}"
  print("Loading", tbl, "from", src)
  # Fixed bq load command syntax
  !bq load --skip_leading_rows=1 --autodetect --source_format=CSV {DATASET}.{tbl} {src}

# Row counts
for tbl in tables.keys():
  # Fixed bq query command syntax
  !bq query --nouse_legacy_sql "SELECT '{tbl}' AS table_name, COUNT(*) AS n FROM `{GOOGLE_CLOUD_PROJECT}.netflix.{tbl}`"

Loading users from gs://mgmt467-471119-netflix-a67fb9b7/netflix/users.csv
Waiting on bqjob_r2bc07cacddee0307_0000019a178c19c1_1 ... (3s) Current status: DONE   
Loading movies from gs://mgmt467-471119-netflix-a67fb9b7/netflix/movies.csv
Waiting on bqjob_r4ec27650eb49a0ad_0000019a178c3a1f_1 ... (1s) Current status: DONE   
Loading watch_history from gs://mgmt467-471119-netflix-a67fb9b7/netflix/watch_history.csv
Waiting on bqjob_r40a2f0cebcde4d3d_0000019a178c5058_1 ... (3s) Current status: DONE   
Loading recommendation_logs from gs://mgmt467-471119-netflix-a67fb9b7/netflix/recommendation_logs.csv
Waiting on bqjob_r2f4b14c367c42c38_0000019a178c6ea8_1 ... (2s) Current status: DONE   
Loading search_logs from gs://mgmt467-471119-netflix-a67fb9b7/netflix/search_logs.csv
Waiting on bqjob_r6a4cf061de9c7759_0000019a178c8ac7_1 ... (1s) Current status: DONE   
Loading reviews from gs://mgmt467-471119-netflix-a67fb9b7/netflix/reviews.csv
Waiting on bqjob_r5fbe36a7561e1688_0000019a178ca1c4_1 ... (

In [None]:
# Verification: List contents of the GCS bucket
import os
bucket_name = os.environ.get("BUCKET_NAME")
if bucket_name:
  !gcloud storage ls gs://$BUCKET_NAME/netflix/ --recursive --readable-sizes
else:
  print("BUCKET_NAME environment variable not set.")

gs://mgmt467-471119-netflix-a67fb9b7/netflix/:
gs://mgmt467-471119-netflix-a67fb9b7/netflix/movies.csv
gs://mgmt467-471119-netflix-a67fb9b7/netflix/recommendation_logs.csv
gs://mgmt467-471119-netflix-a67fb9b7/netflix/reviews.csv
gs://mgmt467-471119-netflix-a67fb9b7/netflix/search_logs.csv
gs://mgmt467-471119-netflix-a67fb9b7/netflix/users.csv
gs://mgmt467-471119-netflix-a67fb9b7/netflix/watch_history.csv


In [None]:
# Create a unique bucket in the specified region and upload data
import uuid
import os

bucket_name = f"{PROJECT_ID}-netflix-{uuid.uuid4().hex[:8]}"
os.environ["BUCKET_NAME"] = bucket_name

!gcloud storage buckets create gs://$BUCKET_NAME
!gcloud storage cp /content/data/raw/*.csv gs://$BUCKET_NAME/netflix/

print(f"Created bucket: {bucket_name}")
print("Dataset files uploaded to GCS. Staging data in GCS provides a centralized, versionable source for loading into BigQuery and other services.")

# Verify contents (optional)
# !gcloud storage ls gs://$BUCKET_NAME/netflix/

Creating gs://mgmt467-471119-netflix-ce0628a4/...
Copying file:///content/data/raw/movies.csv to gs://mgmt467-471119-netflix-ce0628a4/netflix/movies.csv
Copying file:///content/data/raw/recommendation_logs.csv to gs://mgmt467-471119-netflix-ce0628a4/netflix/recommendation_logs.csv
Copying file:///content/data/raw/reviews.csv to gs://mgmt467-471119-netflix-ce0628a4/netflix/reviews.csv
Copying file:///content/data/raw/search_logs.csv to gs://mgmt467-471119-netflix-ce0628a4/netflix/search_logs.csv
Copying file:///content/data/raw/users.csv to gs://mgmt467-471119-netflix-ce0628a4/netflix/users.csv
Copying file:///content/data/raw/watch_history.csv to gs://mgmt467-471119-netflix-ce0628a4/netflix/watch_history.csv

Average throughput: 59.7MiB/s
Created bucket: mgmt467-471119-netflix-ce0628a4
Dataset files uploaded to GCS. Staging data in GCS provides a centralized, versionable source for loading into BigQuery and other services.


### Verification Prompt
Generate a snippet that lists the `netflix/` prefix and shows object sizes.


**Reflection:** Name two benefits of staging in GCS vs loading directly from local Colab.

## 4) BigQuery dataset & loads — What & Why
Create dataset `netflix` and load six CSVs with **autodetect** for speed (we’ll enforce schemas later).

### Build Prompt (two cells)
**Cell A:** Create (idempotently) dataset `netflix` in US multi-region; if it exists, print a friendly message.  
**Cell B:** Load tables from `gs://$BUCKET_NAME/netflix/`:
`users, movies, watch_history, recommendation_logs, search_logs, reviews`
with `--skip_leading_rows=1 --autodetect --source_format=CSV`.
Finish with row-count queries for each table.


In [None]:
# # EXAMPLE (from LLM) — BigQuery dataset (commented)
# # DATASET="netflix"
# # # Attempt to create; ignore if exists
# # !bq --location=US mk -d --description "MGMT467 Netflix dataset" $DATASET || echo "Dataset may already exist."

In [None]:
# Verification: Get row counts for all tables
!bq query --nouse_legacy_sql "SELECT table_name, row_count FROM `{GOOGLE_CLOUD_PROJECT}.netflix.__TABLES__`"

/bin/bash: line 1: {GOOGLE_CLOUD_PROJECT}.netflix.__TABLES__: command not found
Error in query string: Error processing job
'mgmt467-471119:bqjob_r5edad692c7c37273_0000019a178d2510_1': Syntax error:
Unexpected end of script at [1:34]


In [None]:
# # EXAMPLE (from LLM) — Load tables (commented)
# # tables = {
# #   "users": "users.csv",
# #   "movies": "movies.csv",
# #   "watch_history": "watch_history.csv",
# #   "recommendation_logs": "recommendation_logs.csv",
# #   "search_logs": "search_logs.csv",
# #   "reviews": "reviews.csv",
# # }
# # import os
# # for tbl, fname in tables.items():
# #   src = f"gs://{os.environ['BUCKET_NAME']}/netflix/{fname}"
# #   print("Loading", tbl, "from", src)
# #   !bq load --skip_leading_rows=1 --autodetect --source_format=CSV $DATASET.$tbl $src
# #
# # # Row counts
# # for tbl in tables.keys():
# #   !bq query --nouse_legacy_sql "SELECT '{tbl}' AS table_name, COUNT(*) AS n FROM `${GOOGLE_CLOUD_PROJECT}.netflix.{tbl}`".format(tbl=tbl)

In [None]:
%%bigquery
-- Users: % missing per column
WITH base AS (
  SELECT COUNT(*) n,
         COUNTIF(region IS NULL) miss_region,
         COUNTIF(plan_tier IS NULL) miss_plan,
         COUNTIF(age_band IS NULL) miss_age
  FROM `@GOOGLE_CLOUD_PROJECT`.netflix.users
)
SELECT n,
       ROUND(SAFE_DIVIDE(100*miss_region, n),2) AS pct_missing_region,
       ROUND(SAFE_DIVIDE(100*miss_plan, n),2)   AS pct_missing_plan_tier,
       ROUND(SAFE_DIVIDE(100*miss_age, n),2)    AS pct_missing_age_band
FROM base;

Executing query with job ID: bbe75be0-7cb8-4498-8567-56956e4f2c6c
Query executing: 0.21s


ERROR:
 400 Invalid project ID '@GOOGLE_CLOUD_PROJECT'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: @GOOGLE_CLOUD_PROJECT.netflix.users, message: Invalid project ID '@GOOGLE_CLOUD_PROJECT'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: bbe75be0-7cb8-4498-8567-56956e4f2c6c



In [None]:
%%bigquery
-- % plan_tier missing by region (MAR analysis)
SELECT region,
       COUNT(*) AS n,
       ROUND(SAFE_DIVIDE(100*COUNTIF(plan_tier IS NULL), COUNT(*)),2) AS pct_missing_plan_tier
FROM `@GOOGLE_CLOUD_PROJECT`.netflix.users
GROUP BY region
ORDER BY pct_missing_plan_tier DESC;

Executing query with job ID: 16fc0a10-c8f4-49e6-989f-ed214549e49d
Query executing: 0.20s


ERROR:
 400 Invalid project ID '@GOOGLE_CLOUD_PROJECT'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: @GOOGLE_CLOUD_PROJECT.netflix.users, message: Invalid project ID '@GOOGLE_CLOUD_PROJECT'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: 16fc0a10-c8f4-49e6-989f-ed214549e49d



### Verification Prompt
Generate a single query that returns `table_name, row_count` for all six tables in `${GOOGLE_CLOUD_PROJECT}.netflix`.


**Reflection:** When is `autodetect` acceptable? When should you enforce explicit schemas and why?

## 5) Data Quality (DQ) — Concepts we care about
- **Missingness** (MCAR/MAR/MNAR). Impute vs drop. Add `is_missing_*` indicators.
- **Duplicates** (exact vs near). Double-counted engagement corrupts labels & KPIs.
- **Outliers** (IQR). Winsorize/cap vs robust models. Always **flag** and explain.
- **Reproducibility**. Prefer `CREATE OR REPLACE` and deterministic keys.


### 5.1 Missingness (users) — What & Why
Measure % missing and check if missingness depends on another variable (MAR) → potential bias & instability.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Total rows and % missing in `region`, `plan_tier`, `age_band` from `users`.
2) `% plan_tier missing by region` ordered descending. Add comments on MAR.


In [None]:
# # EXAMPLE (from LLM) — Missingness profile (commented)
# # -- Users: % missing per column
# # WITH base AS (
# #   SELECT COUNT(*) n,
# #          COUNTIF(region IS NULL) miss_region,
# #          COUNTIF(plan_tier IS NULL) miss_plan,
# #          COUNTIF(age_band IS NULL) miss_age
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`
# # )
# # SELECT n,
# #        ROUND(100*miss_region/n,2) AS pct_missing_region,
# #        ROUND(100*miss_plan/n,2)   AS pct_missing_plan_tier,
# #        ROUND(100*miss_age/n,2)    AS pct_missing_age_band
# # FROM base;

In [None]:
%%bigquery
-- Users: % missing per column
WITH base AS (
  SELECT COUNT(*) n,
         COUNTIF(region IS NULL) miss_region,
         COUNTIF(plan_tier IS NULL) miss_plan,
         COUNTIF(age_band IS NULL) miss_age
  FROM `your_project_id.netflix.users` -- REPLACE 'your_project_id' with your actual GCP Project ID
)
SELECT n,
       ROUND(SAFE_DIVIDE(100*miss_region, n),2) AS pct_missing_region,
       ROUND(SAFE_DIVIDE(100*miss_plan, n),2)   AS pct_missing_plan_tier,
       ROUND(SAFE_DIVIDE(100*miss_age, n),2)    AS pct_missing_age_band
FROM base;

Executing query with job ID: 1394776a-2e7d-40b5-8a2b-659f5f76c8ff
Query executing: 0.19s


ERROR:
 400 Invalid project ID 'your_project_id'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: your_project_id.netflix.users, message: Invalid project ID 'your_project_id'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: 1394776a-2e7d-40b5-8a2b-659f5f76c8ff



In [None]:
%%bigquery
-- % plan_tier missing by region (MAR analysis)
SELECT region,
       COUNT(*) AS n,
       ROUND(SAFE_DIVIDE(100*COUNTIF(plan_tier IS NULL), COUNT(*)),2) AS pct_missing_plan_tier
FROM `your_project_id.netflix.users` -- REPLACE 'your_project_id' with your actual GCP Project ID
GROUP BY region
ORDER BY pct_missing_plan_tier DESC;

Executing query with job ID: 61f164f9-0829-41c5-9476-6b25436feccf
Query executing: 0.20s


ERROR:
 400 Invalid project ID 'your_project_id'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: your_project_id.netflix.users, message: Invalid project ID 'your_project_id'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: 61f164f9-0829-41c5-9476-6b25436feccf



In [None]:
# # EXAMPLE (from LLM) — MAR by region (commented)
# # SELECT region,
# #        COUNT(*) AS n,
# #        ROUND(100*COUNTIF(plan_tier IS NULL)/COUNT(*),2) AS pct_missing_plan_tier
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`
# # GROUP BY region
# # ORDER BY pct_missing_plan_tier DESC;

### Verification Prompt
Generate a query that prints the three missingness percentages from (1), rounded to two decimals.


**Reflection:** Which columns are most missing? Hypothesize MCAR/MAR/MNAR and why.

### 5.2 Duplicates (watch_history) — What & Why
Find exact duplicate interaction records and keep **one best** per group (deterministic policy).

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Report duplicate groups on `(user_id, movie_id, event_ts, device_type)` with counts (top 20).
2) Create table `watch_history_dedup` that keeps one row per group (prefer higher `progress_ratio`, then `minutes_watched`). Add comments.


In [None]:
# # EXAMPLE (from LLM) — Detect duplicate groups (commented)
# # SELECT user_id, movie_id, event_ts, device_type, COUNT(*) AS dup_count
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history`
# # GROUP BY user_id, movie_id, event_ts, device_type
# # HAVING dup_count > 1
# # ORDER BY dup_count DESC
# # LIMIT 20;

In [None]:
%%bigquery
-- Report duplicate groups on (user_id, movie_id, event_ts, device_type) with counts (top 20)
SELECT user_id, movie_id, event_ts, device_type, COUNT(*) AS dup_count
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history`
GROUP BY user_id, movie_id, event_ts, device_type
HAVING dup_count > 1
ORDER BY dup_count DESC
LIMIT 20;

Executing query with job ID: 1bd05fe4-be0e-48b6-949c-e6cc0257e319
Query executing: 0.15s


ERROR:
 400 Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: ${GOOGLE_CLOUD_PROJECT}.netflix.watch_history, message: Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: 1bd05fe4-be0e-48b6-949c-e6cc0257e319



In [None]:
%%bigquery
-- Create table watch_history_dedup that keeps one row per group (prefer higher progress_ratio, then minutes_watched)
CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` AS
SELECT * EXCEPT(rk) FROM (
  SELECT h.*,
         ROW_NUMBER() OVER (
           PARTITION BY user_id, movie_id, event_ts, device_type
           ORDER BY progress_ratio DESC, minutes_watched DESC
         ) AS rk
  FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history` h
)
WHERE rk = 1;

Executing query with job ID: be6512c0-f292-4bc9-9e38-5d1c4b63b7dd
Query executing: 0.22s


ERROR:
 400 Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: ${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup, message: Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: be6512c0-f292-4bc9-9e38-5d1c4b63b7dd



In [None]:
# # EXAMPLE (from LLM) — Keep-one policy (commented)
# # CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` AS
# # SELECT * EXCEPT(rk) FROM (
# #   SELECT h.*,
# #          ROW_NUMBER() OVER (
# #            PARTITION BY user_id, movie_id, event_ts, device_type
# #            ORDER BY progress_ratio DESC, minutes_watched DESC
# #          ) AS rk
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history` h
# # )
# # WHERE rk = 1;

### Verification Prompt
Generate a before/after count query comparing raw vs `watch_history_dedup`.


**Reflection:** Why do duplicates arise (natural vs system-generated)? How do they corrupt labels and KPIs?

### 5.3 Outliers (minutes_watched) — What & Why
Estimate extreme values via IQR; report % outliers; **winsorize** to P01/P99 for robustness while also **flagging** extremes.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Compute IQR bounds for `minutes_watched` on `watch_history_dedup` and report % outliers.
2) Create `watch_history_robust` with `minutes_watched_capped` capped at P01/P99; return quantile summaries before/after.


In [None]:
# # EXAMPLE (from LLM) — IQR outlier rate (commented)
# # WITH dist AS (
# #   SELECT
# #     APPROX_QUANTILES(minutes_watched, 4)[OFFSET(1)] AS q1,
# #     APPROX_QUANTILES(minutes_watched, 4)[OFFSET(3)] AS q3
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # ),
# # bounds AS (
# #   SELECT q1, q3, (q3-q1) AS iqr,
# #          q1 - 1.5*(q3-q1) AS lo,
# #          q3 + 1.5*(q3-q1) AS hi
# #   FROM dist
# # )
# # SELECT
# #   COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi) AS outliers,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi)/COUNT(*),2) AS pct_outliers
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h
# # CROSS JOIN bounds b;

In [None]:
%%bigquery
-- Compute IQR bounds for minutes_watched on watch_history_dedup and report % outliers.
WITH dist AS (
  SELECT
    APPROX_QUANTILES(minutes_watched, 4)[OFFSET(1)] AS q1,
    APPROX_QUANTILES(minutes_watched, 4)[OFFSET(3)] AS q3
  FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
),
bounds AS (
  SELECT q1, q3, (q3-q1) AS iqr,
         q1 - 1.5*(q3-q1) AS lo,
         q3 + 1.5*(q3-q1) AS hi
  FROM dist
)
SELECT
  COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi) AS outliers,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi)/COUNT(*),2) AS pct_outliers
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h
CROSS JOIN bounds b;

Executing query with job ID: 27d82ea8-8234-4bbd-a9b4-4da6463cb9d8
Query executing: 0.21s


ERROR:
 400 Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: ${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup, message: Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: 27d82ea8-8234-4bbd-a9b4-4da6463cb9d8



In [None]:
%%bigquery
-- Create watch_history_robust with minutes_watched_capped at P01/P99; return quantile summaries before/after.
CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust` AS
WITH q AS (
  SELECT
    APPROX_QUANTILES(minutes_watched, 100)[OFFSET(1)]  AS p01,
    APPROX_QUANTILES(minutes_watched, 100)[OFFSET(98)] AS p99
  FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
)
SELECT
  h.*,
  GREATEST(q.p01, LEAST(q.p99, h.minutes_watched)) AS minutes_watched_capped
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h, q;

-- Quantiles before vs after
WITH before AS (
  SELECT 'before' AS which, APPROX_QUANTILES(minutes_watched, 5) AS q
  FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
),
after AS (
  SELECT 'after' AS which, APPROX_QUANTILES(minutes_watched_capped, 5) AS q
  FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`
)
SELECT * FROM before UNION ALL SELECT * FROM after;

Executing query with job ID: 7d73f38c-8580-4ff1-a86d-f01709762aa9
Query executing: 0.21s


ERROR:
 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/mgmt467-471119/queries/7d73f38c-8580-4ff1-a86d-f01709762aa9?maxResults=0&location=US&prettyPrint=false: Invalid value: Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash. at [2:1]

Location: US
Job ID: 7d73f38c-8580-4ff1-a86d-f01709762aa9



In [None]:
# # EXAMPLE (from LLM) — Winsorize + quantiles (commented)
# # CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust` AS
# # WITH q AS (
# #   SELECT
# #     APPROX_QUANTILES(minutes_watched, 100)[OFFSET(1)]  AS p01,
# #     APPROX_QUANTILES(minutes_watched, 100)[OFFSET(98)] AS p99
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # )
# # SELECT
# #   h.*,
# #   GREATEST(q.p01, LEAST(q.p99, h.minutes_watched)) AS minutes_watched_capped
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h, q;
# #
# # -- Quantiles before vs after
# # WITH before AS (
# #   SELECT 'before' AS which, APPROX_QUANTILES(minutes_watched, 5) AS q
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # ),
# # after AS (
# #   SELECT 'after' AS which, APPROX_QUANTILES(minutes_watched_capped, 5) AS q
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`
# # )
# # SELECT * FROM before UNION ALL SELECT * FROM after;

### Verification Prompt
Generate a query that shows min/median/max before vs after capping.


**Reflection:** When might capping be harmful? Name a model type less sensitive to outliers and why.

### 5.4 Business anomaly flags — What & Why
Human-readable flags help both product decisioning and ML features (e.g., binge behavior).

### Build Prompt
Generate **three BigQuery SQL cells** (adjust if columns differ):
1) In `watch_history_robust`, compute and summarize `flag_binge` for sessions > 8 hours.
2) In `users`, compute and summarize `flag_age_extreme` if age can be parsed from `age_band` (<10 or >100).
3) In `movies`, compute and summarize `flag_duration_anomaly` where `duration_min` < 15 or > 480 (if exists).
Each cell should output count and percentage and include 1–2 comments.


In [None]:
# # EXAMPLE (from LLM) — flag_binge (commented)
# # SELECT
# #   COUNTIF(minutes_watched > 8*60) AS sessions_over_8h,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(minutes_watched > 8*60)/COUNT(*),2) AS pct
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`;

In [None]:
%%bigquery
-- Summarize flag_binge for sessions > 8 hours
SELECT
  COUNTIF(minutes_watched > 8*60) AS sessions_over_8h,
  COUNT(*) AS total,
  ROUND(SAFE_DIVIDE(100*COUNTIF(minutes_watched > 8*60), COUNT(*)),2) AS pct
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`;

Executing query with job ID: ceff0dae-ee5e-415a-b7e7-e53b5c43cae3
Query executing: 0.22s


ERROR:
 400 Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: ${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust, message: Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: ceff0dae-ee5e-415a-b7e7-e53b5c43cae3



In [None]:
%%bigquery
-- Summarize flag_age_extreme if age can be parsed from age_band (<10 or >100)
SELECT
  COUNTIF(SAFE_CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
          SAFE_CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100) AS extreme_age_rows,
  COUNT(*) AS total,
  ROUND(SAFE_DIVIDE(100*COUNTIF(SAFE_CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
                                SAFE_CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100), COUNT(*)),2) AS pct
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`;

Executing query with job ID: 05a00d22-19f3-4a87-a1ef-ff423365ce89
Query executing: 0.18s


ERROR:
 400 Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: ${GOOGLE_CLOUD_PROJECT}.netflix.users, message: Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: 05a00d22-19f3-4a87-a1ef-ff423365ce89



In [None]:
%%bigquery
-- Summarize flag_duration_anomaly where duration_min < 15 or > 480
SELECT
  COUNTIF(duration_min < 15) AS titles_under_15m,
  COUNTIF(duration_min > 480) AS titles_over_8h, -- 480 minutes = 8 hours
  COUNT(*) AS total
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.movies`;

Executing query with job ID: 46036206-6e42-413e-844b-87ea2199bbb7
Query executing: 0.17s


ERROR:
 400 Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: ${GOOGLE_CLOUD_PROJECT}.netflix.movies, message: Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: 46036206-6e42-413e-844b-87ea2199bbb7



In [None]:
# # EXAMPLE (from LLM) — flag_age_extreme (commented)
# # SELECT
# #   COUNTIF(CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
# #           CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100) AS extreme_age_rows,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
# #                     CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100)/COUNT(*),2) AS pct
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`;

In [None]:
# # EXAMPLE (from LLM) — flag_duration_anomaly (commented)
# # SELECT
# #   COUNTIF(duration_min < 15) AS titles_under_15m,
# #   COUNTIF(duration_min > 8*60) AS titles_over_8h,
# #   COUNT(*) AS total
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.movies`;

### Verification Prompt
Generate a single compact summary query that returns two columns per flag: `flag_name, pct_of_rows`.


In [None]:
%%bigquery
-- Compact summary of anomaly flags
SELECT 'flag_binge' AS flag_name,
       ROUND(SAFE_DIVIDE(100*COUNTIF(minutes_watched > 8*60), COUNT(*)),2) AS pct_of_rows
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`
UNION ALL
SELECT 'flag_age_extreme' AS flag_name,
       ROUND(SAFE_DIVIDE(100*COUNTIF(SAFE_CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
                                     SAFE_CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100), COUNT(*)),2) AS pct_of_rows
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`
UNION ALL
SELECT 'flag_duration_anomaly_under_15m' AS flag_name,
       ROUND(SAFE_DIVIDE(100*COUNTIF(duration_min < 15), COUNT(*)),2) AS pct_of_rows
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.movies`
UNION ALL
SELECT 'flag_duration_anomaly_over_8h' AS flag_name,
       ROUND(SAFE_DIVIDE(100*COUNTIF(duration_min > 480), COUNT(*)),2) AS pct_of_rows -- 480 minutes = 8 hours
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.movies`;

Executing query with job ID: 5c9cce76-7348-4ab1-8d0a-80139279eb71
Query executing: 0.18s


ERROR:
 400 Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: ${GOOGLE_CLOUD_PROJECT}.netflix.movies, message: Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: 5c9cce76-7348-4ab1-8d0a-80139279eb71



**Reflection:** Which anomaly flag is most common? Which would you keep as a feature and why?

Based on the output from the previous cells (which are currently failing due to a project ID issue), once they are fixed and run, you can analyze the percentage of rows for each flag. The flag with the highest percentage of rows would be the most common.

Regarding which flag to keep as a feature, you would consider which anomaly is most likely to influence user behavior or be relevant for your modeling task. For example, flag_binge might be useful for identifying highly engaged users or predicting churn, while flag_age_extreme might indicate data entry errors or a niche user segment. flag_duration_anomaly in movies could highlight potential data issues or unique content. The decision depends on the specific business or ML problem you are trying to solve.

## 6) Save & submit — What & Why
Reproducibility: save artifacts and document decisions so others can rerun and audit.

### Build Prompt
Generate a checklist (Markdown) students can paste at the end:
- Save this notebook to the team Drive.
- Export a `.sql` file with your DQ queries and save to repo.
- Push notebook + SQL to the **team GitHub** with a descriptive commit.
- Add a README with your `PROJECT_ID`, `REGION`, bucket, dataset, and today’s row counts.


## Checklist for Saving and Submitting

*   [ ] Save this notebook to the team Drive.
*   [ ] Export a `.sql` file with your DQ queries and save to repo.
*   [ ] Push notebook + SQL to the team GitHub with a descriptive commit.
*   [ ] Add a README with your `PROJECT_ID`, `REGION`, bucket, dataset, and today’s row counts.

## Grading rubric (quick)
- Profiling completeness (30)  
- Cleaning policy correctness & reproducibility (40)  
- Reflection/insight (20)  
- Hygiene (naming, verification, idempotence) (10)
