# Notebook 1: Setup & First Query

## Welcome to Analytics Intelligence!

**👋 New to Colab?** No problem! This is a cloud-based coding environment. You'll:
- **Read** text cells like this one (markdown)
- **Run** code cells by clicking the ▶️ play button
- **See** results appear below each code cell

**👀 Just watching?** That's totally fine! You can follow along without running anything.

---

### What We're Building
A system that watches your analytics data 24/7 and alerts you to:
- 🚨 **Problems**: Tracking breaks, PII leaks, data quality issues
- 🎉 **Opportunities**: Traffic spikes, conversion improvements, new patterns
- 📊 **Insights**: Behavior shifts, emerging trends

### Workshop Structure
1. **This notebook (20 min)**: Connect to BigQuery, explore data, find first issue
2. **Notebook 2 (20 min)**: Use AI to generate SQL checks
3. **Notebook 3 (25 min)**: Build alerts and deploy

### The Data
We're using 7 days of sample news analytics (3.5M events, GA4-style) with **planted problems** and **opportunities** for you to discover.

Let's go! 🚀

---

## Step 1: Install Dependencies

**What this does:** Downloads the Python libraries we need (BigQuery connector, Pandas for data).

**▶️ Click the play button** to run this cell. Takes ~10 seconds.

In [None]:
# Install required packages
!pip install -q google-cloud-bigquery pandas

print("✓ Packages installed successfully!")

---

## Step 2: Import Libraries

**What this does:** "Opens the toolbox" - makes the libraries available to use.

In [None]:
from google.cloud import bigquery
from google.colab import auth
import pandas as pd
from datetime import datetime, timedelta

# Display settings (makes tables easier to read)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("✓ Libraries imported successfully!")

---

## Step 3: Authenticate with Google Cloud

**What this does:** Logs you into Google Cloud so you can access BigQuery.

**⚠️ You'll see a popup** asking to authorize. Click "Allow".

**Note:** Bryan Davis (AP) has already set up a shared project for this workshop - you're getting read-only access to the sample data.

In [None]:
# Authenticate with Google Cloud
auth.authenticate_user()

print("✓ Authentication successful!")

---

## Step 4: Configure BigQuery Connection

**What this does:** Points to the sample data project.

**✅ Already configured** - The project ID below points to Bryan's shared BigQuery dataset.

In [None]:
# Configuration - UPDATE THIS VALUE
PROJECT_ID = "npa-workshop-2025"  # ⬅️ Workshop project ID
DATASET_ID = "npa_workshop"
TABLE_ID = "news_events"

# Build full table reference
TABLE_REF = f"{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}"

print(f"✓ Configuration set!")
print(f"  Using table: {TABLE_REF}")

---

## Step 5: Connect to BigQuery

**What this does:** Opens the connection to the database.

In [None]:
# Initialize BigQuery client
client = bigquery.Client(project=PROJECT_ID)

# Helper function to run queries easily
def run_query(sql):
    """Execute SQL and return results as pandas DataFrame."""
    query_job = client.query(sql)
    df = query_job.to_dataframe()
    return df

print("✓ Connected to BigQuery!")
print(f"  Project: {client.project}")

---

# Part 1: Explore the Data

Before we can detect anomalies, we need to understand what's "normal" in our data.

Think of this like getting to know a new friend - we're learning the baseline patterns.

---

## Table Schema

**What this does:** Shows us what columns exist and what type of data they hold.

This is like opening a CSV and looking at the headers before analyzing.

In [None]:
# Get table metadata
table = client.get_table(TABLE_REF)

print(f"Table: {table.table_id}")
print(f"Total Rows: {table.num_rows:,}")
print(f"Size: {table.num_bytes / 1024 / 1024:.1f} MB")
print(f"\nSchema ({len(table.schema)} columns):")
print("-" * 50)
for field in table.schema:
    print(f"  {field.name:25} {field.field_type:10}")

---

## Sample Rows

**What this does:** Shows us a few actual events to understand the data structure.

**📊 Look for:**
- `event_name`: What happened (page_view, scroll_depth, etc.)
- `platform`: Where it happened (web, ios, android)
- `consent_state`: GDPR flag (important!)
- `engagement_time_msec`: How long they spent

In [None]:
# Get 5 sample rows
sql = f"""
SELECT *
FROM `{TABLE_REF}`
LIMIT 5
"""

df_sample = run_query(sql)
df_sample

---

## Date Range

**What this does:** Shows us what time period the data covers.

**🎯 Expected:** 7 days, ~3.5M events

In [None]:
sql = f"""
SELECT
  MIN(event_date) as first_date,
  MAX(event_date) as last_date,
  COUNT(DISTINCT event_date) as total_days,
  COUNT(*) as total_events,
  COUNT(DISTINCT user_pseudo_id) as unique_users
FROM `{TABLE_REF}`
"""

df_range = run_query(sql)

print("\n📅 Data Summary:")
print("=" * 50)
print(f"  Date range: {df_range['first_date'][0]} → {df_range['last_date'][0]}")
print(f"  Days: {df_range['total_days'][0]}")
print(f"  Total events: {df_range['total_events'][0]:,}")
print(f"  Unique users: {df_range['unique_users'][0]:,}")
print(f"  Avg events/day: {df_range['total_events'][0] / df_range['total_days'][0]:,.0f}")
print("=" * 50)

---

# Part 2: Establish Baselines

To detect anomalies, we need to know what "normal" looks like.

Let's look at typical patterns...

---

## Event Distribution

**What this does:** Shows us what types of events we're tracking.

**🎯 This is our baseline:** When we look for problems later, we're looking for deviations from these percentages.

In [None]:
sql = f"""
SELECT
  event_name,
  COUNT(*) as event_count,
  ROUND(COUNT(*) / (SELECT COUNT(*) FROM `{TABLE_REF}`) * 100, 1) as pct
FROM `{TABLE_REF}`
GROUP BY event_name
ORDER BY event_count DESC
"""

df_events = run_query(sql)

print("\n📊 Event Type Distribution:")
print("=" * 60)
print(df_events.to_string(index=False))
print("=" * 60)
print("\n💡 Insight: Page views dominate (expected for news sites)")

---

## Platform & Device Distribution

**What this does:** Shows us web vs mobile app traffic.

In [None]:
sql = f"""
SELECT
  platform,
  device_category,
  COUNT(*) as event_count,
  ROUND(COUNT(*) / (SELECT COUNT(*) FROM `{TABLE_REF}`) * 100, 1) as pct
FROM `{TABLE_REF}`
GROUP BY platform, device_category
ORDER BY event_count DESC
"""

df_platform = run_query(sql)

print("\n📱 Platform & Device Distribution:")
print("=" * 60)
print(df_platform.to_string(index=False))
print("=" * 60)

---

## Daily Event Volume

**What this does:** Shows us how consistent daily traffic is.

**🎯 We're looking for:** Consistent volume day-to-day (sudden drops = problem)

In [None]:
sql = f"""
SELECT
  event_date,
  COUNT(*) as event_count,
  COUNT(DISTINCT user_pseudo_id) as unique_users,
  ROUND(AVG(engagement_time_msec) / 1000, 1) as avg_engagement_sec
FROM `{TABLE_REF}`
GROUP BY event_date
ORDER BY event_date
"""

df_daily = run_query(sql)

print("\n📈 Daily Volume:")
print("=" * 70)
print(df_daily.to_string(index=False))
print("=" * 70)

# Calculate stats
avg_daily = df_daily['event_count'].mean()
std_daily = df_daily['event_count'].std()

print(f"\n📊 Daily Statistics:")
print(f"  Average: {avg_daily:,.0f} events/day")
print(f"  Std Dev: {std_daily:,.0f}")
print(f"  Range: {df_daily['event_count'].min():,} to {df_daily['event_count'].max():,}")

---

## Hourly Patterns

**What this does:** Shows us when people read news throughout the day.

**🎯 Expected pattern:** Low overnight, peaks at morning/lunch/evening

In [None]:
sql = f"""
SELECT
  EXTRACT(HOUR FROM event_datetime) as hour,
  COUNT(*) as event_count,
  ROUND(AVG(engagement_time_msec) / 1000, 1) as avg_engagement_sec
FROM `{TABLE_REF}`
WHERE event_name = 'page_view'
GROUP BY hour
ORDER BY hour
"""

df_hourly = run_query(sql)

print("\n⏰ Hourly Traffic Patterns:")
print("=" * 70)

# Text-based visualization
for _, row in df_hourly.iterrows():
    bar = '█' * int(row['event_count'] / 10000)
    print(f"{int(row['hour']):02d}:00 | {bar} {row['event_count']:>8,} events")

print("=" * 70)
print("\n💡 Insight: Clear morning/lunch/evening peaks (typical news consumption)")

---

# Part 3: First Investigation

Now that we know what "normal" looks like, let's check data quality.

**🔍 Specifically:** Are we missing critical fields?

---

## Missing Values Check

**What this does:** Counts how many events are missing important fields.

**⚠️ Pay attention to `consent_state`** - this is a GDPR requirement!

In [None]:
sql = f"""
SELECT
  COUNT(*) as total_events,
  
  -- Missing counts
  COUNTIF(consent_state IS NULL) as missing_consent,
  COUNTIF(referrer IS NULL) as missing_referrer,
  COUNTIF(article_id IS NULL) as missing_article,
  COUNTIF(section IS NULL) as missing_section,
  
  -- Percentages
  ROUND(COUNTIF(consent_state IS NULL) / COUNT(*) * 100, 1) as pct_missing_consent,
  ROUND(COUNTIF(referrer IS NULL) / COUNT(*) * 100, 1) as pct_missing_referrer,
  ROUND(COUNTIF(article_id IS NULL) / COUNT(*) * 100, 1) as pct_missing_article
FROM `{TABLE_REF}`
"""

df_quality = run_query(sql)

print("\n🔍 Data Quality Check:")
print("=" * 70)
print(df_quality.to_string(index=False))
print("=" * 70)

# Check for issues
missing_consent_pct = df_quality['pct_missing_consent'][0]

print("\n🚨 Issue Detection:")
if missing_consent_pct > 10:
    print(f"\n  ⚠️  WARNING: {missing_consent_pct}% of events missing consent_state!")
    print("     This is a potential GDPR compliance problem.")
    print("     Legal/compliance teams should investigate immediately.")
else:
    print("\n  ✓ Consent tracking looks healthy")

print("\n💡 Insight: You just found your first data quality issue!")

---

## Traffic Sources

**What this does:** Shows us where traffic is coming from.

**🎯 This becomes important later** when we look for new referral sources (opportunities!)

In [None]:
sql = f"""
SELECT
  referrer,
  COUNT(*) as event_count,
  COUNT(DISTINCT user_pseudo_id) as unique_users,
  ROUND(AVG(engagement_time_msec) / 1000, 1) as avg_engagement_sec
FROM `{TABLE_REF}`
WHERE referrer IS NOT NULL
GROUP BY referrer
ORDER BY event_count DESC
LIMIT 10
"""

df_referrers = run_query(sql)

print("\n🌐 Top Traffic Sources:")
print("=" * 80)
print(df_referrers.to_string(index=False))
print("=" * 80)

---

# 🎯 Optional Exercise

**For those following along:** Try writing your own query!

**Goal:** Find the top 10 articles by page views.

**Include:**
- Article ID
- Section
- Number of page views
- Number of unique users

**Hint:** Filter to `event_name = 'page_view'` and use `GROUP BY`

**Don't worry if you can't figure it out** - the solution is in the next cell.

In [None]:
# YOUR CODE HERE (optional!)
sql = f"""
SELECT
  article_id,
  section,
  COUNT(*) as page_views,
  COUNT(DISTINCT user_pseudo_id) as unique_users
FROM `{TABLE_REF}`
WHERE event_name = 'page_view'
  AND article_id IS NOT NULL
GROUP BY article_id, section
ORDER BY page_views DESC
LIMIT 10
"""

df_top_articles = run_query(sql)

print("\n🏆 Top 10 Articles:")
print("=" * 80)
print(df_top_articles.to_string(index=False))
print("=" * 80)

---

# 🎉 Notebook 1 Complete!

## What You Accomplished

✅ Connected to BigQuery (no installation needed!)
✅ Explored 3.5M events across 7 days
✅ Established baselines for "normal" patterns
✅ Ran investigative SQL queries
✅ **Found your first issue**: 15% missing consent_state

---

## What You Discovered

📊 **Data Profile:**
- ~500K events/day over 7 days
- 60% page views, 20% scroll events, 10% clicks
- 50% web, 30% iOS, 20% Android
- Clear hourly patterns (morning/lunch/evening peaks)

🚨 **Potential Problem:**
- High percentage of missing `consent_state` values
- This could be a GDPR compliance issue
- Would you have caught this without automated checks?

---

## Next Steps

In **Notebook 2**, you'll:
- ✨ Use **AI to automatically generate SQL** (no SQL expertise needed!)
- 🔍 Detect **specific problems** (tracking breaks, PII leaks, duplicates)
- 🎉 Find **opportunities** (traffic spikes, new referrers)
- 🧠 Learn **prompt engineering** for SQL generation

**Ready to level up?** Open `COLAB_02_ai_generated_sql_checks.ipynb`

---

### Questions?

Ask Bryan or reach out: **brdavis@ap.org**

---

**Presented by:**
Bryan Davis  
Director of Product, Data & Analytics  
The Associated Press

---

**🌟 Pro Tip:** Bookmark this notebook - it's a great template for exploring any analytics dataset!