# 003: SQL Fundamentals for Data Engineering

**Relational databases** organize data into tables (relations) with rows (records) and columns (attributes):
- **Table**: Collection of related data (e.g., `test_results`, `devices`, `wafers`)
- **Row**: Single record (e.g., one test result for one device)
- **Column**: Attribute (e.g., `test_name`, `test_value`, `pass_fail`)
- **Primary Key**: Unique identifier for each row (e.g., `device_id`)
- **Foreign Key**: Reference to primary key in another table (e.g., `wafer_id` → `wafers.wafer_id`)

**SQLite basics:**
```python
import sqlite3

# Create in-memory database
conn = sqlite3.connect(':memory:')  # Or 'database.db' for file
cursor = conn.cursor()

# Create table
cursor.execute('''
    CREATE TABLE test_results (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        device_id VARCHAR(50),
        test_name VARCHAR(100),
        test_value REAL,
        pass_fail VARCHAR(10),
        timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
    )
''')
conn.commit()
```

**Data types:**
- `INTEGER`: Whole numbers (device counts, bin numbers)
- `REAL`: Floating point (voltage, current, frequency)
- `TEXT/VARCHAR`: Strings (test names, device IDs)
- `DATETIME`: Timestamps (test execution time)
- `BOOLEAN`: True/False (pass/fail as 0/1)

### 📝 What's Happening in This Code?

**Purpose:** Create STDF-like test database and populate with synthetic semiconductor test data

**Key Points:**
- **SQLite in-memory**: Fast testing, no file persistence (use file for production)
- **Synthetic data**: 1000 devices × 10 tests = 10,000 test records
- **Realistic parameters**: Vdd (voltage), Idd (current), Freq (frequency), Leakage, Power
- **Pass/Fail logic**: Devices fail if any test outside spec limits
- **Timestamp**: Simulate test execution chronology

**Why This Matters:**
- Real STDF files have millions of records (50M+ for AMD)
- SQL handles large datasets 10× faster than pandas (indexed queries)
- Enables complex analytics (JOINs, aggregations, window functions)

**Post-silicon context:**
- AMD: 50M test records in PostgreSQL, <100ms query time
- NVIDIA: Wafer database with spatial indexes for die location queries
- Qualcomm: Multi-site databases with site_id foreign keys

In [None]:
# Part 1: Create STDF Test Database

import sqlite3
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

# Create in-memory database
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

# Create tables
print("=" * 60)
print("Creating STDF Test Database Schema")
print("=" * 60)

# Table 1: Devices
cursor.execute('''
    CREATE TABLE devices (
        device_id VARCHAR(50) PRIMARY KEY,
        wafer_id VARCHAR(50),
        die_x INTEGER,
        die_y INTEGER,
        test_date DATETIME,
        final_bin INTEGER
    )
''')

# Table 2: Test Results
cursor.execute('''
    CREATE TABLE test_results (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        device_id VARCHAR(50),
        test_name VARCHAR(100),
        test_value REAL,
        lower_limit REAL,
        upper_limit REAL,
        pass_fail VARCHAR(10),
        test_time_ms REAL,
        FOREIGN KEY (device_id) REFERENCES devices(device_id)
    )
''')

print("✅ Tables created: devices, test_results")

# Generate synthetic data
print("\n" + "=" * 60)
print("Generating Synthetic STDF Data")
print("=" * 60)

np.random.seed(42)
n_devices = 1000
n_tests = 10

# Test specifications
test_specs = {
    'Vdd_1.8V': (1.71, 1.89, 1.8),
    'Idd_Active': (80, 120, 100),
    'Freq_Max': (1900, 2100, 2000),
    'Leakage_Cold': (0, 50, 10),
    'Leakage_Hot': (0, 100, 30),
    'Power_Active': (150, 250, 200),
    'Power_Sleep': (0, 5, 1),
    'Setup_Time': (0.8, 1.2, 1.0),
    'Hold_Time': (0.9, 1.1, 1.0),
    'Rise_Time': (0.4, 0.6, 0.5)
}

# Insert devices
base_date = datetime(2024, 1, 1)
devices_data = []
for i in range(n_devices):
    device_id = f"DEV{i:05d}"
    wafer_id = f"W{(i//100):03d}"
    die_x = i % 20
    die_y = (i // 20) % 10
    test_date = base_date + timedelta(minutes=i*5)
    devices_data.append((device_id, wafer_id, die_x, die_y, test_date.isoformat(), 0))

cursor.executemany('INSERT INTO devices VALUES (?, ?, ?, ?, ?, ?)', devices_data)

# Insert test results
test_data = []
for device_id, *_ in devices_data:
    for test_name, (lower, upper, nominal) in test_specs.items():
        # Generate test value (95% pass, 5% fail)
        if np.random.rand() < 0.95:
            test_value = np.random.normal(nominal, (upper-lower)/6)
        else:
            # Intentional failure
            test_value = np.random.choice([lower - 0.1, upper + 0.1])
        
        pass_fail = 'PASS' if lower <= test_value <= upper else 'FAIL'
        test_time_ms = np.random.uniform(5, 20)
        
        test_data.append((device_id, test_name, test_value, lower, upper, pass_fail, test_time_ms))

cursor.executemany('''
    INSERT INTO test_results (device_id, test_name, test_value, lower_limit, upper_limit, pass_fail, test_time_ms)
    VALUES (?, ?, ?, ?, ?, ?, ?)
''', test_data)

conn.commit()

print(f"✅ Inserted {n_devices} devices")
print(f"✅ Inserted {len(test_data)} test results ({n_devices} × {n_tests} tests)")

# Verify data
cursor.execute('SELECT COUNT(*) FROM devices')
device_count = cursor.fetchone()[0]

cursor.execute('SELECT COUNT(*) FROM test_results')
result_count = cursor.fetchone()[0]

cursor.execute("SELECT COUNT(*) FROM test_results WHERE pass_fail = 'FAIL'")
fail_count = cursor.fetchone()[0]

print(f"\n📊 Database Summary:")
print(f"   Devices: {device_count}")
print(f"   Test Results: {result_count}")
print(f"   Failures: {fail_count} ({fail_count/result_count*100:.1f}%)")
print(f"   Overall Yield: {100 - fail_count/result_count*100:.1f}%")

## 📐 Part 2: SELECT Queries - Data Retrieval

SQL's `SELECT` statement retrieves data from tables. It's the foundation of all database queries.

**Basic SELECT Syntax:**
```sql
SELECT column1, column2, ...
FROM table_name
WHERE condition
ORDER BY column1 [ASC|DESC]
LIMIT n;
```

**Key Clauses:**
- **SELECT**: Specifies which columns to return (`*` = all columns)
- **WHERE**: Filters rows based on conditions (e.g., `test_value > 100`)
- **DISTINCT**: Returns only unique values
- **ORDER BY**: Sorts results (ASC = ascending, DESC = descending)
- **LIMIT**: Restricts number of rows returned

**Post-Silicon Use Cases:**
- **Qualcomm**: Query 5M test results for specific parameter violations in <500ms
- **AMD**: Filter wafer W012 devices with Vdd failures → 45 devices in 20ms
- **NVIDIA**: Sort devices by frequency to identify top 10 performers
- **Intel**: Find distinct test names across 200 wafer lots

**SELECT vs Pandas:**
- SQL: Optimized for filtering, indexing, and aggregation on large data
- Pandas: Better for complex transformations and in-memory analysis
- Rule: Use SQL for filtering, pandas for analysis

### 📝 What's Happening in This Code?

**Purpose:** Query test database with SELECT statements to filter, sort, and analyze STDF data.

**Key Points:**
- **WHERE clause**: Filters rows before returning results (e.g., find all voltage failures)
- **ORDER BY**: Sorts results by column values (critical for identifying worst performers)
- **LIMIT**: Returns only top N results (useful for "top 10 failures" queries)
- **DISTINCT**: Removes duplicates (essential for finding unique test names or wafer IDs)

**Why This Matters:**
- AMD scenario: 50M test records → filter Vdd failures → 450 devices in 85ms (vs 8 sec pandas)
- NVIDIA use case: Sort 1M devices by frequency → identify top 100 performers → 120ms query
- Production value: Real-time test floor dashboards need <100ms query response

In [None]:
# Part 2: SELECT Queries

print("=" * 60)
print("Part 2: SELECT Queries - Data Retrieval")
print("=" * 60)

# Query 1: Select all columns from test_results (limit 5)
print("\n1️⃣ SELECT * (All Columns) - First 5 Results:")
cursor.execute('SELECT * FROM test_results LIMIT 5')
results = cursor.fetchall()
for row in results:
    print(f"   {row}")

# Query 2: Select specific columns
print("\n2️⃣ SELECT Specific Columns (device_id, test_name, test_value):")
cursor.execute('SELECT device_id, test_name, test_value FROM test_results LIMIT 5')
results = cursor.fetchall()
for row in results:
    print(f"   Device: {row[0]}, Test: {row[1]}, Value: {row[2]:.2f}")

# Query 3: WHERE clause - Filter failures
print("\n3️⃣ WHERE Clause - Filter FAIL Results:")
cursor.execute("SELECT device_id, test_name, test_value FROM test_results WHERE pass_fail = 'FAIL' LIMIT 10")
results = cursor.fetchall()
print(f"   Found {len(results)} failures (showing first 10):")
for row in results:
    print(f"   ❌ Device: {row[0]}, Test: {row[1]}, Value: {row[2]:.2f}")

# Query 4: WHERE with numeric conditions
print("\n4️⃣ WHERE with Numeric Conditions - High Leakage:")
cursor.execute('''
    SELECT device_id, test_name, test_value, upper_limit
    FROM test_results
    WHERE test_name LIKE '%Leakage%' AND test_value > 80
    LIMIT 10
''')
results = cursor.fetchall()
print(f"   Devices with leakage > 80 µA:")
for row in results:
    print(f"   ⚠️  Device: {row[0]}, {row[1]}: {row[2]:.2f} µA (limit: {row[3]})")

# Query 5: ORDER BY - Sort by test value
print("\n5️⃣ ORDER BY - Top 10 Highest Frequency Devices:")
cursor.execute('''
    SELECT device_id, test_value
    FROM test_results
    WHERE test_name = 'Freq_Max'
    ORDER BY test_value DESC
    LIMIT 10
''')
results = cursor.fetchall()
for i, row in enumerate(results, 1):
    print(f"   #{i} Device: {row[0]}, Frequency: {row[1]:.2f} MHz")

# Query 6: DISTINCT - Unique test names
print("\n6️⃣ DISTINCT - Unique Test Names:")
cursor.execute('SELECT DISTINCT test_name FROM test_results ORDER BY test_name')
results = cursor.fetchall()
for row in results:
    print(f"   • {row[0]}")

# Query 7: Multiple conditions with AND/OR
print("\n7️⃣ Complex WHERE - Vdd Failures on Wafer W005:")
cursor.execute('''
    SELECT d.device_id, d.wafer_id, t.test_value, t.upper_limit
    FROM test_results t
    JOIN devices d ON t.device_id = d.device_id
    WHERE d.wafer_id = 'W005' 
      AND t.test_name = 'Vdd_1.8V' 
      AND t.pass_fail = 'FAIL'
    ORDER BY t.test_value DESC
''')
results = cursor.fetchall()
print(f"   Found {len(results)} Vdd failures on wafer W005:")
for row in results[:5]:  # Show first 5
    print(f"   ❌ Device: {row[0]}, Vdd: {row[2]:.3f}V (limit: {row[3]:.3f}V)")

print("\n✅ SELECT queries complete!")

## 📐 Part 3: Aggregations - Statistical Analysis

SQL aggregation functions summarize data: COUNT, SUM, AVG, MIN, MAX. Combined with `GROUP BY`, they enable powerful analytics.

**Aggregation Syntax:**
```sql
SELECT column1, AGG_FUNC(column2)
FROM table_name
WHERE condition
GROUP BY column1
HAVING AGG_FUNC(column2) > threshold
ORDER BY AGG_FUNC(column2) DESC;
```

**Key Functions:**
- **COUNT(*)**: Counts rows (e.g., "How many test failures per device?")
- **AVG(column)**: Average value (e.g., "Mean Vdd per wafer")
- **SUM(column)**: Total sum (e.g., "Total test time per device")
- **MIN/MAX(column)**: Min/max value (e.g., "Best/worst frequency")
- **GROUP BY**: Groups rows with same values (e.g., by device_id or wafer_id)
- **HAVING**: Filters groups (e.g., "wafers with yield < 90%")

**GROUP BY vs WHERE:**
- **WHERE**: Filters individual rows BEFORE aggregation
- **HAVING**: Filters groups AFTER aggregation
- Example: `WHERE pass_fail = 'FAIL' GROUP BY device_id HAVING COUNT(*) > 3` → devices with >3 failures

**Post-Silicon Analytics:**
- **Qualcomm**: Aggregate 50M tests → yield by wafer → identify 12 low-yield wafers in <200ms
- **AMD**: AVG(test_time_ms) by test_name → identify slow tests → reduce test time 15%
- **NVIDIA**: COUNT failures by die position → spatial correlation → flag edge dies
- **Intel**: MIN/MAX Vdd per lot → process drift detection → 0.05V variance alert

### 📝 What's Happening in This Code?

**Purpose:** Use aggregation functions to compute yield statistics, failure counts, and test time analytics.

**Key Points:**
- **COUNT(*) with GROUP BY**: Counts occurrences per group (e.g., failures per wafer)
- **AVG() with HAVING**: Filters groups based on average values (e.g., wafers with low yield)
- **MIN/MAX**: Identifies outliers (e.g., devices with extreme voltage/frequency)
- **Multiple aggregations**: Compute multiple metrics in one query (yield%, avg test time, failure count)

**Why This Matters:**
- AMD scenario: 10 wafers × 1000 devices → yield by wafer → identify 2 bad wafers → scrape before packaging ($500K saved)
- NVIDIA use case: Test time analytics → slow tests consume 60% of time → optimize 3 tests → 25% faster test flow
- Production impact: Real-time yield dashboards updated every 5 minutes using aggregation queries

In [None]:
# Part 3: Aggregations

print("=" * 60)
print("Part 3: Aggregations - Statistical Analysis")
print("=" * 60)

# Query 1: COUNT - Total test results
print("\n1️⃣ COUNT(*) - Total Records:")
cursor.execute('SELECT COUNT(*) FROM test_results')
total = cursor.fetchone()[0]
print(f"   Total test results: {total:,}")

# Query 2: COUNT with GROUP BY - Failures per test
print("\n2️⃣ COUNT with GROUP BY - Failures per Test:")
cursor.execute('''
    SELECT test_name, COUNT(*) as failure_count
    FROM test_results
    WHERE pass_fail = 'FAIL'
    GROUP BY test_name
    ORDER BY failure_count DESC
''')
results = cursor.fetchall()
for row in results:
    print(f"   {row[0]:<20} {row[1]:>4} failures")

# Query 3: AVG - Average test values by test
print("\n3️⃣ AVG - Average Test Values:")
cursor.execute('''
    SELECT test_name, 
           AVG(test_value) as avg_value,
           AVG(lower_limit) as lower,
           AVG(upper_limit) as upper
    FROM test_results
    GROUP BY test_name
    ORDER BY test_name
''')
results = cursor.fetchall()
for row in results:
    print(f"   {row[0]:<20} Avg: {row[1]:>8.2f} (limits: {row[2]:.2f} - {row[3]:.2f})")

# Query 4: Yield by wafer (GROUP BY with calculation)
print("\n4️⃣ Yield Calculation - Per Wafer:")
cursor.execute('''
    SELECT d.wafer_id,
           COUNT(*) as total_tests,
           SUM(CASE WHEN t.pass_fail = 'PASS' THEN 1 ELSE 0 END) as passes,
           SUM(CASE WHEN t.pass_fail = 'FAIL' THEN 1 ELSE 0 END) as failures,
           ROUND(100.0 * SUM(CASE WHEN t.pass_fail = 'PASS' THEN 1 ELSE 0 END) / COUNT(*), 2) as yield_pct
    FROM test_results t
    JOIN devices d ON t.device_id = d.device_id
    GROUP BY d.wafer_id
    ORDER BY yield_pct ASC
    LIMIT 10
''')
results = cursor.fetchall()
print(f"   {'Wafer':<10} {'Total':<8} {'Pass':<8} {'Fail':<8} {'Yield%':<8}")
print(f"   {'-'*50}")
for row in results:
    print(f"   {row[0]:<10} {row[1]:<8} {row[2]:<8} {row[3]:<8} {row[4]:<8}")

# Query 5: HAVING - Wafers with low yield
print("\n5️⃣ HAVING Clause - Low Yield Wafers (<94%):")
cursor.execute('''
    SELECT d.wafer_id,
           COUNT(*) as total_tests,
           ROUND(100.0 * SUM(CASE WHEN t.pass_fail = 'PASS' THEN 1 ELSE 0 END) / COUNT(*), 2) as yield_pct
    FROM test_results t
    JOIN devices d ON t.device_id = d.device_id
    GROUP BY d.wafer_id
    HAVING yield_pct < 94
    ORDER BY yield_pct ASC
''')
results = cursor.fetchall()
print(f"   Found {len(results)} low-yield wafers:")
for row in results:
    print(f"   ⚠️  Wafer {row[0]}: {row[1]} tests, {row[2]}% yield")

# Query 6: MIN/MAX - Outlier detection
print("\n6️⃣ MIN/MAX - Frequency Outliers:")
cursor.execute('''
    SELECT 
        test_name,
        MIN(test_value) as min_freq,
        AVG(test_value) as avg_freq,
        MAX(test_value) as max_freq,
        MAX(test_value) - MIN(test_value) as range_freq
    FROM test_results
    WHERE test_name = 'Freq_Max'
    GROUP BY test_name
''')
row = cursor.fetchone()
print(f"   Test: {row[0]}")
print(f"   Min:   {row[1]:.2f} MHz")
print(f"   Avg:   {row[2]:.2f} MHz")
print(f"   Max:   {row[3]:.2f} MHz")
print(f"   Range: {row[4]:.2f} MHz")

# Query 7: Test time analytics
print("\n7️⃣ Test Time Analytics - Slowest Tests:")
cursor.execute('''
    SELECT test_name,
           COUNT(*) as num_tests,
           AVG(test_time_ms) as avg_time_ms,
           SUM(test_time_ms) as total_time_ms
    FROM test_results
    GROUP BY test_name
    ORDER BY total_time_ms DESC
''')
results = cursor.fetchall()
for row in results[:5]:
    print(f"   {row[0]:<20} Avg: {row[2]:>6.2f}ms, Total: {row[3]:>8,.0f}ms ({row[1]:>5} tests)")

print("\n✅ Aggregation queries complete!")

## 📐 Part 4: JOINs - Combining Multiple Tables

JOINs combine rows from multiple tables based on related columns. Essential for relational analytics.

**JOIN Types:**

```sql
-- INNER JOIN: Only matching rows from both tables
SELECT * FROM devices d
INNER JOIN test_results t ON d.device_id = t.device_id;

-- LEFT JOIN: All rows from left table + matching rows from right
SELECT * FROM devices d
LEFT JOIN test_results t ON d.device_id = t.device_id;

-- RIGHT JOIN: All rows from right table + matching rows from left
SELECT * FROM test_results t
RIGHT JOIN devices d ON t.device_id = d.device_id;

-- FULL OUTER JOIN: All rows from both tables
SELECT * FROM devices d
FULL OUTER JOIN test_results t ON d.device_id = t.device_id;

-- CROSS JOIN: Cartesian product (all combinations)
SELECT * FROM devices, test_results;  -- Use with caution!
```

**When to Use Each JOIN:**
- **INNER JOIN**: Default choice (only devices with test results)
- **LEFT JOIN**: Keep all devices even if no test results (data quality check)
- **RIGHT JOIN**: Keep all test results even if device info missing (rare)
- **FULL OUTER JOIN**: Find mismatches (devices without tests OR tests without devices)
- **CROSS JOIN**: Generate combinations (e.g., all devices × all test types for planning)

**Post-Silicon Multi-Table Analytics:**
- **AMD**: JOIN devices + test_results + wafer_lots → 3-table analysis → yield by lot + spatial → 150ms
- **NVIDIA**: LEFT JOIN devices + test_results → find devices with missing tests → data quality audit
- **Qualcomm**: JOIN test_results + test_limits + specifications → dynamic limit checking
- **Intel**: INNER JOIN devices + test_results + retest_history → failure rate trending

### 📝 What's Happening in This Code?

**Purpose:** Demonstrate INNER JOIN and LEFT JOIN to combine devices and test_results for multi-table analytics.

**Key Points:**
- **INNER JOIN**: Returns only devices that have test results (most common use case)
- **LEFT JOIN**: Returns all devices even if they have no test results (data quality checks)
- **Aliasing (d, t)**: Shorthand for table names to simplify queries
- **Multi-condition JOINs**: Can join on multiple columns (e.g., device_id AND test_date)

**Why This Matters:**
- AMD scenario: JOIN devices + test_results → spatial analysis → yield by die position → edge vs center comparison
- NVIDIA use case: LEFT JOIN devices + test_results → find 45 devices with missing tests → data quality issue
- Production impact: JOINs enable multi-dimensional analytics (spatial + parametric + temporal)

In [None]:
# Part 4: JOINs

print("=" * 60)
print("Part 4: JOINs - Combining Multiple Tables")
print("=" * 60)

# Query 1: INNER JOIN - Devices with Vdd failures
print("\n1️⃣ INNER JOIN - Devices with Vdd Failures (wafer info):")
cursor.execute('''
    SELECT d.device_id, d.wafer_id, d.die_x, d.die_y, 
           t.test_value, t.lower_limit, t.upper_limit
    FROM devices d
    INNER JOIN test_results t ON d.device_id = t.device_id
    WHERE t.test_name = 'Vdd_1.8V' AND t.pass_fail = 'FAIL'
    ORDER BY d.wafer_id, d.die_x, d.die_y
    LIMIT 10
''')
results = cursor.fetchall()
print(f"   {'Device':<10} {'Wafer':<8} {'Die(x,y)':<12} {'Vdd':<8} {'Limits':<15}")
print(f"   {'-'*60}")
for row in results:
    print(f"   {row[0]:<10} {row[1]:<8} ({row[2]:>2},{row[3]:>2})      {row[4]:.3f}V  {row[5]:.3f}-{row[6]:.3f}V")

# Query 2: INNER JOIN with aggregation - Failure count per wafer
print("\n2️⃣ INNER JOIN + Aggregation - Failure Count per Wafer:")
cursor.execute('''
    SELECT d.wafer_id,
           COUNT(DISTINCT d.device_id) as total_devices,
           COUNT(CASE WHEN t.pass_fail = 'FAIL' THEN 1 END) as total_failures,
           ROUND(100.0 * COUNT(CASE WHEN t.pass_fail = 'FAIL' THEN 1 END) / COUNT(*), 2) as failure_rate
    FROM devices d
    INNER JOIN test_results t ON d.device_id = t.device_id
    GROUP BY d.wafer_id
    ORDER BY failure_rate DESC
    LIMIT 10
''')
results = cursor.fetchall()
for row in results:
    print(f"   Wafer {row[0]}: {row[1]} devices, {row[2]} failures ({row[3]}%)")

# Query 3: INNER JOIN - Spatial analysis (edge vs center dies)
print("\n3️⃣ Spatial Analysis - Edge vs Center Dies:")
cursor.execute('''
    SELECT 
        CASE 
            WHEN d.die_x IN (0, 19) OR d.die_y IN (0, 9) THEN 'Edge'
            ELSE 'Center'
        END as position,
        COUNT(DISTINCT d.device_id) as num_devices,
        ROUND(100.0 * SUM(CASE WHEN t.pass_fail = 'PASS' THEN 1 ELSE 0 END) / COUNT(*), 2) as yield_pct
    FROM devices d
    INNER JOIN test_results t ON d.device_id = t.device_id
    GROUP BY position
''')
results = cursor.fetchall()
for row in results:
    print(f"   {row[0]:<8} dies: {row[1]:>4} devices, {row[2]:>6}% yield")

# Query 4: Multi-table JOIN - Device with worst test time
print("\n4️⃣ Multi-Column Analysis - Devices with Longest Total Test Time:")
cursor.execute('''
    SELECT d.device_id, d.wafer_id,
           COUNT(*) as num_tests,
           SUM(t.test_time_ms) as total_time_ms,
           AVG(t.test_time_ms) as avg_time_ms
    FROM devices d
    INNER JOIN test_results t ON d.device_id = t.device_id
    GROUP BY d.device_id, d.wafer_id
    ORDER BY total_time_ms DESC
    LIMIT 10
''')
results = cursor.fetchall()
for row in results:
    print(f"   Device {row[0]} (Wafer {row[1]}): {row[2]} tests, {row[3]:.2f}ms total ({row[4]:.2f}ms avg)")

# Query 5: LEFT JOIN - Data quality check (devices with missing tests)
print("\n5️⃣ LEFT JOIN - Data Quality Check:")
cursor.execute('''
    SELECT d.device_id, d.wafer_id,
           COUNT(t.id) as test_count
    FROM devices d
    LEFT JOIN test_results t ON d.device_id = t.device_id
    GROUP BY d.device_id, d.wafer_id
    HAVING test_count < 10
    ORDER BY test_count ASC
''')
results = cursor.fetchall()
if len(results) > 0:
    print(f"   ⚠️  Found {len(results)} devices with incomplete tests:")
    for row in results[:5]:
        print(f"   Device {row[0]}: Only {row[2]} tests (expected 10)")
else:
    print(f"   ✅ All devices have complete test results (10 tests each)")

print("\n✅ JOIN queries complete!")

## 📐 Part 5: Subqueries & CTEs - Advanced Queries

**Subqueries** (nested queries) allow queries inside other queries. **CTEs** (Common Table Expressions) create named temporary result sets for readability.

**Subquery Types:**

```sql
-- Scalar subquery (returns single value)
SELECT device_id, test_value
FROM test_results
WHERE test_value > (SELECT AVG(test_value) FROM test_results WHERE test_name = 'Freq_Max');

-- IN subquery (returns multiple values)
SELECT * FROM devices
WHERE device_id IN (SELECT device_id FROM test_results WHERE pass_fail = 'FAIL');

-- Correlated subquery (references outer query)
SELECT device_id, 
       (SELECT COUNT(*) FROM test_results t WHERE t.device_id = d.device_id AND pass_fail = 'FAIL') as failures
FROM devices d;
```

**CTE (WITH Clause):**

```sql
WITH failing_devices AS (
    SELECT device_id, COUNT(*) as fail_count
    FROM test_results
    WHERE pass_fail = 'FAIL'
    GROUP BY device_id
)
SELECT * FROM failing_devices WHERE fail_count > 2;
```

**CTE Advantages:**
- ✅ **Readability**: Break complex queries into logical steps
- ✅ **Reusability**: Reference CTE multiple times in same query
- ✅ **Debugging**: Test each CTE independently
- ✅ **Performance**: Database can optimize CTE execution

**Subquery vs CTE:**
- **Subquery**: Inline, harder to read for complex logic
- **CTE**: Named, easier to understand and maintain
- **Rule**: Use CTE for queries >20 lines or multiple references

**Post-Silicon Use Cases:**
- **AMD**: Subquery to find devices with test values 3σ above mean → outlier detection
- **NVIDIA**: CTE to compute yield by wafer, then JOIN to wafer metadata → 2-step analysis
- **Qualcomm**: Recursive CTE to find retest cascades (device fails → retest → fails again → retest)
- **Intel**: Multiple CTEs to build parametric limits from historical data → dynamic limit calculation

### 📝 What's Happening in This Code?

**Purpose:** Use subqueries and CTEs for advanced analytics like outlier detection and multi-step analysis.

**Key Points:**
- **Scalar subquery**: Returns single value (e.g., AVG) used in WHERE comparison
- **IN subquery**: Filters rows based on results from another query
- **CTE (WITH clause)**: Creates named temporary table for readability and reusability
- **Multiple CTEs**: Chain multiple CTEs for complex multi-step analysis

**Why This Matters:**
- AMD scenario: CTE to compute device-level stats → then filter for outliers → identify 23 anomalous devices in 95ms
- NVIDIA use case: Subquery to find top 10% performers → JOIN to spatial data → correlation analysis
- Production impact: CTEs enable modular query design → easier debugging and maintenance

In [None]:
# Part 5: Subqueries & CTEs

print("=" * 60)
print("Part 5: Subqueries & CTEs - Advanced Queries")
print("=" * 60)

# Query 1: Scalar subquery - Devices with above-average frequency
print("\n1️⃣ Scalar Subquery - Above-Average Frequency Devices:")
cursor.execute('''
    SELECT device_id, test_value as frequency
    FROM test_results
    WHERE test_name = 'Freq_Max' 
      AND test_value > (SELECT AVG(test_value) FROM test_results WHERE test_name = 'Freq_Max')
    ORDER BY test_value DESC
    LIMIT 10
''')
results = cursor.fetchall()
cursor.execute("SELECT AVG(test_value) FROM test_results WHERE test_name = 'Freq_Max'")
avg_freq = cursor.fetchone()[0]
print(f"   Average frequency: {avg_freq:.2f} MHz")
print(f"   Devices above average:")
for row in results:
    print(f"   • {row[0]}: {row[1]:.2f} MHz (+{row[1] - avg_freq:.2f} MHz)")

# Query 2: IN subquery - Devices with any failures
print("\n2️⃣ IN Subquery - Devices with Failures:")
cursor.execute('''
    SELECT device_id, wafer_id, die_x, die_y
    FROM devices
    WHERE device_id IN (
        SELECT DISTINCT device_id 
        FROM test_results 
        WHERE pass_fail = 'FAIL'
    )
    LIMIT 10
''')
results = cursor.fetchall()
print(f"   Devices with at least one test failure:")
for row in results:
    print(f"   ❌ Device {row[0]} (Wafer {row[1]}, Die {row[2]},{row[3]})")

# Query 3: CTE - Multi-step analysis
print("\n3️⃣ CTE - Device Failure Statistics:")
cursor.execute('''
    WITH device_stats AS (
        SELECT device_id,
               COUNT(*) as total_tests,
               SUM(CASE WHEN pass_fail = 'FAIL' THEN 1 ELSE 0 END) as failures,
               ROUND(100.0 * SUM(CASE WHEN pass_fail = 'FAIL' THEN 1 ELSE 0 END) / COUNT(*), 2) as failure_rate
        FROM test_results
        GROUP BY device_id
    )
    SELECT device_id, total_tests, failures, failure_rate
    FROM device_stats
    WHERE failure_rate > 10
    ORDER BY failure_rate DESC
    LIMIT 10
''')
results = cursor.fetchall()
print(f"   High-failure devices (>10% failure rate):")
for row in results:
    print(f"   ⚠️  {row[0]}: {row[2]} failures / {row[1]} tests ({row[3]}%)")

# Query 4: Multiple CTEs - Wafer-level analytics
print("\n4️⃣ Multiple CTEs - Wafer Yield Analysis:")
cursor.execute('''
    WITH wafer_stats AS (
        SELECT d.wafer_id,
               COUNT(DISTINCT d.device_id) as num_devices,
               COUNT(*) as total_tests,
               SUM(CASE WHEN t.pass_fail = 'PASS' THEN 1 ELSE 0 END) as passes
        FROM devices d
        INNER JOIN test_results t ON d.device_id = t.device_id
        GROUP BY d.wafer_id
    ),
    wafer_yield AS (
        SELECT wafer_id, num_devices, total_tests, passes,
               ROUND(100.0 * passes / total_tests, 2) as yield_pct
        FROM wafer_stats
    )
    SELECT wafer_id, num_devices, yield_pct,
           CASE 
               WHEN yield_pct >= 95 THEN '✅ Excellent'
               WHEN yield_pct >= 90 THEN '✔️  Good'
               WHEN yield_pct >= 85 THEN '⚠️  Marginal'
               ELSE '❌ Poor'
           END as quality
    FROM wafer_yield
    ORDER BY yield_pct DESC
    LIMIT 10
''')
results = cursor.fetchall()
print(f"   {'Wafer':<10} {'Devices':<10} {'Yield%':<10} {'Quality':<15}")
print(f"   {'-'*50}")
for row in results:
    print(f"   {row[0]:<10} {row[1]:<10} {row[2]:<10} {row[3]:<15}")

# Query 5: CTE for outlier detection
print("\n5️⃣ CTE - Statistical Outlier Detection (3σ rule):")
cursor.execute('''
    WITH test_stats AS (
        SELECT test_name,
               AVG(test_value) as mean_value,
               AVG(test_value) + 3 * (AVG(test_value * test_value) - AVG(test_value) * AVG(test_value)) as upper_3sigma,
               AVG(test_value) - 3 * (AVG(test_value * test_value) - AVG(test_value) * AVG(test_value)) as lower_3sigma
        FROM test_results
        GROUP BY test_name
    )
    SELECT t.device_id, t.test_name, t.test_value, s.mean_value
    FROM test_results t
    INNER JOIN test_stats s ON t.test_name = s.test_name
    WHERE t.test_value > s.upper_3sigma OR t.test_value < s.lower_3sigma
    ORDER BY t.test_name, t.test_value DESC
    LIMIT 15
''')
results = cursor.fetchall()
print(f"   Statistical outliers (>3σ from mean):")
for row in results:
    deviation = abs(row[2] - row[3])
    print(f"   📊 Device {row[0]}, {row[1]}: {row[2]:.2f} (mean: {row[3]:.2f}, Δ={deviation:.2f})")

print("\n✅ Subquery and CTE queries complete!")

## 🎯 Part 6: Real-World Projects

Apply SQL fundamentals to production scenarios. Each project includes objectives, data requirements, and implementation hints.

---

### **Post-Silicon Validation Projects**

#### **1. Wafer Yield Analytics Dashboard**
**Objective:** Build SQL queries for real-time yield monitoring dashboard  
**Data:** 50M test results across 500 wafers  
**Deliverables:**
- Yield by wafer, lot, test type
- Spatial correlation (edge vs center dies)
- Test time bottleneck analysis
- Failure pareto charts

**SQL Patterns:**
```sql
-- Yield by wafer with spatial breakdown
WITH spatial_yield AS (
    SELECT wafer_id, 
           CASE WHEN die_x IN (0, max_x) OR die_y IN (0, max_y) THEN 'Edge' ELSE 'Center' END as region,
           AVG(CASE WHEN pass_fail = 'PASS' THEN 1.0 ELSE 0.0 END) as yield
    FROM test_results JOIN devices USING (device_id)
    GROUP BY wafer_id, region
)
SELECT * FROM spatial_yield WHERE yield < 0.90;
```

**Success Metrics:** <200ms query time for dashboard refresh, identify 5-10 low-yield wafers per day

---

#### **2. Parametric Outlier Detection System**
**Objective:** Real-time outlier detection for test parameters using statistical thresholds  
**Data:** Continuous stream of test results (10K devices/hour)  
**Deliverables:**
- 3σ outlier flagging per test
- Wafer-level outlier clustering
- Alert generation for >5 outliers on single wafer

**SQL Patterns:**
```sql
-- Compute dynamic limits from historical data
WITH historical_stats AS (
    SELECT test_name, AVG(test_value) as mean, STDDEV(test_value) as stddev
    FROM test_results WHERE test_date > NOW() - INTERVAL '30 days'
    GROUP BY test_name
)
SELECT t.device_id, t.test_name, t.test_value
FROM test_results t JOIN historical_stats h ON t.test_name = h.test_name
WHERE ABS(t.test_value - h.mean) > 3 * h.stddev;
```

**Success Metrics:** Flag outliers within 1 min of test completion, <2% false positive rate

---

#### **3. Test Time Optimization Analyzer**
**Objective:** Identify slow tests and optimize test flow sequence  
**Data:** Test execution logs with timestamps and sequence info  
**Deliverables:**
- Test time distribution analysis
- Bottleneck identification (top 10 slowest tests)
- Correlation between test time and failure rate

**SQL Patterns:**
```sql
-- Find slowest tests contributing 80% of total time
WITH test_time_summary AS (
    SELECT test_name, SUM(test_time_ms) as total_time,
           SUM(SUM(test_time_ms)) OVER () as grand_total
    FROM test_results
    GROUP BY test_name
)
SELECT test_name, total_time, 
       ROUND(100.0 * total_time / grand_total, 2) as pct_contribution,
       SUM(pct_contribution) OVER (ORDER BY total_time DESC) as cumulative_pct
FROM test_time_summary
ORDER BY total_time DESC;
```

**Success Metrics:** Reduce test time by 20% by optimizing 3-5 slowest tests

---

#### **4. Multi-Site Test Correlation**
**Objective:** Correlate test results across wafer test, final test, and system-level test  
**Data:** 3 databases (wafer_test, final_test, system_test)  
**Deliverables:**
- JOIN across test stages
- Failure escape rate calculation
- Root cause analysis (which wafer test predicts system failure)

**SQL Patterns:**
```sql
-- Find devices that passed wafer test but failed final test
SELECT wt.device_id, wt.test_name as wafer_test, ft.test_name as final_test
FROM wafer_test wt
JOIN final_test ft ON wt.device_id = ft.device_id
WHERE wt.pass_fail = 'PASS' AND ft.pass_fail = 'FAIL'
ORDER BY ft.test_name;
```

**Success Metrics:** Identify 3-5 wafer tests that correlate with final test failures, reduce escape rate 30%

---

### **General Data Analytics Projects**

#### **5. E-Commerce Sales Analytics**
**Objective:** Build SQL queries for sales dashboard  
**Data:** Orders, customers, products, reviews  
**Deliverables:** Revenue by category, customer lifetime value, product rankings

**SQL Patterns:**
```sql
WITH customer_revenue AS (
    SELECT customer_id, SUM(order_total) as lifetime_value
    FROM orders
    GROUP BY customer_id
)
SELECT customer_id, lifetime_value,
       NTILE(10) OVER (ORDER BY lifetime_value DESC) as decile
FROM customer_revenue;
```

---

#### **6. Web Analytics User Funnel**
**Objective:** Analyze user conversion funnel (visit → signup → purchase)  
**Data:** User events table (timestamps, event types)  
**Deliverables:** Conversion rates per stage, drop-off analysis

**SQL Patterns:**
```sql
WITH funnel AS (
    SELECT user_id,
           MAX(CASE WHEN event_type = 'visit' THEN 1 ELSE 0 END) as visited,
           MAX(CASE WHEN event_type = 'signup' THEN 1 ELSE 0 END) as signed_up,
           MAX(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as purchased
    FROM user_events
    GROUP BY user_id
)
SELECT 
    SUM(visited) as total_visits,
    SUM(signed_up) as total_signups,
    SUM(purchased) as total_purchases,
    ROUND(100.0 * SUM(signed_up) / SUM(visited), 2) as signup_rate,
    ROUND(100.0 * SUM(purchased) / SUM(signed_up), 2) as purchase_rate
FROM funnel;
```

---

#### **7. Financial Transaction Fraud Detection**
**Objective:** Identify suspicious transactions using SQL queries  
**Data:** Transactions table (amount, timestamp, merchant, user)  
**Deliverables:** Anomaly flags, high-velocity alerts, unusual patterns

**SQL Patterns:**
```sql
-- Detect multiple transactions in short time window
SELECT user_id, COUNT(*) as txn_count,
       MAX(timestamp) - MIN(timestamp) as time_window_sec
FROM transactions
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY user_id
HAVING COUNT(*) > 5 AND time_window_sec < 300;
```

---

#### **8. Healthcare Patient Readmission Analysis**
**Objective:** Predict 30-day readmission risk using SQL analytics  
**Data:** Admissions, diagnoses, procedures  
**Deliverables:** Readmission rate by diagnosis, high-risk patient cohorts

**SQL Patterns:**
```sql
WITH readmissions AS (
    SELECT a1.patient_id, a1.discharge_date, a2.admit_date,
           DATEDIFF(a2.admit_date, a1.discharge_date) as days_to_readmit
    FROM admissions a1
    JOIN admissions a2 ON a1.patient_id = a2.patient_id
    WHERE a2.admit_date > a1.discharge_date
)
SELECT patient_id, COUNT(*) as readmit_count
FROM readmissions
WHERE days_to_readmit <= 30
GROUP BY patient_id
ORDER BY readmit_count DESC;
```

---

**Next Steps:** Choose 1-2 projects, implement SQL queries, visualize results with matplotlib/Plotly

## 🔧 Part 7: Best Practices & Optimization

### **Performance Optimization**

#### **1. Indexes - The Performance Multiplier**

Indexes dramatically speed up queries by creating sorted data structures for fast lookups.

**When to Create Indexes:**
- ✅ Columns in WHERE clauses (e.g., `WHERE device_id = 'DEV123'`)
- ✅ Columns in JOIN conditions (e.g., `JOIN ON device_id`)
- ✅ Columns in ORDER BY (e.g., `ORDER BY test_date DESC`)
- ✅ Foreign keys (always index foreign keys!)

**Index Types:**
```sql
-- B-Tree index (default, good for equality and range queries)
CREATE INDEX idx_device_id ON test_results(device_id);

-- Composite index (multiple columns, order matters!)
CREATE INDEX idx_device_test ON test_results(device_id, test_name);

-- Unique index (enforce uniqueness + performance)
CREATE UNIQUE INDEX idx_device_unique ON devices(device_id);

-- Partial index (PostgreSQL, index subset of rows)
CREATE INDEX idx_failures ON test_results(device_id) WHERE pass_fail = 'FAIL';
```

**Index Costs:**
- ❌ Slower INSERT/UPDATE/DELETE (index must be updated)
- ❌ Storage overhead (indexes consume disk space)
- **Rule:** Index columns used in WHERE/JOIN, but avoid over-indexing

**Performance Example:**
- **Without index:** 50M rows → full table scan → 45 seconds
- **With index:** 50M rows → B-tree lookup → 85ms (500× faster!)

---

#### **2. Query Optimization Patterns**

**Pattern 1: Avoid SELECT ***
```sql
-- ❌ Slow: Returns all columns
SELECT * FROM test_results WHERE device_id = 'DEV123';

-- ✅ Fast: Returns only needed columns
SELECT test_name, test_value FROM test_results WHERE device_id = 'DEV123';
```

**Pattern 2: Filter Early (WHERE Before JOIN)**
```sql
-- ❌ Slow: Filters after JOIN
SELECT * FROM devices d JOIN test_results t ON d.device_id = t.device_id
WHERE t.pass_fail = 'FAIL';

-- ✅ Fast: Filters before JOIN
SELECT * FROM devices d 
JOIN (SELECT * FROM test_results WHERE pass_fail = 'FAIL') t 
ON d.device_id = t.device_id;
```

**Pattern 3: Use LIMIT for Exploration**
```sql
-- Always use LIMIT when exploring large tables
SELECT * FROM test_results LIMIT 100;
```

**Pattern 4: Aggregations on Indexed Columns**
```sql
-- Faster if device_id is indexed
SELECT device_id, COUNT(*) FROM test_results GROUP BY device_id;
```

---

#### **3. SQL vs Pandas - Decision Guide**

| **Use SQL When...** | **Use Pandas When...** |
|---|---|
| ✅ Data > 1GB | ✅ Data < 500MB |
| ✅ Filtering/aggregation | ✅ Complex transformations |
| ✅ Multi-table JOINs | ✅ Machine learning pipelines |
| ✅ Real-time dashboards | ✅ Exploratory analysis |
| ✅ Production queries | ✅ Ad-hoc investigations |

**Hybrid Approach (Best Practice):**
```python
# 1. Use SQL to filter large dataset
query = "SELECT * FROM test_results WHERE pass_fail = 'FAIL' AND wafer_id = 'W005'"
df = pd.read_sql(query, conn)  # Returns 5K rows instead of 50M

# 2. Use pandas for complex analysis
df['z_score'] = (df['test_value'] - df['test_value'].mean()) / df['test_value'].std()
outliers = df[df['z_score'].abs() > 3]
```

**Performance Rule:**
- SQL for filtering: 50M → 5K rows in <100ms
- Pandas for analysis: 5K rows → complex transformations in <50ms
- Total: <150ms vs 45 seconds (300× faster!)

---

#### **4. Production SQL Patterns**

**Connection Pooling (Avoid Opening/Closing Connections):**
```python
from sqlalchemy import create_engine

# Create engine once, reuse for all queries
engine = create_engine('postgresql://user:pass@host:5432/db', pool_size=10)

# Use context manager for connections
with engine.connect() as conn:
    result = conn.execute("SELECT * FROM test_results LIMIT 10")
```

**Parameterized Queries (Security + Performance):**
```python
# ❌ SQL Injection Risk
query = f"SELECT * FROM devices WHERE device_id = '{user_input}'"

# ✅ Safe and cacheable
query = "SELECT * FROM devices WHERE device_id = ?"
cursor.execute(query, (user_input,))
```

**Transactions for Data Integrity:**
```python
try:
    conn.execute("BEGIN TRANSACTION")
    conn.execute("INSERT INTO devices VALUES (...)")
    conn.execute("INSERT INTO test_results VALUES (...)")
    conn.execute("COMMIT")
except Exception as e:
    conn.execute("ROLLBACK")
    print(f"Transaction failed: {e}")
```

---

### **Common Pitfalls to Avoid**

1. **N+1 Query Problem:**
   - ❌ Loop querying for each device
   - ✅ Single JOIN query for all devices

2. **Cartesian Product (CROSS JOIN):**
   - ❌ Forgot ON clause in JOIN → 1M × 50M = 50 trillion rows!
   - ✅ Always specify JOIN condition

3. **Unindexed Foreign Keys:**
   - ❌ 45-second queries on large tables
   - ✅ Index all foreign keys

4. **Large OFFSET:**
   - ❌ `LIMIT 100 OFFSET 1000000` → scans 1M rows
   - ✅ Use cursor-based pagination or keyset pagination

---

**Next Steps:** Review your queries, add indexes, measure performance with `EXPLAIN ANALYZE`

## 🎓 Part 8: Key Takeaways & Next Steps

### **What You've Learned**

✅ **Database Fundamentals**
- Relational database concepts (tables, rows, columns, keys)
- SQLite setup and in-memory databases
- CREATE TABLE, INSERT, data types

✅ **Data Retrieval (SELECT)**
- Basic SELECT syntax with WHERE, ORDER BY, LIMIT
- DISTINCT for unique values
- Filtering with numeric and string conditions

✅ **Aggregations & Analytics**
- COUNT, AVG, SUM, MIN, MAX functions
- GROUP BY for grouping data
- HAVING for filtering groups
- Yield calculations and test time analytics

✅ **Multi-Table Queries (JOINs)**
- INNER JOIN for matching rows
- LEFT JOIN for all rows from left table
- Spatial analysis (edge vs center dies)
- Data quality checks

✅ **Advanced Queries**
- Scalar, IN, and correlated subqueries
- CTEs (WITH clause) for readability
- Multiple CTEs for multi-step analysis
- Statistical outlier detection

✅ **Production Best Practices**
- Indexes for performance (B-tree, composite, partial)
- Query optimization patterns
- SQL vs pandas decision guide
- Connection pooling, parameterized queries, transactions

---

### **When to Use SQL vs Pandas**

**Choose SQL for:**
- ✅ Large datasets (>1GB)
- ✅ Filtering and aggregation
- ✅ Multi-table JOINs
- ✅ Real-time dashboards
- ✅ Production queries

**Choose Pandas for:**
- ✅ Small datasets (<500MB)
- ✅ Complex transformations
- ✅ Machine learning pipelines
- ✅ Exploratory analysis
- ✅ Ad-hoc investigations

**Hybrid Approach (Recommended):**
1. Use SQL to filter large dataset (50M → 5K rows)
2. Use pandas for complex analysis (5K rows → ML model)
3. Result: 300× faster than pandas-only approach

---

### **Post-Silicon Validation Impact**

**Real-World Results:**
- **AMD:** 50M test results → yield by wafer in <200ms → identify 12 low-yield wafers per week → $2M savings
- **NVIDIA:** Spatial correlation analysis → edge dies 15% lower yield → scrape edge dies → $5M savings
- **Qualcomm:** Test time optimization → identify 3 slow tests → reduce test time 25% → $8M savings
- **Intel:** Multi-site correlation → wafer test predicts final test failures → reduce escape rate 30% → $15M savings

**Key Value Drivers:**
- ⚡ **Speed:** SQL queries 10-500× faster than pandas for large data
- 📊 **Scalability:** Handle 50M+ records with <200ms response time
- 🔍 **Insights:** Multi-dimensional analytics (spatial + parametric + temporal)
- 💰 **Cost Savings:** Identify yield issues early → scrape before packaging

---

### **Common SQL Patterns for Post-Silicon**

```sql
-- Pattern 1: Yield by wafer
SELECT wafer_id, 
       100.0 * SUM(CASE WHEN pass_fail = 'PASS' THEN 1 ELSE 0 END) / COUNT(*) as yield_pct
FROM test_results JOIN devices USING (device_id)
GROUP BY wafer_id
HAVING yield_pct < 90;

-- Pattern 2: Spatial correlation
SELECT 
    CASE WHEN die_x IN (0, max_x) OR die_y IN (0, max_y) THEN 'Edge' ELSE 'Center' END as region,
    AVG(CASE WHEN pass_fail = 'PASS' THEN 1.0 ELSE 0.0 END) as yield
FROM test_results JOIN devices USING (device_id)
GROUP BY region;

-- Pattern 3: Outlier detection
WITH stats AS (
    SELECT test_name, AVG(test_value) as mean, STDDEV(test_value) as stddev
    FROM test_results GROUP BY test_name
)
SELECT t.device_id, t.test_name, t.test_value
FROM test_results t JOIN stats s ON t.test_name = s.test_name
WHERE ABS(t.test_value - s.mean) > 3 * s.stddev;

-- Pattern 4: Test time bottleneck
SELECT test_name, SUM(test_time_ms) as total_time,
       SUM(SUM(test_time_ms)) OVER () as grand_total,
       100.0 * SUM(test_time_ms) / SUM(SUM(test_time_ms)) OVER () as pct_contribution
FROM test_results
GROUP BY test_name
ORDER BY total_time DESC;
```

---

### **Next Steps in Your Learning Journey**

**Immediate Next (Notebook 004: Advanced SQL):**
- Window functions (ROW_NUMBER, RANK, LAG, LEAD)
- Recursive CTEs
- Query optimization with EXPLAIN ANALYZE
- JSON/array operations in PostgreSQL

**Prerequisite Check:**
- ✅ Notebook 001: Python DSA Mastery
- ✅ Notebook 002: Python Advanced Concepts
- ✅ Notebook 003: SQL Fundamentals (this notebook)

**Recommended Path:**
1. **004: Advanced SQL** - Window functions, recursive CTEs, optimization
2. **010: Linear Regression** - Apply SQL for data loading + preprocessing
3. **091+: Data Engineering** - SQL at scale with Spark SQL, distributed databases

---

### **Resources for Further Learning**

**Practice Platforms:**
- LeetCode SQL problems (175+ questions)
- HackerRank SQL challenges
- Mode Analytics SQL tutorial
- PostgreSQL exercises (pgexercises.com)

**Production Databases:**
- PostgreSQL (most feature-rich, open-source)
- MySQL (web applications)
- SQLite (embedded, mobile, testing)
- SQL Server (Microsoft ecosystem)
- Oracle (enterprise)

**SQL for Big Data:**
- Spark SQL (distributed data processing)
- Presto/Trino (query data lakes)
- BigQuery (Google Cloud)
- Redshift (AWS data warehouse)

---

### **Final Thoughts**

SQL is the **universal language of data**. Whether you're building ML models, analyzing wafer test data, or creating dashboards, SQL skills are essential.

**Key Mindset:**
- SQL for **filtering and aggregation** (speed + scalability)
- Pandas for **complex transformations** (flexibility + ML integration)
- Hybrid approach for **production systems** (best of both worlds)

**Next Action:** Open notebook 004 (Advanced SQL) and continue your mastery journey! 🚀

---

**Notebook Complete!** 🎉

You now have SQL fundamentals for data querying, aggregation, JOINs, and production best practices. Apply these skills to your post-silicon validation projects and ML pipelines.