# Data Processing Pipeline Demo

This notebook demonstrates the complete data processing pipeline for the Tech Career Intelligence Platform.

## What This Notebook Shows

1. **Raw Data Loading**: Load the original Lightcast job postings CSV
2. **Data Inspection**: View the first 5 rows of raw data
3. **Processed Data Loading**: Load the clean, processed Parquet data
4. **Data Comparison**: Compare raw vs processed data transformations
5. **Pipeline Summary**: Understand the complete transformation flow

## Pipeline Overview

```
Raw CSV (UPPERCASE columns)
    ↓
Standardize columns (UPPERCASE → snake_case)
    ↓
Decode locations (base64 → plain text)
    ↓
Compute salary_avg (from min/max with imputation)
    ↓
Validate & clean data
    ↓
Processed Parquet (snake_case columns, clean data)
```


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from pathlib import Path
import sys
import warnings
warnings.filterwarnings('ignore')

# Add project root to path so we can import from src/
project_root = Path().resolve().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print(f"[OK] Libraries imported successfully")
print(f"   Project root: {project_root}")

## Step 1: Load and Inspect Raw Data

Let's start by loading the raw CSV data and examining its structure.

In [None]:
import pandas as pd
from pathlib import Path

# Load raw data
raw_data_path = Path('../data/raw/lightcast_job_postings.csv')

if raw_data_path.exists():
    print("Loading raw data...")
    df_raw = pd.read_csv(raw_data_path)
    print(f"[OK] Loaded {len(df_raw):,} records with {len(df_raw.columns)} columns")

    # Display first 5 rows
    print("\n" + "="*80)
    print("RAW DATA - First 5 Rows")
    print("="*80)
    display(df_raw.head())

    # Show column names
    print("\n" + "="*80)
    print(f"RAW DATA - Column Names ({len(df_raw.columns)} total)")
    print("="*80)
    print("\n".join([f"{i+1:3d}. {col}" for i, col in enumerate(df_raw.columns)]))
else:
    print(f"[ERROR] Raw data not found at {raw_data_path}")
    df_raw = None

## Step 2: Load and Inspect Processed Data

Now let's load the processed Parquet data to see the transformation results.

In [None]:
# Load processed data using the centralized pipeline
# This follows the "Process Once, Use Many Times" architecture from DESIGN.md

from src.data.website_processor import load_and_process_data

print("Loading processed data...")
df_processed, summary = load_and_process_data()

print(f"[OK] Loaded {len(df_processed):,} records with {len(df_processed.columns)} columns")
print(f"   Salary coverage: {summary['salary_coverage']:.1f}%")
print(f"   Median salary: ${summary['salary_range']['median']:,.0f}")

# Display first 5 rows
print("\n" + "="*80)
print("PROCESSED DATA - First 5 Rows")
print("="*80)
display(df_processed.head())

# Show column names (snake_case)
print("\n" + "="*80)
print(f"PROCESSED DATA - Column Names ({len(df_processed.columns)} total, all snake_case)")
print("="*80)
print("\n".join([f"{i+1:3d}. {col}" for i, col in enumerate(df_processed.columns)]))

## Step 3: Compare Raw vs Processed Data

Let's see what transformations were applied by the data pipeline.

In [None]:
# Data Transformation Summary

if df_raw is not None:
    print("="*80)
    print("RAW vs PROCESSED DATA COMPARISON")
    print("="*80)

    print(f"\n[DATA] Record Count:")
    print(f"   Raw:       {len(df_raw):,}")
    print(f"   Processed: {len(df_processed):,}")
    print(f"   Retained:  {len(df_processed)/len(df_raw)*100:.1f}%")

    print(f"\n Column Count:")
    print(f"   Raw:       {len(df_raw.columns)}")
    print(f"   Processed: {len(df_processed.columns)}")
    print(f"   Change:    {len(df_processed.columns) - len(df_raw.columns):+d}")

    print(f"\n🏷️  Column Naming:")
    print(f"   Raw:       UPPERCASE (e.g., {df_raw.columns[0]})")
    print(f"   Processed: snake_case (e.g., {df_processed.columns[0]})")

    # Show key transformations
    print(f"\n🔄 Key Transformations:")

    # Check for common columns
    if 'SALARY_AVG' in df_raw.columns or 'salary_avg' in df_processed.columns:
        print(f"   [OK] Salary: SALARY_AVG → salary_avg (computed & validated)")

    if 'CITY_NAME' in df_raw.columns or 'city_name' in df_processed.columns:
        print(f"   [OK] Location: CITY_NAME → city_name (decoded & cleaned)")

    if 'TITLE_NAME' in df_raw.columns or 'title' in df_processed.columns:
        print(f"   [OK] Title: TITLE_NAME → title (standardized)")

    print(f"\n✨ Derived Columns Created:")
    derived_cols = [col for col in df_processed.columns if col not in [c.lower() for c in df_raw.columns]]
    if derived_cols:
        for col in derived_cols[:10]:  # Show first 10
            print(f"   • {col}")
        if len(derived_cols) > 10:
            print(f"   ... and {len(derived_cols) - 10} more")

    print("\n" + "="*80)
else:
    print("[WARNING]  Raw data not available for comparison")

## Summary

This notebook demonstrates the data processing pipeline that transforms raw Lightcast job postings into clean, analysis-ready data.

### Key Takeaways

1. **Column Standardization**: All UPPERCASE columns converted to snake_case
2. **Data Quality**: Salary validation, location decoding, and null handling
3. **Derived Columns**: New features created for analysis (experience_avg, etc.)
4. **Efficient Storage**: Parquet format for fast loading (117.8 MB compressed)

### Next Steps

- Explore `ml_feature_engineering_lab.ipynb` for machine learning models
- Check `job_market_skill_analysis.ipynb` for NLP and skills analysis
- View the Quarto website for interactive dashboards

---

**For more details, see:**
- `DESIGN.md` - Technical design and pipeline architecture
- `ARCHITECTURE.md` - System architecture with Mermaid diagrams
- `README.md` - Project overview and setup instructions
