# Data Processing Pipeline Demo

This notebook demonstrates the complete data processing pipeline for the Tech Career Intelligence Platform.

## What This Notebook Shows

1. **Raw Data Loading with PySpark**: Load the original Lightcast job postings CSV (72K rows)
2. **Data Inspection**: View schema and sample data from PySpark DataFrame
3. **Processed Data Loading with Pandas**: Load the clean, processed Parquet data
4. **Data Comparison**: Compare raw vs processed data transformations
5. **Pipeline Summary**: Understand the complete transformation flow

## Technology Stack

| Stage | Technology | Reason |
|-------|-----------|--------|
| **Raw Data ETL** | PySpark | Scalable for large CSV files (millions of rows) |
| **Storage** | Parquet | Compressed, columnar, fast reads |
| **Analysis** | Pandas | Rich API, fast for processed data (<100K rows) |

## Pipeline Overview

```
Raw CSV (UPPERCASE columns, 72K rows)
 │
 ├─> PySpark ETL Pipeline:
 │   ├── Load with multiLine=True (robust CSV parsing)
 │   ├── Standardize columns (UPPERCASE → snake_case)
 │   ├── Decode locations (base64 → plain text)
 │   ├── Compute salary_avg (from salary_from/salary_to)
 │   ├── Clean employment/remote type data
 │   └── Validate & filter data
 │
 ├─> Save to Parquet (compressed, fast)
 │
 └─> Pandas Analysis (load processed Parquet)
```

## Key Architecture Decision

**Why not use Pandas for raw data?**
- Pandas loads entire CSV into memory (72K rows × 131 columns = high memory usage)
- PySpark uses lazy evaluation (processes data on-demand)
- PySpark scales to millions of rows without code changes
- Same pipeline works locally or on a cluster


In [None]:
# Import necessary libraries
from pathlib import Path
import sys
import warnings

# Benign warnings: Ignoring them (TBD Will revist them post project)
warnings.filterwarnings('ignore')

# PySpark imports (for large-scale ETL)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Pandas imports (for final processed data analysis only)
import pandas as pd
import numpy as np

# Add project root to path so we can import from src/
project_root = Path().resolve().parent

# Adding the project root to the path so we can import from src/
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print(f"[OK] Libraries imported successfully")
print(f"   Project root: {project_root}")
print(f"   PySpark: For raw data ETL (72K rows)")
print(f"   Pandas: For processed data analysis")

## Step 1: Load and Inspect Raw Data with PySpark

Let's start by loading the raw CSV data using **PySpark** for scalable data processing.

**Why PySpark for Raw Data?**
- **Large Dataset**: 72K rows × 131 columns (can scale to millions)
- **Memory Efficient**: Lazy evaluation, distributed processing
- **Production Ready**: Same code works on single machine or cluster
- **Robust Parsing**: `multiLine=True` handles complex CSV formatting

**Architecture Pattern**: 
```
Raw CSV → PySpark ETL → Processed Parquet → Pandas Analysis
```

In [None]:
# Step 1: Load Raw Data with PySpark (using project_root for absolute path)

# Use project_root from previous cell (no relative paths!)
raw_data_path = project_root / 'data' / 'raw' / 'lightcast_job_postings.csv'

print(f"[DATA] Raw data path: {raw_data_path}")
print(f"[DATA] File exists: {raw_data_path.exists()}")

if raw_data_path.exists():
    # Initialize Spark session for distributed processing
    print("\nInitializing Spark session...")
    spark = SparkSession.builder \
        .appName("RawDataInspection") \
        .master("local[*]") \
        .config("spark.driver.memory", "2g") \
        .getOrCreate()

    print("[PYSPARK] Loading raw CSV with PySpark...")
    print("          Options: multiLine=True, escape='\"', header=True, inferSchema=True")

    # Load with PySpark (distributed, lazy evaluation)
    spark_df_raw = spark.read.csv(
        str(raw_data_path),
        multiLine=True,
        escape="\"",
        header=True,
        inferSchema=True
    )

    print(f"\n[OK] Loaded {spark_df_raw.count():,} records with {len(spark_df_raw.columns)} columns")
    print(f"     Memory: PySpark uses lazy evaluation - data processed on-demand")
    print(f"     Technology: Distributed DataFrame (not Pandas)")

    # Display first 5 rows (PySpark DataFrame format)
    print("\n" + "="*80)
    print("RAW DATA - First 5 Rows (PySpark DataFrame)")
    print("="*80)
    spark_df_raw.show(5, truncate=50, vertical=False)

    # Show column names (UPPERCASE from raw CSV)
    print("\n" + "="*80)
    print(f"RAW DATA - Column Names ({len(spark_df_raw.columns)} total, UPPERCASE)")
    print("="*80)
    print("\n".join([f"{i+1:3d}. {col}" for i, col in enumerate(spark_df_raw.columns[:20])]))
    print(f"     ... and {len(spark_df_raw.columns) - 20} more columns")

    # Show schema (data types inferred by PySpark)
    print("\n" + "="*80)
    print("RAW DATA - PySpark Schema (first 10 columns)")
    print("="*80)
    for field in spark_df_raw.schema.fields[:10]:
        print(f"  {field.name:30s} {field.dataType}")
    print(f"  ... and {len(spark_df_raw.columns) - 10} more columns")

    # Show data quality stats using PySpark
    print("\n" + "="*80)
    print("RAW DATA - PySpark Data Quality Check")
    print("="*80)
    total_rows = spark_df_raw.count()
    salary_rows = spark_df_raw.filter(col("SALARY_FROM").isNotNull()).count()
    print(f"  Total rows: {total_rows:,}")
    print(f"  Rows with salary: {salary_rows:,} ({salary_rows/total_rows*100:.1f}%)")

else:
    print(f"[ERROR] Raw data not found at {raw_data_path}")
    spark_df_raw = None
    spark = None

## Step 2: Load and Inspect Processed Data

Now let's load the processed Parquet data to see the transformation results.

In [None]:
# Load processed data using the centralized pipeline
# This follows the "Process Once, Use Many Times" architecture from DESIGN.md

from src.data.website_processor import load_and_process_data

print("Loading processed data...")

df_processed, summary = load_and_process_data()

print(f"[OK] Loaded {len(df_processed):,} records with {len(df_processed.columns)} columns")
print(f" Salary coverage: {summary['salary_coverage']:.1f}%")
print(f" Median salary: ${summary['salary_range']['median']:,.0f}")

# Display first 5 rows
print("\n" + "="*80)
print("PROCESSED DATA - First 5 Rows")
print("="*80)
display(df_processed.head())

# Show column names (snake_case)
print("\n" + "="*80)
print(f"PROCESSED DATA - Column Names ({len(df_processed.columns)} total, all snake_case)")
print("="*80)
print("\n".join([f"{i+1:3d}. {col}" for i, col in enumerate(df_processed.columns)]))

## Step 3: Compare Raw vs Processed Data

Let's see what transformations were applied by the data pipeline.

In [None]:
# Step 3: Compare Raw (PySpark) vs Processed (Pandas) Data

# Check if required variables exist and run comparison
try:
    # Test if variables are defined
    _ = spark_df_raw
    _ = df_processed
    _ = spark

    print("="*80)
    print("RAW vs PROCESSED DATA COMPARISON")
    print("="*80)

    # Get counts from PySpark DataFrame
    raw_count = spark_df_raw.count()
    raw_columns = len(spark_df_raw.columns)

    print(f"\n[DATA] Record Count:")
    print(f"  Raw (PySpark): {raw_count:,}")
    print(f"  Processed (Pandas): {len(df_processed):,}")
    print(f"  Retained: {len(df_processed)/raw_count*100:.1f}%")

    print(f"\n[COLUMNS] Column Count:")
    print(f"  Raw: {raw_columns}")
    print(f"  Processed: {len(df_processed.columns)}")
    print(f"  Change: {len(df_processed.columns) - raw_columns:+d} (salary_avg added)")

    print(f"\n[NAMING] Column Naming Convention:")
    print(f"  Raw: UPPERCASE (e.g., {spark_df_raw.columns[0]})")
    print(f"  Processed: snake_case (e.g., {df_processed.columns[0]})")

    # Show key transformations
    print(f"\n[TRANSFORMS] Key Transformations:")

    # Check for salary column
    if 'salary_avg' in df_processed.columns:
        print(f"  [OK] salary_avg: Computed from SALARY_FROM and SALARY_TO")
        print(f"       Formula: (SALARY_FROM + SALARY_TO) / 2")
        print(f"       Missing values: NOT imputed (maintains data integrity)")

    # Check for location decoding
    if 'city_name' in df_processed.columns:
        print(f"  [OK] city_name: Decoded from base64 (if encoded)")

    # Check for employment/remote cleaning
    if 'employment_type_name' in df_processed.columns:
        print(f"  [OK] employment_type_name: Special characters removed, nulls → 'Undefined'")

    if 'remote_type_name' in df_processed.columns:
        print(f"  [OK] remote_type_name: Missing values filled with 'Undefined'")

    # Show technology stack difference
    print(f"\n[TECH] Technology Stack:")
    print(f"  Raw data: PySpark DataFrame (distributed, lazy evaluation)")
    print(f"  Processed data: Pandas DataFrame (in-memory, rich API)")
    print(f"  Storage: Parquet (columnar, compressed)")

    print("\n" + "="*80)

    # Clean up Spark session
    print("\n[CLEANUP] Stopping Spark session to free memory...")
    spark.stop()
    print("[OK] Spark session stopped")

except NameError as e:
    print(f"[ERROR] Missing required variable: {e}")
    print("[INFO] Please ensure you run cells in order:")
    print("       Step 1: Load raw data with PySpark")
    print("       Step 2: Load processed data with Pandas")
    print("       Step 3: Compare both datasets")

## Summary

This notebook demonstrates the data processing pipeline that transforms raw Lightcast job postings into clean, analysis-ready data.

### Key Takeaways

1. **Column Standardization**: All UPPERCASE columns converted to snake_case
2. **Data Quality**: Salary validation, location decoding, and null handling
3. **Derived Columns**: New features created for analysis (experience_avg, etc.)
4. **Efficient Storage**: Parquet format for fast loading (117.8 MB compressed)

### Next Steps

- Explore `ml_feature_engineering_lab.ipynb` for machine learning models
- Check `job_market_skill_analysis.ipynb` for NLP and skills analysis
- View the Quarto website for interactive dashboards

---

**For more details, see:**
- `DESIGN.md` - Technical design and pipeline architecture
- `ARCHITECTURE.md` - System architecture with Mermaid diagrams
- `README.md` - Project overview and setup instructions
