# Data Processing Pipeline Demo

This notebook demonstrates the complete data processing pipeline for the Tech Career Intelligence Platform.

## What This Notebook Shows

1. **Raw Data Loading**: Load the original Lightcast job postings CSV
2. **Data Inspection**: View the first 5 rows of raw data
3. **Processed Data Loading**: Load the clean, processed Parquet data
4. **Data Comparison**: Compare raw vs processed data transformations
5. **Pipeline Summary**: Understand the complete transformation flow

## Pipeline Overview

```
Raw CSV (UPPERCASE columns)
    ↓
Standardize columns (UPPERCASE → snake_case)
    ↓
Decode locations (base64 → plain text)
    ↓
Compute salary_avg (from min/max with imputation)
    ↓
Validate & clean data
    ↓
Processed Parquet (snake_case columns, clean data)
```


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from pathlib import Path
import sys
import warnings
warnings.filterwarnings('ignore')

# Add project root to path so we can import from src/
project_root = Path().resolve().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print(f"✅ Libraries imported successfully")
print(f"   Project root: {project_root}")

✅ Libraries imported successfully
   Project root: /Users/ss670121/sourcebox/github.com/ad688-scratch


## Step 1: Load and Inspect Raw Data

Let's start by loading the raw CSV data and examining its structure.

In [2]:
import pandas as pd
from pathlib import Path

# Load raw data
raw_data_path = Path('../data/raw/lightcast_job_postings.csv')

if raw_data_path.exists():
    print("Loading raw data...")
    df_raw = pd.read_csv(raw_data_path)
    print(f"✅ Loaded {len(df_raw):,} records with {len(df_raw.columns)} columns")

    # Display first 5 rows
    print("\n" + "="*80)
    print("RAW DATA - First 5 Rows")
    print("="*80)
    display(df_raw.head())

    # Show column names
    print("\n" + "="*80)
    print(f"RAW DATA - Column Names ({len(df_raw.columns)} total)")
    print("="*80)
    print("\n".join([f"{i+1:3d}. {col}" for i, col in enumerate(df_raw.columns)]))
else:
    print(f"❌ Raw data not found at {raw_data_path}")
    df_raw = None

Loading raw data...
✅ Loaded 72,498 records with 131 columns

RAW DATA - First 5 Rows


Unnamed: 0,ID,LAST_UPDATED_DATE,LAST_UPDATED_TIMESTAMP,DUPLICATES,POSTED,EXPIRED,DURATION,SOURCE_TYPES,SOURCES,URL,...,NAICS_2022_2,NAICS_2022_2_NAME,NAICS_2022_3,NAICS_2022_3_NAME,NAICS_2022_4,NAICS_2022_4_NAME,NAICS_2022_5,NAICS_2022_5_NAME,NAICS_2022_6,NAICS_2022_6_NAME
0,1f57d95acf4dc67ed2819eb12f049f6a5c11782c,9/6/2024,2024-09-06 20:32:57.352 Z,0.0,6/2/2024,6/8/2024,6.0,"[\n ""Company""\n]","[\n ""brassring.com""\n]","[\n ""https://sjobs.brassring.com/TGnewUI/Sear...",...,44.0,Retail Trade,441.0,Motor Vehicle and Parts Dealers,4413.0,"Automotive Parts, Accessories, and Tire Retailers",44133.0,Automotive Parts and Accessories Retailers,441330.0,Automotive Parts and Accessories Retailers
1,0cb072af26757b6c4ea9464472a50a443af681ac,8/2/2024,2024-08-02 17:08:58.838 Z,0.0,6/2/2024,8/1/2024,,"[\n ""Job Board""\n]","[\n ""maine.gov""\n]","[\n ""https://joblink.maine.gov/jobs/1085740""\n]",...,56.0,Administrative and Support and Waste Managemen...,561.0,Administrative and Support Services,5613.0,Employment Services,56132.0,Temporary Help Services,561320.0,Temporary Help Services
2,85318b12b3331fa490d32ad014379df01855c557,9/6/2024,2024-09-06 20:32:57.352 Z,1.0,6/2/2024,7/7/2024,35.0,"[\n ""Job Board""\n]","[\n ""dejobs.org""\n]","[\n ""https://dejobs.org/dallas-tx/data-analys...",...,52.0,Finance and Insurance,524.0,Insurance Carriers and Related Activities,5242.0,"Agencies, Brokerages, and Other Insurance Rela...",52429.0,Other Insurance Related Activities,524291.0,Claims Adjusting
3,1b5c3941e54a1889ef4f8ae55b401a550708a310,9/6/2024,2024-09-06 20:32:57.352 Z,1.0,6/2/2024,7/20/2024,48.0,"[\n ""Job Board""\n]","[\n ""disabledperson.com"",\n ""dejobs.org""\n]","[\n ""https://www.disabledperson.com/jobs/5948...",...,52.0,Finance and Insurance,522.0,Credit Intermediation and Related Activities,5221.0,Depository Credit Intermediation,52211.0,Commercial Banking,522110.0,Commercial Banking
4,cb5ca25f02bdf25c13edfede7931508bfd9e858f,6/19/2024,2024-06-19 07:00:00.000 Z,0.0,6/2/2024,6/17/2024,15.0,"[\n ""FreeJobBoard""\n]","[\n ""craigslist.org""\n]","[\n ""https://modesto.craigslist.org/sls/77475...",...,99.0,Unclassified Industry,999.0,Unclassified Industry,9999.0,Unclassified Industry,99999.0,Unclassified Industry,999999.0,Unclassified Industry



RAW DATA - Column Names (131 total)
  1. ID
  2. LAST_UPDATED_DATE
  3. LAST_UPDATED_TIMESTAMP
  4. DUPLICATES
  5. POSTED
  6. EXPIRED
  7. DURATION
  8. SOURCE_TYPES
  9. SOURCES
 10. URL
 11. ACTIVE_URLS
 12. ACTIVE_SOURCES_INFO
 13. TITLE_RAW
 14. BODY
 15. MODELED_EXPIRED
 16. MODELED_DURATION
 17. COMPANY
 18. COMPANY_NAME
 19. COMPANY_RAW
 20. COMPANY_IS_STAFFING
 21. EDUCATION_LEVELS
 22. EDUCATION_LEVELS_NAME
 23. MIN_EDULEVELS
 24. MIN_EDULEVELS_NAME
 25. MAX_EDULEVELS
 26. MAX_EDULEVELS_NAME
 27. EMPLOYMENT_TYPE
 28. EMPLOYMENT_TYPE_NAME
 29. MIN_YEARS_EXPERIENCE
 30. MAX_YEARS_EXPERIENCE
 31. IS_INTERNSHIP
 32. SALARY
 33. REMOTE_TYPE
 34. REMOTE_TYPE_NAME
 35. ORIGINAL_PAY_PERIOD
 36. SALARY_TO
 37. SALARY_FROM
 38. LOCATION
 39. CITY
 40. CITY_NAME
 41. COUNTY
 42. COUNTY_NAME
 43. MSA
 44. MSA_NAME
 45. STATE
 46. STATE_NAME
 47. COUNTY_OUTGOING
 48. COUNTY_NAME_OUTGOING
 49. COUNTY_INCOMING
 50. COUNTY_NAME_INCOMING
 51. MSA_OUTGOING
 52. MSA_NAME_OUTGOING
 53. MSA_INC

## Step 2: Load and Inspect Processed Data

Now let's load the processed Parquet data to see the transformation results.

In [3]:
# Load processed data using the centralized pipeline
# This follows the "Process Once, Use Many Times" architecture from DESIGN.md

from src.data.website_processor import load_and_process_data

print("Loading processed data...")
df_processed, summary = load_and_process_data()

print(f"✅ Loaded {len(df_processed):,} records with {len(df_processed.columns)} columns")
print(f"   Salary coverage: {summary['salary_coverage']:.1f}%")
print(f"   Median salary: ${summary['salary_range']['median']:,.0f}")

# Display first 5 rows
print("\n" + "="*80)
print("PROCESSED DATA - First 5 Rows")
print("="*80)
display(df_processed.head())

# Show column names (snake_case)
print("\n" + "="*80)
print(f"PROCESSED DATA - Column Names ({len(df_processed.columns)} total, all snake_case)")
print("="*80)
print("\n".join([f"{i+1:3d}. {col}" for i, col in enumerate(df_processed.columns)]))

Loading processed data...
📊 Loading job market data...
  ✅ Loading processed Parquet (/Users/ss670121/sourcebox/github.com/ad688-scratch/data/processed/job_market_processed.parquet)...
  ✅ Loaded 32,364 records (already standardized, no processing needed)
✅ Loaded 32,364 records with 132 columns
   Salary coverage: 100.0%
   Median salary: $113,522

PROCESSED DATA - First 5 Rows


Unnamed: 0,id,last_updated_date,last_updated_timestamp,duplicates,posted,expired,duration,source_types,sources,url,...,naics_2022_2_name,naics_2022_3,naics_2022_3_name,naics_2022_4,naics_2022_4_name,naics_2022_5,naics_2022_5_name,naics_2022_6,naics_2022_6_name,salary_avg
0,cb5ca25f02bdf25c13edfede7931508bfd9e858f,6/19/2024,2024-06-19 07:00:00.000 Z,0.0,6/2/2024,6/17/2024,15.0,"[\n ""FreeJobBoard""\n]","[\n ""craigslist.org""\n]","[\n ""https://modesto.craigslist.org/sls/77475...",...,Unclassified Industry,999.0,Unclassified Industry,9999.0,Unclassified Industry,99999.0,Unclassified Industry,999999.0,Unclassified Industry,92500.0
1,35a6cd2183d9fb270e3f504b270f36d43cb759a6,9/6/2024,2024-09-06 20:32:57.352 Z,0.0,6/2/2024,6/12/2024,10.0,"[\n ""Job Board""\n]","[\n ""dejobs.org""\n]","[\n ""https://dejobs.org/little-rock-ar/sr-lea...",...,Information,517.0,Telecommunications,5178.0,All Other Telecommunications,51781.0,All Other Telecommunications,517810.0,All Other Telecommunications,110155.0
2,229620073766234e814e8add21db7dfaef69b3bd,10/9/2024,2024-10-09 18:07:44.758 Z,0.0,6/2/2024,8/1/2024,,"[\n ""Company""\n]","[\n ""3ds.com""\n]","[\n ""https://www.3ds.com/careers/jobs/sr-mark...",...,"Professional, Scientific, and Technical Services",541.0,"Professional, Scientific, and Technical Services",5415.0,Computer Systems Design and Related Services,54151.0,Computer Systems Design and Related Services,541511.0,Custom Computer Programming Services,92962.0
3,b7aa80a24c82f080cca31a8b5b720824eb2b71f3,9/28/2024,2024-09-28 14:06:14.129 Z,8.0,6/2/2024,9/27/2024,,"[\n ""Government"",\n ""Company"",\n ""Job Board...","[\n ""dcscorp.com"",\n ""latpro.com"",\n ""ca.go...","[\n ""https://www.latpro.com/career/8048496/Da...",...,Wholesale Trade,423.0,"Merchant Wholesalers, Durable Goods",4238.0,"Machinery, Equipment, and Supplies Merchant Wh...",42383.0,Industrial Machinery and Equipment Merchant Wh...,423830.0,Industrial Machinery and Equipment Merchant Wh...,107645.5
4,57b527ea0f91db5bb17f82ff3d34dcdb7afe5c13,9/6/2024,2024-09-06 20:32:57.352 Z,0.0,6/2/2024,7/27/2024,55.0,"[\n ""Job Board""\n]","[\n ""simplyhired.com""\n]","[\n ""https://www.simplyhired.com/job/InNNBIUX...",...,"Professional, Scientific, and Technical Services",541.0,"Professional, Scientific, and Technical Services",5416.0,"Management, Scientific, and Technical Consulti...",54161.0,Management Consulting Services,541611.0,Administrative Management and General Manageme...,192800.0



PROCESSED DATA - Column Names (132 total, all snake_case)
  1. id
  2. last_updated_date
  3. last_updated_timestamp
  4. duplicates
  5. posted
  6. expired
  7. duration
  8. source_types
  9. sources
 10. url
 11. active_urls
 12. active_sources_info
 13. title_raw
 14. body
 15. modeled_expired
 16. modeled_duration
 17. company
 18. company_name
 19. company_raw
 20. company_is_staffing
 21. education_levels
 22. education_levels_name
 23. min_edulevels
 24. min_edulevels_name
 25. max_edulevels
 26. max_edulevels_name
 27. employment_type
 28. employment_type_name
 29. min_years_experience
 30. max_years_experience
 31. is_internship
 32. salary
 33. remote_type
 34. remote_type_name
 35. original_pay_period
 36. salary_to
 37. salary_from
 38. location
 39. city
 40. city_name
 41. county
 42. county_name
 43. msa
 44. msa_name
 45. state
 46. state_name
 47. county_outgoing
 48. county_name_outgoing
 49. county_incoming
 50. county_name_incoming
 51. msa_outgoing
 52. msa_name

## Step 3: Compare Raw vs Processed Data

Let's see what transformations were applied by the data pipeline.

In [4]:
# Data Transformation Summary

if df_raw is not None:
    print("="*80)
    print("RAW vs PROCESSED DATA COMPARISON")
    print("="*80)

    print(f"\n📊 Record Count:")
    print(f"   Raw:       {len(df_raw):,}")
    print(f"   Processed: {len(df_processed):,}")
    print(f"   Retained:  {len(df_processed)/len(df_raw)*100:.1f}%")

    print(f"\n📋 Column Count:")
    print(f"   Raw:       {len(df_raw.columns)}")
    print(f"   Processed: {len(df_processed.columns)}")
    print(f"   Change:    {len(df_processed.columns) - len(df_raw.columns):+d}")

    print(f"\n🏷️  Column Naming:")
    print(f"   Raw:       UPPERCASE (e.g., {df_raw.columns[0]})")
    print(f"   Processed: snake_case (e.g., {df_processed.columns[0]})")

    # Show key transformations
    print(f"\n🔄 Key Transformations:")

    # Check for common columns
    if 'SALARY_AVG' in df_raw.columns or 'salary_avg' in df_processed.columns:
        print(f"   ✅ Salary: SALARY_AVG → salary_avg (computed & validated)")

    if 'CITY_NAME' in df_raw.columns or 'city_name' in df_processed.columns:
        print(f"   ✅ Location: CITY_NAME → city_name (decoded & cleaned)")

    if 'TITLE_NAME' in df_raw.columns or 'title' in df_processed.columns:
        print(f"   ✅ Title: TITLE_NAME → title (standardized)")

    print(f"\n✨ Derived Columns Created:")
    derived_cols = [col for col in df_processed.columns if col not in [c.lower() for c in df_raw.columns]]
    if derived_cols:
        for col in derived_cols[:10]:  # Show first 10
            print(f"   • {col}")
        if len(derived_cols) > 10:
            print(f"   ... and {len(derived_cols) - 10} more")

    print("\n" + "="*80)
else:
    print("⚠️  Raw data not available for comparison")

RAW vs PROCESSED DATA COMPARISON

📊 Record Count:
   Raw:       72,498
   Processed: 32,364
   Retained:  44.6%

📋 Column Count:
   Raw:       131
   Processed: 132
   Change:    +1

🏷️  Column Naming:
   Raw:       UPPERCASE (e.g., ID)
   Processed: snake_case (e.g., id)

🔄 Key Transformations:
   ✅ Salary: SALARY_AVG → salary_avg (computed & validated)
   ✅ Location: CITY_NAME → city_name (decoded & cleaned)
   ✅ Title: TITLE_NAME → title (standardized)

✨ Derived Columns Created:
   • salary_avg



## Summary

This notebook demonstrates the data processing pipeline that transforms raw Lightcast job postings into clean, analysis-ready data.

### Key Takeaways

1. **Column Standardization**: All UPPERCASE columns converted to snake_case
2. **Data Quality**: Salary validation, location decoding, and null handling
3. **Derived Columns**: New features created for analysis (experience_avg, etc.)
4. **Efficient Storage**: Parquet format for fast loading (117.8 MB compressed)

### Next Steps

- Explore `ml_feature_engineering_lab.ipynb` for machine learning models
- Check `job_market_skill_analysis.ipynb` for NLP and skills analysis
- View the Quarto website for interactive dashboards

---

**For more details, see:**
- `DESIGN.md` - Technical design and pipeline architecture
- `ARCHITECTURE.md` - System architecture with Mermaid diagrams
- `README.md` - Project overview and setup instructions
