# Prepare Training Data Variants

## Purpose
Prepare CSV training data for SageMaker XGBoost experiments with different feature subsets.
This notebook creates two additional variants beyond the 20-feature "engineered-baseline":
1. **Raw baseline (6 features)**: Minimal features for benchmark model
2. **Engineered without target encoding (16 features)**: Avoids temporal leakage from target-encoded rates

## Prerequisites
- Feature parquet files available in public S3 bucket
- Run after 02_prepare_sagemaker_data.ipynb (which creates the 20-feature variant)

## Output Locations
- `s3://{PERSONAL_BUCKET}/aai540-group1/training/raw-baseline/`
- `s3://{PERSONAL_BUCKET}/aai540-group1/training/engineered-no-target-encoding/`

In [1]:
import boto3
import pandas as pd
import sagemaker
import os
from botocore.exceptions import ClientError

# SageMaker session
sess = sagemaker.Session()
s3_client = boto3.client('s3')

# S3 buckets
PUBLIC_BUCKET = 'sagemaker-us-east-1-425709451100'
PERSONAL_BUCKET = sess.default_bucket()
PARQUET_PREFIX = 'aai540-group1/features'

# Target and feature definitions
TARGET = 'DELAYED'

# 6 raw features (for benchmark model)
RAW_FEATURES = [
    'MONTH', 'DAY_OF_WEEK', 'DEP_HOUR', 
    'DISTANCE', 'SCHEDULED_TIME', 'IS_WEEKEND'
]

# 16 engineered features WITHOUT target encoding
# Excludes: AIRLINE_DELAY_RATE, ORIGIN_DELAY_RATE, DEST_DELAY_RATE, ROUTE_DELAY_RATE
ENGINEERED_NO_TARGET_ENCODING = [
    # Temporal (9)
    'MONTH', 'DAY', 'DAY_OF_WEEK', 'DEP_HOUR', 'SCHEDULED_DEPARTURE',
    'HOUR_SIN', 'HOUR_COS', 'IS_PEAK_HOUR', 'IS_WEEKEND',
    # Distance (4)
    'DISTANCE', 'SCHEDULED_TIME', 'IS_LONG_HAUL', 'DISTANCE_BUCKET',
    # Volume (3)
    'ORIGIN_FLIGHTS', 'DEST_FLIGHTS', 'ROUTE_FLIGHTS'
]

# Feature set configurations
FEATURE_SETS = {
    'raw-baseline': {
        'features': RAW_FEATURES,
        'prefix': 'aai540-group1/training/raw-baseline',
        'description': '6 raw features for benchmark model'
    },
    'engineered-no-target-encoding': {
        'features': ENGINEERED_NO_TARGET_ENCODING,
        'prefix': 'aai540-group1/training/engineered-no-target-encoding',
        'description': '16 features excluding target-encoded rates'
    }
}

# Split name mapping (parquet uses 'val', SageMaker convention is 'validation')
SPLIT_MAPPING = {
    'train': 'train',
    'val': 'validation',
    'test': 'test'
}

print("Configuration:")
print(f"  Public bucket (source): s3://{PUBLIC_BUCKET}/{PARQUET_PREFIX}/")
print(f"  Personal bucket (dest):  s3://{PERSONAL_BUCKET}/")
print(f"\nFeature sets:")
for name, config in FEATURE_SETS.items():
    print(f"  {name:35s} - {len(config['features']):2d} features: {config['description']}")
print(f"\nTarget: {TARGET}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Configuration:
  Public bucket (source): s3://sagemaker-us-east-1-425709451100/aai540-group1/features/
  Personal bucket (dest):  s3://sagemaker-us-east-1-786869526001/

Feature sets:
  raw-baseline                        -  6 features: 6 raw features for benchmark model
  engineered-no-target-encoding       - 16 features: 16 features excluding target-encoded rates

Target: DELAYED


In [2]:
# Check existing data in S3
print("Checking S3 for existing files...\n")

upload_status = {}

for feature_set_name, config in FEATURE_SETS.items():
    upload_status[feature_set_name] = {}
    
    for parquet_split, s3_split in SPLIT_MAPPING.items():
        s3_key = f"{config['prefix']}/{s3_split}/{s3_split}.csv"
        
        try:
            s3_client.head_object(Bucket=PERSONAL_BUCKET, Key=s3_key)
            upload_status[feature_set_name][parquet_split] = 'exists'
        except ClientError:
            upload_status[feature_set_name][parquet_split] = 'missing'

# Print status table
print(f"{'Feature Set':<40s} {'Train':<10s} {'Validation':<12s} {'Test':<10s}")
print("-" * 75)

for feature_set_name in FEATURE_SETS.keys():
    statuses = upload_status[feature_set_name]
    train_status = '✓' if statuses['train'] == 'exists' else 'missing'
    val_status = '✓' if statuses['val'] == 'exists' else 'missing'
    test_status = '✓' if statuses['test'] == 'exists' else 'missing'
    
    print(f"{feature_set_name:<40s} {train_status:<10s} {val_status:<12s} {test_status:<10s}")

# Determine what needs processing
needs_processing = {}
for feature_set_name, statuses in upload_status.items():
    if any(status == 'missing' for status in statuses.values()):
        needs_processing[feature_set_name] = FEATURE_SETS[feature_set_name]

if needs_processing:
    print(f"\nProcessing needed for: {', '.join(needs_processing.keys())}")
else:
    print("\n✓ All files already exist in S3")

Checking S3 for existing files...

Feature Set                              Train      Validation   Test      
---------------------------------------------------------------------------
raw-baseline                             ✓          ✓            ✓         
engineered-no-target-encoding            missing    missing      missing   

Processing needed for: engineered-no-target-encoding


In [3]:
# Load parquet and prepare CSVs
if not needs_processing:
    print("No processing needed - all files exist in S3")
else:
    print("Loading parquet files and preparing CSVs...\n")
    
    # Local temp directory
    local_dir = '/tmp/training_variants'
    os.makedirs(local_dir, exist_ok=True)
    
    # Track prepared files
    prepared_files = {}
    
    for feature_set_name, config in needs_processing.items():
        print(f"Processing {feature_set_name} ({len(config['features'])} features)...")
        prepared_files[feature_set_name] = {}
        
        for parquet_split, s3_split in SPLIT_MAPPING.items():
            # Skip if already exists
            if upload_status[feature_set_name][parquet_split] == 'exists':
                print(f"  {s3_split:12s}: Already exists, skipping")
                continue
            
            # Load parquet with columnar projection (only needed columns)
            s3_path = f"s3://{PUBLIC_BUCKET}/{PARQUET_PREFIX}/{parquet_split}_features.parquet"
            columns_to_read = [TARGET] + config['features']
            df = pd.read_parquet(s3_path, columns=columns_to_read)
            
            # Reorder: target first, then features
            df_train = df[[TARGET] + config['features']].copy()
            
            # Save locally as CSV (no headers, no index)
            local_path = f"{local_dir}/{feature_set_name}_{s3_split}.csv"
            df_train.to_csv(local_path, header=False, index=False)
            
            prepared_files[feature_set_name][parquet_split] = {
                'local_path': local_path,
                's3_split': s3_split,
                'rows': len(df_train)
            }
            
            print(f"  {s3_split:12s}: {len(df_train):>7,} rows prepared")
        
        print()
    
    print(f"✓ CSV files prepared in {local_dir}")

Loading parquet files and preparing CSVs...

Processing engineered-no-target-encoding (16 features)...
  train       : 4,299,046 rows prepared
  validation  : 482,878 rows prepared
  test        : 462,367 rows prepared

✓ CSV files prepared in /tmp/training_variants


In [4]:
# Upload to S3
if not needs_processing:
    print("No uploads needed - all files already exist in S3")
else:
    print("Uploading to S3...\n")
    
    uploaded_paths = []
    
    for feature_set_name, files in prepared_files.items():
        config = FEATURE_SETS[feature_set_name]
        print(f"{feature_set_name}:")
        
        for parquet_split, file_info in files.items():
            s3_split = file_info['s3_split']
            s3_key = f"{config['prefix']}/{s3_split}/{s3_split}.csv"
            s3_path = f"s3://{PERSONAL_BUCKET}/{s3_key}"
            
            # Double-check if exists (idempotency)
            try:
                s3_client.head_object(Bucket=PERSONAL_BUCKET, Key=s3_key)
                print(f"  {s3_split:12s}: Already exists at {s3_path}")
            except ClientError:
                s3_client.upload_file(file_info['local_path'], PERSONAL_BUCKET, s3_key)
                print(f"  {s3_split:12s}: Uploaded to {s3_path}")
            
            uploaded_paths.append(s3_path)
        
        print()
    
    print(f"✓ Upload complete")
    
    # Final summary
    print("\nAll S3 paths:")
    for feature_set_name, config in FEATURE_SETS.items():
        print(f"\n{feature_set_name}:")
        for s3_split in ['train', 'validation', 'test']:
            s3_path = f"s3://{PERSONAL_BUCKET}/{config['prefix']}/{s3_split}/{s3_split}.csv"
            print(f"  {s3_path}")

Uploading to S3...

engineered-no-target-encoding:
  train       : Uploaded to s3://sagemaker-us-east-1-786869526001/aai540-group1/training/engineered-no-target-encoding/train/train.csv
  validation  : Uploaded to s3://sagemaker-us-east-1-786869526001/aai540-group1/training/engineered-no-target-encoding/validation/validation.csv
  test        : Uploaded to s3://sagemaker-us-east-1-786869526001/aai540-group1/training/engineered-no-target-encoding/test/test.csv

✓ Upload complete

All S3 paths:

raw-baseline:
  s3://sagemaker-us-east-1-786869526001/aai540-group1/training/raw-baseline/train/train.csv
  s3://sagemaker-us-east-1-786869526001/aai540-group1/training/raw-baseline/validation/validation.csv
  s3://sagemaker-us-east-1-786869526001/aai540-group1/training/raw-baseline/test/test.csv

engineered-no-target-encoding:
  s3://sagemaker-us-east-1-786869526001/aai540-group1/training/engineered-no-target-encoding/train/train.csv
  s3://sagemaker-us-east-1-786869526001/aai540-group1/training

## Data Ready

Training data variants prepared for SageMaker XGBoost:

| Variant | Features | S3 Location |
|---------|----------|-------------|
| raw-baseline | 6 | `s3://{bucket}/aai540-group1/training/raw-baseline/` |
| engineered-no-target-encoding | 16 | `s3://{bucket}/aai540-group1/training/engineered-no-target-encoding/` |
| engineered-baseline | 20 | `s3://{bucket}/aai540-group1/training/engineered-baseline/` (from 02_prepare_sagemaker_data.ipynb) |

Each location contains:
- `train/train.csv`
- `validation/validation.csv`  
- `test/test.csv`

Format: CSV with target column first, no headers (SageMaker XGBoost format)