# Exploring Feature Data for Full Model Training

**Goal:** Verify all 20 engineered features are available and prepare for training with the complete feature set.

## What We're Testing:
1. Load Parquet features from S3
2. Verify all 20 feature columns exist
3. Check data quality
4. Prepare parameters for pipeline execution

In [1]:
import boto3
import pandas as pd
import sagemaker

# Initialize SageMaker session
sess = sagemaker.Session()
bucket = sess.default_bucket()
region = sess.boto_region_name

print(f"Region: {region}")
print(f"Default Bucket: {bucket}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Region: us-east-1
Default Bucket: sagemaker-us-east-1-786869526001


## 1. Load Training Features from S3

In [2]:
# Public bucket where features are stored
PUBLIC_BUCKET = 'sagemaker-us-east-1-425709451100'
FEATURES_PREFIX = 'aai540-group1/features'

# Load training features
train_path = f"s3://{PUBLIC_BUCKET}/{FEATURES_PREFIX}/train_features.parquet"
print(f"Loading: {train_path}")

df_train = pd.read_parquet(train_path)

print(f"\nLoaded {len(df_train):,} rows √ó {df_train.shape[1]} columns")
print(f"\nColumns: {list(df_train.columns)}")

Loading: s3://sagemaker-us-east-1-425709451100/aai540-group1/features/train_features.parquet

Loaded 4,299,046 rows √ó 27 columns

Columns: ['MONTH', 'DAY', 'DAY_OF_WEEK', 'SCHEDULED_DEPARTURE', 'AIRLINE', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'DISTANCE', 'SCHEDULED_TIME', 'DELAYED', 'DEP_HOUR', 'HOUR_SIN', 'HOUR_COS', 'IS_PEAK_HOUR', 'IS_WEEKEND', 'IS_LONG_HAUL', 'DISTANCE_BUCKET', 'ROUTE', 'AIRLINE_DELAY_RATE', 'ORIGIN_DELAY_RATE', 'DEST_DELAY_RATE', 'ROUTE_DELAY_RATE', 'ORIGIN_FLIGHTS', 'DEST_FLIGHTS', 'ROUTE_FLIGHTS', 'flight_id', 'event_time']


## 2. Verify All 20 Feature Columns

In [3]:
# Define all 20 engineered features (including target-encoded rates)
ALL_FEATURES = [
    # Temporal (9)
    'MONTH', 'DAY', 'DAY_OF_WEEK', 'DEP_HOUR', 'SCHEDULED_DEPARTURE',
    'HOUR_SIN', 'HOUR_COS', 'IS_PEAK_HOUR', 'IS_WEEKEND',
    
    # Distance (4)
    'DISTANCE', 'SCHEDULED_TIME', 'IS_LONG_HAUL', 'DISTANCE_BUCKET',
    
    # Target-encoded (4) - THESE WERE EXCLUDED IN ENGINEERED BASELINE
    'AIRLINE_DELAY_RATE', 'ORIGIN_DELAY_RATE', 'DEST_DELAY_RATE', 'ROUTE_DELAY_RATE',
    
    # Volume (3)
    'ORIGIN_FLIGHTS', 'DEST_FLIGHTS', 'ROUTE_FLIGHTS'
]

TARGET = 'DELAYED'

print(f"Expected features: {len(ALL_FEATURES)}")
print(f"\nChecking for all features...")

missing_features = [f for f in ALL_FEATURES if f not in df_train.columns]
if missing_features:
    print(f"‚ùå Missing features: {missing_features}")
else:
    print(f"‚úÖ All {len(ALL_FEATURES)} features are present!")

# Check target column
if TARGET in df_train.columns:
    print(f"‚úÖ Target column '{TARGET}' found")
else:
    print(f"‚ùå Target column '{TARGET}' missing")

Expected features: 20

Checking for all features...
‚úÖ All 20 features are present!
‚úÖ Target column 'DELAYED' found


## 3. Inspect Data Quality

In [4]:
# Select only the features and target we need
df_model = df_train[ALL_FEATURES + [TARGET]].copy()

print(f"Model dataset: {df_model.shape[0]:,} rows √ó {df_model.shape[1]} columns")
print(f"\nData types:")
print(df_model.dtypes)

print(f"\nMissing values:")
missing = df_model.isnull().sum()
if missing.sum() > 0:
    print(missing[missing > 0])
else:
    print("No missing values ‚úÖ")

print(f"\nTarget distribution:")
delay_rate = df_model[TARGET].mean() * 100
print(f"Delayed: {(df_model[TARGET]==1).sum():,} ({delay_rate:.2f}%)")
print(f"On-time: {(df_model[TARGET]==0).sum():,} ({100-delay_rate:.2f}%)")

Model dataset: 4,299,046 rows √ó 21 columns

Data types:
MONTH                    int64
DAY                      int64
DAY_OF_WEEK              int64
DEP_HOUR                 int64
SCHEDULED_DEPARTURE      int64
HOUR_SIN               float64
HOUR_COS               float64
IS_PEAK_HOUR             int64
IS_WEEKEND               int64
DISTANCE                 int64
SCHEDULED_TIME         float64
IS_LONG_HAUL             int64
DISTANCE_BUCKET          int64
AIRLINE_DELAY_RATE     float64
ORIGIN_DELAY_RATE      float64
DEST_DELAY_RATE        float64
ROUTE_DELAY_RATE       float64
ORIGIN_FLIGHTS         float64
DEST_FLIGHTS           float64
ROUTE_FLIGHTS          float64
DELAYED                  int64
dtype: object

Missing values:
No missing values ‚úÖ

Target distribution:
Delayed: 805,372 (18.73%)
On-time: 3,493,674 (81.27%)


## 4. Sample Data Preview

In [5]:
print("First 5 rows (all 20 features + target):")
df_model.head()

First 5 rows (all 20 features + target):


Unnamed: 0,MONTH,DAY,DAY_OF_WEEK,DEP_HOUR,SCHEDULED_DEPARTURE,HOUR_SIN,HOUR_COS,IS_PEAK_HOUR,IS_WEEKEND,DISTANCE,...,IS_LONG_HAUL,DISTANCE_BUCKET,AIRLINE_DELAY_RATE,ORIGIN_DELAY_RATE,DEST_DELAY_RATE,ROUTE_DELAY_RATE,ORIGIN_FLIGHTS,DEST_FLIGHTS,ROUTE_FLIGHTS,DELAYED
0,1,1,4,0,5,0.0,1.0,0,0,1448,...,0,1,0.12349,0.113604,0.153551,0.090405,9.504203,11.41035,8.596004,0
1,1,1,4,0,10,0.0,1.0,0,0,2330,...,1,2,0.186115,0.204192,0.223629,0.285714,11.975848,9.785605,4.356709,0
2,1,1,4,0,20,0.0,1.0,0,0,2296,...,1,2,0.179845,0.193393,0.153058,0.116247,11.684026,11.296174,7.294377,0
3,1,1,4,0,20,0.0,1.0,0,0,2342,...,1,2,0.186115,0.204192,0.203246,0.211837,11.975848,10.930944,7.804251,0
4,1,1,4,0,25,0.0,1.0,0,0,1448,...,0,1,0.12349,0.155027,0.168331,0.165713,11.410195,9.504576,8.597851,0


## 5. Compare with Engineered Baseline (16 Features)

The engineered baseline excluded these 4 target-encoded features:

In [6]:
# Features that were EXCLUDED in engineered baseline
EXCLUDED_IN_BASELINE = [
    'AIRLINE_DELAY_RATE', 
    'ORIGIN_DELAY_RATE', 
    'DEST_DELAY_RATE', 
    'ROUTE_DELAY_RATE'
]

print("Target-encoded features (excluded in baseline, included now):")
print(df_model[EXCLUDED_IN_BASELINE].describe())

print("\nüìä These 4 features capture historical delay patterns:")
print("  - AIRLINE_DELAY_RATE: Historical delay rate for this airline")
print("  - ORIGIN_DELAY_RATE: Historical delay rate at origin airport")
print("  - DEST_DELAY_RATE: Historical delay rate at destination airport")
print("  - ROUTE_DELAY_RATE: Historical delay rate for this specific route")

Target-encoded features (excluded in baseline, included now):


       AIRLINE_DELAY_RATE  ORIGIN_DELAY_RATE  DEST_DELAY_RATE  \
count        4.299046e+06       4.299046e+06     4.299046e+06   
mean         1.873374e-01       1.873374e-01     1.873374e-01   
std          3.455165e-02       3.252787e-02     2.896137e-02   
min          1.174150e-01       1.951220e-02     3.375527e-02   
25%          1.810031e-01       1.604845e-01     1.683308e-01   
50%          1.905687e-01       1.882270e-01     1.897910e-01   
75%          1.972027e-01       2.131789e-01     2.085066e-01   
max          3.047932e-01       4.342105e-01     3.896104e-01   

       ROUTE_DELAY_RATE  
count      4.299046e+06  
mean       1.873374e-01  
std        5.398142e-02  
min        0.000000e+00  
25%        1.515152e-01  
50%        1.857143e-01  
75%        2.238232e-01  
max        1.000000e+00  

üìä These 4 features capture historical delay patterns:
  - AIRLINE_DELAY_RATE: Historical delay rate for this airline
  - ORIGIN_DELAY_RATE: Historical delay rate at origin airp

## 6. Pipeline Parameters for Full Model Training

Based on `02_engineered_baseline.ipynb` hyperparameters:

In [7]:
# Hyperparameters from engineered baseline
HYPERPARAMETERS = {
    'Objective': 'binary:logistic',
    'EvalMetric': 'auc',
    'MaxDepth': 8,
    'Eta': 0.05,
    'NumRound': 200,
    'ScalePosWeight': 4.58,
    'Subsample': 0.8,
    'ColsampleByTree': 0.8,
    'MinChildWeight': 1
}

# Data paths (Parquet format)
DATA_PATHS = {
    'TrainingDataUrl': f's3://{PUBLIC_BUCKET}/{FEATURES_PREFIX}/train/',
    'ValidationDataUrl': f's3://{PUBLIC_BUCKET}/{FEATURES_PREFIX}/validation/',
    'InputContentType': 'application/x-parquet'
}

print("üöÄ Pipeline Parameters for Full Model (20 features):")
print("\nHyperparameters:")
for k, v in HYPERPARAMETERS.items():
    print(f"  --{k} {v}")

print("\nData Configuration:")
for k, v in DATA_PATHS.items():
    print(f"  --{k} {v}")

üöÄ Pipeline Parameters for Full Model (20 features):

Hyperparameters:
  --Objective binary:logistic
  --EvalMetric auc
  --MaxDepth 8
  --Eta 0.05
  --NumRound 200
  --ScalePosWeight 4.58
  --Subsample 0.8
  --ColsampleByTree 0.8
  --MinChildWeight 1

Data Configuration:
  --TrainingDataUrl s3://sagemaker-us-east-1-425709451100/aai540-group1/features/train/
  --ValidationDataUrl s3://sagemaker-us-east-1-425709451100/aai540-group1/features/validation/
  --InputContentType application/x-parquet


## 7. Check for CSV Training Data

In [8]:
# Check if CSV training data exists
s3_client = boto3.client('s3')

csv_paths = [
    f'{bucket}/aai540-group1/training/engineered-baseline/train/train.csv',
    f'{bucket}/aai540-group1/training/engineered-baseline/val/val.csv'
]

print("Checking for prepared CSV training data...")
for path in csv_paths:
    key = path.split('/', 1)[1]
    try:
        s3_client.head_object(Bucket=bucket, Key=key)
        print(f"  ‚úÖ Found: s3://{path}")
    except:
        print(f"  ‚ùå Missing: s3://{path}")
        print(f"     Run: notebooks/02_feature_engineering/02_prepare_sagemaker_data.ipynb")

Checking for prepared CSV training data...
  ‚úÖ Found: s3://sagemaker-us-east-1-786869526001/aai540-group1/training/engineered-baseline/train/train.csv
  ‚úÖ Found: s3://sagemaker-us-east-1-786869526001/aai540-group1/training/engineered-baseline/val/val.csv


## 8. Run the Pipeline Experiment

**IMPORTANT:** Make sure the pipeline exists first!

If the pipeline doesn't exist, run this first:
```bash
cd /home/sagemaker-user/aai540_group1/tmp/scripts_v2
python pipeline_definition_v2.py
```

In [9]:
# Import the run_experiment module
import sys
sys.path.insert(0, '/home/sagemaker-user/aai540_group1/tmp/scripts_v2')

from run_experiment_v2 import main

# Prepare arguments for the pipeline
args = [
    '--TrainingDataUrl', f's3://{bucket}/aai540-group1/training/engineered-baseline/train/',
    '--ValidationDataUrl', f's3://{bucket}/aai540-group1/training/engineered-baseline/val/',
    '--InputContentType', 'text/csv',
    '--MaxDepth', '8',
    '--Eta', '0.05',
    '--NumRound', '50', # Using 50 rounds for quicker testing
    '--ScalePosWeight', '4.58',
    '--Subsample', '0.8',
    '--ColsampleByTree', '0.8',
    '--MinChildWeight', '1',
    '--display-name', 'full-engineered-20-features'
]

print("üöÄ Starting pipeline execution with all 20 features...")
print(f"\nTraining Data: s3://{bucket}/aai540-group1/training/engineered-baseline/train/")
print(f"Validation Data: s3://{bucket}/aai540-group1/training/engineered-baseline/val/")
print(f"\nHyperparameters:")
print(f"  MaxDepth: 8")
print(f"  Eta: 0.05")
print(f"  NumRound: 50") # Using 50 rounds for quicker testing
print(f"  ScalePosWeight: 4.58")
print(f"\n" + "="*70)

# Run the pipeline
result = main(args)

if result == 0:
    print("\n" + "="*70)
    print("‚úÖ Pipeline execution started successfully!")
    print("\nüìä Monitor progress in SageMaker Studio ‚Üí Pipelines")
    print("="*70)
elif result == 2:
    print("\n" + "="*70)
    print("‚ùå Pipeline not found!")
    print("\nüìù Create the pipeline first:")
    print("   cd /home/sagemaker-user/aai540_group1/tmp/scripts_v2")
    print("   python pipeline_definition_v2.py")
    print("="*70)
else:
    print("\n‚ùå Pipeline execution failed. Check the output above for errors.")

2026-02-07 07:51:47,578 [INFO] run_experiment_v2 ‚Äî Starting experiment: pipeline=FlightDelayTrainingPipeline, overrides={'TrainingDataUrl': 's3://sagemaker-us-east-1-786869526001/aai540-group1/training/engineered-baseline/train/', 'ValidationDataUrl': 's3://sagemaker-us-east-1-786869526001/aai540-group1/training/engineered-baseline/val/', 'InputContentType': 'text/csv', 'MaxDepth': 8, 'Eta': 0.05, 'NumRound': 50, 'Subsample': 0.8, 'ColsampleByTree': 0.8, 'ScalePosWeight': 4.58, 'MinChildWeight': 1}


üöÄ Starting pipeline execution with all 20 features...

Training Data: s3://sagemaker-us-east-1-786869526001/aai540-group1/training/engineered-baseline/train/
Validation Data: s3://sagemaker-us-east-1-786869526001/aai540-group1/training/engineered-baseline/val/

Hyperparameters:
  MaxDepth: 8
  Eta: 0.05
  NumRound: 50
  ScalePosWeight: 4.58



2026-02-07 07:51:48,300 [INFO] run_experiment_v2 ‚Äî Pipeline execution started. ARN: arn:aws:sagemaker:us-east-1:786869526001:pipeline/FlightDelayTrainingPipeline/execution/ryzu5n07lt5l



‚úÖ Pipeline execution started successfully!

üìä Monitor progress in SageMaker Studio ‚Üí Pipelines
