# Engineered Baseline Model (No Target Encoding)

## Purpose
This notebook trains an XGBoost model on **16 engineered features** excluding target-encoded rates to avoid temporal leakage. This model uses temporal transformations, distance features, and volume metrics while maintaining proper train/test separation.

## Prerequisites
- Model registry (`flight-delay-models` group) must exist
- Training data prepared by running `02_feature_engineering/03_prepare_training_variants.ipynb`
- Data location: `s3://{PERSONAL_BUCKET}/aai540-group1/training/engineered-no-target-encoding/`

## Comparison Intent
This model serves as a middle ground between:
- **Raw baseline** (6 features): Basic temporal and flight characteristics
- **Full engineered** (20 features): Includes target-encoded historical delay rates

By excluding target-encoded features (AIRLINE_DELAY_RATE, ORIGIN_DELAY_RATE, DEST_DELAY_RATE, ROUTE_DELAY_RATE), we evaluate performance without potential information leakage while still leveraging feature engineering.

In [6]:
import boto3
import pandas as pd
import numpy as np
import sagemaker
from sagemaker.inputs import TrainingInput

print("Imports loaded successfully")
print(f"SageMaker version: {sagemaker.__version__}")

Imports loaded successfully
SageMaker version: 2.245.0


In [7]:
# SageMaker session and role
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = sess.boto_region_name

# S3 buckets
PUBLIC_BUCKET = 'sagemaker-us-east-1-425709451100'
PERSONAL_BUCKET = sess.default_bucket()

# Engineered features (16 total - excluding target-encoded rates)
ENGINEERED_FEATURES = [
    # Temporal (9)
    'MONTH', 'DAY', 'DAY_OF_WEEK', 'DEP_HOUR', 'SCHEDULED_DEPARTURE',
    'HOUR_SIN', 'HOUR_COS', 'IS_PEAK_HOUR', 'IS_WEEKEND',
    # Distance (4)
    'DISTANCE', 'SCHEDULED_TIME', 'IS_LONG_HAUL', 'DISTANCE_BUCKET',
    # Volume (3)
    'ORIGIN_FLIGHTS', 'DEST_FLIGHTS', 'ROUTE_FLIGHTS'
]
TARGET = 'DELAYED'

# S3 paths
INPUT_PREFIX = 'aai540-group1/features'
OUTPUT_PREFIX = 'aai540-group1/training/engineered-no-target-encoding'

print(f"Region: {region}")
print(f"Role: {role.split('/')[-1]}")
print(f"Public bucket (features): s3://{PUBLIC_BUCKET}/{INPUT_PREFIX}/")
print(f"Personal bucket (training): s3://{PERSONAL_BUCKET}/{OUTPUT_PREFIX}/")
print(f"\nEngineered features ({len(ENGINEERED_FEATURES)}): {ENGINEERED_FEATURES}")
print(f"Target: {TARGET}")

Region: us-east-1
Role: LabRole
Public bucket (features): s3://sagemaker-us-east-1-425709451100/aai540-group1/features/
Personal bucket (training): s3://sagemaker-us-east-1-786869526001/aai540-group1/training/engineered-no-target-encoding/

Engineered features (16): ['MONTH', 'DAY', 'DAY_OF_WEEK', 'DEP_HOUR', 'SCHEDULED_DEPARTURE', 'HOUR_SIN', 'HOUR_COS', 'IS_PEAK_HOUR', 'IS_WEEKEND', 'DISTANCE', 'SCHEDULED_TIME', 'IS_LONG_HAUL', 'DISTANCE_BUCKET', 'ORIGIN_FLIGHTS', 'DEST_FLIGHTS', 'ROUTE_FLIGHTS']
Target: DELAYED


In [8]:
# Load feature data from public bucket
s3_client = boto3.client('s3')

print("Loading feature data from public bucket...")

datasets = {}
for split in ['train', 'val']:
    s3_path = f"s3://{PUBLIC_BUCKET}/{INPUT_PREFIX}/{split}_features.parquet"
    
    # Read only needed columns from parquet (efficient columnar read)
    # Parquet's columnar format allows reading specific columns without loading entire file
    columns_to_read = ENGINEERED_FEATURES + [TARGET]
    df_engineered = pd.read_parquet(s3_path, columns=columns_to_read)
    
    datasets[split] = df_engineered
    
    print(f"{split:5s}: {df_engineered.shape[0]:>7,} rows × {df_engineered.shape[1]} columns")

# Unpack datasets
train_df = datasets['train']
val_df = datasets['val']

print(f"\nTotal samples: {sum(df.shape[0] for df in datasets.values()):,}")
print(f"\nFeatures: {ENGINEERED_FEATURES}")
print(f"\nSample from training set:")
print(train_df.head(3))

Loading feature data from public bucket...
train: 4,299,046 rows × 17 columns
val  : 482,878 rows × 17 columns

Total samples: 4,781,924

Features: ['MONTH', 'DAY', 'DAY_OF_WEEK', 'DEP_HOUR', 'SCHEDULED_DEPARTURE', 'HOUR_SIN', 'HOUR_COS', 'IS_PEAK_HOUR', 'IS_WEEKEND', 'DISTANCE', 'SCHEDULED_TIME', 'IS_LONG_HAUL', 'DISTANCE_BUCKET', 'ORIGIN_FLIGHTS', 'DEST_FLIGHTS', 'ROUTE_FLIGHTS']

Sample from training set:
   MONTH  DAY  DAY_OF_WEEK  DEP_HOUR  SCHEDULED_DEPARTURE  HOUR_SIN  HOUR_COS  \
0      1    1            4         0                    5       0.0       1.0   
1      1    1            4         0                   10       0.0       1.0   
2      1    1            4         0                   20       0.0       1.0   

   IS_PEAK_HOUR  IS_WEEKEND  DISTANCE  SCHEDULED_TIME  IS_LONG_HAUL  \
0             0           0      1448           205.0             0   
1             0           0      2330           280.0             1   
2             0           0      2296           28

In [9]:
# Define XGBoost Estimator
from sagemaker.image_uris import retrieve

# Get XGBoost container image
container = retrieve('xgboost', region, version='1.5-1')

# Output path for model artifacts
output_path = f"s3://{PERSONAL_BUCKET}/aai540-group1/models/engineered-no-target-encoding"

# Hyperparameters for engineered baseline model
hyperparameters = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 8,
    'eta': 0.05,
    'num_round': 1000,
    'scale_pos_weight': 4.58,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'early_stopping_rounds': 50
}

# Create estimator
xgb_estimator = sagemaker.estimator.Estimator(
    container,
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path=output_path,
    sagemaker_session=sess,
    hyperparameters=hyperparameters
)

print("XGBoost Estimator configured")
print(f"\nContainer: {container}")
print(f"Instance type: ml.m5.xlarge")
print(f"Output path: {output_path}")
print(f"\nHyperparameters:")
for key, value in hyperparameters.items():
    print(f"  {key:25s} = {value}")

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


XGBoost Estimator configured

Container: 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.5-1
Instance type: ml.m5.xlarge
Output path: s3://sagemaker-us-east-1-786869526001/aai540-group1/models/engineered-no-target-encoding

Hyperparameters:
  objective                 = binary:logistic
  eval_metric               = auc
  max_depth                 = 8
  eta                       = 0.05
  num_round                 = 1000
  scale_pos_weight          = 4.58
  subsample                 = 0.8
  colsample_bytree          = 0.8
  early_stopping_rounds     = 50


In [10]:
# Use pre-prepared training data (from 03_prepare_training_variants.ipynb)
from botocore.exceptions import ClientError

train_path = f"s3://{PERSONAL_BUCKET}/{OUTPUT_PREFIX}/train/train.csv"
val_path = f"s3://{PERSONAL_BUCKET}/{OUTPUT_PREFIX}/validation/validation.csv"

print("Using pre-prepared training data:")
print(f"  Train:      {train_path}")
print(f"  Validation: {val_path}")

# Verify data exists
for name, path in [('Train', train_path), ('Validation', val_path)]:
    key = path.replace(f"s3://{PERSONAL_BUCKET}/", "")
    try:
        s3_client.head_object(Bucket=PERSONAL_BUCKET, Key=key)
        print(f"  ✓ {name} data verified")
    except ClientError:
        raise FileNotFoundError(
            f"{name} data not found at {path}\n"
            "Run 02_feature_engineering/03_prepare_training_variants.ipynb first."
        )

# Create TrainingInput channels
train_input = TrainingInput(train_path, content_type='text/csv')
validation_input = TrainingInput(val_path, content_type='text/csv')

Using pre-prepared training data:
  Train:      s3://sagemaker-us-east-1-786869526001/aai540-group1/training/engineered-no-target-encoding/train/train.csv
  Validation: s3://sagemaker-us-east-1-786869526001/aai540-group1/training/engineered-no-target-encoding/validation/validation.csv
  ✓ Train data verified
  ✓ Validation data verified


In [11]:
print(f"\nStarting training job...")

# Train the model
xgb_estimator.fit({
    'train': train_input,
    'validation': validation_input
}, wait=True)

print("\nTraining complete!")
print(f"Model artifacts: {xgb_estimator.model_data}")

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2026-02-02-06-15-34-211



Starting training job...
2026-02-02 06:15:34 Starting - Starting the training job...
2026-02-02 06:16:00 Starting - Preparing the instances for training...
2026-02-02 06:16:25 Downloading - Downloading input data...
  from pandas import MultiIndex, Int64Index
[2026-02-02 06:17:50.153 ip-10-0-244-215.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2026-02-02 06:17:50.176 ip-10-0-244-215.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.
[2026-02-02:06:17:50:INFO] Imported framework sagemaker_xgboost_container.training
[2026-02-02:06:17:50:INFO] Failed to parse hyperparameter eval_metric value auc to Json.
Returning the value itself
[2026-02-02:06:17:50:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.
Returning the value itself
[2026-02-02:06:17:50:INFO] No GPUs detected (normal if no gpus installed)
[2026-02-02:06:17:50:INFO] Running XGBoost Sagemaker in algorithm mode
[2026-02-02:06:17:50:INFO] Determined 0

In [12]:
# Register model to SageMaker Model Registry using boto3 client

print("Registering model to Model Registry...")

sm_client = boto3.client('sagemaker')
model_package_group_name = "flight-delay-models"

# Get model artifacts path from estimator
try:
    model_data = xgb_estimator.model_data
except:
    # If estimator state lost, find the latest model from S3
    print("Fetching latest model artifact from S3...")
    response = s3_client.list_objects_v2(
        Bucket=PERSONAL_BUCKET,
        Prefix='aai540-group1/models/engineered-no-target-encoding/'
    )
    folders = set()
    for obj in response.get('Contents', []):
        parts = obj['Key'].split('/')
        if len(parts) >= 4:
            folders.add(parts[3])
    latest_job = sorted(folders)[-1]
    model_data = f"s3://{PERSONAL_BUCKET}/aai540-group1/models/engineered-no-target-encoding/{latest_job}/output/model.tar.gz"

print(f"Model artifacts: {model_data}")
print(f"Container image: {container}")

# Create inference specification
inference_spec = {
    "Containers": [
        {
            "Image": container,
            "ModelDataUrl": model_data
        }
    ],
    "SupportedContentTypes": ["text/csv"],
    "SupportedResponseMIMETypes": ["text/csv"],
    "SupportedRealtimeInferenceInstanceTypes": ["ml.t2.medium", "ml.m5.large"],
    "SupportedTransformInstanceTypes": ["ml.m5.large"]
}

# Build CustomerMetadataProperties
customer_metadata = {
    "model_type": "engineered-no-target-encoding",
    "num_features": str(len(ENGINEERED_FEATURES)),
    "training_samples": str(len(train_df)),
    "validation_samples": str(len(val_df))
}

print(f"Model metadata: {customer_metadata}")

# Create model package
response = sm_client.create_model_package(
    ModelPackageGroupName=model_package_group_name,
    ModelPackageDescription=f"Engineered baseline - {len(ENGINEERED_FEATURES)} features excluding target-encoded rates. Features: MONTH DAY DAY_OF_WEEK DEP_HOUR SCHEDULED_DEPARTURE HOUR_SIN HOUR_COS IS_PEAK_HOUR IS_WEEKEND DISTANCE SCHEDULED_TIME IS_LONG_HAUL DISTANCE_BUCKET ORIGIN_FLIGHTS DEST_FLIGHTS ROUTE_FLIGHTS",
    InferenceSpecification=inference_spec,
    ModelApprovalStatus='PendingManualApproval',
    CustomerMetadataProperties=customer_metadata
)

model_package_arn = response['ModelPackageArn']

print(f"\n✓ Model registered successfully!")
print(f"Model Package ARN: {model_package_arn}")
print(f"Model Package Group: {model_package_group_name}")
print(f"Approval Status: PendingManualApproval")
print(f"\nMetadata:")
print(f"  Model Type: engineered-no-target-encoding")
print(f"  Features: {ENGINEERED_FEATURES}")
print(f"  Training Samples: {len(train_df):,}")
print(f"  Validation Samples: {len(val_df):,}")

Registering model to Model Registry...
Model artifacts: s3://sagemaker-us-east-1-786869526001/aai540-group1/models/engineered-no-target-encoding/sagemaker-xgboost-2026-02-02-06-15-34-211/output/model.tar.gz
Container image: 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.5-1
Model metadata: {'model_type': 'engineered-no-target-encoding', 'num_features': '16', 'training_samples': '4299046', 'validation_samples': '482878'}

✓ Model registered successfully!
Model Package ARN: arn:aws:sagemaker:us-east-1:786869526001:model-package/flight-delay-models/2
Model Package Group: flight-delay-models
Approval Status: PendingManualApproval

Metadata:
  Model Type: engineered-no-target-encoding
  Features: ['MONTH', 'DAY', 'DAY_OF_WEEK', 'DEP_HOUR', 'SCHEDULED_DEPARTURE', 'HOUR_SIN', 'HOUR_COS', 'IS_PEAK_HOUR', 'IS_WEEKEND', 'DISTANCE', 'SCHEDULED_TIME', 'IS_LONG_HAUL', 'DISTANCE_BUCKET', 'ORIGIN_FLIGHTS', 'DEST_FLIGHTS', 'ROUTE_FLIGHTS']
  Training Samples: 4,299,046
  Validation S

## Results Summary

### Model Details
- **Model Type**: Engineered Baseline (No Target Encoding)
- **Features Used**: 16 engineered features
  - Temporal (9): `MONTH`, `DAY`, `DAY_OF_WEEK`, `DEP_HOUR`, `SCHEDULED_DEPARTURE`, `HOUR_SIN`, `HOUR_COS`, `IS_PEAK_HOUR`, `IS_WEEKEND`
  - Distance (4): `DISTANCE`, `SCHEDULED_TIME`, `IS_LONG_HAUL`, `DISTANCE_BUCKET`
  - Volume (3): `ORIGIN_FLIGHTS`, `DEST_FLIGHTS`, `ROUTE_FLIGHTS`
- **Algorithm**: XGBoost (binary:logistic)
- **Training Instance**: ml.m5.xlarge
- **Registry Status**: Registered to `flight-delay-models` group (PendingManualApproval)

### Hyperparameter Changes from Raw Baseline
- **Learning rate (eta)**: 0.05 (was 0.1) - slower, more careful learning
- **Iterations (num_round)**: 1000 (was 500) - allow longer training
- **Early stopping**: 50 rounds (was 30) - more patience before stopping

### Comparison Notes
- **vs Raw Baseline (6 features)**: Adds 10 engineered features for improved temporal and volume signals
- **vs Full Engineered (20 features)**: Excludes 4 target-encoded delay rates to avoid potential leakage
- Model artifacts: `s3://sagemaker-us-east-1-786869526001/aai540-group1/models/engineered-no-target-encoding/`
- Training logs contain validation AUC metrics from early stopping rounds

### Next Steps
1. Evaluate on test set and compare metrics with raw baseline
2. Optionally train full 20-feature model with target encoding for comparison
3. Approve best model in registry for deployment