# Benchmark Model: Raw Baseline

## Purpose
This notebook establishes a performance baseline using a simple XGBoost model trained on only **6 raw features** with minimal preprocessing. This satisfies the assignment requirement for a "simple heuristic or a model with just a couple of features" to serve as a comparison point.

## Prerequisites
- Model registry (`flight-delay-models` group) must exist (created in model registry setup notebook)
- Feature data available in public S3 bucket: `s3://sagemaker-us-east-1-425709451100/aai540-group1/features/`

## Benchmark Strategy
We use only the most basic temporal and flight characteristics:
- `MONTH`, `DAY_OF_WEEK`, `DEP_HOUR` (temporal)
- `DISTANCE`, `SCHEDULED_TIME` (flight characteristics)
- `IS_WEEKEND` (derived temporal flag)

This model will be registered to the model group for comparison with more sophisticated approaches in subsequent notebooks.

In [1]:
import boto3
import pandas as pd
import numpy as np
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.inputs import TrainingInput

print("Imports loaded successfully")
print(f"SageMaker version: {sagemaker.__version__}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Imports loaded successfully
SageMaker version: 2.245.0


In [2]:
# SageMaker session and role
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = sess.boto_region_name

# S3 buckets
PUBLIC_BUCKET = 'sagemaker-us-east-1-425709451100'
PERSONAL_BUCKET = sess.default_bucket()

# Benchmark features (6 only - minimal baseline, are required by this week's Team Project Steps)
BENCHMARK_FEATURES = [
    'MONTH', 
    'DAY_OF_WEEK', 
    'DEP_HOUR', 
    'DISTANCE', 
    'SCHEDULED_TIME', 
    'IS_WEEKEND'
]
TARGET = 'DELAYED'

# S3 paths
INPUT_PREFIX = 'aai540-group1/features'
OUTPUT_PREFIX = 'aai540-group1/training/raw-baseline'

print(f"Region: {region}")
print(f"Role: {role.split('/')[-1]}")
print(f"Public bucket (features): s3://{PUBLIC_BUCKET}/{INPUT_PREFIX}/")
print(f"Personal bucket (training): s3://{PERSONAL_BUCKET}/{OUTPUT_PREFIX}/")
print(f"\nBenchmark features ({len(BENCHMARK_FEATURES)}): {BENCHMARK_FEATURES}")
print(f"Target: {TARGET}")

Region: us-east-1
Role: LabRole
Public bucket (features): s3://sagemaker-us-east-1-425709451100/aai540-group1/features/
Personal bucket (training): s3://sagemaker-us-east-1-786869526001/aai540-group1/training/raw-baseline/

Benchmark features (6): ['MONTH', 'DAY_OF_WEEK', 'DEP_HOUR', 'DISTANCE', 'SCHEDULED_TIME', 'IS_WEEKEND']
Target: DELAYED


In [11]:
# Load feature data from public bucket
s3_client = boto3.client('s3')

print("Loading feature data from public bucket...")

datasets = {}
for split in ['train', 'val', 'test']:
    s3_path = f"s3://{PUBLIC_BUCKET}/{INPUT_PREFIX}/{split}_features.parquet"
    
    # Read only needed columns from parquet (efficient columnar read)
    # Parquet's columnar format allows reading specific columns without loading entire file
    columns_to_read = BENCHMARK_FEATURES + [TARGET]
    df_benchmark = pd.read_parquet(s3_path, columns=columns_to_read)
    
    datasets[split] = df_benchmark
    
    print(f"{split:5s}: {df_benchmark.shape[0]:>7,} rows × {df_benchmark.shape[1]} columns")

# Unpack datasets
train_df = datasets['train']
val_df = datasets['val']
test_df = datasets['test']

print(f"\nTotal samples: {sum(df.shape[0] for df in datasets.values()):,}")
print(f"\nFeatures: {BENCHMARK_FEATURES}")
print(f"\nSample from training set:")
print(train_df.head(3))

Loading feature data from public bucket...
train: 4,299,046 rows × 7 columns
val  : 482,878 rows × 7 columns
test : 462,367 rows × 7 columns

Total samples: 5,244,291

Features: ['MONTH', 'DAY_OF_WEEK', 'DEP_HOUR', 'DISTANCE', 'SCHEDULED_TIME', 'IS_WEEKEND']

Sample from training set:
   MONTH  DAY_OF_WEEK  DEP_HOUR  DISTANCE  SCHEDULED_TIME  IS_WEEKEND  DELAYED
0      1            4         0      1448           205.0           0        0
1      1            4         0      2330           280.0           0        0
2      1            4         0      2296           286.0           0        0


In [12]:
# Define XGBoost Estimator
from sagemaker.image_uris import retrieve

# Get XGBoost container image
container = retrieve('xgboost', region, version='1.5-1')

# Output path for model artifacts
output_path = f"s3://{PERSONAL_BUCKET}/aai540-group1/models/raw-baseline"

# Hyperparameters for benchmark model
hyperparameters = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 8,
    'eta': 0.1,
    'num_round': 500,
    'scale_pos_weight': 4.58,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'early_stopping_rounds': 30
}

# Create estimator
xgb_estimator = sagemaker.estimator.Estimator(
    container,
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path=output_path,
    sagemaker_session=sess,
    hyperparameters=hyperparameters
)

print("XGBoost Estimator configured")
print(f"\nContainer: {container}")
print(f"Instance type: ml.m5.xlarge")
print(f"Output path: {output_path}")
print(f"\nHyperparameters:")
for key, value in hyperparameters.items():
    print(f"  {key:25s} = {value}")

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


XGBoost Estimator configured

Container: 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.5-1
Instance type: ml.m5.xlarge
Output path: s3://sagemaker-us-east-1-786869526001/aai540-group1/models/raw-baseline

Hyperparameters:
  objective                 = binary:logistic
  eval_metric               = auc
  max_depth                 = 8
  eta                       = 0.1
  num_round                 = 500
  scale_pos_weight          = 4.58
  subsample                 = 0.8
  colsample_bytree          = 0.8
  early_stopping_rounds     = 30


In [13]:
# Prepare data for SageMaker and train model
import os
from botocore.exceptions import ClientError

# Local temp directory
local_dir = '/tmp/benchmark_train'
os.makedirs(local_dir, exist_ok=True)

print("Preparing training data (CSV format: target first, no headers)...")

# S3 training data location
s3_train_prefix = f"{OUTPUT_PREFIX}"

# Save and upload each dataset
s3_inputs = {}
for split, df in [('train', train_df), ('validation', val_df)]:
    # Reorder: target first, then features
    df_train = df[[TARGET] + BENCHMARK_FEATURES].copy()
    
    # Save locally as CSV (no headers)
    local_path = f"{local_dir}/{split}.csv"
    df_train.to_csv(local_path, header=False, index=False)
    
    # Upload to S3
    s3_key = f"{s3_train_prefix}/{split}/{split}.csv"
    s3_path = f"s3://{PERSONAL_BUCKET}/{s3_key}"
    
    # Check if already exists
    try:
        s3_client.head_object(Bucket=PERSONAL_BUCKET, Key=s3_key)
        print(f"{split:12s}: Already exists at {s3_path}")
    except ClientError:
        s3_client.upload_file(local_path, PERSONAL_BUCKET, s3_key)
        print(f"{split:12s}: Uploaded to {s3_path}")
    
    s3_inputs[split] = s3_path

# Create TrainingInput channels
train_input = TrainingInput(s3_inputs['train'], content_type='text/csv')
validation_input = TrainingInput(s3_inputs['validation'], content_type='text/csv')

Preparing training data (CSV format: target first, no headers)...
train       : Already exists at s3://sagemaker-us-east-1-786869526001/aai540-group1/training/raw-baseline/train/train.csv
validation  : Already exists at s3://sagemaker-us-east-1-786869526001/aai540-group1/training/raw-baseline/validation/validation.csv


In [14]:

print(f"\nStarting training job...")
print(f"Train data: {s3_inputs['train']}")
print(f"Validation data: {s3_inputs['validation']}")

# Train the model
xgb_estimator.fit({
    'train': train_input,
    'validation': validation_input
}, wait=True)

print("\nTraining complete!")
print(f"Model artifacts: {xgb_estimator.model_data}")

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2026-02-02-03-41-23-733



Starting training job...
Train data: s3://sagemaker-us-east-1-786869526001/aai540-group1/training/raw-baseline/train/train.csv
Validation data: s3://sagemaker-us-east-1-786869526001/aai540-group1/training/raw-baseline/validation/validation.csv
2026-02-02 03:41:24 Starting - Starting the training job...
2026-02-02 03:41:39 Starting - Preparing the instances for training...
2026-02-02 03:42:26 Downloading - Downloading the training image......
  from pandas import MultiIndex, Int64Index
[2026-02-02 03:43:19.385 ip-10-2-71-72.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2026-02-02 03:43:19.413 ip-10-2-71-72.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.
[2026-02-02:03:43:19:INFO] Imported framework sagemaker_xgboost_container.training
[2026-02-02:03:43:19:INFO] Failed to parse hyperparameter eval_metric value auc to Json.
Returning the value itself
[2026-02-02:03:43:19:INFO] Failed to parse hyperparameter objective value binary:log

In [20]:
# Register model to SageMaker Model Registry using boto3 client
# Pattern from homework 4.1

print("Registering model to Model Registry...")

sm_client = boto3.client('sagemaker')
model_package_group_name = "flight-delay-models"

# Get model artifacts path from estimator (or construct from known location)
try:
    model_data = xgb_estimator.model_data
except:
    # If estimator state lost, find the latest model from S3
    print("Fetching latest model artifact from S3...")
    response = s3_client.list_objects_v2(
        Bucket=PERSONAL_BUCKET,
        Prefix='aai540-group1/models/raw-baseline/'
    )
    folders = set()
    for obj in response.get('Contents', []):
        parts = obj['Key'].split('/')
        if len(parts) >= 4:
            folders.add(parts[3])
    latest_job = sorted(folders)[-1]
    model_data = f"s3://{PERSONAL_BUCKET}/aai540-group1/models/raw-baseline/{latest_job}/output/model.tar.gz"

print(f"Model artifacts: {model_data}")
print(f"Container image: {container}")

# Create inference specification (matching homework 4.1 pattern)
inference_spec = {
    "Containers": [
        {
            "Image": container,
            "ModelDataUrl": model_data
        }
    ],
    "SupportedContentTypes": ["text/csv"],
    "SupportedResponseMIMETypes": ["text/csv"],
    "SupportedRealtimeInferenceInstanceTypes": ["ml.t2.medium", "ml.m5.large"],
    "SupportedTransformInstanceTypes": ["ml.m5.large"]
}

# Build CustomerMetadataProperties (feature list is in notebook code, not metadata)
customer_metadata = {
    "model_type": "raw-baseline",
    "num_features": str(len(BENCHMARK_FEATURES)),
    "training_samples": str(len(train_df)),
    "validation_samples": str(len(val_df))
}

print(f"Model metadata: {customer_metadata}")

# Create model package
response = sm_client.create_model_package(
    ModelPackageGroupName=model_package_group_name,
    ModelPackageDescription=f"Benchmark baseline model - XGBoost trained on {len(BENCHMARK_FEATURES)} raw features only. Features: MONTH, DAY_OF_WEEK, DEP_HOUR, DISTANCE, SCHEDULED_TIME, IS_WEEKEND",
    InferenceSpecification=inference_spec,
    ModelApprovalStatus='PendingManualApproval',
    CustomerMetadataProperties=customer_metadata
)

model_package_arn = response['ModelPackageArn']

print(f"\n✓ Model registered successfully!")
print(f"Model Package ARN: {model_package_arn}")
print(f"Model Package Group: {model_package_group_name}")
print(f"Approval Status: PendingManualApproval")
print(f"\nMetadata:")
print(f"  Model Type: raw-baseline")
print(f"  Features: {BENCHMARK_FEATURES}")
print(f"  Training Samples: {len(train_df):,}")
print(f"  Validation Samples: {len(val_df):,}")

Registering model to Model Registry...
Model artifacts: s3://sagemaker-us-east-1-786869526001/aai540-group1/models/raw-baseline/sagemaker-xgboost-2026-02-02-03-41-23-733/output/model.tar.gz
Container image: 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.5-1
Model metadata: {'model_type': 'raw-baseline', 'num_features': '6', 'training_samples': '4299046', 'validation_samples': '482878'}

✓ Model registered successfully!
Model Package ARN: arn:aws:sagemaker:us-east-1:786869526001:model-package/flight-delay-models/1
Model Package Group: flight-delay-models
Approval Status: PendingManualApproval

Metadata:
  Model Type: raw-baseline
  Features: ['MONTH', 'DAY_OF_WEEK', 'DEP_HOUR', 'DISTANCE', 'SCHEDULED_TIME', 'IS_WEEKEND']
  Training Samples: 4,299,046
  Validation Samples: 482,878


## Results Summary

### Model Details
- **Model Type**: Raw Baseline (Benchmark)
- **Features Used**: 6 raw features only
  - Temporal: `MONTH`, `DAY_OF_WEEK`, `DEP_HOUR`, `IS_WEEKEND`
  - Flight: `DISTANCE`, `SCHEDULED_TIME`
- **Algorithm**: XGBoost (binary:logistic)
- **Training Instance**: ml.m5.xlarge
- **Registry Status**: Registered to `flight-delay-models` group (PendingManualApproval)

### Notes
- This model establishes the minimum performance baseline required by assignment
- Model artifacts: `s3://sagemaker-us-east-1-786869526001/aai540-group1/models/raw-baseline/`
- Training logs contain validation AUC metrics from early stopping rounds
- Model can be retrieved from registry using package ARN for deployment or evaluation