# S3 Datalake Setup and Data Ingestion

**AAI-540 Group 1 - Flight Delay Prediction Project**

## Objective
This notebook performs the initial data ingestion for our MLOps project:
1. Download the 2015 Flight Delays dataset from Kaggle
2. Verify data integrity and file sizes
3. Set up S3 bucket structure following MLOps best practices
4. Upload raw data files to S3 datalake

## Dataset
- **Name:** 2015 Flight Delays and Cancellations
- **Source:** U.S. DOT via Kaggle
- **Files:** flights.csv (~5.8M rows), airlines.csv, airports.csv
- **Size:** ~575MB compressed, ~2GB raw

---

## 1. Setup Environment

Import required libraries and load project configuration.

In [4]:
# Install project dependencies from requirements.txt
%pip install -q -r ../../requirements.txt

print("✓ Dependencies installed successfully")

Note: you may need to restart the kernel to use updated packages.
✓ Dependencies installed successfully


In [5]:
import sys
import os
from pathlib import Path
import boto3
import sagemaker
import pandas as pd
from datetime import datetime

# Add project root to Python path
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

# Import project configuration
from config import settings

# Display configuration
settings.print_config()

print(f"\nNotebook executed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

AAI-540 Group 1 Project Configuration
Region: us-east-1
S3 Bucket: sagemaker-us-east-1-786869526001
Project Prefix: aai540-group1/
Athena Database: aai540_group1_db

S3 Base URI: s3://sagemaker-us-east-1-786869526001/aai540-group1/

Notebook executed at: 2026-01-23 07:48:41


## 2. Download Dataset from Kaggle

Using `kagglehub` to download the 2015 Flight Delays dataset.

**Requirements:**
- Kaggle API credentials configured (`~/.kaggle/kaggle.json`)
- Or Kaggle authentication via environment variables

The dataset will be downloaded to a local cache managed by kagglehub.

In [6]:
import kagglehub

# Download latest version of the flight delays dataset
print(f"Downloading dataset: {settings.DATASET_KAGGLE_ID}")
print("This may take several minutes (~575MB compressed)...\n")

dataset_path = kagglehub.dataset_download(settings.DATASET_KAGGLE_ID)

print(f"\n✓ Dataset downloaded successfully!")
print(f"Path to dataset files: {dataset_path}")

# Store path for later use
%store dataset_path

Downloading dataset: usdot/flight-delays
This may take several minutes (~575MB compressed)...

Downloading to /home/sagemaker-user/.cache/kagglehub/datasets/usdot/flight-delays/1.archive...


100%|██████████| 191M/191M [00:01<00:00, 110MB/s]  

Extracting files...






✓ Dataset downloaded successfully!
Path to dataset files: /home/sagemaker-user/.cache/kagglehub/datasets/usdot/flight-delays/versions/1
Stored 'dataset_path' (str)


## 3. Verify Downloaded Files

Check that all expected data files are present and examine their sizes.

In [7]:
dataset_dir = Path(dataset_path)

print("Dataset files found:")
print("=" * 70)

file_info = []
for expected_file in settings.DATA_FILES.values():
    file_path = dataset_dir / expected_file
    if file_path.exists():
        size_mb = file_path.stat().st_size / (1024 * 1024)
        file_info.append({
            'File': expected_file,
            'Size (MB)': f'{size_mb:.2f}',
            'Status': '✓'
        })
        print(f"✓ {expected_file:20s} - {size_mb:8.2f} MB")
    else:
        file_info.append({
            'File': expected_file,
            'Size (MB)': 'N/A',
            'Status': '✗ MISSING'
        })
        print(f"✗ {expected_file:20s} - MISSING")

print("=" * 70)

# Create DataFrame for summary
df_files = pd.DataFrame(file_info)
print(f"\nTotal files found: {df_files[df_files['Status'] == '✓'].shape[0]} / {len(settings.DATA_FILES)}")

Dataset files found:
✓ flights.csv          -   564.96 MB
✓ airlines.csv         -     0.00 MB
✓ airports.csv         -     0.02 MB

Total files found: 3 / 3


### Quick Data Preview

Load a sample of each file to verify data integrity.

In [8]:
# Preview flights.csv (large file - just first 5 rows)
flights_path = dataset_dir / settings.DATA_FILES['flights']
print("FLIGHTS.CSV Preview:")
print("-" * 70)
df_flights_preview = pd.read_csv(flights_path, nrows=5)
print(f"Columns: {df_flights_preview.shape[1]}")
print(df_flights_preview.head())

print("\n" + "=" * 70 + "\n")

# Preview airlines.csv (small file)
airlines_path = dataset_dir / settings.DATA_FILES['airlines']
print("AIRLINES.CSV Preview:")
print("-" * 70)
df_airlines = pd.read_csv(airlines_path)
print(f"Total airlines: {df_airlines.shape[0]}")
print(df_airlines.head())

print("\n" + "=" * 70 + "\n")

# Preview airports.csv
airports_path = dataset_dir / settings.DATA_FILES['airports']
print("AIRPORTS.CSV Preview:")
print("-" * 70)
df_airports = pd.read_csv(airports_path)
print(f"Total airports: {df_airports.shape[0]}")
print(df_airports.head())

FLIGHTS.CSV Preview:
----------------------------------------------------------------------
Columns: 31
   YEAR  MONTH  DAY  DAY_OF_WEEK AIRLINE  FLIGHT_NUMBER TAIL_NUMBER  \
0  2015      1    1            4      AS             98      N407AS   
1  2015      1    1            4      AA           2336      N3KUAA   
2  2015      1    1            4      US            840      N171US   
3  2015      1    1            4      AA            258      N3HYAA   
4  2015      1    1            4      AS            135      N527AS   

  ORIGIN_AIRPORT DESTINATION_AIRPORT  SCHEDULED_DEPARTURE  ...  ARRIVAL_TIME  \
0            ANC                 SEA                    5  ...           408   
1            LAX                 PBI                   10  ...           741   
2            SFO                 CLT                   20  ...           811   
3            LAX                 MIA                   20  ...           756   
4            SEA                 ANC                   25  ...       

## 4. Setup S3 Bucket Structure

Initialize the S3 client and verify bucket access. The bucket structure follows MLOps best practices with separate directories for raw data, processed data, features, training, and inference.

In [9]:
# Initialize S3 client
s3_client = boto3.client('s3')
s3_resource = boto3.resource('s3')

# Get bucket name from settings
bucket_name = settings.DEFAULT_BUCKET

print(f"S3 Bucket: {bucket_name}")
print(f"Region: {settings.REGION}")
print(f"\nS3 Structure:")
print("=" * 70)

for key, path in settings.S3_PATHS.items():
    print(f"  {key:20s}: {path}")

print("=" * 70)

# Verify bucket exists and is accessible
try:
    s3_client.head_bucket(Bucket=bucket_name)
    print(f"\n✓ Bucket '{bucket_name}' is accessible")
except Exception as e:
    print(f"\n✗ Error accessing bucket: {e}")
    raise

S3 Bucket: sagemaker-us-east-1-786869526001
Region: us-east-1

S3 Structure:
  raw_data            : s3://sagemaker-us-east-1-786869526001/aai540-group1/data/raw/
  processed_data      : s3://sagemaker-us-east-1-786869526001/aai540-group1/data/processed/
  parquet_data        : s3://sagemaker-us-east-1-786869526001/aai540-group1/data/parquet/
  features            : s3://sagemaker-us-east-1-786869526001/aai540-group1/features/
  athena_staging      : s3://sagemaker-us-east-1-786869526001/aai540-group1/athena/staging/
  training_input      : s3://sagemaker-us-east-1-786869526001/aai540-group1/training/input/
  training_output     : s3://sagemaker-us-east-1-786869526001/aai540-group1/training/output/
  evaluation          : s3://sagemaker-us-east-1-786869526001/aai540-group1/evaluation/
  batch_inference     : s3://sagemaker-us-east-1-786869526001/aai540-group1/inference/batch/
  monitoring          : s3://sagemaker-us-east-1-786869526001/aai540-group1/monitoring/

✓ Bucket 'sagemaker-us

## 5. Upload Raw Data to S3

Upload all three CSV files to the S3 raw data location. We'll use boto3's upload_file with progress tracking.

In [10]:
from tqdm import tqdm

class S3ProgressCallback:
    """Callback class for tracking S3 upload progress."""
    
    def __init__(self, filename, filesize):
        self._filename = filename
        self._size = filesize
        self._seen_so_far = 0
        self._pbar = tqdm(total=filesize, unit='B', unit_scale=True, desc=filename)
    
    def __call__(self, bytes_amount):
        self._seen_so_far += bytes_amount
        self._pbar.update(bytes_amount)
    
    def __del__(self):
        self._pbar.close()


# Get S3 prefix for raw data (remove s3:// and bucket name)
raw_data_s3_uri = settings.S3_PATHS['raw_data']
s3_prefix = raw_data_s3_uri.replace(f's3://{bucket_name}/', '')

print(f"Uploading files to: {raw_data_s3_uri}")
print("=" * 70)

upload_results = []

for file_key, filename in settings.DATA_FILES.items():
    local_file_path = dataset_dir / filename
    s3_key = f"{s3_prefix}{filename}"
    
    if not local_file_path.exists():
        print(f"✗ Skipping {filename} - file not found locally")
        continue
    
    file_size = local_file_path.stat().st_size
    
    print(f"\nUploading: {filename}")
    print(f"  Size: {file_size / (1024**2):.2f} MB")
    print(f"  S3 Key: {s3_key}")
    
    try:
        # Upload with progress tracking
        callback = S3ProgressCallback(filename, file_size)
        s3_client.upload_file(
            str(local_file_path),
            bucket_name,
            s3_key,
            Callback=callback
        )
        
        upload_results.append({
            'File': filename,
            'Size (MB)': f'{file_size / (1024**2):.2f}',
            'S3 URI': f's3://{bucket_name}/{s3_key}',
            'Status': '✓ Success'
        })
        
        print(f"  ✓ Upload complete")
        
    except Exception as e:
        print(f"  ✗ Upload failed: {e}")
        upload_results.append({
            'File': filename,
            'Size (MB)': f'{file_size / (1024**2):.2f}',
            'S3 URI': f's3://{bucket_name}/{s3_key}',
            'Status': f'✗ Failed: {str(e)[:50]}'
        })

print("\n" + "=" * 70)
print("\nUpload Summary:")
df_uploads = pd.DataFrame(upload_results)
print(df_uploads.to_string(index=False))

Uploading files to: s3://sagemaker-us-east-1-786869526001/aai540-group1/data/raw/

Uploading: flights.csv
  Size: 564.96 MB
  S3 Key: aai540-group1/data/raw/flights.csv


flights.csv:  99%|█████████▉| 588M/592M [00:01<00:00, 478MB/s]  

  ✓ Upload complete

Uploading: airlines.csv
  Size: 0.00 MB
  S3 Key: aai540-group1/data/raw/airlines.csv


flights.csv: 100%|██████████| 592M/592M [00:01<00:00, 330MB/s]


  ✓ Upload complete

Uploading: airports.csv
  Size: 0.02 MB
  S3 Key: aai540-group1/data/raw/airports.csv


airlines.csv: 100%|██████████| 359/359 [00:00<00:00, 11.3kB/s]

  ✓ Upload complete


Upload Summary:
        File Size (MB)                                                                    S3 URI    Status
 flights.csv    564.96  s3://sagemaker-us-east-1-786869526001/aai540-group1/data/raw/flights.csv ✓ Success
airlines.csv      0.00 s3://sagemaker-us-east-1-786869526001/aai540-group1/data/raw/airlines.csv ✓ Success
airports.csv      0.02 s3://sagemaker-us-east-1-786869526001/aai540-group1/data/raw/airports.csv ✓ Success





## 6. Verify S3 Uploads

List and verify all uploaded files in the S3 raw data location.

In [11]:
print(f"Listing objects in: {raw_data_s3_uri}")
print("=" * 70)

# List objects in the raw data prefix
response = s3_client.list_objects_v2(
    Bucket=bucket_name,
    Prefix=s3_prefix
)

if 'Contents' in response:
    s3_files = []
    total_size = 0
    
    for obj in response['Contents']:
        size_mb = obj['Size'] / (1024 * 1024)
        total_size += obj['Size']
        s3_files.append({
            'File': obj['Key'].split('/')[-1],
            'Size (MB)': f'{size_mb:.2f}',
            'Last Modified': obj['LastModified'].strftime('%Y-%m-%d %H:%M:%S'),
            'S3 Key': obj['Key']
        })
    
    df_s3_files = pd.DataFrame(s3_files)
    print(df_s3_files.to_string(index=False))
    
    print("\n" + "=" * 70)
    print(f"Total files in S3: {len(s3_files)}")
    print(f"Total size: {total_size / (1024**2):.2f} MB")
    
    # Verify all expected files are present
    uploaded_filenames = {f['File'] for f in s3_files}
    expected_filenames = set(settings.DATA_FILES.values())
    
    missing_files = expected_filenames - uploaded_filenames
    
    if missing_files:
        print(f"\n⚠ Warning: Missing files: {missing_files}")
    else:
        print(f"\n✓ All expected files are present in S3!")
        
else:
    print("✗ No objects found in the specified S3 location")

Listing objects in: s3://sagemaker-us-east-1-786869526001/aai540-group1/data/raw/
        File Size (MB)       Last Modified                              S3 Key
airlines.csv      0.00 2026-01-23 07:57:56 aai540-group1/data/raw/airlines.csv
airports.csv      0.02 2026-01-23 07:57:56 aai540-group1/data/raw/airports.csv
 flights.csv    564.96 2026-01-23 07:57:54  aai540-group1/data/raw/flights.csv

Total files in S3: 3
Total size: 564.99 MB

✓ All expected files are present in S3!


## Summary

**S3 Datalake Setup Complete!**

✓ Downloaded 2015 Flight Delays dataset from Kaggle  
✓ Verified data integrity and file sizes  
✓ Uploaded raw data to S3 datalake  
✓ Confirmed all files are accessible in S3

**Next Steps:**
1. Set up Athena database and tables (next notebook)
2. Perform exploratory data analysis
3. Feature engineering and Feature Store setup

**S3 Data Location:**
```
s3://{bucket}/aai540-group1/data/raw/
├── flights.csv   (~580 MB, 5.8M rows)
├── airlines.csv  (~1 KB, 14 airlines)
└── airports.csv  (~30 KB, 322 airports)
```