# 02 - Clean and Normalize

## Overview
Standardize compounds, derive stints, attach tyre age, remove outliers.

## Inputs
- data/raw/*_laps.parquet

## Outputs
- data/interim/laps_interim.parquet
- data/interim/stints_interim.parquet

In [6]:
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent / 'src'))

import pandas as pd
from f1ts import config, io_flat, clean, validation

config.ensure_dirs()

## Load Raw Laps

In [7]:
raw_dir = config.paths()['data_raw']
laps_files = list(raw_dir.glob('*_laps.parquet'))

print(f"Found {len(laps_files)} lap files")

all_laps = []
for laps_file in laps_files:
    laps = pd.read_parquet(laps_file)
    all_laps.append(laps)
    print(f"  Loaded {laps_file.name}: {len(laps):,} laps")

laps_raw = pd.concat(all_laps, ignore_index=True)
print(f"\nTotal laps: {len(laps_raw):,}")

Found 3 lap files
  Loaded 2023_3_R_laps.parquet: 882 laps
  Loaded 2023_1_R_laps.parquet: 1,035 laps
  Loaded 2023_2_R_laps.parquet: 904 laps

Total laps: 2,821


## Transform: Clean Pipeline

In [8]:
laps_clean, stints = clean.clean_pipeline(laps_raw)

print(f"\nCleaned laps: {len(laps_clean):,}")
print(f"Stints derived: {len(stints):,}")

Starting cleaning pipeline...
✓ Standardized compounds: 2,821 laps
✓ Derived 224 stints
✓ Attached tyre age
✓ Fixed data types
Removing 288 outlier laps (10.2%)
✓ Removed outliers: 2,533 laps remaining

Cleaned laps: 2,533
Stints derived: 224


## Validate

In [9]:
# Validate schema
required_lap_cols = ['session_key', 'driver', 'lap', 'compound', 'stint_id', 'tyre_age_laps']
validation.validate_schema(laps_clean, required_lap_cols, name='laps_interim')

# Validate no NAs in key columns
validation.assert_no_na(laps_clean, ['session_key', 'driver', 'lap', 'compound'], name='laps_interim')

# Check compounds
validation.validate_categorical(laps_clean, 'compound', set(config.VALID_COMPOUNDS), name='laps_interim')

print('\n✓ All validations passed')

✓ Schema validation passed for laps_interim
✓ No NA values in required columns for laps_interim
✓ Categorical validation passed for laps_interim.compound

✓ All validations passed


## Save

In [10]:
interim_dir = config.paths()['data_interim']

io_flat.write_parquet(laps_clean, interim_dir / 'laps_interim.parquet')
io_flat.write_parquet(stints, interim_dir / 'stints_interim.parquet')

print('\n✓ Saved interim data')

✓ Saved laps_interim.parquet: 2,533 rows, 13 cols
✓ Saved stints_interim.parquet: 224 rows, 7 cols

✓ Saved interim data


## Repro Notes

- Standardized compounds
- Derived stints based on pit stops and compound changes
- Attached tyre age
- Removed outliers using MAD
- All validations passed