# Data Validation

Quick check to make sure:
1. FastF1 works for 2024 testing
2. I can load race weekends (all sessions)
3. Telemetry data is actually there
4. Data quality is good enough to work with

In [1]:
import fastf1
import pandas as pd
import numpy as np
from pathlib import Path
import logging
logging.getLogger("fastf1").setLevel(logging.ERROR)

# Cache directory so I don't re-download everything
cache_dir = Path('../data/raw/.fastf1_cache')
cache_dir.mkdir(parents=True, exist_ok=True)
fastf1.Cache.enable_cache(str(cache_dir))

print(f"FastF1 version: {fastf1.__version__}")

FastF1 version: 3.7.0


## Test 1: Load 2024 Pre-Season Testing

Testing data is critical for 2026 - it's the first real data we'll see before the season starts.

In [2]:
# 2024 Bahrain testing - 3 days in late February
testing_sessions = []

for day in range(1, 4):
    session = fastf1.get_session(2024, 'Testing', day)
    session.load()
    testing_sessions.append(session)
    
    print(f"Day {day}: {len(session.laps)} laps, {session.laps.Driver.nunique()} drivers")

print(f"\n游릭 Testing data works - {sum(len(s.laps) for s in testing_sessions)} total laps")

Day 1: 485 laps, 20 drivers
Day 2: 206 laps, 20 drivers
Day 3: 380 laps, 20 drivers

游릭 Testing data works - 1071 total laps


## Test 2: Load a full race weekend

Need all sessions: FP1, FP2, FP3, Qualifying, Race

In [3]:
# Bahrain 2024 - first race of the season
event = fastf1.get_event(2024, 'Bahrain')
print(f"Event: {event['EventName']} at {event['Location']}")
print(f"Date: {event['EventDate']}\n")

sessions = {}
for session_name in ['FP1', 'FP2', 'FP3', 'Q', 'R']:
    session = fastf1.get_session(2024, 'Bahrain', session_name)
    session.load()
    sessions[session_name] = session
    print(f"{session_name:3s}: {len(session.laps):4d} laps")

print("\n游릭 All sessions loaded")

Event: Bahrain Grand Prix at Sakhir
Date: 2024-03-02 00:00:00

FP1:  449 laps
FP2:  511 laps
FP3:  311 laps
Q  :  267 laps
R  : 1129 laps

游릭 All sessions loaded


## Test 3: Check telemetry quality

Make sure I actually have telemetry data (speed, throttle, brake, etc.)

In [4]:
# Pick Verstappen's fastest lap in FP1
fp1 = sessions['FP1']
ver_laps = fp1.laps.pick_drivers('VER')
fastest = ver_laps.pick_fastest()

# Get telemetry
telemetry = fastest.get_telemetry()

print(f"Fastest lap: {fastest['LapTime']}")
print(f"Telemetry points: {len(telemetry)}")
print(f"\nAvailable data:")
print(telemetry.columns.tolist())

# Quick sanity checks
print(f"\nMax speed: {telemetry['Speed'].max():.1f} km/h")
print(f"Full throttle: {(telemetry['Throttle'] == 100).sum() / len(telemetry) * 100:.1f}%")
print(f"DRS active: {(telemetry['DRS'] > 0).sum() / len(telemetry) * 100:.1f}%")

print("\n游릭 Telemetry looks good")

Fastest lap: 0 days 00:01:33.238000
Telemetry points: 709

Available data:
['Date', 'SessionTime', 'DriverAhead', 'DistanceToDriverAhead', 'Time', 'RPM', 'Speed', 'nGear', 'Throttle', 'Brake', 'DRS', 'Source', 'Distance', 'RelativeDistance', 'Status', 'X', 'Y', 'Z']

Max speed: 315.0 km/h
Full throttle: 58.3%
DRS active: 100.0%

游릭 Telemetry looks good


## Test 4: Check race results format

Need to validate our predictions against actual results, so make sure the format is clean.

In [5]:
race = sessions['R']
results = race.results

print("Race results columns:")
print(results.columns.tolist())

# Show top 5
print(f"\nTop 5:")
for idx, row in results.head(5).iterrows():
    print(f"  P{row['Position']:.0f}: {row['Abbreviation']} ({row['TeamName']})")

print("\n游릭 Results format is clean")

Race results columns:
['DriverNumber', 'BroadcastName', 'Abbreviation', 'DriverId', 'TeamName', 'TeamColor', 'TeamId', 'FirstName', 'LastName', 'FullName', 'HeadshotUrl', 'CountryCode', 'Position', 'ClassifiedPosition', 'GridPosition', 'Q1', 'Q2', 'Q3', 'Time', 'Status', 'Points', 'Laps']

Top 5:
  P1: VER (Red Bull Racing)
  P2: PER (Red Bull Racing)
  P3: SAI (Ferrari)
  P4: LEC (Ferrari)
  P5: RUS (Mercedes)

游릭 Results format is clean


## Test 5: Does 2025 data exist?

Quick check to see if I can validate on 2025 before deploying to 2026.

In [6]:
try:
    schedule_2025 = fastf1.get_event_schedule(2025)
    print(f"游릭 2025 schedule available - {len(schedule_2025)} events")
    
    # Try loading first race
    bahrain_2025 = fastf1.get_session(2025, 1, 'R')  # Round 1 race
    bahrain_2025.load()
    winner = bahrain_2025.results.iloc[0]
    print(f"游릭 2025 race data works - Winner R1: {winner['Abbreviation']}")
    
except Exception as e:
    print(f"游댮 2025 data not available yet: {e}")
    print("  (This is fine - I'll use 2024 for now)")

游릭 2025 schedule available - 25 events
游릭 2025 race data works - Winner R1: NOR
