# 🏎️ F1 Data Exploration & Analysis

## Overview
This notebook explores the Formula 1 historical data collected from Ergast API (2017-2025).

### Data Sources
- **Ergast API**: Race results, qualifying times, driver/constructor standings
- **Time Period**: 2017-2025 (covering major regulation changes)
- **Training Data**: 2017-2023
- **Test Data**: 2024 (for accuracy validation)
- **Future Prediction**: 2025

### Regulation Era Changes
- **2017**: Wider cars, new tires, increased downforce
- **2022**: Ground effect, simplified aerodynamics, bigger wheels


In [20]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set up pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Data paths
DATA_DIR = Path('../data/raw')
print(f"📁 Data directory: {DATA_DIR}")

# Plotly theme
import plotly.io as pio
pio.templates.default = "plotly_white"


📁 Data directory: ../data/raw


## 📊 Data Loading & Basic Information


In [21]:
# Load main datasets
races = pd.read_csv(DATA_DIR / 'ergast_races_2017_2025.csv')
results = pd.read_csv(DATA_DIR / 'ergast_results_2017_2025.csv')
qualifying = pd.read_csv(DATA_DIR / 'ergast_qualifying_2017_2025.csv')
driver_standings = pd.read_csv(DATA_DIR / 'ergast_driver_standings_2017_2025.csv')
constructor_standings = pd.read_csv(DATA_DIR / 'ergast_constructor_standings_2017_2025.csv')

print("🏁 Dataset Shapes:")
print(f"Races: {races.shape}")
print(f"Results: {results.shape}")
print(f"Qualifying: {qualifying.shape}")
print(f"Driver Standings: {driver_standings.shape}")
print(f"Constructor Standings: {constructor_standings.shape}")

print("\n📈 Data Coverage:")
print(f"Years: {results['year'].min()} - {results['year'].max()}")
print(f"Total unique drivers: {results['driver_id'].nunique()}")
print(f"Total unique constructors: {results['constructor_id'].nunique()}")
print(f"Total unique circuits: {races['circuit_id'].nunique()}")


🏁 Dataset Shapes:
Races: (193, 12)
Results: (3718, 25)
Qualifying: (3714, 16)
Driver Standings: (198, 12)
Constructor Standings: (90, 8)

📈 Data Coverage:
Years: 2017 - 2025
Total unique drivers: 48
Total unique constructors: 16
Total unique circuits: 32


In [22]:
# Races per year analysis
races_per_year = races.groupby('year').size().reset_index(name='race_count')

# Create interactive visualization
fig = px.bar(races_per_year, 
             x='year', 
             y='race_count',
             title='📅 Races per Season (2017-2025)',
             labels={'year': 'Year', 'race_count': 'Number of Races'},
             color='race_count',
             color_continuous_scale='viridis')

# Add COVID annotation
fig.add_annotation(
    x=2020, y=races_per_year[races_per_year['year']==2020]['race_count'].iloc[0],
    text="COVID-19<br>Reduced Calendar",
    showarrow=True,
    arrowhead=2,
    arrowcolor="red",
    bgcolor="white",
    bordercolor="red"
)

fig.update_layout(
    showlegend=False,
    height=500,
    xaxis_title="Season Year",
    yaxis_title="Number of Races"
)

fig.show()

# Print detailed breakdown
print("📊 Detailed Breakdown:")
for _, row in races_per_year.iterrows():
    year = int(row['year'])
    count = int(row['race_count'])
    covid_note = " (COVID-19 reduced calendar)" if year == 2020 else ""
    print(f"  {year}: {count} races{covid_note}")


📊 Detailed Breakdown:
  2017: 20 races
  2018: 21 races
  2019: 21 races
  2020: 17 races (COVID-19 reduced calendar)
  2021: 22 races
  2022: 22 races
  2023: 22 races
  2024: 24 races
  2025: 24 races


## 🏆 Driver & Constructor Performance Analysis


In [23]:
# Top drivers analysis
top_drivers = results.groupby(['driver_id', 'driver_first_name', 'driver_last_name']).agg({
    'points': 'sum',
    'race_name': 'count',
    'year': ['min', 'max'],
    'position': lambda x: (x == 1).sum()  # Wins
}).round(1)

top_drivers.columns = ['total_points', 'total_races', 'first_year', 'last_year', 'wins']
top_drivers['driver_name'] = top_drivers.index.get_level_values(1) + ' ' + top_drivers.index.get_level_values(2)
top_drivers = top_drivers.sort_values('total_points', ascending=False).head(15)

# Create interactive bar chart for top drivers
fig = px.bar(top_drivers.reset_index(), 
             x='driver_name', 
             y='total_points',
             color='wins',
             title='🏆 Top 15 Drivers by Total Points (2017-2025)',
             labels={'driver_name': 'Driver', 'total_points': 'Total Points', 'wins': 'Race Wins'},
             hover_data=['total_races', 'first_year', 'last_year'],
             color_continuous_scale='Reds')

fig.update_layout(
    xaxis_tickangle=-45,
    height=600,
    xaxis_title="Driver",
    yaxis_title="Total Championship Points"
)

fig.show()

# Print top 10 summary
print("🏁 Top 10 Drivers Summary:")
print(top_drivers.head(10)[['total_points', 'total_races', 'wins', 'first_year', 'last_year']])


🏁 Top 10 Drivers Summary:
                                                   total_points  total_races  \
driver_id      driver_first_name driver_last_name                              
max_verstappen Max               Verstappen              2900.5          186   
hamilton       Lewis             Hamilton                2680.5          185   
leclerc        Charles           Leclerc                 1519.0          166   
bottas         Valtteri          Bottas                  1377.0          169   
norris         Lando             Norris                  1234.0          145   
perez          Sergio            Pérez                   1218.0          167   
sainz          Carlos            Sainz                   1167.5          185   
vettel         Sebastian         Vettel                   990.0          121   
russell        George            Russell                  866.0          145   
ricciardo      Daniel            Ricciardo                704.0          148   

             

In [24]:
# Constructor performance analysis
top_constructors = results.groupby(['constructor_id', 'constructor_name']).agg({
    'points': 'sum',
    'race_name': 'count',
    'year': ['min', 'max'],
    'position': lambda x: (x == 1).sum()  # Wins
}).round(1)

top_constructors.columns = ['total_points', 'total_races', 'first_year', 'last_year', 'wins']
top_constructors = top_constructors.sort_values('total_points', ascending=False).head(10)

# Create interactive bar chart for constructors (replacing treemap due to compatibility)
top_constructors_df = top_constructors.reset_index()
fig = px.bar(top_constructors_df, 
             x='constructor_name', 
             y='total_points',
             color='wins',
             title='🏗️ Constructor Performance (by Total Points)',
             labels={'constructor_name': 'Constructor', 'total_points': 'Total Points', 'wins': 'Race Wins'},
             hover_data=['total_races', 'first_year', 'last_year'],
             color_continuous_scale='Blues')

fig.update_layout(
    height=500,
    xaxis_tickangle=-45,
    xaxis_title="Constructor",
    yaxis_title="Total Championship Points"
)
fig.show()

# Constructor points over time
constructor_yearly = results.groupby(['year', 'constructor_name'])['points'].sum().reset_index()
top_constructor_names = top_constructors.head(6).index.get_level_values(1)
constructor_yearly_filtered = constructor_yearly[constructor_yearly['constructor_name'].isin(top_constructor_names)]

fig = px.line(constructor_yearly_filtered, 
              x='year', 
              y='points', 
              color='constructor_name',
              title='📈 Constructor Points Evolution (Top 6 Teams)',
              labels={'year': 'Season', 'points': 'Total Points', 'constructor_name': 'Constructor'},
              markers=True)

fig.update_layout(
    height=500,
    xaxis_title="Season Year",
    yaxis_title="Championship Points"
)

fig.show()

print("🏗️ Top 8 Constructors Summary:")
print(top_constructors.head(8)[['total_points', 'total_races', 'wins', 'first_year', 'last_year']])


🏗️ Top 8 Constructors Summary:
                                 total_points  total_races  wins  first_year  \
constructor_id constructor_name                                                
mercedes       Mercedes                4817.5          372    66        2017   
red_bull       Red Bull                4407.5          372    74        2017   
ferrari        Ferrari                 3790.5          372    24        2017   
mclaren        McLaren                 2325.0          372    19        2017   
aston_martin   Aston Martin             550.0          213     0        2021   
alpine         Alpine F1 Team           517.0          214     1        2021   
renault        Renault                  451.0          158     0        2017   
alphatauri     AlphaTauri               306.0          166     1        2020   

                                 last_year  
constructor_id constructor_name             
mercedes       Mercedes               2025  
red_bull       Red Bull          

## ⚠️ Critical Modeling Considerations

### Data Split Strategy
- **Training**: 2017-2023 (chronological training)
- **Validation**: Rolling time-series CV (2021, 2022, 2023)  
- **Hold-out Test**: 2024 (final model evaluation)
- **Live Testing**: 2025 (real-world performance monitoring)

### Key Challenges & Solutions

#### 1. **Prediction Timing** (Defines Feature Set)
- **Pre-Qualifying Model**: No grid position known → use historical averages, driver/constructor form
- **Post-Qualifying Model**: Grid position available → use qualifying results for better accuracy
- **Decision**: Build both models for comparison, post-qual typically more accurate

#### 2. **Target Leakage Prevention** 
**⚠️ CRITICAL: Avoid future information**
- ❌ **Don't use**: Post-race data (pit stops, safety cars, DNF reasons)
- ✅ **Use instead**: Historical averages, pre-race statistics
- ✅ **Example**: "Average pit stops at this track" not "actual pit stops in this race"

#### 3. **Regulation Era Handling**
```python
era_flags = {
    'era_2017_plus': True,  # Wider cars, new aero
    'era_2022_plus': True,  # Ground effect regulations  
    'has_sprint_weekend': True,  # Sprint format races
    'covid_season': True    # 2020 anomaly handling
}
```

#### 4. **Domain Shift Considerations**
- **2020 COVID Season**: Different tracks, double headers → special handling
- **Driver/Team Changes**: 2023→2024 transfers need rolling performance metrics
- **Solution**: Use rolling windows (last 3-5 races) for form indicators

#### 5. **Target & Evaluation Strategy**
**🎯 Ranking Problem, Not Regression**
- **Objective**: XGBoost `rank:pairwise` or LightGBM `lambdarank`
- **Grouping**: Each race = one group (drivers ranked within same race)
- **Metrics**:
  - **Spearman ρ**: Rank correlation (primary metric)
  - **NDCG@k**: Normalized discounted cumulative gain
  - **Top-3/Top-5 Hit Rate**: Podium/points prediction accuracy
  - **MAE/RMSE**: Position difference (secondary)


## 🔍 Next Steps: Feature Engineering Framework

### Planned Feature Categories
```python
feature_categories = {
    'driver_form': [
        'last_3_races_avg_position',
        'last_5_races_points', 
        'last_3_races_dnf_rate',
        'season_championship_position'
    ],
    'constructor_form': [
        'team_last_3_races_points',
        'team_historical_pit_error_rate',
        'constructor_championship_position'
    ],
    'track_history': [
        'driver_avg_position_this_track',
        'constructor_avg_position_this_track',
        'track_specific_performance_rating'
    ],
    'qualifying_strength': [
        'grid_position',  # Post-qual model
        'season_avg_qual_position',  # Pre-qual model
        'qualifying_improvement_trend'
    ],
    'environmental': [
        'track_rain_probability_historical',
        'temperature_category',
        'track_characteristics_cluster'
    ],
    'era_flags': [
        'is_2017_plus_era',
        'is_2022_plus_era', 
        'has_sprint_format',
        'is_covid_season_2020'
    ]
}
```

### Implementation Roadmap
1. **V1 (MVP)**: Train 2017-2023, time-series CV, hold-out 2024 test
2. **V2 (Enhanced)**: Pre-qual vs Post-qual model comparison
3. **V3 (Production)**: Live 2025 predictions with weekly performance monitoring


## 🎯 Final Feature-Engineered Dataset

Now let's examine our processed dataset with all features ready for machine learning:


In [26]:
# Load the final feature-engineered dataset
PROCESSED_DATA_DIR = Path('../data/processed')
final_df = pd.read_csv(PROCESSED_DATA_DIR / 'f1_race_prediction_dataset.csv')

print("🏆 FINAL DATASET OVERVIEW")
print("="*50)
print(f"Shape: {final_df.shape}")
print(f"Years: {final_df['year'].min()} - {final_df['year'].max()}")
print(f"Unique drivers: {final_df['driver_id'].nunique()}")
print(f"Unique constructors: {final_df['constructor_id'].nunique()}")
print(f"Unique circuits: {final_df['circuit_id'].nunique()}")
print(f"Missing values: {final_df.isnull().sum().sum()}")

print("\n📊 Feature Categories:")
print(f"Total features: {final_df.shape[1]}")

# Show feature breakdown
feature_categories = {
    'Basic Info': ['year', 'round', 'circuit_id', 'circuit_name', 'country', 'date'],
    'Driver Info': ['driver_id', 'driver_first_name', 'driver_last_name', 'driver_nationality'],
    'Constructor Info': ['constructor_id', 'constructor_name', 'constructor_nationality'],
    'Era Flags': [col for col in final_df.columns if 'era' in col or 'covid' in col or 'sprint' in col],
    'Driver Historical': [col for col in final_df.columns if col.startswith('driver_career') or col.startswith('driver_win') or col.startswith('driver_podium') or col.startswith('driver_points')],
    'Constructor Historical': [col for col in final_df.columns if col.startswith('constructor_career') or col.startswith('constructor_win') or col.startswith('constructor_podium') or col.startswith('constructor_points')],
    'Track Specific': [col for col in final_df.columns if 'track' in col],
    'Grid Position': ['grid_position'],
    'Target': ['position']
}

for category, cols in feature_categories.items():
    available_cols = [col for col in cols if col in final_df.columns]
    if available_cols:
        print(f"  {category}: {len(available_cols)} features")

print("\n" + "="*50)


🏆 FINAL DATASET OVERVIEW
Shape: (3718, 43)
Years: 2017 - 2025
Unique drivers: 48
Unique constructors: 16
Unique circuits: 32
Missing values: 0

📊 Feature Categories:
Total features: 43
  Basic Info: 6 features
  Driver Info: 4 features
  Constructor Info: 3 features
  Era Flags: 4 features
  Driver Historical: 9 features
  Constructor Historical: 9 features
  Track Specific: 6 features
  Grid Position: 1 features
  Target: 1 features



In [27]:
# Show sample of the final dataset
print("🔍 SAMPLE OF FINAL DATASET (df.head())")
print("="*60)

# Display first 5 rows with key columns
key_cols = [
    'year', 'round', 'driver_first_name', 'driver_last_name', 'constructor_name', 
    'circuit_name', 'grid_position', 'position',
    'driver_career_wins', 'driver_win_rate', 'constructor_win_rate',
    'is_2022_plus_era'
]

sample_df = final_df[key_cols].head()
print(sample_df.to_string(index=False))

print(f"\n📋 Complete Column List ({len(final_df.columns)} total):")
print("="*60)
for i, col in enumerate(final_df.columns, 1):
    print(f"{i:2d}. {col}")
    if i % 3 == 0:  # Print line break every 3 columns for readability
        print()


🔍 SAMPLE OF FINAL DATASET (df.head())
 year  round driver_first_name driver_last_name constructor_name                   circuit_name  grid_position  position  driver_career_wins  driver_win_rate  constructor_win_rate  is_2022_plus_era
 2017      1         Sebastian           Vettel          Ferrari Albert Park Grand Prix Circuit            2.0         1                  11            0.091                 0.065                 0
 2017      1             Lewis         Hamilton         Mercedes Albert Park Grand Prix Circuit            1.0         2                  52            0.281                 0.177                 0
 2017      1          Valtteri           Bottas         Mercedes Albert Park Grand Prix Circuit            3.0         3                  10            0.059                 0.177                 0
 2017      1              Kimi        Räikkönen          Ferrari Albert Park Grand Prix Circuit            4.0         4                   1            0.010             