# 🏎️ FastF1 Data Exploration & Telemetry Analysis

## Overview
This notebook explores advanced F1 telemetry and strategy data using the FastF1 API to enrich our existing Ergast dataset (2017-2025).

### Data Enhancement Strategy
- **Ergast Coverage**: 2017-2025 (complete historical data)
- **FastF1 Coverage**: 2018-2025 (telemetry & advanced features)
- **2017 Data**: Will rely solely on Ergast features
- **2018-2025**: Enhanced with FastF1 telemetry, tire strategy, and weather data

### Target Features to Add
1. **Telemetry**: Sector times, speed data, throttle/brake patterns
2. **Strategy**: Tire compounds, pit stop timing, fuel strategy
3. **Weather**: Track conditions, temperature, rain probability
4. **Track**: Session-specific performance metrics


In [19]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set up pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Data paths
RAW_DATA_DIR = Path('../data/raw')
PROCESSED_DATA_DIR = Path('../data/processed')
print(f"📁 Raw data directory: {RAW_DATA_DIR}")
print(f"📁 Processed data directory: {PROCESSED_DATA_DIR}")

# Plotly theme
import plotly.io as pio
pio.templates.default = "plotly_white"


📁 Raw data directory: ../data/raw
📁 Processed data directory: ../data/processed


## 📦 FastF1 Installation & Setup


In [20]:
# Install fastf1 if not already installed
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

try:
    import fastf1
    print(f"✅ FastF1 already installed - Version: {fastf1.__version__}")
except ImportError:
    print("📦 Installing FastF1...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "fastf1"])
    import fastf1
    print(f"✅ FastF1 installed successfully - Version: {fastf1.__version__}")

# Enable cache for better performance
cache_dir = Path('../data/cache')
cache_dir.mkdir(exist_ok=True)
fastf1.Cache.enable_cache(str(cache_dir))
print(f"🗂️ Cache enabled at: {cache_dir}")


✅ FastF1 already installed - Version: 3.6.1
🗂️ Cache enabled at: ../data/cache


## 🔍 FastF1 Coverage Analysis (2017-2025)

Let's check what data is available for our target time range.


In [21]:
# Import fastf1 first (make sure you ran the installation cell above!)
import fastf1
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Test FastF1 data availability for our target years
test_years = [2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025]
coverage_results = []

print("🔍 Testing FastF1 data coverage for 2017-2025:")
print("="*60)

for year in test_years:
    try:
        # Try to get the season schedule
        schedule = fastf1.get_event_schedule(year)
        race_count = len(schedule)
        coverage_results.append({
            'year': year,
            'available': True,
            'race_count': race_count,
            'first_race': schedule.iloc[0]['EventName'] if race_count > 0 else 'N/A',
            'status': '✅ Available'
        })
        print(f"{year}: ✅ Available - {race_count} races (First: {schedule.iloc[0]['EventName']})")
    except Exception as e:
        coverage_results.append({
            'year': year,
            'available': False,
            'race_count': 0,
            'first_race': 'N/A',
            'status': f'❌ Not Available: {str(e)[:50]}...'
        })
        print(f"{year}: ❌ Not Available - {str(e)[:50]}...")

# Convert to DataFrame for analysis
coverage_df = pd.DataFrame(coverage_results)
print(f"\n📊 Coverage Summary:")
print(f"Available years: {coverage_df[coverage_df['available']]['year'].tolist()}")
print(f"Total races available: {coverage_df[coverage_df['available']]['race_count'].sum()}")
print(f"Coverage: {len(coverage_df[coverage_df['available']])}/{len(test_years)} years ({len(coverage_df[coverage_df['available']])/len(test_years)*100:.1f}%)")


🔍 Testing FastF1 data coverage for 2017-2025:
2017: ✅ Available - 20 races (First: Australian Grand Prix)
2018: ✅ Available - 21 races (First: Australian Grand Prix)
2019: ✅ Available - 21 races (First: Australian Grand Prix)
2020: ✅ Available - 19 races (First: Pre-Season Test 1)
2021: ✅ Available - 23 races (First: Pre-Season Test)
2022: ✅ Available - 24 races (First: Pre-Season Track Session)
2023: ✅ Available - 23 races (First: Pre-Season Testing)
2024: ✅ Available - 25 races (First: Pre-Season Testing)
2025: ✅ Available - 25 races (First: Pre-Season Testing)

📊 Coverage Summary:
Available years: [2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025]
Total races available: 201
Coverage: 9/9 years (100.0%)


## 🏁 Sample Race Data Exploration

Let's explore what data we can get from a recent race to understand the available features.


In [22]:
# Load a sample race to explore available data
# Let's use 2024 Bahrain Grand Prix as an example
try:
    print("🔄 Loading 2024 Bahrain Grand Prix data...")
    print("This may take a few minutes for first load...")
    
    # Load race session
    session = fastf1.get_session(2024, 'Bahrain', 'R')  # 'R' for Race
    session.load()
    
    print(f"✅ Session loaded successfully:")
    print(f"   📅 Date: {session.date}")
    print(f"   🏁 Event: {session.event['EventName']}")
    print(f"   📍 Location: {session.event['Location']}")
    
    # Weather info with proper handling
    if len(session.weather_data) > 0:
        air_temp = session.weather_data['AirTemp'].iloc[-1]
        rainfall = session.weather_data['Rainfall'].any()
        print(f"   🌡️ Air Temp: {air_temp}°C")
        print(f"   🌧️ Rainfall: {'Yes' if rainfall else 'No'}")
    else:
        print("   🌡️ Air Temp: N/A")
        print("   🌧️ Rainfall: N/A")
    
except Exception as e:
    print(f"❌ Error loading race data: {e}")
    print("Trying alternative race...")
    
    try:
        # Fallback to 2023 data
        session = fastf1.get_session(2023, 'Bahrain', 'R')
        session.load()
        print(f"✅ Fallback session loaded: {session.event['EventName']} 2023")
    except Exception as e2:
        print(f"❌ Fallback also failed: {e2}")
        session = None


core           INFO 	Loading data for Bahrain Grand Prix - Race [v3.6.1]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...


🔄 Loading 2024 Bahrain Grand Prix data...
This may take a few minutes for first load...


req            INFO 	Using cached data for car_data
req            INFO 	Using cached data for position_data
req            INFO 	Using cached data for weather_data
req            INFO 	Using cached data for race_control_messages
core           INFO 	Finished loading data for 20 drivers: ['1', '11', '55', '16', '63', '4', '44', '81', '14', '18', '24', '20', '3', '22', '23', '27', '31', '10', '77', '2']


✅ Session loaded successfully:
   📅 Date: 2024-03-02 15:00:00
   🏁 Event: Bahrain Grand Prix
   📍 Location: Sakhir
   🌡️ Air Temp: 17.7°C
   🌧️ Rainfall: No


In [23]:
# Explore available data structures if session loaded successfully
if session is not None:
    print("🔍 AVAILABLE DATA STRUCTURES:")
    print("="*50)
    
    # 1. Results data
    if hasattr(session, 'results') and len(session.results) > 0:
        print(f"\n📊 RACE RESULTS:")
        print(f"   Shape: {session.results.shape}")
        print(f"   Columns: {list(session.results.columns)}")
        print(f"   Sample (Top 5):")
        display_cols = ['Position', 'Abbreviation', 'DriverNumber', 'TeamName', 'Time', 'Status']
        available_display_cols = [col for col in display_cols if col in session.results.columns]
        print(session.results[available_display_cols].head())
    
    # 2. Laps data
    if hasattr(session, 'laps') and len(session.laps) > 0:
        print(f"\n⏱️ LAP DATA:")
        print(f"   Shape: {session.laps.shape}")
        print(f"   Total laps: {session.laps['LapNumber'].max()}")
        print(f"   Columns: {list(session.laps.columns)[:10]}...")  # Show first 10 columns
        
        # Sample fastest lap
        fastest_lap = session.laps.pick_fastest()
        if fastest_lap is not None:
            print(f"   🏆 Fastest Lap: {fastest_lap['Driver']} - {fastest_lap['LapTime']}")
    
    # 3. Weather data
    if hasattr(session, 'weather_data') and len(session.weather_data) > 0:
        print(f"\n🌤️ WEATHER DATA:")
        print(f"   Shape: {session.weather_data.shape}")
        print(f"   Columns: {list(session.weather_data.columns)}")
        print(f"   Temperature range: {session.weather_data['AirTemp'].min():.1f}°C - {session.weather_data['AirTemp'].max():.1f}°C")
        print(f"   Track temp range: {session.weather_data['TrackTemp'].min():.1f}°C - {session.weather_data['TrackTemp'].max():.1f}°C")
    
    # 4. Track status
    if hasattr(session, 'track_status') and len(session.track_status) > 0:
        print(f"\n🚦 TRACK STATUS:")
        print(f"   Shape: {session.track_status.shape}")
        print(f"   Status changes: {session.track_status['Status'].nunique()}")
        print(f"   Unique statuses: {session.track_status['Status'].unique()}")
    
else:
    print("⚠️ No session data available for detailed exploration")


🔍 AVAILABLE DATA STRUCTURES:

📊 RACE RESULTS:
   Shape: (20, 22)
   Columns: ['DriverNumber', 'BroadcastName', 'Abbreviation', 'DriverId', 'TeamName', 'TeamColor', 'TeamId', 'FirstName', 'LastName', 'FullName', 'HeadshotUrl', 'CountryCode', 'Position', 'ClassifiedPosition', 'GridPosition', 'Q1', 'Q2', 'Q3', 'Time', 'Status', 'Points', 'Laps']
   Sample (Top 5):
    Position Abbreviation DriverNumber         TeamName  \
1        1.0          VER            1  Red Bull Racing   
11       2.0          PER           11  Red Bull Racing   
55       3.0          SAI           55          Ferrari   
16       4.0          LEC           16          Ferrari   
63       5.0          RUS           63         Mercedes   

                     Time    Status  
1  0 days 01:31:44.742000  Finished  
11 0 days 00:00:22.457000  Finished  
55 0 days 00:00:25.110000  Finished  
16 0 days 00:00:39.669000  Finished  
63 0 days 00:00:46.788000  Finished  

⏱️ LAP DATA:
   Shape: (1129, 31)
   Total laps: 57.

## 🏎️ Driver Telemetry Analysis

Let's dive deeper into telemetry data for individual drivers.


In [24]:
# Explore telemetry data for specific drivers
if session is not None and hasattr(session, 'laps') and len(session.laps) > 0:
    print("🔍 DRIVER TELEMETRY EXPLORATION:")
    print("="*50)
    
    # Get unique drivers
    drivers = session.laps['Driver'].unique()
    print(f"📋 Available drivers: {list(drivers[:10])}{' ...' if len(drivers) > 10 else ''}")
    
    # Focus on a specific driver (e.g., VER for Verstappen)
    if 'VER' in drivers:
        target_driver = 'VER'
    elif len(drivers) > 0:
        target_driver = drivers[0]
    else:
        target_driver = None
    
    if target_driver:
        print(f"\n🎯 Analyzing driver: {target_driver}")
        
        # Get driver's laps
        driver_laps = session.laps.pick_driver(target_driver)
        print(f"   Total laps: {len(driver_laps)}")
        
        if len(driver_laps) > 0:
            # Lap time analysis
            print(f"   🏁 Lap Times:")
            valid_laps = driver_laps.dropna(subset=['LapTime'])
            if len(valid_laps) > 0:
                print(f"      Fastest: {valid_laps['LapTime'].min()}")
                print(f"      Average: {valid_laps['LapTime'].mean()}")
                print(f"      Slowest: {valid_laps['LapTime'].max()}")
            
            # Sector times analysis
            sector_cols = ['Sector1Time', 'Sector2Time', 'Sector3Time']
            available_sectors = [col for col in sector_cols if col in driver_laps.columns]
            
            if available_sectors:
                print(f"   🏁 Sector Times (Available: {available_sectors}):")
                for sector in available_sectors:
                    sector_data = driver_laps[sector].dropna()
                    if len(sector_data) > 0:
                        print(f"      {sector}: {sector_data.min()} (best) - {sector_data.mean()} (avg)")
            
            # Tire compound usage
            if 'Compound' in driver_laps.columns:
                compounds = driver_laps['Compound'].value_counts()
                print(f"   🏎️ Tire Compounds Used:")
                for compound, count in compounds.items():
                    print(f"      {compound}: {count} laps")
else:
    print("⚠️ No lap data available for telemetry analysis")


🔍 DRIVER TELEMETRY EXPLORATION:
📋 Available drivers: ['VER', 'PER', 'SAI', 'LEC', 'RUS', 'NOR', 'HAM', 'PIA', 'ALO', 'STR'] ...

🎯 Analyzing driver: VER
   Total laps: 57
   🏁 Lap Times:
      Fastest: 0 days 00:01:32.608000
      Average: 0 days 00:01:36.574421052
      Slowest: 0 days 00:01:57.854000
   🏁 Sector Times (Available: ['Sector1Time', 'Sector2Time', 'Sector3Time']):
      Sector1Time: 0 days 00:00:29.741000 (best) - 0 days 00:00:31.408017857 (avg)
      Sector2Time: 0 days 00:00:39.916000 (best) - 0 days 00:00:41.414842105 (avg)
      Sector3Time: 0 days 00:00:22.951000 (best) - 0 days 00:00:23.734122807 (avg)
   🏎️ Tire Compounds Used:
      SOFT: 37 laps
      HARD: 20 laps


## 🔍 FastF1 Extracted Data Analysis

Let's analyze the FastF1 features we've extracted for 2017-2025 to understand data quality and coverage.


In [29]:
# Load and analyze the complete FastF1 dataset
print("📊 FASTF1 COMPLETE DATASET ANALYSIS:")
print("="*60)

# Load the complete FastF1 features (NEW COMPLETE DATA - 2018-2025!)
fastf1_complete = pd.read_csv(RAW_DATA_DIR / 'fastf1_features_2018_2025_complete.csv')

print(f"📁 Dataset Overview:")
print(f"   Shape: {fastf1_complete.shape}")
print(f"   Columns: {len(fastf1_complete.columns)}")
print(f"   Records: {len(fastf1_complete)} driver-race entries")

# Analyze year coverage
year_analysis = fastf1_complete['year'].value_counts().sort_index()
print(f"\n📅 Year Coverage:")
for year, count in year_analysis.items():
    races_approx = count // 20  # Assuming ~20 drivers per race
    print(f"   {year}: {count} records (~{races_approx} races)")

print(f"\n🏎️ Driver Coverage:")
driver_counts = fastf1_complete['driver_abbreviation'].value_counts()
print(f"   Total unique drivers: {len(driver_counts)}")
print(f"   Top drivers by records:")
for driver, count in driver_counts.head(10).items():
    print(f"      {driver}: {count} races")

# Analyze feature completeness
print(f"\n🎯 Feature Completeness Analysis:")
feature_cols = [col for col in fastf1_complete.columns if col not in 
               ['year', 'race_name', 'date', 'circuit_name', 'event_name', 'driver_abbreviation', 'round']]

missing_analysis = {}
for col in feature_cols[:15]:  # Analyze first 15 features
    missing_count = fastf1_complete[col].isna().sum()
    completeness = ((len(fastf1_complete) - missing_count) / len(fastf1_complete)) * 100
    missing_analysis[col] = {
        'missing': missing_count,
        'completeness': completeness
    }

print(f"   Feature completeness rates:")
for feature, stats in missing_analysis.items():
    print(f"      {feature}: {stats['completeness']:.1f}% complete ({stats['missing']} missing)")

print(f"\n🔍 Sample Data Quality Check:")
# Check a sample record with data
sample_with_data = fastf1_complete[fastf1_complete['avg_sector1_time'].notna()]
if len(sample_with_data) > 0:
    sample = sample_with_data.iloc[0]
    print(f"   Sample record: {sample['year']} {sample['race_name']} - {sample['driver_abbreviation']}")
    print(f"   Sector 1: {sample['avg_sector1_time']:.3f}s")
    print(f"   Main Compound: {sample['main_compound']}")
    print(f"   Air Temp: {sample['air_temp']:.1f}°C")
    print(f"   Track Temp: {sample['track_temp']:.1f}°C")
    print(f"   Max Speed: {sample['max_speed']:.1f} km/h")
else:
    print("   ⚠️ No records with sector time data found")


📊 FASTF1 COMPLETE DATASET ANALYSIS:
📁 Dataset Overview:
   Shape: (2415, 29)
   Columns: 29
   Records: 2415 driver-race entries

📅 Year Coverage:
   2018: 420 records (~21 races)
   2019: 340 records (~17 races)
   2020: 340 records (~17 races)
   2021: 379 records (~18 races)
   2022: 198 records (~9 races)
   2023: 299 records (~14 races)
   2024: 359 records (~17 races)
   2025: 80 records (~4 races)

🏎️ Driver Coverage:
   Total unique drivers: 43
   Top drivers by records:
      VER: 121 races
      GAS: 121 races
      LEC: 121 races
      HAM: 120 races
      SAI: 120 races
      STR: 119 races
      BOT: 117 races
      PER: 115 races
      RIC: 103 races
      OCO: 103 races

🎯 Feature Completeness Analysis:
   Feature completeness rates:
      driver_number: 100.0% complete (0 missing)
      position: 100.0% complete (0 missing)
      grid_position: 100.0% complete (0 missing)
      points: 100.0% complete (0 missing)
      status: 100.0% complete (0 missing)
      fastest_l

In [31]:
# Data Quality Issues Analysis
print("⚠️ DATA QUALITY ISSUES ANALYSIS:")
print("="*50)

# Check for missing sector times (critical features)
sector_missing = fastf1_complete[['avg_sector1_time', 'avg_sector2_time', 'avg_sector3_time']].isna().all(axis=1).sum()
print(f"📊 Critical Missing Data:")
print(f"   Records with NO sector time data: {sector_missing}/{len(fastf1_complete)} ({sector_missing/len(fastf1_complete)*100:.1f}%)")

# Analyze by year - which years have more missing data?
print(f"\n📅 Missing Data by Year:")
for year in sorted(fastf1_complete['year'].unique()):
    year_data = fastf1_complete[fastf1_complete['year'] == year]
    year_missing = year_data[['avg_sector1_time', 'avg_sector2_time', 'avg_sector3_time']].isna().all(axis=1).sum()
    print(f"   {year}: {year_missing}/{len(year_data)} missing ({year_missing/len(year_data)*100:.1f}%)")

# Analyze which features have the most issues
print(f"\n🎯 Most Problematic Features:")
problematic_features = []
for feature, stats in missing_analysis.items():
    if stats['completeness'] < 50:  # Less than 50% complete
        problematic_features.append((feature, stats['completeness']))

if problematic_features:
    problematic_features.sort(key=lambda x: x[1])
    for feature, completeness in problematic_features:
        print(f"   {feature}: {completeness:.1f}% complete")
else:
    print("   ✅ No severely problematic features found")

# Check data consistency
print(f"\n🔍 Data Consistency Checks:")

# Check if we have realistic sector times
if len(sample_with_data) > 0:
    realistic_sectors = sample_with_data[
        (sample_with_data['avg_sector1_time'] > 20) & 
        (sample_with_data['avg_sector1_time'] < 60)
    ]
    print(f"   Realistic sector times: {len(realistic_sectors)}/{len(sample_with_data)} records")
    
    # Check tire compound variety
    compounds = fastf1_complete['main_compound'].dropna().unique()
    print(f"   Tire compound variety: {list(compounds)}")
    
    # Check temperature ranges
    if 'air_temp' in fastf1_complete.columns:
        temp_data = fastf1_complete['air_temp'].dropna()
        if len(temp_data) > 0:
            print(f"   Air temperature range: {temp_data.min():.1f}°C - {temp_data.max():.1f}°C")
    
    if 'track_temp' in fastf1_complete.columns:
        track_temp_data = fastf1_complete['track_temp'].dropna()
        if len(track_temp_data) > 0:
            print(f"   Track temperature range: {track_temp_data.min():.1f}°C - {track_temp_data.max():.1f}°C")

print(f"\n📝 DATA QUALITY SUMMARY:")
print(f"   • Total records: {len(fastf1_complete)}")
print(f"   • Years covered: {fastf1_complete['year'].min()}-{fastf1_complete['year'].max()}")
print(f"   • Unique drivers: {fastf1_complete['driver_abbreviation'].nunique()}")
print(f"   • Complete telemetry records: {len(fastf1_complete) - sector_missing}")
print(f"   • Data completeness: {((len(fastf1_complete) - sector_missing)/len(fastf1_complete)*100):.1f}%")


⚠️ DATA QUALITY ISSUES ANALYSIS:
📊 Critical Missing Data:
   Records with NO sector time data: 57/2415 (2.4%)

📅 Missing Data by Year:
   2018: 16/420 missing (3.8%)
   2019: 2/340 missing (0.6%)
   2020: 9/340 missing (2.6%)
   2021: 10/379 missing (2.6%)
   2022: 3/198 missing (1.5%)
   2023: 3/299 missing (1.0%)
   2024: 10/359 missing (2.8%)
   2025: 4/80 missing (5.0%)

🎯 Most Problematic Features:
   ✅ No severely problematic features found

🔍 Data Consistency Checks:
   Realistic sector times: 2087/2343 records
   Tire compound variety: ['SOFT', 'SUPERSOFT', 'ULTRASOFT', 'MEDIUM', 'HYPERSOFT', 'HARD', 'INTERMEDIATE', 'WET', 'UNKNOWN']

📝 DATA QUALITY SUMMARY:
   • Total records: 2415
   • Years covered: 2018-2025
   • Unique drivers: 43
   • Complete telemetry records: 2358
   • Data completeness: 97.6%


In [32]:
# Integration Compatibility Analysis with Ergast Data
print("🔗 INTEGRATION COMPATIBILITY ANALYSIS:")
print("="*55)

# Load existing Ergast dataset for comparison
ergast_df = pd.read_csv(PROCESSED_DATA_DIR / 'f1_race_prediction_dataset.csv')

print(f"📊 Dataset Comparison:")
print(f"   Ergast records: {len(ergast_df)}")
print(f"   FastF1 records: {len(fastf1_complete)}")
print(f"   Overlap potential: {min(len(ergast_df), len(fastf1_complete))} records")

# Check year overlap
ergast_years = set(ergast_df['year'].unique())
fastf1_years = set(fastf1_complete['year'].unique())
overlap_years = ergast_years.intersection(fastf1_years)

print(f"\n📅 Year Overlap Analysis:")
print(f"   Ergast years: {sorted(ergast_years)}")
print(f"   FastF1 years: {sorted(fastf1_years)}")
print(f"   Overlapping years: {sorted(overlap_years)} ({len(overlap_years)} years)")

# Driver mapping analysis
print(f"\n🏎️ Driver Mapping Analysis:")

# Create basic driver mapping
driver_mapping = {
    'VER': 'max_verstappen', 'HAM': 'hamilton', 'BOT': 'bottas', 'RUS': 'russell',
    'NOR': 'norris', 'PER': 'perez', 'SAI': 'sainz', 'LEC': 'leclerc',
    'ALO': 'alonso', 'STR': 'stroll', 'VET': 'vettel', 'RAI': 'raikkonen',
    'RIC': 'ricciardo', 'OCO': 'ocon', 'GAS': 'gasly', 'TSU': 'tsunoda',
    'ALB': 'albon', 'LAT': 'latifi', 'MSC': 'mick_schumacher', 'MAZ': 'mazepin',
    'HUL': 'hulkenberg', 'MAG': 'kevin_magnussen', 'ZHO': 'zhou', 'PIA': 'piastri'
}

fastf1_drivers = set(fastf1_complete['driver_abbreviation'].unique())
ergast_drivers = set(ergast_df['driver_id'].unique())
mappable_drivers = set(driver_mapping.keys()).intersection(fastf1_drivers)

print(f"   FastF1 drivers: {len(fastf1_drivers)} unique")
print(f"   Ergast drivers: {len(ergast_drivers)} unique") 
print(f"   Mappable drivers: {len(mappable_drivers)}/{len(fastf1_drivers)} ({len(mappable_drivers)/len(fastf1_drivers)*100:.1f}%)")

unmappable = fastf1_drivers - set(driver_mapping.keys())
if unmappable:
    print(f"   Unmappable drivers: {sorted(unmappable)}")

# Calculate potential merged dataset size
fastf1_with_mapping = fastf1_complete[fastf1_complete['driver_abbreviation'].isin(mappable_drivers)]
print(f"\n📈 Integration Potential:")
print(f"   FastF1 records with mappable drivers: {len(fastf1_with_mapping)}")
print(f"   Potential enhanced records: {len(fastf1_with_mapping)} with FastF1 features")
print(f"   Original Ergast features: {len(ergast_df.columns)}")
print(f"   New FastF1 features: {len(feature_cols)}")
print(f"   Total enhanced features: {len(ergast_df.columns) + len(feature_cols)}")

print(f"\n🎯 INTEGRATION RECOMMENDATION:")
integration_success_rate = len(fastf1_with_mapping) / len(ergast_df) * 100
if integration_success_rate > 50:
    print(f"   ✅ GOOD - {integration_success_rate:.1f}% of Ergast data can be enhanced")
    print(f"   📊 Recommended: Proceed with integration")
elif integration_success_rate > 25:
    print(f"   ⚠️ MODERATE - {integration_success_rate:.1f}% of Ergast data can be enhanced")
    print(f"   📊 Recommended: Improve driver mapping and proceed selectively")
else:
    print(f"   ❌ POOR - Only {integration_success_rate:.1f}% of Ergast data can be enhanced")
    print(f"   📊 Recommended: Significant improvements needed before integration")

print(f"\n📋 Next Steps for Integration:")
print(f"   1. Improve driver abbreviation mapping (add missing drivers)")
print(f"   2. Handle missing FastF1 data (imputation or exclusion strategy)")
print(f"   3. Align race/round matching for precise merging")
print(f"   4. Decide on feature selection (which FastF1 features to keep)")
print(f"   5. Create final integrated dataset for model training")


🔗 INTEGRATION COMPATIBILITY ANALYSIS:
📊 Dataset Comparison:
   Ergast records: 3718
   FastF1 records: 2415
   Overlap potential: 2415 records

📅 Year Overlap Analysis:
   Ergast years: [np.int64(2017), np.int64(2018), np.int64(2019), np.int64(2020), np.int64(2021), np.int64(2022), np.int64(2023), np.int64(2024), np.int64(2025)]
   FastF1 years: [np.int64(2018), np.int64(2019), np.int64(2020), np.int64(2021), np.int64(2022), np.int64(2023), np.int64(2024), np.int64(2025)]
   Overlapping years: [np.int64(2018), np.int64(2019), np.int64(2020), np.int64(2021), np.int64(2022), np.int64(2023), np.int64(2024), np.int64(2025)] (8 years)

🏎️ Driver Mapping Analysis:
   FastF1 drivers: 43 unique
   Ergast drivers: 48 unique
   Mappable drivers: 24/43 (55.8%)
   Unmappable drivers: ['AIT', 'ANT', 'BEA', 'BOR', 'COL', 'DEV', 'DOO', 'ERI', 'FIT', 'GIO', 'GRO', 'HAD', 'HAR', 'KUB', 'KVY', 'LAW', 'SAR', 'SIR', 'VAN']

📈 Integration Potential:
   FastF1 records with mappable drivers: 2097
   Potentia