# F1 Race Prediction Model Training

**Academic Research Project**: Using Machine Learning to Predict Formula 1 Race Outcomes

This notebook demonstrates the training of an XGBoost model to predict F1 race positions based on qualifying results, weather conditions, and driver/team performance metrics.

## Research Objectives
1. Investigate the predictive power of qualifying results on race outcomes
2. Quantify the impact of weather conditions on race predictions
3. Evaluate driver and team performance factors
4. Build a deployable prediction model with confidence metrics

## Data Sources
- **FastF1**: Official F1 timing and telemetry data
- **Weather APIs**: Historical weather conditions during races
- **Manual curation**: Driver ratings and team performance metrics

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb
import pickle
import warnings
import fastf1
from datetime import datetime

# Setup
warnings.filterwarnings('ignore')
plt.style.use('dark_background')
sns.set_palette("husl")

# Enable FastF1 cache
fastf1.Cache.enable_cache('../data/fastf1_cache')

print("üìö Libraries imported successfully")
print(f"üïê Analysis started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## Data Collection and Processing

We collect race data from the 2023 and 2024 F1 seasons using the FastF1 library, which provides access to official FIA timing data.

In [None]:
def fetch_race_data(year, race_round):
    """
    Fetch race and qualifying data for a specific race
    
    Args:
        year (int): Season year
        race_round (int): Race round number
    
    Returns:
        tuple: (qualifying_df, race_df, weather_info)
    """
    try:
        # Load qualifying session
        qualifying = fastf1.get_session(year, race_round, 'Q')
        qualifying.load()
        
        # Load race session
        race = fastf1.get_session(year, race_round, 'R')
        race.load()
        
        # Extract qualifying results
        quali_results = qualifying.results[['DriverNumber', 'Abbreviation', 'TeamName', 
                                          'Q1', 'Q2', 'Q3', 'Position']].copy()
        quali_results['QualifyingTime'] = qualifying.results['Q3'].fillna(
            qualifying.results['Q2'].fillna(qualifying.results['Q1'])
        )
        
        # Extract race results
        race_results = race.results[['DriverNumber', 'Abbreviation', 'Position', 
                                   'ClassifiedPosition', 'Points', 'Status']].copy()
        
        # Get basic weather info (simplified)
        weather_info = {
            'temperature': np.random.uniform(20, 35),  # Placeholder
            'humidity': np.random.uniform(40, 80),
            'rain_probability': 0.1 if 'rain' not in str(race.name).lower() else 0.8,
            'track_name': race.event['EventName']
        }
        
        return quali_results, race_results, weather_info
        
    except Exception as e:
        print(f"‚ùå Error fetching data for {year} Round {race_round}: {e}")
        return None, None, None

# Collect data for multiple races
all_data = []
failed_races = []

# 2024 season data (first 10 races for demo)
seasons_and_rounds = [
    (2024, list(range(1, 11))),  # First 10 races of 2024
    (2023, list(range(1, 23)))   # Full 2023 season
]

print("üèÅ Starting data collection...")

for year, rounds in seasons_and_rounds:
    print(f"\nüìÖ Processing {year} season...")
    
    for round_num in rounds:
        print(f"  üîÑ Round {round_num}...", end=" ")
        
        quali, race, weather = fetch_race_data(year, round_num)
        
        if quali is not None and race is not None:
            # Merge qualifying and race data
            merged_data = quali.merge(race, on=['DriverNumber', 'Abbreviation'], 
                                    suffixes=('_quali', '_race'))
            
            # Add metadata
            merged_data['Year'] = year
            merged_data['Round'] = round_num
            merged_data['TrackName'] = weather['track_name']
            merged_data['Temperature'] = weather['temperature']
            merged_data['RainProbability'] = weather['rain_probability']
            
            all_data.append(merged_data)
            print("‚úÖ")
        else:
            failed_races.append((year, round_num))
            print("‚ùå")

print(f"\nüìä Data collection complete!")
print(f"‚úÖ Successfully collected: {len(all_data)} races")
print(f"‚ùå Failed to collect: {len(failed_races)} races")

if failed_races:
    print(f"Failed races: {failed_races}")

In [None]:
# If FastF1 data collection fails, create sample data for demonstration
def create_sample_data():
    """
    Create realistic sample data for model training demonstration
    """
    np.random.seed(42)  # For reproducibility
    
    # Driver pool (2023-2024 grid)
    drivers = ['VER', 'PER', 'HAM', 'RUS', 'LEC', 'SAI', 'NOR', 'PIA', 'ALO', 'STR',
               'TSU', 'RIC', 'HUL', 'MAG', 'GAS', 'OCO', 'BOT', 'ZHO', 'SAR', 'ALB']
    
    # Team mappings
    teams = {
        'VER': 'Red Bull Racing', 'PER': 'Red Bull Racing',
        'HAM': 'Mercedes', 'RUS': 'Mercedes',
        'LEC': 'Ferrari', 'SAI': 'Ferrari',
        'NOR': 'McLaren', 'PIA': 'McLaren',
        'ALO': 'Aston Martin', 'STR': 'Aston Martin',
        'TSU': 'AlphaTauri', 'RIC': 'AlphaTauri',
        'HUL': 'Haas', 'MAG': 'Haas',
        'GAS': 'Alpine', 'OCO': 'Alpine',
        'BOT': 'Alfa Romeo', 'ZHO': 'Alfa Romeo',
        'SAR': 'Williams', 'ALB': 'Williams'
    }
    
    # Driver performance ratings (based on 2023-2024 performance)
    driver_ratings = {
        'VER': 0.95, 'HAM': 0.90, 'LEC': 0.85, 'RUS': 0.80, 'SAI': 0.78,
        'NOR': 0.75, 'PER': 0.73, 'ALO': 0.70, 'PIA': 0.68, 'STR': 0.65,
        'GAS': 0.62, 'OCO': 0.60, 'TSU': 0.58, 'HUL': 0.55, 'RIC': 0.53,
        'MAG': 0.50, 'BOT': 0.48, 'ZHO': 0.45, 'ALB': 0.43, 'SAR': 0.40
    }
    
    # Team performance ratings
    team_ratings = {
        'Red Bull Racing': 0.90, 'Mercedes': 0.75, 'Ferrari': 0.80, 'McLaren': 0.70,
        'Aston Martin': 0.60, 'AlphaTauri': 0.45, 'Haas': 0.40, 'Alpine': 0.55,
        'Alfa Romeo': 0.35, 'Williams': 0.30
    }
    
    tracks = ['Bahrain', 'Saudi Arabia', 'Australia', 'Japan', 'China', 'Miami',
              'Emilia Romagna', 'Monaco', 'Canada', 'Spain', 'Austria', 'Great Britain']
    
    sample_data = []
    
    # Generate 30 races worth of data
    for race_id in range(30):
        track = tracks[race_id % len(tracks)]
        
        # Weather conditions
        is_wet = np.random.random() < 0.15  # 15% chance of wet race
        temperature = np.random.uniform(15, 35) if not is_wet else np.random.uniform(10, 25)
        rain_prob = 0.8 if is_wet else np.random.uniform(0, 0.3)
        
        # Simulate qualifying and race for each driver
        race_data = []
        
        for i, driver in enumerate(drivers):
            # Qualifying position with some randomness
            base_quali_pos = i + 1
            driver_skill = driver_ratings[driver]
            team_perf = team_ratings[teams[driver]]
            
            # Add randomness to qualifying
            quali_randomness = np.random.normal(0, 2) * (1 - driver_skill)
            quali_pos = max(1, min(20, int(base_quali_pos + quali_randomness)))
            
            # Race position based on qualifying + additional factors
            race_randomness = np.random.normal(0, 3)
            
            # Weather impact (some drivers better in wet)
            if is_wet:
                if driver in ['HAM', 'VER', 'RUS']:  # Good wet weather drivers
                    race_randomness -= 1
                else:
                    race_randomness += np.random.uniform(0, 2)
            
            # DNF probability
            dnf_prob = 0.05 + (1 - team_perf) * 0.1
            if is_wet:
                dnf_prob *= 1.5
            
            if np.random.random() < dnf_prob:
                race_pos = 21  # DNF
                points = 0
            else:
                race_pos = max(1, min(20, int(quali_pos + race_randomness)))
                # Points system
                points_map = {1: 25, 2: 18, 3: 15, 4: 12, 5: 10, 6: 8, 7: 6, 8: 4, 9: 2, 10: 1}
                points = points_map.get(race_pos, 0)
            
            race_data.append({
                'DriverNumber': i + 1,
                'Abbreviation': driver,
                'TeamName': teams[driver],
                'Position_quali': quali_pos,
                'Position_race': race_pos,
                'Points': points,
                'Year': 2024 if race_id < 15 else 2023,
                'Round': (race_id % 15) + 1,
                'TrackName': track,
                'Temperature': temperature,
                'RainProbability': rain_prob,
                'DriverRating': driver_skill,
                'TeamPerformance': team_perf,
                'WeatherDry': 0.0 if is_wet else 1.0,
                'TireStrategy': np.random.uniform(0.5, 1.5)  # Simplified tire strategy
            })
        
        sample_data.extend(race_data)
    
    return pd.DataFrame(sample_data)

# Use sample data if real data collection failed
if len(all_data) == 0:
    print("üé≤ Using sample data for demonstration...")
    df = create_sample_data()
else:
    # Combine all real data
    df = pd.concat(all_data, ignore_index=True)

print(f"üìä Dataset shape: {df.shape}")
print(f"üìà Features: {list(df.columns)}")
df.head()

## Exploratory Data Analysis

Let's analyze the data to understand the relationships between qualifying positions, race outcomes, and other factors.

In [None]:
# Basic statistics
print("üìà Dataset Overview")
print(f"Total races: {df['Round'].nunique() * df['Year'].nunique()}")
print(f"Total driver entries: {len(df)}")
print(f"Unique drivers: {df['Abbreviation'].nunique()}")
print(f"Years covered: {sorted(df['Year'].unique())}")

# Check for missing values
print("\nüîç Missing Values:")
print(df.isnull().sum()[df.isnull().sum() > 0])

# Create visualization subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('F1 Data Analysis - Key Relationships', fontsize=16, color='white')

# 1. Qualifying vs Race Position correlation
if 'Position_quali' in df.columns and 'Position_race' in df.columns:
    # Filter out DNFs for correlation analysis
    finished_races = df[df['Position_race'] <= 20]
    
    axes[0,0].scatter(finished_races['Position_quali'], finished_races['Position_race'], 
                     alpha=0.6, s=30, c='cyan')
    axes[0,0].plot([1, 20], [1, 20], 'r--', alpha=0.8, linewidth=2)  # Perfect correlation line
    axes[0,0].set_xlabel('Qualifying Position')
    axes[0,0].set_ylabel('Race Position')
    axes[0,0].set_title('Qualifying vs Race Position')
    axes[0,0].grid(True, alpha=0.3)
    
    # Calculate correlation
    correlation = finished_races['Position_quali'].corr(finished_races['Position_race'])
    axes[0,0].text(15, 3, f'Correlation: {correlation:.3f}', 
                  bbox=dict(boxstyle="round", facecolor='black', alpha=0.8), color='white')

# 2. Position changes distribution
if 'Position_quali' in df.columns and 'Position_race' in df.columns:
    finished_races['position_change'] = finished_races['Position_race'] - finished_races['Position_quali']
    
    axes[0,1].hist(finished_races['position_change'], bins=range(-15, 16), 
                  alpha=0.7, color='orange', edgecolor='black')
    axes[0,1].axvline(0, color='red', linestyle='--', linewidth=2)
    axes[0,1].set_xlabel('Position Change (Race - Qualifying)')
    axes[0,1].set_ylabel('Frequency')
    axes[0,1].set_title('Distribution of Position Changes')
    axes[0,1].grid(True, alpha=0.3)

# 3. Weather impact on position changes
if 'RainProbability' in df.columns:
    dry_races = finished_races[finished_races['RainProbability'] < 0.3]
    wet_races = finished_races[finished_races['RainProbability'] > 0.6]
    
    axes[1,0].hist([dry_races['position_change'], wet_races['position_change']], 
                  bins=range(-10, 11), alpha=0.7, label=['Dry Races', 'Wet Races'],
                  color=['skyblue', 'navy'])
    axes[1,0].set_xlabel('Position Change')
    axes[1,0].set_ylabel('Frequency')
    axes[1,0].set_title('Weather Impact on Position Changes')
    axes[1,0].legend()
    axes[1,0].grid(True, alpha=0.3)

# 4. Driver performance analysis
if 'DriverRating' in df.columns:
    driver_avg_change = finished_races.groupby('Abbreviation')['position_change'].mean().sort_values()
    
    # Top 10 and bottom 10 drivers
    top_drivers = driver_avg_change.head(10)
    bottom_drivers = driver_avg_change.tail(10)
    
    y_pos = range(len(top_drivers))
    bars = axes[1,1].barh(y_pos, top_drivers.values, color='green', alpha=0.7)
    axes[1,1].set_yticks(y_pos)
    axes[1,1].set_yticklabels(top_drivers.index)
    axes[1,1].set_xlabel('Average Position Change')
    axes[1,1].set_title('Top 10 Drivers - Position Gain/Loss')
    axes[1,1].grid(True, alpha=0.3)
    axes[1,1].axvline(0, color='red', linestyle='--', linewidth=1)

plt.tight_layout()
plt.show()

# Summary statistics
print("\nüìä Key Findings:")
print(f"Average position change: {finished_races['position_change'].mean():.2f}")
print(f"Standard deviation: {finished_races['position_change'].std():.2f}")
print(f"Qualifying-Race correlation: {correlation:.3f}")
print(f"DNF rate: {(len(df) - len(finished_races)) / len(df) * 100:.1f}%")

## Feature Engineering

We'll create features that capture the key factors influencing race outcomes:
1. **Qualifying Position** - Starting grid position
2. **Driver Rating** - Historical performance metric
3. **Team Performance** - Car competitiveness
4. **Weather Conditions** - Dry/wet race impact
5. **Track Temperature** - Performance factor
6. **Tire Strategy** - Strategic element

In [None]:
# Prepare features for machine learning
def prepare_features(data):
    """
    Prepare feature matrix and target variable for ML model
    """
    features_df = data.copy()
    
    # Ensure we have all required columns
    required_features = ['Position_quali', 'DriverRating', 'TeamPerformance', 
                        'WeatherDry', 'Temperature', 'TireStrategy']
    
    # Create missing features if they don't exist
    if 'DriverRating' not in features_df.columns:
        # Create driver ratings based on historical performance
        driver_performance = features_df.groupby('Abbreviation')['Position_race'].mean()
        driver_ratings = {}
        for driver, avg_pos in driver_performance.items():
            # Convert average position to rating (inverse relationship)
            rating = max(0.1, 1.0 - (avg_pos - 1) / 19)
            driver_ratings[driver] = rating
        
        features_df['DriverRating'] = features_df['Abbreviation'].map(driver_ratings)
    
    if 'TeamPerformance' not in features_df.columns:
        # Create team performance ratings
        team_performance = features_df.groupby('TeamName')['Position_race'].mean()
        team_ratings = {}
        for team, avg_pos in team_performance.items():
            rating = max(0.1, 1.0 - (avg_pos - 1) / 19)
            team_ratings[team] = rating
        
        features_df['TeamPerformance'] = features_df['TeamName'].map(team_ratings)
    
    if 'WeatherDry' not in features_df.columns:
        features_df['WeatherDry'] = (features_df['RainProbability'] < 0.3).astype(float)
    
    if 'TireStrategy' not in features_df.columns:
        # Simple tire strategy based on qualifying position
        features_df['TireStrategy'] = np.random.uniform(0.5, 1.5, len(features_df))
    
    # Select feature columns
    feature_columns = ['Position_quali', 'DriverRating', 'TeamPerformance', 
                      'WeatherDry', 'Temperature', 'TireStrategy']
    
    X = features_df[feature_columns].fillna(0.5)  # Fill any remaining NaN values
    
    # Target variable: race position (1-20 for finished, 21 for DNF)
    y = features_df['Position_race'].fillna(21).astype(int)
    
    # Convert to classification problem (position classes 1-20, DNF as 21)
    # For simplicity, we'll predict top 10 vs bottom 10 vs DNF
    y_simplified = y.copy()
    y_simplified[y <= 10] = 1  # Top 10
    y_simplified[(y > 10) & (y <= 20)] = 2  # Bottom 10
    y_simplified[y > 20] = 3  # DNF
    
    return X, y, y_simplified, feature_columns

# Prepare the data
X, y_full, y_simplified, feature_names = prepare_features(df)

print(f"üìä Feature Matrix Shape: {X.shape}")
print(f"üéØ Target Distribution (simplified):")
print(f"   Top 10 finishers: {(y_simplified == 1).sum()}")
print(f"   Bottom 10 finishers: {(y_simplified == 2).sum()}")
print(f"   DNFs: {(y_simplified == 3).sum()}")

print(f"\nüîß Features used: {feature_names}")

# Display feature statistics
print("\nüìà Feature Statistics:")
print(X.describe())

## Model Training and Evaluation

We'll train an XGBoost classifier to predict race outcomes. XGBoost is chosen for its:
- Excellent performance on tabular data
- Built-in feature importance
- Robustness to overfitting
- Ability to handle non-linear relationships

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y_simplified, test_size=0.2, random_state=42, stratify=y_simplified
)

print(f"üèãÔ∏è Training set size: {X_train.shape[0]}")
print(f"üß™ Test set size: {X_test.shape[0]}")

# Train XGBoost model
print("\nüöÄ Training XGBoost model...")

# Configure XGBoost parameters
xgb_params = {
    'objective': 'multi:softprob',  # Multi-class probability
    'num_class': 3,  # Top 10, Bottom 10, DNF
    'max_depth': 6,
    'learning_rate': 0.1,
    'n_estimators': 200,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'random_state': 42,
    'eval_metric': 'mlogloss'
}

# Train the model
model = xgb.XGBClassifier(**xgb_params)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"\nüìä Model Performance:")
print(f"   Accuracy: {accuracy:.3f}")

# Cross-validation
cv_scores = cross_val_score(model, X, y_simplified, cv=5, scoring='accuracy')
print(f"   Cross-validation accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# Detailed classification report
class_names = ['Top 10', 'Bottom 10', 'DNF']
print(f"\nüìã Detailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=class_names))

## Feature Importance Analysis

Understanding which factors most influence race outcomes is crucial for both model interpretability and racing insights.

In [None]:
# Feature importance analysis
feature_importance = model.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)

print("üéØ Feature Importance Ranking:")
for i, (feature, importance) in enumerate(zip(feature_importance_df['Feature'], 
                                            feature_importance_df['Importance']), 1):
    print(f"   {i}. {feature}: {importance:.3f}")

# Visualize feature importance
plt.figure(figsize=(12, 8))
plt.subplot(2, 2, 1)
bars = plt.bar(range(len(feature_importance)), feature_importance, 
               color='lightcoral', alpha=0.8, edgecolor='darkred')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('XGBoost Feature Importance')
plt.xticks(range(len(feature_names)), feature_names, rotation=45, ha='right')
plt.grid(True, alpha=0.3)

# Add value labels on bars
for i, bar in enumerate(bars):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{feature_importance[i]:.3f}', ha='center', va='bottom', fontsize=10)

# Confusion Matrix
plt.subplot(2, 2, 2)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=class_names, yticklabels=class_names)
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')

# Prediction confidence distribution
plt.subplot(2, 2, 3)
max_probabilities = np.max(y_pred_proba, axis=1)
plt.hist(max_probabilities, bins=20, alpha=0.7, color='lightgreen', edgecolor='darkgreen')
plt.xlabel('Prediction Confidence')
plt.ylabel('Frequency')
plt.title('Model Confidence Distribution')
plt.axvline(max_probabilities.mean(), color='red', linestyle='--', 
           label=f'Mean: {max_probabilities.mean():.3f}')
plt.legend()
plt.grid(True, alpha=0.3)

# Actual vs Predicted scatter plot
plt.subplot(2, 2, 4)
plt.scatter(y_test, y_pred, alpha=0.6, color='purple')
plt.plot([1, 3], [1, 3], 'r--', alpha=0.8)  # Perfect prediction line
plt.xlabel('Actual Class')
plt.ylabel('Predicted Class')
plt.title('Actual vs Predicted Classes')
plt.xticks([1, 2, 3], class_names)
plt.yticks([1, 2, 3], class_names)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Model insights
print("\nüîç Key Model Insights:")
print(f"   Most important factor: {feature_importance_df.iloc[0]['Feature']}")
print(f"   Average prediction confidence: {max_probabilities.mean():.3f}")
print(f"   High confidence predictions (>0.8): {(max_probabilities > 0.8).sum()}/{len(max_probabilities)}")

## Model Persistence and Deployment Preparation

Save the trained model for use in the web application.

In [None]:
# Create model directory if it doesn't exist
import os
os.makedirs('../model', exist_ok=True)

# Save the trained model
model_path = '../model/f1_model.pkl'
with open(model_path, 'wb') as f:
    pickle.dump(model, f)

print(f"‚úÖ Model saved to: {model_path}")

# Save model metadata
model_metadata = {
    'model_type': 'XGBoost Classifier',
    'features': feature_names,
    'classes': class_names,
    'accuracy': accuracy,
    'cv_accuracy_mean': cv_scores.mean(),
    'cv_accuracy_std': cv_scores.std(),
    'training_date': datetime.now().isoformat(),
    'training_samples': len(X_train),
    'test_samples': len(X_test),
    'feature_importance': dict(zip(feature_names, feature_importance.tolist())),
    'hyperparameters': xgb_params
}

metadata_path = '../model/model_metadata.json'
with open(metadata_path, 'w') as f:
    import json
    json.dump(model_metadata, f, indent=2)

print(f"üìã Model metadata saved to: {metadata_path}")

# Create a simple prediction function for testing
def predict_race_outcome(qualifying_pos, driver_rating, team_performance, 
                        weather_dry, temperature, tire_strategy):
    """
    Predict race outcome for a single driver
    
    Returns:
        tuple: (predicted_class, confidence, class_probabilities)
    """
    features = np.array([[qualifying_pos, driver_rating, team_performance, 
                         weather_dry, temperature, tire_strategy]])
    
    prediction = model.predict(features)[0]
    probabilities = model.predict_proba(features)[0]
    confidence = np.max(probabilities)
    
    class_map = {1: 'Top 10', 2: 'Bottom 10', 3: 'DNF'}
    predicted_class = class_map[prediction]
    
    return predicted_class, confidence, probabilities

# Test the prediction function
print("\nüß™ Testing prediction function:")
test_cases = [
    (1, 0.95, 0.90, 1.0, 25.0, 1.0),  # Pole position, top driver, dry conditions
    (15, 0.50, 0.40, 0.0, 18.0, 1.0), # Back of grid, average driver, wet conditions
    (5, 0.80, 0.75, 1.0, 30.0, 1.2)   # Midfield start, good driver, hot conditions
]

for i, test_case in enumerate(test_cases, 1):
    pred_class, confidence, probs = predict_race_outcome(*test_case)
    print(f"   Test {i}: {pred_class} (confidence: {confidence:.3f})")
    print(f"            Probabilities - Top 10: {probs[0]:.3f}, Bottom 10: {probs[1]:.3f}, DNF: {probs[2]:.3f}")

print(f"\nüéâ Model training completed successfully!")
print(f"üìä Final model accuracy: {accuracy:.3f}")
print(f"üöÄ Model ready for deployment in Flask app")

## Research Conclusions

### Key Findings:

1. **Qualifying Position Impact**: Qualifying position shows strong correlation with race outcome, confirming the importance of Saturday performance.

2. **Driver vs Car Performance**: Both driver skill and team performance contribute significantly to race outcomes, with their relative importance varying by track and conditions.

3. **Weather Effects**: Wet conditions introduce additional unpredictability, affecting different drivers and teams disproportionately.

4. **Model Performance**: The XGBoost classifier achieves reasonable accuracy in predicting race outcome categories, demonstrating the feasibility of ML-based F1 predictions.

### Academic Applications:

- **Sports Analytics**: Demonstrates application of ML to motorsport prediction
- **Feature Engineering**: Shows importance of domain knowledge in creating meaningful features
- **Model Interpretability**: XGBoost feature importance provides insights into racing dynamics
- **Real-world Deployment**: Model can be integrated into web applications for live predictions

### Future Research Directions:

1. **Enhanced Features**: Incorporate tire compound data, fuel loads, and car setup parameters
2. **Deep Learning**: Experiment with neural networks for capturing complex interactions
3. **Real-time Updates**: Implement online learning for model updates during race weekends
4. **Uncertainty Quantification**: Add confidence intervals and uncertainty estimates
5. **Multi-objective Prediction**: Predict multiple outcomes (position, points, fastest lap, etc.)

---

**Model Information:**
- Training completed: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
- Model saved: `../model/f1_model.pkl`
- Metadata saved: `../model/model_metadata.json`
- Ready for deployment in Flask application