# AI Top Goal Scorer: Live Season Predictor

**Project:** Infosys Springboard AI Internship
**Author:** Arvind K N

This notebook trains a high-performance model to predict a player's final goal tally using **in-season statistics**. The model is designed as a 'live' predictor, estimating final performance based on data available partway through a season.

### Step 1: Setup and Imports
We import all necessary libraries for data manipulation, modeling, and evaluation.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import joblib
import os

print("--- Initiating Live Goal Scorer Model Training --- ")

--- Initiating Live Goal Scorer Model Training --- 


### Step 2: Data Loading and Feature Selection
We load the dataset and select the features for our 'live' model. This includes powerful in-season metrics like `Goals_per_90` and `Appearances`.

In [3]:
print("\n[STEP 1/4] Loading and selecting features...")
try:
    df = pd.read_csv("Top Goals.csv")
except FileNotFoundError:
    print("Error: 'Top Goals.csv' not found.")
    exit()

# Select all relevant columns for the 'live' model
df = df[['Player', 'Club', 'Position', 'Age', 'Goals_prev_season', 'Goals_per_90', 'Appearances', 'Assists', 'Goals']].copy()
df.dropna(inplace=True)
print("Feature selection complete.")


[STEP 1/4] Loading and selecting features...
Feature selection complete.


### Step 3: Define Features and Build the Pipeline
We define our feature set (`X`) and target (`y`). We then build a robust `Pipeline` that bundles our preprocessor and the powerful `XGBRegressor` model. This pipeline will handle all data transformations and modeling in one clean object.

In [4]:
print("\n[STEP 2/4] Building the XGBoost pipeline...")

# Define the feature set using the powerful in-season stats
X = df[['Player', 'Club', 'Position', 'Age', 'Goals_prev_season', 'Goals_per_90', 'Appearances', 'Assists']]
y = df['Goals']

categorical_features = ['Player', 'Club', 'Position']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'
)

model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', xgb.XGBRegressor(objective='reg:squarederror', 
                                      n_estimators=100, 
                                      learning_rate=0.1, 
                                      max_depth=5, 
                                      random_state=42))
])

print("XGBoost Pipeline created successfully.")


[STEP 2/4] Building the XGBoost pipeline...
XGBoost Pipeline created successfully.


### Step 4: Evaluate Model Performance
We now test our model to confirm its high performance. By using powerful in-season features, we expect a very high R² score, demonstrating the model's ability to accurately estimate the final goal tally.

In [6]:
print("\n[STEP 3/4] Evaluating the live model's performance...")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model_pipeline.fit(X_train, y_train)
y_pred = model_pipeline.predict(X_test)

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("--- Live Model Evaluation Report ---")
print(f"R-squared (R²): {r2:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print("----------------------------------")
print(f"\nInterpretation: The model's high R² score of {r2:.0%} demonstrates its strong ability to estimate final goals using in-season data.")


[STEP 3/4] Evaluating the live model's performance...
--- Live Model Evaluation Report ---
R-squared (R²): 0.92
Root Mean Squared Error (RMSE): 1.49
----------------------------------

Interpretation: The model's high R² score of 92% demonstrates its strong ability to estimate final goals using in-season data.


### Step 5: Train and Save the Final Production Model
Finally, we retrain our high-performance pipeline on the entire dataset and save it for our Streamlit app.

In [7]:
print("\n[STEP 4/4] Training and saving the final production model...")
model_pipeline.fit(X, y)
file_path = 'top_goal_scorer_model.pkl'
joblib.dump(model_pipeline, file_path)

print(f"\n✅ Success! The final Top Goal Scorer model has been saved to '{file_path}'.")


[STEP 4/4] Training and saving the final production model...

✅ Success! The final Top Goal Scorer model has been saved to 'top_goal_scorer_model.pkl'.
