# Retraining Regression Models on Transfermarkt Dataset (No Data Leakage)

This notebook retrains regression models (Linear Regression, Random Forest, XGBoost) to predict player market value, after removing the `value_log` feature to eliminate data leakage. The goal is to obtain realistic model performance and ensure better generalization to unseen data.


Now that we have removed `value_log from` the input features, the next step is to retrain all three models — Linear Regression, Random Forest, and XGBoost — using only independent attributes. This will allow us to get a more realistic measure of model performance without any information leakage.

In [1]:
#import necessary libraries for data handling, visualization, modeling, and evaluation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


In [2]:
#load the cleaned and feature-enhanced dataset into a pandas dataframe
df = pd.read_csv("/content/cleaned_tm_players_dataset_v3_with_features.csv")

We now define the input features that the model will use to learn from and set the target variable we want to predict. Importantly, we exclude `value_log` to prevent the model from having prior knowledge of the answer.

In [3]:
#define the features to be used for prediction and set the target variable
features = ['age', 'height_in_cm', 'position_encoded', 'highest_market_value_in_eur']
target = 'market_value_in_eur'

X = df[features]
y = df[target]


In [4]:
#split the dataset into training and testing sets with an 80-20 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [5]:
#standardize the input features for linear regression to improve model performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


We initialize and train three different machine learning models: Linear Regression, Random Forest, and XGBoost. Each model will learn the relationship between features and the target market value in its own way.

In [6]:
#initialize and train three different models: linear regression, random forest, and xgboost
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

xgb = XGBRegressor(n_estimators=100, random_state=42)
xgb.fit(X_train, y_train)


In [7]:
#generate predictions from each model on the testing dataset
y_pred_lr = lr.predict(X_test_scaled)
y_pred_rf = rf.predict(X_test)
y_pred_xgb = xgb.predict(X_test)


We define a reusable function that will calculate three important evaluation metrics: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R² Score. These metrics will help us objectively compare the models.

In [8]:
#define a function to calculate mae, rmse, and r² score for model evaluation
def evaluate_model(y_true, y_pred, model_name):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    print(f"📊 {model_name} Performance")
    print(f"MAE: {mae:,.2f}")
    print(f"RMSE: {rmse:,.2f}")
    print(f"R² Score: {r2:.4f}")
    print("-" * 40)


In [9]:
#evaluate and compare the performance of all three models
evaluate_model(y_test, y_pred_lr, "Linear Regression")
evaluate_model(y_test, y_pred_rf, "Random Forest")
evaluate_model(y_test, y_pred_xgb, "XGBoost")


📊 Linear Regression Performance
MAE: 1,547,023.14
RMSE: 3,703,430.58
R² Score: 0.6348
----------------------------------------
📊 Random Forest Performance
MAE: 621,122.36
RMSE: 2,608,521.26
R² Score: 0.8188
----------------------------------------
📊 XGBoost Performance
MAE: 627,643.68
RMSE: 2,871,394.42
R² Score: 0.7804
----------------------------------------


### **Analysis of These Results:**

- **Performance dropped compared to previous results** (especially for XGBoost), but **this is a good thing**.

- This confirms that the earlier extreme scores (R² ≈ 0.9996) were t**oo optimistic** because of **data leakage** from `value_log`.

- Now we have **realistic model behavior**:

  - Random Forest is the **best performing model** with R² ≈ **0.82**, which is **very respectable** for real-world financial data.

  - Linear Regression again struggles because of **nonlinearity** in features.

  - XGBoost is still strong but slightly behind Random Forest.



###🎯 **Conclusion**:

> "By removing the leakage, we have made the models more trustworthy, realistic, and ready for generalization to unseen players."

