# Contents
1. Introduction <br>
    1.1. Import Libraries and Data<br>
    
    
2. Pipelines <br> 
    2.1. Cleaning Data Pipeline - Classes <br>
    2.2. Cleaning Data Pipeline - Application <br>
    2.3. Pre-processing Data and Model Pipeline <br>
    2.4. Model Prediction <br> 
    
    
3. Goal Correlation (xG vs Model) <br>
    3.1. Goal Correlation with xG <br>
    3.2. Goal Correlation with Model <br>
    3.3. Summary <br> 
 
 
4. XGB Regression Hyperparameter Optimization <br>
    4.1. Before Optimization <br>
    4.2. Optimization <br>
    4.3. After Optimization

# 1. Introduction 

In this kernel we create an XGBoost Regressor model from various independent factors drawn from our main kernel to better predict goals scored. 

We will show our model has a higher correlation with goals than xG alone. 

## 1.1. Import Libraries and Data

In [1446]:
from IPython.core.interactiveshell import InteractiveShell
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import cross_val_score, TimeSeriesSplit
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin
import xgboost as xgb
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import warnings
warnings.filterwarnings('ignore')

In [1328]:
df=pd.read_csv('dfFinal.csv', encoding='latin1')

# 2. Pipelines

## 2.1. Cleaning Data Pipeline - Classes

In [1329]:
#Sort by date - useful when splitting df (time-series). 
class sort_date(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.sort_values(by=['date'])

#Drop columns. 
class drop_col(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.drop(['player', 'Unnamed: 0', '#', 'game_id', 'Cmp%', 'date', 'Opposition', 'team', 'Nation'], axis=1)

#Age column clean up. 
class age_dtype(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X['Age']=X['Age'].replace(np.nan, 0)
        X['Age']=(X['Age'].astype(str).str[:2]).astype(float)
        X['Age']=X['Age'].replace(0, np.nan)
        return X 

#One hot encode position col separately. Not ideal since novel categories can come up in new data but had to because multiple
#categ's in col. Change this after model trained. 
class position_ohe(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X=pd.concat([X, X['Pos'].str.get_dummies(',')], axis=1)
        X.drop(['Pos'], axis=1, inplace=True)
        return X 

#Remove NaN's. 
class remove_nan(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X.dropna(inplace=True)
        return X 

## 2.2. Cleaning Data Pipeline - Application

In [1411]:
clean_pipe = Pipeline([
    ("sort_date", sort_date()),
    ("drop_columns", drop_col()),
    ("age_datatype", age_dtype()),
    ("encoding_position_column", position_ohe()),
    ("remove_nan", remove_nan())
])

cleaned_df = clean_pipe.fit_transform(df)

## 2.3. Pre-Processing Data and Model Pipeline

### Pre-Processing Pipeline

In [1440]:
standard_scaler = Pipeline(steps=[
        ("standard_scaler", StandardScaler())
])

robust_scaler = Pipeline(steps=[
        ("robust_scaler", RobustScaler())
])

categorical_transformer = Pipeline(steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("stand_scaler", standard_scaler, ["Touches", "Age"]),
        ("robust_scaler", robust_scaler, ["Att", "Cmp", "Mins"]),
        ("cat", categorical_transformer, ["Location"])
        ], remainder = 'passthrough'
)

lin = xgb.XGBRegressor(
        min_child_weight= 1,
        max_depth= 4,
        learning_rate= 0.15,
        gamma=3,
        colsample_bytree=0.75
)

xgb_model = Pipeline(steps= [
    ("pre", preprocessor),
    ("XGB Model", lin)
])

## 2.4. Model Prediction

In [1441]:
train_df, test_df = np.split(cleaned_df, [int(.8 *len(cleaned_df))]) #change when needed 

X_train = train_df.loc[:, train_df.columns != 'Gls']
X_test = test_df.loc[:, test_df.columns != 'Gls']

y_train = train_df.loc[:, train_df.columns == 'Gls']
y_test = test_df.loc[:, test_df.columns == 'Gls']

xgb_model_fit=xgb_model.fit(X_train, y_train)
xgb_model_prediction=xgb_model_fit.predict(X_test)

Pipeline(steps=[('pre',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('stand_scaler',
                                                  Pipeline(steps=[('standard_scaler',
                                                                   StandardScaler())]),
                                                  ['Touches', 'Age']),
                                                 ('robust_scaler',
                                                  Pipeline(steps=[('robust_scaler',
                                                                   RobustScaler())]),
                                                  ['Att', 'Cmp', 'Mins']),
                                                 ('cat',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                

# 3. Goal Correlation (xG vs Model)

## 3.1. Goal Correlation with xG 

###  Time Series Cross Validation on Training Data 

In [1442]:
train_df, test_df = np.split(cleaned_df, [int(.8 *len(cleaned_df))])

X_train = train_df.loc[:, train_df.columns == 'xG']
y_train = train_df.loc[:, train_df.columns == 'Gls']

X_test = test_df.loc[:, test_df.columns == 'xG']
y_test = test_df.loc[:, train_df.columns == 'Gls']

tscv = TimeSeriesSplit(n_splits=3)

rmse_lin_xG = np.sqrt(-cross_val_score(lin, X_train, y_train, cv=tscv, scoring='neg_mean_squared_error'))
R2_lin_xG = cross_val_score(lin, X_train, y_train, cv=tscv, scoring='r2')

print(f"RMSE: {rmse_lin_xG.mean()} +/- {rmse_lin_xG.std()} - Training")
print(f"R2: {R2_lin_xG.mean()} +/- {R2_lin_xG.std()} - Training")

RMSE: 0.2606475401563377 +/- 0.009597410120866906 - Training
R2: 0.3679386630057572 +/- 0.0073866866216153 - Training


### Score on Test Data 

In [1443]:
lin_xG=lin.fit(X_train, y_train)
lin_xG_Test=lin_xG.score(X_test, y_test)
print(f"\nR2: {lin_xG_Test} - Test") 


R2: 0.3470078004517928 - Test


## 3.2. Goal Correlation with Model 

###  Time Series Cross Validation on Training Data 

In [1444]:
X_train = train_df.loc[:, train_df.columns != 'Gls']
X_test = test_df.loc[:, test_df.columns != 'Gls']

X_train = pd.DataFrame(preprocessor.fit_transform(X_train))
X_test = pd.DataFrame(preprocessor.transform(X_test))

rmse_lin_model = np.sqrt(-cross_val_score(lin, X_train, y_train, cv=tscv, scoring='neg_mean_squared_error'))
R2_lin_model = cross_val_score(lin, X_train, y_train, cv=tscv, scoring='r2')

print(f"\nRMSE: {rmse_lin_model.mean()} +/- {rmse_lin_model.std()} - Training")
print(f"R2: {R2_lin_model.mean()} +/- {R2_lin_model.std()} - Training")


RMSE: 0.23361071084689464 +/- 0.003978111207638588 - Training
R2: 0.4910979242300389 +/- 0.025929636890849646 - Training


### Score on Test Data 

In [1445]:
lin_model=lin_model.fit(X_train, y_train)
lin_model_Test=lin_model.score(X_test, y_test)
print(f"\nR2: {lin_model_Test} - Test") 


R2: 0.4613344812088226 - Test


## 3.3. Summary

### Training Data Comparison

In [1389]:
print("xG Scores:")
print(f"RMSE: {rmse_lin_xG.mean()} +/- {rmse_lin_xG.std()}")
print(f"R2: {R2_lin_xG.mean()} +/- {R2_lin_xG.std()}")

print("\nModel Scores:")
print(f"RMSE: {rmse_lin_model.mean()} +/- {rmse_lin_model.std()}")
print(f"R2: {R2_lin_model.mean()} +/- {R2_lin_model.std()}")

xG Prediction Scores:
RMSE: 0.2595489369583568 +/- 0.007691932737236629
R2: 0.3727887512807027 +/- 0.01631460160980845

Model Prediction Scores:
RMSE: 0.23301401532671442 +/- 0.004093134248596038
R2: 0.49369896271862374 +/- 0.025975128826630754


Lower RMSE and higher R2 values for model vs xG. 

### Test Data Comparison

In [1394]:
print("xG Score:")
print(f"R2: {lin_xG_Test}") 

print("\nModel Score:")
print(f"R2: {lin_model_Test}") 

xG Score:
R2: 0.35162915399111705

Model Score:
R2: 0.4613344812088226


Our results from the training data carry over onto the test data. Our model has a higher correlation with goals than xG alone. 

# 4.0. XGB Regression Hyperparameter Optimization 
## 4.1. Before Optimization

In [1422]:
lin_basic = xgb.XGBRegressor()
lin_basic_fit=lin_basic.fit(X_train, y_train)

print("Test Data: ")
print(f"R2: {lin_basic_fit.score(X_test, y_test)}")

Test Data: 
R2: 0.40177871349744954


## 4.2. Optimization

XGB hyperparameters to be optimized.

In [1427]:
params = {
    'learning_rate' : [.05,.1,.15, .2, .25, .3],
    'max_depth' : range(3, 10, 1),
    'min_child_weight' : [1, 3, 5],
    'gamma' : [0, 1, 2, 3],
    'colsample_bytree' : [0.5, 0.7, 0.9]
}

Find optimal parameters via RandomSearch 

In [None]:
tscv = TimeSeriesSplit(n_splits=3)

lin_params=RandomizedSearchCV(lin_basic, param_grid=params, n_iter=10, scoring='r2', cv=tscv, verbose=True)
lin_opt_fit=lin_params.fit(X_train, y_train) 
lin_opt_model=lin_opt_fit.best_estimator_ #model with best parameters

Best parameters found by Random Search

In [1438]:
lin_opt_fit.best_params_ 

{'min_child_weight': 1,
 'max_depth': 4,
 'learning_rate': 0.15,
 'gamma': 3,
 'colsample_bytree': 0.5}

## 4.3. After Optimization

In [1451]:
lin_opt_model_fit=lin_opt_model.fit(X_train, y_train)

print("Default XGB Regressor on Test Data: ")
print(f"R2: {lin_basic_fit.score(X_test, y_test)}")

print("\nOptimized XGB Regressor on Test Data: ")
print(f"R2: {lin_opt_model_fit.score(X_test, y_test)}")

Default XGB Regressor on Test Data: 
R2: 0.40177871349744954

Optimized XGB Regressor on Test Data: 
R2: 0.4531630601803899


Notable improvement after hyperparameter optimization. 