### Get all of our data for this project
We obtained data from transfermarkt, understat, and sofascore.  Each of these are on kaggle, and we personally scraped the data from understat and sofascore and put those on kaggle.

- [transfermarkt](https://www.kaggle.com/datasets/davidcariboo/player-scores)
- [understat](https://www.kaggle.com/datasets/codytipton/player-stats-per-game-understat)
- [sofascore](https://www.kaggle.com/datasets/rafaelmiksianmagaldi/sofascore-data)

The below function gets all the kaggle data, does the merging of the data and splits it into train and test data.  You can access all the data in its pure form once you download it all.

In [1]:
import pandas as pd
import get_data.get_all_data as gad 
gad.get_data_merge_split()

Start Kaggle data downloads
Dataset URL: https://www.kaggle.com/datasets/davidcariboo/player-scores
Dataset URL: https://www.kaggle.com/datasets/codytipton/player-stats-per-game-understat
Dataset URL: https://www.kaggle.com/datasets/rafaelmiksianmagaldi/sofascore-data
Done with kaggle data
 Start Merges
Start Merging of the Data
Done Merging Transfermarkt and Sofascore
Done merging Transfermarkt and Understat
Done with Merging Transfermarkt x sofascore and Transfermarkt x understat
Done with merging of the data
Start making database
Create train test splits


## Train and Test Data
We have two sets of train and test data, depending on if it has a cutoff point for the minutes played

In [2]:
# Get the data
train = pd.read_csv('data/main_data/train/train.csv')
train_cutoff = pd.read_csv('data/main_data/train/train_cutoff_1000.csv')

test = pd.read_csv('data/main_data/test/test.csv')
test_cutoff = pd.read_csv('data/main_data/test/test_cutoff_1000.csv')


In [3]:
#Example of our data
train.head()

Unnamed: 0,name,dob,pos,height,foot,date,market_value,adjusted_market_value,team,league,...,red_card,rating,accuratePass,accurateLongBalls,accurateCross,accurateKeeperSweeper,expectedAssists,expectedGoals,xGChain,xGBuildup
0,maximilian wittek,1995-08-20,M,173.0,left,2024-10-19,2000000.0,2000000,VfL Bochum 1848,Bundesliga,...,0.0,7.07479,26.529412,1.966387,1.756303,0.0,0.038472,0.014907,0.028277,0.014541
1,jovan milosevic,2005-07-30,F,190.0,right,2024-01-27,500000.0,500000,VfB Stuttgart,Bundesliga,...,0.0,3.94,1.4,0.0,0.0,0.0,0.001434,0.0,0.0,0.0
2,tika de jonge,2003-03-10,M,173.0,right,2024-10-20,500000.0,500000,FC Groningen,Eredivisie,...,0.0,7.12,42.4,3.2,0.2,0.0,0.056765,0.03602,0.0,0.0
3,cas odenthal,2000-09-25,D,190.0,right,2022-04-03,650000.0,755054,NEC Nijmegen,Eredivisie,...,0.0,6.87,41.366667,2.866667,0.066667,0.0,0.0,0.0,0.0,0.0
4,miguel baeza,2000-03-26,M,177.0,left,2024-09-29,400000.0,400000,CD Nacional,Liga Portugal Betclic,...,0.0,6.356,10.38,0.54,0.38,0.0,0.013463,0.048262,0.069041,0.017071


### Our Generalized model class
So to make our hyperparameter tuning and for just making models with little effort, we made a generalized regression model that can take its type of model and any parameters needed.

There are various classes one can use: in the ensamble_model file:
- generalized_Regression
    - This is used to make any of the various models with our data with all of the features.
- G_Pos, D_Pos, M_Pos, F_Pos, these are inherited classes of generalized_Regression that is specifically designed to only do models on the specific positions.
- ensamble_model
    - This class is a ensemble of G_Pos, D_Pos, M_Pos, F_Pos, which you can specify the parameters for each of those positions.  This is able to do predictions and fit with the data.  

Finally, there are hyperparameter tuning classes that can take these models and go through a random grid search of the parameters to find some optimal parameters for these models.  There is also a beta parameter in these that gives a penatly to the hyperparameter tuning for when they are overfitting.  Specifically, it has the equation
    $$ Score = Score_{Test}  + \beta |Score_{Test} - Score_{Train}| $$

This is designed to penalize the models that overfit while going through the hyperparameter tuning.

In [4]:
import models.main_dataset.ensamble_model as em



In [5]:
# Example of a generalized_Regression model
# This gives us a linear regression model
ex_LR = em.general_Regression(train,type='LR')

#This gives us a linear regression with L2 regularization and regularization factor of 4
ex_RIDGE = em.general_Regression(train,type='RIDGE',alpha=4)

# This gives us a random forest regression model with the various parameters
ex_RFR = em.general_Regression(train,type='RFR',scale='log',max_depth=4,n_estimators=20,min_sample_leaf=2 ,bootstrap=True) 

# This gives us a Gradient Boost regression model with the various parameters
ex_GBR = em.general_Regression(train,type='GBR',scale='log',max_depth=4,n_estimators=20,min_sample_leaf=2 ,bootstrap=True) 

In [6]:
# In each of these, you can perform a cross-validation
ex_LR.perform_CV()

MSE for train: mean: 54376834367966.21 std: 1221604229958.603
MSE for test:  mean: 57432157959300.96  std: 11366455370460.338

RMSE for train: mean: 7373601.336737566 std: 82690.35510984856
RMSE for test: mean: 7537878.749909605 std: 782650.5675984453

R^2 for train: mean: 0.41485441920051275 std: 0.004654714636380806
R^2 for test: mean: 0.3822597123751495 std: 0.0462619185995481

MAE for train: mean: 3719634.1434434974 std: 34014.544478594355
MAE for test: mean: 3766498.001722875 std: 181933.56922456858

MAPE for train: mean: 4.139587126374417 std: 0.05677865191594086
MAPE for test: mean: 4.170831255759044 std: 0.3607062701507288



In [7]:
ex_RIDGE.perform_CV()

MSE for train: mean: 54507743089940.49 std: 1220501912464.027
MSE for test:  mean: 57502096779659.27  std: 11424550588100.363

RMSE for train: mean: 7382474.783125895 std: 82517.67356602143
RMSE for test: mean: 7542120.34204383 std: 786458.8519356432

R^2 for train: mean: 0.41344462495224 std: 0.004681733710239646
R^2 for test: mean: 0.38156671274494314 std: 0.04724482372464625

MAE for train: mean: 3715501.0927937226 std: 33766.96771544349
MAE for test: mean: 3760391.087914384 std: 181415.2433834562

MAPE for train: mean: 4.140033624748517 std: 0.05671594459387419
MAPE for test: mean: 4.170232494477886 std: 0.362505515947886



In [8]:
ex_RFR.perform_CV()

MSE for train: mean: 1.2448488409340344 std: 0.02488143613043517
MSE for test:  mean: 1.2710089897318997  std: 0.06527437250164449

RMSE for train: mean: 1.1156724379004952 std: 0.011128892271941427
RMSE for test: mean: 1.1270307054866382 std: 0.028474174653358546

R^2 for train: mean: 0.36307950859921334 std: 0.012798513391890573
R^2 for test: mean: 0.3494440497312323 std: 0.01350034066792082

MAE for train: mean: 0.8845335183106119 std: 0.00875528347034923
MAE for test: mean: 0.8934218445317335 std: 0.024553343993494785

MAPE for train: mean: 0.0629103073147064 std: 0.0006231763881676812
MAPE for test: mean: 0.06353471322828155 std: 0.0016758981716299688



In [13]:
ex_GBR.perform_CV()

MSE for train: mean: 0.5283064512626419 std: 0.008571365662415679
MSE for test:  mean: 0.782212511152801  std: 0.054260167671448194

RMSE for train: mean: 0.7268230538443737 std: 0.005890641983721369
RMSE for test: mean: 0.8839069210075863 std: 0.030349071611655808

R^2 for train: mean: 0.7297053670628653 std: 0.0037183401775102917
R^2 for test: mean: 0.5989917277083427 std: 0.030790894628957625

MAE for train: mean: 0.5621622660689025 std: 0.005593141957076557
MAE for test: mean: 0.6806895551772036 std: 0.01935101198233509

MAPE for train: mean: 0.04037138692408798 std: 0.00039818309746465964
MAPE for test: mean: 0.048711960195761815 std: 0.001417104442500487



In [9]:
# One can perform hyperparameter tuning on this generalized regression class. You can specify type="something"
# and it will only use parameters for that type of model, otherwise, it will randomly go through different models 
# and their corresponding parameters.

ex_hp = em.hyperparameter_tuning_general(train,n_iter=10,cv=3,model=em.general_Regression,scale='log',beta=1,type=None)

# Perform the tuning
ex_hp.perform_tuning()

# Outputs the best parameters it found and the score.
print(ex_hp.best_params)
print(ex_hp.best_score)

# This is the best model it found, this is a general_Regression class if model=em.general_Regression, otherwise
# it will be any model that is inherited from general_Regression.
ex_hp.best_model

{'model': 'LR', 'param': {}}
1.0245811562932952


<models.main_dataset.ensamble_model.general_Regression at 0x7102d256d9a0>

In [None]:
# Build models for each position

#goalkeeper position model
g_pos = em.G_Pos(train,type='LR',scale='log')

# defender position model
d_pos = em.D_Pos(train,type='RIDGE',scale='log',alpha=4)

# Midfielder position model
m_pos = em.M_Pos(train,type='RFR',scale='log',max_depth=4,n_estimators=20,min_sample_leaf=2 ,bootstrap=True)

# Forward position model
f_pos = em.F_Pos(train,type='GBR',scale='log',max_depth=4,n_estimators=20,min_sample_leaf=2 ,bootstrap=True)

# Since these are inherited classes, they have the same methods as general_Regression class

In [11]:
# Now we can talk about our ensemble model, which is essentially takes in each of the position models like above

en_model = em.ensamble_model(scale='log')

# Put the parameters for each position
en_model.G_parameters(type='LR')
en_model.D_parameters(type='RIDGE',alpha=4)
en_model.M_parameters(type='RFR',max_depth=4,n_estimators=20,min_sample_leaf=2 ,bootstrap=True)
en_model.F_parameters(type='GBR',max_depth=4,n_estimators=20,min_sample_leaf=2 ,bootstrap=True)

# Can perform cross-validation
en_model.perform_CV(train,n_splits=5)

#Fit the model
en_model.fit(train)

# Makes a predictions, but it is not scaled back
predictions = en_model.predict(test)


# This makes a prediction, but it scales it back to the original scale (before the ln(1+x))
predictions_scaled_back = en_model.predict_scaled(test)


MSE for train: mean: 0.8515214668419132 std: 0.012996531669017411
MSE for test:  mean: 1.0980076650651118  std: 0.03560443116746653

RMSE for train: mean: 0.9227523605542437 std: 0.007039029299860132
RMSE for test: mean: 1.0477215352064129 std: 0.016948443286301674

R^2 for train: mean: 0.5643353543880675 std: 0.005804220110990079
R^2 for test: mean: 0.43820396111581 std: 0.011712225418961034

MAE for train: mean: 0.7037536273794515 std: 0.005842218384848044
MAE for test: mean: 0.820043095535732 std: 0.010444179726335347

MAPE for train: mean: 0.05045368600354032 std: 0.00040881041433830256
MAPE for test: mean: 0.058654417427999414 std: 0.0008553844776853359



In [12]:
# To do hyperparameter tuning for the ensamble model, we use a specific class.  Note that beta is the penalizing constant

en_hp = em.hyperparameter_tuning(train,n_iter=10,cv=3,scale='log',beta=1)

# Perform the tuning
en_hp.perform_tuning()

# Outputs the best parameters it found and the score
print(en_hp.best_params)
print(en_hp.best_score)

# This is the best model that it outputs, it is a ensemble_model class and has all the usual methods for that class
en_hp.best_model


{'G': {'model': 'LR', 'param': {}}, 'D': {'model': 'RFR', 'param': {'max_depth': 3, 'n_estimators': 10, 'max_features': 'sqrt', 'min_samples_split': 10, 'min_samples_leaf': 8, 'bootstrap': False}}, 'M': {'model': 'RIDGE', 'param': {'alpha': np.float64(3.9897959183673466)}}, 'F': {'model': 'LR', 'param': {}}}
1.0593461921099772


<models.main_dataset.ensamble_model.ensamble_model at 0x710302232000>