# Predict player's overall scores using features 

This notebook uses techniques such as Linear Regression, Lasso Regression, Cross Validation and AIC/BIC to select the best model that can accurately predict fifa player's overall scores based on various features


## Load packages & data

In [0]:
import pandas as pd
import numpy as np
# plotting
import seaborn as sns
# standardization
from sklearn.preprocessing import StandardScaler
# modelling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from statsmodels.tools.eval_measures import aicc

In [3]:
fifa = pd.read_csv("https://raw.githubusercontent.com/vanessaaleung/rawdata/master/FIFA19data.csv", encoding = "ISO-8859-1")
fifa.head()

Unnamed: 0,ID,Name,Age,Nationality,Overall,Potential,Club,Value,Wage,International Reputation,Weak Foot,Skill Moves,Work Rate,Body Type,Position,Contract Valid Until,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,FKAccuracy,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Reactions,Balance,ShotPower,Jumping,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes
0,158023,L. Messi,31,Argentina,94,94,FC Barcelona,110.5M,565K,5.0,4.0,4.0,Medium/ Medium,Messi,RF,2021,84.0,95.0,70.0,90.0,86.0,97.0,93.0,94.0,87.0,96.0,91.0,86.0,91.0,95.0,95.0,85.0,68.0,72.0,59.0,94.0,48.0,22.0,94.0,94.0,75.0,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0
1,20801,Cristiano Ronaldo,33,Portugal,94,94,Juventus,77M,405K,5.0,4.0,5.0,High/ Low,C. Ronaldo,ST,2022,84.0,94.0,89.0,81.0,87.0,88.0,81.0,76.0,77.0,94.0,89.0,91.0,87.0,96.0,70.0,95.0,95.0,88.0,79.0,93.0,63.0,29.0,95.0,82.0,85.0,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0
2,190871,Neymar Jr,26,Brazil,92,93,Paris Saint-Germain,118.5M,290K,5.0,5.0,5.0,High/ Medium,Neymar,LW,2022,79.0,87.0,62.0,84.0,84.0,96.0,88.0,87.0,78.0,95.0,94.0,90.0,96.0,94.0,84.0,80.0,61.0,81.0,49.0,82.0,56.0,36.0,89.0,87.0,81.0,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0
3,193080,De Gea,27,Spain,91,93,Manchester United,72M,260K,4.0,3.0,1.0,Medium/ Medium,Lean,GK,2020,17.0,13.0,21.0,50.0,13.0,18.0,21.0,19.0,51.0,42.0,57.0,58.0,60.0,90.0,43.0,31.0,67.0,43.0,64.0,12.0,38.0,30.0,12.0,68.0,40.0,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0
4,192985,K. De Bruyne,27,Belgium,91,92,Manchester City,102M,355K,4.0,5.0,4.0,High/ High,Normal,RCM,2023,93.0,82.0,55.0,92.0,82.0,86.0,85.0,83.0,91.0,91.0,78.0,76.0,79.0,91.0,77.0,91.0,63.0,90.0,75.0,91.0,76.0,61.0,87.0,94.0,79.0,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0


## Data Wrangling

In [0]:
# drop unnecessary columns
fifa = fifa.drop(columns = ["ID", "Name", "Nationality", "Club", "Value", "Wage", "Body Type", "Potential"])

In [0]:
# fill na with mode value
for col in fifa.columns:
    fifa[col].fillna(value=fifa[col].mode()[0], inplace=True)

In [0]:
# convert categorical columns into dummies
factors = ['International Reputation', 'Weak Foot', 'Skill Moves', 'Work Rate', 'Position', 'Contract Valid Until']

for var in factors:
    cat_list='var'+'_'+var
    cat_list = pd.get_dummies(fifa[var], prefix=var)
    fifa = pd.concat([fifa,cat_list], axis = 1)
    fifa = fifa.drop(var, 1)

## Simple Linear Regression
1. Fit a simple linear regression model to predict the overall score of a player and test your model against the test set. Calculate the R^2 for the predictions you made on the test set. How many features are used in this model?
  ```
  The R^2 is 0.8905 with 122 features being used.
  ```
2. Using the same training and test sets, fit a simple regression model but with 5-fold cross validation and predict the overall scores of players in the test set. Calculate R^2 for the predictions and compare with the R^2 from question 1. 
  ```
  The R^2 is 0.8955, slightly higher than the R^2 from question 1.
  ```



In [0]:
# initialize X and Y variables for modelling
X = fifa.copy()
X = X.drop('Overall', 1)
Y = fifa.copy()
Y = Y['Overall']

In [0]:
# splitting training/teseting data
X_train,X_test,y_train,y_test=train_test_split(X,Y, test_size=0.9, random_state=31)

In [26]:
# Calculate the R^2 of simple linear regression model
lm = LinearRegression()
lm.fit(X_train, y_train)
lm_predictions = lm.predict(X_test)
lm_r2 = r2_score(y_test, lm_predictions)
print(lm_r2)

0.8904970737556854


In [27]:
# #features used
len(X_train.columns)

122

In [11]:
# with 5-fold cross validation
cv_predictions = cross_val_predict(lm, X_test, y_test, cv=5)
cv_r2 = r2_score(y_test,cv_predictions)
print(np,(cross_val_score(lm, X_test, y_test, cv=5))/5)
print(cv_r2)

<module 'numpy' from '/usr/local/lib/python3.6/dist-packages/numpy/__init__.py'> [0.17948    0.17926694 0.17888061 0.1798548  0.17798133]
0.895506109826533


## Lasso Regression
3. fit a Lasso regression to predict the overall scores of players in the test set. Use the default value of alpha. Calculate the R^2 for the predictions you made on the test set and compare with the R^2 from question 1. 
  ```
  The R^2 is 0.8508, lower than the simple linear regression model.
  ```
4. Do you expect your answer to question 3 to change if you are using ridge- or log- instead of lasso- penalties?
    ```
    Yes. 
    Ridge regression penalizes sum of squared coefficients (L2 penalty), lasso penalizes the sum of their absolute values (L1 penalty). 
    As a result, for high values of λ, many coefficients are exactly zeroed under lasso, which is never the case in ridge regression.
    Logarithmic penalty results in a discontinuous thresholding rule whereas lasso result in continuous thresholding rules
    (i.e.  enforcing coefficients to be small in a continuous way).
    ```



5. fit a Lasso regression to predict the overall scores of players with an ideal value for alpha. Your code should try to test different values of alpha and use the ideal one. What, according to your code, is the ideal value of alpha? How many features are being used by the model? Calculate the R^2 for the predictions you made on the test set and compare with the R^2 from question 1. 
  ```
  The ideal value of alpha is 0.01, 59 features are being used by the model. The R^2 is 0.8898, slightly lower than R^2 from question 1.
  ```



In [0]:
# splitting
sc_x = StandardScaler() 
X_train = sc_x.fit_transform(X_train)  
X_test = sc_x.transform(X_test)

In [0]:
# fitting lasso model
lasso = Lasso()
lasso.fit(X_train,y_train)
lasso_predictions = lasso.predict(X_test)

In [14]:
# #features used
coeff_used = np.sum(lasso.coef_!=0)
print(coeff_used)


4


In [15]:
# calculate the R^2
lasso_r2 = r2_score(y_test, lasso_predictions)
print(lasso_r2)

0.7408777940191329


In [16]:
# try out different alpha values
parameters = {'alpha': [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]}
lasso_regressor = GridSearchCV(lasso, parameters, cv = 5)
lasso_regressor.fit(X_train, y_train)

  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)
  positive)


GridSearchCV(cv=5, error_score=nan,
             estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=1000, normalize=False, positive=False,
                             precompute=False, random_state=None,
                             selection='cyclic', tol=0.0001, warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'alpha': [1e-15, 1e-10, 1e-08, 0.0001, 0.001, 0.01, 1,
                                   5, 10, 20]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [17]:
# best alpha
lasso_regressor.best_params_

{'alpha': 0.01}

In [18]:
# #features used
coeff_used = np.sum(lasso_regressor.best_estimator_.coef_!=0)
print(coeff_used)

83


In [19]:
# calculate r^2
lasso2_predictions = lasso_regressor.predict(X_test)
lasso_r2 = r2_score(y_test, lasso2_predictions)
print(lasso_r2)

0.8911481681827964


## AIC & BIC
6. Calculate AIC and BIC for the models you built in question 1 and question 5. According to each of the measures, which is the better model? Is BIC always greater than AIC? Please explain. Compare the AICs with the corresponding corrected AICs.
  ```
  * The lower the AIC & BIC, the better the model. Thus, the Lasso model in question 5 is better.
  * The BICs are both greater than AICc in these two models.
  * AICc is use to address AIC's potential overfitting on small sample size. 
  Since the n/df are large in both the models here, both their corrected AIC(AICc) are similar to AICs. 
  ```
7. ICs are alternatives to CVs. Do you trust them equally? Please explain.
  ```
  No. AIC and BIC are just estimation. AIC tries to approx likelihood penalized by the degrees of freedom, and BIC is the probabilty that the model is true. 
  AIC accounts makes the assumption that more parameters leads to higher risk of overfitting. 
  But Cross Validation just looks at the test set performance of the model, with no further assumptions.
  ```





In [0]:
# define ICs function
def AIC(y_true, y_hat, coeff_used):
    resid = y_true - y_hat
    sse = sum(resid**2)
    n = len(y_hat)
    return n*np.log(sse/n) + 2*coeff_used

def BIC(y_true, y_hat, coeff_used):
    resid = y_true - y_hat
    sse = sum(resid**2)
    n = len(y_hat)
    return n*np.log(sse/n) + np.log(n)*coeff_used

def AICc(y_true, y_hat, coeff_used):
    resid = y_true - y_hat
    sse = sum(resid**2)
    n = len(y_hat)
    return n*np.log(sse/n) + 2 * coeff_used + ((2 * (coeff_used ** 2) + 2 * coeff_used) / (n - coeff_used - 1))

In [28]:
#aic, bic and aicc of question 1(simple model)
aic_lm1 = AIC(y_test, lm_predictions, (len(X_test.columns)+1))
print(aic_lm1)
bic_lm1 = BIC(y_test, lm_predictions, (len(X_test.columns)+1))
print(bic_lm1)
aicc_lm1 = AICc(y_test, lm_predictions, (len(X_test.columns)+1))
print(aicc_lm1)

27300.326786055306
28247.948750890493
27302.202454751117


In [23]:
#aic, bic and aicc of question 5(lasso model)
aic_lasso2 = AIC(y_test, lasso2_predictions, (coeff_used+1))
print(aic_lasso2)
bic_lasso2 = BIC(y_test, lasso2_predictions, (coeff_used+1))
print(bic_lasso2)
aicc_lasso2 = AICc(y_test, lasso2_predictions, (coeff_used+1))
print(aicc_lasso2)


27124.600351388897
27771.75681517878
27125.47631752802
