## Day 25 Lecture 2 Assignment

In this assignment, we will extend a previous binary model to a multinomial case with three classes. We will use the FIFA soccer ratings dataset loaded below and analyze the model generated for this dataset.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from statsmodels.discrete.discrete_model import MNLogit
from sklearn.linear_model import LogisticRegression

  import pandas.util.testing as tm


In [0]:
def remove_correlated_features(dataset, threshold):
    col_corr = set()
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i]
                col_corr.add(colname)
                if colname in dataset.columns:
                    print(f'Deleted {colname} from dataset.')
                    del dataset[colname]

    return dataset

In [0]:
soccer_data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/fifa_ratings.csv')

In [4]:
soccer_data.head()

Unnamed: 0,ID,Name,Overall,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,FKAccuracy,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Reactions,Balance,ShotPower,Jumping,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle
0,158023,L. Messi,94,84,95,70,90,86,97,93,94,87,96,91,86,91,95,95,85,68,72,59,94,48,22,94,94,75,96,33,28,26
1,20801,Cristiano Ronaldo,94,84,94,89,81,87,88,81,76,77,94,89,91,87,96,70,95,95,88,79,93,63,29,95,82,85,95,28,31,23
2,190871,Neymar Jr,92,79,87,62,84,84,96,88,87,78,95,94,90,96,94,84,80,61,81,49,82,56,36,89,87,81,94,27,24,33
3,192985,K. De Bruyne,91,93,82,55,92,82,86,85,83,91,91,78,76,79,91,77,91,63,90,75,91,76,61,87,94,79,88,68,58,51
4,183277,E. Hazard,91,81,84,61,89,80,95,83,79,83,94,94,88,95,90,94,82,56,83,66,80,54,41,87,89,86,91,34,27,22


Our response for our logistic regression model is going to be a new column, "RankingTier", that contains three categories:

- High: Overall score > 75
- Middle: Overall score between 65 and 75
- Low: Overall score < 65

In [16]:
# answer goes here

soccer_data['RankingTier'] = 0
soccer_data.loc[(soccer_data['Overall'] >= 65) & (soccer_data['Overall'] <= 75), 'RankingTier'] = 1
soccer_data.loc[(soccer_data['Overall'] > 75), 'RankingTier'] = 2
soccer_data['RankingTier'].sum()



11370

The next few steps until model training are the same as before: identify and remove highly correlated features, and split the data into a training set (80%) and a test set (20%).

In [18]:
# answer goes here

remove_correlated_features(soccer_data, 0.9)
print(soccer_data.info())
X = soccer_data.drop(['RankingTier', 'ID', 'Name'], axis = 1)
Y = soccer_data['RankingTier']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16122 entries, 0 to 16121
Data columns (total 31 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ID               16122 non-null  int64 
 1   Name             16122 non-null  object
 2   Overall          16122 non-null  int64 
 3   Crossing         16122 non-null  int64 
 4   Finishing        16122 non-null  int64 
 5   HeadingAccuracy  16122 non-null  int64 
 6   ShortPassing     16122 non-null  int64 
 7   Volleys          16122 non-null  int64 
 8   Dribbling        16122 non-null  int64 
 9   Curve            16122 non-null  int64 
 10  FKAccuracy       16122 non-null  int64 
 11  LongPassing      16122 non-null  int64 
 12  BallControl      16122 non-null  int64 
 13  Acceleration     16122 non-null  int64 
 14  SprintSpeed      16122 non-null  int64 
 15  Agility          16122 non-null  int64 
 16  Reactions        16122 non-null  int64 
 17  Balance          16122 non-null

Fit a multinomial logistic regression model using the statsmodels package and print out the coefficient summary. What is the "reference" tier chosen by the model? How do we interpret the coefficients - for example, how does the intepretation of the "Reactions" coefficient for RankingTier=Low differ from the "Reactions" coefficient for RankingTier=Middle?

In [17]:
# answer goes here

model = MNLogit(Y_train, X_train)
results = model.fit()

results.summary()



Optimization terminated successfully.
         Current function value: 0.628339
         Iterations 8


0,1,2,3
Dep. Variable:,RankingTier,No. Observations:,12897.0
Model:,MNLogit,Df Residuals:,12841.0
Method:,MLE,Df Model:,54.0
Date:,"Sat, 23 May 2020",Pseudo R-squ.:,0.3157
Time:,03:51:17,Log-Likelihood:,-8103.7
converged:,True,LL-Null:,-11842.0
Covariance Type:,nonrobust,LLR p-value:,0.0

RankingTier=1,coef,std err,z,P>|z|,[0.025,0.975]
Overall,0.1674,0.010,17.568,0.000,0.149,0.186
Crossing,0.0277,0.003,9.191,0.000,0.022,0.034
Finishing,0.0013,0.004,0.326,0.744,-0.006,0.009
HeadingAccuracy,-0.0234,0.004,-6.629,0.000,-0.030,-0.016
ShortPassing,-0.0358,0.006,-5.944,0.000,-0.048,-0.024
Volleys,0.0010,0.003,0.298,0.766,-0.006,0.008
Dribbling,-0.0018,0.005,-0.364,0.716,-0.011,0.008
Curve,0.0147,0.003,4.495,0.000,0.008,0.021
FKAccuracy,-0.0051,0.003,-1.715,0.086,-0.011,0.001
LongPassing,0.0118,0.004,2.840,0.005,0.004,0.020


To evaluate test error using cross-validation, we will switch back to scikit-learn. Estimate the test error of this multinomial logistic regression model using 10-fold CV.

Note: scikit-learn's LogisticRegression() function can handle both binary and multinomial regression, and it is automatically able to determine which is appropriate based on the y_train array that is passed. You should be able to reuse previous code with minimal changes required.

In [0]:
# answer goes here

#model_multi = LogisticRegression(penalty='l2', solver='lbfgs', multi_class = 'multinomial', max_iter=2000)
#results_multi = model_multi.fit(X_train, Y_train)


model_ovr = LogisticRegression(penalty='l2', solver='liblinear', multi_class = 'ovr', max_iter=2000)
results_ovr = model_ovr.fit(X_train, Y_train)


In [9]:
from sklearn.metrics import log_loss

train_probs = model_ovr.predict_proba(X_train)
test_probs = model_ovr.predict_proba(X_test)
train_scores = (model_ovr.score(X_train, Y_train))
test_scores = (model_ovr.score(X_test, Y_test))
train_loss = (log_loss(Y_train, train_probs))
test_loss = (log_loss(Y_test, test_probs))
score_list = list((train_scores, test_scores, train_loss, test_loss))

score_list

[0.8950918818329844,
 0.8992248062015504,
 0.34135673164040253,
 0.34060853350486825]

As we did in the previous exercise, train a multinomial logistic regression on the training data, make predictions on the 20% holdout test data, then:

- Determine the precision, recall, and F1-score of our model using a cutoff/threshold of 0.5 (hint: scikit-learn's *classification_report* function may be helpful)
- Plot or otherwise generate a confusion matrix
- Plot the ROC curve for our logistic regression model

Comment on the performance of the model.

In [0]:
# answer goes here

from sklearn.metrics import classification_report

report = classification_report(Y_test, results_ovr.predict(X_test))


In [11]:
print(report)

              precision    recall  f1-score   support

           0       0.96      0.94      0.95      1225
           1       0.86      0.97      0.91      1735
           2       0.95      0.23      0.37       265

    accuracy                           0.90      3225
   macro avg       0.92      0.71      0.74      3225
weighted avg       0.91      0.90      0.88      3225

