<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Regresssion with scikit-learn<br><br> using Soccer Dataset

<br><br></p>


We will again be using the open dataset from the popular site <a href="https://www.kaggle.com">Kaggle</a> that we used in Week 1 for our example. 

Recall that this <a href="https://www.kaggle.com/hugomathien/soccer">European Soccer Database</a> has more than 25,000 matches and more than 10,000 players for European professional soccer seasons from 2008 to 2016. 

**Note:** Please download the file *database.sqlite* if you don't yet have it in your *Week-7-MachineLearning* folder.

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Import Libraries<br><br></p>


In [1]:
import sqlite3
import pandas as pd 
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Read Data from the Database into pandas
<br><br></p>


In [2]:
# Create connection and exec SQL statement
cnx = sqlite3.connect('C:\ml\database.sqlite')
df = pd.read_sql_query("SELECT * FROM Player_Attributes", cnx)

In [3]:
df.head()

Unnamed: 0,id,player_fifa_api_id,player_api_id,date,overall_rating,potential,preferred_foot,attacking_work_rate,defensive_work_rate,crossing,...,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
0,1,218353,505942,2016-02-18 00:00:00,67.0,71.0,right,medium,medium,49.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
1,2,218353,505942,2015-11-19 00:00:00,67.0,71.0,right,medium,medium,49.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
2,3,218353,505942,2015-09-21 00:00:00,62.0,66.0,right,medium,medium,49.0,...,54.0,48.0,65.0,66.0,69.0,6.0,11.0,10.0,8.0,8.0
3,4,218353,505942,2015-03-20 00:00:00,61.0,65.0,right,medium,medium,48.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0
4,5,218353,505942,2007-02-22 00:00:00,61.0,65.0,right,medium,medium,48.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0


In [4]:
df.shape

(183978, 42)

In [5]:
df.columns

Index(['id', 'player_fifa_api_id', 'player_api_id', 'date', 'overall_rating',
       'potential', 'preferred_foot', 'attacking_work_rate',
       'defensive_work_rate', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes'],
      dtype='object')

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Declare the Columns You Want to Use as Features For Predicting Overall Rating
<br><br></p>

Get all the attributes except the identifiers, date, and overall rating

In [6]:
features = ['potential', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes']

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Specify the Prediction Target (Outcome)
<br><br></p>


In [7]:
target = ['overall_rating']

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Clean the Data<br><br></p>


In [8]:
# drop all records that contain an NA/NaN/None (instead of imputation)
df = df.dropna()

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Extract Features and Target ('overall_rating') Values into Separate Dataframes
<br><br></p>


In [9]:
# get those values for the features specified as input + output
x = df[features]
y = df[target]

In [10]:
x.head()

Unnamed: 0,potential,crossing,finishing,heading_accuracy,short_passing,volleys,dribbling,curve,free_kick_accuracy,long_passing,...,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
0,71.0,49.0,44.0,71.0,61.0,44.0,51.0,45.0,39.0,64.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
1,71.0,49.0,44.0,71.0,61.0,44.0,51.0,45.0,39.0,64.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
2,66.0,49.0,44.0,71.0,61.0,44.0,51.0,45.0,39.0,64.0,...,54.0,48.0,65.0,66.0,69.0,6.0,11.0,10.0,8.0,8.0
3,65.0,48.0,43.0,70.0,60.0,43.0,50.0,44.0,38.0,63.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0
4,65.0,48.0,43.0,70.0,60.0,43.0,50.0,44.0,38.0,63.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0


In [11]:
y.head()

Unnamed: 0,overall_rating
0,67.0
1,67.0
2,62.0
3,61.0
4,61.0


Let us look at a typical row from our features: 

In [13]:
# check feature values for 1 record
x.iloc[2]

potential             66.0
crossing              49.0
finishing             44.0
heading_accuracy      71.0
short_passing         61.0
volleys               44.0
dribbling             51.0
curve                 45.0
free_kick_accuracy    39.0
long_passing          64.0
ball_control          49.0
acceleration          60.0
sprint_speed          64.0
agility               59.0
reactions             47.0
balance               65.0
shot_power            55.0
jumping               58.0
stamina               54.0
strength              76.0
long_shots            35.0
aggression            63.0
interceptions         41.0
positioning           45.0
vision                54.0
penalties             48.0
marking               65.0
standing_tackle       66.0
sliding_tackle        69.0
gk_diving              6.0
gk_handling           11.0
gk_kicking            10.0
gk_positioning         8.0
gk_reflexes            8.0
Name: 2, dtype: float64

Let us also display our target values: 

In [17]:
# check range of overall player rating values
print(y.min(),'-',y.max())

overall_rating    33.0
dtype: float64 - overall_rating    94.0
dtype: float64


In [20]:
# check the collinearity of the inputs
x.corr()

Unnamed: 0,potential,crossing,finishing,heading_accuracy,short_passing,volleys,dribbling,curve,free_kick_accuracy,long_passing,...,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
potential,1.0,0.277284,0.287838,0.206063,0.382538,0.301678,0.339978,0.29605,0.262842,0.343133,...,0.379278,0.315207,0.054094,0.082073,0.063284,-0.012283,0.005865,0.092299,0.004472,0.004936
crossing,0.277284,1.0,0.576896,0.368956,0.790323,0.637527,0.809747,0.788924,0.708763,0.685649,...,0.693978,0.574208,0.234886,0.285018,0.274673,-0.604567,-0.595646,-0.356728,-0.597742,-0.601696
finishing,0.287838,0.576896,1.0,0.373459,0.580245,0.851482,0.784988,0.691082,0.633274,0.341121,...,0.652376,0.726234,-0.285416,-0.230453,-0.262144,-0.47937,-0.465135,-0.292349,-0.470758,-0.473302
heading_accuracy,0.206063,0.368956,0.373459,1.0,0.548435,0.391129,0.400803,0.320384,0.306013,0.362741,...,0.336472,0.431291,0.460831,0.480054,0.441134,-0.6656,-0.649145,-0.402865,-0.648981,-0.652494
short_passing,0.382538,0.790323,0.580245,0.548435,1.0,0.639995,0.788935,0.731948,0.69349,0.803073,...,0.766401,0.612511,0.349578,0.415427,0.380148,-0.694111,-0.689874,-0.422659,-0.69103,-0.69326
volleys,0.301678,0.637527,0.851482,0.391129,0.639995,1.0,0.784247,0.75241,0.682909,0.41452,...,0.690716,0.713116,-0.170094,-0.108062,-0.12781,-0.508029,-0.486178,-0.279492,-0.490148,-0.492267
dribbling,0.339978,0.809747,0.784988,0.400803,0.788935,0.784247,1.0,0.810353,0.707322,0.579201,...,0.734119,0.66342,0.004345,0.067306,0.044988,-0.654097,-0.650645,-0.432452,-0.65356,-0.656195
curve,0.29605,0.788924,0.691082,0.320384,0.731948,0.75241,0.810353,1.0,0.797842,0.586313,...,0.728198,0.649737,0.032956,0.094466,0.08011,-0.556625,-0.54494,-0.333784,-0.54987,-0.551574
free_kick_accuracy,0.262842,0.708763,0.633274,0.306013,0.69349,0.682909,0.707322,0.797842,1.0,0.603286,...,0.697943,0.669018,0.072918,0.133147,0.105894,-0.498347,-0.491631,-0.279713,-0.494253,-0.495868
long_passing,0.343133,0.685649,0.341121,0.362741,0.803073,0.41452,0.579201,0.586313,0.603286,1.0,...,0.670151,0.47675,0.441837,0.496679,0.462544,-0.464221,-0.466906,-0.261361,-0.468453,-0.469598


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Split the Dataset into Training and Test Datasets
<br><br></p>


In [19]:
# use train_test_split(input, output, test_size, random_state) from sklearn.modelselection
#    - takes in 2 DataFrames and returns 4 DataFrames
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state = 324)

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

(1) Linear Regression: Fit a model to the training set
<br><br></p>


In [21]:
# create a regressor (model) object with LinearRegression() from sklearn.linear_model
regressor_lm = LinearRegression()
type(regressor_lm)

sklearn.linear_model.base.LinearRegression

In [23]:
# fit our linear regression model to both sets of training data
#    - .fit() fine tunes the parameters of the model to capture the interaction between the training sets
regressor_lm.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Perform Prediction using Linear Regression Model
<br><br></p>


In [24]:
# get outcome predictions for the test set (NEW DATA) from the linear regression model created above using .predict()
y_test_predictions = regressor_lm.predict(x_test)
y_test_predictions

array([[ 66.51284879],
       [ 79.77234615],
       [ 66.57371825],
       ..., 
       [ 69.23780133],
       [ 64.58351696],
       [ 73.6881185 ]])

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

What is the mean of the expected target value in test set ?
<br><br></p>


In [25]:
y_test.describe()

Unnamed: 0,overall_rating
count,59517.0
mean,68.635818
std,7.041297
min,33.0
25%,64.0
50%,69.0
75%,73.0
max,94.0


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Evaluate Linear Regression Accuracy using Root Mean Square Error

<br><br></p>


In [28]:
# use mean_squared_error from sklearn metrics to get the MSE from our test actuals and test predictions
# then take square root of this result to get RMSE
RMSE_lm = sqrt(mean_squared_error(y_true = y_test, y_pred = y_test_predictions))
print(RMSE_lm)

2.805303046855208


This captures the variability of our predicted value from the observed value. We want a lower one (0 = no errors, and never happens). The model with the smaller RMSE is "better", since this indicates smaller differences between observed and expected values for y.

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

(2) Decision Tree Regressor: Fit a new regression model to the training set
<br><br></p>


In [30]:
# create a decision tree regressor (model) object with DecisionTreeRegressor() from sklearn.tree
#    - this builds a model by splitting data on an attribute in a top-down manner 
#    - the algorithm chooses the attribute which gives the maximum reduction in standard deviation
regressor_t = DecisionTreeRegressor(max_depth=20)
regressor_t.fit(x_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=20, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Perform Prediction using Decision Tree Regressor
<br><br></p>


In [36]:
y_test_predictions_tree = regressor_t.predict(x_test)
y_test_predictions_tree

array([ 62.        ,  84.        ,  62.38666667, ...,  71.        ,
        62.        ,  73.        ])

See we get different values than with the linear regression model

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

What is the mean of the expected target value in test set ?
<br><br></p>


In [34]:
y_test.describe()

Unnamed: 0,overall_rating
count,59517.0
mean,68.635818
std,7.041297
min,33.0
25%,64.0
50%,69.0
75%,73.0
max,94.0


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Evaluate Decision Tree Regression Accuracy using Root Mean Square Error

<br><br></p>


In [37]:
# use mean_squared_error from sklearn metrics to get the MSE from our test actuals and test predictions
RMSE_tree = sqrt(mean_squared_error(y_true = y_test, y_pred = y_test_predictions_tree))
print(RMSE_tree)

1.4583407002598117


So we have a smaller RMSE with the Decision Tree model than with the Linear Regression Model