<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>
Regresssion with scikit-learn<br><br> using Soccer Dataset
<br><br></p>


We will again be using the open dataset from the popular site <a href="https://www.kaggle.com">Kaggle</a> that we used in Week 1 for our example. 

Recall that this <a href="https://www.kaggle.com/hugomathien/soccer">European Soccer Database</a> has more than 25,000 matches and more than 10,000 players for European professional soccer seasons from 2008 to 2016. 

**Note:** Please download the file *database.sqlite* if you don't yet have it in your *Week-7-MachineLearning* folder.

In [4]:
# Below can be used to change the width of the Jupyter notebook. I did this so that the full dataset with all variables can be seen without needing to use the scrollbar

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:70% !important; }</style>"))

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Import Libraries<br><br></p>


In [1]:
import sqlite3
import pandas as pd 
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Read Data from the Database into pandas
<br><br></p>


Pandas method `.read_sql_query(sql, con, ...)` reads a SQL query into a DataFrame. It returns a DataFrame corresponding to the result set of the query string.

In [2]:
# Create your connection.
cnx = sqlite3.connect('../Week-5-Exercises/soccer_database.sqlite')
df = pd.read_sql_query("SELECT * FROM Player_Attributes", cnx)

In [3]:
df.head()

Unnamed: 0,id,player_fifa_api_id,player_api_id,date,overall_rating,potential,preferred_foot,attacking_work_rate,defensive_work_rate,crossing,...,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
0,1,218353,505942,2016-02-18 00:00:00,67.0,71.0,right,medium,medium,49.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
1,2,218353,505942,2015-11-19 00:00:00,67.0,71.0,right,medium,medium,49.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
2,3,218353,505942,2015-09-21 00:00:00,62.0,66.0,right,medium,medium,49.0,...,54.0,48.0,65.0,66.0,69.0,6.0,11.0,10.0,8.0,8.0
3,4,218353,505942,2015-03-20 00:00:00,61.0,65.0,right,medium,medium,48.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0
4,5,218353,505942,2007-02-22 00:00:00,61.0,65.0,right,medium,medium,48.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0


In [5]:
df.shape

(183978, 42)

Using the dataframe `.columns` attribute we can see all of the variables per sample. This mostly includes features, but also includes identifiers such as various id's and date etc.

In [7]:
df.columns

Index(['id', 'player_fifa_api_id', 'player_api_id', 'date', 'overall_rating',
       'potential', 'preferred_foot', 'attacking_work_rate',
       'defensive_work_rate', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes'],
      dtype='object')

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Declare the Columns You Want to Use as Features
<br><br></p>

For our linear regression, we only want to include variables that can be used for **predicting the overall rating (our objective)**. We don't need to use all features to make the prediction, but for now we will. We store these variables in a list named features. 

Based on the input data from these features we will predict a numeric overall rating value of a player.

In [10]:
features = [
       'potential', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes']

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Specify the Prediction Target
<br><br></p>

We do not include overall_rating as a feature because it's our prediction target.

In [9]:
target = ['overall_rating']

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Clean the Data<br><br></p>

Unfortunately there is a fair amount of missing data. Of our 183,978 initial samples, 3,624 samples contain null values. Lets simply drop these using the pandas `.dropna()` method.

In [23]:
df_nulls = df[df.isna().any(axis=1)]
df_nulls.shape

(3624, 42)

In [25]:
df = df.dropna()
df.shape

(180354, 42)

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Extract Features and Target ('overall_rating') Values into Separate Dataframes
<br><br></p>

We create two dataframes, seperating our inputs from our output. We name these dataframes appropriately x for inputs and y for output. When doing this, keep in mind the analogy `y = f(x)`

For x we index our original data by the features we wanted as our inputs, and y we index by our target (what we're trying to predict.. in this case overall_rating)

In [29]:
X = df[features]
X.head()

Unnamed: 0,potential,crossing,finishing,heading_accuracy,short_passing,volleys,dribbling,curve,free_kick_accuracy,long_passing,...,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
0,71.0,49.0,44.0,71.0,61.0,44.0,51.0,45.0,39.0,64.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
1,71.0,49.0,44.0,71.0,61.0,44.0,51.0,45.0,39.0,64.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
2,66.0,49.0,44.0,71.0,61.0,44.0,51.0,45.0,39.0,64.0,...,54.0,48.0,65.0,66.0,69.0,6.0,11.0,10.0,8.0,8.0
3,65.0,48.0,43.0,70.0,60.0,43.0,50.0,44.0,38.0,63.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0
4,65.0,48.0,43.0,70.0,60.0,43.0,50.0,44.0,38.0,63.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0


In [32]:
y = df[target]

Let us look at a typical row from our features: 

In [34]:
# Dataframe.iloc[] method is used when the index label of a dataframe is something other than numeric series of 0, 1, 2, 3, …, n or in case the user doesn’t know the index label.
# Rows can be extracted using an imaginary index position which isn’t visible in the dataframe.

X.iloc[2]

potential             66.0
crossing              49.0
finishing             44.0
heading_accuracy      71.0
short_passing         61.0
volleys               44.0
dribbling             51.0
curve                 45.0
free_kick_accuracy    39.0
long_passing          64.0
ball_control          49.0
acceleration          60.0
sprint_speed          64.0
agility               59.0
reactions             47.0
balance               65.0
shot_power            55.0
jumping               58.0
stamina               54.0
strength              76.0
long_shots            35.0
aggression            63.0
interceptions         41.0
positioning           45.0
vision                54.0
penalties             48.0
marking               65.0
standing_tackle       66.0
sliding_tackle        69.0
gk_diving              6.0
gk_handling           11.0
gk_kicking            10.0
gk_positioning         8.0
gk_reflexes            8.0
Name: 2, dtype: float64

Let us also display our target values: 

In [33]:
y

Unnamed: 0,overall_rating
0,67.0
1,67.0
2,62.0
3,61.0
4,61.0
5,74.0
6,74.0
7,73.0
8,73.0
9,73.0


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Split the Dataset into Training and Test Datasets
<br><br></p>

From sklearn.model_selection, we imported the `train_test_split()` function. This function is used to split our data in to training and test datasets. It takes arguments for the training and test dataframes for the inputs and outputs, and the train-test split ratio (test_size), among other options.

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
(1) Linear Regression: Fit a model to the training set
<br><br></p>

We first model our predictions using a Linear Regression model. From sklearn.linear_model we import `LinearRegression` and create a LinearRegression object, and assign it to variable 'regressor'.

We then use the `.fit()` method of the LinearRegression object, passing in our training data and training target values, to fit a linear model. An instance of itself (our 'regressor' variable) is returned.

In [76]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# using the .coef attribute of our LinearRegression object, we can see the
# weights for each input variable once the regressor has been trained.
regressor.coef_

array([[ 0.37672555,  0.02167533,  0.01131009,  0.06938888,  0.05116648,
         0.00483757, -0.01350703,  0.01114517,  0.01294573,  0.00594126,
         0.13353741,  0.00619644,  0.00921171, -0.00751459,  0.21092139,
         0.00821475,  0.01725006,  0.01478119, -0.00612045,  0.061079  ,
        -0.01362171,  0.0210621 ,  0.01142752, -0.01002039, -0.00124983,
         0.01382425,  0.03319181,  0.00366161, -0.02838212,  0.16135841,
         0.03268736, -0.03354147,  0.0567999 ,  0.02530991]])

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Perform Prediction using Linear Regression Model
<br><br></p>

Once we have fitted the linear model and attained our weights (coefficients) for each variable, we can use the `.predict()` method of our LinearRegression object to predict using the linear model. 

We pass in our X_test data array, and return an array of predicted values, which we store in variable 'y_prediction'. Remember, the model hasn't seen any data from the X_test data set we pass in. It's predicting based on a new dataset.

In [77]:
y_prediction = regressor.predict(X_test)
y_prediction

array([[66.51284879],
       [79.77234615],
       [66.57371825],
       ...,
       [69.23780133],
       [64.58351696],
       [73.6881185 ]])

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
What is the mean of the expected target value in the test set ?
<br><br></p>


In [78]:
y_test.describe()

Unnamed: 0,overall_rating
count,59517.0
mean,68.635818
std,7.041297
min,33.0
25%,64.0
50%,69.0
75%,73.0
max,94.0


We can compare this to our prediced values of overall_rating in y_prediction. As can be seen, our models predictions are fairly accurate for overall rating. Our predictions are less accurate for the more volatile measures, e.g. min, max, and std.

In [79]:
print(type(y_prediction))
y_prediction_df = pd.DataFrame(y_prediction)
y_prediction_df.describe()

<class 'numpy.ndarray'>


Unnamed: 0,0
count,59517.0
mean,68.638762
std,6.414983
min,39.485815
25%,64.440971
50%,68.679438
75%,73.07368
max,91.778819


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Evaluate Linear Regression Accuracy using Root Mean Square Error
<br><br></p>

Rather than compare basic statistical measures of our predicted values vs the observed values as above, we can use **Root Mean Square Error (RMSE)** to measure the prediction accuracy of our regressor.

**RMSE captures variance of a predicted value from its actual true value.**

An RMSE score of zero means perfect prediction with no errors. When comparing two regression models, the one with the smaller RMSE is generally preferred as its predictions will have smaller difference from the observed values.

To get RMSE, we take the square root (using the sqrt function) of mean squared error. 

From sklearn.metrics we import the `mean_squared_error()` function which takes in arguments for y_true (ground truth (correct) target values), and y_pred (estimated target values). The returned value is a non-negative floating point value representing the **mean squared error regression loss**.

In [80]:
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))

In [81]:
# Below we see our RMSE for our linear regression model. To think about how
# good or bad this is, remember that our range in overall_rating is 94 and 33, 
# andthe mean was approx. 68.

print(RMSE)

2.805303046855208


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
(2) Decision Tree Regressor: Fit a new regression model to the training set
<br><br></p>

Let's now see if we can improve our prediction accuracy (based on the RMSE measure) using a slightly more complex model - a decision tree regressor.

From sklearn.tree we import DecisionTreeRegressor, and create a DecisionTreeRegressor object, named 'regressor' as before. The max_depth argument specifies the maximum depth (number of iterations) of the decision tree.

Using the `.fit()` method we build a decision tree regressor from the training set (X, y).

As with our Linear Regression model, an instance of itself (our 'regressor' variable) is returned. The instance has been fit to our training data and can now be applied to our test data.

In [82]:
regressor = DecisionTreeRegressor(max_depth=20)
regressor.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=20, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Perform Prediction using Decision Tree Regressor
<br><br></p>

Again, we use the `.predict()` method, and pass in our input test data. We assign this to variable 'y_prediction' which is an ndarray.

We can already see that our predicted overall_rating values using the Decison Tree Regressor are different to our predicted values using Linear Regression.

In [91]:
y_prediction = regressor.predict(X_test)
y_prediction

array([62.        , 84.        , 62.38666667, ..., 71.        ,
       62.        , 72.        ])

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
For comparision: What is the mean of the expected target value in test set ?
<br><br></p>

Comparing the actual test overall_rating data vs. our predictions, we can see that our decision tree regressor model appears far more accurate than our linear regression model.

In [92]:
y_test.describe()

Unnamed: 0,overall_rating
count,59517.0
mean,68.635818
std,7.041297
min,33.0
25%,64.0
50%,69.0
75%,73.0
max,94.0


In [93]:
y_prediction_df = pd.DataFrame(y_prediction)
y_prediction_df.describe()

Unnamed: 0,0
count,59517.0
mean,68.629643
std,7.018099
min,33.0
25%,64.0
50%,69.0
75%,73.205128
max,94.0


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Evaluate Decision Tree Regression Accuracy using Root Mean Square Error
<br><br></p>

Lets compute Root Mean Squared Error for the decision tree regressor model to confirm it's more accurate than our linear regression model.

In [94]:
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))

In [96]:
print(RMSE)

1.456915235849845


In terms of prediction accuracy, we confirm that our decision tree regressor model is better than our linear regression model (RMSE of 1.457 vs. 2.805). The predicted overall_rating values are closer to their true values as confirmed by the RMSE measure.

To evaluate the model further, we can compare the RMSE of 1.457 to the mean of 68.6. This is quite a good score.