<p style="font-family: Arial; font-size:3.75vw;color:purple; font-style:bold"><br>
Regression Exercise Notebook
</p><br>

# Exercise Notebook Instructions

### 1. Important: Only modify the cells which instruct you to modify them - leave "do not modify" cells alone.  

The code which tests your responses assumes you have run the startup/read-only code exactly.

### 2. Work through the notebook in order.

Some of the steps depend on previous, so you'll want to move through the notebook in order.

### 3. It is okay to use numpy libraries.

You may find some of these questions are fairly straightforward to answer using built-in numpy functions.  That's totally okay - part of the point of these exercises is to familiarize you with the commonly used numpy functions.

### 4. Seek help if stuck

If you get stuck, don't worry!  You can either review the videos/notebooks from this week, ask in the course forums, or look to the solutions for the correct answer.  BUT, be careful about looking to the solutions too quickly.  Struggling to get the right answer is an important part of the learning process.

In [1]:
# Below can be used to change the width of the Jupyter notebook. I did this so that the full dataset with all variables can be seen without needing to use the scrollbar

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:65% !important; }</style>"))

In [2]:
# DO NOT MODIFY

# import appropriate libraries
import sqlite3
import pandas as pd 
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
import numpy as np

In [12]:
# DO NOT MODIFY

# We will use the European Soccer dataset for this exercise.

def get_data():
    cnx = sqlite3.connect('../Week-5-Exercises/soccer_database.sqlite')
    # returns the SQL query in to a Pandas Dataframe
    df = pd.read_sql_query("SELECT * FROM Player_Attributes", cnx)
    return df

df = get_data()
df.head()

Unnamed: 0,id,player_fifa_api_id,player_api_id,date,overall_rating,potential,preferred_foot,attacking_work_rate,defensive_work_rate,crossing,...,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
0,1,218353,505942,2016-02-18 00:00:00,67.0,71.0,right,medium,medium,49.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
1,2,218353,505942,2015-11-19 00:00:00,67.0,71.0,right,medium,medium,49.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
2,3,218353,505942,2015-09-21 00:00:00,62.0,66.0,right,medium,medium,49.0,...,54.0,48.0,65.0,66.0,69.0,6.0,11.0,10.0,8.0,8.0
3,4,218353,505942,2015-03-20 00:00:00,61.0,65.0,right,medium,medium,48.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0
4,5,218353,505942,2007-02-22 00:00:00,61.0,65.0,right,medium,medium,48.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0


<p style="font-family: Arial; font-size:2.75vw;color:purple; font-style:bold"><br>
Exercise 1: Drop NULLs in the Data<br><br></p>


In the cell below, modify the function `preparation`. The `preparation` function takes three arguments:
* a dataframe, 
* a list of features, and 
* the name of the regression target column.

Function should do the following: 
- take the input data frame and remove all rows containing NULLs. 
- RETURN two data frames, one containing the feature columns and other the target column

In [9]:
# modify this cell

# Below first creates a copy so as to not point to the same initial dataframe, then simply indexes by the freatures and target to return two 
# separate dataframes. .dropna() is applied to remove null rows.

def preparation(df, features, target):
    ### BEGIN SOLUTION
    df_copy = df.copy().dropna()
    feat_df = df_copy[features]
    targ_df = df_copy[target]
    return feat_df, targ_df
    ### END SOLUTION

Note: When creating and using a function that returns two values, ensure to assign to two variables when using the function. (E.g. in X, y variables below)

In [13]:
features = ['potential', 'reactions', 'vision']

target = ['overall_rating']
X, y = preparation(df, features, target)
X.shape, y.shape

((180354, 3), (180354, 1))

In [17]:
# DO NOT MODIFY
ans1 = ['potential', 'reactions', 'vision']
ans2 = ['overall_rating']

try: 
    features = ans1
    target   = ans2
    X, y = preparation(df, features, target)
    X.columns, y.columns
    
    # Pandas attribute .columns returns the column names as a Pandas index obect. Can convert to a python list using the .tolist() method. 
    assert np.alltrue(X.columns.tolist() == ans1)
    assert np.alltrue(y.columns.tolist() == ans2)
    assert np.alltrue(X.shape[0] == 180354)
    assert np.alltrue(y.shape[0] == 180354)

except AssertionError as e: print("Try again, your output did not match the expected answer above")

<p style="font-family: Arial; font-size:2.75vw;color:purple; font-style:bold"><br>
Exercise 2: Perform Splitting<br><br></p>


In the cell below, modify the function to take features (X) and target (y) dataframe and 
split 70% as training data and 30% as test data, using a random state = rstate.

The function should return X_train, X_test, y_train, and y_test

In [18]:
# modify this cell

def clean_data(X, y, rstate):
    ### BEGIN SOLUTION
    xtrain, xtest, ytrain, ytest  = train_test_split(X, y, random_state=rstate, train_size = 0.70)
    return xtrain, xtest, ytrain, ytest
    ### END SOLUTION

Again, as returning multiple values, we need to remember when calling the function to assign the required number of variables to it. For this function, it is 4 variables (the number of variables returned).

In [21]:
X_train, X_test, y_train, y_test = clean_data(X, y, 9000)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(126247, 3) (126247, 1)
(54107, 3) (54107, 1)


In [22]:
# DO NOT MODIFY

try: 
    X_train, X_test, y_train, y_test = clean_data(X, y, 9000)

    assert np.alltrue(X_train.shape == (126247, 3))
    assert np.alltrue(y_train.shape == (126247, 1))
    assert np.alltrue(X_test.shape == (54107, 3))
    assert np.alltrue(y_test.shape == (54107, 1))

except AssertionError as e: print("Try again")

Using the pandas method `.round()` we can limit the `.describe()` method from expanding out more than necessary.

In [36]:
y.describe().round(2)

Unnamed: 0,overall_rating
count,180354.0
mean,68.64
std,7.03
min,33.0
25%,64.0
50%,69.0
75%,73.0
max,94.0


<p style="font-family: Arial; font-size:2.75vw;color:purple; font-style:bold"><br>
Exercise 3: Build a Regressor<br><br></p>

In the cell below, modify the function to take X_train, y_train only and RETURN a regressor
for predicting the y_train based on columns in X_train. You can pick any regressor model.

The function should RETURN a trained model. We will test your regressor on X_test and y_test

In [39]:
# modify this cell

def train_regressor(X_train, y_train):
    ### BEGIN SOLUTION
    regressor = LinearRegression().fit(X_train, y_train)
    return regressor
    ### END SOLUTION

In [41]:
# DO NOT MODIFY

threshold = 4.5

try: 
    model = train_regressor(X_train, y_train)
    y_prediction = model.predict(X_test)
    rmse = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))
    print(rmse)
    assert np.alltrue(rmse < threshold)
except AssertionError as e: print("Keep trying - can you get an RMSE < %f" % threshold)

3.5458963821765392


A RMSE of 3.546 is seems quite reasonable given our mean of 68.6. This is rather low however compared to our `European Soccer Regression Analysis using scikit-learn.ipynb` notebook which used all features for X to predict y, and had a RMSE of 2.81.

Below we try a Decision Tree Regressor rather than a Linear Regression model. We can then see which returns a lower RMSE, and is thus more accurate as a predictor of overall rating given the potential, reactions and visions feature inputs.

In [44]:
# modify this cell

def train_regressor_DT(X_train, y_train):
    ### BEGIN SOLUTION
    regressor = DecisionTreeRegressor().fit(X_train, y_train)
    return regressor
    ### END SOLUTION

In [45]:
# DO NOT MODIFY

threshold = 4.5

try: 
    model = train_regressor_DT(X_train, y_train)
    y_prediction = model.predict(X_test)
    rmse = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))
    print(rmse)
    assert np.alltrue(rmse < threshold)
except AssertionError as e: print("Keep trying - can you get an RMSE < %f" % threshold)

3.133593163464608


An RMSE of 3.13 for the Decision Tree regressor is rather good, however low compared to our analysis of all features for X in `European Soccer Regression Analysis using scikit-learn.ipynb` which returned an RMSE of 1.45 for the Decision Tree regressor.