# Part 2: Linear Regression


In this part, we will be working with a dataset scraped by [Shubham Maurya](https://www.kaggle.com/mauryashubham/linear-regression-to-predict-market-value/data), which collects facts about players in the English Premier League as of 2017. His original goal was to establish if there was a relationship between a player's popularity and his market value, as estimated by transfermrkt.com.

**Your goal is to fit a model able to predict a player's market value.**

## The dataset

The dataset contains the following information:
| **Field**   |     **Description**      |  
|-------------|-------------|
| name   |  Name of the player |
| club   |  Club of the player |
| age    | Age of the player |
|position| The usual position on the pitch
|position_cat| 1 for attackers, 2 for midfielders, 3 for defenders, 4 for goalkeepers|
|market_value| As on transfermrkt.com on July 20th, 2017|
|page_views| Average daily Wikipedia page views from September 1, 2016 to May 1, 2017|
|fpl_value| Value in Fantasy Premier League as on July 20th, 2017|
|fpl_sel| % of FPL players who have selected that player in their team|
|fpl_points| FPL points accumulated over the previous season|
|region| 1 for England, 2 for EU, 3 for Americas, 4 for Rest of World|
|nationality| Player's nationality|
|new_foreign| Whether a new signing from a different league, for 2017/18 (till 20th July)|
|age_cat| a categorical version of the Age feature|
|club_id| a numerical version of the Club feature|
|big_club| Whether one of the Top 6 clubs|
|new_signing| Whether a new signing for 2017/18 (till 20th July)|

## Exercise 1: Exploring the data
The first step you need to do is to explore your data.

We will start wil the necessary imports. In this exercise, we will be working with the library `pandas`. If you are not familiar with it, it is recommended that you follow the introductory exercises that can be found in the course's github repository.

In [29]:
import numpy as np
import pandas as pd

We will now proceed to read the dataset:

In [30]:
league_df = pd.read_csv('data/football_data.csv') #Reads a CSV file

### Task 1.1: Using pandas for data exploration
Use the method `name_dataframe.head(N)` (N is the number of entries) to look at the first instances of the dataframe. 

Then, use the method `name_dataframe.describe(include='all')` to generate descriptive statistics that summarize each field of the dataframe. 

Finally, print the result of `name_dataframe.dtypes`, in this way you print out the data types associated to each of the fields in the table 

In [31]:
#Your code for head
league_df.head(10)

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3.0,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2.0,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2.0,Czech Republic,0,6,1,1,0
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1.0,England,0,4,1,1,0
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2.0,France,0,4,1,1,0
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2.0,Spain,0,2,1,1,0
6,Olivier Giroud,Arsenal,30,CF,1,22.0,2230,8.5,2.50%,116,2.0,France,0,4,1,1,0
7,Nacho Monreal,Arsenal,31,LB,3,13.0,555,5.5,4.70%,115,2.0,Spain,0,4,1,1,0
8,Shkodran Mustafi,Arsenal,25,CB,3,30.0,1877,5.5,4.00%,90,2.0,Germany,0,3,1,1,1
9,Alex Iwobi,Arsenal,21,LW,1,10.0,1812,5.5,1.00%,89,4.0,Nigeria,0,1,1,1,0


In [32]:
#Your code for describe
league_df.describe(include='all')

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
count,461,461,461.0,461,461.0,461.0,461.0,461.0,461,461.0,460.0,461,461.0,461.0,461.0,461.0,461.0
unique,461,20,,13,,,,,113,,,61,,,,,
top,Alexis Sanchez,Arsenal,,CB,,,,,0.10%,,,England,,,,,
freq,1,28,,85,,,,,64,,,156,,,,,
mean,,,26.804772,,2.180043,11.012039,763.776573,5.447939,,57.314534,1.993478,,0.034707,3.206074,10.334056,0.303688,0.145336
std,,,3.961892,,1.000061,12.257403,931.805757,1.346695,,53.113811,0.957689,,0.183236,1.279795,5.726475,0.460349,0.352822
min,,,17.0,,1.0,0.05,3.0,4.0,,0.0,1.0,,0.0,1.0,1.0,0.0,0.0
25%,,,24.0,,1.0,3.0,220.0,4.5,,5.0,1.0,,0.0,2.0,6.0,0.0,0.0
50%,,,27.0,,2.0,7.0,460.0,5.0,,51.0,2.0,,0.0,3.0,10.0,0.0,0.0
75%,,,30.0,,3.0,15.0,896.0,5.5,,94.0,2.0,,0.0,4.0,15.0,1.0,0.0


In [33]:
#Your code for d_type
league_df.dtypes

name             object
club             object
age               int64
position         object
position_cat      int64
market_value    float64
page_views        int64
fpl_value       float64
fpl_sel          object
fpl_points        int64
region          float64
nationality      object
new_foreign       int64
age_cat           int64
club_id           int64
big_club          int64
new_signing       int64
dtype: object

### Question set 1.1: About the data
1. What is the name of the appearing in the 7th record of the dataset?
2. What is the mean age in the English Premier League (in 2017)? 
3. What fields store a continuous value?

Your answers here:
1. Olivier Giroud
2. 26.804772
3. market_value and fpl_value

## Exercise 2: Data splits, data preparation and training
Before starting the training procedure, we need to split the data into the training, validation and test sets.

In this exercise, the data will be already given split for you. 

In [34]:
#Loading the splits
df_train = pd.read_csv('data/league_train.csv')
df_val = pd.read_csv('data/league_val.csv')
df_test = pd.read_csv('data/league_test.csv')

Alternatively, for the type of data used in this exercise, the library `scikit-learn` contains the function `train_test_split` that allows to automatically split the data.

### Question set 2.1 Train_test_split
Look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) of the `train_test_split` function:
1. What parameters it receives as input? Provide examples illustrating.
2. What is the role of the parameter shuffle?
3. What is the role of the parameter test_size?
4. The function does not generate a validation set. What would you do to obtain the desired data splits (train, validation and test)? Answer using pseudo-code (Bonus: Write the code for it so that it can run using some dummy generated data). 

Your answers here:
1. arrays (e.g. your dataset containing: name, club, age...), test_size (e.g. 0.3 or 50), train_size (e.g. 0.7), random_state (e.g. 42), shuffle (e.g. True/False), stratify (e.g. [...] (array) with labels)
2. Whether or not to shuffle the data before splitting. It is a boolean, true or false. We cannot shuffle when there is a temporal dependency between the values, in this case, we can.
3. If it is a float, it represent the proportion of the dataset to include in the test split. If it is an Int, it represents the absolute number of test samples.
4. In order to achieve this, i would call the function twice. In the first call I´ll do the train set much larger in order to separate it in two afterwards:
   X_trainTemp, X_test, y_trainTemp, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
   X_train, X_val, y_train, y_val = train_test_split(X_trainTemp, y_trainTemp, test_size=0.35, random_state=42)

In [35]:
from sklearn.model_selection import train_test_split
X = np.random.randint(1, 10, size=(1000, 4))
y = np.random.randint(0, 2, size=(1000, 1))
X_trainTemp, X_testT, y_trainTemp, y_testT = train_test_split(X, y, test_size=0.3, random_state=42)
X_trainT, X_valT, y_trainT, y_valT = train_test_split(X_trainTemp, y_trainTemp, test_size=0.35, random_state=42)
print("Train length: " + str(len(X_trainT)))
print("Validation length: " + str(len(X_valT)))
print("Test length: " + str(len(X_testT)))

Train length: 455
Validation length: 245
Test length: 300


The dataset contains a lot of features that can be used to build the model. We will start by using `age, fpl_value, big_club` and `page_views`.

$$\hat{y} = w_0 + w_1 x_{age} + w_2 x_{fplavalue} + w_3 x_{bigclub} + w_4(x_{pageviews})^{1/2}$$

Before training the model, we need to prepare the data so that it can be used for training, validation and testing. The following steps need to be executed to prepare the data:

1. Apply the np.sqrt( ) on the values of page_views
2. Transform our variable in numpy array np.array(variable)
3. Add a columns of ones to the matrix $\mathbf{X}$  so it can handle the parameter $w_0$.

### Task 2.1 Prepare data
Complete the function `prepare_data(DataFrame)` where indicated so that all the steps listed above are performed.

In [36]:
from sklearn.preprocessing import PolynomialFeatures

def prepare_data(df):
    '''
        INPUT :
        - df : a pandas DataFrame

         OUTPUT :
        - variable_array : The processed array
    ''' 
    #We obtain a copy of the relevalnt fields from the DataFrame. This avoids modifying the dataframe directly. Instead, we work in a copy. Notice that we are not copying pageviews field
    variable = df[['age', 'fpl_value', 'big_club']].copy()
    
    #Step 1.  Apply the np.sqrt( ) on the values of page_views
    variable['sqrt_page_views'] = np.sqrt(df[['page_views']]) #YOUR CODE HERE
    
    # Step 2. Transform our variable in numpy array np.array(variable)
    variable_array = np.array(variable) #YOUR CODE HERE
    
    # Step 3. Add a columns of ones to the matrix 𝐗 so it can handle the parameter 𝑤0.
    # For this purpose we will use the function PolynomialFeatures from scikit-learn
    variable_array = PolynomialFeatures(1).fit_transform(variable_array)

    return variable_array

### Question set 2.2 PolynomialFeatures function
Investigate the role of the [Polynomial features function](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) from scikit-learn. 
1. Why did the order of the polynomial was set to one in the prepare_data function? 
2. Given two features $x_1, x_2$, write down the expression that you would obtain by using the function by setting `degree=2`

Your answer here: 
1. To include the column of ones to the matrix  𝐗 so it can handle the parameter  $𝑤_{0}$. The output should be the same matrix with a column of ones. This is because it is linear regression, and linear regression has a linear function.
2. We would obtain: [1, $x_{1}$, $x_{2}$, $x_{1}^{2}$, $x_{1}*x_{2}$, $x_{2}^{2}$]

Now, we execute the function to prepare the data.

In [37]:
#We copy the output label
output_df_train=df_train['market_value'].copy()
#We remove the output label from X
input_df_train=df_train.drop(['market_value'],axis=1)

#process is repeated for test and validation
output_df_val=df_val['market_value'].copy()
input_df_val=df_val.drop(['market_value'],axis=1)

output_df_test=df_test['market_value'].copy()
input_df_test=df_test.drop(['market_value'],axis=1)

#We call prepare_data
X_train = prepare_data(input_df_train)
X_val = prepare_data(input_df_val)
X_test = prepare_data(input_df_test)
y_train = np.array(output_df_train)
y_val = np.array(output_df_val)
y_test = np.array(output_df_test)

We will now proceed to train our first model. In this case, we will use a "home made" implementation of linear regression. When dealing with more complex (and real) applications it is best to use the implementation that can be found in scikit-learn. 

We will define a class called my_linear_regression with four methods:
1. `__init__(self)` : Constructor for the object to assign the object its properties
2. `fit(self, X, y)` : Learning step of linear regression.
3. `predict(self, X)` : predicts new labels $\hat{y}$ given an input X
4. `MSE(self,y_pred, y_test)` : Estimates the mean sum of squared errors between a set of predictions and the ground truth. 


### Task 2.2 Mean sum of squared errors
Implement the MSE function in the class below: 

In [38]:
class my_linear_regression:
    def __init__(self) : # initialize constructor for the object to assign the object its properties
        self.X_train = []
        self.y_train = []
        self.weights = []
        
    def fit(self, X, y) :
        self.X_train = X
        self.y_train = y
        self.weights = np.linalg.solve(X.T@X,X.T@y)
    
    def predict(self,x_test) : # method of the object that can be used
        self.y_hat=np.sum(x_test*self.weights,axis=1)
        
        return self.y_hat
    
    def MSE(self,y_pred, y_test) :
        #YOUR CODE HERE
        MSE = 0
        for i in range(len(y_pred)):
            MSE += (y_pred[i]-y_test[i])**2
        MSE = MSE/len(y_pred)
        #YOUR CODE ENDS HERE
        return MSE

Now we can train our first model. 

In [39]:
model_1=my_linear_regression()
model_1.fit(X_train,y_train)

print(f'The learned model has parameters:\n{model_1.weights}\n')

The learned model has parameters:
[-15.66271385  -0.16641898   4.45892732   6.28285382   0.18420319]



### Question set 2.3: Interpreting the weights
The estimated weights $\mathbf{w}$ (excluding $w_0$) are associated to 'age', 'fpl_value', 'big_club' and 'page_views' (squared root), in that order. 
1. How do you interpret the values of each of these parameters? Based on this information, what can you say about the effect in a player's market value of his: age? number of page views? fpl value?
2. Which of these features seems to have the largest effect on a player's value? 
3. How do you interpret the value obtained for $w_0$?

Your answers here:
1. Each parameter (except the first one) indicate us how that feature affects the player´s value. The player´s value increases with the big_club, fpl_value and page_views in that order, being the first two more meaningful than the third one. The player´s value decreases a bit with the age.
2. The feature that have the largest effect on a player´s value is: big_club. Increasing a lot when the club in better.
3. We can observe that $w_{0}$ is negative and bigger than the rest of parameters. This means that the line of the linear regression starts below X=0 and that means that there are some possible combinations of age, club... that give us a negative value estimation. This could be wrong if we assume that all players (even if they have no experience...) have to have a positive net value.

## Exercise 3: Adding categorical features
It is well known that the position where a football player plays has an impact in his market value. Midfielders and stikers tend to be more expensive. Your goal now is to include this information in the model.

As seen from the description, the player position is encoded as a numeric variable (1, 2, 3, 4). However, they represent categories and not values on their own. Categorical variables are commonly encoded under a scheme denoted 1-of-K encoding. This allows to convert a variable representing K different categories into K different binary values. Example:

| **attacker**   |  **midfielder**      |  **defender** | **goalkeeper** |
|-------------|-------------|-------------|-------------|
| 1 | 0 | 0 | 0|
| 0 | 1 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 | 

### Question 3.1: Adding the position to the model
Write down the expression of the model if you consider the position of the player using 1-of-K encoding.

Your answer here:
$$\hat{y} = w_0 + w_1 x_{age} + w_2 x_{fplavalue} + w_3 x_{bigclub} + w_4(x_{pageviews})^{1/2} + w_5 x_{attacker}  + w_6 x_{midfielder}  + w_7 x_{defender} + w_8 x_{goalkeeper}$$

Where $x_{attacker}, x_{midfielder}, x_{defender}$ and $x_{goalkeeper}$ are 0 or 1 depending the position of the player

### Task 3.1 Preparing data with position features
We need to modify the data preparation function so that it now includes the categorical features. For this matter, we have implemented the function `prepare_data_with_position(df)`. It contains the same functionality as the function `prepare_data(df)` and it adds the generation of the 1-of-K encoding. 

Complete the missing code in the function.

In [40]:
def prepare_data_with_position(df):
    variable = df[['age', 'fpl_value', 'big_club']].copy()
    variable['sqrt_page_views'] = np.sqrt(df[['page_views']]) #YOUR CODE HERE

    variable=variable.join(pd.get_dummies(df.position_cat, prefix='pos')) # get_dummies to create 1-of-K encoding, join to add the new columns
    variable_array = np.array(variable) # YOUR CODE HERE
    variable_array = PolynomialFeatures(1).fit_transform(variable_array)
    
    return variable_array

### Question 3.2 The get_dummies function
Explain what the following line of code is doing:

`variable=variable.join(pd.get_dummies(df.position_cat, prefix='pos'))`

Your answer here:
1. It builds the matrix where for each player, marks a 1 or a 0 in each column taking into account the possition of the player. After that, it joins this matrix to the rest of the dataset. Being pos_1 attacker, pos_2 midfielder, pos_3 defender and pos_4 goalkeeper.
| | **pos_1**   |  **pos_2**      |  **pos_3** | **pos_4** |
|-|-------------|-------------|-------------|-------------|
|player1| 1 | 0 | 0 | 0|
|player2| 0 | 0 | 1 | 0 |
|player3| 0 | 1 | 0 | 0 |
|player4| 0 | 0 | 0 | 1 | 


### Task 3.2 Train the new model
Your task now is to train the new model. For this you will need to execute the following steps: 
1. Prepare all your data (train, validation and testing). 
2. Create a new `my_linear_regression` object and store it in a variable named `model_2`
3. Run the learning process
4. For inspection purposes, print out the obtained weights.

**Important:** While preparing the data, make sure you do not override the previous data used for model_1

In [41]:
#Your code here

#We call prepare_data_with_position
X_train2 = prepare_data_with_position(input_df_train)
X_val2 = prepare_data_with_position(input_df_val)
X_test2 = prepare_data_with_position(input_df_test)
y_train2 = np.array(output_df_train)
y_val2 = np.array(output_df_val)
y_test2 = np.array(output_df_test)

#Create a new my_linear_regression object and store it in a variable named model_2
model_2=my_linear_regression()

#Run the learning process
model_2.fit(X_train2,y_train2)

#For inspection purposes, print out the obtained weights.
print(f'The learned model has parameters:\n{model_2.weights}\n')

The learned model has parameters:
[ 2.30116953e+02 -2.32675167e-01  5.55326834e+00  4.67947211e+00
  1.83379660e-01 -2.52639097e+02 -2.47779278e+02 -2.47446714e+02
 -2.48586366e+02]



### Question 3.3 Value of the position
Based on the obtained weights, does it seem as if the position of the player has an important role in his market value?

Your answer here:

We can observe that the absolute value of the difference between the parameters of the position (e.g. parameter of being attacker - parameter of being goalkeeper) is bigger than the parameters of the age and the number of views but it is not bigger than the parameters of the fantasy value and the club were they play. Taking this into account, I would say they are important but not the most important ones.

## Exercise 4: Choosing a model
We will now use the validation set to choose between the two models we have built so far. 

### Task 4.1 MSE estimation
Using the validation data, estimate the MSE for each of the two models that you have built so far. For this you will need to: 
1. Predict labels for the validation set using each of the trained models.
2. Call the MSE function from any of the two models (it is equivalent).

In [42]:
#------------YOUR CODE HERE ------------

labels1 = model_1.predict(X_val)
labels2 = model_2.predict(X_val2)
mse_1 = model_1.MSE(labels1, y_val)
mse_2 = model_2.MSE(labels2, y_val2)

#------------ YOUR CODE ENDS HERE ---------

print(f'MSE model 1 :\n{mse_1}\n')
print(f'MSE model 2 :\n{mse_2}\n')

MSE model 1 :
71.48818413976552

MSE model 2 :
61.93147668530671



### Question set 4.1 Analysis
1. Based on the obtained results, which model would you choose?
2. Is the position feature useful to improve the model? 

Your answer here:
1. I would choose the second model because the MSE in smaller than the first model.
2. Yes, it is useful because the MSE is reduced. This means that the prediction is more accurate.

## Exercise 5: Model testing
Use the test dataset to evaluate the generalization capabilities of the **model you chose** in the previous step. For this you need to:
1. Predict the labels of the test set
2. Estimate the MSE. Please note that other metrics, such as the RSS, could be used as well.

In [43]:
#------------YOUR CODE HERE ------------

labels3 = model_2.predict(X_test2)
mse = model_2.MSE(labels3, y_test2)

#------------ YOUR CODE ENDS HERE ---------

print(f'MSE test:\n{mse}\n')

MSE test:
34.443759078098324



### Question 5.1 Analysis
Based on the previous result, what can you say about your model? Do you consider it makes sufficiently accurate predictions? Feel free to implement other metrics if you consider you need further information. Examples: RSS, Root Mean Squared Error or Mean Absolute Error. 

Your answer here: 

Taking into account tha the mean market value is 11.012039 and the standard deviation is 12.257403, I would say that a MSE of 34.443759078 is quite big. The models needs to improve because the accuracy is not good enough. This could be achieved taking into account more variables or using another model. 