# Part 2: Linear Regression


In this part, we will be working with a dataset scraped by [Shubham Maurya](https://www.kaggle.com/mauryashubham/linear-regression-to-predict-market-value/data), which collects facts about players in the English Premier League as of 2017. His original goal was to establish if there was a relationship between a player's popularity and his market value, as estimated by transfermrkt.com.

**Your goal is to fit a model able to predict a player's market value.**

## The dataset

The dataset contains the following information:
| **Field**   |     **Description**      |  
|-------------|-------------|
| name   |  Name of the player |
| club   |  Club of the player |
| age    | Age of the player |
|position| The usual position on the pitch
|position_cat| 1 for attackers, 2 for midfielders, 3 for defenders, 4 for goalkeepers|
|market_value| As on transfermrkt.com on July 20th, 2017|
|page_views| Average daily Wikipedia page views from September 1, 2016 to May 1, 2017|
|fpl_value| Value in Fantasy Premier League as on July 20th, 2017|
|fpl_sel| % of FPL players who have selected that player in their team|
|fpl_points| FPL points accumulated over the previous season|
|region| 1 for England, 2 for EU, 3 for Americas, 4 for Rest of World|
|nationality| Player's nationality|
|new_foreign| Whether a new signing from a different league, for 2017/18 (till 20th July)|
|age_cat| a categorical version of the Age feature|
|club_id| a numerical version of the Club feature|
|big_club| Whether one of the Top 6 clubs|
|new_signing| Whether a new signing for 2017/18 (till 20th July)|

## Exercise 1: Exploring the data
The first step you need to do is to explore your data.

We will start wil the necessary imports. In this exercise, we will be working with the library `pandas`. If you are not familiar with it, it is recommended that you follow the introductory exercises that can be found in the course's github repository.

In [None]:
import numpy as np
import pandas as pd

We will now proceed to read the dataset:

In [None]:
league_df = pd.read_csv('data/football_data.csv') #Reads a CSV file

### Task 1.1: Using pandas for data exploration
Use the method `name_dataframe.head(N)` (N is the number of entries) to look at the first instances of the dataframe. 

Then, use the method `name_dataframe.describe(include='all')` to generate descriptive statistics that summarize each field of the dataframe. 

Finally, print the result of `name_dataframe.dtypes`, in this way you print out the data types associated to each of the fields in the table 

In [None]:
#Your code for head


In [None]:
#Your code for describe


In [None]:
#Your code for d_type


### Question set 1.1: About the data
1. What is the name of the appearing in the 7th record of the dataset?
2. What is the mean age in the English Premier League (in 2017)? 
3. What fields store a continuous value?

Your answers here:

## Exercise 2: Data splits, data preparation and training
Before starting the training procedure, we need to split the data into the training, validation and test sets.

In this exercise, the data will be already given split for you. 

In [None]:
#Loading the splits
df_train = pd.read_csv('data/league_train.csv')
df_val = pd.read_csv('data/league_val.csv')
df_test = pd.read_csv('data/league_test.csv')

Alternatively, for the type of data used in this exercise, the library `scikit-learn` contains the function `train_test_split` that allows to automatically split the data.

### Question set 2.1 Train_test_split
Look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) of the `train_test_split` function:
1. What parameters it receives as input? Provide examples illustrating.
2. What is the role of the parameter shuffle?
3. What is the role of the parameter test_size?
4. The function does not generate a validation set. What would you do to obtain the desired data splits (train, validation and test)? Answer using pseudo-code (Bonus: Write the code for it so that it can run using some dummy generated data). 

Your answers here:

The dataset contains a lot of features that can be used to build the model. We will start by using `age, fpl_value, big_club` and `page_views`.

$$\hat{y} = w_0 + w_1 x_{age} + w_2 x_{fplavalue} + w_3 x_{bigclub} + w_4(x_{pageviews})^{1/2}$$

Before training the model, we need to prepare the data so that it can be used for training, validation and testing. The following steps need to be executed to prepare the data:

1. Apply the np.sqrt( ) on the values of page_views
2. Transform our variable in numpy array np.array(variable)
3. Add a columns of ones to the matrix $\mathbf{X}$  so it can handle the parameter $w_0$.

### Task 2.1 Prepare data
Complete the function `prepare_data(DataFrame)` where indicated so that all the steps listed above are performed.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

def prepare_data(df):
    '''
        INPUT :
        - df : a pandas DataFrame

         OUTPUT :
        - variable_array : The processed array
    ''' 
    #We obtain a copy of the relevalnt fields from the DataFrame. This avoids modifying the dataframe directly. Instead, we work in a copy. Notice that we are not copying pageviews field
    variable = df[['age', 'fpl_value', 'big_club']].copy()
    
    #Step 1.  Apply the np.sqrt( ) on the values of page_views
    variable['sqrt_page_views'] = # YOUR CODE HERE
    
    # Step 2. Transform our variable in numpy array np.array(variable)
    variable_array = #YOUR CODE HERE
    
    # Step 3. Add a columns of ones to the matrix 𝐗 so it can handle the parameter 𝑤0.
    # For this purpose we will use the function PolynomialFeatures from scikit-learn
    variable_array = PolynomialFeatures(1).fit_transform(variable_array)

    return variable_array

### Question set 2.2 PolynomialFeatures function
Investigate the role of the [Polynomial features function](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) from scikit-learn. 
1. Why did the order of the polynomial was set to one in the prepare_data function? 
2. Given two features $x_1, x_2$, write down the expression that you would obtain by using the function by setting `degree=2`

Your answer here: 

Now, we execute the function to prepare the data.

In [None]:
#We copy the output label
output_df_train=df_train['market_value'].copy()
#We remove the output label from X
input_df_train=df_train.drop(['market_value'],axis=1)

#process is repeated for test and validation
output_df_val=df_val['market_value'].copy()
input_df_val=df_val.drop(['market_value'],axis=1)

output_df_test=df_test['market_value'].copy()
input_df_test=df_test.drop(['market_value'],axis=1)

#We call prepare_data
X_train = prepare_data(input_df_train)
X_val = prepare_data(input_df_val)
X_test = prepare_data(input_df_test)
y_train = np.array(output_df_train)
y_val = np.array(output_df_val)
y_test = np.array(output_df_test)

We will now proceed to train our first model. In this case, we will use a "home made" implementation of linear regression. When dealing with more complex (and real) applications it is best to use the implementation that can be found in scikit-learn. 

We will define a class called my_linear_regression with four methods:
1. `__init__(self)` : Constructor for the object to assign the object its properties
2. `fit(self, X, y)` : Learning step of linear regression.
3. `predict(self, X)` : predicts new labels $\hat{y}$ given an input X
4. `MSE(self,y_pred, y_test)` : Estimates the mean sum of squared errors between a set of predictions and the ground truth. 


### Task 2.2 Mean sum of squared errors
Implement the MSE function in the class below: 

In [None]:
class my_linear_regression:
    def __init__(self) : # initialize constructor for the object to assign the object its properties
        self.X_train = []
        self.y_train = []
        self.weights = []
        
    def fit(self, X, y) :
        self.X_train = X
        self.y_train = y
        self.weights = np.linalg.solve(X.T@X,X.T@y)
    
    def predict(self,x_test) : # method of the object that can be used
        self.y_hat=np.sum(x_test*self.weights,axis=1)
        
        return self.y_hat
    
    def MSE(self,y_pred, y_test) :
        #YOUR CODE HERE
        
        #YOUR CODE ENDS HERE
        return MSE

Now we can train our first model. 

In [None]:
model_1=my_linear_regression()
model_1.fit(X_train,y_train)

print(f'The learned model has parameters:\n{model_1.weights}\n')

### Question set 2.3: Interpreting the weights
The estimated weights $\mathbf{w}$ (excluding $w_0$) are associated to 'age', 'fpl_value', 'big_club' and 'page_views' (squared root), in that order. 
1. How do you interpret the values of each of these parameters? Based on this information, what can you say about the effect in a player's market value of his: age? number of page views? fpl value?
2. Which of these features seems to have the largest effect on a player's value? 
3. How do you interpret the value obtained for $w_0$?

## Exercise 3: Adding categorical features
It is well known that the position where a football player plays has an impact in his market value. Midfielders and stikers tend to be more expensive. Your goal now is to include this information in the model.

As seen from the description, the player position is encoded as a numeric variable (1, 2, 3, 4). However, they represent categories and not values on their own. Categorical variables are commonly encoded under a scheme denoted 1-of-K encoding. This allows to convert a variable representing K different categories into K different binary values. Example:

| **attacker**   |  **midfielder**      |  **defender** | **goalkeeper** |
|-------------|-------------|-------------|-------------|
| 1 | 0 | 0 | 0|
| 0 | 1 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 | 

### Question 3.1: Adding the position to the model
Write down the expression of the model if you consider the position of the player using 1-of-K encoding.

Your answer here:

### Task 3.1 Preparing data with position features
We need to modify the data preparation function so that it now includes the categorical features. For this matter, we have implemented the function `prepare_data_with_position(df)`. It contains the same functionality as the function `prepare_data(df)` and it adds the generation of the 1-of-K encoding. 

Complete the missing code in the function.

In [None]:
def prepare_data_with_position(df):
    variable = df[['age', 'fpl_value', 'big_club']].copy()
    variable['sqrt_page_views'] =  #YOUR CODE HERE

    variable=variable.join(pd.get_dummies(df.position_cat, prefix='pos')) # get_dummies to create 1-of-K encoding, join to add the new columns
    variable_array = # YOUR CODE HERE
    variable_array = PolynomialFeatures(1).fit_transform(variable_array)
    
    return variable_array

### Question 3.2 The get_dummies function
Explain what the following line of code is doing:

`variable=variable.join(pd.get_dummies(df.position_cat, prefix='pos'))`

### Task 3.2 Train the new model
Your task now is to train the new model. For this you will need to execute the following steps: 
1. Prepare all your data (train, validation and testing). 
2. Create a new `my_linear_regression` object and store it in a variable named `model_2`
3. Run the learning process
4. For inspection purposes, print out the obtained weights.

**Important:** While preparing the data, make sure you do not override the previous data used for model_1

In [None]:
#Your code here


### Question 3.3 Value of the position
Based on the obtained weights, does it seem as if the position of the player has an important role in his market value?

Your answer here:

## Exercise 4: Choosing a model
We will now use the validation set to choose between the two models we have built so far. 

### Task 4.1 MSE estimation
Using the validation data, estimate the MSE for each of the two models that you have built so far. For this you will need to: 
1. Predict labels for the validation set using each of the trained models.
2. Call the MSE function from any of the two models (it is equivalent).

In [None]:
#------------YOUR CODE HERE ------------

#------------ YOUR CODE ENDS HERE ---------

print(f'MSE model 1 :\n{mse_1}\n')
print(f'MSE model 2 :\n{mse_2}\n')

### Question set 4.1 Analysis
1. Based on the obtained results, which model would you choose?
2. Is the position feature useful to improve the model? 

## Exercise 5: Model testing
Use the test dataset to evaluate the generalization capabilities of the **model you chose** in the previous step. For this you need to:
1. Predict the labels of the test set
2. Estimate the MSE. Please note that other metrics, such as the RSS, could be used as well.

In [None]:
#------------YOUR CODE HERE ------------

#------------ YOUR CODE ENDS HERE ---------

print(f'MSE test:\n{mse}\n')

### Question 5.1 Analysis
Based on the previous result, what can you say about your model? Do you consider it makes sufficiently accurate predictions? Feel free to implement other metrics if you consider you need further information. Examples: RSS, Root Mean Squared Error or Mean Absolute Error. 

Your answer here: 