# Homework: Multiple regression and exploring the Football data

Let's move on to a truly interesting dataset. The data imported below records various facts about players in the English Premier League. Our goal will be to fit models that predict the players' market value (what the player could earn when hired by a new team).<br>

**name**: Name of the player<br>
**club**: Club of the player<br>
**age**: Age of the player<br>
**position** : The usual position on the pitch<br>
**position_cat** : 1 for attackers, 2 for midfielders, 3 for defenders, 4 for goalkeepers<br>
**market_value** : As on transfermrkt.com on July 20th, 2017<br>
**page_views** : Average daily Wikipedia page views from September 1, 2016 to May 1, 2017<br>
**fpl_value** : Value in Fantasy Premier League as on July 20th, 2017<br>
**fpl_sel**: % of FPL players who have selected that player in their team<br>
**fpl_points** : FPL points accumulated over the previous season<br>
**region**: 1 for England, 2 for EU, 3 for Americas, 4 for Rest of World<br>
**nationality**: Player's nationality<br>
**new_foreign**: Whether a new signing from a different league, for 2017/18 (till 20th July)<br>
**age_cat**: a categorical version of the Age feature<br>
**club_id**: a numerical version of the Club feature<br>
**big_club**: Whether one of the Top 6 clubs<br>
**new_signing**: Whether a new signing for 2017/18 (till 20th July)<br>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import preprocessing
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

from prettytable import PrettyTable

In [None]:
league_df = pd.read_csv("https://raw.githubusercontent.com/univai-ghf/ghfmedia/main/data/Regression/league_data.txt")
print(league_df.dtypes)
league_df.head()

In [None]:
league_df.shape

In [None]:
league_df.describe()

Check for NULL/empty values

In [None]:
# your code here


If there are NULL values, deal with it. <br>
If the NULL value is Numeric, try replacing the NULL value with Mean/Median of that column/predictor. <br>
If the NULL value is Categorical,eliminate that complete row. <br>
**Hint:** To eliminate the entire row,you can use the pandas DataFrame dropna(), use the subset argument and select the appropiate column. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html <br>
Be sure to check again that the NULL values are dealt with appropiately before moving forward

In [None]:
# your code here


We want to fit the following model on the football data.

market_value $\approx$ $\beta_{0}$ + $\beta_{1}$fpl_points + $\beta_{2}$ age + $\beta_{3}$ age<sup>2</sup> + $\beta_{4}log_2$(page_views) + $\beta_{5}$new_signing + $\beta_{6}$big_club + $\beta_{7}$position_cat

We're including a 2nd degree polynomial in age because we expect pay to increase as a player gains experience, but then decrease as they continue aging. We're taking the log of page views because they have such a large, skewed range and the transformed variable will have fewer outliers that could bias the line. We choose the base of the log to be 2 just to make interpretation cleaner.



In the league_df, we need to add 2 more columns: age_squared and log_views<br>


In [None]:
# your code here


Splitting the Data using train_test_split.<br>
We want to make sure that the training and test data have appropriate representation of each region; it would be bad for the training data to entirely miss a region. This is especially important because some regions are rather rare.<br>
Thus we make use of stratified sampling based on the region column.

In [None]:
train_data, test_data = train_test_split(league_df, test_size = 0.2,stratify=league_df['region'],random_state=42)

In [None]:
cols=['fpl_points','age','age_squared','log_views','new_signing','big_club','position_cat']

Now, with the help of train_data and test_data, you need to create X_train ,y_train , X_test and y_test. Use the cols list to selct the columns in X_train and X_test

In [None]:
# your code here


Print the shape for X_train, X_test,y_train, X_test and y_test and y. <br>
**Hint**: X_train should be of shape (368,7), X_test should be of shape (92,7) , y_train should be of shape (368,) and y_test should be of shape (92,). <br> If the shapes do not match, you have made some error above please check

In [None]:
# your code here


Print the head of DataFrame to see if you have selected the correct predictors.

In [None]:
# your code here


Fit the Linear Regression Model on X_train and y_train

In [None]:
# your code here


Print the intercept and Coefficients of the model.(Try using PrettyTable to print the table for better visibility)

In [None]:
# your code here


Print the R<sup>2</sup> score for training and Test Data

In [None]:
# your code here


We have an error in how we've included player position. Even though the variable is numeric (1,2,3,4) and the model runs without issue, the value we're getting back is garbage. The interpretation, such as it is, is that there is an equal effect of moving from position category 1 to 2, from 2 to 3, and from 3 to 4, and that this effect is about -.73.

In reality, we don't expect moving from one position category to another to be equivalent, nor for a move from category 1 to category 3 to be twice as important as a move from category 1 to category 2. We need to introduce better features to model this variable.

We can use the get_dummies to One Hot Encode the Position_cat, Remember to do it for both X_train and X_test

In [None]:
# your code here


Print out the DataFrames after One hot Encoding

In [None]:
# your code here


In [None]:
# your code here


Fit a Linear Regression Model

In [None]:
# your code here


Print the intercept and Coefficients of the model.(Try using PrettyTable to print the table for better visibility)

In [None]:
# your code here


Print the R<sup>2</sup> score for training and Test Data

In [None]:
# your code here
