# Assignment 4: Constructing Confidence Interval

Once you are finished, ensure to complete the following steps.
1.  Restart your kernel by clicking 'Kernel' > 'Restart & Run All'.
2.  Fix any errors which result from this.
3.  Repeat steps 1. and 2. until your notebook runs without errors.
4.  Submit your completed notebook to OWL by the deadline.

## Global Toolbox

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as ss
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from scipy.stats import t
import matplotlib.pyplot as plt
seed=106
np.random.seed(seed)

## Question 1 - <span style="color:green">[100]</span>
You are going to work on a dataset which lists certain attributes of the soccer players participated in the soccer world cup. We want to predict the players value and quantify the uncertainty of our prediction using what we learnt in week 4. The dataset has the following attributes:
- `Age`: Player age in years
- `Overall`: Player overall performance score (higher better)
- `Potential`: Player potential score (higher better)
- `Value`: Player value *i.e*, the amount of money in euros a club should pay in order to purchase the player (higher better)
- `Wage`: Player stipend in euros (higher better)
- `Preferred Foot`: Player preferred foot to play
- `International Reputation`: Player international fame (higher better)
- `Week Foot`: Performance score of player weak foot (higher better)
- `Skill Moves`: Player move skill score (higher better)
- `Body Type`: Player body type
- `Position`: Position player holds on the pitch
- `Height`: Player height in centimeters
- `Weight`: Player weight in kilograms

### Q 1.1 - <span style="color:red">[1]</span> - Load `data.csv` and show the first 5 rows. What is the target attribute?

In [None]:
# Load the data
df = pd.read_csv("A4_data.csv")
print(df.head())

### Q 1.2 - <span style="color:red">[3]</span> - Use a pandas relevant method to reveal `Dtype` of the features and indicate whether the date set has any `Null` values. Also, do you see any categorical attributes? Name them please? 

In [None]:
print(df.info())

#### *Answer*


No null values, and categorical values are: Preferred Foot, Body Type, and Position

### Q 1.3 - <span style="color:red">[3]</span> - Use a `pandas` relevant method to get a summary statistics of the data all in one tabular output and inspect it. Which features have the lowest and highest standard deviation respectively? What was the age of the youngest player in the World Cup? 

In [None]:
#print(df.describe())

#### *Answer*


Wage has highest standard deviation: 20502.356045
International Reputation has lowest standard deviation: 0.400888
The age of the youngest player in the World Cup is: 15 

### Q 1.4 - <span style="color:red">[4]</span> - Use a `pandas` relevant method to see the distribution of the numerical features all in one plot window. Which ones look like Gaussian?

In [None]:
#df.hist(figsize=(15, 12), bins=50)
plt.tight_layout()
plt.show()

#### *Answer*


Overall, Potential, Height, and Weight looks like normal distribution

### Q 1.5 - <span style="color:red">[2]</span> - Perform one hot encoding on the dataframe to prepare the categorical values for linear regression.

This can be done in different ways, two common methods are [this](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) and [this](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html).

Note that, in one hot encoding, a categorical attribute with $n$ distinct entries gets replaced with $n-1$ columns with entries of 0 or 1.

In [None]:
#
df_encoded = pd.get_dummies(df, columns=['Preferred Foot', 'Body Type', 'Position'])
print(df_encoded)

### Q 1.6 - <span style="color:red">[4]</span> - Use `seaborn.jointplot` to plot marginal histograms to investigate the relationship between `Overall` and `Value` as well as `Wage` and `Value`.

In [None]:
sns.jointplot(x='Overall', y='Value', data=df, kind='scatter', marginal_kws=dict(bins=30, fill=False))
plt.title('Relationship between Overall and Value')
plt.show()
sns.jointplot(x='Wage', y='Value', data=df, kind='scatter', marginal_kws=dict(bins=30, fill=False))
plt.title('Relationship between Wage and Value')
plt.show()

### Q 1.7 - <span style="color:red">[12]</span> - Determine which one(s) of the attributes `Overall`, `Wage`, and `Value` should be $log$ transformed and apply the transformation. Now, repeat what you did in "Q 1.6" but this time use the transformed version of the attribute(s) where applicable. Make sure to concatenate your original dataframe with the transformed versions of the attributes using different names to avoid overwriting the original attributes.


Hint: For example, you can see that "Value" is highly skewed to the right, therefore, you need to use the transformation for it.

Hint: $log$ transform is often used to normalize skewed distributions.

In [None]:
df[['Overall', 'Wage', 'Value']].hist(figsize=(10, 8), bins=30)
plt.tight_layout()
plt.show()
df['log_Wage'] = np.log1p(df['Wage'])
df['log_Value'] = np.log1p(df['Value'])
sns.jointplot(x='Overall', y='log_Value', data=df, kind='scatter', marginal_kws=dict(bins=30, fill=False))
plt.title('Relationship between Overall and log(Value)')
plt.show()
sns.jointplot(x='log_Wage', y='log_Value', data=df, kind='scatter', marginal_kws=dict(bins=30, fill=False))
plt.title('Relationship between log(Wage) and log(Value)')
plt.show()

### Q 1.8 - <span style="color:red">[4]</span> - Use `pandas.corr()` to output

a) the pairwise correlations between every attribute and the original target (*i.e.*, before transformation), and

b) the pairwise correlations between every attribute and the $log$-transformed target.

For each part, the output of your code should be a table with two columns, one listing the attributes excluding the target (or transformed target), and the other column being correlation values in an ascending order.

Once you have the tables, use the mean of the absolute values of the correlations (per table) as a basis to judge whether it is best to use "LogValue" or "Value" as target.

In [None]:
# part a)
df_encoded = pd.get_dummies(df, columns=['Preferred Foot', 'Body Type', 'Position'])
corr = df_encoded.corr()
print(corr)
# part b)
df_encoded = pd.get_dummies(df, columns=['Preferred Foot', 'Body Type', 'Position'])
df_log = np.log(df_encoded)
corr = df_log.corr()
print(corr)


#### *Answer*


LogValue is better

### Q 1.9 - <span style="color:red">[4]</span> - What were the most positively and negatively correlated features in each table in Q 1.8? How do you interpret the positive and negative correlations?

#### *Answer*


Overall and Value attributes.

### Q 1.10 - <span style="color:red">[15]</span> - Let's train a model to predict the target (*i.e.*, concluded in your answer to Q 1.8):
1. Use `mean_squared_error` to calculate Root Mean Squared Error (RMSE) as your model scorer
2. Split the data into train and test with `test_size=0.2, random_state=seed`
3. Pick `LinearRegression()` from sklearn as your model
4. Report both prediction (*i.e.*, on training set) and generalization (*i.e.*, on test set) RMSE scores of your model

In [None]:
df_encoded = pd.get_dummies(df, columns=['Preferred Foot', 'Body Type', 'Position'])
X = df_encoded.drop(columns = "Overall")
y = df['Overall']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)
model = LinearRegression()
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
print(f"Training RMSE: {rmse_train}")
print(f"Test RMSE: {rmse_test}")

### Q 1.11 - <span style="color:red">[8]</span> - Scatter plot `Overall` vs true `LogValue` as well as `Overall` vs predicted `LogValue` in the same graph window over the test set. 

In [None]:
plt.scatter(X_test['Overall'], y_test, color='blue', label='True LogValue', alpha=0.6)
plt.scatter(X_test['Overall'], y_test_pred, color='red', label='Predicted LogValue', alpha=0.6, marker='x')
plt.title('Overall vs LogValue')
plt.xlabel('Overall')
plt.ylabel('LogValue')
plt.legend()
plt.grid(True)
plt.show()

### Q 1.12 - <span style="color:red">[15]</span> - Calculate confidence interval (based on 99% confidence level) for the mean of target by bootstrapping. For this purpose, code a bootstrap function that in each bootstrap iteration, samples from the training set to fit the linear regression model and uses the test set to make predictions - therefore your bootstrap statistic is the average of the predictions over the test set. Your function must take as input arguments: your model, Xtrain, ytrain, Xtest, and numboot=100. The function must return only one object that is the array of recorded values for the bootstrap statistic in $\mathrm{euros}$ - and not $log(\mathrm{euros})$. Also, the unit of the confidence interval must be $\mathrm{euros}$ - and not $log(\mathrm{euros})$.

In [None]:
#

### Q 1.13 - <span style="color:red">[6]</span> - Construct a 99% confidence interval using the Central Limit Theorem (again in $\mathrm{euros}$). 

In [None]:
#

### Q 1.14 - <span style="color:red">[10]</span> - We want to see the effect of sample size ($n$) on the CI calculated from CLT. Write a `for` loop which in each iteration randomly samples from your "sample statistic" and calculates and stores the width of the corresponding CI in an array. Obviously, you should start from a small $n$ and increase it per iteration. After the loop, plot sample size (*i.e.*, $n$) against CI width and report your observation in one sentence.

In [None]:
#

#### *Answer*


### Q 1.15 - <span style="color:red">[9]</span> - Randomly subsample your "sample statistic" with $n=30$ and calculate $t$-based 99% CI (in $\mathrm{euros}$). Is it a good idea to calculate CI for this data set this way? Why?

Hint: It would be a good idea to run your code for this part a few times prior to answering the question.

#### *Answer*
