# Linear Regression

In [26]:
# !poetry add seaborn
# !pip install seaborn

In [2]:
# Notebook setup
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression

import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

In this exercise, you will model the performance of a NBA player's win rating (`win_rating`) according to their game statistics (like minutes played, `mp`). 

Load the `NBA.csv` dataset into this notebook as a pandas dataframe, and display its first 5 rows.

Load the data:

In [23]:
# CODE HERE

See the shape of the dataset:

In [None]:
# CODE HERE

See first 5 rows of the dataset:


In [24]:
# CODE HERE

## 1. Define the feature set and target

The first objective is to model the players' overall performance rating compared to peers, called *Wins Above Replacement*, (`win_rating`) against the minutes that they've played (`mp`)

Assign those two variables to `X` and `y`. Remember that `X` is the feature(s), and `y` the target.

In [None]:
# COODE HERE

In a scatter plot ([doc](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html)) , visualize the relationship between the rating and the minutes played.

In [None]:
plt.scatter(X, y)

plt.title('Minutes Played vs Win Rating')
plt.xlabel('Minutes Played')
plt.ylabel('Win Rating')

plt.tight_layout()

The scatter plot should hint at the somewhat linear relationship.

## 2. Cross-validation

Using Sklearn's `cross_validate` ([doc](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)), run a 5-fold cross validation on a `LinearRegression` ([doc](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)) model predicting the player performance rating from minutes played. Save the raw output of the cross validation under a new variable `cv_results`.

In [29]:
# CODE HERE
# Instantiate the model

# Cross validate the model and store the results in `cv_results`

❓ What is the lowest score of the cross validation? Compute your answer and save the value under new variable `min_score`.

In [31]:
# CODE HERE

❓ What is the highest score of the cross validation?  Compute your answer and save the value under new variable `max_score`.

In [32]:
# CODE HERE

What is the mean score of the cross validation? Compute your answer and save the value under new variable `mean_score`.

In [33]:
# CODE HERE

**When running a cross-validation, we always look at the mean score as the most robust and representative evaluation of the model's performance.**

Plot the evolution of **<u>total</u> computational time (fitting and scoring)** and **<u>mean</u> score** as K folds increases from 2 until 20.

In [10]:
# create a range of k values
ks = np.arange(2, 41)

# calculate the total computation time for each k
total_comp_time = np.array(
    [cross_validate(linreg_model, X, y, cv = k)['fit_time'].sum() for k in ks]
)

# calculate the mean score for each k
mean_score_per_k = np.array(
    [cross_validate(linreg_model, X, y, cv = k)['test_score'].mean() for k in ks]
)

In [None]:
# Plotting total_comp_time and mean_score_per_k vs ks
plt.plot(ks, total_comp_time)
plt.plot(ks, mean_score_per_k);

You should see that the $R^2$ score stays stable, which is a good sign that the model performs equally on smaller and larger test sizes. Meanwhile, the computational time keeps increasing. For that reason, we do not exceed K = 10 as a rule of thumb.

## 3. Train the model

Cross validation does not train a model, it evaluates a hypothetical model on the dataset. If you want to use the model to, for example, make predictions, you will need to train it outside of the cross validation. 

Go ahead and train the model on the full `X` and `y` (as we've already validated the model's score, and now will use it to predict!). Save the trained model under the variable `model`.

In [34]:
# CODE HERE
# Retrieve the previous model linreg_model and save it to model variable as requested

# Fitting the model

What is the slope of your trained model? It can be accessed via the model's attributes. Save the slope under variable name `slope`.

In [35]:
# CODE HERE
# Viewring the slope of the model

What is the intercept of your trained model? It can be accessed via the model's attributes. Save the intercept under variable name `intercept`.

In [36]:
# CODE HERE
# Viewing the intercept of the model

Make sure you understand how to interpret these coefficients before moving on.

## 4. Predict *(mannually)*

With matplotlib: 
- Define the line of best fit equation (using the slope and intercept values)
- Plot it in a graph over the scattered data points

In [None]:
X_min, X_max = np.array(X.min())[0], np.array(X.max())[0]
X_min, X_max

In [None]:
# Creating line with np.linspace
xx = np.linspace(X_min, X_max, 2000)
print(xx)

yy = np.array(intercept + slope * xx)
print(yy)

In [None]:
plt.scatter(X, y, alpha = .2)
plt.plot(xx, yy, c = 'red')

plt.tight_layout()

# 5. Predict (with scikit-learn)

Use your trained model to predict the the **Win Rating** of a player that played **1500 minutes**. Save the predicted price as variable name `prediction`.

In [None]:
# CODE HERE

## 6. Improving the model with more features

Your friend who enjoys NBA fantasy league comes to you with some insights.

They say that when evaluating a player's Wins Above Replacement rating, they would typically also look at the number of ball possessions (`poss`), their defense/offense ratio and their pacing.

Visualize the correlation between these new features and the `win_rating`. You can use `matplotlib` or `seaborn`. Which **one** of the above features would you consider adding to your model?

In [None]:
X = nba[['mp', 'poss', 'do_ratio', 'pacing']]
y = nba['win_rating']

X.shape, y.shape

display(X.head(), y.head())

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Combine X and y into a single DataFrame for correlation
data = pd.concat([X, y], axis=1)

# Compute the correlation matrix
corr_matrix = data.corr()

# Plot the heatmap of the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

Now let's see if the model with two features (the new one you picked and `mp`) is better at predicting a player's rating than our first model.

Create a new set of features - it should be a DataFrame with two columns.

In [None]:
X = nba[['mp', 'poss', 'do_ratio', 'pacing']]
display(X.head())

y = nba['win_rating']
display(y.head())

Now cross-validate a new linear regression model and save the **mean** score to `mean_multi_feat_score`.

In [37]:
# CODE HERE

You should see a ** percentage increase** to your $R^2$! 

You just performed your first manual *feature selection*. Congratulations! 🎉