# Assess the accuracy of a model with R2 and Adjusted R2 and also understand the difference between them
We will predict balance in the credit dataset which is a continuous variable. We will use linear regression and then understand how R2 and adjusted R2 differs. This will help to know why adjusted R2 is best suited to select a model with set of features.

### R2:
* The R2 always increases as more variables are added.
* The model containing all of the predictors will always the largest R2, since these quantities are related to the training error. Instead, we wish to choose a model with a low test error. R2 is not suitable for selecting the best model among a collection of models with different numbers of predictors.

R2 = 1-RSS/TSS

where TSS = sum(yi − yhat)^2 is the total sum of squares for the response,
and TSS = sum(yi − ybar)^2 is the total sum of squares,

### Adjusted R2:
* The intuition behind the adjusted R2 is that once all of the correct variables have been included in the model, adding additional noise variables
* A large value of adjusted R2 indicates a model with a small test error. The model with the largest adjusted R2 will have only correct variables and no noise variables. Unlike the R2 statistic, the adjusted R2 statistic pays a price for the inclusion of unnecessary variables in the model.

Adj R2 = 1-(1-R2)*(n-1)/(n-p-1)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split

In [None]:
df = pd.read_csv('../input/ISLR-Auto/Credit.csv')
df.head()

In [None]:
df.drop(columns='Unnamed: 0', inplace=True, axis=1)

## Exploratory Data Analysis (EDA)

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
sns.pairplot(df)

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))
sns.heatmap(df.corr(), annot=True, fmt='.2f', ax=ax)

### Interpretation
* There are several catgorical features like Sex, Student, Married, Gender
* Balance is correlated with limit, rating, income
* There is also multicollinearity. for example, limit and rating are correlated.
* Valnce is not normally distributed , it is skewed to the left.

## Data Preprocessing

In [None]:
X = df.loc[:, 'Income':'Ethnicity']
y = df.loc[:, 'Balance']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
NUM_FEATURES = ['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education']
CAT_FEATURES = ['Gender', 'Student', 'Married', 'Ethnicity']

num_pipe = Pipeline(steps=[
    ('scale', StandardScaler()),   
])
cat_pipe = Pipeline(steps=[    
    ('encode', OneHotEncoder(drop='first')),   
    ('scale', StandardScaler(with_mean=False)),
])

preprocessor = ColumnTransformer(transformers=[
    ('num', num_pipe, NUM_FEATURES),
    ('cat', cat_pipe, CAT_FEATURES),
], remainder='drop')

## Feature or subset selection
We will also use cross validation to find R2 and Adjusted R2 scores
We will use sklearn's SelectKBest to find the optimum number of features.

In [None]:
r2scores=[]
adjustedr2 = []
feature_names=[]
for i in range(1, 10):   
    reduce_dim_pipe = Pipeline(steps=[
        ('preprocess', preprocessor),
        ('reduce_dim', SelectKBest(k=i, score_func=f_regression)),       
    ])
    
    pipeline = Pipeline(steps=[
        ('reduce_dim_pipe', reduce_dim_pipe),       
        ('regress', LinearRegression())
    ])
    
    #calculate cross validated R2
    R2 = cross_val_score(pipeline, X=X_train, y=y_train,cv=10, scoring='r2').mean()    
    r2scores.append(R2)
        
    #calculate Adj R2
    n= len(X_train)
    p = i #len(X.columns)
    adj_R2 = 1- ((1-R2) * (n-1)/(n-p-1)) #Adj R2 = 1-(1-R2)*(n-1)/(n-p-1)
#     print(r2, adjustedr2)
    adjustedr2.append(adj_R2)
    
    reduce_dim_pipe.fit(X=X_train, y=y_train)
    # Get columns to keep    
    cols = reduce_dim_pipe.named_steps['reduce_dim'].get_support(indices=True)
    # Create new dataframe with only desired columns
#     print(cols)
    features_df_new = X_train.iloc[:, cols]
    best_features = list(features_df_new.columns)
#     print(best_features)
    feature_names.append(best_features)

In [None]:
scoring_df = pd.DataFrame(np.column_stack((r2scores, adjustedr2)), columns=['R2', 'Adj_R2'])
scoring_df['feature_names'] = feature_names
scoring_df['features'] = range(1, 10)
scoring_df

### Interpretation
* There is almost always a increase in the value of R2 score with addition of new features. On the other hand, there is no substantial increase in value of Adj R2 after number of features become 4.
* We can also see that with addition of limit when number of selected features are 2, the increase in adjusted R2 and R2 are less because limit is not adding much valeu to the model. The reason is that limit and rating are correlated.
* The increase in value of Adj R2 is less compared to R2 when number of features increase from 4 to 5. The Adj R2 remains constant or reduces with addition of every new feature.

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))
#convert data frame from wide format to long format so that we can pass into seaborn line plot function to draw multiple line plots in same figure
# https://stackoverflow.com/questions/52308749/how-do-i-create-a-multiline-plot-using-seaborn
long_format_df = pd.melt(scoring_df.loc[:, ['features','R2', 'Adj_R2']], ['features'])
sns.lineplot(x='features', y='value', hue='variable', data=long_format_df, ax=ax)
ax.set_xlabel('No of features')
ax.set_ylabel('Cross validated R2 and Adj R2 scores')
ax.set_title('Plot between number of features and R2/Adj R2 scores')

### Interpretation
* The most important feature is 'rating'
* We can see that after 4-5 features, there is no improvement in R2 scores. 
* We should not add more features to our model and that can add additional noise and can cause overfitting.

### References
* https://datascience.stackexchange.com/questions/14693/what-is-the-difference-of-r-squared-and-adjusted-r-squared
* https://www.listendata.com/2014/08/adjusted-r-squared.html
* https://thestatsgeek.com/2013/10/28/r-squared-and-adjusted-r-squared/
