<a href="https://colab.research.google.com/github/soharabhossain/DataAnalysis/blob/master/Regression_R_squared_Adj_R_squared.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Regression - R-squared and Adjusted-R-squared

## Randomly create some data for regression analysis.

Lets create a random sample of 20 observations with 6 features/predictors.


In [0]:
from sklearn.datasets import make_regression
X, y = make_regression(n_samples = 20, n_features = 6, random_state = 2, noise = 0.5)

print(X.shape)

print(y.shape)

## Scatter plot of some individual features vs. output

Visually check for linear relationship between individual X's and Y with a scatter plot.


In [0]:
import matplotlib.pyplot as plt
plt.scatter(X[:,0], y, alpha=0.5)
plt.title('Feature-0')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

plt.scatter(X[:,1], y, alpha=0.5)
plt.title('Feature-1')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

plt.scatter(X[:,2], y, alpha=0.5)
plt.title('Feature-2')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

plt.scatter(X[:,3], y, alpha=0.5)
plt.title('Feature-3')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

plt.scatter(X[:,4], y, alpha=0.5)
plt.title('Feature-4')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

plt.scatter(X[:,5], y, alpha=0.5)
plt.title('Feature-5')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# Note

From the above plot of first 6 features it seems that feature 2 and feature 4 have some sort of linear relationship with the output variable.

Ohter features don't show such linear relationship.

Therefore, is we fit a Linear Regression model (model2) with only feature 2 (or 4 model4) this model will perform better than the a model (model5) fitted with feature (say 5).

Let's validate this assumption.


## With Feature 2

#### Note the R-squared and the Adjusted-R-squared scores of the fitted model.

In [0]:
import statsmodels.api as sm
model2 = sm.OLS(y, X[:, 2]).fit()
model2.summary()

## With Feature 4

#### Note the R-squared and the Adjusted-R-squared scores of the fitted model.

In [0]:
import statsmodels.api as sm
model4 = sm.OLS(y, X[:, 4]).fit()
model4.summary()

## With Feature 5

#### Note the R-squared and the Adjusted-R-squared scores of the fitted model.


In [0]:

#import statsmodels.api as sm
model5 = sm.OLS(y, X[:, 5]).fit()
model5.summary()

# Experiment with more than one feature

Lets take feature 4 (which is a good predictor) and along with this let us take two not so important predictors (features 3 and 5) and build a model.

The model fitted with only feature 4 has a R-squred of  .443 and Adjusted-R-squared of .414.

The new model fitted with the three features (3, 4, 5) has a R-squred of  .471 but an Adjusted-R-squared of .377.

## Comparing the model fitted with only feature 4 with the new model fitted with features 3, 4 and 5.

A slightly better R-squared score shows that the new model is slightly better in explaining the variance in Y w.r.t these 3 features. 
However, there is a dip in the Adjusted-R-squared score from .414 to .377. 
It suggests that the new features (3 and 5) are not so important features for designing the model.

A good feature/predictor will be able to explain higher variance in Y, therefore, will improve the R-squared by a larger margin. 




In [0]:
#import statsmodels.api as sm
model5 = sm.OLS(y, X[:, [3,4,5]]).fit()
model5.summary()

## See the Effect of using a Good Feature in Model Building

Now, a model is fitted with features 2 and 4 (both are important features).

Check the values of R-squared and Adjusted-R-squared.


In [0]:
#import statsmodels.api as sm
model5 = sm.OLS(y, X[:, [2,4]]).fit()
model5.summary()

###############################################
#------------------ Case-study -----------------


### Predict Car-mileage from different features of a car.


### Read data from a CSV

In [0]:
# import dataset
import pandas as pd
data = pd.read_csv('mtcars.csv')

print(data.head())

### Preprocess Data

In [0]:
# remove string and categorical variables
cat_var = ['model', 'cyl', 'vs', 'am', 'gear', 'carb']
data = data.drop(cat_var, axis = 1)
print(data.head())

### Normalize Data

In [0]:
# scale the variables to prevent coefficients from becoming too large or too small
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data = scaler.fit_transform(data)


### Create the LR Model

In [0]:
# fit the linear regression model to predict mpg as a function of other variables
from sklearn.linear_model import LinearRegression
reg = LinearRegression()


### Seperate out Training Data and Label


### Experiment with different feature combinations in fitting the model.


In [0]:
# column 0 is the output labels
y=data[:, 0]  


# Commont/un-comment different lines to fit the model on different features

# All 5 the features
#X=data[:, 1:6]      # R-squared =  0.8220253997643747    Adjusted-R-squared =  0.7877995151036776

# First 4 features
#X=data[:, 1:5]      # R-squared =  0.8061538383178671    Adjusted-R-squared =  0.7774358884390327

# First 3 features
#X=data[:, 1:4]      # R-squared =  0.7096991776694601    Adjusted-R-squared =  0.6785955181340451

# First 2 features
#X=data[:, 1:3]      # R-squared =  0.663530743237355     Adjusted-R-squared =  0.6403259669088966

# First feature
#X=data[:, 1:2]      # R-squared =  0.6079080244298629    Adjusted-R-squared =  0.5948382919108584


# Only two Features 5, 6
#X=data[:, [4,5]]     # R-squared =  0.7899558504269999   Adjusted-R-squared =  0.7754700470081723

# Only three Features 4, 5, 6...these three feature are almost as good as all the features..look at the scores
X=data[:, [3, 4, 5]]  # R-squared =  0.8052874391519524   Adjusted-R-squared =  0.7844253790610902


print(X.shape)
print(y.shape)

print(X[1:5,:])

### Train the LR Model

In [0]:
reg.fit(X,y)
print('\n Model trained......')

### Compute Score - R-squared and Adjusted-R-squared

In [0]:
# calculate r2 score
from sklearn.metrics import r2_score

r2 = r2_score(reg.predict(X), y)

# adjusted r2 using formula adj_r2 = 1 - (1- r2) * (n-1) / (n - k - 1)

# Number of observations in the sample/training data
n = X.shape[0]

# k = number of predictors 
k = X.shape[1]

adj_r2 = 1 - ((1-r2)*(n - 1) / (n-k- 1))

print('\n R-squared = ', r2)
print('\n Adjusted-R-squared = ', adj_r2)


### Fit OLS Model from Statsmodels

In [51]:
#import statsmodels.api as sm
model_ols = sm.OLS(y, X).fit()
model_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.909
Model:,OLS,Adj. R-squared (uncentered):,0.9
Method:,Least Squares,F-statistic:,96.72
Date:,"Thu, 29 Aug 2019",Prob (F-statistic):,3.32e-15
Time:,11:30:31,Log-Likelihood:,16.224
No. Observations:,32,AIC:,-26.45
Df Residuals:,29,BIC:,-22.05
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,0.6490,0.094,6.939,0.000,0.458,0.840
x2,-0.2031,0.079,-2.562,0.016,-0.365,-0.041
x3,0.5781,0.116,4.996,0.000,0.341,0.815

0,1,2,3
Omnibus:,0.488,Durbin-Watson:,2.111
Prob(Omnibus):,0.784,Jarque-Bera (JB):,0.05
Skew:,0.074,Prob(JB):,0.975
Kurtosis:,3.126,Cond. No.,3.76
