# LINEAR REGRESSION MODELS

### Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Importing the dataset

In [None]:
cars = pd.read_csv('../input/vehicle-dataset-from-cardekho/car data.csv')

This is Vehicle dataset from cardekho Dataset . This dataset contains information about used cars listed on website cardekho.com. We are going to use for finding predictions of price with the use of regression models.

The datasets consist of several independent variables include:

Car_Name : This column should be filled with the name of the car.

Year : This column should be filled with the year in which the car was bought.

Selling_Price : This column should be filled with the price the owner wants to sell the car at.

Present_Price : This is the current ex-showroom price of the car.

Kms_Driven : This is the distance completed by the car in km.
 
Fuel_Type : Fuel type of the car i.e Diesel,Petrol,CNG

Seller_Type : Defines whether the seller is a dealer or an individual.

Transmission : Defines whether the car is manual or automatic.

Owner : Defines the number of owners the car has previously had.

In [None]:
cars.head()

In [None]:
cars.shape

In [None]:
cars.describe()

In [None]:
cars.info()

There are all non null values present in the columns

In this regression model the dependent variable will be 'Selling_price' rest all the variables will be considered as independent variables


For buliding linear regression model we need all numerical variables,so the features containing object datatype are either converted or dropped

In [None]:
#Car_Name

cars.Car_Name.value_counts()

Car_Name contains 98 different values so it is better to drop this column

In [None]:
cars.drop(['Car_Name'],axis=1,inplace = True)

In [None]:
cars.head()

### EDA

#### UNIVARIATE ANALYSIS

In [None]:
# Year

plt.figure(figsize = (15,5))
sns.boxplot(data=cars)
plt.show()

In [None]:
# from the boxplot we can see that kms_Driven has outliers

In [None]:
q1 = cars['Kms_Driven'].quantile(0.25)
q3 = cars['Kms_Driven'].quantile(0.75)
iqr = q3-q1

UL = q3 + (1.5 * iqr)
LL = q1 - (1.5 * iqr)
print(iqr,UL,LL)

In [None]:
cars[cars['Kms_Driven']>UL]

In [None]:
cars[cars['Kms_Driven']>UL].count()['Kms_Driven']

These 8 values are greater than the upper limit value 99417.5

We would remove these values

In [None]:
#outlier removal from Kms_Driven

df = cars[cars['Kms_Driven']<UL]
cars=df
cars

In [None]:
sns.distplot(df['Year'])

In [None]:
# The Years variable is left skewed

In [None]:
sns.distplot(df['Selling_Price'])
plt.show()

In [None]:
# the selling price is right skewed

In [None]:
sns.distplot(df['Present_Price'])
plt.show()

In [None]:
#the present_price is right skewed

In [None]:
sns.distplot(df['Kms_Driven'])
plt.show()

In [None]:
#The Kms_Driven are almost normally distributed after removing the outliers, the max values lie between 20000 to 50000
#kms

In [None]:
sns.countplot(cars['Fuel_Type'])
plt.show()

In [None]:
# From this bar plot we can see that there are three categories of Fuel_Type
#Petrol Fuel_type is the maximum in number and CNG cars are the least

In [None]:
sns.countplot(cars['Seller_Type'])
plt.show()

In [None]:
# There are two types of sellers : Individual and Dealer
# The seller_type dealer is greater than the individual seller_type

In [None]:
sns.countplot(cars['Transmission'])
plt.show()

In [None]:
# The Transmission feature has 2 categories
#Manual and Automatic

In [None]:
sns.countplot(cars['Owner'])
plt.show()

In [None]:
# The cars having 0 previous owners is more than the cars having one previous owner.

#### Bivarate  analysis

In [None]:
fig, (ax1, ax2,ax3) = plt.subplots(1,3,figsize = (15,5))

#scatter plot 1
ax1.scatter(x=cars['Year'],y= cars['Selling_Price'])
ax1.set_title('Years v/s Selling_Price')

#scatter plot 2
ax2.scatter(x=cars['Present_Price'], y=cars['Selling_Price']) 
ax2.set_title('Present_Price v/s Selling_Price')

#scatter plot 3
ax3.scatter(x=cars['Kms_Driven'],y=cars['Selling_Price'])
ax3.set_title('Kms_Driven v/s Selling_Price')

plt.draw()  

OR

1. The older the car the lesser the selling price
2. The selling price of those cars is greater whose current ex-showroom price(present_price) is greater i.e the present_price and sellin_price is directly proportional.
3. As the Km_Driven increases the Selling_price of the car decreases 

In [None]:
fig,axes = plt.subplots(2,2,figsize=(20,12))

sns.boxplot(x=cars.Fuel_Type,y=cars.Selling_Price,ax=axes[0][0])
axes[0][0].set_title('Fuel_Type v/s Selling_Price')

sns.boxplot(x=cars.Transmission,y=cars.Selling_Price,ax=axes[0][1])
axes[0][1].set_title('Transmission v/s Selling_Price')

sns.boxplot(x=cars.Owner,y=cars.Selling_Price,ax=axes[1][0])
axes[1][0].set_title('Owner v/s Selling_Price')

sns.boxplot(x=cars.Seller_Type,y=cars.Selling_Price,ax=axes[1][1])
axes[1][1].set_title('Seller_Type v/s Selling_Price')

1. The Diesel cars are having the highest selling_price with most number of outliers being present.
   Diesel > CNG > Petrol in terms of seeling price
2. Automatic cars are expesnive than manual cars
3. The cars with no previous owner are expensive than with a previous owner.
4. Individuals are selling there cars at lesser price than the cars being sold by the dealers

#### Multivariate Analysis

In [None]:
sns.lmplot(x='Kms_Driven',y='Selling_Price',data=cars,fit_reg=False,col='Transmission',row='Seller_Type')   
plt.show()

In [None]:
sns.lmplot(x='Present_Price',y='Selling_Price',data=cars,fit_reg=False,col='Transmission',row='Seller_Type',hue='Fuel_Type')   
plt.show()

1. All the individual seller_type are having only petrol cars.
2. Dealers selling manual transmission cars are selling all the 3 types of fuel cars, most expensive being the diesel cars

#### Converting categorical variables to dummy variables

In [None]:
#Fuel_Type

cars.Fuel_Type.value_counts()

In [None]:
cars.Seller_Type.value_counts()

In [None]:
cars.Transmission.value_counts()

In [None]:
cars = pd.get_dummies(cars,columns=['Fuel_Type','Seller_Type','Transmission'],drop_first=True)

In [None]:
cars.info()

In [None]:
cars.shape

In [None]:
cars.head()

In [None]:
#Heatmap to show the correlation between various variables of the dataset

plt.figure(figsize=(10, 8))
cor = cars.corr()
ax = sns.heatmap(cor,annot=True)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()

The target variable Selling Price is highly correlated with:
1. Present Price
2. Fuel Type
3. Seller Type

### Linear Regression Model

The simplest form of regression is the linear regression, which assumes that the predictors have a linear relationship with the target variable.

The linear regression equation can be expressed in the following form:

y = a1x1 + a2x2 + a3x3 + ..... + anxn + b

* y is the target variable.
* x1, x2, x3,...xn are the features.
* a1, a2, a3,..., an are the coefficients.
* b is the parameter of the model.

In [None]:
y = cars['Selling_Price']
X = cars.drop(['Selling_Price'],axis=1)

In [None]:
#Splitting the data into train and test

from sklearn.model_selection import train_test_split

X_train , X_test , y_train , y_test = train_test_split(X,y,test_size = 0.30 , random_state = 1)

print(X_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
#standardization of the data
from sklearn.preprocessing import StandardScaler

sc=StandardScaler() 
X_train=sc.fit_transform(X_train)
X_train=pd.DataFrame(X_train,columns=X.columns)

X_test=sc.fit_transform(X_test)
X_test=pd.DataFrame(X_test,columns=X.columns)

In [None]:
#Building model using sklearn(Gradient Descent)

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train) # training the algorithm

# Getting the coefficients and intercept

print('coefficients:\n', lin_reg.coef_)
print('\n intercept:', lin_reg.intercept_)
#coeff_df = pd.DataFrame(lin_reg.coef_, X.columns, columns=['Coefficient'])  
#print(coeff_df)

#Now predicting on the test data

y_pred = lin_reg.predict(X_test)

In [None]:
# compare the actual output values for X_test with the predicted values

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.reset_index(inplace=True,drop=True)
df

In [None]:
#Showing the difference between the actual and predicted value

df1 = df.head(25)
df1.plot(kind='bar',figsize=(16,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

In [None]:
#Calculating the accuracy 

from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

print('r2_score:', metrics.r2_score(y_test,y_pred))

#or
#print('rsquare_Train', lin_reg.score(X_train, y_train))
#print('rsquare_Test', lin_reg.score(X_test, y_test)) 

In [None]:
# Building a linear Regression model using statsmodels (OLS)

In [None]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm

y = cars['Selling_Price']
X = cars.drop(['Selling_Price'],axis=1)
X_constant = sm.add_constant(X)
model = sm.OLS(y, X_constant).fit()
predictions = model.predict(X_constant)
print(model.summary())

### Assumptions

For Linear Regression, we need to check if the 5 major assumptions hold.

1. No Auto correlation
2. Linearity of variables
3. Normality of error terms
4. No Heteroscedacity
5. No strong MultiCollinearity

#### Assumption 1- No autocorrelation


In [None]:
# 1. Durbin Watson Test

#Ho: Linear Regression Residuals are not correlated
#H1: Errors are correlated.

from statsmodels.stats.api import durbin_watson
durbin_watson(model.resid)

From summary also we can see the durbin watson value ,this is v close to 2 which indicates no autocorrelation

In [None]:
#2. time series analysis graph 

import statsmodels.tsa.api as smt #tsa time series anlaysis

acf = smt.graphics.plot_acf(model.resid, lags=40 , alpha=0.05) #model.resid comes from statsmodel 
acf.show()

# from this graph we dont see any pattern in the residuals so this shows no autocorrelation


#### Assumption 2- Normality of Residuals

In [None]:
#1. Jarque berua test

from scipy import stats
print(stats.jarque_bera(model.resid))

#ho : the data is normally distributed
#h1: the errors are not normally distributed

pvalue (0) < alpha (0.05)
so we reject the null hypothesis 
the errors are not normally distributed

In [None]:
#2. Histogram

import seaborn as sns

sns.distplot(model.resid)

In [None]:
#3. QQ plot

import pylab

stats.probplot(model.resid, dist = 'norm', plot = pylab)
plt.show()

In [None]:
#4. shapiro wilk test

# Ho: The Data / Errors are Normal in Nature
# H1: The Data is not Normal

from scipy.stats import shapiro

teststats, pvalue = shapiro(model.resid)
print(pvalue)
print("reject the null ho")

#### Asssumption 3 - Linearity of residuals

In [None]:
#1. Visual representation

%matplotlib inline
%config InlineBackend.figure_format ='retina'
import statsmodels.stats.api as sms
sns.set_style('darkgrid')
sns.mpl.rcParams['figure.figsize'] = (15.0, 9.0)

def linearity_test(model, y):
    '''
    Function for visually inspecting the assumption of linearity in a linear regression model.
    It plots observed vs. predicted values and residuals vs. predicted values.
    
    Args:
    * model - fitted OLS model from statsmodels
    * y - observed values
    '''
    fitted_vals = model.predict()
    resids = model.resid

    fig, ax = plt.subplots(1,2)
    
    sns.regplot(x=fitted_vals, y=y, lowess=True, ax=ax[0], line_kws={'color': 'red'})
    ax[0].set_title('Observed vs. Predicted Values', fontsize=16)
    ax[0].set(xlabel='Predicted', ylabel='Observed')
    
    #LOWESS (Locally Weighted Scatterplot Smoothing) is a popular tool used in regression analysis that creates a smooth line 
    #through a timeplot or scatter plot to help you to see relationship between variables and foresee trends.

    sns.regplot(x=fitted_vals, y=resids, lowess=True, ax=ax[1], line_kws={'color': 'red'})
    ax[1].set_title('Residuals vs. Predicted Values', fontsize=16)
    ax[1].set(xlabel='Predicted', ylabel='Residuals')
    
linearity_test(model, y)

In [None]:
#2. Rainbow test

import statsmodels.api as sm
sm.stats.linear_rainbow(res=model, frac=0.5)
# frac : we are not checking the whole data we are just checking the fraction of it

#### Assumption 4 - Homoscedasticity_test

In [None]:
from statsmodels.stats.api import het_goldfeldquandt
from statsmodels.compat import lzip


In [None]:
#1. Goldfeld Quandt Test:

# Ho: The residuals are not heteroscedastic / same variance / homoscedastic
# H1: The residuals are Heteroscedastic / unequal variance

name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(model.resid, model.model.exog)
lzip(name, test)

#exog - x varibles and endog - y variables

In [None]:
#2. Visual representation

fitted_vals = model.predict()
resids = model.resid
resids_standardized = model.get_influence().resid_studentized_internal
fig, ax = plt.subplots(1,2,figsize=(20,12))

sns.regplot(x=fitted_vals, y=resids, lowess=True, ax=ax[0], line_kws={'color': 'red'})
ax[0].set_title('Residuals vs Fitted', fontsize=16)
ax[0].set(xlabel='Fitted Values', ylabel='Residuals')

sns.regplot(x=fitted_vals, y=np.sqrt(np.abs(resids_standardized)), lowess=True, ax=ax[1], line_kws={'color': 'red'})
ax[1].set_title('Scale-Location', fontsize=16)
ax[1].set(xlabel='Fitted Values', ylabel='sqrt(abs(Residuals))')

plt.show()

#### Assumption 5- NO MULTI COLLINEARITY

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = [variance_inflation_factor(X_constant.values, i) for i in range(X_constant.shape[1])]
df = pd.DataFrame({'vif': vif[1:]}, index=X.columns)
df

In [None]:
df[df.vif > 5].index

In [None]:
## After removing multicollinear feature 'Fuel_Type_Diesel'....cars1
cars1 = cars
cars1.drop(['Fuel_Type_Diesel'],axis=1,inplace=True)

X_vif = cars1.drop(['Selling_Price'],axis=1)
y_vif = cars1['Selling_Price']
from sklearn.linear_model import LinearRegression

lin_reg_vif = LinearRegression()
lin_reg_vif.fit(X, y)

print(f'Coefficients: {lin_reg_vif.coef_}')
print(f'Intercept: {lin_reg_vif.intercept_}')
print(f'R^2 score: {lin_reg_vif.score(X, y)}')

In [None]:
## After removing multicollinear feature 'Fuel_Type_Diesel'

import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm

X = cars1.drop(['Selling_Price'],axis=1)
y = cars1['Selling_Price']

X_constant = sm.add_constant(X)
model = sm.OLS(y,X_constant).fit()
predictions = model.predict(X_constant)
model.summary()

In [None]:
vif = [variance_inflation_factor(X_constant.values, i) for i in range(X_constant.shape[1])]
pd.DataFrame({'vif': vif[1:]}, index=X.columns).T

In [None]:
#After checking the assumptions found that Normality criteria not met

# we will apply transformation on the data to make the data meet the assumption

In [None]:
# Residual plot

sns.set(style = 'whitegrid')

cars1['predictions'] = model.predict(X_constant)
residuals = model.resid

ax = sns.residplot(cars1.predictions, residuals, lowess = True, color = 'g')
ax.set(xlabel = 'Fitted value', ylabel = 'Residuals', title = 'Residual vs Fitted Plot \n')
plt.show()

In [None]:
## for sqrt(X)

final_df = cars1.transform(lambda x: x**0.5)
final_df.head()

In [None]:
X_final = final_df.drop(['Selling_Price','predictions'],axis=1)
y_final = final_df.Selling_Price
X_constant_final = sm.add_constant(X_final)
model_final = sm.OLS(y_final, X_constant_final).fit()
predictions_final = model_final.predict(X_constant_final)
model_final.summary()

In [None]:
#After transformating the data the accuracy/R2 score for the model improved.

#We can look further into the different regularization techniques with different values of alpha and build models

#The best R2 score that this model is giving is using these parameters

### Regularized Regression 

#### 1. Ridge Regression

Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model. This modification is done by adding a penalty parameter that is equivalent to the square of the magnitude of the coefficients.

Loss function = OLS + alpha * summation (squared coefficient values)

In the above loss function, alpha is the parameter we need to select. A low alpha value can lead to over-fitting, whereas a high alpha value can lead to under-fitting.

Instead of arbitrarily choosing alpha value ,it would be better to use cross-validation to choose the tuning parameter alpha. We can do this using the cross-validated ridge regression function, RidgeCV()

In [None]:
from sklearn.linear_model import RidgeCV,Ridge

alphas = 10**np.linspace(10,-2,100)*0.5

ridgecv = RidgeCV(alphas = alphas,normalize = True)
ridgecv.fit(X_train, y_train)
ridgecv.alpha_

The value of alpha that results in the smallest cross-validation error is 0.0814.

In [None]:
rr = Ridge(alpha = ridgecv.alpha_, normalize = True)
rr.fit(X_train, y_train)

In [None]:
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, rr.predict(X_test))))

print('r2_score:', metrics.r2_score(y_test, rr.predict(X_test)))

#### 2. Lasso Regression

Lasso regression, or the Least Absolute Shrinkage and Selection Operator, is also a modification of linear regression. In Lasso, the loss function is modified to minimize the complexity of the model by limiting the sum of the absolute values of the model coefficients (also called the l1-norm).

The loss function for Lasso Regression can be expressed as below:

Loss function = OLS + alpha * summation (absolute values of the magnitude of the coefficients)

We now ask whether the lasso can yield either a more accurate or a more interpretable model than ridge regression. In order to fit a lasso model, we'll use the Lasso() function; however, this time we'll need to include the argument max_iter = 10000. Other than that change, we proceed just as we did in fitting a ridge model:

In [None]:
from sklearn.linear_model import LassoCV,Lasso

lasso = Lasso(max_iter = 10000, normalize = True)
coefs = []

for a in alphas:
    lasso.set_params(alpha=a)
    lasso.fit(X_train, y_train)
    coefs.append(lasso.coef_)

We now perform 10-fold cross-validation to choose the best alpha, refit the model, and compute the associated score:

In [None]:
lassocv = LassoCV(alphas = None, cv = 10, max_iter = 100000, normalize = True)
lassocv.fit(X_train, y_train)

lasso.set_params(alpha=lassocv.alpha_)
lasso.fit(X_train, y_train)

The value of alpha that results in the smallest cross-validation error is 0.000332.

In [None]:
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, lasso.predict(X_test))))

print('r2_score:', metrics.r2_score(y_test, lasso.predict(X_test)))

In [None]:
# Plot the coefficients
plt.figure(figsize=(8, 5))

colnames = X_train.columns

plt.plot(range(len(colnames)), lasso.coef_, linestyle='none',marker='*',markersize=5,color='red')
plt.xticks(range(len(colnames)), colnames.values, rotation=60) 
plt.margins(0.02)
plt.show()

We can see that the Ridge model is performing better than the Lasso model.

#### 3. ElasticNet Regression

ElasticNet combines the properties of both Ridge and Lasso regression. It works by penalizing the model using both the l2-norm and the l1-norm.

In [None]:
# Let's perform a cross-validation to find the best combination of alpha and l1_ratio
from sklearn.linear_model import ElasticNetCV, ElasticNet

# how much importance should be given to l1 reguralization
cv_model = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, .995, 1], eps=0.001, n_alphas=100, fit_intercept=True, 
                        normalize=True, precompute='auto', max_iter=2000, tol=0.0001, cv=5, 
                        copy_X=True, verbose=0, n_jobs=-1, positive=False, random_state=None, selection='cyclic')

In [None]:
cv_model.fit(X_train, y_train)

In [None]:
print('Optimal alpha: %.8f'%cv_model.alpha_)
#The amount of penalization chosen by cross validation

print('Optimal l1_ratio: %.3f'%cv_model.l1_ratio_)
#The compromise between l1 and l2 penalization chosen by cross validation

print('Number of iterations %d'%cv_model.n_iter_)
#number of iterations run by the coordinate descent solver to reach the specified tolerance for the optimal alpha.

In [None]:
# train model with best parameters from CV
elastic = ElasticNet(l1_ratio=cv_model.l1_ratio_, alpha = cv_model.alpha_, max_iter=cv_model.n_iter_, fit_intercept=True, normalize = True)
elastic.fit(X_train, y_train)

In [None]:
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, elastic.predict(X_test))))

print('r2_score:', metrics.r2_score(y_test, elastic.predict(X_test)))

Out of the 3 regularization models the Elastic Net Model is performing the best on this dataset.