<table align="center" width=100%>
    <tr>
        <td width="15%">
            <img src="homework.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                    <b> Take-Home <br>(Session 3)
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

### Import the required libraries

In [None]:
# import 'Pandas' 
import pandas as pd 

# import 'Numpy' 
import numpy as np

# 'Statsmodels' is used to build and analyze various statistical models
import statsmodels
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.tools.eval_measures import rmse

# import various metrics from 'Scikit-learn' (sklearn)
from sklearn.model_selection import train_test_split

# to set the digits after decimal place 
pd.options.display.float_format = '{:.5f}'.format

# suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

#### Read the data

Load the csv file and set the first column as index

In [None]:
# read the data
df_car = pd.read_csv("car_data.csv", index_col = 0)

# display the first two rows of the data
df_car.head(2)

Unnamed: 0_level_0,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
Car_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0


Our objective is to predict the selling price of the cars data.

**The data definition is as follows:** <br><br>
**Car_Name:** name of the car <br>

**YearThis:** year in which the car was bought <br>

**Present_Price:** current ex-showroom price of the car (in lakhs)<br>

**Kms_Driven:** distance completed by the car in km <br>

**Fuel_Type:** fuel type of the car <br>

**Seller_Type:** defines whether the seller is a dealer or an individual<br>

**Transmission:** defines whether the car is manual or automatic <br>

**Owner:** defines the number of owners the car has previously had <br>

**Selling_Price:** price the owner wants to sell the car at (in lakhs) (response variable)

### Let's continue with hands-on practice exercises

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>1. Is there multicollinearity present? If yes, which variables are involved in multicollinearity?    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [None]:
# create an empty dataframe to store the VIF for each variable
vif = pd.DataFrame()

# filter the numerical features in the dataset
# select_dtypes: selects the variable having specified datatype
df_numeric = df_car.select_dtypes(include=[np.number])

# create a column of variable names
vif["Features"] = df_numeric.columns

# calculate VIF using list comprehension 
# use for loop to access each variable 
# calculate VIF for each variable and create a column 'VIF' to store the values 
vif["VIF"] = [variance_inflation_factor(df_numeric.values, i) for i in range(df_numeric.shape[1])]

# print the dataframe 
vif

Unnamed: 0,Features,VIF
0,Year,2.78056
1,Selling_Price,9.35503
2,Present_Price,9.33909
3,Kms_Driven,2.21597
4,Owner,1.07427


Here we consider the threshold as 5. Hence, if VIF > 5 the corresponding variable is involved in multicollinearity.

Note: The threshold can be considered as 10 depending upon the dataset.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>2. What is the impact of present price of the car and seller type on the selling price?
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [None]:
# consider the independent variables
X = df_car[["Present_Price"]]

# convert the categorical variable to a dummy variable
# get_dummies(): converts the variable to categorical variable
# prefix: specifies the prefix added to each level while creating a dummy variable for it
# drop_first=True: indicates n-1 dummy enoding; if set to false indicated one-hot encoding
dummy_variable = pd.get_dummies(df_car["Seller_Type"], prefix="Seller", drop_first=True)

# concatenate X and the dummy variable
# concat(): concatenates the specified dataframes
# axis: specifies whether to drop labels from index or columns; use 1 for columns and 0 for index
X = pd.concat([X, dummy_variable],axis=1)

# consider the dependent variable
y = df_car["Selling_Price"]

# fit a model with an intercept using fit()
# add_constant(): adds the intercept term to the model
MLR_full = sm.OLS(y, sm.add_constant(X)).fit()

# print the summary output
print(MLR_full.summary())

                            OLS Regression Results                            
Dep. Variable:          Selling_Price   R-squared:                       0.786
Model:                            OLS   Adj. R-squared:                  0.785
Method:                 Least Squares   F-statistic:                     548.4
Date:                Tue, 14 Apr 2020   Prob (F-statistic):          1.34e-100
Time:                        14:20:33   Log-Likelihood:                -683.71
No. Observations:                 301   AIC:                             1373.
Df Residuals:                     298   BIC:                             1385.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 1.5423      0.26

Interpretation of β coefficients:

β<sub>const</sub> = 1.5423, represents the selling price of the cars when a dealer is selling the car and considering the current price to be 0  <br>

β<sub>Present_Price</sub> = 0.4758, it implies that 0.4758 is the average increase in the selling price of the cars due one unit increase in the present price, keeping other variables constant  <br>

β<sub>Seller_Individual</sub> = -1.4493, it implies that 1.4493 is the average decrease in the selling price of the cars if the individual is selling the car himself/herself, keeping other variables constant

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>3. Consider all the numeric features in the data. Do all of them significantly contribute to explaining the variation in the selling price?
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [None]:
# consider the independent variables
# we need to drop the feature 'Selling_Price' from X 
# select_dtypes: selects the variable having specified datatype
# include: includes the variables with specified datatype
# drop(): drops specified column(s)/row(s) from the dataframe
# axis: specifies whether to drop labels from index or columns; use 1 for columns and 0 for index
X = df_car.select_dtypes(include=[np.number]).drop(["Selling_Price"],axis=1)

# consider the dependent variables
y = df_car["Selling_Price"]

# fit a model with an intercept using fit()
# add_constant(): adds the intercept term to the model
LM_model_num = sm.OLS(y, sm.add_constant(X)).fit()

# print the summary output
print(LM_model_num.summary())

                            OLS Regression Results                            
Dep. Variable:          Selling_Price   R-squared:                       0.852
Model:                            OLS   Adj. R-squared:                  0.850
Method:                 Least Squares   F-statistic:                     426.6
Date:                Tue, 14 Apr 2020   Prob (F-statistic):          1.66e-121
Time:                        14:20:33   Log-Likelihood:                -628.25
No. Observations:                 301   AIC:                             1267.
Df Residuals:                     296   BIC:                             1285.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const          -937.7642     94.392     -9.935

From the p-value for the regression coefficients, we understand that the variable `Kms_Driven` does not contribute to explaining the variation in the selling price since the corresponding p-value > 0.05 (level of significance).

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>4. In the model obtained in question 3, consider the interaction effect of the present price of the car and the year in which it was purchased. Compare the resultant model with the model obtained in previous question and give your interpretation 
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [None]:
# consider the independent variables
# we need to drop the feature 'Selling_Price' from X 
# select_dtypes: selects the variable having specified datatype
# include: includes the variables with specified datatype
# drop(): drops specified column(s)/row(s) from the dataframe
# axis: specifies whether to drop labels from index or columns; use 1 for columns and 0 for index
X = df_car.select_dtypes(include=[np.number]).drop(["Selling_Price"],axis=1)

# add the interaction variable
X['Price*Year'] = df_car['Present_Price']*df_car['Year'] 

# consider the dependent variables
y = df_car["Selling_Price"]

# fit a model with an intercept using fit()
# add_constant(): adds the intercept term to the model
LM_model_interaction = sm.OLS(y, sm.add_constant(X)).fit()

# print the summary output
print(LM_model_interaction.summary())

                            OLS Regression Results                            
Dep. Variable:          Selling_Price   R-squared:                       0.963
Model:                            OLS   Adj. R-squared:                  0.963
Method:                 Least Squares   F-statistic:                     1546.
Date:                Tue, 14 Apr 2020   Prob (F-statistic):          3.05e-209
Time:                        14:20:33   Log-Likelihood:                -418.79
No. Observations:                 301   AIC:                             849.6
Df Residuals:                     295   BIC:                             871.8
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const           101.2676     58.597      1.728

We see the interaction effect is significant. On comparing the two models (with and without interaction efect), we see that the model with interaction effect has an increased R-squared of 0.963 and an increase adjusted R-squared of 0.963 compared to the model without interaction effect which has an R-squared of 0.852 and an adjusted R-squared of 0.850

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>5. Regress the selling price over the present price. Compare the 99% and 95% confidence interval of present price of a car
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [None]:
# consider the independent variables
X = df_car["Present_Price"]

# consider the dependent variables
y = df_car["Selling_Price"]

# fit a model with an intercept using fit()
# add_constant(): adds the intercept term to the model
LM_model = sm.OLS(y, sm.add_constant(X)).fit()

# 99% confidence interval
# conf_int(0.01): constructs the 99% CI
# use '[1:]' to consider only the 99% CI of present price of the car 
print("The 99% CI is: \n", LM_model.conf_int(0.01)[1:])

print("\n\n")

# 95% confidence interval
# conf_int(0.05): constructs the 95% CI
# use '[1:]' to consider only the 95% CI of present price of the car 
print("The 95% CI is: \n", LM_model.conf_int(0.05)[1:])

The 99% CI is: 
                     0       1
Present_Price 0.47481 0.55889



The 95% CI is: 
                     0       1
Present_Price 0.48494 0.54876


We see that 99% confidence interval is wider. 

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                        <b>6. Verify the statement: The sum of the residuals in any regression model that contains an intercept β<sub>0</sub> is always zero
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

        To verify the result, we will fit a regression model of 'Present_Price' on 'Selling_Price' 

In [None]:
# consider the independent variables
X = df_car["Present_Price"]

# consider the dependent variables
y = df_car["Selling_Price"]

# fit a model with an intercept using fit()
# add_constant(): adds the intercept term to the model
LM_model = sm.OLS(y, sm.add_constant(X)).fit()

# obtain the sum of residuals
# resid gives the residuals of the models
# sum() gives the sum of all residuals
resid_sum = LM_model.resid.sum()

# round of the answer upto 10 decimal places
round(resid_sum, 10)

0.0

Thus, statement is verified.

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>7. Consider two models as specified below. Compare the performance of the models
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

                First model:
        
        Selling_Price ~ Year + Present_Price + Kms_Driven + Owner + Fuel_Type + Seller_Type + Transmission
        
        
                Second model:
        
        Selling_Price ~ Year + Present_Price + Kms_Driven + Owner 

In [None]:
# consider the independent variables
# select_dtypes: selects the variable having specified datatype
# include: includes the variables with specified datatype
# drop(): drops specified column(s)/row(s) from the dataframe
# axis: specifies whether to drop labels from index or columns; use 1 for columns and 0 for index
df_car_num = df_car.select_dtypes(include=np.number).drop(["Selling_Price"],axis=1)

# consider all the categorical variables in the data
# select_dtypes: selects the variable having specified datatype
# include: includes the variables with specified datatype
df_car_cat = df_car.select_dtypes(include="object")

# convert the categorical variable to dummy variable
# get_dummies(): converts the variable to categorical variable
# drop_first=True: indicates n-1 dummy enoding; if set to false indicated one-hot encoding
dummy_variables = pd.get_dummies(df_car_cat, drop_first=True)

# concatenate the numerical and dummy variables
# axis: specifies whether to drop labels from index or columns; use 1 for columns and 0 for index
X = pd.concat([df_car_num, dummy_variables],axis=1)

# add intercept in X
X.insert(loc = 0, column = 'intercept',value = np.ones(X.shape[0]))

# consider the dependent variable
y = df_car["Selling_Price"]

The train-test split:

In [None]:
# split data into train subset and test subset
# set 'random_state' to generate the same dataset each time you run the code 
# 'test_size' returns the proportion of data to be included in the testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1,test_size = 0.3)

# check the dimensions of the train & test subset using 'shape'
# print dimension of train set
print('X_train', X_train.shape)
print('y_train', y_train.shape)

# print dimension of test set
print('X_test', X_test.shape)
print('y_test', y_test.shape)

X_train (210, 9)
y_train (210,)
X_test (91, 9)
y_test (91,)


### First model:

In [None]:
# fit a full model with an intercept using fit()
MLR_full = sm.OLS(y_train, X_train).fit()

# print the summary output
print(MLR_full.summary())

                            OLS Regression Results                            
Dep. Variable:          Selling_Price   R-squared:                       0.884
Model:                            OLS   Adj. R-squared:                  0.879
Method:                 Least Squares   F-statistic:                     191.2
Date:                Tue, 14 Apr 2020   Prob (F-statistic):           1.35e-89
Time:                        14:20:33   Log-Likelihood:                -423.34
No. Observations:                 210   AIC:                             864.7
Df Residuals:                     201   BIC:                             894.8
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
intercept               -844

Predictions:

In [None]:
# predicting the selling price
y_pred = MLR_full.predict(X_test)

In [None]:
# create the result table for all accuracy scores
# Accuracy measures considered for model comparision are RMSE, R-squared value and Adjusted R-squared value
# create a list of column names
cols = ['Model', 'R-Squared', 'Adj. R-Squared',  'RMSE']

# creating a empty dataframe of the colums
result_tabulation = pd.DataFrame(columns = cols)

# compiling the required information
linreg_full_model = pd.Series({'Model': "Linreg full model",
                           'R-Squared': MLR_full.rsquared,
                      'Adj. R-Squared': MLR_full.rsquared_adj ,
                                'RMSE': rmse(y_test, y_pred)
                   })

# appending our result table
result_tabulation = result_tabulation.append(linreg_full_model, ignore_index = True)

# view the result table
result_tabulation

Unnamed: 0,Model,R-Squared,Adj. R-Squared,RMSE
0,Linreg full model,0.88385,0.87923,1.66717


### Second model:

In [None]:
# consider the numeric variables
# drop(): drops specified column(s)/row(s) from the dataframe
# axis: specifies whether to drop labels from index or columns; use 1 for columns and 0 for index
X_train = X_train.drop(['Fuel_Type_Diesel', 'Fuel_Type_Petrol', 'Seller_Type_Individual','Transmission_Manual'],axis=1)

# fit a full model with an intercept using fit()
MLR_num = sm.OLS(y_train, X_train).fit()

# print the summary output
print(MLR_num.summary())

                            OLS Regression Results                            
Dep. Variable:          Selling_Price   R-squared:                       0.853
Model:                            OLS   Adj. R-squared:                  0.850
Method:                 Least Squares   F-statistic:                     298.1
Date:                Tue, 14 Apr 2020   Prob (F-statistic):           3.23e-84
Time:                        14:20:33   Log-Likelihood:                -447.86
No. Observations:                 210   AIC:                             905.7
Df Residuals:                     205   BIC:                             922.5
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
intercept      -996.2108    116.383     -8.560

Predictions:

In [None]:
# consider the numeric variables
# drop(): drops specified column(s)/row(s) from the dataframe
# axis: specifies whether to drop labels from index or columns; use 1 for columns and 0 for index
X_test = X_test.drop(['Fuel_Type_Diesel', 'Fuel_Type_Petrol', 'Seller_Type_Individual','Transmission_Manual'],axis=1)

# predicting the selling price
y_pred = MLR_num.predict(X_test)

In [None]:
# compiling the required information
linreg_num_model = pd.Series({'Model': "Linreg numeric model",
                          'R-Squared': MLR_num.rsquared,
                     'Adj. R-Squared': MLR_num.rsquared_adj ,
                               'RMSE': rmse(y_test, y_pred)
                   })

# appending our result table
result_tabulation = result_tabulation.append(linreg_num_model, ignore_index = True)

# view the result table
result_tabulation

Unnamed: 0,Model,R-Squared,Adj. R-Squared,RMSE
0,Linreg full model,0.88385,0.87923,1.66717
1,Linreg numeric model,0.85329,0.85043,1.87571


The full model is relatively better.  However, we see that there is not much difference in the performance of both the models. We may say that presence of the categorical variables in the data do not play a crucial role in explaining variation of selling price. 