## Geely Auto Car Price Case Study

#### Problem Statement:

A Chinese automobile company __Geely Auto__ aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. 

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, **they want to understand the factors affecting the pricing of cars in the American market**, since those may be very different from the Chinese market.

Essentially, the company wants —

- Which variables are significant in predicting the price of a car.

- How well those variables describe the price of a car.

## Step 1: Reading and Understanding the Data

Let us first import NumPy and Pandas and read the car dataset

In [1]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
import numpy as np
import pandas as pd

In [3]:
cars = pd.read_csv("CarPrice_Assignment.csv")

FileNotFoundError: File b'CarPrice_Assignment.csv' does not exist

In [None]:
# Check the head of the dataset
cars.head()

Let's inspect the various aspects of the cars dataframe

In [None]:
cars.shape

In [None]:
cars.info()

In [None]:
cars.describe()

## Step 2: Visualising the Data

Let's put effort in - **understanding the data**.
- I will try to identify any multicolinearity
- I'll also identify if any predictors directly have a strong association with the outcome variable.

I'll visualise the data using `matplotlib` and `seaborn`.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Visualising Numeric Variables

Let's make a pairplot of all the numeric variables.
_Mayplotlib inline_ gives **double click functionality on the pairplot** to visualise the graph comfortably.

In [None]:
#sns.pairplot(cars, kind='reg')
#plt.show()

#### Visualising Categorical Variables

Let's make a boxplot for the categorical variables.

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(2,5,1)
sns.boxplot(x = 'fueltype', y = 'price', data = cars)
plt.subplot(2,5,2)
sns.boxplot(x = 'aspiration', y = 'price', data = cars)
plt.subplot(2,5,3)
sns.boxplot(x = 'doornumber', y = 'price', data = cars)
plt.subplot(2,5,4)
sns.boxplot(x = 'carbody', y = 'price', data = cars)
plt.subplot(2,5,5)
sns.boxplot(x = 'drivewheel', y = 'price', data = cars)
plt.subplot(2,5,6)
sns.boxplot(x = 'enginelocation', y = 'price', data = cars)
plt.subplot(2,5,7)
sns.boxplot(x = 'enginetype', y = 'price', data = cars)
plt.subplot(2,5,8)
sns.boxplot(x = 'cylindernumber', y = 'price', data = cars)
plt.subplot(2,5,9)
sns.boxplot(x = 'fuelsystem', y = 'price', data = cars)
plt.show()

Let's try to visualize enginelocation, price with hue as enginetype

In [None]:
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'enginelocation', y = 'price', hue = 'enginetype', data = cars)
plt.show()

Let's also visualize fueltype, price with hue as fuelsystem.
The notion of selecting these variables is that I guess these variables are correlated.

In [None]:
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'fueltype', y = 'price', hue = 'fuelsystem', data = cars)
plt.show()

## Step 3: Data Preparation

- As we can see that our dataset has many columns with two values.

- But in order to fit a regression line, we would need numerical values and not string. Hence, we need to convert them to 1s and 0s, where 1 and 0 will have specific meaning.

- We will also create dummy variables as needed (variables with levels greater than two)

- As we understand the carName is carCompany + model and we understand that we need only car company, let's create a col car_company and drop the carName.

In [None]:
# Check the head of the dataset
cars.CarName.head()

In [None]:
# It's clearly visible that name is CAR_COMPANY + SPACE + CAR_MODEL_NAME.
# We will split the carName on SPACE and take first element of the splitted list.
cars['car_company'] = cars.CarName.apply(lambda x : x.split(' ')[0].lower())

In [None]:
# let's cheque unique car name
print(cars.car_company.nunique())
cars.car_company.unique()

> volkswagen, vokswagen, vw all refers to the company volkswagen
- Let's fix this

In [None]:
cars = cars.replace({'car_company': ["vokswagen", "vw"]}, "volkswagen")

In [None]:
# let's cheque unique car name
print(cars.car_company.nunique())
cars.car_company.unique()

> maxda and mazda refers to mazda
- Let's fix this

In [None]:
cars = cars.replace({'car_company': "maxda"}, "mazda")

In [None]:
# let's cheque unique car name
print(cars.car_company.nunique())
cars.car_company.unique()

> porsche and porcshce refers to porsche
- Let's fix this

In [None]:
cars = cars.replace({'car_company': "porcshce"}, "porsche")

In [None]:
# let's cheque unique car name
print(cars.car_company.nunique())
cars.car_company.unique()

> toyouta and toyota refers to toyota
- Let's fix this

In [None]:
cars = cars.replace({'car_company': "toyouta"}, "toyota")

In [None]:
# let's cheque unique car name
print(cars.car_company.nunique())
cars.car_company.unique()

In [None]:
# check the dataset
cars.head()

In [None]:
# Let's drop the carName
cars.drop('CarName', axis=1, inplace=True)

In [None]:
# Lets verify if CarName is dropped
cars.info()

Let's visualize the car_company against price. We didn't do visualization earlier for this.

In [None]:
plt.figure(figsize = (20, 10))
sns.boxplot(y = 'car_company', x = 'price', data = cars)
plt.show()

In [None]:
# let's drop car_ID, it's no use for us here

cars.drop('car_ID', axis=1, inplace=True)

#### Categorical variables with two level of values
- fueltype [gas : 1, diesel: 0]
- aspiration [std : 1, turbo: 0]
- doornumber [two : 1, four: 0]
- enginelocation

In [None]:
# converting fuel type to numeric
cars['fueltype'] = cars.fueltype.map({"gas": 1, "diesel": 0})

In [None]:
cars.fueltype.tail(20)

In [None]:
# converting aspiration to numeric
cars['aspiration'] = cars.aspiration.map({"std": 1, "turbo": 0})

In [None]:
# converting doornumber to numeric
cars['doornumber'] = cars.doornumber.map({"two": 1, "four": 0})

In [None]:
# converting enginelocation to numeric
cars['enginelocation'] = cars.enginelocation.map({"front": 1, "rear": 0})

In [None]:
# Let's check the cars dataframe now
cars.head()

#### Categorical variables with more than two level of values
- car_company
- carbody
- drivewheel
- enginetype
- cylindernumber
- fuelsystem

In [None]:
# creating dummies for car_company

carcompany = pd.get_dummies(cars['car_company'], drop_first=True)

In [None]:
# Add the results to the original cars dataframe

cars = pd.concat([cars, carcompany], axis = 1)

In [None]:
# Now let's see the head of our dataframe.

cars.head(10)

In [None]:
# Drop 'car_company' as we have created the dummies for it

cars.drop(['car_company'], axis = 1, inplace = True)

In [None]:
cars.head()

In [None]:
# creating dummies for carbody

body = pd.get_dummies(cars['carbody'], drop_first=True)

In [None]:
# Add the results to the original cars dataframe

cars = pd.concat([cars, body], axis = 1)

In [None]:
# Now let's see the head of our dataframe.

cars.head(10)

In [None]:
# Drop 'carbody' as we have created the dummies for it

cars.drop(['carbody'], axis = 1, inplace = True)

In [None]:
cars.head()

In [None]:
# creating dummies for drivewheel
wheel = pd.get_dummies(cars['drivewheel'], drop_first=True)

# Add the results to the original cars dataframe
cars = pd.concat([cars, wheel], axis = 1)

# Drop 'drivewheel' as we have created the dummies for it
cars.drop(['drivewheel'], axis = 1, inplace = True)

In [None]:
cars.head()

In [None]:
# creating dummies for enginetype
engine = pd.get_dummies(cars['enginetype'], drop_first=True)

# Add the results to the original cars dataframe
cars = pd.concat([cars, engine], axis = 1)

# Drop 'enginetype' as we have created the dummies for it
cars.drop(['enginetype'], axis = 1, inplace = True)

In [None]:
# creating dummies for cylindernumber
cylinder = pd.get_dummies(cars['cylindernumber'], drop_first=True)

# Add the results to the original cars dataframe
cars = pd.concat([cars, cylinder], axis = 1)

# Drop 'cylindernumber' as we have created the dummies for it
cars.drop(['cylindernumber'], axis = 1, inplace = True)

In [None]:
# creating dummies for fuelsystem
fuelsys = pd.get_dummies(cars['fuelsystem'], drop_first=True)

# Add the results to the original cars dataframe
cars = pd.concat([cars, fuelsys], axis = 1)

# Drop 'fuelsystem' as we have created the dummies for it
cars.drop(['fuelsystem'], axis = 1, inplace = True)

In [None]:
cars.head()

In [None]:
cars.shape

## Step 4: Splitting the Data into Training and Testing Sets


In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(cars, train_size = 0.7, test_size = 0.3, random_state = 100)

### Rescaling the Features 

Using MinMax scaling.

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

In [None]:
# Apply scaler() to all the columns except the dummy' variables

num_vars = ['wheelbase', 'carlength', 'carwidth', 'carheight', 'curbweight','enginesize','compressionratio','horsepower',
           'peakrpm','citympg','highwaympg','price']

df_train[num_vars] = scaler.fit_transform(df_train[num_vars])

df_train.head()

In [None]:
df_train.describe()

In [None]:
# Let's check the correlation coefficients to see which variables are highly correlated

#plt.figure(figsize = (56, 40))
#sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
#plt.show()

### Dividing into X and Y sets for the model building

In [None]:
y_train = df_train.pop('price')
X_train = df_train

### RFE
Recursive feature elimination

In [None]:
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# Running RFE with the output number of the variable equal to 15
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 15)             # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

### Building model using statsmodel, for the detailed statistics

> Model 1

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]

In [None]:
# Adding a constant variable
import statsmodels.api as sm
X_train_lm_1 = sm.add_constant(X_train_rfe)

In [None]:
# Running the liner model
lm_1 = sm.OLS(y_train, X_train_lm_1).fit()

In [None]:
#Let's see the summary of our linear model
print(lm_1.summary())

`stroke` is insignificant in presence of other variables; can be dropped

In [None]:
X_train_new = X_train_rfe.drop(['stroke'], axis=1)

> Model 2

Rebuilding the model without `stroke`

In [None]:
# Adding a constant variable
import statsmodels.api as sm
X_train_lm_2 = sm.add_constant(X_train_new)

In [None]:
lm_2 = sm.OLS(y_train,X_train_lm_2).fit()

In [None]:
print(lm_2.summary())

In [None]:
X_train_new.columns

> Model 3

`twelve cylinder` is insignificant in presence of other variables; can be dropped

In [None]:
X_train_new = X_train_new.drop(['twelve'],axis=1)

In [None]:
X_train_new.shape

Rebuilding the model without `twelve` cylinder

In [None]:
# Adding a constant variable
import statsmodels.api as sm
X_train_lm_3 = sm.add_constant(X_train_new)

In [None]:
lm_3 = sm.OLS(y_train, X_train_lm_3).fit()

In [None]:
print(lm_3.summary())

In [None]:
X_train_new.columns

`five cylinder` is insignificant in presence of other variables; can be dropped

In [None]:
X_train_new = X_train_new.drop(['five'],axis=1)

> Model 4

Rebuilding the model without `five` cylinder

In [None]:
# Adding a constant variable
import statsmodels.api as sm
X_train_lm_4 = sm.add_constant(X_train_new)

In [None]:
lm_4 = sm.OLS(y_train, X_train_lm_4).fit()

In [None]:
print(lm_4.summary())

`four cylinder` is insignificant in presence of other variables; can be dropped

In [None]:
X_train_new = X_train_new.drop(['four'],axis=1)

> Model 5

Rebuilding the model without `four` cylinder

In [None]:
# Adding a constant variable
import statsmodels.api as sm
X_train_lm_5 = sm.add_constant(X_train_new)

In [None]:
lm_5 = sm.OLS(y_train, X_train_lm_5).fit()

In [None]:
print(lm_5.summary())

## Checking VIF

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

We generally want a VIF that is less than 5. So there are clearly some variables we need to drop.

### Dropping the variable and updating the model

As you can see from the summary and the VIF dataframe, some variables are still insignificant. One of these variables is, `two` cylinder.

In [None]:
X_train_new = X_train_new.drop(['two'],axis=1)

> Model 6

Rebuilding the model without `two` cylinder

In [None]:
# Adding a constant variable
import statsmodels.api as sm
X_train_lm_6 = sm.add_constant(X_train_new)

In [None]:
lm_6 = sm.OLS(y_train, X_train_lm_6).fit()

In [None]:
print(lm_6.summary())

In [None]:
# Calculate the VIFs again for the new model

vif = pd.DataFrame()
X= X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

As you can see from the summary and the VIF dataframe, some variables are still insignificant. One of these variables is, `const`.
Let's drop this

In [None]:
# X_train_new = X_train_new.drop(['const'],axis=1)

In [None]:
# X_train_new.columns

> Model 7

Rebuilding the model without `const` cylinder

In [None]:
# Add constant
# import statsmodels.api as sm
# X_train_lm_7 = sm.add_constant(X_train_new)

In [None]:
# X_train_lm.columns

In [None]:
# lm_7 = sm.OLS(y_train, X_train_lm_7).fit()

In [None]:
# print(lm_7.summary())

In [None]:
# Calculate the VIFs again for the new model

#vif = pd.DataFrame()
#X= X_train_new
#vif['Features'] = X.columns
#vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
#vif['VIF'] = round(vif['VIF'], 2)
#vif = vif.sort_values(by = "VIF", ascending = False)
#vif

As you can see from the summary and the VIF dataframe, some variables are still insignificant. One of these variables is, `boreratio`.
Let's drop this

In [None]:
X_train_new = X_train_new.drop(['boreratio'],axis=1)

> Model 8

Rebuilding the model without `boreratio`.

In [None]:
# Add constant
import statsmodels.api as sm
X_train_lm_8 = sm.add_constant(X_train_new)

In [None]:
lm_8 = sm.OLS(y_train, X_train_lm_8).fit()

In [None]:
print(lm_8.summary())

In [None]:
# Calculate the VIFs again for the new model

vif = pd.DataFrame()
X= X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

As you can see from the summary and the VIF dataframe, some variables are still insignificant. One of these variables is, `curbweight`.
Let's drop this

In [None]:
X_train_new = X_train_new.drop(["curbweight"],axis=1)

> Model 9

Rebuilding the model without `curbweight`.

In [None]:
# Add constant
import statsmodels.api as sm
X_train_lm_9 = sm.add_constant(X_train_new)

In [None]:
lm_9 = sm.OLS(y_train, X_train_lm_9).fit()

In [None]:
print(lm_9.summary())

In [None]:
# Calculate the VIFs again for the new model

vif = pd.DataFrame()
X= X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

`peugeot` has high p-value.
Let's drop it

In [None]:
X_train_new = X_train_new.drop(['peugeot'],axis=1)

> Model 10

Rebuilding the model without `peugeot`.

In [None]:
# Add constant
import statsmodels.api as sm
X_train_lm_10 = sm.add_constant(X_train_new)

In [None]:
lm_10 = sm.OLS(y_train, X_train_lm_10).fit()

In [None]:
print(lm_10.summary())

In [None]:
# Calculate the VIFs again for the new model

vif = pd.DataFrame()
X= X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

As you can see from the summary and the VIF dataframe, some variables are still insignificant. One of these variables is, `carwidth`.
Let's drop this

In [None]:
X_train_new = X_train_new.drop(["carwidth"],axis=1)

> Model 11

Rebuilding the model without `carwidth`.

In [None]:
# Add constant
import statsmodels.api as sm
X_train_lm_11 = sm.add_constant(X_train_new)

In [None]:
lm_11 = sm.OLS(y_train, X_train_lm_11).fit()

In [None]:
print(lm_11.summary())

In [None]:
# Calculate the VIFs again for the new model

vif = pd.DataFrame()
X= X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

`three` cylinder has high p-value.
Let's drop it

In [None]:
X_train_new = X_train_new.drop(["three"],axis=1)

> Model 12

Rebuilding the model without `three` cylinder.

In [None]:
# Add constant
import statsmodels.api as sm
X_train_lm_12 = sm.add_constant(X_train_new)

In [None]:
lm_12 = sm.OLS(y_train, X_train_lm_12).fit()

In [None]:
print(lm_12.summary())

In [None]:
# Calculate the VIFs again for the new model

vif = pd.DataFrame()
X= X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Now as you can see, the VIFs and p-values both are within an acceptable range. So we go ahead and make our predictions using this model only.

## Step 7: Residual Analysis of the train data

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
y_train_price = lr_4.predict(X_train_lm)