### Problem Statement
A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. 

 

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:

Which variables are significant in predicting the price of a car

How well those variables describe the price of a car

### Data Set

https://www.kaggle.com/saivivekreddy00/car-price

### Business Goal 

You are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. 


### Model Evaluation

Calculate the R-squared score on the test set.


### Steps Followed

##### 1. Read and Understand Data
##### 2. Visualize the Data
##### 3. Data Preparation
##### 4. Building the model
##### 5. Residual Analysis of the train data
##### 6. Making Predictions
##### 7. Conclusion





Let us start by importing pandas and numpy !

In [None]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

### 1. Read and Understand Data

In [None]:
car = pd.read_csv('../input/car-price-prediction/CarPrice_Assignment.csv')

Basic Data exploration

In [None]:
car.head()

In [None]:
car.shape

There are 205 data items in the data set

In [None]:
car.info()

#### There are no missing values in the data set

From the dictionary file ( please see "Data Dictionary - carprices.xlsx" ), symboling seems to be an Ordinal categorical type.

In [None]:
print(car['symboling'].value_counts())

In [None]:
# Let us drop the car_ID variable which looks useless to analysis.
car_original = car.copy()
car = car.drop("car_ID", axis=1)

In [None]:
# Convert Symboling to object type from int64 as it is actually a categorical variable.
#car['symboling'] = car['symboling'].astype('object')

In [None]:
car.describe()

Looking at the mean, median, max etc , looks like the distribution of numerical attributes don't have severe outliers.

##### Convert text categorical variables to lower string

In [None]:
car_categorical = car.select_dtypes(include=['object'])
car_categorical.columns

In [None]:
car.update(car.select_dtypes(include='object')\
    .apply(lambda x: x.astype(str).str.lower()))

In [None]:
car.head()

In [None]:
print(car['CarName'].value_counts())

In [None]:
#car['CarName'].loc[car["CarName"].str.startswith('N')]

There is a variable named CarName which is comprised of two parts - 
the first word is the name of 'car company' and the second is the 'car model'.
For example, chevrolet impala has 'chevrolet' as the car company name and 
'impala' as the car model name. 
Let's consider only company name as the independent variable for model building. 

In [None]:
car['CarName'] = car['CarName'].apply(lambda x: x.split()[0])

In [None]:
#car.CarName.value_counts()
sorted(dict(car.CarName.value_counts()).items())

The following car companies' names are kind of duplicated - mazda as maxda, porsche as porcshce, toyota as toyouta, volkswagen as vokswagen. We need to correct these.

In [None]:
car['CarName'] = car['CarName'].apply(lambda x: x.replace('maxda', 'mazda'))
car['CarName'] = car['CarName'].apply(lambda x: x.replace('porcshce', 'porsche'))
car['CarName'] = car['CarName'].apply(lambda x: x.replace('toyouta', 'toyota'))
#car['CarName'] = car['CarName'].apply(lambda x: x.replace('vokswagen', 'volkswagen'))

car.loc[(car['CarName'] == "vw") | (car['CarName'] == "vokswagen"), 'CarName'] = 'volkswagen'

In [None]:
sorted(dict(car.CarName.value_counts()).items())

### 2. Visualize the Data

#### Numerical Data

In [None]:
#%matplotlib inline 
import matplotlib.pyplot as plt

car.hist(bins=50, figsize=(20,15))

#### Observations :
    1. enginesize, compressionratio, horsepower, peakrpm, citympg have some extreme outliers.
    2. The attributes are different in scale.
    3. The target variable price is skewed to the right and has extreme outliers around 40000.
    
    

In [None]:
car.price.quantile([.25, .5, .75, .90, .92, .95, .97, .99]) 

##### Try removing outliers from the dataset with respect to price

In [None]:
len(car[car.price <= car.price.quantile(0.95)])

In [None]:
car = car[car.price <= car.price.quantile(.95)]
car.price.describe()

In [None]:
car_numeric = car.select_dtypes(include=['float64', 'int64'])
car_numeric.columns

In [None]:
car.shape

##### Let us see the pair plots between price (dependent variable) and other numerical features

In [None]:
import seaborn as sns
attributes1 = ['price', 'symboling', 'wheelbase','carlength','carwidth','carheight','curbweight', 'enginesize']
sns.pairplot(car[attributes1])

#### Except for carheight and symboling, all the other attributes seem to have good linear correlation with price.
Let us examine the remaining numerical attributes.

In [None]:
attributes2 = ['price','boreratio','stroke','compressionratio','horsepower','peakrpm','citympg','highwaympg']
sns.pairplot(car[attributes2])

##### price seems to have 
Good positive correlation with horsepower. <br>
Some positive correlation with boreratio. <br>
Good negative correlation with citympg and highwaympg. <br>
Not good correlation with stroke, peakrpm and compressionratio.





##### Attributes that have good correlation with each other.
highwaympg and citympg have good positive correlation <br>
Both highwaympg and citympg have negative correlation with horsepower  

##### From the above plots, we see that the problem is well suited for Linear Regression modelling

Let's also look at the correlation matrix on how other numerical features are correlated with the target variable price.

In [None]:
corr_matrix = car.corr()
corr_matrix['price'].sort_values(ascending=False)

#### From above, we can see 
the following features have highest positive correlation with price -<br> 
curbweight, carwidth, enginesize, horsepower and carlength. <br><br>

the following features have highest negative correlation with price -<br> 
citympg (City mileage) and highwaympg (Highway mileage) <br>

So we get a feeling like prices are higher for cars with lower mileage and vice versa. Makes sense because usually city cars have higher mileage and lower prices..right ?


In [None]:
# Figure size
plt.figure(figsize=(16,8))

# Heatmap
sns.heatmap(corr_matrix, cmap="YlGnBu", annot=True)
plt.show()

##### Above heatmap shows a lot of correlation between a number of features. 
So we can easily guess that not all the features will be needed for a successful model and there is good scope for feature elimination. 

#### Now let us look about Categorical Variables

In [None]:
car_categorical = car.select_dtypes(include=['object'])
car_categorical.columns

In [None]:
plt.figure(figsize=(20, 12))
#plt.subplot(1,3,1)
#sns.boxplot(x = 'symboling', y = 'price', data = car)
plt.subplot(1,3,1)
sns.boxplot(x = 'CarName', y = 'price', data = car)
plt.subplot(1,3,2)
sns.boxplot(x = 'fueltype', y = 'price', data = car)
plt.subplot(1,3,3)
sns.boxplot(x = 'aspiration', y = 'price', data = car)
plt.show()

Diesel cars look generally costlier than gas.

In [None]:
# Let us take a closer look at Car Companies

plt.figure(figsize=(20, 20))
sns.boxplot(x = 'CarName', y = 'price', data = car)

bmw, buick, porsche and volvo cars look costlier than others

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(1,3,1)
sns.boxplot(x = 'doornumber', y = 'price', data = car)
plt.subplot(1,3,2)
sns.boxplot(x = 'carbody', y = 'price', data = car)
plt.subplot(1,3,3)
sns.boxplot(x = 'drivewheel', y = 'price', data = car)
plt.show()

Cars with four doors look costlier, likewise convertible and wagon cars

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(1,4,1)
sns.boxplot(x = 'enginelocation', y = 'price', data = car)
plt.subplot(1,4,2)
sns.boxplot(x = 'enginetype', y = 'price', data = car)
plt.subplot(1,4,3)
sns.boxplot(x = 'cylindernumber', y = 'price', data = car)
plt.subplot(1,4,4)
sns.boxplot(x = 'fuelsystem', y = 'price', data = car)
plt.show()

There seems to be only front enginelocation type in the data. Let's confirm it and drop the feature if it is the case, because it is not going to add value to modelling.

In [None]:
car.enginelocation.value_counts()

In [None]:
car = car.drop('enginelocation', axis=1)

In [None]:
car.shape

### 3. Data Preparation

##### Apply one-hot encoding to the categorical variables 

In [None]:
car_categorical = car.select_dtypes(include=['object'])
car_categorical.columns

In [None]:
car_categorical.head()

In [None]:
#category_list = car_categorical.columns
#print(category_list)

In [None]:
#dummy1 = pd.get_dummies(car_categorical[category_list], drop_first=True)
dummy1 = pd.get_dummies(car_categorical, drop_first=True)
dummy1                                       

In [None]:
car.head()

In [None]:
# Add the results to the original housing dataframe
#housing = pd.concat([housing, status], axis = 1)
car = pd.concat([car, dummy1], axis = 1)
car.head()

In [None]:
# Drop categorical variables as we have created the dummies for it
#car.drop(category_list, axis = 1, inplace = True)
car.drop(car_categorical.columns, axis = 1, inplace = True)
car.head()

#### Splitting the Data into Training and Testing Sets and rescaling

In [None]:
from sklearn.model_selection import train_test_split

np.random.seed(0)

train_set, test_set = train_test_split(car, test_size=0.3, random_state=100)

In [None]:
#print(car.shape)
train_set.shape

In [None]:
train_set.head(10)

#### Rescaling the numerical features

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [None]:
numerical_vars = car_numeric.columns

In [None]:
train_set[numerical_vars] = scaler.fit_transform(train_set[numerical_vars])
train_set.head(10)

In [None]:
train_set.describe()

#### Dividing into X and Y sets for the model building

In [None]:
y_train = train_set.pop('price')
X_train = train_set
#X_train.columns

In [None]:

X_train.shape

In [None]:
y_train

### 4. Building the model

##### Let us first use RFE to select the best 13 features ( coarse tuning ) as we saw a significant correlation between a number of features. Then we will go with manual removal of features one by one ( fine tuning ).

In [None]:
from sklearn.feature_selection import RFE 
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()

lr.fit(X_train, y_train)

In [None]:
rfe = RFE(lr, 13)
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
cols = X_train.columns[rfe.support_]

###### Features selected by RFE

In [None]:
cols

###### Features NOT selected by RFE

In [None]:
X_train.columns[~rfe.support_]

##### Building model using statsmodel, for the detailed statistics. We will use only the RFE picked features.

In [None]:
# Creating X_train dataframe with RFE selected variables
X_train_rfe = X_train[cols]
X_train_rfe.shape

In [None]:
# Adding a constant variable as statsmodels will not add it by default
import statsmodels.api as sm
X_train_lm = sm.add_constant(X_train_rfe)
X_train_lm.shape

In [None]:
### Fit Statsmodels model
lm = sm.OLS(y_train, X_train_lm).fit()

In [None]:
### Print the model summary
print(lm.summary())

#### Observations
R-squared:                       0.919
Adj. R-squared:                  0.911
Prob (F-statistic):           1.39e-59

Both R2 and Adj.R2 look very good. Note the Prob (F-statistic). It looks good with a very low value ( < 0.05 ) and shows all the predictor variables together as a whole are significant.

##### cylindernumber_three seems to have high p-value of 0.191 which says this feature is insignificant.
Let's go and check VIF of the variables.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
X_train_new = X_train_lm.drop('const',axis=1)

In [None]:
X_train_new.columns

In [None]:
vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

##### VIF of all the features are in acceptable range i.e less than 5.

##### Let's drop cylindernumber_three which has a high p-value 

In [None]:
X = X.drop('cylindernumber_three',axis=1)
#X = X.drop('enginetype_dohcv', axis=1)

##### Again build a statsmodel fit on the reduced set of features

In [None]:
X_train_lm = sm.add_constant(X)
X_train_lm.columns

In [None]:
lm = sm.OLS(y_train, X_train_lm).fit()

In [None]:
print(lm.summary())

#### Observations
R-squared:                       0.918
Adj. R-squared:                  0.910
Prob (F-statistic):           2.97e-60

A very slight drop in R2 and Adj.R2 from the previous model, but still both look very good. 
Also there is a significant drop in Prob (F-statistic) which is very good.

p-values of all the features are < 0.05 which is all good.

Let us check VIF too

In [None]:
X_train_new = X_train_lm.drop('const',axis=1)

In [None]:
vif = pd.DataFrame()

X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

##### VIF of all the features are in acceptable range i.e less than 5.

#### p-values and VIF are all looking fine within acceptable ranges. So we can try this model on test set.

### 5. Residual Analysis of the train data

##### We have to predict and check if the error terms are normally distributed - This is one of the assumptions of Linear Regression

In [None]:
y_train_price = lm.predict(X_train_lm)

In [None]:
fig = plt.figure()
sns.distplot((y_train - y_train_price), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label

##### Yes ! The error terms are in a nice normal distribution, though not perfect.

### 6. Making Predictions

##### Let us predict test set using the model. 

First we have to apply the standard scaler (that we fit to the train set ) to the test data

In [None]:
test_set[numerical_vars] = scaler.transform(test_set[numerical_vars])

In [None]:
test_set.shape

#### Dividing into X_test and y_test

In [None]:
y_test = test_set.pop('price')
X_test = test_set

In [None]:
X_test.shape

In [None]:
y_test

##### Use only the significant features that we narrowed down finally when we built the model above

In [None]:
# Creating X_test_new dataframe by dropping unnecessary features from X_test
X_test_new = X_test[X_train_new.columns]

# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)

In [None]:
# Making predictions
y_test_pred = lm.predict(X_test_new)

In [None]:
# Plotting y_test and y_test_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test, y_test_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_test_pred', fontsize=16)                          # Y-label

In [None]:
from sklearn.metrics import r2_score

r2_score(y_true=y_test, y_pred=y_test_pred)

##### There seems to be significant gap between R2 of train set (0.918) and test set (0.82).  This might be a overfit on train set.
##### Let us see if there is high correlation between the selected features and try to drop few more features.


In [None]:
X_train_new.columns

In [None]:
# Figure size
plt.figure(figsize=(8,5))

# Heatmap
sns.heatmap(car[X_train_new.columns].corr(), cmap="YlGnBu", annot=True)
plt.show()

##### Try dropping further from the predictor variables with higher correlation. In this case, we will drop  drivewheel_rwd which seems to have high collinearity with curbweight.

In [None]:
# Drop drivewheel_rwd which had  VIF > 2 and also high correlation with curbweight
X = X.drop('drivewheel_rwd',axis=1)

Build model again with reduced feature set

In [None]:
X_train_lm2 = sm.add_constant(X)
X_train_lm2.columns

In [None]:
lm2 = sm.OLS(y_train, X_train_lm2).fit()
print(lm2.summary())

##### Observations

R-squared:                       0.895
Adj. R-squared:                  0.886

##### R2 and Adj-R2 has reduced from previous model. But still look very good. 

CarName_saab and carbody_hardtop have high p-values.

Let us check the VIF values

In [None]:
X_train_new = X_train_lm2.drop('const',axis=1)

In [None]:
vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

##### VIF values are looking fine. Let us drop CarName_saab which has high p-value

In [None]:
X = X.drop('CarName_saab', axis=1)

Build model again with reduced feature set

In [None]:
X_train_lm2 = sm.add_constant(X)
X_train_lm2.columns

In [None]:
lm2 = sm.OLS(y_train, X_train_lm2).fit()

In [None]:
print(lm2.summary())

##### Observations

carbody_hardtop has high p-value of 0.217.

Let us check VIF again

In [None]:
X_train_new = X_train_lm2.drop('const',axis=1)

In [None]:
vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

##### VIF values are looking fine. Let us drop carbody_hardtop which has high p-value

In [None]:
X = X.drop('carbody_hardtop', axis=1)

Build model again with reduced feature set

In [None]:
X_train_lm2 = sm.add_constant(X)
X_train_lm2.columns

In [None]:
lm2 = sm.OLS(y_train, X_train_lm2).fit()


In [None]:
lm2.summary()

##### Observations

R-squared: 0.893
Adj R-squared: 0.885
    

##### R2 and Adj-R2 look very good and p-values are all within control.
Let us check the VIF values.


In [None]:
X_train_new = X_train_lm2.drop('const',axis=1)


In [None]:
vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

##### VIF values are also fine < 5

##### Let us check the correlation heatmap again for the reduced feature set

In [None]:
# Figure size
plt.figure(figsize=(8,5))

# Heatmap
sns.heatmap(car[X_train_new.columns].corr(), cmap="YlGnBu", annot=True)
plt.show()



#### The correlations are also looking fine with no high correlations. So let us predict test set using this model

In [None]:
# Creating X_test_new dataframe by dropping unnecessary features from X_test
X_test_new = X_test[X_train_new.columns]

In [None]:
# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)

In [None]:
# Making predictions
y_test_pred = lm2.predict(X_test_new)

In [None]:
# Plotting y_test and y_test_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test, y_test_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_test_pred', fontsize=16)                          # Y-label

In [None]:
r2_score(y_true=y_test, y_pred=y_test_pred)

### 7. Conclusion

##### Test R2 looks good at 0.87 which is much closer to R2 on train set that is 0.89.

##### So, we built a good model which performs very well both on train and test sets, with the following features

In [None]:
X_train_new.columns