In this Problem dataset, for finding the factors contributing to Car Prices, I will use Linear Regression with RFE and form different model depending on the p-value and VIF (for feature significance).

Will be updating the notebook with additional ML techniques (like adding Regularization while Model building / building Polynomial transformed features with degree 2,3) soon.

Author: Akhil Shukla

If you use parts of this notebook in your own scripts, please give some sort of credit (for example link back to this).

Thanks!

# Problem Statement

A Chinese automobile company <b>Geely Auto</b> aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. 

They want to specifically understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:

1. Which variables are significant in predicting the price of a car
2. How well those variables describe the price of a car

Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the Americal market. 

<b>We will follow folowing steps for building our model with the variables that are significant, and to be considered by Geely Autos in understanding the car pricing in the American Automobile market:

1. `Reading and understanding the data`

2. `Visualizing the data`

3. `Data Preparation`

4. `Training the model`

5. `Residual analysis`

6. `Predicting and evaluation on the test set`</b>

### At the end we will make suggestions and recommendations to Geely Autos to help them enter the US market, utilizing our analysis on factors leading to Car price behavior.

____________________________________________________________________________________________________________________________________________

### IMPORTING LIBRARIES

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#Importing statsmodel library for statistical summary
import statsmodels.api as sm 

#Importing sklearn methods
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Importing sklearn RFE and LinearRegression methods
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

pd.set_option('display.max_columns', 600)
pd.set_option('display.max_rows', 50)

## STEP 1: Reading and Understanding the data

In [None]:
dfcar = pd.read_csv('../input/car-data/CarPrice_Assignment.csv')
dfcar.head()

    Dataset shape          ---- The data contains 205 rows and 26 columns.

    Null value status      ---- None of the values in the dataset are null.

In [None]:
dfcar.info()

In [None]:
dfcar.describe().T

## Step 2: Visualising the Data

Lets visualise the given dataset to find out any visible clues, and check:
- which features show strong linear relationship with our target variable 'price'.
- the correlation among the features.

#### Visualising All the Variables

Let's make a pairplot of all the variables

In [None]:
sns.set(font_scale=2)
sns.pairplot(dfcar)

**Observations** from above graphs:

1. When plotted against 'price': most of the independent variables are showing linear behaviour with 'price'. Hence a linear regression model should be a good choice for the given data.

2. Also we see that there are many features that show multicollinearity. Lets check that now.

In [None]:
plt.figure(figsize = (20,10))  
sns.set(font_scale=1.5)
sns.heatmap(dfcar.corr(),annot = True)

Looking at the above correlation table, the features:
- highwaympg, citympg are very highly correlated. We can drop one of them. Lets drop highwaympg.

- wheelbase, carlength, carwidth, curbweight, enginesize are highly correlated with each other - **lets plot corr map of just these var. for better readability**

In [None]:
#Lets drop the highwaympg for reasons mentioned above.

dfcar = dfcar.drop(['highwaympg'], axis=1)

In [None]:
plt.figure(figsize = (20,10))
corr1 = dfcar[['wheelbase','carlength','carwidth','curbweight','carheight','enginesize','price']].corr()
corr1.style.background_gradient(cmap='Greens')

#### Observations:

curbweight, carwidth, wheelbase and carlength are highly correlated. Lets just keep one out of these.

- drop `curbweight` -> it is highly correlated with other features.
- drop `carwidth`   -> it is highly correlated with other features.
- drop `wheelbase`  -> it is highly correlated with other features, but lets keep it for now.
- `carheight` is less correlated with others, so we'll keep it for now.
- `enginesize` has a very high corr with 'price', so we'll keep it.

In [None]:
#Lets drop the curbweight, carwidth and wheelbase column for reasons mentioned above.

dfcar = dfcar.drop(['curbweight', 'carwidth', 'wheelbase'], axis=1)

#### Visualising Categorical Variables

As you might have noticed, there are a few categorical variables as well. Let's make a boxplot for some of these variables.

In [None]:
sns.set(font_scale=1)
plt.figure(figsize=(20, 12))
plt.subplot(3,3,1)
sns.boxplot(y = 'fueltype', x = 'price', data = dfcar)
plt.subplot(3,3,2)
sns.boxplot(y = 'aspiration',x = 'price', data = dfcar)
plt.subplot(3,3,3)
sns.boxplot(y = 'doornumber', x = 'price', data = dfcar)
plt.subplot(3,3,4)
sns.boxplot(y = 'carbody', x = 'price', data = dfcar)
plt.subplot(3,3,5)
sns.boxplot(y = 'drivewheel', x = 'price', data = dfcar)
plt.subplot(3,3,6)
sns.boxplot(y = 'enginelocation', x = 'price', data = dfcar)
plt.subplot(3,3,7)
sns.boxplot(y = 'enginetype', x = 'price', data = dfcar)
plt.subplot(3,3,8)
sns.boxplot(y = 'cylindernumber', x = 'price', data = dfcar)
plt.subplot(3,3,9)
sns.boxplot(y = 'fuelsystem', x = 'price', data = dfcar)

plt.show()

#### Observations:

Lets make some visible observations from the boxplots. This might come handy for intuitive modelling later.

1.Fueltype: not much of a difference on price.

2.Aspiration: turbo is on an average costlier than std.

3.DoorNumber: No visible effect on price.

4.Carbody: Hatchback and Wagon are usually cheaper. Hardtop is available across wide range of price.

5.Drivewheel: rwd is usually costlier.

6.Enginelocation: Rear is clearly expensive.

7.Enginetype: very general.. will check later.

8.Cylindernumber: eight is clearly expensive. three and four are comparatively cheaper.

9.Fuelsystem: very general.. will check later.

Lets create a derived variable containing just the car company name. We do not need the car model information.

In [None]:
dfcar['CarComp'] = dfcar['CarName'].str.replace('-',' ')
dfcar['CarComp'] = dfcar['CarComp'].str.split(' ', n=1, expand=True)[0]

**Notice** that there are some car companies 

- some car companies are duplicates with upper lower case distinction. We need to fix this - we will make all car company names as lowercase.
- some car company names are misplelled, because of which they are being treated as different companies - we need to fix this.


In [None]:
#convert all car company names to lowercase
dfcar['CarComp'] = dfcar['CarComp'].str.lower()

#Note some car company name are entered with typos/ or lowercase. Lets fix that.
correct_name = {'maxda' : 'mazda', 'porcshce' : 'porsche', 'toyouta' : 'toyota', 'vokswagen' : 'volkswagen', 'vw' : 'volkswagen'}
dfcar['CarComp'] = dfcar['CarComp'].replace(correct_name, regex=True)

In [None]:
#Lets drop the CarName and car_ID column as we dont need it anymore
dfcar = dfcar.drop(['CarName', 'car_ID'], axis=1)

In [None]:
#Lets also visualize the derived feature CarComp

plt.figure(figsize=(15,8))
sns.boxplot(x = 'price', y = 'CarComp', data = dfcar)

- We observe that the car company 'mercury' has just a single entry in the data. Hence wont be of any use. We'll discard it later, when we create dummy var for car company names.

Now lets apply some encoding on the categorical features.

## Step 3: Data Preparation

Categorical feature encoding
- Let us map values like 1,2,3.. for `ordinal feature values`, and create dummy var for the `nominal features values`, so that they can be used in a regression model.

In [None]:
## Ordinal Features:

# dfcar.fueltype.unique()           # 'gas', 'diesel'

dfcar.fueltype = dfcar.fueltype.map({'gas':0, 'diesel':1})

# dfcar.aspiration.unique()         # 'std', 'turbo'

dfcar.aspiration = dfcar.aspiration.map({'std':0, 'turbo':1})

# dfcar.doornumber.unique()         # 'two', 'four'

dfcar.doornumber = dfcar.doornumber.map({'two':0, 'four':1})

# dfcar.enginelocation.unique()     # 'front', 'rear'

dfcar.enginelocation = dfcar.enginelocation.map({'front':0, 'rear':1})

dfcar

In [None]:
## Nominal Features:

car = pd.get_dummies(dfcar)

# From now on we will use `car` dataset only.

`One unfinished task`: As stated above while plotting the boxplot for CarComp names, since mercury company has just one entry in the data, we don't need it, and hence would drop this `CarComp_mercury` column

In [None]:
car = car.drop(['CarComp_mercury'], axis=1)
car

## Step 4: Training the model

Let us start towards building our LR model. Lets first split data into train-test.

In [None]:
# Let us split the dataset 'car' into train and test data (70:30 ratio) using sklearn train_test_split() method

df_train, df_test = train_test_split(car, train_size = 0.7, test_size = 0.3, random_state = 100)

### Rescaling the Features 

We will use Normalization scaling, and scale all the numeric features. We dont need to do anything with the created dummy variables as they are already 0 or 1, and hence in the range [0,1] 

In [None]:
# Fit on train data

# Create list num_vars(contains the list of numeric predictor variables):

num_vars = ['symboling','carlength','carheight','enginesize','boreratio','stroke','compressionratio','horsepower','peakrpm','citympg','price']

# Lets normalize both df_train and df_test here.

df_train[num_vars] = df_train[num_vars].apply(lambda x: (x- np.mean(x))/(x.max() - x.min()))
df_test[num_vars] = df_test[num_vars].apply(lambda x: (x- np.mean(x))/(x.max() - x.min()))

In [None]:
df_train.describe()

In [None]:
# Let's check the correlation coefficients to see which variables are highly correlated. 

corr = df_train[df_train.columns].corr()
corr.style.background_gradient(cmap='coolwarm')

    Price is showing strong relationship with:

>Positive Corr
   - 'enginesize' (0.867), 'horsepower' (0.806), 'carwidth' (0.799), 'drivewheel_rwd' (0.677), 'wheelbase' (0.622) 
   
>Negative Corr
   - 'citympg' (-0.674), 'drivewheel_fwd' (-0.635), 

Before beginning to build a model, let us again check the scatterplot of features vs price to get a better intuitive understanding.

Car Dimension numeric features:

In [None]:
sns.pairplot(df_train, x_vars=['carlength', 'carheight'], y_vars='price',height=4, aspect=1, kind='scatter')
plt.show()

Car engine related numeric features

In [None]:
sns.pairplot(df_train, x_vars=['enginesize', 'horsepower', 'stroke'], y_vars='price',height=4, aspect=1, kind='scatter')
plt.show()

Other numeric features

In [None]:
sns.pairplot(df_train, x_vars=['boreratio', 'compressionratio','peakrpm', 'citympg'], y_vars='price',height=4, aspect=1, kind='scatter')
plt.show()

**OBSERVATIONS:**

`carweight`, `enginesize` show strong positive linear relationship with car `price`. 

Also they are important features to have in model for making business decisions for`Geely Automobiles`.

In [None]:
y_train = df_train.pop('price')
X_train = df_train

### Building a linear model

Fit a regression line through the training data using `statsmodels`.

Since there are a lot of variables, we will make use of RFE to determine the top good features.

In [None]:
# Running RFE with the output number of the variable equal to 15
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 15)             # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col_in = X_train.columns[rfe.support_]
col_in

In [None]:
col_out = X_train.columns[~rfe.support_]
col_out

### Building model using statsmodel, for the detailed statistics

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col_in]

#### Define a function that displays the Statistics summary and also return the X_train_lm dataset, and the modeling object

In [None]:
def LRM_Summ(df):
    # Display the linear model summary
    
    #Adding a constant
    X_train_lm = sm.add_constant(df)
    
    #Fit
    lm = sm.OLS(y_train,X_train_lm).fit()
    
    #Printing the statistics summary
    print(lm.summary())
    
    #Returning the X_train_lm dataset and the VIF table for all the features present in the model
    return X_train_lm, lm

#### Defining a function that display VIF of the features present in the model and their correlation matrix

In [None]:
def LRM_VIF_Corr(df):    
    
    # Calculate the VIFs for the current model
    vif = pd.DataFrame()
    X = df
    vif['Features'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    print('\n')
    print(vif)
    
    
    #Printing the correlation matrix of features in the current model along with price
    col_var = list(vif.Features)
    col_var = col_var + ['price']
    corr = car[col_var].corr()
    return(corr.style.background_gradient(cmap='coolwarm'))

**MODEL 1:**

In [None]:
X_train_lmod1, lmod1 = LRM_Summ(X_train_rfe)
LRM_VIF_Corr(X_train_rfe)

Model 1: 
 - The model contains 15 features.
 - All the features look significant (as per p-value) but notice the extremely high VIF (infinite)

**MODEL 2:**

`engine_rotor` and `cylindernumber_two` have infinite VIF, indicating that they have a correlarion of 1.

We need to drop one of them. Lets drop `engine_rotor`.

In [None]:
X_train_rfe = X_train_rfe.drop(['enginetype_rotor'], axis=1)
car = car.drop(['enginetype_rotor'], axis=1)

X_train_lmod2, lmod2 = LRM_Summ(X_train_rfe)
LRM_VIF_Corr(X_train_rfe)

**MODEL 3:**

Since the p-value of `enginetype_dohcv` is 0.064 (> 0.05), hence we'll drop `enginetype_dohcv`

In [None]:
X_train_rfe = X_train_rfe.drop(['enginetype_dohcv'], axis=1)
car = car.drop(['enginetype_dohcv'], axis=1)

X_train_lmod3, lmod3 = LRM_Summ(X_train_rfe)
LRM_VIF_Corr(X_train_rfe)

**MODEL 4:**


Since the p-value of `cylindernumber_eight` is 0.068 (> 0.05), hence we'll drop `cylindernumber_eight`

In [None]:
X_train_rfe = X_train_rfe.drop(['cylindernumber_eight'], axis=1)

car = car.drop(['cylindernumber_eight'], axis=1)

X_train_lmod4, lmod4 = LRM_Summ(X_train_rfe)
LRM_VIF_Corr(X_train_rfe)

All the features seem to be farily significant. Lets look above at the correlation table to find any visible multicollinearity.

**MODEL 5:**

Notice that `cylindernumber_four` is highly (negatively) correlated with `enginesize`, hence we'll drop `cylindernumber_four`.

In [None]:
X_train_rfe = X_train_rfe.drop(['cylindernumber_four'], axis=1)

car = car.drop(['cylindernumber_four'], axis=1)

X_train_lmod5, lmod5 = LRM_Summ(X_train_rfe)
LRM_VIF_Corr(X_train_rfe)

**MODEL 6:**

The `cylindernumber_twelve` is having a high p-value of 0.149. Lets drop this feature and check.

In [None]:
X_train_rfe = X_train_rfe.drop(['cylindernumber_twelve'], axis=1)

car = car.drop(['cylindernumber_twelve'], axis=1)

X_train_lmod6, lmod6 = LRM_Summ(X_train_rfe)
LRM_VIF_Corr(X_train_rfe)

**MODEL 7:**

Notice `stroke` has a high p-value, and also `stroke` has considerable correlation with `enginesize` and `carlength`. Hence dropping it would be a good idea.

In [None]:
X_train_rfe = X_train_rfe.drop(['stroke'], axis=1)

car = car.drop(['stroke'], axis=1)

X_train_lmod7, lmod7 = LRM_Summ(X_train_rfe)
LRM_VIF_Corr(X_train_rfe)

**MODEL 8:**

The feature `boreratio` has a high p-value. Lets drop it and check if there is significant change

In [None]:
X_train_rfe = X_train_rfe.drop(['boreratio'], axis=1)

car = car.drop(['boreratio'], axis=1)

X_train_lmod8, lmod8 = LRM_Summ(X_train_rfe)
LRM_VIF_Corr(X_train_rfe)

Dropping `boreratio` had almost no drop in Adjusted R-square which is a good thing.

**MODEL 9:**

Now lets drop `cylindernumber_three` which is having a high significant value of 0.07.

In [None]:
X_train_rfe = X_train_rfe.drop(['cylindernumber_three'], axis=1)

car = car.drop(['cylindernumber_three'], axis=1)

X_train_lmod9, lmod9 = LRM_Summ(X_train_rfe)
LRM_VIF_Corr(X_train_rfe)

#### Our model looks fairly well at this point. The features in our model are all significant and also the VIF associated with the features are all below 2.6. We can perform some more feature trimming, but it might backfire as it will bring further more cutting, hence reducing our feature size.

#### Let us now do some Residual Analysis, followed by our Model Evaluation

## Step 5: Residual Analysis on the training data

In [None]:
y_train_pred = lmod9.predict(X_train_lmod9)

In [None]:
res = y_train - y_train_pred
sns.distplot(res, color = 'blue')

## Step 6: Predication and Evaluation on the test set

In [None]:
y_test = df_test.pop('price')
X_test = df_test

In [None]:
# Adding  constant variable to test dataframe
X_test_lmod9 = sm.add_constant(X_test)
X_test_lmod9.head()

In [None]:
# Creating X_test_lm9 dataframe by dropping variables from X_test_lm9
X_test_lmod9 = X_test_lmod9.drop(col_out, axis=1)
X_test_lmod9 = X_test_lmod9.drop(['enginetype_dohcv', 'enginetype_rotor', 'cylindernumber_eight',
       'cylindernumber_four', 'cylindernumber_three', 'stroke', 'boreratio', 'cylindernumber_twelve'], axis=1)
X_test_lmod9.head()

In [None]:
#Predict
y_test_pred = lmod9.predict(X_test_lmod9)

In [None]:
#Evaluate the model
r2_score(y_true = y_test, y_pred = y_test_pred)

In [None]:
# Actual vs Predicted
index = [i for i in range(1,63,1)]
fig = plt.figure()
plt.plot(index,y_test, color="blue", linewidth=3.5, linestyle="-")     #Plotting Actual
plt.plot(index,y_test_pred, color="purple",  linewidth=3.5, linestyle="-")  #Plotting predicted
fig.suptitle('Actual and Predicted')              # Plot heading 
plt.xlabel('Index')                               # X-label
plt.ylabel('Car Price')  

In [None]:
#Plotting y_test against y_pred to see the relationship.
fig = plt.figure()
plt.scatter(y_test,y_test_pred)
fig.suptitle('y_test vs y_test_pred')              # Plot heading 
plt.xlabel('y_test')                          # X-label
plt.ylabel('y_test_pred')     

The y_test_pred and y_test show a linear relation for most of the values. The other isolated points as seen above in the plot are the outliers, and not of much significance. The important point to note here is that the y_test and y_test_pred are following a somewhat linear relation, which in other words mean that the y_test_pred is able to consistently predict the values similar to the actual values carried by the y_test data. 

### Assessing the Model:
Lets scatter plot the error and see if the error is random or shows some pattern. The desired outcome is that there should be no visible pattern, and the distribution should be random.

In [None]:
# Error terms
fig = plt.figure(figsize = (16,6))

index = [i for i in range(1,63)]

#To see the randomness with dots
plt.subplot(1,2,1)
plt.scatter(index, y_test-y_test_pred, color = 'red')

fig.suptitle('Error Terms')              # Plot heading 
plt.xlabel('Index')                      # X-label
plt.ylabel('y_test - y_test_pred')                # Y-label


# To join the randomness with dots to see if it has any pattern
plt.subplot(1,2,2)
plt.plot(index, y_test-y_test_pred, color="red")

fig.suptitle('Error Terms')              # Plot heading 
plt.xlabel('Index')                      # X-label
plt.ylabel('y_test - y_test_pred')                # Y-label

> In the above scatterplot, the error is randomly distributed and it does not follow any pattern. 

In [None]:
# Let us plot the histogram plot of the error terms to see if they follow the Normal Distribution.

fig = plt.figure(figsize = (4,4))

sns.distplot((y_test - y_test_pred),bins=10)
fig.suptitle('Error Terms', fontsize=20)                  # Plot heading 
plt.xlabel('y_test-y_pred', fontsize=18)                  # X-label
plt.ylabel('Index', fontsize=16)             

> Also the (y_test - y_test_pred) is following a Normal distribution with mean close to zero (~0.00241)

> The above model shows an R-square and adjusted R-square value of `0.898` and `0.892`.
> Also the predicted values from the test data shows an R-square of `0.856`, which is pretty close to the one observed on the training data, and hence displays a good model.

(I have tried many other models as well, and they are all comparable. The accuracy of this model is pretty good. And also the error distribution is much more random in this model compared to the others that I had tried. There can of-course be other models, that show similar or slightly better results.)

## Inference
  
1. The R2 score of the predicted test data value is 86.1% which is pretty close to Adj R-square of the train data. This indicates that our model is able to predict the target var. values pretty well.

2. The R square and Adjusted R square value in the training data are almost similar @ 86.8% and 86.2% respectively, which indicates that there is almost no redundany in the variables that we have chosen, and they all hold significance.

3. The scatter plot of the Error Terms (y_test-y-pred) shows that the error terms are randomly distributed, and does not follow any visible pattern, which indicates that it is just the white noise. 

4. The R2 score of the predicted test data value is 86.1% which is pretty close to Adj R-square of the train data. This indicates that our model is able to predict the target var. values pretty well.

5. From the histogram plot of the error terms in the test data, we can see that the error terms are following a Normal Distribution centered around a mean of close to `zero`. 

6. The predictor variables that can affect the car price are:

   1.`carlength`, 
   
   2.`enginesize`, 
   
   3.`cylindernumber_two`,
   
   4.`CarComp_audi`,
   
   5.`CarComp_bmw`, 
   
   6.`CarComp_buick`,
   
   7.`CarComp_porsche`

# Recommendations to Geely Autos:
>As per the model, the car price American automobile market is largely driven by the `carlength` and `enginesize`. The number of `cylinders` used as a gas fuel is also relevent to some extent in determining the prices.

>The car brands, particularly `Audi`, `BMW`, `Buick`, `Porsche` also drive the car price in the American Automobile Market. So Geely Autos can invest time in studying these car brands particularly, which will help them in making decisions regarding the consumer auto choice in the American Market, while they set up their manufacturing unit in US market.