## Thank you for opening this notebook!!

### In this kernel we will try to predict the price of cars. 

### Let's begin!

### Problem Statement
A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:

Which variables are significant in predicting the price of a car
How well those variables describe the price of a car
Based on various market surveys, the consulting firm has gathered a large data set of different types of cars across the America market.

### Business Goal
We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.


## 1. Import libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.style
import matplotlib as mpl
mpl.style.use('ggplot')
sns.set_style('whitegrid')
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.base import TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import r2_score

## 2. Check out the data

In [None]:
df = pd.read_csv('/kaggle/input/car-price-prediction/CarPrice_Assignment.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
pd.DataFrame(df.isnull().sum())

I'm going to drop car_ID column from the dataset

In [None]:
df = df.drop('car_ID', axis =1)

In [None]:
df.head()

In [None]:
df.describe()

First we will look at the distribution of the price which is our dependent variable.

In [None]:
plt.hist(df['price'] , bins = 20 ,color = 'blue')
plt.xlabel('price')
plt.ylabel('no. of cars')
plt.title('Histogram')

According to distribution pf price mean price is around 13276.71. Minimum price of a car is 5118 while maximum price is 45400.

## 3. Cleaning of data

In [None]:
df.dtypes

In [None]:
df['CarName'] = df['CarName'].str.split(' ',expand=True)

In [None]:
df['CarName'].unique()

There are typing mistakes in Car company names. Let's rename them.

In [None]:

df['CarName'] = df['CarName'].replace({'maxda': 'mazda', 'nissan': 'Nissan', 'porcshce': 'porsche', 'toyouta': 'toyota', 
                            'vokswagen': 'volkswagen', 'vw': 'volkswagen'})

Now let's change the datatype of symboling as it is categorical variable as per dictionary file

In [None]:
 df['symboling'] = df['symboling'].astype(str)

Check for the duplicates.

In [None]:
df.loc[df.duplicated()]

We are good to go! We don't have any duplicate rows.

## 4. Visualization of data

First we'll seperate numerical and categorical variables.

In [None]:
categorical = df.select_dtypes(include=['object']).columns
numerical = df.select_dtypes(exclude=['object']).columns
df_categorical = df[categorical]
df_numerical = df[numerical]

In [None]:
df_numerical.head()

Next I'm going to check the correlation of price with other variables. Then I'll use the variables with highest correlation to fit our multiple regression model.

In [None]:
df.corr()['price'].sort_values()

* carlength           0.682920
* carwidth            0.759325
* horsepower          0.808139
* curbweight          0.835305
* enginesize          0.874145
* highwaympg         -0.697599
* citympg            -0.685751
* carwidth , carlength, curbweight ,enginesize ,horsepower seems to have a poitive correlation with price.
* citympg , highwaympg seem to have a significant negative correlation with price.

In [None]:
df_categorical.head()

Let's draw boxplots for all the categorical variables.

In [None]:
plt.figure(figsize=(20, 15))
plt.subplot(3,3,1)
sns.boxplot(x = 'doornumber', y = 'price', data = df)
plt.subplot(3,3,2)
sns.boxplot(x = 'fueltype', y = 'price', data = df)
plt.subplot(3,3,3)
sns.boxplot(x = 'aspiration', y = 'price', data = df)
plt.subplot(3,3,4)
sns.boxplot(x = 'carbody', y = 'price', data = df)
plt.subplot(3,3,5)
sns.boxplot(x = 'enginelocation', y = 'price', data = df)
plt.subplot(3,3,6)
sns.boxplot(x = 'drivewheel', y = 'price', data = df)
plt.subplot(3,3,7)
sns.boxplot(x = 'enginetype', y = 'price', data = df)
plt.subplot(3,3,8)
sns.boxplot(x = 'cylindernumber', y = 'price', data = df)
plt.subplot(3,3,9)
sns.boxplot(x = 'fuelsystem', y = 'price', data = df)
plt.show()

* The cars with fueltype as diesel are comparatively expensive than the cars with fueltype as gas.
* The cars with engine location as rear in very expensive than cars with front engine location.
* The cars with ohcv enginetype seems to be expensive and have a higher range of prices.
* The price of the car vary with the no. of cylinders.
* Expensive cars seem to have rwd drivewheel.
* No. of doors do not influence the price of the car.


So from both of the analysis of numerical and categorical variables I decided to use the below varibales to fit the multiple linear regression model.
* price , carwidth , carlength, curbweight ,enginesize ,horsepower ,citympg , highwaympg , CarName , fueltype , enginelocation , enginetype , cylindernumber , drivewheel

## 5. Data Preparation

First I'm going to drop the variables which are not going to be used for the model.

In [None]:
drop_variables = ['symboling' , 'aspiration' ,'doornumber' ,'carbody' ,'fuelsystem' , 'wheelbase' , 'carheight' ,'boreratio' ,'stroke' , 'compressionratio' , 'peakrpm'  ]

In [None]:
df.drop( drop_variables , axis = 1, inplace = True)
df.shape

In [None]:
df.head()

In [None]:
cat = [ 'CarName' , 'fueltype' , 'enginelocation' , 'enginetype' , 'cylindernumber' , 'drivewheel' ]

First we need to convert the levels of the categorical variables to numeric values. For this we have to use dummy variables.

In [None]:
dummy_variables = pd.get_dummies(df[cat])
dummy_variables.shape

In [None]:
dummy_variables = pd.get_dummies(df[cat], drop_first = True)
dummy_variables.shape

Then we will add the dummy variables to the original dataframe.

In [None]:
df = pd.concat([df, dummy_variables], axis = 1)

In [None]:
df.head()

Now we have the dummy variables created for the categirical variables. We can drop the original cateorical variables from the dataset.

In [None]:
df.drop( cat, axis = 1, inplace = True)
df.shape

In [None]:
df.head()

## 6. Multiple linear regression model

In [None]:
np.random.seed(0)
train, test = train_test_split(df, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
train.shape

In [None]:
test.shape

In [None]:
train.head()

In [None]:
sc = StandardScaler()

In [None]:
num = ['carlength','carwidth','curbweight','enginesize','horsepower','citympg','highwaympg','price']

Let's apply standardization except to dummy variables.

In [None]:
import warnings
warnings.filterwarnings("ignore")
train[num] = sc.fit_transform(train[num])

In [None]:
train.head()

In [None]:
plt.figure(figsize = (20, 20))
sns.heatmap(train.corr(), cmap="Blues")
plt.show()

In [None]:
y_train = train.pop('price')
X_train = train

In [None]:
X_train.shape

In [None]:
model=LinearRegression()

In [None]:
lr_1 =model.fit(X_train,y_train)

In [None]:
print(model.intercept_)

In [None]:
print(model.coef_)

Now lets use RFE to select the independent variables which accurately predict the dependent variable price.

## RFE - Recursive feature elimination 
Let's use Recursive feature elimination since we have too many independent variables.

Running RFE with the output number of the variable equals to 15.

In [None]:
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 15)             
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
# Selecting the variables which are in support

col_sup = X_train.columns[rfe.support_]
col_sup

In [None]:
# Creating X_train dataframe with RFE selected variables

X_train_rfe = X_train[col_sup]

In [None]:
# Adding a constant variable and Build a first fitted model
import statsmodels.api as sm  
X_train_rfec = sm.add_constant(X_train_rfe)
lm_rfe = sm.OLS(y_train,X_train_rfec).fit()

#Summary of linear model
print(lm_rfe.summary())

* Dropping enginetype_rotor beacuse its p-value is 0.264 and we want p-value less than 0.05.

Let's rebuild the model.

In [None]:
X_train_rfe1 = X_train_rfe.drop('enginetype_rotor', 1,)

# Adding a constant variable and Build a second fitted model

X_train_rfe1c = sm.add_constant(X_train_rfe1)
lm_rfe1 = sm.OLS(y_train, X_train_rfe1c).fit()

#Summary of linear model
print(lm_rfe1.summary())

* Dropping enginelocation_rear beacuse its p-value is 0.154 and we want p-value less than 0.05.

Let's rebuild the model.

In [None]:
# Dropping highly correlated variables and insignificant variables

X_train_rfe2 = X_train_rfe1.drop('enginelocation_rear', 1,)

# Adding a constant variable and Build a third fitted model

X_train_rfe2c = sm.add_constant(X_train_rfe2)
lm_rfe2 = sm.OLS(y_train, X_train_rfe2c).fit()

#Summary of linear model
print(lm_rfe2.summary())

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_rfe2.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe2.values, i) for i in range(X_train_rfe2.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

We generally want a VIF that is less than 5. So let's drop some variables.

In [None]:
# Dropping highly correlated variables and insignificant variables

X_train_rfe3 = X_train_rfe2.drop('fueltype_gas', 1,)

# Adding a constant variable and Build a fourth fitted model

X_train_rfe3c = sm.add_constant(X_train_rfe3)
lm_rfe3 = sm.OLS(y_train, X_train_rfe3c).fit()

#Summary of linear model
print(lm_rfe3.summary())

* Dropping cylindernumber_six  beacuse its p-value is 0.684 and we want p-value less than 0.05.

In [None]:
# Dropping highly correlated variables and insignificant variables

X_train_rfe4 = X_train_rfe3.drop('cylindernumber_six', 1,)

# Adding a constant variable and Build a fifth fitted model

X_train_rfe4c = sm.add_constant(X_train_rfe4)
lm_rfe4 = sm.OLS(y_train, X_train_rfe4c).fit()

#Summary of linear model
print(lm_rfe4.summary())

* Dropping cylindernumber_five  beacuse its p-value is 0.495 and we want p-value less than 0.05.

In [None]:
# Dropping highly correlated variables and insignificant variables

X_train_rfe5 = X_train_rfe4.drop('cylindernumber_five', 1,)

# Adding a constant variable and Build a sixth fitted model

X_train_rfe5c = sm.add_constant(X_train_rfe5)
lm_rfe5 = sm.OLS(y_train, X_train_rfe5c).fit()

#Summary of linear model
print(lm_rfe5.summary())

* Dropping cylindernumber_four  beacuse its p-value is 0.349 and we want p-value less than 0.05.

In [None]:
# Dropping highly correlated variables and insignificant variables

X_train_rfe6 = X_train_rfe5.drop('cylindernumber_four', 1,)

# Adding a constant variable and Build a seventh fitted model

X_train_rfe6c = sm.add_constant(X_train_rfe6)
lm_rfe6 = sm.OLS(y_train, X_train_rfe6c).fit()

#Summary of linear model
print(lm_rfe6.summary())

R squared value is 0.884

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_rfe6.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe6.values, i) for i in range(X_train_rfe6.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Now the VIFs and p-values both are within an acceptable range. Variables are statistically significant.

## 7. Predictions

In [None]:
import warnings
warnings.filterwarnings("ignore")

test[num] = sc.transform(test[num])
test.shape

In [None]:
y_test = test.pop('price')
X_test = test

In [None]:
# Adding constant
X_test_1 = sm.add_constant(X_test)

X_test1 = X_test_1[X_train_rfe6c.columns]

In [None]:
y_predictions = lm_rfe6.predict(X_test1)

In [None]:
# Plotting y_test and y_pred.
fig = plt.figure()
plt.scatter(y_test,y_predictions)
fig.suptitle('y_test vs y_predions', fontsize=20)   
plt.xlabel('y_test ', fontsize=18)                       
plt.ylabel('y_predictions', fontsize=16)    

In [None]:
r2_score(y_test, y_predictions)

The R squared value of training set is 0.884 and test set is 0.87.

The above will be used to predict the price of cars.

Multiple linear regression model is,

*Price = -0.2346 + 0.557 x horsepower + 0.5820 x Carname_audi + 1.5135 x Carname_bmw + 2.0651 x Carname_buick + 1.8828 X Carname_jaguar + 1.1414 x Carname_porsche + 0.7072 x Carname_volvo - 1.1521 x enginetype_dohcv - 0.9373 x cylindernumber_twelve *

## I hope this kernal is helpful for you. Your UPVOTE means alot to me!!

## Thank you !! 