<img src="https://thumbs.gfycat.com/LinedSerpentineAnemonecrab-size_restricted.gif" length=1000 width=1000>



<font size="+3" color='#053c96'><b> Problem Statement</b></font>

A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:

Which variables are significant in predicting the price of a car
How well those variables describe the price of a car
Based on various market surveys, the consulting firm has gathered a large data set of different types of cars across the America market.

<font size="+3" color='#053c96'><b>Bussiness Goal</b></font>

We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

<font size="+3" color='#053c96'><b>This Notebook will cover the following - </b></font>
### 1. Data cleaning
### 2. Exploratory Data Analysis
### 3. Feature selection using Recursive Feature elimination(RFE)
### 4. Data Modelling and evaluation

<font size="+2" color=chocolate ><b>Please Upvote my kernel if you like my work.</b></font>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.model_selection import train_test_split,KFold,cross_val_score,GridSearchCV,RandomizedSearchCV
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
sns.set(rc={'figure.figsize':(8,8)})

<font size="+3" color='#540b11'><b>1. Data cleaning </b></font>

In [None]:
data=pd.read_csv('../input/car-price-prediction/CarPrice_Assignment.csv')

In [None]:
data.head()

## Drop car Id 

In [None]:
data=data.drop(['car_ID'],axis=1)

## Extracting car company from car name

In [None]:
data['CarName'] = data['CarName'].str.split(' ',expand=True)

## Handling duplicate values in car name 

* nissan and Nissan are same
* toyota and toyouta are same 
* vokswagen , volkswagen and vw are same
* mazda and maxda are same
* porcshce and porsche are same

In [None]:
data['CarName'] = data['CarName'].replace({'maxda': 'mazda', 'nissan': 'Nissan', 'porcshce': 'porsche', 'toyouta': 'toyota', 
                            'vokswagen': 'volkswagen', 'vw': 'volkswagen'})

In [None]:
data['symboling']=data['symboling'].astype('str')

# Categorical columns

In [None]:
categorical_cols=data.select_dtypes(include=['object']).columns

In [None]:
data[categorical_cols].head(2)

## Numerical columns 

In [None]:
numerical_cols=data.select_dtypes(exclude=['object']).columns

In [None]:
data[numerical_cols].head(2)

<font size="+3" color='#540b11'><b>2. Exploratory Data Analysis </b></font>

In [None]:
data.describe()

# Visualise different car names 

In [None]:
df=pd.DataFrame(data['CarName'].value_counts()).reset_index().rename(columns={'index':'car_name','CarName': 'count'})

In [None]:
plot = sns.barplot(y='car_name',x='count',data=df)
plot=plt.setp(plot.get_xticklabels(), rotation=80)

According to the dataset-
* Toyota is the most suitable car 
* mercury is the least suitable car

# Fuel type Ratio

In [None]:
df=pd.DataFrame(data['fueltype'].value_counts())

In [None]:
plot = df.plot.pie(y='fueltype', figsize=(5, 5))

* Most of the car has gas fuel 

# Price distribution of cars

In [None]:
sns.distplot(data['price'],kde=True)

* Price distribution plot is right skewed 
* maximum  number of cars are in range of 20000

# Price distribution of diesel vs gas car

In [None]:
f= plt.figure(figsize=(12,5))

ax=f.add_subplot(121)
sns.distplot(data[(data.fueltype== 'gas')]["price"],color='b',ax=ax)
ax.set_title('Distribution of price of gas vehicles')

ax=f.add_subplot(122)
sns.distplot(data[(data.fueltype == 'diesel')]['price'],color='r',ax=ax)
ax.set_title('Distribution of ages of diesel vehicles')

In [None]:
sns.boxplot(x = 'fueltype', y = 'price', data = data,palette='Pastel2')

* Price of diesel is much higher than of gas , also there are some outliers in gas vehicles

# Aspiration ratio

In [None]:
df=pd.DataFrame(data['aspiration'].value_counts())

In [None]:
plot = df.plot.pie(y='aspiration', figsize=(5, 5))

* Most of the cars have standard aspiration

# Price distribution of Std vs Turbo aspiration vehicles

In [None]:
f= plt.figure(figsize=(12,5))

ax=f.add_subplot(121)
plot=sns.distplot(data[(data.aspiration== 'turbo')]["price"],color='#ca91eb',ax=ax)
ax.set_title('Price distribution of Turbo aspiration vehicles')

ax=f.add_subplot(122)
plot=sns.distplot(data[(data.aspiration == 'std')]['price'],color='#eb6426',ax=ax)
ax.set_title('Price distribution of Std aspiration vehicles')


In [None]:
sns.boxplot(x = 'aspiration', y = 'price', data = data,palette='Pastel1')

* Price of turbo is higher in compared to std , although there are some outliers in std price

# Symboling 

Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

In [None]:
df=pd.DataFrame(data['symboling'].value_counts()).reset_index().rename(columns={'index':'symboling','symboling':'count'})

In [None]:
sns.barplot(x='symboling',y='count',data=df)

 * Most of the car symboling has 0 value 

# Price distribution according to symboling 

In [None]:
sns.boxplot(x = 'symboling', y = 'price', data = data,palette='Pastel1')

* Price of -1 symboling is higher in compare to others

# Door number

In [None]:
df=pd.DataFrame(data['doornumber'].value_counts())

In [None]:
plot = df.plot.pie(y='doornumber', figsize=(5, 5))

* 115 cars has four dooors and 90 cars has 2 doors

# Price distribution according to door number 

In [None]:
f= plt.figure(figsize=(12,5))

ax=f.add_subplot(121)
plot=sns.distplot(data[(data.doornumber== 'two')]["price"],color='#ca91eb',ax=ax)
ax.set_title('Price distribution of cars having two doors')

ax=f.add_subplot(122)
plot=sns.distplot(data[(data.doornumber == 'four')]['price'],color='#eb6426',ax=ax)
ax.set_title('Price distribution of cars having four doors')


In [None]:
sns.boxplot(x = 'doornumber', y = 'price', data = data,palette='Accent')

* As you can see there is slight difference in distributions of cars having two doors vs cars having 4 doors

# Carbody

In [None]:
df=pd.DataFrame(data['carbody'].value_counts())

In [None]:
plot = df.plot.pie(y='carbody', figsize=(8, 8))

* majority of car body are sedan and hatchback

# Price distribution according to car body

In [None]:
sns.boxplot(x = 'carbody', y = 'price', data = data,palette='Accent')

* price of hardtop is very high compare to others

# Drive wheel

In [None]:
df=pd.DataFrame(data['drivewheel'].value_counts())

In [None]:
plot = df.plot.pie(y='drivewheel', figsize=(8, 8))

* Maximum cars has fwd drive wheel

# Price distribution according to drive wheel 

In [None]:
sns.boxplot(x = 'drivewheel', y = 'price', data = data,palette='Accent')

* Price range of rwd drivewheel cars is quite high compare to others

# Engine location

In [None]:
df=pd.DataFrame(data['enginelocation'].value_counts())

In [None]:
plot = df.plot.pie(y='enginelocation', figsize=(8, 8))

* less number of cars having rear engine 

# Engine type

In [None]:
df=pd.DataFrame(data['enginetype'].value_counts())

In [None]:
plot = df.plot.pie(y='enginetype', figsize=(8, 8))

* Maximum number of engine type are of 'ohc'

In [None]:
sns.boxplot(x = 'enginetype', y = 'price', data = data,palette='Accent')

* price range of ohcv engine cars are quite high in compare to others

# Cylinder number

In [None]:
df=pd.DataFrame(data['cylindernumber'].value_counts())

In [None]:
plot = df.plot.pie(y='cylindernumber', figsize=(8, 8))

* maximum cars are of four cylinder number

# Price distribution according to cylinder number

In [None]:
sns.boxplot(x = 'cylindernumber', y = 'price', data = data,palette='Accent')

* there is only 1 car having cylinder number 3 and 12 .
* car having cylinder number eight has higher price range.

# Fuel system 

In [None]:
df=pd.DataFrame(data['fuelsystem'].value_counts()).reset_index().rename(columns={'index':'fuelsystem','fuelsystem':'count'})

In [None]:
sns.barplot(x='fuelsystem',y='count',data=df)

* most number of cars having fuel system mpfi
* least number of cars having fuel system mfi and spfi

# Price distribution according to fuel system

In [None]:
sns.boxplot(x = 'fuelsystem', y = 'price', data = data,palette='gist_rainbow')

* price range are high of car having idi fuel system 

# Visualising Numerical features

# Wheel base Vs Price


In [None]:
sns.scatterplot(x="wheelbase", y="price", data=data,color='purple')

In [None]:
g = sns.jointplot(x="wheelbase", y="price", data=data, kind="kde", color="b")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("wheel base", "price");

* Highly scattered points

# Carlength vs Car price

In [None]:
sns.scatterplot(x="carlength", y="price", data=data,color='b')

In [None]:
g = sns.jointplot(x="carlength", y="price", data=data, kind="kde", color="pink")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("car length", "price");

* car length also scattered but less scattered than wheelbase

# Car width Vs Price

In [None]:
sns.scatterplot(x="carwidth", y="price", data=data,color='b')

In [None]:
g = sns.jointplot(x="carwidth", y="price", data=data, kind="kde", color="pink")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("car width", "price");

# Car length vs Car width

In [None]:
sns.scatterplot(x="carlength", y="carwidth", data=data,color='b')

In [None]:
g = sns.jointplot(x="carwidth", y="carlength", data=data, kind="kde", color="pink")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("car width", "car length");

* interesting !!! strong relation is seemed between car length and car width 

# Curbweight vs Price

In [None]:
sns.scatterplot(x="curbweight", y="price", data=data,color='b')

In [None]:
g = sns.jointplot(x="curbweight", y="price", data=data, kind="kde", color="b")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("curbweight", "price");

* points are scattered after curbweight of 2900, initially it is increasing as curbweight increases as you can alse see in the joint plot color becomes lighter after curbweight of 2900 .

# Engine size Vs Price

In [None]:
sns.scatterplot(x="enginesize", y="price", data=data,color='b')

In [None]:
g = sns.jointplot(x="enginesize", y="price", data=data, kind="kde", color="b")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("enginesize", "price");

* points are scattered after engine size of 140, initially it is increasing as curbweight increases as you can alse see in the joint plot color becomes lighter after engine size of 140 .

# Boreratio vs Price

In [None]:
sns.scatterplot(x="boreratio", y="price", data=data,color='b')

# Stroke vs price

In [None]:
sns.scatterplot(x="stroke", y="price", data=data,color='b')

* very weak correlation between stroke vs price

# Compression ratio vs Price

In [None]:
sns.scatterplot(x="compressionratio", y="price", data=data,color='b')

* no relation between compression ratio and price

# Horsepower vs Price

In [None]:
sns.scatterplot(x="horsepower", y="price", data=data,color='b')

In [None]:
g = sns.jointplot(x="horsepower", y="price", data=data, kind="kde", color="b")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("horsepower", "price");

# Peakrpm vs price

In [None]:
sns.scatterplot(x="peakrpm", y="price", data=data,color='r')

* no correlation between peakrpm and  price

# Citympg vs Price

In [None]:
sns.scatterplot(x="citympg", y="price", data=data,color='b')

In [None]:
g = sns.jointplot(x="citympg", y="price", data=data, kind="kde", color="b")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("citympg", "price");

* negative correlation is seen between citympg and price

# Highwaympg vs Price

In [None]:
sns.scatterplot(x="highwaympg", y="price", data=data,color='b')

In [None]:
g = sns.jointplot(x="highwaympg", y="price", data=data, kind="kde", color="b")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("highwaympg", "price");

* negative correlation between highwaympg and price

# Pairplot of all numerical features

In [None]:
ax = sns.pairplot(data[numerical_cols])

# Correlation matrix

In [None]:
data[numerical_cols].corr()

In [None]:
sns.heatmap(data[numerical_cols].corr())

* wheelbase has high positive correlation with carlength,carwidth and curbweight	
* carlength has high postive correlation with curbweight
* carlength has negative correlation with highwaympg 
* carwidth has high postive correlation with curbweight and engine size
* enginesize has high positive correlation with horsepower
* curbweight has high positive correlation with engine size and horse power, negative correlation with highwaympg
* horsepower has negative correlation with citympg and highwaympg
* citympg and highwaympg are highly correlated 


# Scatter plot of wheelbase , carlength,carheight and carweight  with price

In [None]:
col=['wheelbase','carlength','carwidth','curbweight','price']

In [None]:
sns.pairplot(data[col])

In [None]:
sns.heatmap(data[col].corr())

# Scatter plot of carlength,curbweight,highwaympg with price

In [None]:
col=['carlength','highwaympg','curbweight','price']

In [None]:
sns.pairplot(data[col])

In [None]:
sns.heatmap(data[col].corr())

# Scatter plot of carwidth,curbweight ,engine size and price

In [None]:
col=['carwidth','curbweight','enginesize','price']

In [None]:
sns.pairplot(data[col])

In [None]:
sns.heatmap(data[col].corr())

# Scatter plot of curbweight ,engine size ,horse power,highwaympg and price

In [None]:
col=['curbweight','enginesize','horsepower','highwaympg','price']

In [None]:
sns.pairplot(data[col])

In [None]:
sns.heatmap(data[col].corr())

# Horsepower,citympg , highway mpg  and price

In [None]:
col=['horsepower','citympg','highwaympg','price']

In [None]:
sns.pairplot(data[col])

In [None]:
sns.heatmap(data[col].corr())

# Horsepower Vs Car Body categorized by carbody

In [None]:
sns.pairplot(data[['horsepower','price','carbody']], hue="carbody");

# Fitting all features with price

In [None]:
fig,axes = plt.subplots(4,4,figsize=(18,15))
for seg,col in enumerate(numerical_cols[:len(numerical_cols)-1]):
    
    x,y = seg//4,seg%4
    sns.regplot(x=col, y='price' ,data=data,ax=axes[x][y],color='r')


<font size="+3" color='#540b11'><b>3. Feature Selection </b></font>

Feature selection methods are intended to reduce the number of input variables to those that are believed to be most useful to a model in order to predict the target variable.

In [None]:
X=data[numerical_cols].drop('price',axis=1)
y=data['price']

# Recursive feature elimination (RFE) with random forest

In [None]:
X = data.apply(lambda col: preprocessing.LabelEncoder().fit_transform(col))
X=X.drop(['CarName','price'],axis=1)
y=data['price']

In [None]:

# Create the RFE object and rank each pixel
clf_rf_3 = RandomForestRegressor()      
rfe = RFE(estimator=clf_rf_3, n_features_to_select=15, step=1)
rfe = rfe.fit(X, y)
print('Chosen best 15 feature by rfe:',X.columns[rfe.support_])


In [None]:
features=list(X.columns[rfe.support_])

<font size="+3" color='#540b11'><b>4. Data Modelling and Evaluation </b></font>

In [None]:
x = X[features]
y = data.price
x_train,x_test,y_train,y_test = train_test_split(x,y, random_state = 0)

# Linear Regression

In [None]:
lreg = linear_model.LinearRegression()
lreg.fit(x_train,y_train)
y_train_pred = lreg.predict(x_train)
y_test_pred = lreg.predict(x_test)
lreg.score(x_test,y_test)

# Decision Tree Regressor

In [None]:
dt_regressor = DecisionTreeRegressor(random_state=0)
dt_regressor.fit(x_train,y_train)
y_train_pred = dt_regressor.predict(x_train)
y_test_pred = dt_regressor.predict(x_test)
dt_regressor.score(x_test,y_test)

# Random Forest regressor

In [None]:
Rf = RandomForestRegressor(n_estimators = 15,
                              criterion = 'mse',
                              random_state = 20,
                              n_jobs = -1)
Rf.fit(x_train,y_train)
Rf_train_pred = Rf.predict(x_train)
Rf_test_pred = Rf.predict(x_test)


r2_score(y_test,Rf_test_pred)

# Conclusion

* We applied three models Linear Regression Decision Tree Regressor,and RandomForest Regressor
* As we can see random forest performing best (with accuracy ~ 0.90)



<font size="+1" color='#9b24a3'><b>I hope you enjoyed this kernel , Please don't forget to appreciate me with an Upvote.</b></font>

<img src="https://i.pinimg.com/originals/e2/d7/c7/e2d7c71b09ae9041c310cb6b2e2918da.gif">