## Car Price Prediction
*The aim of this company is to  know:
* Which variables are significant in predicting the price of a car
* How well those variables describe the price of a car

The solution is divided into the following sections: 
- Data understanding and exploration
- Data cleaning
- Data preparation
- Model building and evaluation


first let's understand and explore our data 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 


In [None]:
carP=pd.read_csv("../input/car-price-prediction/CarPrice_Assignment.csv")


In [None]:
carP.head()

In [None]:
carP.info()

In [None]:
carP['symboling'].astype('category').value_counts()

In [None]:
carP['aspiration'].astype('category').value_counts()

In [None]:
carP['doornumber'].astype('category').value_counts()

In [None]:
carP['drivewheel'].astype('category').value_counts()

In [None]:
carP['compressionratio'].astype('category').value_counts()

In [None]:
# wheelbase: distance between centre of front and rarewheels
sns.distplot(carP['wheelbase'])
plt.show()

In [None]:
sns.distplot(carP['compressionratio'])
plt.show()

In [None]:
# target variable: price of car
sns.distplot(carP['price'])
plt.show()

In [None]:
#select all numerical variables
numeric_car=carP.select_dtypes(include=['float64','int64'])
numeric_car

In [None]:
numeric_car = numeric_car.drop(['symboling', 'car_ID'], axis=1)
numeric_car.head()

In [None]:
# plotting pairwise
plt.figure(figsize=(20,10))
sns.pairplot(numeric_car)
plt.show()

This is quite hard to read, and we can rather plot correlations between variables. Also, a heatmap is pretty useful to visualise multiple correlations in one plot.

In [None]:
cor=numeric_car.corr()
cor

In [None]:
# plotting correlations on a heatmap

# figure size
plt.figure(figsize=(16,8))

# heatmap
sns.heatmap(cor, cmap="YlGnBu", annot=True)
plt.show()

Correlation of price with independent variables:

Price is highly (positively) correlated with wheelbase, carlength, carwidth, curbweight, enginesize, horsepower (notice how all of these variables represent the size/weight/engine power of the car)

In [None]:
# converting symboling to categorical
carP['symboling'] = carp['symboling'].astype('object')
carp.info()

In [None]:
carP['CarName'].head()

we have just the first part of CarName is the name of company 

In [None]:
car_names=carP['CarName'].apply(lambda x: x.split(" ")[0])
car_names.head()

let's add new column of car_names

In [None]:
# New column car_company
carP['car_company'] = car_names

In [None]:
carP.head()

In [None]:
carP=carP.drop(['CarName'], axis=1)

In [None]:
carP.head()

In [None]:
carP['car_company'].astype('category').value_counts()

wa have some names in company name which are not written correctly , we have to rewrite them like VW vokswagen toyouta porcche... which are toyota porche..


In [None]:
# volkswagen
carP.loc[(carP['car_company'] == "vw") | 
         (carP['car_company'] == "vokswagen")
         , 'car_company'] = 'volkswagen'

# porsche
carP.loc[carP['car_company'] == "porcshce", 'car_company'] = 'porsche'

# toyota
carP.loc[carP['car_company'] == "toyouta", 'car_company'] = 'toyota'

# nissan
carP.loc[carP['car_company'] == "Nissan", 'car_company'] = 'nissan'

# mazda
carP.loc[carP['car_company'] == "maxda", 'car_company'] = 'mazda'

In [None]:
carP['car_company'].astype('category').value_counts()

In [None]:
carP.info()

In [None]:
carP.describe()

Prepare the data for our Model 

In [None]:
x=carP.loc[:, carP.columns != 'price']
y=carP['price']

creating dummy variables for categorical variables

In [None]:
# subset all categorical variables
cars_categorical = x.select_dtypes(include=['object'])
cars_categorical.head()

In [None]:
# convert into dummies
cars_dummies = pd.get_dummies(cars_categorical, drop_first=True)
cars_dummies.head()

In [None]:
x=x.drop(list(cars_categorical.columns), axis=1)

In [None]:
x=pd.concat([x,cars_dummies],axis=1)

In [None]:
# scaling the features
from sklearn.preprocessing import scale

# storing column names in cols, since column names are (annoyingly) lost after 
# scaling (the df is converted to a numpy array)
cols = x.columns
x = pd.DataFrame(scale(x))
x.columns = cols
x.columns

In [None]:
# split into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.7,test_size = 0.3, random_state=100)

**BUilding the model**

In [None]:
from sklearn.linear_model import LinearRegression
# Building the first model with all the features
Rg = LinearRegression()
# fit
Rg.fit(X_train, y_train)

In [None]:
# print coefficients and intercept
print(Rg.coef_)
print(Rg.intercept_)

In [None]:
# predict 
y_pred = Rg.predict(X_test)

# metrics
from sklearn.metrics import r2_score

print(r2_score(y_true=y_test, y_pred=y_pred))

Not bad, we are getting approx. 82.5% r-squared with all the variables


Let's now build a model using recursive feature elimination to select features

In [None]:
# RFE with 15 features
from sklearn.feature_selection import RFE
#Initializing RFE model
Rg=LinearRegression()
rfe = RFE(Rg, 15)
#Transforming data using RFE
X_rfe = rfe.fit(X_train,y_train)  

print(rfe.support_)
print(rfe.ranking_)

In [None]:
# making predictions using rfe model
y_pred = rfe.predict(X_test)

# r-squared
print(r2_score(y_test, y_pred))

In [None]:
# RFE with 7 features

#Initializing RFE model
Rg=LinearRegression()
rfe_7 = RFE(Rg, 7)
#Transforming data using RFE
X_rfe7 = rfe_7.fit(X_train,y_train)
# making predictions using rfe model
y_pred = rfe_7.predict(X_test)

# r-squared
print(r2_score(y_test, y_pred))

Note that RFE with 7 features is giving about 88% r-squared, compared to 90% with 15 features. 
Should we then choose more features for slightly better performance?

In [None]:
# import statsmodels
import statsmodels.api as sm  

# subset the features selected by rfe_15
col_15 = X_train.columns[rfe.support_]

# subsetting training data for 15 selected columns
X_train_rfe_15 = X_train[col_15]

# add a constant to the model
X_train_rfe_15 = sm.add_constant(X_train_rfe_15)
X_train_rfe_15.head()

In [None]:
# fitting the model with 15 variables
Rg_15 = sm.OLS(y_train, X_train_rfe_15).fit()   
print(Rg_15.summary())

Note that the model with 15 variables gives about 93.6% r-squared, though that is on training data. The adjusted r-squared is 92.9.

In [None]:
# making predictions using rfe_15 sm model
X_test_rfe_15 = X_test[col_15]


# # Adding a constant variable 
X_test_rfe_15 = sm.add_constant(X_test_rfe_15, has_constant='add')
X_test_rfe_15.info()


# # Making predictions
y_pred = Rg_15.predict(X_test_rfe_15)

In [None]:
# r-squared
r2_score(y_test, y_pred)

 the test r-squared of model with 15 features is about 90.7%, while training is about 93.6%

Choosing the optimal number of features

In [None]:
n_features_list = list(range(4, 20))
adjusted_r2 = []
r2 = []
test_r2 = []

for n_features in range(4, 20):

    # RFE with n features
    Rg = LinearRegression()

    # specify number of features
    rfe_n = RFE(Rg, n_features)

    # fit with n features
    rfe_n.fit(X_train, y_train)

    # subset the features selected by rfe_6
    col_n = X_train.columns[rfe_n.support_]

    # subsetting training data for 6 selected columns
    X_train_rfe_n = X_train[col_n]

    # add a constant to the model
    X_train_rfe_n = sm.add_constant(X_train_rfe_n)


    # fitting the model with 6 variables
    Rg_n = sm.OLS(y_train, X_train_rfe_n).fit()
    adjusted_r2.append(Rg_n.rsquared_adj)
    r2.append(Rg_n.rsquared)
    
    
    # making predictions using rfe_15 sm model
    X_test_rfe_n = X_test[col_n]


    # # Adding a constant variable 
    X_test_rfe_n = sm.add_constant(X_test_rfe_n, has_constant='add')



    # # Making predictions
    y_pred = Rg_n.predict(X_test_rfe_n)
    
    test_r2.append(r2_score(y_test, y_pred))


In [None]:
# plotting adjusted_r2 against n_features
plt.figure(figsize=(10, 8))
plt.plot(n_features_list, adjusted_r2, label="adjusted_r2")
plt.plot(n_features_list, r2, label="train_r2")
plt.plot(n_features_list, test_r2, label="test_r2")
plt.legend(loc='upper left')
plt.show()

Based on the plot, we can choose the number of features considering the r2_score we are looking for.

we can choose anything between 4 and 12 features, since beyond 12, the test r2 goes down; and at lesser than 4, the r2_score is too less.

In fact, the test_r2 score doesn't increase much anyway from n=6 to n=12. It is thus wiser to choose a simpler model, and so let's choose n=6.

In [None]:
# RFE with n features
lm = LinearRegression()

n_features = 6

# specify number of features
rfe_n = RFE(Rg, n_features)

# fit with n features
rfe_n.fit(X_train, y_train)

# subset the features selected by rfe_6
col_n = X_train.columns[rfe_n.support_]

# subsetting training data for 6 selected columns
X_train_rfe_n = X_train[col_n]

# add a constant to the model
X_train_rfe_n = sm.add_constant(X_train_rfe_n)


# fitting the model with 6 variables
Rg_n = sm.OLS(y_train, X_train_rfe_n).fit()
adjusted_r2.append(Rg_n.rsquared_adj)
r2.append(Rg_n.rsquared)


# making predictions using rfe_15 sm model
X_test_rfe_n = X_test[col_n]


# # Adding a constant variable 
X_test_rfe_n = sm.add_constant(X_test_rfe_n, has_constant='add')



# # Making predictions
y_pred = Rg_n.predict(X_test_rfe_n)

test_r2.append(r2_score(y_test, y_pred))

In [None]:
# summary
Rg_n.summary()

In [None]:
# results 
r2_score(y_test, y_pred)

**Final Model Evaluation**

In [None]:
# Error terms
c = [i for i in range(len(y_pred))]
fig = plt.figure()
plt.plot(c,y_test-y_pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                      # X-label
plt.ylabel('ytest-ypred', fontsize=16)                # Y-label
plt.show()

In [None]:
# mean
np.mean(y_test-y_pred)


Now it may look like that the mean is not 0, though compared to the scale of 'price', -380 is not such a big number (see distribution below).

In [None]:
sns.distplot(carP['price'],bins=50)
plt.show()

In [None]:
# multicollinearity
predictors = ['carwidth', 'curbweight', 'enginesize', 
             'enginelocation_rear', 'car_company_bmw', 'car_company_porsche']

cors = x.loc[:, list(predictors)].corr()
sns.heatmap(cors, annot=True)
plt.show()