# **Author: Sri Sudheera Chitipolu**

### *Github: https://github.com/sudheera96*

## Acquire the Data

Collected data from kaggle

*https://www.kaggle.com/hellbuoy/car-price-prediction*

## General Description: 
For understanding pricing dynamics of the new market in the different cars for business growth, we will predict the car’s prices depending on different independent variables. 

In [1]:
import pandas as pd # imported pandas as pd
cp_data = pd.read_csv("carPrice_Assignment.csv") # pulled data from csv file to pandas dataframe
print(cp_data.keys()) # display a list of the names of the fields
cp_data.head(10) # table display of the first few lines in the DataFrame

Index(['car_ID', 'symboling', 'CarName', 'fueltype', 'aspiration',
       'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase',
       'carlength', 'carwidth', 'carheight', 'curbweight', 'enginetype',
       'cylindernumber', 'enginesize', 'fuelsystem', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg',
       'price'],
      dtype='object')


Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0
5,6,2,audi fox,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250.0
6,7,1,audi 100ls,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710.0
7,8,1,audi 5000,gas,std,four,wagon,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,18920.0
8,9,1,audi 4000,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875.0
9,10,0,audi 5000s (diesel),gas,turbo,two,hatchback,4wd,front,99.5,...,131,mpfi,3.13,3.4,7.0,160,5500,16,22,17859.167


## _Notes_
In inital_exploration we have seen there are features with negative correlated with target variable. So we are droping them from data set

In [2]:
df = cp_data.drop(columns=["car_ID","CarName","highwaympg","citympg","peakrpm","symboling"])

In [3]:
df_obj_cols = df.select_dtypes(include = 'object')
df_num_cols = df.select_dtypes(exclude = 'object')
# make object columns to numerical columns using get_dummies
df_obj_dummies = pd.get_dummies(df_obj_cols,drop_first=True)
df_obj_dummies.head()

Unnamed: 0,fueltype_gas,aspiration_turbo,doornumber_two,carbody_hardtop,carbody_hatchback,carbody_sedan,carbody_wagon,drivewheel_fwd,drivewheel_rwd,enginelocation_rear,...,cylindernumber_three,cylindernumber_twelve,cylindernumber_two,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi
0,1,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,1,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,1,0,1,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
3,1,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
4,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


## _Notes_

Created dummy variables for categorical without it with column in every respective dummy variable inorder to avoid multicollinarity.

In [4]:
final_cardf = pd.concat([df_num_cols, df_obj_dummies], axis=1)
final_cardf.head()

Unnamed: 0,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,horsepower,...,cylindernumber_three,cylindernumber_twelve,cylindernumber_two,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi
0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,...,0,0,0,0,0,0,0,1,0,0
1,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,...,0,0,0,0,0,0,0,1,0,0
2,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154,...,0,0,0,0,0,0,0,1,0,0
3,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102,...,0,0,0,0,0,0,0,1,0,0
4,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115,...,0,0,0,0,0,0,0,1,0,0


## _Notes_
Similary removed other features which are having negative correlation with price. For more details look at the initial exploration

In [5]:
final_cardf = final_cardf.drop(columns=["cylindernumber_four","drivewheel_fwd","fuelsystem_2bbl","enginetype_ohc",
                                        "carbody_hatchback",
                                        "fueltype_gas","cylindernumber_three","fuelsystem_spdi",
                                        "carbody_wagon","doornumber_two","fuelsystem_spfi","fuelsystem_4bbl",
                                        "enginetype_rotor","cylindernumber_two","fuelsystem_mfi"])

In [6]:
from sklearn.model_selection import train_test_split
X = final_cardf.drop('price',1)
x_train, x_cv, y_train, y_cv = train_test_split(X,final_cardf.price, test_size =0.3)

## Importing Random Forest

In [11]:
# Building  Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(x_train, y_train)


RandomForestRegressor()

In [13]:
rfr.score(x_train,y_train)

0.9869471047386335

In [17]:
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
rf_pred=rfr.predict(x_cv)
print('MSE:',mean_squared_error(rf_pred,y_cv))
print('MAE:',mean_absolute_error(rf_pred,y_cv))
print('r2_score:',r2_score(rf_pred,y_cv))

MSE: 6691987.561795239
MAE: 1713.4695826344084
r2_score: 0.8910310629688138


In [18]:
prediction_rf=pd.DataFrame({'Actual':y_cv,'Predicted':rf_pred})
prediction_rf.head(10)

Unnamed: 0,Actual,Predicted
144,9233.0,8836.85
196,15985.0,14318.43
99,8949.0,9695.92
204,22625.0,17475.0
195,13415.0,15883.52
20,6575.0,8441.44
59,8845.0,10345.093333
182,7775.0,8094.68
191,13295.0,13755.48
30,6479.0,6603.3
