# Problem Statement:
The price of a car depends on a lot of factors like the goodwill of the brand of the car, features of the car, horsepower and the mileage it gives and many more. Car price prediction is one of the major research areas in machine learning. So if you want to learn how to train a car price prediction model then this project is for you.

# Import necessary libraries

In [1]:
import pandas as pd # Library for data manipulation and analysis

# *Load Dataset*

In [2]:
dataset = pd.read_csv('CarPrice.csv') # Read the dataset
dataset = dataset.drop(['car_ID'],axis=1) # Drop car_ID column not required for prediction

### *Summarize Dataset*

In [3]:
print(dataset.shape) # Print the shape of dataset (rows,columns)
print(dataset.head(5)) # Print first 5 rows of dataset

(205, 25)
   symboling                   CarName fueltype aspiration doornumber  \
0          3        alfa-romero giulia      gas        std        two   
1          3       alfa-romero stelvio      gas        std        two   
2          1  alfa-romero Quadrifoglio      gas        std        two   
3          2               audi 100 ls      gas        std       four   
4          2                audi 100ls      gas        std       four   

       carbody drivewheel enginelocation  wheelbase  carlength  ...  \
0  convertible        rwd          front       88.6      168.8  ...   
1  convertible        rwd          front       88.6      168.8  ...   
2    hatchback        rwd          front       94.5      171.2  ...   
3        sedan        fwd          front       99.8      176.6  ...   
4        sedan        4wd          front       99.4      176.6  ...   

   enginesize  fuelsystem  boreratio stroke compressionratio  horsepower  \
0         130        mpfi       3.47   2.68     

### *Splitting Dataset into X & Y*
### *This X contains Both Numerical & Text Data*

In [4]:
Xdata = dataset.drop('price',axis='columns') # Drop price column from dataset as it is the target variable for prediction
numericalCols=Xdata.select_dtypes(exclude=['object']).columns # Select numerical columns from dataset
X=Xdata[numericalCols] # Store numerical columns in X
X # Print X

Unnamed: 0,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg
0,3,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,5000,21,27
1,3,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,5000,21,27
2,1,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154,5000,19,26
3,2,99.8,176.6,66.2,54.3,2337,109,3.19,3.40,10.0,102,5500,24,30
4,2,99.4,176.6,66.4,54.3,2824,136,3.19,3.40,8.0,115,5500,18,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,109.1,188.8,68.9,55.5,2952,141,3.78,3.15,9.5,114,5400,23,28
201,-1,109.1,188.8,68.8,55.5,3049,141,3.78,3.15,8.7,160,5300,19,25
202,-1,109.1,188.8,68.9,55.5,3012,173,3.58,2.87,8.8,134,5500,18,23
203,-1,109.1,188.8,68.9,55.5,3217,145,3.01,3.40,23.0,106,4800,26,27


In [5]:
Y = dataset['price'] # Store price column in Y
Y # Print Y

0      13495.0
1      16500.0
2      16500.0
3      13950.0
4      17450.0
        ...   
200    16845.0
201    19045.0
202    21485.0
203    22470.0
204    22625.0
Name: price, Length: 205, dtype: float64

### *Scaling the Independent Variables (Features)*

In [6]:
from sklearn.preprocessing import scale # Scale the data
cols = X.columns # Store column names in cols
X = pd.DataFrame(scale(X)) # Scale the data
X.columns = cols # Assign column names to scaled data
X # Print X

Unnamed: 0,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg
0,1.743470,-1.690772,-0.426521,-0.844782,-2.020417,-0.014566,0.074449,0.519071,-1.839377,-0.288349,0.174483,-0.262960,-0.646553,-0.546059
1,1.743470,-1.690772,-0.426521,-0.844782,-2.020417,-0.014566,0.074449,0.519071,-1.839377,-0.288349,0.174483,-0.262960,-0.646553,-0.546059
2,0.133509,-0.708596,-0.231513,-0.190566,-0.543527,0.514882,0.604046,-2.404880,0.685946,-0.288349,1.264536,-0.262960,-0.953012,-0.691627
3,0.938490,0.173698,0.207256,0.136542,0.235942,-0.420797,-0.431076,-0.517266,0.462183,-0.035973,-0.053668,0.787855,-0.186865,-0.109354
4,0.938490,0.107110,0.207256,0.230001,0.235942,0.516807,0.218885,-0.517266,0.462183,-0.540725,0.275883,0.787855,-1.106241,-1.273900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1.476452,1.721873,1.198549,1.398245,0.728239,0.763241,0.339248,1.666445,-0.336970,-0.162161,0.250533,0.577692,-0.340094,-0.400490
201,-1.476452,1.721873,1.198549,1.351515,0.728239,0.949992,0.339248,1.666445,-0.336970,-0.364062,1.416637,0.367529,-0.953012,-0.837195
202,-1.476452,1.721873,1.198549,1.398245,0.728239,0.878757,1.109571,0.926204,-1.232021,-0.338824,0.757535,0.787855,-1.106241,-1.128332
203,-1.476452,1.721873,1.198549,1.398245,0.728239,1.273437,0.435538,-1.183483,0.462183,3.244916,0.047732,-0.683286,0.119594,-0.546059


### *Splitting Dataset into Train & Test*

In [7]:
from sklearn.model_selection import train_test_split # Split the data into train and test
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.3,random_state=0) # Split the data into train and test

### *Training using Random Forest*

In [8]:
from sklearn.ensemble import RandomForestRegressor # Library for Random Forest Regressor
model = RandomForestRegressor() # Create a model
model.fit(x_train, y_train) # Fit the model

### *Evaluating Model*

In [9]:
ypred = model.predict(x_test) # Predict the values

from sklearn.metrics import r2_score # Library for R2Score
r2score = r2_score(y_test,ypred) # Calculate R2Score
print("R2Score :",r2score*100) # Print R2Score in percentage

R2Score : 91.1503284907459
