# Car Price Prediction 

This dataset consists of 16 Features where we have to predict our Price with the help of features. (Regression)
We will observe, analyse, visualize the data. As well as we will perform all the regression algorithms and test with algorithm performs best among all. 

### Outline

- <b> Observation </b>
- <b> Exploratory Data Analysis </b>
    - Fixing Outliers
    - Fixing missing values
    - Data Visualization
- <b> Preparing data </b>
    - Encoding
    - Scaling 
    - Splitting the dataset
    
- <b> Models </b>
    - Linear Regression
    - SVM
    
    

If I have made any mistakes please let me know in the comments.

# Observation

In [None]:
import pandas as pd

In [None]:
path = "../input/cardataset/"

In [None]:
df = pd.read_csv(path + 'data.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.describe()

As we observed the data from the surface, what's the data about?,  what's the size? and what to predict. Let's dig deeper into <b> Exploratory Data Analysis </b>

# Exploratory Data Analysis

Before make any changes to original data, it's always better to make a copy of original data, and make a changes on copy.

In [None]:
xdf = df.copy()

### Checking missing values

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.figure(figsize = (10,8))
sns.heatmap(xdf.isnull(), cbar = False);

In [None]:
xdf.isnull().mean().round(4).mul(100).sort_values(ascending=False)

We observe, <b> Market Category </b> has the highest missing values followed by <b> Engine HP </b> and <b> Enginer Cylinders </b> 

In [None]:
# Observing Market Category

xdf['Market Category'].value_counts()

We see it is really not that useful for our model, especially when our data already consists of additional details like <b> Model </b>, <b> Vechicle Style </b>, <b> Vechicle Size </b> or <b> Make </b>. Anyone can which type of car it is from these features. We will drop the entire column <b> Market Category </b>

In [None]:
xdf = xdf.drop('Market Category', axis = 1)

In [None]:
df.shape

In [None]:
xdf.shape

In [None]:
## There are still other missing values Engine Hp, Engine Cylinders, Number of Doors and Engine Fuel Type

In [None]:
xdf['Engine HP'].isnull().sum()

Looks alike someone forgot to mention their Engine HP

In [None]:
null_data = xdf[xdf.isnull().any(axis = 1)]

In [None]:
null_data

We see that <b> Engine HP </b> with <b> NaN </b> values is actually an <b> Electric Vechicle </b>. So, technically <b> Electric vechicle </b> doesn't have Engine. So we will replace the NaN value with 0. 

In [None]:
xdf['Engine HP'] = xdf['Engine HP'].fillna(0)

In [None]:
xdf.isnull().sum()

In [None]:
null_data = xdf[xdf.isnull().any(axis = 1)]

In [None]:
null_data

This is the same case as <b> Electric Vechicle </b> does't have any <b> Enginer Cylinders </b>. So, again we will replace with 0

In [None]:
xdf['Engine Cylinders'] = xdf['Engine Cylinders'].fillna(0)

In [None]:
xdf.isnull().sum()

In [None]:
null_data = xdf[xdf.isnull().any(axis = 1)]

In [None]:
null_data

We know we can manually fill the missing data for this case, but it is not the best method to do it. Since, usually we have to deal to huge amount of missing datas. So we will use <b> mode </b> for <b> Number of Doors </b> and fill the popular <b> Engine Fuel Type </b>

In [None]:
xdf['Engine Fuel Type'].value_counts()

In [None]:
xdf['Number of Doors'] = xdf['Number of Doors'].fillna(xdf['Number of Doors'].mode())

In [None]:
xdf['Engine Fuel Type'] = xdf['Engine Fuel Type'].fillna('regular unleaded')

In [None]:
xdf['Engine Fuel Type'].value_counts()

In [None]:
xdf.isnull().sum()

Hence, we have <b> NaN </b> free data. Moving to the Outliers

### Fixing Outliers

In [None]:
xdf.describe()

In [None]:
xdf.hist(bins = 50, figsize = (10,10));

In [None]:
## Boxplots

In [None]:
## Engine Cylinders
plt.figure(figsize = (5,8))
sns.boxplot(data = xdf, y = 'Engine Cylinders');

In [None]:
## Engine HP
plt.figure(figsize = (5,8))
sns.boxplot( data = xdf, y = 'Engine HP');


In [None]:
## highway MPG
plt.figure(figsize = (5,8))
sns.boxplot(data = xdf, y = 'highway MPG');

In [None]:
## city mpg
plt.figure(figsize = (5,8))
sns.boxplot(data = xdf, y = 'city mpg');

In [None]:
## popularity

plt.figure(figsize = (5,8))
sns.boxplot(data = xdf, y = 'Popularity');

In [None]:
plt.figure(figsize = (5,8))
sns.boxplot(data = xdf, y = 'MSRP');

In [None]:
out_xdf = xdf.copy()

In [None]:
## Let's clean the dataset

In [None]:
def removingoutliers(dataframe, column):
    Q1 = dataframe[column].quantile(0.25)
    Q3 = dataframe[column].quantile(0.75)
    
    IQR = Q3 - Q1
    min = Q1 - 1.5 * IQR
    max = Q3 + 1.5 * IQR
    
    df_no_outlier = dataframe[(dataframe[column] > min ) & (dataframe[column] < max)]
        
    return df_no_outlier

In [None]:
## Removing Outlier for MSRP
out_xdf = removingoutliers(out_xdf, "MSRP")

In [None]:
## Popularity

out_xdf = removingoutliers(out_xdf, "Popularity")

In [None]:
## City Mpg

out_xdf = removingoutliers(out_xdf, "city mpg")

In [None]:
## highway MPG

out_xdf = removingoutliers(out_xdf, "highway MPG")

In [None]:
## Engine HP

out_xdf = removingoutliers(out_xdf, "Engine HP")

In [None]:
## Engine Cylinders

out_xdf = removingoutliers(out_xdf, "Engine Cylinders")

In [None]:

out_xdf.describe()

This seems perfect

### Data Visualization

We already have a light knowledge of our data. To gain more insights from our data, and check the relationship between the features we use Data Visualization technique. 


In [None]:
out_xdf.info()

In [None]:
plt.figure(figsize = (10,10))
sns.countplot(data = out_xdf, y = "Make");

- <b> Chevrolet </b> 
- <b> Volkswagen </b>
- <b> Toyota </b>
- <b> Nissan </b>

These cars has largest population, obviously they are widely used brands with average price all over the world. Premium or Luxurious cars are lessly populated.
- <b> HUMMER </b>
- <b> Maserati </b>
- <b> Alfa Romeo </b>


Let's observe the same figure with price

In [None]:
plt.figure(figsize = (10,12))
sns.barplot(data = out_xdf, y = "Make", x = "MSRP");


It's clear. <b> Maserati </b>, <b> HUMMER </b>, <b> Land Rover </b>, <b> Alfa Romeo </b> these are expensive cars.

Let's check the Model of the Cars

In [None]:
out_xdf["Model"].value_counts()

There 703 unique models. This features is useless for visualization, as single brand consist of multiple models.

Let's see <b> Engine Fuel Type </b>

In [None]:
out_xdf['Engine Fuel Type'].value_counts()

In [None]:
plt.figure(figsize = (10,12))
sns.countplot(y = "Engine Fuel Type", data = out_xdf);

In [None]:
## Engine Type By price

plt.figure(figsize = (12,8))
sns.barplot(y = "Engine Fuel Type", x = "MSRP", data = out_xdf);

In [None]:
## Engine Type By Popularity

plt.figure(figsize = (12,8))
sns.barplot(y = "Engine Fuel Type", x = "Popularity", data = out_xdf);

<b> Engine HP </b>

In [None]:
plt.figure(figsize = (10,12))
sns.scatterplot(data = out_xdf, y = "Engine HP", x = "MSRP");

In [None]:
plt.figure(figsize = (10,12))
sns.lineplot(data = out_xdf, x = "Engine HP", y = "MSRP");

Ignore the 0 HP, it is from <b> Electric Car </b>. The more the HP the more it is expensive.

<b> Engine Cylinders </b>

In [None]:
## Checking it by Price

plt.figure(figsize = (10,8))
sns.barplot(x = 'Engine Cylinders', y = 'MSRP', data = out_xdf);

It's clearly visibile high<b> Engine Cylinders </b> tends to be more expensive. 

In [None]:
## Checking it by Price

plt.figure(figsize = (10,8))
sns.barplot(x = 'Engine Cylinders', y = 'Popularity', data = out_xdf);

<b> Transmission Type </b>

In [None]:
## Checking it by Price
plt.figure(figsize = (10,8))
sns.barplot(x = 'Transmission Type', y = 'MSRP', data = out_xdf);

We see <b> AUTOMATED_MANUAL </b> is the expensive one.

In [None]:
## Checking it by Popularity 

plt.figure(figsize = (10,8))
sns.barplot(x = 'Transmission Type', y = 'Popularity', data = out_xdf);

<b> Driven_Wheels </b>

In [None]:
## Checking it by Popularity

plt.figure(figsize = (10,8))
sns.barplot(x = 'Driven_Wheels', y = 'Popularity', data = out_xdf);

In [None]:
## Checking it by Price
plt.figure(figsize = (10,8))
sns.barplot(x = 'Driven_Wheels', y = 'MSRP', data = out_xdf);

Based on my research on google:
- Often sports cars are usually only <b> rear wheel drive </b>. This is because all wheel drive adds weight and complexity to the car which does reduce its performance.
- However, <b> Four Wheel Drive </b> and <b> all wheel drive </b> are most commonly seen in modern-day <b> SUV's </b> and <b> Pickup Trucks </b> 

<b> Number of Doors </b>

In [None]:
## BY its Popularity

plt.figure(figsize = (10,8))
sns.boxplot(x = 'Number of Doors', y = 'Popularity', data = out_xdf);

In [None]:
## BY its price

plt.figure(figsize = (10,8))
sns.boxplot(x = 'Number of Doors', y = 'MSRP', data = out_xdf);

2 doors vechicle are often <b> Sports car </b> which is expensive.

<b> Vechicle Style 

In [None]:
## Plotting the count

plt.figure(figsize = (8,12))
sns.countplot(y = "Vehicle Style", data = out_xdf);

In [None]:
## Let's compare it with price

plt.figure(figsize = (8,10))
sns.boxplot( x= "MSRP", y = "Vehicle Style", data = out_xdf);

<b> 4 Dr (Door) Suv </b>, <b> Coupe </b>, <b> Convertible </b>, <b> Sedan </b> has higher price, as it is true.

In [None]:
## With it's City MPG

plt.figure(figsize = (8,10))
sns.barplot( x= "city mpg", y = "Vehicle Style", data = out_xdf);

<b> highway MPG </b>

In [None]:
plt.figure(figsize = (10,8))
sns.countplot(x = "highway MPG", data = out_xdf);

In [None]:
plt.figure(figsize = (10,8))
sns.lineplot(data = out_xdf, x = "highway MPG", y = "MSRP", hue = "Driven_Wheels");

In [None]:
## With number of Doors 

plt.figure(figsize = (10,8))
sns.lineplot(data = out_xdf, x = "highway MPG", y = "MSRP", hue = "Number of Doors");

In [None]:
# With the Price

plt.figure(figsize = (10,12))
sns.barplot(data = out_xdf, x = "highway MPG", y = "MSRP");

Highway MPG: Highway MPG is the average MPG for your car on the highway. Even though you'll be driving faster and often for longer stretches of time, highways are smoother than city roads and keep your engine running at consistent levels, which requires less gas

In [None]:
plt.figure(figsize = (12,12))
sns.boxplot(data = out_xdf, x = "highway MPG", y = "MSRP");

<b> city mpg </b>

In [None]:
out_xdf['city mpg'].value_counts()

In [None]:
## let's compare it with price
plt.figure(figsize = (12,12))
sns.boxplot(data = out_xdf, x = "city mpg", y = "MSRP");

In [None]:
plt.figure(figsize = (10,8))
sns.lineplot(data = out_xdf, x = "city mpg", y = "MSRP", hue = "Number of Doors");

MPG depends on the vehicle. For a hybrid car, 19 MPG is bad. For a large pickup truck, 19 MPG is good.

<b> popularity

In [None]:
##
plt.figure(figsize = (12,8))
sns.lineplot(x = "Popularity", y = "MSRP", data = out_xdf);

In [None]:
plt.figure(figsize = (12,8))
sns.lineplot(x = "Popularity", y = "MSRP", data = out_xdf, hue = "Driven_Wheels");

Green line, outstands all i.e all wheel drive the popular one

In [None]:
## which Vechicle style is popular

plt.figure(figsize = (8,10))
sns.boxplot( x= "Popularity", y = "Vehicle Style", data = out_xdf);

In [None]:
## let's see by Brand

plt.figure(figsize = (10,12))
sns.barplot(data = out_xdf, y = "Make", x = "Popularity");

<b> BMW </b> , <b> Audi </b>, <b> Honda </b> , <b> Nissan </b>, <b> Toyota </b> these are the top popular brands.

<b> Year 

In [None]:
plt.figure(figsize = (8,8))
sns.countplot(y = "Year", data = out_xdf);

<b> 2015 </b>, <b> 2016 </b> are the most popular

In [None]:
## let's check it with the price

plt.figure(figsize = (12,8))
sns.countplot(y = "Vehicle Size", data = out_xdf);

In [None]:
## with respect to price


plt.figure(figsize = (8,8))
sns.barplot(y = "Vehicle Size", x = "MSRP", data = out_xdf);

In [None]:
## With respect to popularity

plt.figure(figsize = (8,8))
sns.barplot(y = "Vehicle Size", x = "Popularity", data = out_xdf);

In [None]:
plt.figure(figsize = (12,8))
sns.lineplot(x = "Popularity", y = "MSRP", data = out_xdf, hue = "Vehicle Size");

In [None]:
## Let's see the correleation of all the features

corr = out_xdf.corr()

plt.figure(figsize = (10,10))
sns.heatmap(corr, cmap = 'viridis', annot = True, linewidth = 0.1);

## Prepare the Dataset

In [None]:
xxdf = out_xdf.copy()

In [None]:
# dropping this column since it has largest categorical variables

#xxdf = xxdf.drop(columns = 'Model', axis = 1)

In [None]:
from sklearn.preprocessing import OneHotEncoder

<b> Encoding Categorical Variables

Our dataset consists of categorical features, before splitting the dataset we have to encode the datset. Since our dataset, doesn't consist of any <b> Ordinal Data </b> we do not have any order of sequence.

We will be using <b> One Hot Encoding </b>

Read this article to know more about Encoding:
https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/

In [None]:
xxdf.info()

In [None]:
cat_features = ["Make","Model","Engine Fuel Type","Transmission Type","Driven_Wheels","Vehicle Size","Vehicle Style"]

In [None]:
xxdf = pd.get_dummies(xxdf, columns = cat_features)

In [None]:
xxdf

<b> Splitting the Dataset </b>

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = xxdf.drop(["MSRP"], axis = 1)
y = xxdf["MSRP"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 1)

<b> Scaling the Dataset

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.fit_transform(X_test)

## Modeling and Testing

In [None]:
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_squared_log_error, make_scorer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
import math
import numpy as np

<b> LGM Regressor </b>

In [None]:
lgm = LGBMRegressor(n_estimators = 40)
model = make_pipeline(lgm)
model.fit(X_train_std, y_train)

print(model)

In [None]:
kfold = KFold(n_splits = 5)
score = cross_val_score(model, X_train_std, y_train, cv = kfold)
print(score)

In [None]:
print(np.mean(score))

In [None]:
yp = model.predict(X_test_std)
print("R2 Score:", r2_score(y_test, yp))
print("Mean Squared Error:",mean_squared_error(y_test, yp))
print("RSME:", math.sqrt(mean_squared_error(y_test, yp)))

In [None]:
plt.scatter(y_test, yp)
plt.xlabel('True Value')
plt.ylabel("Predicted Value")

In [None]:
sns.displot((y_test - yp), bins = 50)

<b> RandomForestRegressor </b>

In [None]:
rfr = RandomForestRegressor(n_estimators = 100)
rfr_model = make_pipeline(rfr)
print(rfr_model)

In [None]:
rfr_model.fit(X_train_std, y_train)
score = cross_val_score(rfr_model, X_train_std, y_train)
kfold = KFold(n_splits = 5)

In [None]:
score = cross_val_score(rfr_model, X_train_std, y_train, cv = kfold)
print(score)

In [None]:
rfr_yp = rfr_model.predict(X_test_std)
print("R2 Score:",mean_squared_error(y_test, rfr_yp))
print("Mean Squarred Error:", mean_squared_error(y_test, rfr_yp))
print("Root Mean Squarred Error:", math.sqrt(mean_squared_error(y_test, rfr_yp)))

<b> XGBRegressor </b>

In [None]:
xgb1 = XGBRegressor()
parameters = {'n_estimators': [500]}

xgb_grid = GridSearchCV(xgb1, parameters, cv = 2 )

xgb_grid.fit(X_train_std, y_train)

In [None]:
print(xgb_grid.best_score_)


In [None]:
xgb_grid.best_params_

In [None]:
xgb_predict = xgb_grid.predict(X_test_std)

In [None]:
print("R2 score",r2_score(y_test, xgb_predict))
print("Mean Squarred Error:",mean_squared_error(y_test, xgb_predict))
