<a href="https://colab.research.google.com/github/shyamkrishnan1999/ML-project-1/blob/master/EDA_car_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Importing the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Here we have imported the necessary libraries.Now let us load the dataset.

In [None]:
#Loading the dataset

url="https://raw.githubusercontent.com/shyamkrishnan1999/ML-project-1/master/data.csv"
data=pd.read_csv(url)
data.head()

Lets check a few measures on our data.

In [None]:
#Checking various measures in our data

#A small info about our data 
data.info()

In [None]:
#Shape of your data
data.shape

In [None]:
#Checking for null values
data.isnull().sum()

In [None]:
#Small description about our data
data.describe()

From observing the dataset we can see that 'no of doors', ' Engine fuel type' ,'market category' and 'vechicle style' aren't much relevant so we drop them.

In [None]:
#Dropping irrelvant columns

data=data.drop(columns=['Number of Doors','Market Category','Engine Fuel Type'])
data.head()

In [None]:
#Filling null values and removing duplicates

data=data.dropna().drop_duplicates()
data.duplicated(subset=None).unique()

In [9]:
#Renaming columns

data=data.rename(columns={'Engine HP':'HP','Engine Cylinders':'Cylinders',
                     'Transmission Type':'Transmission','Driven_Wheels':'Drive Mode','highway MPG':'MPG-H','city mpg':'MPG-C','MSRP':'Price'})

In [None]:
#Rechecking our data
data.isna().sum()

Let us plot outliers in our data and remove them.

In [None]:
#plotting outliers in data

sns.boxplot(data['HP'])

In [None]:
sns.boxplot(data['Cylinders'])

In [None]:
sns.boxplot(data['MPG-H'])

In [None]:
sns.boxplot(data['MPG-C'])

In [None]:
sns.boxplot(data['Popularity'])

In [None]:
sns.boxplot(data['Price'])

In [None]:
#Removing outliers

Q1=data.quantile(0.25)
Q3=data.quantile(0.75)
IQR=Q3-Q1
data=data[~((data<(Q1-1.5*IQR))|(data>(Q3+1.5*IQR))).any(axis=1)]
sns.boxplot(data['HP'])  #Checking whether outliers have been removed or not

Lets look which car brands are most represented in the dataset and their average prices.

In [None]:
top_brands=data['Make'].value_counts().head(10)
top_brands

In [None]:
top_brands_data=pd.DataFrame()

for i in top_brands.index:
  df=data[data['Make']==i]
  top_brands_data=top_brands_data.append(df)

top_brands_data

In [None]:
#Finding average price of top 10 car brands

top_brands_mean_price=pd.DataFrame(columns=['Brands','Average price'])
price_data=top_brands_data[['Make','Price']]

for i in top_brands.index:
  mean_price=price_data[price_data['Make']==i].mean()
  top_brands_mean_price=top_brands_mean_price.append([{'Brands':i,'Average price':mean_price[0]}])

top_brands_mean_price.reset_index().drop(columns=['index'])


We plot the top car brands and their average prices on a bar graph.

In [None]:
plt.figure(figsize=(10,8))
plt.barh(top_brands_mean_price['Brands'],top_brands_mean_price['Average price'],)

From the above graph we can see that the average price of a suzuki car is the least whereas the average price of an infiniti car is the highest.

Let us perform our Explanatory Data Analysis on the given dataset.  

First let us find the correlation matrix.

In [None]:
#Finding correlation matrix and plotting it in a heatmap

corr=data.corr()
plt.figure(figsize=(10,8))
corr_matrix=sns.heatmap(corr,annot=True)
corr_matrix

From the above figure we can see that year,HP and Cylinders have a high correlation with price. 

It would be better if we convert our prices to a price range for example 20-39k etc.

In [None]:
def price_range(x):
  if x<10000:
    return "Under 10k"
  elif x>=10000 and x<30000:
    return "10-29K"
  elif x>=30000 and x<50000:
    return "30-49k"
  elif x>=50000 and x<70000:
    return "50-69k"
  else:
    return "Over 70k"

data['Price Range']=list(map(price_range,data['Price'].sort_values()))
data


Let us plot each of our feature variable against price and price Range.

In [None]:
#Plotting price Range against year

plt.figure(figsize=(8,8))
sns.swarmplot(data['Year'],data['Price Range']) #swarmplot takes about 2 minutes to load

From the above swarm plot we can see that during the period 2000-2005 less cars were manufactured which costs below 10k dollars.On the other hand there had been huge increase in the cars models costing from 50-69k dollars from year 2000 and over 70k dollars from year 2015.

In [None]:
#plotting price against year

plt.figure(figsize=(8,8))
plt.scatter(data['Year'],data['Price'])

From the above scatter plot we can see after year 2000 the prices of cars have increased substantially.

In [None]:
#plotting HP against Price range

plt.figure(figsize=(8,8))
sns.stripplot(data['HP'],data['Price Range'])

In [None]:
#plotting HP against price

plt.figure(figsize=(8,8))
plt.scatter(data['HP'],data['Price'])

From the above plot we understood that as the horsepower(HP) of engine increases the price also increases.

In [None]:
#plotting no of cylinders against price range

sns.scatterplot(data['Cylinders'],data['Price Range'])

In [None]:
#Plotting number of cylinders against price
plt.figure(figsize=(8,8))
plt.scatter(data['Cylinders'],data['Price'])

In [None]:
#plotting highway milage against price
plt.figure(figsize=(8,8))
plt.scatter(data['MPG-H'],data['Price'])

In [None]:
#plotting highway milage against price range

plt.figure(figsize=(8,8))
sns.violinplot(data['MPG-H'],data['Price Range'])

Here in this violin plot we observe a fact:median highway milage decreses for increse in price.

In [None]:
#plotting city milage against price
plt.figure(figsize=(8,8))
plt.scatter(data['MPG-C'],data['Price'])

In [None]:
#plotting city milage against price range
plt.figure(figsize=(8,8))
sns.violinplot(data['MPG-C'],data['Price Range'])

In [None]:
#plotting popularity against prices

plt.figure(figsize=(8,8))
plt.scatter(data['Popularity'],data['Price'])

In [None]:
#plotting popularity against price range

plt.figure(figsize=(8,8))
sns.violinplot(data['Popularity'],data['Price Range'])

From the above plot we can see that as price increses the median popularity decreases.

In [None]:
#plotting transmission mode against price
plt.figure(figsize=(8,8))
plt.scatter(data['Transmission'],data['Price'])

In [None]:
#plotting transmission mode against price range
plt.figure(figsize=(8,8))
sns.violinplot(data['Transmission'],data['Price'])

From the above violin plot we can see that median price of automatic tramsmission model is slightly higher than that of manual transmission model.

In [None]:
#plotting vechicle size against price
plt.figure(figsize=(8,8))
plt.scatter(data['Vehicle Size'],data['Price'])

In [None]:
#plotting drive mode against price
plt.figure(figsize=(8,8))
plt.scatter(data['Drive Mode'],data['Price'])

In [None]:
#plotting brands against price
plt.figure(figsize=(40,40))
sns.swarmplot(data['Make'],data['Price'])

In [None]:
#plotting vehicle style against price
plt.figure(figsize=(30,30))
sns.swarmplot(data['Vehicle Style'],data['Price'])

From all the above plots we come to the fact the major determinants for the price of a car is-Year,HP,highway milage and popularity.

Lets do some data preprocessing such as scaling and splitting into train set and test set.

In [None]:
#Taking values
x=data[['Year','HP','MPG-H','Popularity']].values
y=data['Price'].values
x,y

In [None]:
#scaling values
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
x=scaler.fit_transform(x)
x

In [55]:
#Splitting into train set and test set
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

Let us apply various machine learning models and see results.

First let us apply multivariable linear regression model and check the results.

In [56]:
from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(x_train,y_train)
y_pred=model.predict(x_test)

Checking various parameters for this linear regression model like Mean Absolute Error(MAE),Mean Squared Error(MSE),Root Mean Squared Error(RMSE) and R2 score.

In [None]:
#checking various parameters
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
mean_price=data['Price'].mean() #mean price
mae=mean_absolute_error(y_test,y_pred) #mean absolute error
mse=mean_squared_error(y_test,y_pred)  #mean squared error
rmse=np.sqrt(mse)  #root mean square error
r2=r2_score(y_test,y_pred)  #r2 score
mae,mse,rmse,mean_price,r2


Here we see that linear regression model gives us a r2 score of about 0.68 and the root mean square error is more than 10 percent of the mean.Hence linear Regression is not a suitable algorithm in this case.

Lets try out knn algorithm.

In [117]:
#KNearestBeighbor
from sklearn.neighbors import KNeighborsRegressor
model=KNeighborsRegressor(n_neighbors=3)
model.fit(x_train,y_train)
y_pred=model.predict(x_test)

In [None]:
#checking various parameters
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
mae=mean_absolute_error(y_test,y_pred) #mean absolute error
mse=mean_squared_error(y_test,y_pred)  #mean squared error
rmse=np.sqrt(mse)  #root mean square error
r2=r2_score(y_test,y_pred)  #r2 score
mae,mse,rmse,mean_price,r2

Knn algorithm gives us an r2 score of 0.9 which is fairly good.However the root mean squared error is still more than 10 percent of mean.

Next let us try out a decision tree and see.

In [61]:
#Decision tree
from sklearn.tree import DecisionTreeRegressor
model=DecisionTreeRegressor()
model.fit(x_train,y_train)
y_pred=model.predict(x_test)

In [None]:
#checking various parameters
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
mae=mean_absolute_error(y_test,y_pred) #mean absolute error
mse=mean_squared_error(y_test,y_pred)  #mean squared error
rmse=np.sqrt(mse)  #root mean square error
r2=r2_score(y_test,y_pred)  #r2 score
mae,mse,rmse,mean_price,r2

We see that decision tree gives us an r2 score about 0.92.However the root mean square error is still a little above 10 percent of the mean.so we could try Random Forest algorithm.

In [91]:
#Random Forest
from sklearn.ensemble import RandomForestRegressor
model=RandomForestRegressor(n_estimators=100)
model.fit(x_train,y_train)
y_pred=model.predict(x_test)

In [None]:
#checking various parameters
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
mae=mean_absolute_error(y_test,y_pred) #mean absolute error
mse=mean_squared_error(y_test,y_pred)  #mean squared error
rmse=np.sqrt(mse)  #root mean square error
r2=r2_score(y_test,y_pred)  #r2 score
mae,mse,rmse,mean_price,r2

We saw that random forest algorithm with 100 trees gave us an r2 score of about 0.93.However the root mean square error is above 10 percent of the mean.

Out of the four algorithms Random Forest gives the highest r2 score of 0.93 and least mean absolute,mean squared and root mean square errors.Hence Random Forest is the most suitable algorithm in this case.