#             Exploratory Data Analysis of Car Features

# Context
As a data scientist, the majority of your time will be spent on data pre-processing i.e.
making sure you have the right data in the right format. Once this is done, you get a
sense of your dataset through applying some descriptive statistics and then, you move
on to the exploration stage wherein you plot various graphs and mine the hidden
insights. In this project, you as a data scientist are expected to perform Exploratory data
analysis on how the different features of a car and its price are related. The data comes
from the Kaggle dataset "Car Features and MSRP". It describes almost 12,000 car
models, sold in the USA between 1990 and 2017, with the market price (new or used)
and some features.

# Objective
The objective of the project is to do data pre-processing and exploratory data analysis
of the dataset.


# Data Description
 Make Car Make
 Model Car Model
 Year Car Year (Marketing)
 Engine Fuel Type Engine Fuel Type
 Engine HP Engine HorsePower (HP)
 Engine Cylinders Engine Cylinders
 Transmission Type Transmission Type
 Driven_Wheels Driven Wheels
 Number of Doors Number of Doors
 Market Category Market Category
 Vehicle Size Size of Vehicle
 Vehicle Style Type of Vehicle
 highway MPG Highway MPG
 city mpg City MPG
 Popularity Popularity (Twitter)
 MSRP Manufacturer Suggested Retail Price
 
 
#  Steps
1. Import the dataset and the necessary libraries, check datatype, statistical summary,
shape, null values etc.
2. Are there any columns in the dataset which you think are of less relevance. If so, give
your reasoning and drop them.
3. Rename the columns "Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission
Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city
mpg": "MPG-C", "MSRP": "Price"
4. Check for any duplicates in the data, check for null values and missing data and remove
them.
5. Plot graphs of various columns to check for outliers and remove those data points from the
dataset.
6. What car brands are the most represented in the dataset and find the average price among
the top car brands.
7. Plot the correlation matrix and document your insights.
8. Perform EDA and plot different graphs and document your findings (Try to see how other
variables affect the price of the car)
9. (Extra Credits)Split the dataset into 80 and 20 ratio and build a machine learning model with
Price as the target variable
10. (Extra Credits)Try different algorithms and check their performance over metrics like R
square, RMSE, MAE etc and document your findings

# 1. Import the dataset and the necessary libraries, check datatype, statistical summary,shape, null values etc.


In [None]:
#Importing the required librarirs

import pandas as pd
import numpy as np
import seaborn as sns            #visualization
import matplotlib.pyplot as plt  #visualization
%matplotlib inline

In [None]:
#Reading the dataset

car_df=pd.read_csv("../input/datacsv/data.csv")

In [None]:
#Printing the first five rows using the HEAD() function

car_df.head()

In [None]:
#printing the last 5 lines of the dataset using TAIL() function

car_df.tail()

In [None]:
#Data Types

car_df.dtypes

In [None]:
#shape

car_df.mean().shape

In [None]:
#NULL values

car_df.isnull().sum()

In [None]:
#printing all the information of the dataset to know how it is

car_df.info()

In [None]:
#The describe() method is used for calculating some statistical data like percentile,mean,std
#It analyzes both numeric and object series and also the DataFrame column sets of mixed data types""""

car_df.describe()

# 2. Are there any columns in the dataset which you think are of less relevance. If so, give your reasoning and drop them.


In [None]:
#By observing the car_data.info()

car_df=car_df.drop(['Engine Fuel Type','Number of Doors','Market Category'],axis=1)
car_df.head(5)

# 3.Rename the columns "Engine HP": "HP", "Engine Cylinders": "Cylinders", "TransmissionType": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "citympg": "MPG-C", "MSRP": "Price"


In [None]:
#Renaming Columns and Replicating Data 

car_df.rename(columns = { 'Engine HP': 'HP'}, inplace = True)
car_df.rename(columns = { 'Engine Cylinders': 'Cylinders'}, inplace = True)
car_df.rename(columns = { 'Transmission Type': 'Transmission'}, inplace = True)
car_df.rename(columns = { 'Driven_Wheels': 'Drive Mode'}, inplace = True)
car_df.rename(columns = { 'highway MPG': 'MPG-H'}, inplace = True)
car_df.rename(columns = { 'city mpg': 'MPG-C'}, inplace = True)
car_df.rename(columns = { 'MSRP': 'Price'}, inplace = True)
car_df.info()

In [None]:
car_df.shape

# 4. Check for any duplicates in the data, check for null values and missing data and remove them.
 

In [None]:
duplicated_rows_car_df=car_df[car_df.duplicated()]
print("Number of dupliacted rows are:",duplicated_rows_car_df.shape)

In [None]:
car_df=car_df.drop_duplicates()
car_df.head(5)

In [None]:
#isnull(). sum() will give the column-wise sum of missing values. 
#This returns the counts of non-NA, NA and total number of entries per group.

print(car_df.isnull().sum())

In [None]:
car_df=car_df.dropna() #dropping the missing values
car_df.count()

# 5. Plot graphs of various columns to check for outliers and remove those data points from the dataset.
 

In [None]:
sns.boxplot(x=car_df['Price'])

In [None]:
sns.boxplot(x=car_df['HP'])

In [None]:
sns.boxplot(x=car_df['Cylinders'])

In [None]:
#The quantile() function is used to get values at the given quantile over requested axis. 
#Value between 0 <= q <= 1, the quantile(s) to compute.
#Equals 0 or 'index' for row-wise, 1 or 'columns' for column-wise.

Q1=car_df.quantile(0.25)
Q3=car_df.quantile(0.75)
IQR=Q3-Q1
print(IQR)

In [None]:
car_df=car_df[~((car_df<(Q1-1.5*IQR)) |(car_df>(Q3+1.5*IQR))).any(axis=1)]
car_df.shape

# 6. What car brands are the most represented in the dataset and find the average price among the top car brands

In [None]:
#finding the percentage of car per brand 
counts=car_df['Make'].value_counts()*100/sum(car_df['Make'].value_counts())


#10 car brands which are in top most position
popular_labels=counts.index[:10]

#plot
plt.figure(figsize=(10,5))
plt.barh(popular_labels,width=counts[:10])
plt.title('TOP 10 CAR BRANDS')
plt.show()

In [None]:
prices=car_df[['Make','Price']].loc[(car_df['Make']=='Chevrolet') |
                                    
                                    (car_df['Make']=='Ford') |
                                   (car_df['Make']=='Volkswagen') |
                                   (car_df['Make']=='Toyota') |
                                   (car_df['Make']=='Dodge') |
                                   (car_df['Make']=='Nissan') |
                                   (car_df['Make']=='GMC') |
                                   (car_df['Make']=='Honda') |
                                   (car_df['Make']=='Mazda')].groupby('Make').mean()
print(prices)

# 7. Plot the correlation matrix and document your insights.


In [None]:
car_df.corr()

HIGH CORRELATION BETWEEN   --- Cylinders & HP , Highway mpg & City mpg
HIGH ANTICORRELATION       --- Cylinders & highway mpg 

In [None]:
plt.figure(figsize=(10,5))
c=car_df.corr()
sns.heatmap(c,cmap="Blues",annot=True)

# 8. Perform EDA and plot different graphs and document your findings (Try to see how other variables affect the price of the car)

*SCATTER PLOT*

In [None]:
fig, ax=plt.subplots(figsize=(10,6))
ax.scatter(car_df['HP'],car_df['Price'])
ax.set_xlabel('HP')
ax.set_ylabel('Price')
plt.show()

In [None]:
#which vehicle style segment of cars sold the most

car_df['Vehicle Style'].value_counts().plot.bar(figsize=(10,6))
plt.title("CARS SOLD BY BODY")
plt.ylabel('number of vehicles')
plt.xlabel('Body type');

# **sedan cars were sold most  cars followed by 4dr SUV **

In [None]:
sns.countplot(y='Vehicle Style',data=car_df,hue='Drive Mode')
plt.title("Vehicle Typev/s Drive mode Type")
plt.ylabel('Vehicle Type')
plt.xlabel('Count of vehicle')

# making a new group "Price_group"

In [None]:
#creating a new column 'Price_group and assigning the value based on price of the car

car_df['Price_group']=pd.cut(car_df['Price'],[0,20000,40000,60000,80000,100000,600000],
                             labels=['<20K','20-39K','40-59K','6-079K','80-99K','>100K'],include_lowest=True)
car_df['Price_group']=car_df['Price_group'].astype(object)

In [None]:
(car_df['Price_group'].value_counts()/len(car_df)*100).plot.bar(figsize=(10,6))
plt.title("Price group bar diagram")
plt.ylabel('% of vehicles')
plt.xlabel('Price group')

# 9. (Extra Credits)Split the dataset into 80 and 20 ratio and build a machine learning model with Price as the target variable

# Implementing the different machine learning models like -linear Regression  and predicting the values

In [None]:
X =car_df[['Popularity','Year','HP','Cylinders','MPG-H','MPG-C']].values
y=car_df['Price'].values

In [None]:
#Feature scaling 

from sklearn.preprocessing import StandardScaler
sc_X=StandardScaler()  
sc_y=StandardScaler()
X=sc_X.fit_transform(X)
y=sc_y.fit_transform(y.reshape(-1,1))


In [None]:
#splitting the dataset into TRAINING set & TESTING set

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)


In [None]:
#as we discussed above fitting the multiple linear regression to the training set

from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,y_train)

In [None]:
#here we  are predicting the TEST SET results

y_pred=regressor.predict(X_test)
plt.scatter(y_test,y_pred)

In [None]:
sns.distplot((y_test-y_pred),bins=50)

In [None]:
from sklearn import metrics
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred))
print('Root mean Squared Error:',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
print('R2 Score:',metrics.r2_score(y_test,y_pred))

# 10. (Extra Credits)Try different algorithms and check their performance over metrics like R square, RMSE, MAE etc and document your findings

In [None]:
#fitting the POLYNOMIAL REGRESSION  to this dataset

from sklearn.preprocessing import PolynomialFeatures
poly_reg=PolynomialFeatures(degree=4)
X_poly=poly_reg.fit_transform(X_train)
poly_reg.fit(X_poly,y_train)
lin_reg_2=LinearRegression()
lin_reg_2.fit(X_poly,y_train)

In [None]:
#predicting the new result using the polynomial regression 

y_pred=lin_reg_2.predict(poly_reg.fit_transform(X_test))
plt.scatter(y_test,y_pred)

In [None]:
sns.distplot((y_test-y_pred),bins=50)

In [None]:
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred))
print('Root mean Squared Error:',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
print('R2 Score:',metrics.r2_score(y_test,y_pred))

In [None]:
#fitting the SVR to this dataset

from sklearn.svm import SVR
regressor=SVR(kernel='rbf')
regressor.fit(X_train,y_train)

In [None]:
#predicting new result 

y_pred=regressor.predict(X_test)
plt.scatter(y_test,y_pred)

In [None]:
sns.distplot((y_test-y_pred),bins=50)

In [None]:
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred))
print('Root mean Squared Error:',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
print('R2 Score:',metrics.r2_score(y_test,y_pred))

In [None]:
#fitting the RANDOM FOREST REGRESSION TO THE DATASET

from sklearn.ensemble import RandomForestRegressor
regressor=RandomForestRegressor(n_estimators=300,random_state=0)
regressor.fit(X_train,y_train)

In [None]:
y_pred=regressor.predict(X_test)
plt.scatter(y_test,y_pred)

In [None]:
sns.distplot((y_test-y_pred),bins=50)

In [None]:
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred))
print('Root mean Squared Error:',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
print('R2 Score:',metrics.r2_score(y_test,y_pred))