<a href="https://colab.research.google.com/github/rajesh2072/Exploratory-data-analysis-of-car-features/blob/master/ML_PROJECT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory data analysis in Python.
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# EDA STARTS NOW
Multiple libraries are available to perform basic EDA but I am going to use pandas and matplotlib for this post. Pandas for data manipulation and matplotlib, well, for plotting graphs. Jupyter Nootbooks to write code and other findings. Jupyter notebooks is kind of diary for data analysis and scientists, a web based platform where you can mix Python, html and Markdown to explain your data insights

# 1.** Import the dataset and the necessary libraries, check datatype, statistical summary, shape, null values etc.** **
Since the data set was already in a CSV format. All I had to do is just format the data into a pandas data frame. This was done by using a pandas data frame method called (read_csv) by importing pandas library. The read_csv data frame method was used by passing the filename as an argument. And then by executing this, it converted the CSV file into a neatly organized pandas data frame format.


# 1.1 importing the required libraries for EDA

In [None]:
#1.1 importing neccesory libraries for EDA
import pandas as pd
import numpy as np
import seaborn as sns #visualising
import matplotlib.pyplot as plt #visualising 
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
%matplotlib inline 
sns.set(color_codes=True)



# 1.1 Loding the data into data frame.

In [None]:
#1.2 Reading the data 
df = pd.read_csv("data.csv")
# To display the top 5 rows
df.head(5)


In [None]:
# To display the bottom 5 rows
df.tail(5)

## 1.2 checking the types of data and basics summary stats
usually we check for datatypes since sometimes the MSRP or the price of the car would be stored as a string, if in that case, we have to convert that string to the integer data only then we can plot the data via a graph. Here, in this case, the data is already in integer format so nothing to worry

In [None]:
# Checking the data type
df.dtypes

In [None]:
df.describe()

# 2. Dropping irrelevant columns

This step is certainly needed in EDA because sometimes there would be columns that we never use and in such cases dropping is useful. In this case, the columns such as Engine Fuel Type and Number of doors maynot be very relevant. here already dropped so no need

# 3. Renaming the columns

Sometimes, column names can be confusing or not readable, so its a good practice to rename column names as it improves the readability of the data set.

In [None]:
#Renaming the Columns
df.rename(columns={'Engine HP':'HP','Engine Cylinders':'Cylinders','Transmission Type':'Transmission','Driven_Wheels':'Drive Mode','highway MPG':'MPG-H','city mpg':'MPG-C','MSRP':'MRP'},inplace = True)

In [None]:
# Total number of rows and columns
df.shape

In [None]:
df.head(5)

# 4. Dropping the duplicate rows
This is often a handy thing to do because a huge data set as in this case contains more than 10, 000 rows often have some duplicate data, so here we remove all the duplicate value from the data-set.

In [None]:
# Rows containing duplicate data
duplicate_rows = df[df.duplicated()]
print(duplicate_rows)

In [None]:
# Used to count the number of rows before removing the data
df.count()

In [None]:
# Finding the null values.
print(df.isnull().sum())

# 4.1 Dropping the missing or null values.
This is mostly similar to the previous step but in here all the missing values are detected and are dropped later. Now, this is not the best approach and generally people just replace the missing values with the mean or the average of that column. This helps in fine tuning the model performance as more the data, a model has the better it performs. But the objective of this project is to do EDA and its fine to drop missing values and also the number of missing values is small compared to the entire dataset

In [None]:
# Dropping the missing values.
df = df.dropna() 
df.count()

In [None]:
# After dropping the values
print(df.isnull().sum()) 


# 5. Detecting Outliers
An outlier is a point or set of points that are different from other points. Sometimes they can be very high or very low. It's often a good idea to detect and remove the outliers. Because outliers are one of the primary reasons for resulting in a less accurate model. Often outliers can be seen with visualizations using a box plot. Shown below are the box plot of MSRP, Cylinders, Horsepower and EngineSize. Herein all the plots, you can find some points are outside the box they are none other than outliers.

In [None]:
#Plotting Graphs of Data(Columns)
sns.boxplot(x=df['MRP'])

In [None]:
sns.boxplot(x=df['HP'])

In [None]:
sns.boxplot(x=df['Cylinders'])

In [None]:
sns.boxplot(x=df['MPG-C'])

In [None]:
sns.boxplot(x=df['MPG-H'])

In [None]:
sns.boxplot(x=df['Popularity'])

In [None]:
#Finding IQR
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)


## 6 What car brands are the most represented in the dataset and find the average price among the top car brand

In [None]:
#Removing Outliners
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
# Percentage of car per brand
counts = df['Make'].value_counts()*100/sum(df['Make'].value_counts())

# Top 10 car brands
popular_labels = counts.index[:10]
# Plot
plt.figure(figsize=(10,5))
plt.barh(popular_labels, width=counts[:10])
plt.title('Top 10 Car brands')
plt.show()

In [None]:
prices = df[['Make','MRP']].loc[(df['Make'] == 'Chevrolet')|
               (df['Make'] == 'Ford')|
               (df['Make'] == 'Volkswagen')|
               (df['Make'] == 'Toyota')|
               (df['Make'] == 'Dodge')|
               (df['Make'] == 'Nissan')|
               (df['Make'] == 'GMC')|
               (df['Make'] == 'Honda')|
               (df['Make'] == 'Mazda')].groupby('Make').mean()
print(prices)

# 7.Correlation matrix

In [None]:
df.corr()

# High correlation between
Cylinders & HP
highway mpg & City mpg
The more cylinders there are, the more powerful the car is

High anticorrelation
### Cylinders & highway mpg
Highway mpg / Engine Cylinders have a strong negative correlation with highway and city MPG because lower MPG figures mean higher fuel consumption.

In [None]:
#Plotting Correlation Matrix
corrMatrix = df.corr()
sns.heatmap(corrMatrix, annot=True)

From the heatmap plotted above, it can be concluded that:
>>> Price is positively dependent on features and Horse Power(HP) and Year

>>> The features HP and Cylinders are positively dependent on each other

>>>MPG-H and MPG-C have strong negative correlation with Cylinders.
simply  if number of cylinders are increased, MPG-H and MPG-C decreases.

# 8 ploting different graphs.

In [None]:
sns.barplot(df['Cylinders'],df['MRP'])

In [None]:
sns.barplot(df['HP'],df['MRP'])

In [None]:
sns.barplot(df['MPG-C'],df['MRP'])

In [None]:
sns.barplot(df['MPG-H'],df['MRP'])

In [None]:
sns.barplot(df['Popularity'],df['MRP'])

In [None]:
sns.barplot(df['Year'],df['MRP'])

In [None]:
dcc = df.select_dtypes(exclude=[np.number]).columns
dcc

In [None]:
from sklearn.preprocessing import LabelEncoder
#Creating the object instance
label_enc = LabelEncoder()
for i in dcc:
  df[i] = label_enc.fit_transform(df[i])
print('Label Encoded Data')
df.head()  

In [None]:
#Setting Target Value
y = df['MRP']

In [None]:
# create training and testing vars

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
#Fit in a model
from sklearn.linear_model import LinearRegression
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)

In [None]:
#Plotting the Predictions
ax=sns.scatterplot(y_test, predictions)
ax.set(xlabel = "True Values", ylabel = "Predictions")

In [None]:
print ('Score:'), model.score(X_test, y_test)