In [None]:
# importing pandas and numpy

import pandas as pd
import numpy as np

In [None]:
# Loading data 

data = pd.read_csv("/kaggle/input/vehicle-dataset-from-cardekho/Car details v3.csv")
display(data.head())

In [None]:
# Data has 8128 columns and 13 rows
data.shape

In [None]:
from pandas_profiling import ProfileReport

In [None]:
data.profile_report()

In [None]:
# show first 5 rows of data
data.head()

In [None]:
# remove the unneccesary name column from dataset
data.drop(["name","torque"],axis=1,inplace=True)
data.head()

## Handling categorical variables (Part 1)

### 1. Lable Encoding

One of the simplest and most common solutions advertised to transform categorical variables is Label Encoding. __It consists of substituting each group with a corresponding number and keeping such numbering consistent throughout the feature__.This solution makes the models run, and it is one of the most commonly used by aspiring Data Scientists. However, its simplicity comes with many issues.
Numbers hold relationships. 
Well, that’s not exactly right…
This is especially an issue for algorithms, such as K-Means, where a distance measure is calculated when running the model.

### 2. One-Hot Encoding

__One-Hot Encoding__ is the most common, correct way to deal with __non-ordinal categorical data__. It consists of creating an additional feature for each group of the categorical feature and mark each observation belonging (__Value=1__) or not (__Value=0__) to that group.This approach is able to encode categorical features properly, despite some minor drawbacks. Specifically, the presence of a high number of binary values is not ideal for distance-based algorithms, such as Clustering models. In addition, __the high number of additionally generated features introduces the curse of dimensionality.__ This means that due to the now high dimensionality of the dataset, the dataset becomes much more sparse. In other words, in Machine Learning problems, you’d need at least a few samples per each feature combination. Increasing the number of features means that we might encounter cases of not having enough observations for each feature combination.

### 3. Target Encoding

A lesser known, but very effective way of handling categorical variables, is __Target Encoding__. It consists of substituting each group in a categorical feature with the __average response__ in the target variable.The process to obtain the Target Encoding is relatively straightforward and it can be summarised as:
1. Group the data by category
2. Calculate the average of the target variable per each group
3. Assign the average to each observation belonging to that group

In [None]:
# Checking catagirocal variable

print(data["fuel"].unique())
print(data["seller_type"].unique())
print(data["transmission"].unique())


The efficient way to handle this type of catagorical variable is __One Hot Encoding__.

In [None]:
# Creating dummy variables for "fuel" and droping the fuel column.

fuel_dummies = pd.get_dummies(data['fuel'], prefix='fuel_', drop_first = True)
data = pd.concat([data, fuel_dummies], axis = 1)
data.drop(['fuel'], inplace = True, axis = 1)

In [None]:
data.head()

In [None]:
# Creating dummy variables for "seller_type" and droping the seller_type column.

seller_type_dummies = pd.get_dummies(data['seller_type'], prefix='seller_type', drop_first = True)
data = pd.concat([data, seller_type_dummies], axis = 1)
data.drop(['seller_type'], inplace = True, axis = 1)

In [None]:
data.head()

In [None]:
# Creating dummy variables for "transmission" and droping the transmission column.

transmission_dummies = pd.get_dummies(data['transmission'], prefix='transmission_', drop_first = True)
data = pd.concat([data, transmission_dummies], axis = 1)
data.drop(['transmission'], inplace = True, axis = 1)

In [None]:
data.head()

In [None]:
# printing Unique values of owner 
print(data["owner"].unique())

Here we have 5 unique values for owner variable, we can use lable encoding or just replace the values with the appropriate numbers.

`['First Owner': 0, 'Second Owner': 1, 'Third Owner': 2,'Fourth & Above Owner': 3,'Test Drive Car': 4]`

In [None]:
data["owner"]= data['owner'].replace(['First Owner', 'Second Owner', 'Third Owner','Fourth & Above Owner','Test Drive Car'],[0,1,2,3,4])

In [None]:
data.head()

### Creating a new column which has the info about how old is the car at current year

`df.year` is the year at which the car has sell.
* what we want is that how old is the car at current year 
* to calculate that we need to substract the year column from the current year.

`data["year"] - data["current_year"]`

In [None]:
# Creating a new coulmn name current_year
data["current_year"] = 2021

In [None]:
data.head()

In [None]:
# Creating a new column which show how old is the car is.
data["car_age"] = data["current_year"] - data["year"]

In [None]:
# Droping the year and current_year column
data.drop(["year","current_year"], axis= 1, inplace=True)

In [None]:
data.head()

In [None]:
data.info()

### Handling Missing values



In [None]:
# Checking for missing values
data.isnull().sum()

Here in case we missing values in `mileage`, `engine`, `max_power` and  `seats`  but when you observe a litte close you can say that the missing values in all three columns are same so that is way we are removeing the missing values.
and as you observe the number of missing values as compare to the total entries is  __221 / 8128__  so it won't gonna affect to the model.

In [None]:
data.dropna(inplace=True)

In [None]:
# Checking for missing values
data.isnull().sum()

## Handling categorical variables (Part 2)



In [None]:
data.head()

The columns __`mileage`, `engine`__ and  __`max_power`__ are having the postfix string like __`kmph`, `CC`__ and __`bhp`__ respectively,

In [None]:
# Creating a method to eliminate the kmph part from mileage column

mileage = []
for i in data.mileage:
    m = i.split(" ")     #spliting the string from space " "
    mileage.append(m[0]) # we want only the first element.
    

In [None]:
data.head()

In [None]:
# do the same for engine as well as max_power column

engine = []
for i in data.engine:
    e = i.split(" ")     #spliting the string from space " "
    engine.append(e[0]) # we want only the first element.
    
max_power= []
for i in data.max_power:
    mp = i.split(" ")     #spliting the string from space " "
    max_power.append(mp[0]) # we want only the first element.
    
    
# Adding this columns to original ones

data["engine"] = engine
data["max_power"] = max_power


In [None]:
data.head()

## Visualization of data

In [None]:
import seaborn as sns
sns.pairplot(data)

In [None]:
import matplotlib.pyplot as plt

# get correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))

# plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

# End of the EDA Part.....