# Car Price Prediction - EDA


### Exploratory Data Analysis

In this notebook we are going to analyse the car dataset and perform various operations and process the dataset that can be then used to make model through ***Numpy*** and ***Pandas***.
We will also try to understand the relationship of different independent variables with our target variable through ***seaborn*** and ***matplotlib.pyplot*** libraries.

In [None]:
#importing essenstial libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Step 1: Reading and Understanding the Data
 
 In this step we will try to understand all the columns of the dataset and what datatype they are and divide them into categorical and numerical variables.

In [None]:
cars = pd.read_csv('../input/CarPrice_Assignment.csv')
cars.head()

In [None]:
cars.shape

The dataset is 26 columns wide and has a length of 205 rows.

In [None]:
cars.describe()

In [None]:
cars.info()

It looks like there are no null values in our data, so we are moving forward with data cleaning and wrangling. There are 8 rows of float data type, 8 rows of int data type and 10 rows of object data type.

### Step 2 : Data Cleaning and Wrangling

In [None]:
#Splitting company name from CarName column
CompanyName = cars['CarName'].apply(lambda x : x.split(' ')[0])  #we have split the company name and car model name and saved it into variable CompanyName
cars.insert(3,"CompanyName",CompanyName)  #inserting a new row named CompanyName into the cars dataframe
cars.drop(['CarName'],axis=1,inplace=True)  #dropping the old cloumn
cars.head()

In [None]:
cars['CompanyName'] = cars['CompanyName'].str.capitalize()
cars['CompanyName'].unique()

Removing inconsistent dataentries in the column of CompanyName

The following companies name have been entered incorrect:-
* mazda as maxda
* porsche as porcshce
* toyota as toyouta
* volkswagon as vokswagen, vw



In [None]:
cars.CompanyName.replace('Maxda','Mazda',inplace=True)
cars.CompanyName.replace('Porcshce','Porsche',inplace=True)
cars.CompanyName.replace('Toyouta','Toyota',inplace=True)
cars.CompanyName.replace('Vokswagen','Volkswagen',inplace=True)
cars.CompanyName.replace('Vw','Volkswagen',inplace=True)

cars.CompanyName.unique()

In [None]:
#Checking for duplicates
cars.loc[cars.duplicated()]

### Step 3: Visualizing the data


In [None]:
plt.figure(figsize=(20,8))

plt.subplot(1,2,1)
plt.title('Car Price Distribution Plot')
sns.distplot(cars.price)

plt.subplot(1,2,2)
plt.title('Car Price Spread')
sns.boxplot(y=cars.price)

plt.show()

In [None]:
print(cars.price.describe(percentiles = [0.25,0.50,0.75,0.85,0.90,1]))

#### Inference :

1. The plot seemed to be right-skewed, meaning that the most prices in the dataset are low(Below 15,000).
2. There is a significant difference between the mean and the median of the price distribution.
3. The data points are far spread out from the mean, which indicates a high variance in the car prices.(85% of the prices are below 18,500, whereas the remaining 15% are between 18,500 and 45,400.)

#### Step 3.1 : Visualising Categorical Data

    - CompanyName
    - Symboling
    - fueltype
    - enginetype
    - carbody
    - doornumber
    - enginelocation
    - fuelsystem
    - cylindernumber
    - aspiration
    - drivewheel

In [None]:

def price_comp(a):
    fig = plt.figure(figsize=(10,3))
    s1=fig.add_subplot(1,2,1)
    s2=fig.add_subplot(1,2,2)


    df=cars[a].value_counts()
    df.plot(kind='bar',ax=s1)
    s1.set_title(a+' Histogram')
    s1.set_xlabel(a)


    df = pd.DataFrame(cars.groupby([a])['price'].mean().sort_values(ascending = False))
    df.plot(kind='bar',ax=s2)
    plt.title(a+' vs Average Price')
    plt.show()

price_comp('CompanyName')
price_comp('fueltype')
price_comp('enginetype')
price_comp('carbody')
price_comp('doornumber')
price_comp('enginelocation')
price_comp('fuelsystem')
price_comp('cylindernumber')
price_comp('drivewheel')


[](http://)Insight:-
* Toyota is the most prefered car company.
* Gas is the most prefered fuel type and average price of gas type vehicle is also less.
* Ohc is prefered engine type and average price of ohc vehicle is the least among all.
* Hardtop and Convertible vehicles are more costlier than others.
* Cars with engine in rear are more than double the average cost of cars with engine in front.
* A four cylinder car is prefered most, eight and twelve cylinder cars are the costliest

#### Step 3.2 : Visualising numerical data

In [None]:
def scatter(x,fig):
    plt.subplot(2,2,fig)
    plt.scatter(cars[x],cars['price'])
    plt.title(x+' vs Price')
    plt.ylabel('Price')
    plt.xlabel(x)

plt.figure(figsize=(10,8))

scatter('carlength', 1)
scatter('carwidth', 2)
scatter('carheight', 3)
scatter('curbweight', 4)

plt.tight_layout()

#### Inference :

1. `carwidth`, `carlength` and `curbweight` seems to have a poitive correlation with `price`. 
2. `carheight` doesn't show any significant trend with price.

In [None]:
def pp(x,y,z):
    sns.pairplot(cars, x_vars=[x,y,z], y_vars='price',size=4, aspect=1, kind='scatter')
    plt.show()

pp('enginesize', 'boreratio', 'stroke')
pp('compressionratio', 'horsepower', 'peakrpm')
pp('wheelbase', 'citympg', 'highwaympg')

#### Inference :

1. `enginesize`, `boreratio`, `horsepower`, `wheelbase` - seem to have a significant positive correlation with price.
2. `citympg`, `highwaympg` - seem to have a significant negative correlation with price.

In [None]:
np.corrcoef(cars['carlength'], cars['carwidth'])[0, 1]

### Step 4 : Data Normalisation and Binning

In [None]:
#Binning the Car Companies based on avg prices of each Company.
cars['price'] = cars['price'].astype('int')
bins = [0,10000,20000,40000]
cars_bin=['Budget','Medium','Highend']
cars['carsrange_binned'] = pd.cut(cars['price'],bins,right=False,labels=cars_bin)
cars.head()

In [None]:
def norm_data(var):
    cars[var]=cars[var]/cars[var].max()
norm_data('wheelbase')
norm_data('carlength')
norm_data('carwidth')
norm_data('carheight')
norm_data('curbweight')
norm_data('enginesize')
norm_data('boreratio')
norm_data('stroke')
norm_data('compressionratio')
norm_data('horsepower')
norm_data('peakrpm')
norm_data('citympg')
norm_data('highwaympg')
norm_data('price')
cars.head()

### List of significant variables after Visual analysis :

    - Car Range 
    - Engine Type 
    - Fuel type 
    - Car Body 
    - Aspiration 
    - Cylinder Number 
    - Drivewheel 
    - Curbweight 
    - Car Length
    - Car width
    - Engine Size 
    - Boreratio 
    - Horse Power 
    - Wheel base 
    - Fuel Economy 

In [None]:
cars_new = cars[['price', 'fueltype', 'aspiration','carbody', 'drivewheel','wheelbase',
                  'curbweight', 'enginetype', 'cylindernumber', 'enginesize', 'boreratio','horsepower', 
                     'carlength','carwidth', 'carsrange_binned']]
cars_new.head()

In [None]:
sns.pairplot(cars_new)
plt.show()

### Step 6 : Dummy Variables

In [None]:
def dummies(x,df):
    temp = pd.get_dummies(df[x], drop_first = True)
    df = pd.concat([df, temp], axis = 1)
    df.drop([x], axis = 1, inplace = True)
    return df

cars_new = dummies('fueltype',cars_new)
cars_new = dummies('aspiration',cars_new)
cars_new = dummies('carbody',cars_new)
cars_new = dummies('drivewheel',cars_new)
cars_new = dummies('enginetype',cars_new)
cars_new = dummies('cylindernumber',cars_new)
cars_new = dummies('carsrange_binned',cars_new)

In [None]:
cars_new.head()