## Linear Regression to Predict Used Car Prices

#### Background & Context

There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholes in this market.

In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.

#### Objective

1. Explore and visualize the dataset.
2. Build a linear regression model to predict the prices of used cars.
3. Generate a set of insights and recommendations that will help the business.


#### Data Dictionary 

    S.No. : Serial Number
    Name : Name of the car which includes Brand name and Model name
    Location : The location in which the car is being sold or is available for purchase Cities
    Year : Manufacturing year of the car
    Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.
    Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
    Transmission : The type of transmission used by the car. (Automatic / Manual)
    Owner : Type of ownership
    Mileage : The standard mileage offered by the car company in kmpl or km/kg
    Engine : The displacement volume of the engine in CC.
    Power : The maximum power of the engine in bhp.
    Seats : The number of seats in the car.
    New_Price : The price of a new car of the same model in INR Lakhs.(1 Lakh = 100, 000)
    Price : The price of the used car in INR Lakhs (1 Lakh = 100, 000)

#### Problem

    - Does various predicating factors effect the price of the used car?
    - What all independent variables effect the pricing of used cars?
    - Does name of a car have any effect on pricing of car?
    - How does type of Transmission effect pricing?
    - Does Location in which the car being sold has any effect on the price?
    - Does kilometers_Driven, Year of manufacturing have negative correlation with price of the car?
    - Does Mileage, Engine and Power have any impact on the pricing of the car?
    - How does number of seat, Fuel type impact the pricing?

## 1. Import necessary libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# To check model performance
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Removes the limit from the number of displayed columns and rows.
# This is so I can see the entire dataframe when I print it
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 200)

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## 2. Load data to pandas dataframe

In [None]:
# read data into pandas dataframe from  the csv file
df_main = pd.read_csv('../input/cars4u/used_cars_data.csv')
df_used_cars = df_main.copy()

#### 2.1. Explore the data

In [None]:
df_used_cars.sample(5).T

#### 2.2. Explore spread of the data

In [None]:
df_used_cars.describe().T

Price and Kilometers driven both are right skewed

#### 2.3. Rows and columns and data types

In [None]:
df_used_cars.shape

In [None]:
df_used_cars.dtypes

#### 2.4. Check null data

In [None]:
df_used_cars.isnull().sum()

#### Observation:
1. There are null data for Mileage, Engine, Power, Seats
2. Huge null data for New_Price. This is usually a key factor that determines old car price, so need to think how to impute this
3. Price (dependent variable) also has a lots of null values

## 3. Feature Engineering

#### 3.1. Extract Brand, Model Name, and Model Description from Name

In [None]:
df_used_cars['Brand'] = df_used_cars['Name'].str.split().str[0].str.upper()

In [None]:
df_used_cars['Brand'].unique()

In [None]:
df_used_cars[df_used_cars['Brand'].isin(['MINI','JEEP','HINDUSTAN', 'LAND'])].sample(25).T

#### Observation:
1. Further refinement required for Hindustan Motors, Land Rover, Jeep Compass, Mini Cooper

In [None]:
df_used_cars['Brand'] = df_used_cars['Brand']\
.str.replace('LAND', 'LAND ROVER')\
.str.replace('MINI', 'MINI COOPER')\
.str.replace('HINDUSTAN', 'HINDUSTAN MOTORS')\
.str.replace('JEEP', 'JEEP COMPASS')

In [None]:
df_used_cars['Model'] = df_used_cars['Name'].str.upper().str.split().str[1:].str.join(' ')\
.str.replace('ROVER ', '')\
.str.replace('COOPER ', '')\
.str.replace('MOTORS', '')\
.str.replace('COMPASS', '')

In [None]:
df_used_cars['Model Name'] = df_used_cars['Model'].str.split(' ').str[0]

In [None]:
df_used_cars.sample(15).T

In [None]:
df_used_cars['Brand'].unique()

#### 3.2. Generate Brand Class

In [None]:
df_used_cars.groupby(['Brand'])['Price'].agg({'median','mean','max'}).sort_values(by='max', ascending = False)

In [None]:
# Creating brand class looking at the prices and using knowledge (google)

# Above 50 Lakhs
High = ['LAND ROVER',
        'LAMBORGHINI',
        'JAGUAR',
        'BMW',
        'MERCEDES-BENZ',
        'PORSCHE',
        'AUDI',
        'BENTLEY',
        'FORD']

# Above 20 Lakhs to 50 Lakhs
Mid = ['MINI COOPER',
        'TOYOTA',
        'VOLVO',
        'MITSUBISHI',
        'SKODA',
        'VOLKSWAGEN',
        'JEEP COMPASS',
        'HYUNDAI']

# Upto 20Lakhs
Low = ['ISUZU',
        'TATA',
        'MAHINDRA',
        'HONDA',
        'RENAULT',
        'FORCE',
        'MARUTI',
        'CHEVROLET',
        'NISSAN',
        'FIAT',
        'DATSUN',
        'SMART',
        'AMBASSADOR',
        'HINDUSTAN MOTORS',
        'OPELCORSA']

In [None]:
# Function to create brand class column using the above list
def classify(brand):
    if brand in High:
        return 'High'
    elif brand in Mid:
        return 'Mid'
    elif brand in Low:
        return 'Low'
    else:
        return 'is_missing'

In [None]:
df_used_cars['Brand Class'] = df_used_cars['Brand'].apply(lambda brand: classify(brand))

In [None]:
df_used_cars['Brand Class'].unique()

#### 3.3. Convert Mileage, Power and Engine data to integers/floats by removing unit of measurements

First checking the units available in each of these columns

In [None]:
df_used_cars['Mileage'].str.split(' ').str[1].unique()

In [None]:
df_used_cars['Power'].str.split(' ').str[1].unique()

In [None]:
df_used_cars['Engine'].str.split(' ').str[1].unique()

In [None]:
df_used_cars[df_used_cars['Mileage'].isnull()]

In [None]:
df_used_cars[df_used_cars['Power'].isnull()]

In [None]:
df_used_cars[df_used_cars['Engine'].isnull()]

#### Observation:
1. Even though mileage has 2 nulls, there are data with 0.0 in that column, that will require some treatment most probably

#### 3.4. Remove units from Mileage, Engine, Power

In [None]:
df_used_cars['Mileage'] = df_used_cars['Mileage'].str.replace(' km/kg', '').str.replace(' kmpl','')
df_used_cars['Power'] = df_used_cars['Power'].str.replace(' bhp', '')
df_used_cars['Engine'] = df_used_cars['Engine'].str.replace(' CC', '')

In [None]:
df_used_cars.isnull().sum()

In [None]:
df_used_cars.sample(5).T

In [None]:
df_used_cars[(df_used_cars['Mileage'].astype(str).str.split('.').str[0] == '0') \
             | (df_used_cars['Mileage'].astype(str) == 'null')]['Mileage'].count()

#### Observation:
1. Mileage has 81 values as 0.0 or null

In [None]:
df_used_cars[(df_used_cars['Engine'].astype(str).str.split('.').str[0] == '0') \
             | (df_used_cars['Engine'].astype(str) == 'null')]['Engine'].count()

In [None]:
df_used_cars[(df_used_cars['Power'].astype(str).str.split('.').str[0] == '0') \
             | (df_used_cars['Power'].astype(str) == 'null')]['Power'].count()

#### Observation:
1. Power has 129 values as 0 or null

#### 3.5. Add Car Age from Year

In [None]:
df_used_cars['Car Age'] = 2021 - df_used_cars['Year']

#### Convert new price data with units to unitless (by bringing Cr to Lakh)

In [None]:
df_used_cars['New_Price'].str.split(' ').str[1].unique()

In [None]:
# Function to convert price to Lakhs
def price_point_conv(price):
    if isinstance(price, str):
        if price.split(' ')[1] == 'Lakh':
            multiplier = 1
        elif price.split(' ')[1] == 'Cr':
            multiplier = 100
        return float(price.replace(' Lakh', '').replace(' Cr','')) * multiplier
    else:
        return np.nan

In [None]:
df_used_cars['New_Price'] = df_used_cars['New_Price'].apply(price_point_conv)

In [None]:
df_used_cars.sample(5)

#### 3.6. Dropping unnecessary columns

In [None]:
df_used_cars = df_used_cars.drop(['S.No.','Model'], axis=1)

In [None]:
df_used_cars.info()

In [None]:
df_used_cars.sample(5)

#### 3.7. Data Type conversions

##### 3.7.1. Converting 0.0/null Mileage/Power/Engine to Null to treat them all with FillNa()

In [None]:
df_used_cars.loc[(df_used_cars['Mileage'].astype(str).str.split('.').str[0] == '0') \
             | (df_used_cars['Mileage'].astype(str) == 'null'), 'Mileage'] = np.nan

In [None]:
df_used_cars.loc[(df_used_cars['Power'].astype(str).str.split('.').str[0] == '0') \
             | (df_used_cars['Power'].astype(str) == 'null'), 'Power'] = np.nan

In [None]:
df_used_cars.loc[(df_used_cars['Engine'].astype(str).str.split('.').str[0] == '0') \
             | (df_used_cars['Engine'].astype(str) == 'null'), 'Engine'] = np.nan

##### 3.7.2. Data Type conversions

In [None]:
df_used_cars['Name'] = df_used_cars['Name'].astype('str')
df_used_cars['Location'] = df_used_cars['Location'].astype('category')
df_used_cars['Fuel_Type'] = df_used_cars['Fuel_Type'].astype('category')
df_used_cars['Transmission'] = df_used_cars['Transmission'].astype('category')
df_used_cars['Owner_Type'] = df_used_cars['Owner_Type'].astype('category')
df_used_cars['Mileage'] = df_used_cars['Mileage'].astype('float')
df_used_cars['Power'] = df_used_cars['Power'].astype('float')
df_used_cars['Engine'] = df_used_cars['Engine'].astype('float')
df_used_cars['Brand'] = df_used_cars['Brand'].astype('str')
df_used_cars['Model Name'] = df_used_cars['Model Name'].astype('str')
df_used_cars['Brand Class'] = df_used_cars['Brand Class'].astype('category')

## 4. Exploratory Data Analysis

In [None]:
df_used_cars.describe().T

In [None]:
plt.style.use('ggplot')
numeric_columns = df_used_cars.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20,40))

for i, col in enumerate(numeric_columns):
                     plt.subplot(10,3,i+1)
                     sns.distplot(df_used_cars[col],kde=True,color='green')
                     plt.tight_layout()
                     plt.title(str(i+1)+ ': '+ col + ' distribution', color='black')

#### Observation:
1. Year is slightly left skewed and hence car age is slightly right skewed
2. Kilometers driven is heavily right skewed. Max KM driven is 6500000.00, investigation required. Also, skewness needs to be resolved (probably with log transform)
3. Power, New Price columns are highly right skewed and requires scaling.
4. Max price is 160 Lakhs, which looks quite high, needs investigation.
5. Max New Price is also high 375 Lakhs, needs investigation, hopefully it's a very high end car.
6. Power is also right skewed. Max power is 616, which again looks high, and needs investigation.
6. Engine is also right skewed. Max engine is 5998, which again looks high, and needs investigation.
7. Mileage distribution looks okayish
8. Most cars in analysis have 5 seats.

In [None]:
cat_columns = df_used_cars.select_dtypes(exclude=np.number).columns.tolist()
cat_columns.remove('Model Name')
cat_columns.remove('Name')

plt.figure(figsize=(20,25))

for i, col in enumerate(cat_columns):
    plt.subplot(4,2,i+1)
    order = df_used_cars[col].value_counts(ascending=False).index    
    ax=sns.countplot(x=col, data=df_used_cars , order=order, palette='pastel')
    for p in ax.patches:
       percentage = '{:.1f}%'.format(100 * p.get_height()/len(df_used_cars[col]))
       x = p.get_x() + p.get_width() / 2 - 0.05
       y = p.get_y() + p.get_height() + 20
       plt.annotate(percentage, (x, y),ha='center', color='black')
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.title(col, color='black')

#### Observation:
1. Most of the cars are Petrol or Diesel cars
2. Most of the cars have manual transmission
3. Most of the owners are first owners
4. Only 20% cars are high end cars
5. 50% of cars are from Maruti, Hyundai and Honda

In [None]:
plt.style.use('ggplot')
numeric_columns = df_used_cars.select_dtypes(include=np.number).columns.tolist()
numeric_columns.remove('Price')
plt.figure(figsize=(20,60))

for i, col in enumerate(numeric_columns):
                     plt.subplot(8,2,i+1)
                     sns.scatterplot(data=df_used_cars, x=col, y='Price', color='green', alpha=0.5)
                     plt.tight_layout()
                     plt.title(str(i+1)+ ': '+ col + ' vs Price Distribution', color='black')

#### Observation:
1. New Price and Price appears positively correlated
2. Car age also has negative impact on used car price
3. Power has a positive impact on used car price
4. Kilometers driven has negative impact of used car prices
5. Engine has some positive correlation with Price

## 5. Missing Value handling

In [None]:
# counting the number of missing values per row
num_missing = df_used_cars.isnull().sum(axis=1)
num_missing.value_counts()

In [None]:
#Investigating how many missing values per row are there for each variable
for n in num_missing.value_counts().sort_index().index:
    if n > 0:
        print("*" *30,f'\nFor the rows with exactly {n} missing values, NAs are found in:')
        n_miss_per_col = df_used_cars[num_missing == n].isnull().sum()
        print(n_miss_per_col[n_miss_per_col > 0])
        print('\n\n')

#### Observation:
This confirms that certain columns tend to be missing together or all nonmissing together. The missing values should be imputed for modelling.

#### 5.1. Exploring options to impute Mileage, Power and Engine

##### Level 1: Fill gap by Name and Year

In [None]:
df_used_cars['Engine']=df_used_cars.groupby(['Name','Year'])['Engine'].apply(lambda x:x.fillna(x.median()))
df_used_cars['Power']=df_used_cars.groupby(['Name','Year'])['Power'].apply(lambda x:x.fillna(x.median()))
df_used_cars['Mileage']=df_used_cars.groupby(['Name','Year'])['Mileage'].apply(lambda x:x.fillna(x.median()))

In [None]:
df_used_cars.isnull().sum()

##### Level 2: Fill gap by Brand and Model name

In [None]:
df_used_cars['Engine']=df_used_cars.groupby(['Brand','Model Name'])['Engine'].apply(lambda x:x.fillna(x.median()))
df_used_cars['Power']=df_used_cars.groupby(['Brand','Model Name'])['Power'].apply(lambda x:x.fillna(x.median()))
df_used_cars['Mileage']=df_used_cars.groupby(['Brand','Model Name'])['Mileage'].apply(lambda x:x.fillna(x.median()))

In [None]:
df_used_cars.isnull().sum()

##### Level 3: Fill gap by Brand and Year only

In [None]:
df_used_cars['Power']=df_used_cars.groupby(['Brand','Year'])['Power'].apply(lambda x:x.fillna(x.median()))
df_used_cars['Mileage']=df_used_cars.groupby(['Brand','Year'])['Mileage'].apply(lambda x:x.fillna(x.median()))

In [None]:
df_used_cars.isnull().sum()

##### Level 4: Fill gap by Brand

In [None]:
df_used_cars['Power']=df_used_cars.groupby(['Brand'])['Power'].apply(lambda x:x.fillna(x.median()))
df_used_cars['Mileage']=df_used_cars.groupby(['Brand'])['Mileage'].apply(lambda x:x.fillna(x.median()))

In [None]:
df_used_cars.isnull().sum()

In [None]:
df_used_cars[df_used_cars['Mileage'].isnull()]

In [None]:
df_used_cars[df_used_cars['Power'].isnull()]

In [None]:
#df_used_cars[(df_used_cars['Fuel_Type'] == 'Diesel') & \
#            (df_used_cars['Engine'] >= 799.0)]['Mileage'].median()

In [None]:
#df_used_cars['Mileage'] = df_used_cars['Mileage'].fillna(df_used_cars[(df_used_cars['Fuel_Type'] == 'Diesel') & \
#            (df_used_cars['Engine'] >= 799.0)]['Mileage'].median())

In [None]:
#df_used_cars[(df_used_cars['Fuel_Type'] == 'Diesel') & \
#            (df_used_cars['Engine'] >= 700.0)& \
#            (df_used_cars['Engine'] <= 800.0)]['Power'].median()

In [None]:
#df_used_cars[(df_used_cars['Fuel_Type'] == 'Diesel') & \
#            (df_used_cars['Engine'] >= 1900.0)& \
#            (df_used_cars['Engine'] <= 2000.0)]['Power'].median()

In [None]:
#df_used_cars.loc[(df_used_cars['Power'].isnull()) & (df_used_cars['Brand'] == 'SMART'),'Power'] = \
#df_used_cars[(df_used_cars['Fuel_Type'] == 'Diesel') & \
#            (df_used_cars['Engine'] >= 700.0)& \
#            (df_used_cars['Engine'] <= 800.0)]['Power'].median()

In [None]:
#df_used_cars.loc[(df_used_cars['Power'].isnull()) & (df_used_cars['Brand'] == 'HINDUSTAN MOTORS'),'Power'] = \
#df_used_cars[(df_used_cars['Fuel_Type'] == 'Diesel') & \
#            (df_used_cars['Engine'] >= 1900.0)& \
#            (df_used_cars['Engine'] <= 2000.0)]['Power'].median()

#### 5.2. Null handling for seats

In [None]:
sns.boxplot(df_used_cars['Seats'])

In [None]:
df_used_cars['Seats'].value_counts().sort_values(ascending=False)

In [None]:
df_used_cars['Seats']=df_used_cars.groupby(['Name','Year'])['Seats'].apply(lambda x:x.fillna(x.median()))

In [None]:
df_used_cars['Seats']=df_used_cars.groupby(['Name'])['Seats'].apply(lambda x:x.fillna(x.median()))

In [None]:
df_used_cars['Seats']=df_used_cars.groupby(['Brand','Model Name'])['Seats'].apply(lambda x:x.fillna(x.median()))

In [None]:
df_used_cars.isnull().sum()

In [None]:
df_used_cars[df_used_cars['Seats'].isnull()]

##### Maruti Estilo LXI is a 5 seater car, and most of the cars are 5 seater in the dataset, thus imputing 5 for filling in further.

In [None]:
df_used_cars['Seats'] = df_used_cars['Seats'].fillna(5)

In [None]:
df_used_cars.isnull().sum()

#### 5.3. Filling gaps in New Price

In [None]:
# Filling in based on Median new price by group of Name and Year
df_used_cars['New_Price']=df_used_cars.groupby(['Name','Year'])['New_Price'].apply(lambda x:x.fillna(x.median()))

In [None]:
df_used_cars['New_Price'].isnull().sum()

In [None]:
# Filling in based on Median new price by group of Brand and Model Name
df_used_cars['New_Price']=df_used_cars.groupby(['Brand','Model Name'])['New_Price'].apply(lambda x:x.fillna(x.median()))

In [None]:
df_used_cars['New_Price'].isnull().sum()

In [None]:
# Filling in based on Median new price by group of Brand
df_used_cars['New_Price']=df_used_cars.groupby(['Brand'])['New_Price'].apply(lambda x:x.fillna(x.median()))

In [None]:
# Checking how many values are still blank
df_used_cars['New_Price'].isnull().sum()

In [None]:
# Checking if there is a pattern
df_used_cars[df_used_cars['New_Price'].isnull()]['Brand'].unique()

In [None]:
# Checking records where new price is unavailable but price is available
df_used_cars[(df_used_cars['New_Price'].isnull()) & (~df_used_cars['Price'].isnull())]

In [None]:
# This could have been a step to impute the values but it seems way too much assumption
#df_used_cars['New_Price'] = df_used_cars['New_Price'].fillna(df_used_cars['Price'])

In [None]:
df_used_cars.isnull().sum()

The missing values can not further be filled in. Also, rows with Price column unavailable simply would not add any value since Price is the prediction target. Hence dropping rows with unavailable data now.

In [None]:
df_used_cars.dropna(inplace=True, axis=0)

In [None]:
df_used_cars.isnull().sum()

In [None]:
df_used_cars.describe()

## 6. EDA: Bivariate and Multivariate

In [None]:
# Checking the heatmap of correlations to understand important features

plt.figure(figsize=(15,10))
sns.heatmap(data=df_used_cars.corr(), annot=True, cmap='YlGnBu');

In [None]:
# Pair plot of the numeric variables to understand the correlation and importance

sns.pairplot(data=df_used_cars, corner = False, diag_kind='kde');

##### Observation:
1. Price has high correlation with Engine, Power and New Price
2. Age and Mileage have small negative correlation with Price

In [None]:
# The function plots distribution of quantitative features by qualitative feature
 
def bivariate_analysis(df, x, cat_list):
    fig, axes=plt.subplots(int(len(cat_list)/2 if len(cat_list) % 2 == 0 else (len(cat_list)+1)/2), 2, \
                           figsize=(20,40))
    i = 0
    for h in cat_list:
        #sns.countplot(data=df, x=x, hue=h)
        sns.boxplot(data=df, x=h, y=x, showmeans=True, ax=axes[i//2, i%2]).set(title=x + ' By ' + h)
        i+=1

In [None]:
df_used_cars.sample(5)

In [None]:
# Plot quantitative value distribution box using box chart by Location

bivariate_analysis(df_used_cars, x='Location', \
                   cat_list=['Mileage', 'Engine', 'Power', 'New_Price', 'Price', 'Car Age'])

In [None]:
# Plot quantitative value distribution box using box chart by Fuel Type

bivariate_analysis(df_used_cars, x='Fuel_Type', \
                   cat_list=['Mileage', 'Engine', 'Power', 'New_Price', 'Price', 'Car Age'])

##### Observation:
1. CNG cars provide better mileage
2. Diesel cars generally have better engine, more power and costlier, although there appears to be a lot of high end Petrol cars with better engine, power, and higher cost (those are in the outliers)

In [None]:
# Plot quantitative value distribution box using box chart by Transmission

bivariate_analysis(df_used_cars, x='Transmission', \
                   cat_list=['Mileage', 'Engine', 'Power', 'New_Price', 'Price', 'Car Age'])

##### Observation:
1. Auto-transmission cars are costlier

In [None]:
# Plot quantitative value distribution box using box chart by Owner Type

bivariate_analysis(df_used_cars, x='Owner_Type', \
                   cat_list=['Mileage', 'Engine', 'Power', 'New_Price', 'Price', 'Car Age'])

##### Observations:
1. Used car price depreciates with number of owners

In [None]:
# Plot quantitative value distribution box using box chart by Brand Class

bivariate_analysis(df_used_cars, x='Brand Class', \
                   cat_list=['Mileage', 'Engine', 'Power', 'New_Price', 'Price', 'Car Age'])

##### Observations:
1. High brand cars are more costlier (obviously)
2. High class cars have better engine, and are more powerfull
3. Low class cars give better mileage

#### EDA on Variables that are correlated with Price variable

In [None]:
# The function is to plot categorical analysis between two quantitative features for multiple values of a 
# qualitative feature, categorized per another qualitative feature

def categorical_plots(df, x, y, hue, kind, size):
    '''
    Signature:
    categorical_plots(df, x, y, hue, kind, size):

    Parameters:
    df = pandas dataframe
    x = x axis, quantitative feature for scatter and line kind, qualitative feature for line kind
    y = y axis, quantitative feature
    hue = hue parameter by categorical/qualitative feature, unused for bar kind
    kind = {scatter|bar|line}
    size = tuple (width, height) format
    '''
    plt.figure(figsize=size)
    if kind == 'scatter':
        plt.title(x + ' vs. ' + y + ' by ' + hue)
        sns.scatterplot(data=df, x=x, y=y, hue=hue); # this will plot scatter charts
    elif kind == 'line':
        plt.title(x + ' vs. ' + y + ' by ' + hue)
        sns.lineplot(data = df, x = x, y = y, hue = hue); # this will plot line charts
    elif kind == 'bar':
        plt.title(x + ' vs. ' + y)
        sns.barplot(data = df, x = x, y = y, hue=None) # this will plot bar charts

##### Price vs. Engine by Transmission

In [None]:
categorical_plots(df_used_cars, 'Price', 'Engine', 'Transmission', 'scatter', (15, 10))

Auto-transmission cars are costlier and have better engine

##### Price vs. Power by Transmission

In [None]:
categorical_plots(df_used_cars, 'Price', 'Power', 'Transmission', 'scatter', (15,10))

Auto-transmission cars are costlier and have better power

##### Price vs. Mileage by Transmission

In [None]:
categorical_plots(df_used_cars, 'Price', 'Mileage', 'Transmission', 'scatter', (15,10))

Auto-transmission cars are costlier by manual transmission cars provide more mileage

#### Year vs. Price by Transmission

In [None]:
categorical_plots(df_used_cars, 'Year', 'Price', 'Transmission', 'line', (15,7))

Auto transmission cars are getting famous and costlier every year

#### Year vs. Price by Fuel Type

In [None]:
categorical_plots(df_used_cars, 'Year', 'Price', 'Fuel_Type', 'line', (15,7))

Diesel cars are getting costlier every year, and they are gaining more demand due to increasing petrol prices

In [None]:
df_used_cars[df_used_cars['Power'].isnull()]

#### Year vs. Price by Owner Type

In [None]:
categorical_plots(df_used_cars, 'Year', 'Price', 'Owner_Type', 'line', (15,7))

First owned cars are always sold on a better price

In [None]:
# Plot Yearly Price variation by each type of owner

for val in df_used_cars['Owner_Type'].unique():
    categorical_plots(df_used_cars[df_used_cars['Owner_Type'] == val], 'Year', 'Price', 'Owner_Type', 'scatter', (15,4))
    plt.title('Year vs. Price by Owner Type: '+ val)

#### Seats vs. Price

In [None]:
categorical_plots(df_used_cars, 'Seats', 'Price', hue = 'Seats', kind='bar', size=(15,7))

2 seater cars are costlier

#### Location vs. Price

In [None]:
categorical_plots(df_used_cars, 'Location', 'Price', hue = 'Location', kind='bar', size=(15,7))

Cars are sold at higher prices in Bangalore, Coimbatore, Kochi 

#### Brand Class vs. Price

In [None]:
categorical_plots(df_used_cars, 'Brand Class', 'Price', hue = 'Brand Class', kind='bar', size=(10,7))

#### Brand vs. Price

In [None]:
categorical_plots(df_used_cars, 'Brand', 'Price', hue = 'Brand', kind='bar',size=(15,7))
plt.xticks(rotation = -90);

In [None]:
# Plot count charts by Brands per Car Brand Class

fig, axes = plt.subplots(3,1,figsize=(20, 20))
i = 0
for val in df_used_cars['Brand Class'].unique():
    order = df_used_cars[df_used_cars['Brand Class']==val]['Brand'].value_counts(ascending=False).index    
    sns.countplot(x='Brand', data=df_used_cars[df_used_cars['Brand Class']==val] , order=order, palette='pastel', ax=axes[i]).set(title='Brand per Class: ' + val)
    
    i+=1

##### Observation:
1. Maruti, Honda cars are very popular in low budget
2. Hyundai, Toyota, Volkswagen are popular in mide budget
3. Mercedes, Ford, BMW, Audi cares are very popular among high budget cars

#### Age of Car vs. Price

In [None]:
categorical_plots(df_used_cars, 'Car Age', 'Price', hue = 'Car Age', kind='bar',size=(15,4))

Older the car, lesser is the price

In [None]:
px.scatter(data_frame=df_used_cars, x='New_Price', y='Price', color = 'Car Age')

##### Observations from EDA:

1. Expensive cars are in Coimbatore and Banglore.
2. 2 Seater cars are more expensive.
3. Deisel cars are more expensive compared to other fuel types.
4. As expected, Newer models are costlier.
5. Automatic transmission vehicles have a higher price than manual transmission vehicles.
6. Vehicles with more engine capacity have higher prices.
7. Price decreases as number of owner increases.
8. Automatic transmission require high engine and power.
9. Prices for Cars with fuel type as Deisel has increased with recent models.
10. Higher price of new car means higher price of old car. But of course the used car price decreases with age of the car
11. Engine, Power, Car Age, Mileage, Fuel Type, Location, Transmission, New Price correlates with the price

## 7. Log transform skewed columns

Kilometers driven, Price and New Price were highly right skewed, hence, performing log transform on these columns

In [None]:
# Check if columns have non positive values, if they have we can't log transform

def check_log_transformability(df, cols):
    '''
    Signature:
    check_log_transformability(df, cols): validates if the columns are log transformable
    
    parameters:
    df = pandas dataframe
    cols = list of columns to be performed log transformation on
    '''
    for colname in cols:
        plt.hist(df[colname], bins=50)
        plt.title(colname + ' Distribution')
        plt.show()
        print(colname + ' data less or equal to zero in ' + str(np.sum(df[colname] <= 0)) + ' rows')
        print('Skewness:  {}'.format(df[colname].skew()))
        print('Kurtosis:  {}'.format(df[colname].kurt()))

In [None]:
check_log_transformability(df_used_cars, ['Kilometers_Driven', 'Price', 'New_Price'])

Great! It appears we can perform log transformation on these columns.

In [None]:
# Function to perform log transformation

def log_transform(df, cols):
    for colname in cols:
        df[colname + '_log'] = np.log(df[colname])
        sns.histplot(data=df, x=colname + '_log', bins=50, kde=True, color='green')
        plt.title(colname + ' Distribution')
        plt.show()

In [None]:
log_transform(df_used_cars, ['Kilometers_Driven', 'Price', 'New_Price'])

In [None]:
df_used_cars.sample(5)

## 8. Build Model

##### Drop unnecessary features

In [None]:
df_used_cars.drop(['Name', 'Year', 'Brand', 'Model Name'], axis=1, inplace=True)

In [None]:
df_used_cars.sample(5)

##### Encode categorical variables

In [None]:
# Function to 1-hot encode categorical variables and drop the first encoded category per column

def encode_cat_vars(df, cols):
    '''
    Signature: encode_cat_vars(df, cols): encodes categorical variables in a dataframe
    
    Parameters:
    df = a pandas dataframe
    cols = columns to encode
    '''
    
    df = pd.get_dummies(
        df,
        columns=cols,
        drop_first=True,
    )
    return df

In [None]:
df_used_cars_encoded = encode_cat_vars(df_used_cars, ['Location', 'Fuel_Type', 'Transmission'])

***Owner Type and Brand Class gets a different treatment of Label Encoding since it has ordinal precedence in the data values***

In [None]:
df_used_cars_encoded['Owner_Type'] = df_used_cars_encoded['Owner_Type'].replace({"First":1,"Second":2,"Third": 3,"Fourth & Above":4})

In [None]:
df_used_cars_encoded['Brand Class'] = df_used_cars_encoded['Brand Class'].replace({"Low":1,"Mid":2,"High":3})

In [None]:
df_used_cars_encoded.sample(5)

##### Dropping the variables which were already transformed to log

In [None]:
df_used_cars_encoded.drop(['Kilometers_Driven', 'Price', 'New_Price'], axis=1, inplace=True)

##### Performance checks (to be used later)

In [None]:
# RMSE
def rmse(predictions, targets):
    return np.sqrt(((targets - predictions) ** 2).mean())


# MAPE
def mape(predictions, targets):
    return np.mean(np.abs((targets - predictions)) / targets) * 100


# MAE
def mae(predictions, targets):
    return np.mean(np.abs((targets - predictions)))


# Adjusted R square
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse_score = rmse(target, pred)  # to compute RMSE
    mae_score = mae(target, pred)  # to compute MAE
    mape_score = mape(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse_score,
            "MAE": mae_score,
            "MAPE": mape_score,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
        },
        index=[0],
    )

    return df_perf

### Model 1: Dropping the New Price column since it had huge amount of null values

##### Splitting the dataset to dependent and independent variables

In [None]:
x = df_used_cars_encoded.drop(['Price_log', 'New_Price_log'], axis=1)

In [None]:
y = df_used_cars_encoded['Price_log']

In [None]:
x.sample(5)

In [None]:
y.sample(5)

##### Splitting the dataset into traing and testing sets

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)

##### Building the regression model

In [None]:
regression_model = LinearRegression()
regression_model.fit(x_train, y_train)

##### Coefficients and intercept in the model

In [None]:
coef_df = pd.DataFrame(
    np.append(regression_model.coef_, regression_model.intercept_),
    index=x_train.columns.tolist() + ["Intercept"],
    columns=["Coefficients"],
)
coef_df.sort_values(by='Coefficients')

In [None]:
model_performance_regression(regression_model, x_train, y_train)

In [None]:
model_performance_regression(regression_model, x_test, y_test)

### Model 2: Keeping the New Price as we were able to impute most of the null values

##### Splitting the dataset to dependent and independent variables

In [None]:
x2 = df_used_cars_encoded.drop(['Price_log'], axis=1)

In [None]:
y2 = df_used_cars_encoded['Price_log']

In [None]:
x2.sample(5)

In [None]:
y2.sample(5)

##### Splitting the dataset into traing and testing sets

In [None]:
x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y2, test_size=0.3, random_state=1)

##### Building the regression model

In [None]:
regression_model2 = LinearRegression()
regression_model2.fit(x2_train, y2_train)

##### Coefficients and intercept in the model

In [None]:
coef_df2 = pd.DataFrame(
    np.append(regression_model2.coef_, regression_model2.intercept_),
    index=x2_train.columns.tolist() + ["Intercept"],
    columns=["Coefficients"],
)
coef_df2.sort_values(by='Coefficients')

In [None]:
model_performance_regression(regression_model2, x2_train, y2_train)

In [None]:
model_performance_regression(regression_model2, x2_test, y2_test)

##### Observations:
1. $R^2$ score for both train and test data is ~92% in the second model
2. Adjusted $R^2$ values are very high for both train and test dataset (~92%)
3. Above means we could successfully explain ~92% data
4. RMSE for train and test data are comparable, and it improved for test dataset
5. MAE and MAPE also are comparable for train and test data, and MAPE shows improvement in test dataset
6. Overall it appears to be a very good model

## 9. Linear regression assumptions testing

##### 9.1. Mean of residuals

Residuals as we know are the differences between the true value and the predicted value. One of the assumptions of linear regression is that the mean of the residuals should be zero. 

In [None]:
def mean_of_residuals(model, predictors, target):
    residuals = target - model.predict(predictors)
    mean_residuals = round(np.mean(residuals), 3)
    print("Mean of Residuals {}".format(mean_residuals))

In [None]:
mean_of_residuals(regression_model2, x2_train, y2_train)

In [None]:
mean_of_residuals(regression_model2, x2_test, y2_test)

***Mean of residuals is close to zero***

##### 9.2. Homoscedasticity of Error/Residual terms

Homoscedasticity means that the residuals have equal or almost equal variance across the regression line. By plotting the error terms with predicted terms we can check that there should not be any pattern in the error terms.


Homoscedacity - If the residuals are symmetrically distributed across the regression line , then the data is said to homoscedastic.

Heteroscedasticity- - If the residuals are not symmetrically distributed across the regression line, then the data is said to be heteroscedastic. In this case the residuals can form a funnel shape or any other non symmetrical shape.

In [None]:
def check_homoscedasticity(model, predictors, target):
    y_pred = model.predict(predictors)
    residuals = target - y_pred
    
    p = plt.figure(figsize=(15,7))
    p = sns.scatterplot(y_pred,residuals)
    plt.xlabel('Predicted Values')
    plt.ylabel('Residuals')
    p = sns.lineplot([-2,5],[0,0],color='blue')
    p = plt.title('Residuals vs fitted values plot for homoscedasticity check')

In [None]:
check_homoscedasticity(regression_model2, x2_train, y2_train)

In [None]:
check_homoscedasticity(regression_model2, x2_test, y2_test)

##### Goldfeld Quandt Test

Checking heteroscedasticity : Using Goldfeld Quandt we test for heteroscedasticity.

Null Hypothesis: Error terms are homoscedastic
Alternative Hypothesis: Error terms are heteroscedastic.

If we want 95% confidence on our findings and tests then the p-value should be less than 0.05 to be able to reject the null hypothesis.

***Goal: Check if we can reject the null hypothesis; if we can't, that would mean error terms are homoscedastic.***

In [None]:
import statsmodels.stats.api as sms
from statsmodels.compat import lzip

def goldfeld_quandt(model, predictors, target):
    name = ['F statistic', 'p-value']
    y_pred = model.predict(predictors)
    residuals = target = y_pred
    test = sms.het_goldfeldquandt(residuals, predictors)
    print(lzip(name, test))

In [None]:
goldfeld_quandt(regression_model2, x2_train, y2_train)

In [None]:
goldfeld_quandt(regression_model2, x2_test, y2_test)

***Since the p-values are not less than 0.05, we can not reject the null hypothesis, hence the data is homoscedastic.***

##### 9.3. Normality of error terms/residuals

In [None]:
def check_residual_normalcy(model, predictors, target):
    y_pred = model.predict(predictors)
    residuals = target - y_pred
    sns.distplot(residuals, kde=True)
    plt.title('Normality of error terms/residuals')

In [None]:
check_residual_normalcy(regression_model2, x2_test, y2_test)

***The residual terms are pretty much normally distributed for the number of test points we took.***

##### 9.4. No autocorrelation

There should not be autocorrelation in the data so the error terms should not form any pattern.

In [None]:
y_pred = regression_model2.predict(x2_test)
residuals = y2_test - y_pred

plt.figure(figsize=(15,5))
p = sns.lineplot(y_pred,residuals,marker='o',color='blue')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.ylim(-3,3)
plt.xlim(-1,5)
p = sns.lineplot([-1,5],[0,0],color='red')
p = plt.title('Residuals vs fitted values plot for autocorrelation check')

Checking for autocorrelation To ensure the absence of autocorrelation we use Ljungbox test.

Null Hypothesis: Autocorrelation is absent.
Alternative Hypothesis: Autocorrelation is present.

In [None]:
from statsmodels.stats import diagnostic as diag
min(diag.acorr_ljungbox(residuals , lags = 40)[1])

***Since the p-value is not less than 0.05, we can not reject the null hypothesis, hence the data is not autocorrelated.***

##### 9.5. No perfect multicollinearity

In regression, multicollinearity refers to the extent to which independent variables are correlated. Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics.

We have already checked this in heatmap.

## 10. Observations from the model

1. Our linear regression model data satisfies the assumptions of linear regression
2. With our linear regression model we have been able to capture ~92% variation in our data.
3. The model indicates that the most significant predictors of price of used cars are:

        a. Age of the car
        b. Number of seats in the car
        c. Power of the engine
        d. Mileage
        e. Kilometers Driven
        f. Location
        g. Fuel_Type
        h. OwnerType
        i. Transmission - Automatic/Manual
        j. New_Price - Price of new car
    
4. Newer cars sell for higher prices. 1 unit increase in age of the car leads to exp(0.1172) = 1.12 Lakh decrease in the price of the vehicle, when everything else is constant.
5. As the number of seats increases, the price of the car increases - exp(0.039) = 1.04 Lakhs
6. Mileage is negatively correlated with Price. Generally, high mileage cars are the lower budget cars.
7. Kilometers Driven have a negative relationship with the price. A car that has been driven more will have more damages and must have undergone multiple repair and hence sell at a lower price.
8. If the new car price was higher for a car, it is likely to have higher selling price when it is sold after use.

## 11. Recommendations

1. Chennai, Coimbatore, Hyderabad, Bangalore tending to have higher used car prices. We can focus more on these cities to grow the business more.

2. Jaipur, Mumbai, Delhi, Pune, Kochi cities have relatively riskier markets. It'd be beneficial to do market research to strategize growth in these cities.

3. Kolkata appears to be a very risky market for used cars. Careful investment is advised here.

4. With increasing Petrol price, Diesel cars are gaining popularity. Also, Electric cars are, although new, have a very good scope in the market. We should focus on acquiring more Diesel and Electric cars.

5. Number of owners depreciates the used car prices. Thus, we should not acquire cars that have traversed through too many owners. Best is to get cars from the first owner.

6. As we did during pre-processing, the cars are to be categorized in High, Mid and Low brand class cars.
    - Maruti, Honda cars are very popular in low budget
    - Hyundai, Toyota, Volkswagen are popular in mide budget
    - Mercedes, Ford, BMW, Audi cares are very popular among high budget cars
    
7. Overall, Maruti, Hyundai and Honda cars make almost 48% of number of cars sold. This can also be a focus point.

8. Auto transmission cars are sold for more price; we should concentrate on acquiring more auto-transmission cars.

9. The model can provide an esimated price for a newly acquired used can, so that it is never re-sold at less than the predicted price.