# Cars4U Linear Regression Project

#### Description
#### Background & Context

There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholes in this market.

In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.

As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it. 


#### Data Dictionary of Dataset

S.No. : Serial Number

Name : Name of the car which includes Brand name and Model name

Location : The location in which the car is being sold or is available for purchase Cities

Year : Manufacturing year of the car

Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.

Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)

Transmission : The type of transmission used by the car. (Automatic / Manual)

Owner : Type of ownership

Mileage : The standard mileage offered by the car company in kmpl or km/kg

Engine : The displacement volume of the engine in CC.

Power : The maximum power of the engine in bhp.

Seats : The number of seats in the car.

New_Price : The price of a new car of the same model in INR Lakhs.(1 Lakh = 100, 000)

Price : The price of the used car in INR Lakhs (1 Lakh = 100, 000)

### Problem Objective

1.Explore and visualize the dataset.<br>
2.Peform Univariate and Bivariate analysis of features. Bring out the insights in the data.<br>
3.Preprocess the data and find out duplicates, missing values and treatment, outliers and treatment, bad data.<br>
4.Build a linear regression model to predict the prices of used cars.<br>
5.Performance evaluation of model - Generate RMSE, MAE, Adjusted R-square.<br>
6.Generate a set of insights and recommendations that will help the business.<br>


### Questions to be Answered
1.What are the features influencing the pricing of the car?<br>
2.Does Brand or Model of the car has any impact on the pricing?<br>
3.Does the location of the car being sold has any impact on pricing?<br>
4.Does the age of the car(i.e year) being sold has any effect on pricing?<br>
5.Does Kilometers Driven in car impacts pricing? Are Kilometers Driven and Pricing are inversely proportional?<br>
6.Does Fuel type,Transmission has any impact on car pricing?<br>
7.Any relation between pricing of car and Owner type? Are they inversely proportional? If the car is being owned by more number of users decreases car pricing?<br>
8.Any relation between Engine, Power & Seats on pricing of car?<br>
9.How does standard mileage of car affect the pricing?<br>

# **************************************************************************************
# Import necessary libraries
# **************************************************************************************


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import random 
import matplotlib.pyplot as plt
sns.set(color_codes=True)
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build linear model for prediction
from sklearn.linear_model import LinearRegression
# To check model performance
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

## ******************************************************************************************************
## Loading and exploring the data
## ******************************************************************************************************

In [None]:
#Read the input file
car_data=pd.read_csv('/kaggle/input/cars4u/used_cars_data.csv')

# Retaining the actual file and taking a copy of input as working file
carsdf = car_data.copy()

Lets check the shape and sample records in the dataset

In [None]:
np.random.seed(1)  # To get the same random results every time
display(carsdf.sample(8))  # Print sample records
print(f'\nThere are {carsdf.shape[0]} rows and {carsdf.shape[1]} columns.\n')  # Shape of the input file


 The dataset has 7253 rows and 14 columns.As you can notice, The "S.No" column is not of any interest. Let's drop the column.

In [None]:
#dropping S.No column from dataset
carsdf.drop(['S.No.'],axis=1,inplace=True)


#### Let's check the datatypes and size of the file

In [None]:
carsdf.info()

4 numeric (2 float & 2 int) and 9 objects. Size of file is 793.2 KB

In [None]:
#Summary statistics 
carsdf.describe(include='all').T

- Name has 2041 unique values. The top most repeating car is Mahindra XUV500 W8 2WD. This has appeared 55 times in the dataset.
- Location has 11 unique values. Mumbai is the most repeating city. It has appeared 948 times.
- Kilometers_Driven range from 171 - 6500000. There may be few outliers as this number seems to be high
- Year has data from 1996 thru' 2019. Most of the vehcles are from 2014 manufacture date.
- Fuel Type has 5 unique values. Most of the vehicles are Diesel based.
- Transmission has 2 unique values. Most of the vehciles are Manual.
- Owner Type has 4 unique values. Most of vehicles have only single owner.
- Most of the cars mileage (per manufacturer) is 17.0 kmpl, 1197CC Engine and 74bhp power
- Majority of cars are of 5 seater.
- Price of the car range from 0.44 Lakhs to 160 lakhs. Most of them are about 5.64 lakhs
- Most of New price of cars is 63.71 lakh. This look like suspicious. This value indicates the most of the cars are luxury vehicles which may not be correct. We'll explore the data in coming sections and find out if there are any bad data/missing data.


#### Lets further explore the dataset statistics.
- How many numerical & Categorical columns we have?
- Shape (remember we have dropped S.No. already)
- Number of variables
- Missing cells
- Duplicate rows

In [None]:
#setting max column width so that column names doesnt get truncated
pd.set_option('max_colwidth', 100)   

#creating empty dataframe
df1=pd.DataFrame()       

#Find out numerical, categorical variables, shape of dataset, missing values and its %, dups in dataset and its %
#convert the output of the different stats into List
#fill in the dataframe and then display

#checking number of Numerical Variables using len() & select_dtypes
df1["Number of Numerical variables"] = [len(carsdf.select_dtypes('number').columns)] 

#Numerical Variables in dataset
df1["Numerical Columns"] = [carsdf.select_dtypes('number').columns.tolist()]

#number of Object(Categorical) Variables using len() & select_dtypes
df1["Number of Object variables"] = [len(carsdf.select_dtypes('object').columns)]

#Categorical Variables in dataset
df1["Object Columns"] = [carsdf.select_dtypes('object').columns.tolist()]

#Number of columns in dataset from shape
df1["Total Number of variables"] = [carsdf.shape[1]]

#Number of rows in dataset from shape
df1["Total Number of rows"] = [carsdf.shape[0]]

#Calculating number of missing cells/null values in file
df1["Missing cells"] = [carsdf.isnull().sum().sum()]

#Converting the missing cells numbers into %
df1["Missing cells(%)"] = [(carsdf.isnull().sum().sum())*100/(carsdf.shape[0] * carsdf.shape[1])]

#Calculating duplicate rows in file
df1["Duplicate rows"] = [carsdf.duplicated().sum()]

#Converting the duplicate cells numbers into %
df1["Duplicate rows(%)"] = [(carsdf.duplicated().sum()*100)/carsdf.shape[0]]

#Renaming the column name
display(df1.T.rename(columns={0:'Dataset Statistics'}))


- After we dropped "S.No", there are 4 numerical & 9 categorical(object) features.The variable names are displayed inline above<br>
- Shape of the dataset is 7253 rows and 13 columns<br>
- 7628 (8.1%) cells do not have any values<br>
- 1 duplicate row (after dropping S.No)

#### Lets check the duplicate record and drop it

In [None]:
#carsdf[carsdf.duplicated()]
carsdf.duplicated().sum()    #after dropping S.No.

In [None]:
print('Shape of dataset before dropping dups:', carsdf.shape)
print('Dropping duplicate and retaining last occurrence')
carsdf.drop_duplicates(keep='last',inplace=True)
print('Dropping complete')
print('Confirming if duplicate still exist in file')
print('Number of duplicates in dataset:',carsdf.duplicated().sum())
print('Shape of dataset after dropping dups:', carsdf.shape)

As you see above, we have dropped one duplicate row in file and no dups exists in file



- There are 9 categorical variables and 4 numeric variables.
- Dependent variable is Price which is of type float

#### Let's check for Unique values in the data.

In [None]:
carsdf.nunique()

#### Let's check the values in columns

In [None]:
cols=['Location','Year','Fuel_Type','Transmission','Owner_Type','Seats']
for i in cols:
    print('Unique values and count of column',i,'are:')
    display(pd.DataFrame(carsdf[i].value_counts(dropna=False,ascending=True)))

As you can see above, all the variables have good values. However, notice Seats variable having NaN and 0.0 values. Impute values later

#### Let's check missing values in data

In [None]:
print('Missing values:\n')
print(carsdf.isnull().sum())
print('\nMissing values(%):\n')
print(carsdf.isnull().sum() * 100 / len(carsdf))

As you can see above, New_price has max % of nulls (~86%). Price which is dependent variable has 17% nulls. We need to check if New_price can be filled with median value or any other means. If we are not able to fill, then we will drop the field.

Mileage, Engine, Power, Seats has very minimal. We should be able to impute the values.

# ****************************************************************************************
# Data Preprocessing
# ******************************************************************************************

Let's fix bad data and missing data in all columns

####  Seat variable
Seat = 0 and NaNs

In [None]:
print('Records with Seats 0')
carsdf[carsdf['Seats']==0.0]


In [None]:
print('Lets check if dataset has any more same car')
carsdf[carsdf['Name']=='Audi A4 3.2 FSI Tiptronic Quattro']

This is the only one car in dataset. Based on domain knowledge, this car is a 5 seater car. 
Lets impute the value of 5

In [None]:
carsdf.loc[carsdf['Seats'] == 0.0, 'Seats'] = 5.0
print('\n Lets check if the imputation is complete \n')
display(carsdf[carsdf['Seats']==0.0])
display(carsdf[carsdf['Name']=='Audi A4 3.2 FSI Tiptronic Quattro'])

As you can see, Seats = 5.0 is updated. Now lets fix NaN values

In [None]:
#Cars with NaN Seat values
carsdf[carsdf['Seats'].isna()].reset_index()


We have 53 records with Nan in seats. Lets check if any other records in dataset having same car name has seat values

In [None]:
#Unique Name of the cars having Nan in Seats and converting to list
kj=carsdf[carsdf['Seats'].isna()].Name.unique().tolist()
#All the records in dataset containing Name = Car Names having Nan Seats
carsdf[carsdf['Name'].isin(kj)].reset_index()

As you see, couple of cars have Seats = 5.0. Let's check the value counts of all the cars having Nan in Seats.

In [None]:
carsdf[carsdf['Seats'].isna()].Name.value_counts()

Top 3 cars are famous in market and have seat capacity of 5.0. We are left with small set. Hence, it should be ok to impute mean value for all the records shown above. 

In [None]:
print('Imputing median value for Seats with Nan values:')
carsdf['Seats'].fillna(carsdf['Seats'].median(), inplace=True)
print('Imputing done. Lets check if any of records have Nan Seats')
print(carsdf[carsdf['Seats'].isna()].count().sum())
print('Unique Seat values:')
carsdf['Seats'].value_counts()

As you see above, all the bad records are fixed in Seats

In [None]:
#Converting data type
carsdf["Fuel_Type"] = carsdf["Fuel_Type"].astype("category")
carsdf["Transmission"] = carsdf["Transmission"].astype("category")
carsdf["Owner_Type"] = carsdf["Owner_Type"].astype("category")
carsdf["Seats"] = carsdf["Seats"].astype("float")
carsdf["Location"] = carsdf["Location"].astype("category")
carsdf["Year"]=carsdf["Year"].astype("int")

In [None]:
#checking the datatype of columns after conversion
carsdf.info()

In [None]:
carsdf.isnull().sum()

### Mileage, Engine and Power variables

Before removing Nans, Let us try converting Mileage, Enginer, Power to numerics. We can remove kmpl, CC, bhp units from the fields respectively, then convert to float and remove Nans 

In [None]:
carsdf[['Mileage','Engine','Power']].sample(5)

In [None]:
carsdf.Power.unique()

In [None]:
#remove bhp
carsdf['Power'] = carsdf['Power'].replace(regex='bhp',value='') 
#check unique values now
carsdf.Power.unique()

If you notice there is 'null' value in the data. Lets change it to np.nan

In [None]:
#Change 'null' to np.nan
carsdf['Power'] = carsdf['Power'].replace(regex='null',value=np.nan) 
#check unique values now
carsdf.Power.unique()

In [None]:
#Similarly lets update Mileage and Engine too.
carsdf['Mileage'] = carsdf['Mileage'].replace(regex='kmpl',value='') 
carsdf['Mileage'] = carsdf['Mileage'].replace(regex='km/kg',value='') #Mileage has 2 units
carsdf['Engine'] = carsdf['Engine'].replace(regex='CC',value='') 
print('Unique values of Mileage:\n')
print(carsdf.Mileage.unique())
print('Unique values of Engine:\n')
print(carsdf.Engine.unique())

In [None]:
#After removing units from these fields, lets check the values and see if we can convert to float 
carsdf[['Engine','Power','Mileage']].sample(10)

In [None]:
#converting the fields type to float  
carsdf["Mileage"] = carsdf["Mileage"].astype("float")
carsdf["Power"] = carsdf["Power"].astype("float")
carsdf["Engine"] = carsdf["Engine"].astype("float")

#check the type after conversion
carsdf.dtypes

In [None]:
#rechecking missing values
carsdf.isnull().sum()

In [None]:
#For Mileage, Engine and Power, we will impute the median of the respective Brands.
#Creating 2 new variables BRAND and MODEL from Name and drop the Name

carsdf['Brand'] = carsdf['Name'].str.split(' ', n=1, expand=True)[0].astype('category') 
carsdf['Model'] = carsdf['Name'].str.split(' ', n=1, expand=True)[1].astype('category') 
carsdf.drop(['Name'], axis=1, inplace=True)

display(carsdf.info())
display(carsdf.head())

In [None]:
#Let's check if there are any bad data in the dataset for Mileage, Engine and Power
print('Mileage with 0.0:',carsdf[carsdf['Mileage']==0.0]['Mileage'].count())
print('Engine with 0.0:',carsdf[carsdf['Engine']==0.0]['Engine'].count())
print('Power with 0.0:',carsdf[carsdf['Power']==0.0]['Power'].count())

81 cells have Mileage =0.0. Let's change it to NaN so that we can impute all NaN together.

In [None]:
carsdf.loc[carsdf["Mileage"]==0.0,'Mileage']=np.nan

print('Missing values in Mileage, Engine and Power')
carsdf[['Mileage','Engine','Power']].isnull().sum()


In [None]:
print('Imputing median values based on Brand for Mileage, Engine, Power')
carsdf['Engine'] = carsdf.groupby(['Brand'])['Engine'].apply(lambda val:val.fillna(val.median()))
carsdf['Mileage'] = carsdf.groupby(['Brand'])['Mileage'].apply(lambda val:val.fillna(val.median()))
carsdf['Power'] = carsdf.groupby(['Brand'])['Power'].apply(lambda val:val.fillna(val.median()))
print('After Imputing')
carsdf[['Mileage','Engine','Power']].isnull().sum()

Let's check the records which still have Nan in Mileage and Power

In [None]:
display(carsdf[carsdf['Mileage'].isnull()])
display(carsdf[carsdf['Power'].isnull()])

In [None]:
print('Imputing median for missing Mileage and Power\n')
carsdf.loc[carsdf['Mileage'].isnull(), 'Mileage'] = carsdf['Mileage'].median()
carsdf.loc[carsdf['Power'].isnull(), 'Power'] = carsdf['Power'].median()
print('After Imputing, Missing values in the variables')
carsdf.isnull().sum()

####  New_Price variable

In [None]:
#Checking unique value counts for New_Price
carsdf.New_Price.value_counts()

In [None]:
#lets check the unit of New_Price
carsdf['New_Price'].str.split(' ', n=1, expand=True)[1].value_counts()

In [None]:
#Records with New_Price unit in Crores
carsdf[carsdf['New_Price'].str.endswith('Cr')==True]

In [None]:
#Converting 'Cr' to 'Lakh' and removing 'Lakh' from the variable

def price_to_num(new_prc):
    """This function takes in a string representing a price in Lakhs and crores
    and converts it to a number (Unit = Lakh). 1 Cr = 100 Lakh
    If the input is already numeric, which probably means it's NaN,
    this function just returns np.nan."""
    if isinstance(new_prc, str):  # checks if `income_val` is a string
        multiplier = 1  # handles K vs M salaries
        if new_prc.endswith('Lakh'):
            multiplier = 1
        elif new_prc.endswith('Cr'):
            multiplier = 100
        return float(new_prc.replace('Cr', '').replace('Lakh', '')) * multiplier
    else:  # this happens when the new price is np.nan
        return np.nan

carsdf['New_Price'] = carsdf['New_Price'].apply(price_to_num)

#sample converted values
carsdf['New_Price'].value_counts().sample(5) 

In [None]:
# Lets check the sample New_price which has Cr as unit before conversion
carsdf[carsdf['Model']=='RS5 Coupe']


Looking at values, the conversion is complete

In [None]:
#lets check for missing values and impute them

carsdf['New_Price'].isna().sum()

We have a huge number of missing values, lets try to create another variable, do the manipulation there and then 
update the New_Price variable

In [None]:
#Approach:
#We will first impute based on Brand+Model. Next, for remaining missing ones, we can update based on Brand.
carsdf['new_price_bkp'] =carsdf['New_Price']
carsdf['new_price_bkp'] = carsdf.groupby(['Brand','Model'])['new_price_bkp'].apply(lambda val:val.fillna(val.median()))

In [None]:
print('After imputation the median() based on Brand & Model, missing values in New_Price')
carsdf['new_price_bkp'].isna().sum()

In [None]:
#Imputing based on Brand only
carsdf['new_price_bkp'] = carsdf.groupby(['Brand'])['new_price_bkp'].apply(lambda val:val.fillna(val.median()))

In [None]:
print('After imputation the median() based on Brand only, missing values in New_Price')
carsdf['new_price_bkp'].isna().sum()

We were successful in significantly reducing the number of missing values in New_Price.
Let us substitute the New_Price with new_price_bkp and drop new_price_bkp

In [None]:
carsdf['New_Price']=carsdf['new_price_bkp']

#drop 'new_price_bkp'
carsdf.drop(['new_price_bkp'],axis=1,inplace=True)

In [None]:
#checking if drop was successful
carsdf.dtypes

In [None]:
#Sanity check : Unique value count
carsdf.nunique()

Checking unique values in the variables. For variables with large number of unique values,we will sort the values
and check head(5) and tail(5) to find out if we have any extreme values.
We'll not consider Model as it has huge number of unique values (high cardinality). We'll drop the field later.

In [None]:
cols=['Location','Year','Fuel_Type','Transmission','Owner_Type','Seats','Brand']
for i in cols:
    print('\nUnique values of column',i,'are:')
    print(carsdf[i].unique().tolist())

#columns with unique values which are in huge number
cols2=['Kilometers_Driven','Mileage','Engine','Power','New_Price','Price']

for j in cols2:
    print('\n Unique top 5 min',j, ' values:')
    print(carsdf[j].sort_values(ascending=True).unique().tolist()[:5])
    print('\n Unique top 5 max',j,' values:')        
    print(carsdf[j].sort_values(ascending=False).unique().tolist()[:5])

- Brand variable contains 'Isuzu', 'ISUZU'. Both are essentially same and need to covert the case. 
- Land is Land Rover. Needs to be updated.
- Kilometers_Driven has few extreme values at higher end. We'll handle them in outlier treatment later.
- Price has values > 100. We'll check in later

In [None]:
#Updating Brand names

carsdf.loc[carsdf['Brand'] == 'ISUZU','Brand'] = 'Isuzu'
carsdf['Brand']=carsdf['Brand'].cat.remove_categories('ISUZU')  #Remove the value ISUZU

#Land Rover is a new value and hence should be added first
carsdf['Brand']= carsdf['Brand'].cat.add_categories('Land Rover')  
carsdf.loc[carsdf['Brand'] == 'Land','Brand'] = 'Land Rover' 
carsdf['Brand']= carsdf['Brand'].cat.remove_categories('Land') 

print(carsdf['Brand'].unique().tolist())


In [None]:
#Missing values in New_Price now.
carsdf['New_Price'].isna().sum()

In [None]:
carsdf.isna().sum()

In [None]:
print(carsdf.isna().sum())
print(carsdf.info())

#### We have completed the preprocessing of the columns in dataset. 

## ******************************************************************************************************
## Univariate and Bivariate  analysis
## ******************************************************************************************************

### Univariate analysis

In [None]:
#For Univariate analysis, let's write a function to combine boxplot and histplot in same space one below other 
#so that we could visualize outliers & distributions

def hist_box_plot(feature, figsize=(15, 6), bins=None):
    """Boxplot and histogram combined. Input is Numerical feature
    feature: 1-d feature array
    figsize: size of fig (default (9,8))
    bins: number of bins (default None / auto)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
                                            nrows=2,  # Number of rows of the subplot grid= 2
                                            sharex=True,  # x-axis will be shared among all subplots
                                            gridspec_kw={"height_ratios": (0.25, 0.75)},
                                            figsize=figsize,
                                          )  # creating the 2 subplots
    
    # For boxplot. Marker indicates mean value of column.  
    sns.boxplot(feature, ax=ax_box2, showmeans=True, color="yellow")  
    
    # For histogram
    sns.distplot(feature, kde=False, ax=ax_hist2, bins=bins, color='violet')
    
    ax_hist2.axvline(feature.mean(), color="green", linestyle="--", label='Mean')  # Add mean to the histogram
    ax_hist2.axvline(feature.median(), color="red", linestyle="-", label='Median')  # Add median to the histogram
    
    plt.legend() #display legend

#### Year Analysis

In [None]:
hist_box_plot(carsdf['Year'])

##### Year Observation:
1. Year is Negatively skewed (Mean<Median). Has few outliers on lower side.
2. Used cars sale started increasing after year 2000 and reached it peak around 2015. 
3. Used car sales started decreasing after 2015.

#### Kilometers_Driven Analysis

In [None]:
hist_box_plot(carsdf['Kilometers_Driven'])
#hist_box_plot(np.log(carsdf['Kilometers_Driven']))

##### Kilometers_Driven Observation
1. Heavily skewed. Log Transformation should be applied

#### Mileage Analysis

In [None]:
hist_box_plot(carsdf['Mileage'])

##### Mileage Observation:
- Almost uniform distribution where Mean and Median approximately same value 
- Used cars having Mileage of about 15-20 sell the most


#### Engine Analysis

In [None]:
hist_box_plot(carsdf['Engine'])

##### Engine Observation:
1. Median < Mean. Positively skewed
2. Few Outliers observed in data having high Engine units
3. Most cars have Engine between 1000 to 1500 CC

#### Power Analysis

In [None]:
hist_box_plot(carsdf['Power'])

##### Power Observation:
1. Median < Mean. Positive skew
2. Few outliers in the data having high values

#### Seats Analysis

In [None]:
hist_box_plot(carsdf['Seats'])

##### Seats Observations:
1. Values are pretty discrete.
2. Most of the cars have 5 Seats

#### New_Price Analysis

In [None]:
hist_box_plot(carsdf['New_Price'])

##### New_Price Observation:
1. Positive skew where median < mean. Log transformation may need to be applied
2. Few outliers in the data.

#### Price Analysis

In [None]:
hist_box_plot(carsdf['Price'])

##### Price Observation:
1. Positive skew. Median < Mean. May require log transformation. Since its dependent variable, if we apply log then we may have to display in terms of exponent values.
2. Few outliers in data. One or more car being sold at 160 Lakhs.

In [None]:
#Barplots for categorical variables

def bar_plot(data, z):
    """
    This function will barplot the categorial feature.Adds % at the top
    """
    total = len(data[z])  # length of the column
    plt.figure(figsize=(18, 6))   #Plot size
    plt.xticks(rotation=45)     #Rotate the x axis variables to 45 degree
    plt.title( z + ' Plot',fontweight='bold')    #title of the plot
    #plot countplot.  
    ax = sns.countplot(
            data[z], 
            palette="Spectral", 
            order=data[z].value_counts(ascending=False).index #Sorting the variable count in descending order
    )   #i.e the variable with highest count appears in the left side of plot, followed by others in descending order
    for p in ax.patches:                       
        percentage = "{:.1f}%".format(100 * p.get_height() / total)  # % of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05  # width of the plot
        y = p.get_y() + p.get_height()  # hieght of the plot
        ax.annotate(percentage, (x, y), size=10)  # annotate the percantage
    plt.show()  # show the plot

#### Location Analysis

In [None]:
bar_plot(carsdf,"Location")

##### Location Observation:
1. Mumbai has the highest sales in India, followed by Hyderabad.
2. Ahmedabad has the least sales compared to other cities, followed by Bangalore. 

#### Fuel_Type Analysis

In [None]:
bar_plot(carsdf,"Fuel_Type")

##### Fuel_Type Observation
1. Dielsel vehicles are the most selling of all followed by Petrol
2. Electric vehicles sales are very minimum. Company should research and find out the reason because Climate change is the hottest topic everyone are discussing. High potention to increase the sales after conducting research and incorporating the outcomes in marketing. 

#### Transmission Analysis

In [None]:
bar_plot(carsdf,"Transmission")

##### Transmission Observation:
1. Manual Cars are most selling in India (as expected).
2. Automatic cars will not provide enough power to drive on uneven roads (most common in India).

#### Owner_Type Analysis            


In [None]:
bar_plot(carsdf,"Owner_Type")

##### Owner_Type Observation:
Customers mainly look for Cars with only one owner accounting to 82% of sale

#### Brand Analysis

In [None]:
bar_plot(carsdf,"Brand")

##### Brand Observation:
1. Top 5 most selling brands - Maruti, Hyundai, Honda, Toyota, Mercedes Benz
2. Least 5 selling brands - Ambassador, Smart, Hindustan, OpelCorsa, Lamborghini

### Bivariate Analysis

##### Understanding correlation between numerical variables based on Pairplot, Correlation Matrix and Heatmap 

In [None]:
#Generating pairplot to show correlation between numeric variables
sns.pairplot(carsdf, hue='Transmission')
plt.show()

In [None]:
#Check the correlation of numeric variables and generate heatmap

plt.figure(figsize = (14,8))
#create your own palette to show positive values in blue and negative values in red
cmap=sns.diverging_palette(5, 260, as_cmap=True)
#plot heatmap
sns.heatmap(carsdf.corr(), annot=True, cmap=cmap, vmin=-1, vmax=1)
plt.title('Heatmap', fontweight='bold')         #Chart title

#Correlation matrix
carsdf.corr()

##### Looking at above Pairplot, Correlation Matrix and Heatmap, we can conclude:

High Positive Correlation between

- Power - Engine
- Power - New Price 
- Power - Price
- Price - New Price

High Negative Correlation between

- Engine - Mileage
- Power - Mileage




##### Understanding Price relationship with Categorical variables

#### Price vs Brand Analysis

In [None]:
#Average price of cars by brand and sorting in descending order
order=carsdf.groupby(['Brand'])['Price'].mean().fillna(0).sort_values(ascending= False).index
#barplot
sns.catplot(x="Brand", y="Price", data=carsdf, kind='bar', height=7, aspect=2, order=order).set(title='Price by Brand') 
plt.xticks(rotation=45);

##### Price vs Brand Observation
- Most expensive car is Lamborghini followed by Bentley, Porsche, Land Rover, Jaguar
- Least expensive car is Opelcorsa followed by Hindustan ,Ambassador,Smart, Chevrolet

#### Price vs Location Analysis

In [None]:
#Average price of cars by Location and sorting in descending order
order=carsdf.groupby(['Location'])['Price'].mean().fillna(0).sort_values(ascending= False).index

sns.catplot(x="Location", y="Price", data=carsdf, kind='bar', height=5, aspect=2, order=order, palette='Spectral').set(title='Price by Location') 
plt.xticks(rotation=45);

##### Price vs Location
Used Cars  are costly in Coimbatore, Bangalore , Kochi.

Used Cars are cheapest in Kolkata, Jaipur, Pune.

#### Price vs Transmission Analysis

In [None]:
sns.catplot(x="Fuel_Type",y="Price",col="Transmission",data=carsdf,kind='bar') 
plt.show()

##### Price vs Transmission Observation:
1. Diesel cars are the costlier cars.
2. Electric cars are costlier in Automatic Transmission compared to Manual Transmission.

#### Price vs Owner_Type Analysis

In [None]:
sns.catplot(x="Owner_Type",y="Price",data=carsdf,kind="bar").set(title='Price by Owner_Type') 
plt.show()

##### Price vs Owner_Type Observation:
Cars having only one owner is the costliest

##### Understanding Price relationship with other Numerical variables

#### Price vs Mileage Analysis

In [None]:
sns.relplot(x='Mileage', 
            y='Price', 
            data=carsdf, 
            s=80,  #Marker Size
            height=5, 
            aspect=1.5
            ).set(title='Price by Mileage')  
plt.show()

##### Price vs Mileage Observation:
Low Mileage cars have high price

#### Price vs Engine Analysis

In [None]:
sns.relplot(x='Engine', 
            y='Price', 
            data=carsdf, 
            s=80,  #Marker Size
            height=5, 
            aspect=1.5
            ).set(title='Price by Engine') 
plt.show()

##### Price vs Engine Observation:
Cars with high CC are costly compared to ones having low CC

#### Price vs Power Analysis

In [None]:
sns.relplot(x='Power', 
            y='Price', 
            data=carsdf, 
            s=80,  #Marker Size
            height=5, 
            aspect=1.5
            ).set(title='Price by Power') 
plt.show()

##### Price vs Power Observation
Price of Car increases as Power increases.
Inventory has more of low power cars as they sell the most

#### Price vs Year Analysis

In [None]:
plt.figure(figsize=(20,8))
sns.swarmplot(data=carsdf, x='Year', y='Price',palette='flare').set(title='Price by Year') 
plt.show()

##### Price vs Year Observation:
Price of Used cars has increased over years

#### Price vs Kilometers_Driven Analysis

In [None]:
sns.relplot(x='Kilometers_Driven', 
            y='Price', 
            data=carsdf, 
            s=80,  #Marker Size
            height=5, 
            aspect=1.5
            ).set(title='Price by Kilometers_Driven') 
plt.show()

##### Price vs Kilometers_Driven Observation
Cars having low Kilometers driven are costly

#### Price vs Seats Analysis

In [None]:
plt.figure(figsize=(20,10))
sns.boxplot(data=carsdf, x='Seats', y='Price',palette='Spectral').set(title='Price by Seats') 
plt.show()

##### Price vs Seats Observation:
1. 2 Seater cars are costly
2. 5 Seater cars have varing prices
3. 9 and 10 Seater cars are of less price

#### Car Brand  vs Locations sales

In [None]:
#Brand vs Location sales
pd.crosstab(index=carsdf.Brand, columns=carsdf.Location.sort_values(ascending=True), margins=True, margins_name='Total')

Above table shows the car sales across different locations

In [None]:
sns.catplot(data=carsdf, 
              y="Brand", 
              kind='count', 
              height=8, 
              aspect=3 ,
              row="Location",
              palette="flare",
              order = carsdf["Brand"].value_counts(ascending=False).index
           )
plt.show()

#### Brand vs Location Observation
Most selling brands across various locations:
- Ahmedabad: Maruti
- Bangalore: Hyundai
- Chennai: Maruti
- Coimbatore: Hyundai
- Delhi: Maruti
- Hyderabad: Maruti
- Jaipur: Maruti
- Kochi: Hyundai
- Kolkata: Hyundai
- Mumbai: Maruti
- Pune: Maruti

#### Average Price of Car Sold at Various Locations

In [None]:
#Computing Average Sale Price of Car Across various locations
pd.set_option("display.max_rows", None) #to display all rows
pd.DataFrame(carsdf.groupby(['Brand', 'Location'])['Price'].mean().fillna(0))

### Multivariate Analysis

##### Price vs Engine vs Transmission

In [None]:
plt.figure(figsize=(14,6))
plt.title("Price vs Engine vs Transmission")
sns.scatterplot(x='Engine', y='Price', hue='Transmission', data=carsdf, palette='autumn')
plt.show()

Automatic cars' Engine capacity is more than Manual cars and tend to be pricey

##### Price vs Power vs Transmission

In [None]:
plt.figure(figsize=(14,6))
plt.title("Price vs Power vs Transmission")
sns.scatterplot(x='Power', y='Price', hue='Transmission', data=carsdf, palette='flare')
plt.show()

Automatic cars have more power compared to manual cars and tend to be more costly

##### Price Vs Mileage Vs Transmission

In [None]:
plt.figure(figsize=(14,6))
plt.title("Price vs Mileage vs Transmission")
sns.scatterplot(x='Mileage', y='Price', hue='Transmission', data=carsdf, palette='prism')
plt.show()

##### Price vs Year vs Transmission

In [None]:
plt.figure(figsize=(14,7))
plt.title("Price vs Year vs Transmission")
sns.lineplot(x='Year', y='Price',hue='Transmission',data=carsdf)
plt.show()

Price of Automatic cars have increased over years compared to Manual cars

### Insights Based on EDA:

#### Dataset Statistics:
- Dataset has 7253 rows and 14 columns
- 5 numeric variables and 9 categorical/object variables
- Missing cell % ~ 8%
- Year variable is negatively skewed
- Mileage is uniformly distributed
- Kilometers_Driven, Price and New_Price are heavily skewed. Outliers present
- High Positive Correlation between
> Power - Engine <br>
Power - New Price <br>
Power - Price <br>
Price - New Price <br>
- High Negative Correlation between
> Engine - Mileage<br>
Power - Mileage<br>

#### Car Profile:
- Maruti is the top most selling Brand followed by Hyundai, Honda, Toyota, Mercedes Benz
- Ambassador is the least selling Brand followed by Smart, Hindustan, OpelCorsa, Lamborghini
- Most expensive car is Lamborghini followed by Bentley, Porsche, Land Rover, Jaguar
- Least expensive car is Hindustan followed by Opelcorsa,Ambassador,Smart, Chevrolet
- Mumbai has the highest sales followed by Hyderabad. Ahmedabad has least sales followed by Bangalore 
- Most selling cars across various location:
> Ahmedabad: Maruti<br>
> Bangalore: Hyundai<br>
> Chennai: Maruti<br>
> Coimbatore: Hyundai<br>
> Delhi: Maruti<br>
> Hyderabad: Maruti<br>
> Jaipur: Maruti<br>
> Kochi: Hyundai<br>
> Kolkata: Hyundai<br>
> Mumbai: Maruti<br>
> Pune: Maruti<br>
- Used car sales started picking up after 2000, reached peak in 2015 and started decreasing later on.
- Used cars are costly in Coimbatore, Bangalore and Kochi. Cheap in Kolkata, Jaipur and Pune
- Most customers prefer to buy 5 Seater cars and only one owner.
- 2 Seater cars are costly.5 Seater cars price vary significantly. 
- Diesel vehicles are most popular. Electric vehicles are least popular.
- Diesel cars are the costliest cars. Electric cars are costlier in auto transmission.  
- Cars with engine 1000-1500 CC, mileage of 15-20, Power less than 100bhp sell the most
- Price of Car increases as Power increases. Inventory has more of low power cars as they sell the most.

##### We have completed analysis of all the variables and relations

# ********************
# Variable Transformation
# ********************



In [None]:
#Dropping NANs rows
print('Shape of file before dropping Nans :',carsdf.shape)
carsdf.dropna(inplace=True,axis=0)
print('Shape of file after dropping Nans :',carsdf.shape)

##### Based on EDA, we have seen data having skewness and outliers. We have to apply log transformation and may need to treat outliers before model generation

Kilometers_Driven, New_Price and Price are heavily skewed. Log Transformation may be good fit.

In [None]:
pd.set_option("display.max_rows", 200) #resetting it back to 200 rows
#Checking to see if there are zeros or negative values
print('Kilometers_Driven with values <=0 : ',carsdf[carsdf['Kilometers_Driven']<=0.0].count().sum())
print('New_Price with values <=0 : ',carsdf[carsdf['New_Price']<=0.0].count().sum())
print('Price with values <=0 : ',carsdf[carsdf['Price']<=0.0].count().sum())

#### Outlier detection using IQR

Outliers in the data can distort predictions. The challenge is to know whether the data is truly an outlier or an interesting finding which should be kept.Lets compute IQR (interval from 1st quartile to the 3rd quartile) and flag entries outside 1.5*IQR as outliers. The values are displayed in %

In [None]:
def frac_outside_1pt5_IQR(x):
    length = 1.5 * np.diff(np.quantile(x, [.25, .75]))
    return np.mean(np.abs(x - np.median(x)) > length[0])

num_cols=['Year','Kilometers_Driven','Mileage','Engine','Power','Seats','New_Price','Price']

for i in num_cols:
    print(i+' variable')
    print('Before log ',frac_outside_1pt5_IQR(carsdf[i])*100)
    print('After log ',frac_outside_1pt5_IQR(np.log(carsdf[i]))*100,'\n')

Let's check the Outlier values of the variables by blowing up the range to 4*IQR. Ignoring 'Year' and 'Seats' as the values look fine in univariate analysis and doesn't make any sense to check the values. Also notice that applying logs on variables has increased the % above.

In [None]:
#Function to calculate and display outlier values

def outlier_values(df,col):
    ''' 
    Calculate and display outlier values which are > 4*IQR
    '''
    print('Determining outlier values for: ',col)
    quartiles = np.quantile(df[col][df[col].notnull()], [.25, .75])
    col_4iqr = 4 * (quartiles[1] - quartiles[0])
    print(f'\nQ1 = {quartiles[0]}, Q3 = {quartiles[1]}, 4*IQR = {col_4iqr}')
    outlier_vals = df.loc[np.abs(df[col] - df[col].median()) > col_4iqr, col]
    print('\nOutlier values:\n',np.sort(outlier_vals.unique()))
    print('---------------------------------------------------------')

cols=['Kilometers_Driven','Mileage','Engine','Power','New_Price','Price']
for i in cols:
    outlier_values(carsdf,i)

- New_Price had lot of Nulls and we imputed with median values. Moreover, the data looks continuous and hence not
looking into values
- Price is dependent variable and need to take a look at value which is >90
- Engine, Power and Kilometers_Driven should be checked

In [None]:
#checking Engine & Power values

eng = [4806.0,4951.0,5000.0,5461.0,5998.0]
powr = [362.07,362.9,364.9,367.0,382.0,387.3,394.3,395.0,402.0,421.0,
        444.0,450.0,488.1,500.0,503.0,550.0,552.0]
for x in eng:
    display(carsdf[carsdf['Engine']==x][['Brand','Model','Engine']])

for y in powr:
    display(carsdf[carsdf['Power']==y][['Brand','Model','Power']])   

All of them are luxury cars and they usually have high power. Not considering any of Engine & Power values as outlier

In [None]:
#Checking Kilometers_Driven

km=[250000,255000,262000,282000,299322,300000,445000,480000,620000,720000,775000,6500000]

for j in km:
    display(carsdf[carsdf['Kilometers_Driven']==j][['Year','Kilometers_Driven','Engine','Owner_Type','New_Price','Price','Brand','Model']])

Assuming Kilometers_Driven <=300000 as valid if the car is maintained well and serviced on time.

Never heard of any Used cars with Kilometers_Driven >400000 in Indian market. The data look highly suspicious and considering them as outliers and replacing with median value for respective brand/model

In [None]:
#Creating _BKP variable. Applying transformation on that and copying it back to original variable

#KMs to be updated
km=[210000,  215000,  215750,  216000  ,220000,  225000 , 227000,  230000 , 231673,
  234000,  240000,  242000,  248000 , 250000,  255000 , 262000,  282000,  299322,
  300000,  445000,  480000,  620000,  720000 , 775000, 6500000]

#copy variable to _bkp
carsdf['Kilometers_Driven_bkp'] = carsdf['Kilometers_Driven']

#replace the values with np.nan
for j in km:
    carsdf.loc[carsdf['Kilometers_Driven_bkp']==j, 'Kilometers_Driven_bkp']=np.nan

print('Number of nulls in Kilometers_Driven_bkp: ',carsdf['Kilometers_Driven_bkp'].isnull().sum())

#Take the median of cars grouped by Brand & Model and update
carsdf['Kilometers_Driven_bkp'] = carsdf.groupby(['Brand','Model'])['Kilometers_Driven_bkp'].apply(lambda val:val.fillna(val.median()))

#checking if any more nan's in data
print('Nulls in Kilometers_Driven_bkp after imputation with Median(Brand+Model): ',carsdf['Kilometers_Driven_bkp'].isnull().sum())

#if nan's still exist, then update median() based on brand
carsdf['Kilometers_Driven_bkp'] = carsdf.groupby(['Brand'])['Kilometers_Driven_bkp'].apply(lambda val:val.fillna(val.median()))

#checking if any more nans
print('Nulls in Kilometers_Driven_bkp after imputation with Median(Brand):',carsdf['Kilometers_Driven_bkp'].isnull().sum())

print('copying Kilometers_Driven_bkp to Kilometers_Driven')

#copy data from _bkp to original variable
carsdf['Kilometers_Driven'] = carsdf['Kilometers_Driven_bkp']
print('dropping Kilometers_Driven_bkp ')
#dropping _bkp
carsdf.drop(['Kilometers_Driven_bkp'],axis=1,inplace=True)

In [None]:
#Checking Prices of Cars > 90 Lakhs
carsdf[carsdf['Price']>90]

Data looks good. All of them are luxury cars. The Price also depends on Supply & Demand. In recent years, used car market has significantly gone up. Dealers may be looking to sell the cars for more profit.It is also subjected to inflation in market. During current pandemic situation, rich customers who want to invest in cars may end up buying such used luxury cars from market rather than new one. Moreover, the Kilometers_Driven for above cars are attractive. Therefore retaining the Price as is and not treating any values as Outlier.

In Univariate analysis, we saw Kilometers_Driven, Price and New Price are heavily skewed.
None of the values contain negative or zeros. Hence, log transformation can be applied

##### Log Transformation 

In [None]:
hist_box_plot(carsdf['Kilometers_Driven'])


In [None]:
#Log Transformation on Kilometers_Driven
plt.figure(figsize = (10,5))
plt.subplot(1, 2, 1)
plt.title('Kilometers_Driven - Before log')
plt.hist(carsdf['Kilometers_Driven'])

plt.subplot(1, 2, 2)
plt.title('Kilometers_Driven - After log')
plt.hist(np.log(carsdf['Kilometers_Driven']))

plt.show()

After we fixed the outliers, data seems have decent distribution.Log Transformation may not be required for the field.

In [None]:
#Log Transformation on New_Price

plt.figure(figsize = (10,5))
plt.subplot(1, 2, 1)
plt.title('New_Price - Before log')
plt.hist(carsdf['New_Price'],20)

plt.subplot(1, 2, 2)
plt.title('New_Price - After log')
plt.hist(np.log(carsdf['New_Price']),20)

plt.show()

In [None]:
#Log Transformation on Price

plt.figure(figsize = (10,5))
plt.subplot(1, 2, 1)
plt.title('Price - Before log')
plt.hist(carsdf['Price'],20)

plt.subplot(1, 2, 2)
plt.title('Price - After log')
plt.hist(np.log(carsdf['Price']),20)

plt.show()

- As you see above, Log Transformation has helped Price and New_Price. 
- Kilometers_Driven need not be log transformed. However, we could Log Transform the data and see which one performs better in model.

In [None]:
#Applying Log Transformation

carsdf['Kilometers_Driven_Log'] = np.log(carsdf['Kilometers_Driven'])
carsdf['New_Price_Log'] = np.log(carsdf['New_Price'])
carsdf['Price_Log'] = np.log(carsdf['Price'])
print('After applying log transformation...')
print(carsdf.info())
print(carsdf.isnull().sum())

In [None]:
#Dropping Model column as it has high cardinality
carsdf.drop(['Model'],axis=1,inplace=True)
carsdf.info()

In [None]:
#plotting histograms of the variables
col=['Year','Kilometers_Driven_Log','Kilometers_Driven','Mileage','Engine','Power','Seats','New_Price_Log','Price_Log']
for j in col:
    sns.histplot(carsdf[j],kde=True)    
    plt.title('Histogram of '+j)
    plt.show()

Distribution looks fine above and cannot consider any entry as outlier which may impact the model. Seats values are discrete and New Price Log is continuous. Let us proceed with model creation and then assess the performance.

# *******************************************************************************
# Model building
# *******************************************************************************

In [None]:
#taking backup of transformed dataset before we proceed with model
carsdf_model=carsdf.copy()

carsdf_model.head()

In [None]:
carsdf_model.columns

### ------------------------------------------------------------------------------------------------------------------------
### Model 1
##### Considering Kilometers_Driven and not Kilometers_Driven_log
##### x =  { 'Location', 'Year', 'Kilometers_Driven', 'Fuel_Type', 'Transmission','Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats', 'Brand', 'New_Price_Log'}
##### y = Price
### -----------------------------------------------------------------------------------------------------------------------

In [None]:
#Defining x (independent) and y(dependent) variables
x1 = carsdf_model[['Location', 'Year', 'Kilometers_Driven', 'Fuel_Type', 'Transmission','Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats', 'Brand', 'New_Price_Log']]
y1 = carsdf_model[['Price']]

display(x1.head())
display(y1.head())

print('Shape of x: ',x1.shape)
print('Shape of y: ',y1.shape)

#Creating Dummy Variables
print('\nCreating dummy variables for categorical features:')
x1 = pd.get_dummies(x1, columns=['Location', 'Fuel_Type','Transmission','Owner_Type','Brand'], drop_first=True)
x1.head()

In [None]:
## Evaluating Performance of Model by Generating KPIs - RMSE, MAE, MAPE, R^2, Adj R^2

# Adjusted R^2
def adj_r2(ind_vars, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = ind_vars.shape[0]
    k = ind_vars.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))

def mape(targets, predictions):
    return np.mean(np.abs((targets - predictions)) / targets) * 100

# Model performance check
def model_perf(model, inp, out):

    y_pred = model.predict(inp)
    y_act = out.values

    return pd.DataFrame(
        {
            "RMSE": np.sqrt(mean_squared_error(y_act, y_pred)),
            "MAE": mean_absolute_error(y_act, y_pred),
            "MAPE": mape(y_act, y_pred),
            "R^2": r2_score(y_act, y_pred),
            "Adjusted R^2": adj_r2(inp, y_act, y_pred),
        },
        index=[0],
    )

In [None]:
def generate_lin_model(x,y):
    ''' Function to generate Linear Regression Model and generate the metrics '''
    #Splitting the data into Training & Test datasets
    from sklearn.model_selection import train_test_split
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)

    #printing shape of training and test datasets
    print("x_train:",x_train.shape)
    print("x_test:",x_test.shape)
    print("y_train:",y_train.shape)
    print("y_test:",y_test.shape)
    print("\nIndependent variables for model :", x.columns)
    print("\nDependent variable for model :", y.columns)
    print("\nFitting Linear model..........")

    #Fitting Linear model
    lin_reg_model = LinearRegression()
    lin_reg_model.fit(x_train, y_train)

    print("\nLinear model complete..........\n")
    print("Intercept of the linear equation:", lin_reg_model.intercept_) 
    print("\nCoefficients of the equation are:", lin_reg_model.coef_)

    print("Generating Metrics for Model")

    # Checking model performance on train set
    print("\nTraining Performance")
    display(model_perf(lin_reg_model, x_train, y_train))

    # Checking model performance on test set
    print("\nTest Performance")
    display(model_perf(lin_reg_model, x_test, y_test)) 

In [None]:
#Generate Model 1 and Metrics
generate_lin_model(x1,y1)

### ------------------------------------------------------------------------------------------------------------------------
### Model 2
##### Considering Kilometers_Driven_Log and not  Kilometers_Driven
##### x =  { 'Location', 'Year', 'Kilometers_Driven_Log', 'Fuel_Type', 'Transmission','Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats', 'Brand', 'New_Price_Log'}
##### y = Price
### -----------------------------------------------------------------------------------------------------------------------

In [None]:
#Defining x (independent) and y(dependent) variables
x2 = carsdf_model[['Location', 'Year', 'Kilometers_Driven_Log', 'Fuel_Type', 'Transmission','Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats', 'Brand', 'New_Price_Log']]
y2 = carsdf_model[['Price']]

display(x2.head())
display(y2.head())

print('Shape of x: ',x2.shape)
print('Shape of y: ',y2.shape)

#Creating Dummy Variables
print('\nCreating dummy variables for categorical features:')
x2 = pd.get_dummies(x2, columns=['Location', 'Fuel_Type','Transmission','Owner_Type','Brand'], drop_first=True)
x2.head()

In [None]:
#Generate Model 2 and Metrics
generate_lin_model(x2,y2)

##### As you see from above models, Generating Log for Kilometers_Driven didn't help much with performance. 
##### Model 1 and Model 2  yield almost same performance

### ------------------------------------------------------------------------------------------------------------------------
### Model 3
##### Considering y = Price_log
##### x =  { 'Location', 'Year', 'Kilometers_Driven', 'Fuel_Type', 'Transmission','Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats', 'Brand', 'New_Price_Log'}
##### y = Price_log
### -----------------------------------------------------------------------------------------------------------------------

In [None]:
#Defining x (independent) and y(dependent) variables
x3 = carsdf_model[['Location', 'Year', 'Kilometers_Driven', 'Fuel_Type', 'Transmission','Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats', 'Brand', 'New_Price_Log']]
y3 = carsdf_model[['Price_Log']]

display(x3.head())
display(y3.head())

print('Shape of x: ',x3.shape)
print('Shape of y: ',y3.shape)

#Creating Dummy Variables
print('\nCreating dummy variables for categorical features:')
x3 = pd.get_dummies(x3, columns=['Location', 'Fuel_Type','Transmission','Owner_Type','Brand'], drop_first=True)
x3.head()

In [None]:
# Model performance check
def model_perf_log(model, inp, out):
    '''Function to Generate KPIs if log transformation is applied on dependent variable'''
    y_pred = np.exp(model.predict(inp))   #reversing the log and applying exp function
    y_act = np.exp(out.values)            #log(y) => exp(y)

    return pd.DataFrame(
        {
            "RMSE": np.sqrt(mean_squared_error(y_act, y_pred)),
            "MAE": mean_absolute_error(y_act, y_pred),
            "MAPE": mape(y_act, y_pred),
            "R^2": r2_score(y_act, y_pred),
            "Adjusted R^2": adj_r2(inp, y_act, y_pred),
        },
        index=[0],
    )

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x3, y3, test_size=0.3, random_state=1)

linregmodel = LinearRegression()
linregmodel.fit(x_train, y_train)

#printing shape of training and test datasets
print("x_train:",x_train.shape)
print("x_test:",x_test.shape)
print("y_train:",y_train.shape)
print("y_test:",y_test.shape)
print("\nIndependent variables for model :", x_train.columns)
print("\nDependent variable for model :", y_train.columns)
print("\nFitting Linear model..........")

#Fitting Linear model
lin_reg_model = LinearRegression()
lin_reg_model.fit(x_train, y_train)

print("\nLinear model complete..........\n")
print("Intercept of the linear equation:", lin_reg_model.intercept_) 
print("\nCoefficients of the equation are:", lin_reg_model.coef_)
print("Generating Metrics for Model")
# Checking model performance on train set
print("\nTraining Performance")
display(model_perf_log(lin_reg_model, x_train, y_train))
# Checking model performance on test set
print("\nTest Performance")
display(model_perf_log(lin_reg_model, x_test, y_test)) 


##### Based on the metrics above, Model 3 is better than Model 1 & 2

In [None]:
#printing cofficients and Intercept of Model 3
print("Intercept and Co-efficient of Final Model (Model 3)")
coef_df = pd.DataFrame(
    np.append(lin_reg_model.coef_.flatten(), lin_reg_model.intercept_),
    index=x_train.columns.tolist() + ["Intercept"],
    columns=["Coefficients"],
)
coef_df

### Model Observations:

#### Model 1:
------------
##### Training Performance
       RMSE	    MAE	        MAPE	    R^2         Adjusted R^2
      5.417044	2.994105	59.66725  0.776664	     0.773589

##### Test Performance
        RMSE	MAE	        MAPE	    R^2	        Adjusted R^2
      4.877155	2.889871	57.37251	0.782824	0.775712

#### Model 2:
------------
##### Training Performance
	RMSE	    MAE	        MAPE	    R^2	       Adjusted R^2
    5.379086	3.004916	59.316916	0.779783	0.77675

##### Test Performance
	RMSE	    MAE	        MAPE	    R^2	       Adjusted R^2
	4.896422	2.915821	57.24615	0.781105	0.773937

#### Model 3:
-------------

##### Training Performance
	RMSE	     MAE	    MAPE	    R^2	        Adjusted R^2
	4.183008	1.768365	18.879325	0.866829	0.864995

##### Test Performance
	RMSE	     MAE	    MAPE	    R^2	        Adjusted R^2
	3.636954	1.698975	17.742828	0.879232	0.875277

##### Observations:

- Model 3 is the winner among all 3 models. It has better values for all the different metrics that we obtained
- Applying Log Transformation on Kilometers_Driven didn't help much since we treated outliers by taking median of values for respective vehicles
- Applying Log on dependent variable showed better stats (Model 3)
- R^2 and Adjusted R^2 are pretty high indicating model is good.It explains ~87% of variance
- RMSE for both test and training are comparable indicating model is not overfitting.
- We are able to predict the used car sales with Mean Error of 1.7 on test set
- MAPE is ~17.7% on Test set

#### Plotting Predicted vs Actual 

In [None]:
plt.figure(figsize=(10,7))
plt.scatter(y=np.exp(lin_reg_model.predict(x_train)),x=np.exp(y_train.values),s=50,c='purple')
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Predicted vs Actual")
plt.show()

### Residual Plot

In [None]:
plt.figure(figsize=(10,7))

sns.residplot(y=np.exp(lin_reg_model.predict(x_train)),x=np.exp(y_train.values), lowess=True,color='purple')


The final model looks to be a good fit

# Summary:

#### Insights based on Model:
- We have built a Linear Regression Model to predict used car prices .The model explains ~ 87% of total variation in training & test set.
- The model can be used by dealers to predict the sale of used cars with error of 1.6 on test data. Note that predicted value is log based.
- Variables impacting the price of car: 
> Brand <br>
Location <br>
Owner_Type <br>
Fuel_Type <br>
Transmission <br>
Mileage <br>
Kilometers_Driven <br>
Engine <br>
Power <br>
Seats <br>
Year i.e Age of Car <br>

- Mileage, Kilometers_Driven are inversely proportional to car price. High Mileage cars cost less. As the Kilometers increases on the car, the price decreases.
- Brands like Bentley, Isuzu, Tata, Porsche, Datsun doesn't yield more price. Bentley & Porsche are luxury cars and dont have enough data in dataset. May require additional data to determine the actual. They may not be much suitable for Indian markets.
- Location Kolkata affects the price negatively.
- Most of the categorical variables have negative relationship with car. The values in categorical variables affect more. Additional features in dataset would be beneficial.  
- Fuel Type Electric has positive effect on car price.
- Luxury brands like Mini, Land Rover, Audi, BMW, Merdes Benz, Jaguar are having high positive impact on car price.
- New Price of vehicles offer a positive impact to used car. If the Price of New car (say BMW) increases, then used car price also tend to increase.
- Diesel car impacts price positively.
- Locations like Hyderabad, Bangalore, Coimbatore, Chennai impacts price positively.
- Mid class brands like Maruti, Honda also has positive relationship with Price.

#### Recommendations:
- Electric cars have positive impact on car price. Delears can acquire more Electric cars. It also helps to keep climate clean since climate change is the hot topic of every nation now.
- Kolkata has negative impact. Dealers can decrease the sales in Kolkata and open one or more locations down south since south cities like Bangalore, Coimbatore, Hyderabad, Chennai has better sales.
- Dealers can fill up their inventory with Maruti, Honda , Toyota and Hyundai brands.
- Manufacture year has positive impact. The newer car sell for more price. Dealers can offer discount for old cars and fill up the inventory with new ones. 
- Discounts during major festivals like New Year, Diwali, Christmas, Ugadi, Ramzan may help the sale. However, data is needed to see how festivals affect the sales.
- Considering the traffic in India these days, automatic cars may be best fit for many individuals. Dealers can acquire more automatic cars and market it accordingly to increase the sales
- Sales started pluging after 2015.Dealers should study market to understand why the car sales started decreasing post 2015 and incorporate lessons learned.
- Model can be improved by decreasing number of categorical variablee like grouping cities into North, East, South, West and car brands into High, Medium,Low category depending on the Price.

# -----------------------------
# Model 4 (Add on)
#####  converting Brands to High, Med, Low class cars
##### convert locations to North, East, South, West
# ---------------------------

In [None]:
#Defining x (independent) and y(dependent) variables
x4 = carsdf_model[['Location', 'Year', 'Kilometers_Driven', 'Fuel_Type', 'Transmission','Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats', 'Brand', 'New_Price_Log']]
y4 = carsdf_model[['Price_Log']]

print('Shape of x: ',x4.shape)
print('Shape of y: ',y4.shape)
print(x4.dtypes)


In [None]:
# Creating brand class looking at the prices and using knowledge (google)

# Above 50 Lakhs
High = ['Land Rover',
'Lamborghini',          
'Jaguar',              
'BMW',                
'Audi', 
'Mercedes-Benz',      
'Porsche',             
'Bentley',              
'Ford'  ]

# Above 20 Lakhs to 50 Lakhs
Mid = ['Mini',                
'Toyota',            
'Volvo',               
'Mitsubishi',          
'Skoda',              
'Volkswagen',         
'Jeep',                
'Hyundai']

# Upto 20Lakhs
Low = ['Maruti',           
'Honda',            
'Mahindra',                       
'Tata',               
'Renault',            
'Nissan',              
'Fiat',                
'Datsun',              
'Isuzu',                
'Chevrolet',            
'Smart',                
'Force',                
'OpelCorsa',            
'Hindustan',            
'Ambassador']

def classify(brand):
    if brand in High:
        return 'High'
    elif brand in Mid:
        return 'Mid'
    elif brand in Low:
        return 'Low'
    else:
        return 'is_missing'

In [None]:
x4['Brand Class'] = x4['Brand'].apply(lambda brand: classify(brand))

In [None]:
x4['Brand Class'].value_counts()

In [None]:
North = ['Delhi']
South = ['Hyderabad','Kochi','Coimbatore','Chennai' ,'Bangalore']
East = ['Kolkata']
West = ['Mumbai','Pune','Jaipur','Ahmedabad']

def classify_loc(place):
    if place in North:
        return 'North'
    elif place in South:
        return 'South'
    elif place in East:
        return 'East'
    else:
        return 'West'

In [None]:
x4['Loc_zone']=x4['Location'].apply(lambda place: classify_loc(place))

In [None]:
x4['Loc_zone'].value_counts()

In [None]:
x4=x4[[ 'Year', 'Kilometers_Driven', 'Fuel_Type', 'Transmission',
       'Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats',
       'New_Price_Log', 'Brand Class', 'Loc_zone']]

In [None]:
x4 = pd.get_dummies(x4, columns=['Fuel_Type','Transmission','Owner_Type','Loc_zone','Brand Class'], drop_first=True)
x4.head()

In [None]:
y4.head()

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x4, y4, test_size=0.3, random_state=1)

linregmodel = LinearRegression()
linregmodel.fit(x_train, y_train)

#printing shape of training and test datasets
print("x_train:",x_train.shape)
print("x_test:",x_test.shape)
print("y_train:",y_train.shape)
print("y_test:",y_test.shape)
print("\nIndependent variables for model :", x_train.columns)
print("\nDependent variable for model :", y_train.columns)
print("\nFitting Linear model..........")

#Fitting Linear model
lin_reg_model = LinearRegression()
lin_reg_model.fit(x_train, y_train)

print("\nLinear model complete..........\n")
print("Intercept of the linear equation:", lin_reg_model.intercept_) 
print("\nCoefficients of the equation are:", lin_reg_model.coef_)
print("Generating Metrics for Model")
# Checking model performance on train set
print("\nTraining Performance")
display(model_perf_log(lin_reg_model, x_train, y_train))
# Checking model performance on test set
print("\nTest Performance")
display(model_perf_log(lin_reg_model, x_test, y_test)) 

As you see, Model 4 didn't offer much of improvement. Still prefer Model 3

# -----------------------------
# Model 5 (Add on)
##### retaining Location and dropping Brand. Brand class will be used instead
# ---------------------------

In [None]:
x5 = carsdf_model[['Location', 'Year', 'Kilometers_Driven', 'Fuel_Type', 'Transmission','Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats', 'Brand', 'New_Price_Log']]
y5 = carsdf_model[['Price_Log']]

In [None]:
x5['Brand_Class'] = x5['Brand'].apply(lambda brand: classify(brand))

In [None]:
x5 = pd.get_dummies(x5, columns=['Fuel_Type','Transmission','Owner_Type','Location','Brand_Class'], drop_first=True)
x5.head()

In [None]:
x5.drop(columns=['Brand'],axis=1,inplace=True)

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x5, y5, test_size=0.3, random_state=1)

linregmodel = LinearRegression()
linregmodel.fit(x_train, y_train)

#printing shape of training and test datasets
print("x_train:",x_train.shape)
print("x_test:",x_test.shape)
print("y_train:",y_train.shape)
print("y_test:",y_test.shape)
print("\nIndependent variables for model :", x_train.columns)
print("\nDependent variable for model :", y_train.columns)
print("\nFitting Linear model..........")

#Fitting Linear model
lin_reg_model = LinearRegression()
lin_reg_model.fit(x_train, y_train)

print("\nLinear model complete..........\n")
print("Intercept of the linear equation:", lin_reg_model.intercept_) 
print("\nCoefficients of the equation are:", lin_reg_model.coef_)
print("Generating Metrics for Model")
# Checking model performance on train set
print("\nTraining Performance")
display(model_perf_log(lin_reg_model, x_train, y_train))
# Checking model performance on test set
print("\nTest Performance")
display(model_perf_log(lin_reg_model, x_test, y_test)) 

Model 3 is better in metrics compared to Model 5

# -----------------------------
# Model 6 (Add on)
##### Dropping New _Price, Engine, Kilometers_Driven
# ---------------------------

In [None]:
x6 = carsdf_model[['Location', 'Year', 'Fuel_Type', 'Transmission','Owner_Type', 'Power', 'Seats', 'Brand']]
y6 = carsdf_model[['Price_Log']]

In [None]:
display(x6.head())
display(y6.head())

print('Shape of x: ',x6.shape)
print('Shape of y: ',y6.shape)

#Creating Dummy Variables
print('\nCreating dummy variables for categorical features:')
x6 = pd.get_dummies(x6, columns=['Location', 'Fuel_Type','Transmission','Owner_Type','Brand'], drop_first=True)
x6.head()

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x6, y6, test_size=0.3, random_state=1)

linregmodel = LinearRegression()
linregmodel.fit(x_train, y_train)

#printing shape of training and test datasets
print("x_train:",x_train.shape)
print("x_test:",x_test.shape)
print("y_train:",y_train.shape)
print("y_test:",y_test.shape)
print("\nIndependent variables for model :", x_train.columns)
print("\nDependent variable for model :", y_train.columns)
print("\nFitting Linear model..........")

#Fitting Linear model
lin_reg_model = LinearRegression()
lin_reg_model.fit(x_train, y_train)

print("\nLinear model complete..........\n")
print("Intercept of the linear equation:", lin_reg_model.intercept_) 
print("\nCoefficients of the equation are:", lin_reg_model.coef_)
print("Generating Metrics for Model")
# Checking model performance on train set
print("\nTraining Performance")
display(model_perf_log(lin_reg_model, x_train, y_train))
# Checking model performance on test set
print("\nTest Performance")
display(model_perf_log(lin_reg_model, x_test, y_test)) 

# -----------------------------
# Model 7 (Add on)
##### Converting Brand,Location,Fuel_Type,	Transmission,	Owner_Type to Numerics
# ---------------------------

In [None]:
x7 = carsdf_model[['Location', 'Year', 'Kilometers_Driven', 'Fuel_Type', 'Transmission','Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats', 'Brand', 'New_Price_Log']]
y7 = carsdf_model[['Price_Log']]

In [None]:
x7['Fuel_Type'].unique()

In [None]:
x7.loc[carsdf['Fuel_Type']=='CNG', 'Fuel_Type_new']=1
x7.loc[carsdf['Fuel_Type']=='Diesel', 'Fuel_Type_new']=2
x7.loc[carsdf['Fuel_Type']=='Petrol', 'Fuel_Type_new']=3
x7.loc[carsdf['Fuel_Type']=='LPG', 'Fuel_Type_new']=4
x7.loc[carsdf['Fuel_Type']=='Electric', 'Fuel_Type_new']=5

In [None]:
x7['Transmission'].unique()

In [None]:
x7.loc[carsdf['Transmission']=='Manual', 'Transmission_new']=11
x7.loc[carsdf['Transmission']=='Automatic', 'Transmission_new']=12

In [None]:
x7['Owner_Type'].unique()

In [None]:
x7.loc[carsdf['Owner_Type']=='First', 'Owner_Type_new']=21
x7.loc[carsdf['Owner_Type']=='Second', 'Owner_Type_new']=22
x7.loc[carsdf['Owner_Type']=='Third', 'Owner_Type_new']=23
x7.loc[carsdf['Owner_Type']=='Fourth & Above', 'Owner_Type_new']=99

In [None]:
x7.drop(columns=['Fuel_Type','Transmission','Owner_Type'],axis=1,inplace=True)

In [None]:
x7['Brand_Class'] = x7['Brand'].apply(lambda brand: classify(brand))
x7['Loc_zone']=x7['Location'].apply(lambda place: classify_loc(place))

In [None]:
x7.drop(columns=['Brand','Location'],axis=1,inplace=True)

In [None]:
x7.info()

In [None]:
x7 = pd.get_dummies(x7, columns=['Loc_zone','Brand_Class'], drop_first=True)
x7.head()

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x7, y7, test_size=0.3, random_state=1)

linregmodel = LinearRegression()
linregmodel.fit(x_train, y_train)

#printing shape of training and test datasets
print("x_train:",x_train.shape)
print("x_test:",x_test.shape)
print("y_train:",y_train.shape)
print("y_test:",y_test.shape)
print("\nIndependent variables for model :", x_train.columns)
print("\nDependent variable for model :", y_train.columns)
print("\nFitting Linear model..........")

#Fitting Linear model
lin_reg_model = LinearRegression()
lin_reg_model.fit(x_train, y_train)

print("\nLinear model complete..........\n")
print("Intercept of the linear equation:", lin_reg_model.intercept_) 
print("\nCoefficients of the equation are:", lin_reg_model.coef_)
print("Generating Metrics for Model")
# Checking model performance on train set
print("\nTraining Performance")
display(model_perf_log(lin_reg_model, x_train, y_train))
# Checking model performance on test set
print("\nTest Performance")
display(model_perf_log(lin_reg_model, x_test, y_test)) 

Metrics didnt change much. Model 3 is better.

In [None]:
#changing y=Price instead of y=log(Price)
y7=carsdf_model[['Price']]

In [None]:
generate_lin_model(x7,y7)

As you see R^2 & AdjR^2 decreased compared to previous metrics.
RMSE, MAE, MAPE increased.
Model 3 still holds good