# Python Final Project : Predicting the costs of used cars

## Done By: Sukhmani Kaur Bedi

## Task:

`Explore and visualize the dataset`

`Build a linear regression model to predict the prices of used cars`

`Generate a set of insights and recommendations that will help the business`

## Step1- Pre-processing the data

In order to get descriptive statistics about the data and start getting a sense of the distributions and the relationship of these variables, cleaning of the raw data is very improtant. 

#### Conversion required for below features

- Mileage - Its continuous, need to remove kmpl, convert to float64
-  Engine -Its continuous, need to remove CC, convert to int64
-  Power - Its continuous, need to remove bhp, convert to float64
-  Seats - Its numerical and discrete, convert to int64
-  New_Price - Its numericial and continuous; convert to float64 and remove lakh or Cr. and convert 1Cr = 100Lakh

#### Grouping categorical columns 
- Extracted Brand names from the Names column and created a new column Car_Brand 
- Classification of the Car_Brands into Luxury, medium and non Luxury brands on the basis of Premium brand names 
- Created a New column Region: Created a dictionary to map the cities to North/East/West/South.

#### Computing Missing values for:

- Mileage
- Engine 
- Power
- Seats
- New_Price
- Price


## Step 2:- Visulaizing the data
- Univariate Analysis
- Bivariate and Multi-Variate Analysis
- Checking for Outliers


## Step 3 :- Transformation
- Transforming highly skewed variables



## Step4 :- One hot encoding
- Creating dummy variables for categorical data 


## Step 5:-  Model Building
- Creating Linear regression model 
- Adding Stats on model
- Checking the accuracy 


#### FEATURES:

Name: The brand and model of the car.

Location: The location in which the car is being sold or is available for purchase.

Year: The year or edition of the model.

Kilometers_Driven: The total kilometres driven in the car by the previous owner(s) in KM.

Fuel_Type: The type of fuel used by the car.

Transmission: The type of transmission used by the car.

Owner_Type: Whether the ownership is Firsthand, Second hand or other.

Mileage: The standard mileage offered by the car company in kmpl or km/kg

Engine: The displacement volume of the engine in cc.

Power: The maximum power of the engine in bhp.

Seats: The number of seats in the car.

New_Price: The price of a new car of the same model in Lakhs

Price: The price of the used car in INR Lakhs.


## Notes:
- There is only one dataset provided. I will be splittig the data set into Train and Test to build the linear regression model to predict the prices of `used_cars`.
- The `dependent variable` is `Price` which indicates the the price of used cars in INR Lakhs and the rest of the variables are considered to be independent variables.

### Import libraries

In [None]:
# Import necessary libraries.
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import pylab
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
# Removes the limit from the number of displayed columns and rows.
# This is so I can see the entire dataframe when I print it
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 300)

# To enable plotting graphs in Jupyter notebook
%matplotlib inline

### Load and explore the data

In [None]:
! ls

In [None]:
data_car = pd.read_csv('../input/cars-data/used_cars_data1.csv') # reading the data 
data= data_car.copy() # copying the data set


### Overview of the data

In [None]:
# Shape of the data 
print(f'There are {data.shape[0]} rows and {data.shape[1]} columns.')  # f-string

# I'm now going to look at 10 random rows
# I'm setting the random seed via np.random.seed so that
# I get the same random results every time
np.random.seed(1)
data.sample(n=10)

Observations:-

1. There are 7253 rows and 14 attributes. 
    
2. The data needs to be modified for example by removing extra `CC`in `Engine`, `bhp`in `Power` and `Lakh`in `New_Price`.
    
  

In [None]:
data.info() #checking the info of the data

In [None]:
data.isnull().sum().sort_values(ascending =False) # Summing up the null values and sorting them in descending order

Observations:- 
    
1. There are null values in `Mileage`, `Engine`, `Power`, `Seats`, `New_Price` and `Price`

2. The data types must be fixed. For example `Mileage`, `Engine`, `New_Price` and `Power`. 

In [None]:
#Checking if there are any duplicate values in the data set 
data.duplicated().sum()

There are duplicate values in the data set.

### Step1- Pre-processing the data

#### 1. Mileage

We have car mileage in two units, kmpl and km/kg.

After a quick research on the internet it is clear that these 2 units are used for cars of 2 different fuel types.

kmpl - kilometers per litre - is used for petrol and diesel cars.
km/kg - kilometers per kg - is used for CNG and LPG based engines.

We have the variable `Fuel_type` in our data. Let us check if this observations holds true in our data also.

In [None]:
# Create 2 new columns after splitting the mileage values.
km_per_unit_fuel = []
mileage_unit = []

for observation in data["Mileage"]:
    if isinstance(observation, str):
        if (
            observation.split(" ")[0]
            .replace(".", "", 1)
            .isdigit()  # first element should be numeric
            and " " in observation  # space between numeric and unit
            and (
                observation.split(" ")[1]
                == "kmpl"  # units are limited to "kmpl" and "km/kg"
                or observation.split(" ")[1] == "km/kg"
            )
        ):
            km_per_unit_fuel.append(float(observation.split(" ")[0]))
            mileage_unit.append(observation.split(" ")[1])
        else:
            # To detect if there are any observations in the column that do not follow
            # the expected format [number + ' ' + 'kmpl' or 'km/kg']
            print(
                "The data needs further processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the mileage column,
        # we add corresponding missing values to the 2 new columns
        km_per_unit_fuel.append(np.nan)
        mileage_unit.append(np.nan)

In [None]:
# No print output from the function above. The values are all in the expected format or NaNs
# Add the new columns to the data

data["km_per_unit_fuel"] = km_per_unit_fuel
data["mileage_unit"] = mileage_unit

# Checking the new dataframe
data.head(5)  # looks good!

In [None]:
# Let us check if the units correspond to the fuel types as expected.
data.groupby(by=["Fuel_Type", "mileage_unit"]).size()

The data type and the units of `Mileage`has been fixed.

As expected, km/kg is for CNG/LPG cars and kmpl is for Petrol and Diesel cars.

#### 2. Engine 

The data dictionary suggests that `Engine` indicates the displacement volume of the engine in CC.
We will make sure that all the observations follow the same format - [numeric + " " + "CC"] and create a new numeric column from this column. 

This time, lets use a regrex to make all the neccesary checks.

In [None]:
# re module provides support for regular expressions
import re

# Create a new column after splitting the engine values.
engine_num = []

# Regex for numeric + " " + "CC"  format
regex_engine = "^\d+(\.\d+)? CC$"

for observation in data["Engine"]:
    if isinstance(observation, str):
        if re.match(regex_engine, observation):
            engine_num.append(float(observation.split(" ")[0]))
        else:
            # To detect if there are any observations in the column that do not follow [numeric + " " + "CC"]  format
            print(
                "The data needs furthur processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the engine column, we add missing values to the new column
        engine_num.append(np.nan)

In [None]:
# No print output from the function above. The values are all in the same format - [numeric + " " + "CC"] OR NaNs
# Add the new column to the data

data["engine_num"] = engine_num

# Checking the new dataframe
data.head(5)

#### 3. Power 

The data dictionary suggests that `Power` indicates the maximum power of the engine in bhp.
We will make sure that all the observations follow the same format - [numeric + " " + "bhp"] and create a new numeric column from this column, like we did for `Engine`

In [None]:
# Create a new column after splitting the power values.
power_num = []

# Regex for numeric + " " + "bhp"  format
regex_power = "^\d+(\.\d+)? bhp$"

for observation in data["Power"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            power_num.append(float(observation.split(" ")[0]))
        else:
            # To detect if there are any observations in the column that do not follow [numeric + " " + "bhp"]  format
            # that we see in the sample output
            print(
                "The data needs furthur processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the power column, we add missing values to the new column
        power_num.append(np.nan)

We can see that some Null values in power column exist as 'null bhp' string.
Let us replace these with NaNs

In [None]:
power_num = []

for observation in data["Power"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            power_num.append(float(observation.split(" ")[0]))
        else:
            power_num.append(np.nan)
    else:
        # If there are any missing values in the power column, we add missing values to the new column
        power_num.append(np.nan)

# Add the new column to the data
data["power_num"] = power_num

# Checking the new dataframe
data.head(10)  # looks good now

#### 4. New_Price 

We know that `New_Price` is the price of a new car of the same model in INR Lakhs.(1 Lakh = 100, 000)

This column clearly has a lot of missing values. We will impute the missing values later. For now we will only extract the numeric values from this column.

In [None]:
# Create a new column after splitting the New_Price values.
new_price_num = []

# Regex for numeric + " " + "Lakh"  format
regex_power = "^\d+(\.\d+)? Lakh$"

for observation in data["New_Price"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            new_price_num.append(float(observation.split(" ")[0]))
        else:
            # To detect if there are any observations in the column that do not follow [numeric + " " + "Lakh"]  format
            # that we see in the sample output
            print(
                "The data needs furthur processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the New_Price column, we add missing values to the new column
        new_price_num.append(np.nan)

Not all values are in Lakhs. There are a few observations that are in Crores as well

Let us convert these to lakhs. 1 Cr = 100 Lakhs

In [None]:
new_price_num = []

for observation in data["New_Price"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            new_price_num.append(float(observation.split(" ")[0]))
        else:
            # Converting values in Crore to lakhs
            new_price_num.append(float(observation.split(" ")[0]) * 100)
    else:
        # If there are any missing values in the New_Price column, we add missing values to the new column
        new_price_num.append(np.nan)

# Add the new column to the data
data["new_price_num"] = new_price_num

# Checking the new dataframe
data.head(5)  # Looks ok

### Processing for variable columns

The `Name` column in the current format might not be very useful in our analysis.
Since the name contains both the brand name and the model name of the vehicle, the column would have to many unique values to be useful in prediction.


Creating a new column as the `Car_Brand` which tells the brand of the car and `Name` as the model of the car.

#### 1. Car Brand Name

In [None]:
brand = []
for row, sentence in enumerate(data['Name']):
    if sentence.split(' ')[0] == 'Land':
        brand.append(' '.join(sentence.split(' ')[0:2]))
    elif sentence.split(' ')[0] == 'OpelCorsa':
        brand.append('Opel')
    elif sentence.split(' ')[0] == 'ISUZU':
        brand.append('Isuzu')
    else:
        brand.append(sentence.split(' ')[0])
data['Car_Brand'] = brand 

In [None]:
print('The number of unique values of Car_Brand', data['Car_Brand'].nunique())
print(data['Car_Brand'].value_counts())

In [None]:
plt.figure(figsize=(15, 7))
sns.countplot(y="Car_Brand", data=data, order=data['Car_Brand'].value_counts().index);
plt.title('Car Brand')

#### 2. Car Model Name

In [None]:
# Extract Model Names
data["Model"] = data["Name"].apply(lambda x: x.split(" ")[1].lower())

# Check the data

data["Model"].value_counts()

In [None]:
plt.figure(figsize=(15, 7))
sns.countplot(y="Model", data=data, order=data["Model"].value_counts().index[1:30])

It is clear from the above charts that out dataset contains used cars from luxury as well as budget friendly brands.

We can create a new variable using this information. We will bin all our cars in 3 categories -

1. Non Luxury
2. Mid Range
3. Luxury Cars

#### 3. Car_category

In [None]:
data.groupby(["Car_Brand"])["Price"].mean().sort_values(ascending=False)

The output is very close to our expectation (domain knowledge), in terms of brand ordering. Mean price of a used Lamborghini is 120 Lakhs and that of cars from other luxury brands follow in a descending order.

Towards the bottom end we have the more budget friendly brands.

We can see that there is some missingness in our data. Let us come back to creating this variable once we have removed missingness from the data.

### Missing value Treatment

In [None]:
# Basic summary stats - Numeric variables
data.describe().T

**Observations**
1. S.No. clearly has no interpretation here but as discussed earlier let us drop it only after having looked at the initial linear model.
2. Kilometers_Driven values have an incredibly high range. We should check a few of the extreme values to get a sense of the data.
3. Minimum and maximum number of seats in the car also warrent a quick check. On an average a car seems to have 5 seats, which is about right.
4. We have used cars being sold at less than a lakh rupees and as high as 160 lakh, as we saw for Lamborghini earlier. We might have to drop some of these outliers to build a robust model.
5. Min Mileage being 0 is also concerning, we'll have to check what is going on.
6. Engine and Power mean and median values are not very different. Only someone with more domain knowledge would be able to comment furthur on these attributes.
7. New price range seems right. We have both budget friendly Maruti cars and Lamborghinis in our stock. Mean being twice that of the median suggests that there are only a few very high range brands, which again makes sense.

In [None]:
# Check Kilometers_Driven extreme values
data.sort_values(by=["Kilometers_Driven"], ascending=False).head(10)

It looks like the first row here is a data entry error. A car manufactured as recently as 2017 having been driven 6500000 kms is almost impossible.

The other observations that follow are also on a higher end. There is a good chance that these are outliers. We'll look at this furthur while doing the univariate analysis.

In [None]:
# Check Kilometers_Driven Extreme values
data.sort_values(by=["Kilometers_Driven"], ascending=True).head(10)

After looking at the columns - Year, New Price and Price these entries seem feasible.

1000 might be default value in this case. Quite a few cars having driven exactly 1000 km is suspicious.

In [None]:
# Check seats extreme values
data.sort_values(by=["Seats"], ascending=True).head(5)

Audi A4 having 0 seats is clearly a data entry error. This column warrents some outlier treatment or we can treat seats == 0 as a missing value. Overall, there doesn't seem not much to be concerned about here. 

In [None]:
# Let us check if we have a similar car in our dataset.
data[data["Name"].str.startswith("Audi A4")]
# Looks like an Audi A4 typically has 5 seats.

In [None]:
# Let us replace #seats in ro

# Let us replace #seats in row index 3999 form 0 to 5

data.loc[3999, "Seats"] = 5.0

In [None]:
# Check seats extreme values
data.sort_values(by=["Seats"], ascending=False).head(5)

Of course, a Toyota Qualis has 10 seats and so does a Tata Sumo. We don't see any data entry error here.

In [None]:
# Check Mileage - km_per_unit_fuel extreme values
data.sort_values(by=["km_per_unit_fuel"], ascending=True).head(10)

We will have to treat Mileage = 0 as missing values

In [None]:
# Check Mileage - km_per_unit_fuel extreme values
data.sort_values(by=["km_per_unit_fuel"], ascending=False).head(10)

Maruti Wagon R and Maruti Alto CNG versions are budget friendly cars with high mileage so these data points are fine.

In [None]:
# looking at value counts for non-numeric features

num_to_display = 10  # defining this up here so it's easy to change later
for colname in data.dtypes[data.dtypes == "object"].index:
    val_counts = data[colname].value_counts(dropna=False)  # Will also show the NA counts
    print(val_counts[:num_to_display])
    if len(val_counts) > num_to_display:
        print(f"Only displaying first {num_to_display} of {len(val_counts)} values.")
    print("\n\n")  # just for more space in between

Since we haven't dropped the original columns that we processed, we have a few redunadant output here.

We had checked cars of different `Fuel_Type` earlier, but we did not encounter the 2 electric cars. Let us check why.

In [None]:
data.loc[data["Fuel_Type"] == "Electric"]

Mileage values for these cars are NaN, that is why we did not encounter these earlier with groupby.

Electric cars are very new in the market and very rare in our dataset. We can consider dropping these two observations if they turn out to be outliers later. There is a good chance that we will not be able to create a good price prediction model for electric cars, with the currently available data.

New Price for 6247 entries is missing. We need to explore if we can impute these or we should drop this column altogether.

### Missing Value Treatment


Before we start looking at the individual distributions and interactions, let's quickly check the missingness in the data.

In [None]:
data.isnull().sum()

* 2 Electric car variants don't have entries for Mileage.
* Engine displacement information of 46 observations is missing and maximum power of 175 entries is missing.
* Information about number of seats is not avaliable for 53 entries.
* New Price as we saw earlier has a huge missing count. We'll have to see if there is a pattern here.
* Price is also missing for 1234 entries. Since price is our response variable that we want to predict, we will have to drop these rows when we actually build a model. These rows will not be able to help us in modelling or model evaluation. But while we are analysing the distributions and doing missing value imputations, we will keep using information from these rows.

In [None]:
# Drop the redundant columns.
data.drop(columns=["Mileage", "mileage_unit", "Engine", "Power", "New_Price"], inplace=True)

In [None]:
# Look at a few rows where #seats is missing
data[data["Seats"].isnull()]

In [None]:
# We'll impute these missing values one by one, by taking median number of seats for the particular car,
# using the Brand and Model name
data.groupby(["Car_Brand", "Model"], as_index=False)["Seats"].median()

In [None]:
# Impute missing Seats
data["Seats"] = data.groupby(["Car_Brand", "Model"])["Seats"].transform(lambda x: x.fillna(x.median()))

In [None]:
# Check 'Seats'
data[data["Seats"].isnull()]

In [None]:
# Maruti Estilo can accomodate 5
data["Seats"] = data["Seats"].fillna(5.0)

In [None]:
# We will use similar methods to fill missing values for engine, power and new price
data["engine_num"] = data.groupby(["Car_Brand", "Model"])["engine_num"].transform(lambda x: x.fillna(x.median()))

data["power_num"] = data.groupby(["Car_Brand", "Model"])["power_num"].transform(lambda x: x.fillna(x.median()))

data["new_price_num"] = data.groupby(["Car_Brand", "Model"])["new_price_num"].transform(lambda x: x.fillna(x.median()))

In [None]:
data.isnull().sum()

In [None]:
# There are still some NAs in power and new_price_num.
# There are a few car brands and models in our dataset that do not contain the new price information at all.
# Now we'll have to estimate the new price using the other features.
# KNN imputation is once of the imputation methods that can be used for this.
# This sklearn method requires us to encode categorical variables, if we are using them for imputation.
# In this case we'll use only a select numeric features for imputation

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=3, weights="uniform")  # 3 Nearest Neighbours
temp_data_for_imputation = data[["engine_num", "power_num", "Year", "km_per_unit_fuel", "new_price_num", "Seats"]]
temp_data_for_imputation = imputer.fit_transform(temp_data_for_imputation)
temp_data_for_imputation = pd.DataFrame(temp_data_for_imputation, 
                                      columns = ["engine_num", "power_num", "Year", "km_per_unit_fuel", 
                                                 "new_price_num", "Seats"],
)

# Add imputed columns to the original dataset
data["new_price_num"] = temp_data_for_imputation["new_price_num"]
data["power_num"] = temp_data_for_imputation["power_num"]
data["km_per_unit_fuel"] = temp_data_for_imputation["km_per_unit_fuel"]

In [None]:
data.isnull().sum()

In [None]:
# Drop the redundant columns.
data.drop(columns=["Name", "S.No."], inplace=True) 
# Drop the rows where 'Price' == NaN and proceed to modelling
data = data[data["Price"].notna()]

In [None]:
print(f'There are {data.shape[0]} rows and {data.shape[1]} columns after dropping NAN from the target variable')

In [None]:
# Check the value counts and unique number of Car_Brand after dropping the missing values of Price
print('The number of unique values of Car_Brand', data['Car_Brand'].nunique())
print(data['Car_Brand'].value_counts())

Now data for Hindustan and OpelCorsa has dropped. 


Let us try to feature the Location column. 
Since the location in the data set is of India, we can further `group the states according to their region`. For Example, `Northern`,`Eastern` and so on. 

In [None]:
data['Location'].value_counts()

In [None]:
#Grouping the states according to their region in India
Nothern = ['Delhi','Jaipur']

Western =['Mumbai', 'Pune','Ahmedabad' ]

Eastern = ['Kolkata' ]

Southern = ['Hyderabad', 'Kochi', 'Coimbatore', 'Chennai', 'Banglore']

In [None]:
### Defining a user function for continents
def Regions(x):
    if x in Nothern:
        return 'North_India'
    elif x in Western:
        return 'West_India'
    elif x in Eastern:
        return 'East_India'
    else:
        return 'South_India'

In [None]:
data['Region'] = data['Location'].apply(Regions)

In [None]:
# let us look at unique regions
data['Region'].unique()


### Descriptive Statistics 

In [None]:
pd.set_option('display.float_format', lambda x: '%.3f' % x) 
np.random.seed(1)
data.sample(n=10)

In [None]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
data.describe().T # Descriptive statistics for Numerical columns

**Observations :-**
    
1. The `average kilometer driven` by cars is `58738.380km` and `maximum kilometer driven` is `6500000km`.There is huge difference seen between the maximum and the minimum value of kilometeres driven which might be a sign of outliers or extreme values. 

2. The `average mileage` offered by the car comapny is `18.325` and the `maximum mileage` provided is `33.54`. Not much difference between the maximum and mininum values of mileage. 

3. The `average size` of the `engine` is `1621` and the `maximum size` of the `engine` is `5998`. There is huge difference seen between the maximum and the minimum value for the size of the engine which might be a sign of outliers or extreme values. 

4. The `average power` of the `engine` is `112.661` and the `maximum power` provided by the `engine` is `560`. There is huge difference seen between the maximum and the minimum value for power which might be a sign of outliers or extreme values. 

5. The `average price` of the `new car` is `1,875,734.175` and the `average price` of the `used cars` is `947,946`. There is huge difference seen for both of the columns between the maximum and the minimum value and might be a sign of outliers or extreme values. 

6. `50%` of the `cars` in the data set are of `2014 edition`. 

7. There are average cars in the data set is a 5 seater car.

In [None]:
data.describe(exclude = np.number).T # Descriptive statistics for Categorical columns

**Observations :-**


1. Maximum number of cars sold or available for purchase is in Mumbai.

2. There are 5 unique categories of Fuel type out of which Diesel has the most frequency.

3. There are 2 unique categories of Transmission out of which Manual has the maximum frequenxy.

4. There are 4 unique categories of Owner_Type out of which there are maximum First hand cars.

5. There are 31 uniquie categories of Car_Brand out of which Maruti is the most popular brand.

6. There are 4 unique regions (categoriezed according to the location). Maximum sale and purchase of car is done in South_India

### Step 2

### Before we further process the data, let's have a look at the graphical visualization of the data to understand it in a better way!

### Univariate analysis

In [None]:
# Function to create barplots that indicate percentage for each category.

def perc_on_bar(plot, feature):
    '''
    plot
    feature: categorical feature
    the function won't work if a column is passed in hue parameter
    '''
    total = len(feature) # length of the column
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
        y = p.get_y() + p.get_height()           # hieght of the plot
        ax.annotate(percentage, (x, y), size = 10) # annotate the percantage 
    plt.show() # show the plot

In [None]:
plt.figure(figsize=(5,5))
ax = sns.countplot(data['Owner_Type'],palette='winter')
plt.xticks(rotation=45)
perc_on_bar(ax,data['Owner_Type'])

- There are maximum cars maximum First hand owners in the data set and very few with Third hand owners and above
- A substantial drop can be seen between First hand, Second hand owners and Thrid hand owners. 


In [None]:
plt.figure(figsize=(5,5))
ax = sns.countplot(data['Fuel_Type'],palette='winter')
plt.xticks(rotation=45)
perc_on_bar(ax,data['Fuel_Type'])

- There are 53.2% Diesel cars, 45.6% Petrol cars and very few CNG and LPG cars. 


In [None]:
plt.figure(figsize=(5,5))
ax = sns.countplot(data['Transmission'],palette='winter')
plt.xticks(rotation=45)
perc_on_bar(ax,data['Transmission'])

- There are maximum Manual cars (71.4%)

In [None]:
plt.figure(figsize=(15,8))
ax = data.Location.value_counts().plot.bar()
plt.xticks(rotation=60)
perc_on_bar(ax,data['Location'])

- Maximum selling and buying of used cars is done in Mumbai and the least is done in Ahmedabad
- Approximately equal split can be seen for Kochi, Coimbatore and Pune, and for Kolkata and Jaipur.
 

In [None]:
plt.figure(figsize=(20,12))
ax = data.Car_Brand.value_counts().sort_values(ascending=False).plot.bar()
plt.xticks(rotation=90)
perc_on_bar(ax,data['Car_Brand'])

- There are significantly more cars for Maruti in comparision with the others.
- There is only a marginal drop percentage for Hundai in comparison with Maruti. However, there is a significant drop     percentage for Honda in comaprison with Maruti and Hundai.   
- Percentage of cars with brand name Mercedes_Benz, Ford and Volkswagen are approximately same.


### Lets plot histogram for all numerical columns

In [None]:
# lets plot histogram of all plots
from scipy.stats import norm
all_col = data.select_dtypes(include = np.number).columns.tolist()

plt.figure(figsize = (17, 75))

for i in range(len(all_col)):
    plt.subplot(18, 3, i+1)
    #plt.hist(data[all_col[i]])
    sns.distplot(data[all_col[i]], kde=True)
    plt.tight_layout()
    plt.title(all_col[i],fontsize=20)
    

plt.show()

- Year is skewed towards left which means that there are very few models of edition 2005 and less.
- There are maximum cars of the edition 2014. There is approximately an equal split of cars with edition 2015 and 2016. 
- Each variable is skewed towards right except for `Mileage` and `Seats`
- Mileage follows a noraml distribution with 18 as average mileage offered by the car company

In order to improve the skeweness these variables must be transformed.

## Bivariate Analysis

### Lets look at correlations

In [None]:
numeric_columns = data.select_dtypes(include = np.number).columns.tolist()

# sorting correlations w.r.t Price  
corr = data[numeric_columns].corr().sort_values(by = ['Price'], ascending = False) 

# Set up the matplotlib figure
f, ax = plt.subplots(figsize = (13, 10))


# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap = 'seismic', annot = True, fmt = ".1f", vmin = -1, vmax = 1, center = 0, square = False,
            linewidths = .7, cbar_kws = {"shrink": .5});
plt.title(' Correlation Matrix', fontstyle = 'italic')
plt.tick_params(axis='both', labelsize = 12, labelrotation = 45)
plt.show()


In [None]:
sns.pairplot(data, corner = True )

- `Price` is highly positively correlated  with `Engine(0.7)`, `Power(0.8)` and `New_Price(0.6)`
- Other highly positively correlated variables are:-
    - `New_Price` with `Engine(0.6)` and `New_Price` with`Power(0.7)`
    - A very strong correlation is seen between `Power`and `Engine (0.9)` 
- Other negatively correlated variables are:-
    - `Mileage` with `Power-0.5)` and `Mileage` with `Engine (-0.6)`
    
    
 
The correlation between `Power and  Engine`, `Mileage and Power` and `Mileage and Engine` indicates a strong multicollinearity. 

## Let us look at the graph of those variables that are highly correlated with Price

### Price vs Engine vs Categorical Variables( with hue as Fuel_Type, Owner_Type and Transmission and Region)

In [None]:
import plotly.express as px
title_dict = {'family': 'serif',
        'color':  'darkred',
        'weight': 'bold',
        'size': 16
        }

lab_dict = {'family': 'serif',
              'color': 'black',
              'size': 14
              }

plt.figure(figsize=[20,15])
grid = plt.GridSpec(3, 2, wspace=0.6, hspace=0.4) #Defining the grid space

#Plotting Price vs Engine vs Transmission

ax0 = plt.subplot(grid[0, 0])
sns.scatterplot(y= 'Price', x = 'engine_num', hue = 'Transmission', data = data)
ax0.set_ylabel('Price', fontdict=lab_dict)
ax0.set_xlabel('Engine', fontdict=lab_dict)
plt.title('Price vs Engine vs Transmission', fontdict = title_dict )
plt.legend(bbox_to_anchor=(0.99,1))

#Plotting Price vs Engine vs Fuel_Type
ax1 = plt.subplot(grid[0, 1])
sns.scatterplot(y= 'Price', x = 'engine_num', hue = 'Fuel_Type', data = data)
ax1.set_ylabel('Price', fontdict=lab_dict)
ax1.set_xlabel('Engine', fontdict=lab_dict)
plt.title('Price vs Engine vs Fuel_Type', fontdict = title_dict )
plt.legend(bbox_to_anchor=(0.99,1))

#Plotting Price vs Engine vs Owner_Type
ax2 = plt.subplot(grid[1, 0])
sns.scatterplot(y= 'Price', x = 'engine_num', hue = 'Owner_Type', data = data)
ax2.set_ylabel('Price', fontdict=lab_dict)
ax2.set_xlabel('Engine', fontdict=lab_dict)
plt.title('Price vs Engine vs Owner_Type', fontdict = title_dict )
plt.legend(bbox_to_anchor=(0.99,1))


#Plotting Price vs Engine vs Region
ax3 = plt.subplot(grid[1, 1])
sns.scatterplot(y= 'Price', x = 'engine_num', hue = 'Region', data = data)
ax3.set_ylabel('Price', fontdict=lab_dict)
ax3.set_xlabel('Engine', fontdict=lab_dict)
plt.title('Price vs Engine vs Region', fontdict = title_dict )
plt.legend(bbox_to_anchor=(0.99,1))

- There are more expensive Automatic cars with bigger engine size in comparison with Manual cars. 
- There is one Automatic car of INR 160 Lakh and engine size 3000 and one Automatic car with engine size 6000 and price above 40 Lakh  (Will have a look at it). Also, one car with very small engine size and price less than 20Lakh can be seen. (Shall look at it)


- Diesel cars with engine size less than or equal to 3000 are more expensive than the others. However as the engine size increases after 3000, the price of Petrol cars also increases. 
- No CNG and LPG cars can be seen. 
- One Electric car with very small engine size and price less than 20Lakh can be seen. (Shall look at it)
- There is one Automatic Diesel car of INR 160 Lakh and engine size 3000 and one Automatic Petrol car with engine size 6000 and price above 40 Lakh (Will have a look at it).

- There are maximum first hand cars in the data set. As the car exchanges hands the price value of the car drops. However, there is one Third hand of INR 120 Lakh with engine size 5200 (Shall look at it).

- Majority of the cars purchase and sales is done in South India. Majority of the cars with bigger engine size is either purchased or sold in South India.  

- Now shall look up the extreme cases detected above

In [None]:
data[(data['Transmission'] =='Automatic') & (data['Owner_Type'] =='First') & (data['Price'] == 160)]

In [None]:
data[(data['Transmission'] =='Automatic') & (data['Fuel_Type'] == 'Electric') & (data['Region'] == 'South_India')]

In [None]:
data[(data['Owner_Type'] == 'Third') & (data['engine_num']> 5000)]

- Land Rover Range Rover 3.0 Diesel LWB Vogue is the most expensive car. (Will check the after treating outliers) 

- Mahindra E Verito D4 is the only car with minimum engine size. (Will check after treating outliers)

- Lamborghini Gallardo Coupe is the only expensive Third hand petrol car. (Will check after treating outliers)

### Price vs Power vs Categorical Variables( with hue as Fuel_Type, Owner_Type, Region and Transmission)

In [None]:

title_dict = {'family': 'serif',
        'color':  'darkred',
        'weight': 'bold',
        'size': 16
        }

lab_dict = {'family': 'serif',
              'color': 'black',
              'size': 14
              }

plt.figure(figsize=[20,15])
grid = plt.GridSpec(3, 2, wspace=0.6, hspace=0.4) #Defining the grid space

#Plotting Price vs Power vs Transmission

ax0 = plt.subplot(grid[0, 0])
sns.scatterplot(y= 'Price',x = 'power_num', hue = 'Transmission', data = data)
ax0.set_ylabel('Price', fontdict=lab_dict)
ax0.set_xlabel('Power', fontdict=lab_dict)
plt.title('Price vs Power vs Transmission', fontdict = title_dict )
plt.legend(bbox_to_anchor=(1,1))

#Plotting Price vs Power vs Fuel_Type
ax1 = plt.subplot(grid[0, 1])
sns.scatterplot(y= 'Price', x = 'power_num', hue = 'Fuel_Type', data = data)
ax1.set_ylabel('Price', fontdict=lab_dict)
ax1.set_xlabel('Power', fontdict=lab_dict)
plt.title('Price vs Power vs Fuel_Type', fontdict = title_dict )
plt.legend(bbox_to_anchor=(1,1))

           
#Plotting Price vs Power vs Owner_Type
ax2 = plt.subplot(grid[1,0])
sns.scatterplot(y= 'Price', x = 'power_num', hue = 'Owner_Type', data = data)
ax2.set_ylabel('Price', fontdict=lab_dict)
ax2.set_xlabel('Power', fontdict=lab_dict)
plt.title('Price vs Power vs Owner_Type', fontdict = title_dict )
plt.legend(bbox_to_anchor=(0.97,1))

#Plotting Price vs Power vs Region
ax3 = plt.subplot(grid[1, 1])
sns.scatterplot(y= 'Price', x = 'power_num', hue = 'Region', data = data)
ax3.set_ylabel('Price', fontdict=lab_dict)
ax3.set_xlabel('Power', fontdict=lab_dict)
plt.title('Price vs Power vs Region', fontdict = title_dict )
plt.legend(bbox_to_anchor=(0.97,1))


- Automatic cars provides more power than manual cars and are more expensive in comparison with the manual cars.
- There are more expensive Diesel cars with power less than 300 and more number of expensive Petrol cars with power greater than 300
- There are mostly first hand cars that provides more power. However, there is one thrid hand with maximum power (will look at it)
- According to the data set, South India has wide range of cars (in terms of `price`) that provides more power in comparison with the other regions. However, there is one car in North India with maximum power (Will look at it)

In [None]:
print(data[(data['Owner_Type'] == 'Third') & (data['power_num'] > 500)])
print('-'*90)
print(data[(data['Region'] == 'North_India') & (data['power_num']> 500)])

Its the same data was and was adressed previously as well. 

### Price vs New_Price 

In [None]:
title_dict = {'family': 'serif',
        'color':  'darkred',
        'weight': 'bold',
        'size': 16
        }

lab_dict = {'family': 'serif',
              'color': 'black',
              'size': 14
              }


plt.figure(figsize=(10,8))
sns.lmplot(y= 'Price', x = 'new_price_num', data = data)
plt.ylabel('Price', fontdict=lab_dict)
plt.xlabel('New_Price', fontdict=lab_dict)
plt.title('Price vs New_Price ', fontdict = title_dict )


- There are a lot of extreme values which must be taken into consideration and treated. 

#### Plotting categorical variables with Price

In [None]:
df_hm =data.pivot_table(index = 'Fuel_Type', columns = 'Transmission', values = "Price", aggfunc = np.median)
# Draw a heatmap 
f, ax = plt.subplots(figsize = (10, 8))
sns.heatmap(df_hm, cmap = 'coolwarm', linewidths = .5, annot = True, ax = ax);
plt.title('Price vs Fuel_Type vs Transmission ', fontdict = title_dict );

- Automatic disel used cars have a higher price in comparison with the others

In [None]:
df_hm =data.pivot_table(index = 'Owner_Type', columns = 'Transmission', values = "Price", aggfunc = np.median)
# Draw a heatmap 
f, ax = plt.subplots(figsize = (10, 8))
sns.heatmap(df_hm, cmap = 'coolwarm', linewidths = .5, annot = True, ax = ax);
plt.title('Price vs Owner_Type vs Transmission ', fontdict = title_dict );

Automatic First hand cars have a higher price in comparison with the others. 


In [None]:
df_hm =data.pivot_table(index = 'Region', columns = 'Fuel_Type', values = "Price", aggfunc = np.median)
# Draw a heatmap 
f, ax = plt.subplots(figsize = (10, 8))
sns.heatmap(df_hm, cmap = 'coolwarm', linewidths = .5, annot = True, ax = ax);
plt.title('Price vs Fuel_Type vs Region ', fontdict = title_dict );

Most expensive electric cars are purchased or sold only in South India and West India. 

The price of Diesel type used cars is maximum in South India



#### Price change with the year of editon

In [None]:
#Price vs Year vs Fuel Type
plt.figure(figsize = (15, 7))
sns.lineplot(x = 'Year', y = 'Price', hue = 'Fuel_Type', ci = 95, data = data);
plt.title('Price vs Year vs Fuel_Type ', fontdict = title_dict );

- For Diesel and Petrol, the price of used cars has shown a steady increase. However, Diesel cars are more costly than Petrol cars.
- CNG cars have shown their appearance since 2005 and LPG cars from 2007. Not much increase in price can be observed.
- The price for electric cars is constant across the years. 




In [None]:
#Price vs Year vs Transmission
plt.figure(figsize = (15, 7))
sns.lineplot(x = 'Year', y = 'Price', hue = 'Transmission', ci = 95, data = data);
plt.title('Price vs Year vs Transmission ', fontdict = title_dict );

The price for manual cars has shown a smooth incline over the period of time.

For automatic cars, the increase in price was not steady across the years. 2 major drops can be seen, one in 2004 and the other in 2007. From 2008, a step incline can be seen in the price for Automatic cars.

In [None]:
#Price vs Year vs Owner Type
plt.figure(figsize = (15, 7))
sns.lineplot(x = 'Year', y = 'Price', hue = 'Owner_Type', ci = 95, data = data);
plt.title('Price vs Year vs Owner_Type ', fontdict = title_dict );

Steep incline can be seen for the price of first hand cars over the years and from 2017 and 2013 a decline can be seen in the price of second and third hand cars respectively.

In [None]:
#Price vs Year vs Region
plt.figure(figsize = (15, 7))
sns.lineplot(x = 'Year', y = 'Price', hue = 'Region', ci = 95, data = data);
plt.title('Price vs Year vs Region ', fontdict = title_dict );

In  North India and West India, from 2018 a decline in the price of used cars can be seen. However, no decline in price is observed in South India

#### Plotting varaibles with multicollinearity.

In [None]:
title_dict = {'family': 'serif',
        'color':  'darkred',
        'weight': 'bold',
        'size': 16
        }

lab_dict = {'family': 'serif',
              'color': 'black',
              'size': 14
              }

plt.figure(figsize=[10,10])
grid = plt.GridSpec(3, 2, wspace=0.5, hspace=0.4) #Defining the grid space

#Plotting Mileage vs Engine 

ax0 = plt.subplot(grid[0, 0])
sns.regplot(x= 'km_per_unit_fuel', y = 'engine_num',  data = data)
ax0.set_xlabel('Mileage', fontdict=lab_dict)
ax0.set_ylabel('Engine', fontdict=lab_dict)
plt.title('Mileage vs Engine',  fontdict = title_dict )


#Plotting Mileage vs Power
ax1 = plt.subplot(grid[0, 1])
sns.regplot(x= 'km_per_unit_fuel', y = 'power_num', data = data)
ax1.set_xlabel('Mileage', fontdict=lab_dict)
ax1.set_ylabel('Power', fontdict=lab_dict)
plt.title('Mileage vs Power', fontdict = title_dict )


           
#Plotting Power vs Engine 

ax2 = plt.subplot(grid[1,0])
sns.regplot(x= 'power_num', y = 'engine_num', data = data)
ax2.set_xlabel('Power', fontdict=lab_dict)
ax2.set_ylabel('Engine', fontdict=lab_dict)
plt.title('Power vs Engine', fontdict = title_dict )
plt.show();




- There is a strong negative correlation between Mileage and Power and Mileage and Engine
- There is strong positive correlation between Power and Engine. 

Will drop them on the basis of Variance Influence Factor while testing Multicollinearity

#### Price vs Seats 

In [None]:
plt.figure(figsize = (10, 5))
sns.scatterplot(y = 'Price', x = 'Seats',hue = 'Fuel_Type' ,data = data);

There are more number of 5, 6, 7, 8, 9, 10 seater diesel cars in comparision with the other fuel tpyes. However, there are majorly petrol 2 seater cars 

#### Will furthur classify the Car_Brands into Luxury, medium and non Luxury brands on the basis of Premium brand names 

In [None]:
 #Trifurcating Brand into:-
    
Luxury_Brand = ['Audi', 'BMW', 'Bentley', 'Jaguar', 'Mercedes-Benz', 
                'Mini','Land Rover', 'Mitsubishi', 'Porsche', 'Skoda', 'Volvo', 'Lamborghini']
Medium_Brand = ['Chevrolet', 'Toyota','Volkswagen', 
                'Hyundai', 'Honda'] 
Non_Luxury_Brand = ['Ambassador','Datsun', 'Fiat', 'Force', 
                    'Ford', 'Isuzu', 'Jeep', 'Mahindra', 'Nissan', 'Renault', 'Tata', 'Smart', 'Maruti'] 
 
# Defining a function to map the brands 
def Luxury(x):
    if x in Luxury_Brand:
        return 'Luxury_Brand'
    elif x in Medium_Brand :
        return 'Medium_Brand '
    else:
        return 'Non_Luxury_Brand'

# Apyplying the function to our data
data['Brand'] = data['Car_Brand'].apply(Luxury)

In [None]:
data.Brand.value_counts()
data['Brand'] = data['Brand'].astype('category')

In [None]:
data.info()

New feature created in the data and it's data type has been fixed

In [None]:
df_hm =data.pivot_table(index = 'Brand', columns = 'Region', values = "Price", aggfunc = np.median)
# Draw a heatmap 
f, ax = plt.subplots(figsize = (10, 8))
sns.heatmap(df_hm, cmap = 'coolwarm', linewidths = .5, annot = True, ax = ax);

South India has higher luxury cars as compared to the others. 

In [None]:
df_hm =data.pivot_table(index = 'Brand', columns = 'Fuel_Type', values = "Price", aggfunc = np.median)
# Draw a heatmap 
f, ax = plt.subplots(figsize = (10, 8))
sns.heatmap(df_hm, cmap = 'coolwarm', linewidths = .5, annot = True, ax = ax);

Diesel model of luxury brand have higher price value in comparison with the others.


In [None]:
df_hm =data.pivot_table(index = 'Brand', columns = 'Owner_Type', values = "Price", aggfunc = np.median)
# Draw a heatmap
f, ax = plt.subplots(figsize = (10, 8))
sns.heatmap(df_hm, cmap = 'coolwarm', linewidths = .5, annot = True, ax = ax);

The price value of first hand and second hand luxury brand cars are realtively higher than the other brands.

In [None]:
df_hm =data.pivot_table(index = 'Seats', columns = 'Brand', values = "Price", aggfunc = np.median)
# Draw a heatmap 
f, ax = plt.subplots(figsize = (10, 8))
sns.heatmap(df_hm, cmap = 'coolwarm', linewidths = .5, annot = True, ax = ax);

- The price of a 2 seater luxury brand is higher than the other.
- Luxury brand car has a range of models from 2 seater to 7 seater cars. 



In [None]:
df_hm =data.pivot_table(index = 'Brand', columns = 'Transmission', values = "Price", aggfunc = np.median)
# Draw a heatmap 
f, ax = plt.subplots(figsize = (10, 8))
sns.heatmap(df_hm, cmap = 'coolwarm', linewidths = .5, annot = True, ax = ax);

Automatic Luxury brand cars are the most expensive cars.

`Obervations` 
    
    1. Overall the demand and the price for Automatic used cars have been increased during the period of years  
    2. There is a a steep increase for the purchase and sale of used cars in Southern Region of India. 
    3. The sale and purchase of Luxury brand cars is majorly done in Southern region of India.  
    4. There are only two records for Automatic Electric car which is observed in Southern and Western region of India
    5. Increase in purchase and sale of first hand cars has shown a steady increase over the period of years. 
    6. Power and Engine of the cars are highly correlated.
    7. Mileage is negatively coorelated with Power and Engine of the car.
    

### Transforming numerical columns 

In [None]:
#Transforming the columns 
data['Kilometers_Driven_log'] = np.log(data['Kilometers_Driven'])

data['Price_log'] = np.log(data['Price'])


In [None]:
print('Skewness check for Kilometers Driven')
print(data['Kilometers_Driven'].skew())
print(data['Kilometers_Driven_log'].skew())
print('-'*30)


print('Skewness check for Price')
print(data['Price'].skew())
print(data['Price_log'].skew())
print('-'*30)


The above output shows that the skewness value has came down for each column that was transformed which confirms that the distribution has been treated for highly extreme values.

In [None]:
##### Let us drop the initial columns 
data.drop(['Kilometers_Driven'], axis =1, inplace =True)
data.columns

In [None]:
#Transformation of  columns
#set pandas to display more rows

from scipy.stats import norm
import scipy.stats as stats
pd.set_option('display.max_rows', 200)


fig = plt.figure(figsize=[20,35]);
grid = plt.GridSpec(5, 2, wspace=0.5, hspace=0.3);
x = ['Price_log', 'Kilometers_Driven_log'];

#loop to populate boxplots within subplots
for i, a in enumerate(x):
    exec(f'ax{i}0 = plt.subplot(grid[i,0]);')
    exec(f'sns.distplot(data[a], ax=ax{i}0);')
    exec(f'ax{i}0.set_title(a);')
    exec(f'ax{i}1 = plt.subplot(grid[i,1]);')
    exec(f'stats.probplot(data[a],dist="norm",plot=pylab);')
        
fig.show();

**Observations:-**
 Skewness has reduced for all the variables and can see their transformation towards normailty as well. 
 Price_log and Kilometers_Driven_log now has a normal distribution

In [None]:
numeric_columns = data.select_dtypes(include = np.number).columns.tolist()

# sorting correlations w.r.t Price  
corr = data[numeric_columns].corr().sort_values(by = ['Price_log'], ascending = False) 

# Set up the matplotlib figure
f, ax = plt.subplots(figsize = (15, 10))



# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap = 'seismic', annot = True, fmt = ".1f", vmin = -1, vmax = 1, center = 0, square = False,
            linewidths = .7, cbar_kws = {"shrink": .5});

plt.title(' Correlation Matrix after transformation', fontstyle = 'italic')
plt.tick_params(axis='both', labelsize = 12, labelrotation = 45)
plt.show()

In [None]:
sns.pairplot(data, corner=True)

Now `Year` is also correlated with `Price`

## Create Dummy Variables

Values like `Mumbai`or`Pune`cannot be read into an equation. Using substitutes like `1 for Mumbai`, `2 for Pune` and so on would end up implying some baseless assumption which should be avoided. However, `Owner_Type` can be given substitutes such as `1 for First owner`, `2 for Second Owner` and so on. 

For the rest of the categorical variables dummy variables are creates and `drop_frist =True`is set. If we do not drop one of the dummies, then they become linearly related and violate model assumptions.

In [None]:
data.head()

In [None]:
data1 = data.copy()

ind_vars = data1.drop(["Price_log", 'Price'], axis=1)
dep_var = data1[["Price_log"]]

In [None]:
def encode_cat_vars(x):
    x = pd.get_dummies(
        x,
        columns=x.select_dtypes(include=["object", "category"]).columns.tolist(),
        drop_first=True,
    )
    return x


ind_vars_num = encode_cat_vars(ind_vars)
ind_vars_num.head()

# Model Building

In [None]:
#split the data into train and test
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


x_train, x_test, y_train, y_test = train_test_split(ind_vars_num, dep_var, test_size=0.3, random_state=1)

In [None]:
#Size of the training and testing set
print('The size of the training and testing sets are: \n')
print('X train size', x_train.size)
print('-'*30)
print('X test size', x_test.size)
print('-'*30)
print('y train size', y_train.size)
print('-'*30)
print('y test size', y_test.size)

# Shape of the training and testing set
print('The shape of the training and testing sets are: \n')
print('X train size', x_train.shape)
print('-'*30)
print('X test size', x_test.shape)
print('-'*30)
print('y train size', y_train.shape)
print('-'*30)
print('y test size', y_test.shape)

## Choose Model, Train and Evaluate

In [None]:
import statsmodels.api as sm

# Statsmodel api does not add a constant by default. We need to add it explicitly.
x_train = sm.add_constant(x_train)
# Add constant to test data
x_test = sm.add_constant(x_test)


def build_ols_model(train):
    # Create the model
    olsmodel = sm.OLS(y_train["Price_log"], train)
    return olsmodel.fit()


olsmodel1 = build_ols_model(x_train)
print(olsmodel1.summary())

** Observation**
- P value of a variable indicates if the variable is significant or not. If we consider significance level to be 0.05 (5%) than any variable with p-values less than 0.05 would be considered significant 




and rest of the values will be dropped one by one.

- Negative values of coefficient shows that, Life expectancy deceases with their increase.
- Positive values of coefficient shows that, Life expectancy inceases with their increase.
- But these variables might contain Multicollinearity which affects the p values, so we first need to deal with multicollinearity and then look for p values 


* Both the R-squared and Adjusted R squared of our model are very high. This is a clear indication that we have been able to create a very good model that is able to explain variance in price of used cars for upto 95% 
* The model is not an underfitting model.
* To be able to make statistical inferences from our model, we will have to test that the linear regression assumptions are followed.
* Before we move on to assumption testing, we'll do a quick performance check on the test data.

In [None]:
import math

# RMSE
def rmse(predictions, targets):
    return np.sqrt(((targets - predictions) ** 2).mean())




# MAE
def mae(predictions, targets):
    return np.mean(np.abs((targets - predictions)))


# Model Performance on test and train data
def model_pref(olsmodel, x_train, x_test):

    # Insample Prediction
    y_pred_train_pricelog = olsmodel.predict(x_train)
    y_pred_train_Price = y_pred_train_pricelog.apply(math.exp)
    y_train_Price = y_train["Price_log"]

    # Prediction on test data
    y_pred_test_pricelog = olsmodel.predict(x_test)
    y_pred_test_Price = y_pred_test_pricelog.apply(math.exp)
    y_test_Price = y_test["Price_log"]

    print(
        pd.DataFrame(
            {
                "Data": ["Train", "Test"],
                "RMSE": [
                    rmse(y_pred_train_Price, y_train_Price),
                    rmse(y_pred_test_Price, y_test_Price),
                ],
                "MAE": [
                    mae(y_pred_train_Price, y_train_Price),
                    mae(y_pred_test_Price, y_test_Price),
                ], 
                 
                
            }
        )
    )
    

#Checking model performance
model_pref(olsmodel1, x_train, x_test)  # High Overfitting.

* Root Mean Squared Error of train and test data is starkly different, indicating that our model is overfitting the train data. 
* Mean Absolute Error indicates that our current model is able to predict used cars prices within mean error of 11.05 lakhs on test data.
* The units of both RMSE and MAE are same - Lakhs in this case. But RMSE is greater than MAE because it peanalises the outliers more.


### Checking the Linear Regression Assumptions

1. No Multicollinearity
2. Mean of residuals should be 0
3. No Heteroscedacity
4. Linearity of variables
5. Normality of error terms

#### Let's check Multicollinearity using VIF scores

##### TEST FOR MULTICOLLINEARITY

* Multicollinearity occurs when predictor variables in a regression model are correlated. This correlation is a problem because predictor variables should be independent.  If the correlation between variables is high, it can cause problems when we fit the model and interpret the results. When we have multicollinearity the linear model, The coefficients that the model suggests are unreliable.

* There are different ways of detecting(or  testing) multi-collinearity, one such way is Variation Inflation Factor.

* **Variance  Inflation  factor**:  Variance  inflation  factors  measure  the  inflation  in  the variances of the regression parameter estimates due to collinearities that exist among the  predictors.  It  is  a  measure  of  how  much  the  variance  of  the  estimated  regression coefficient $\β_k$ is “inflated”by  the  existence  of  correlation  among  the  predictor variables in the model. 

* General Rule of thumb: If VIF is 1 then there is no correlation among the kth predictor and the remaining predictor variables, and  hence  the variance of β̂k is not inflated at all. Whereas if VIF exceeds 5 or is close to exceeding 5, we say there is moderate VIF and if it is 10 or exceeding 10, it shows signs of high multi-collinearity.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor


def checking_vif(train):
    vif = pd.DataFrame()
    vif["feature"] = train.columns

    # calculating VIF for each feature
    vif["VIF"] = [
        variance_inflation_factor(train.values, i) for i in range(len(train.columns))
    ]
    return vif


print(checking_vif(x_train))

In [None]:
# There are a few variables with high VIF.
# Our current model is extreamly complex. Let us first bin the Brand and Model columns.
# This wouldn't essentially reduce multicollinearity in the data, but it will help us make the dataset more managable

#Will create a new column Car_Category by doing the following:_
data.groupby(["Car_Brand", "Model"])['new_price_num'].mean().sort_values(ascending=False)


In [None]:
# We will create a new variable Car Category by binning the new_price_num
# Create a new variable - Car Category
df1= data.copy()
df1["Car_Category"] = pd.cut(
    x=data["new_price_num"],
    bins=[0, 15, 30, 50, 200],
    labels=["Budget_Friendly", "Mid-Range", "Luxury_Cars", "Ultra_luxury"],
)
# car_category.value_counts()

# Drop the Brand and Model columns.
df1.drop(columns=["Car_Brand", "Model", 'Brand', 'Region'], axis=1, inplace=True)# We will create a new variable Car Category by binning the new_price_num
df1.columns

In [None]:
# We will have to create the x and y datasets again
y = df1[['Price_log']]
X = df1.drop(['Price_log', 'Price'], axis=1)


def encode_cat_vars(x):
    x = pd.get_dummies(
        x,
        columns=x.select_dtypes(include=["object", "category"]).columns.tolist(),
        drop_first=True,
    )
    return x


X = encode_cat_vars(X)
X.head()



# Splitting data into train and test
x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1)

print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in train data =", x_test.shape[0], "\n\n")

# Statsmodel api does not add a constant by default. We need to add it explicitly.
x_train = sm.add_constant(x_train)
# Add constant to test data
x_test = sm.add_constant(x_test)

# Fit linear model on new dataset
olsmodel2 = build_ols_model(x_train)
print(olsmodel2.summary())

The R squared and adjusted r squared values have decreased, but are still quite high indicating that we have been able to capture most of the information of the previous model even after reducing the number of predictor features. 

As we try to decrease overfitting, the r squared of our train model is expected to decrease

In [None]:
# Check VIF
print(checking_vif(x_train))

The R squared and adjusted r squared values have decreased, but are still quite high indicating that we have been able to capture most of the information of the previous model even after reducing the number of predictor features by a huge extent.

As we try to decrease overfitting, the r squared of our train model is expected to decrease.

In [None]:
# Checking model performance
model_pref(olsmodel2, x_train, x_test)  # No Overfitting.

* The RMSE on train data has increased but has been reduced for the test data. 
* The RMSE values on both the dataset being close to each other indicate that the model is not overfitting the training data anymore.
* Reducing overfitting has caused the MAE to decrease on testing data


We have managed to control overfitting and reduce the test data error.

Let us now remove multicollinearity from the model.

## Removing Multicollinearity
 * To remove multicollinearity
  1. Drop every column one by one, that has VIF score greater than 5.
  2. Look at the adjusted R square of all these models
  3. Drop the Variable that makes least change in Adjusted-R square
  4. Check the VIF Scores again
  5. Continue till you get all VIF scores under 5

In [None]:
# Method to drop all the multicollinear column and choose which one we should drop
def treating_multicollinearity(high_vif_columns, x_train, x_test):
    """
    Drop every column that has VIF score greater than 5, one by one.
    Look at the adjusted R square of all these models
    Look at the RMSE of all these models on test data
    """
    adj_rsq_scores = []
    rmse_test_data = []

    # build ols models by dropping one of these at a time and observe the Adjusted R-squared
    for cols in high_vif_columns:
        train = x_train.loc[:, ~x_train.columns.str.startswith(cols)]
        test = x_test.loc[:, ~x_test.columns.str.startswith(cols)]
        # Create the model
        olsres = build_ols_model(train)
        # Adj R-Sq
        adj_rsq_scores.append(olsres.rsquared_adj)
        # RMSE (Test data)
        y_pred_test_pricelog = olsres.predict(test)
        y_pred_test_Price = y_pred_test_pricelog.apply(math.exp)
        y_test_Price = y_test["Price_log"]
        rmse_test_data.append(rmse(y_pred_test_Price, y_test_Price))

    # Add new Adj_Rsq and RMSE after dropping each colmn
    temp = pd.DataFrame(
        {
            "col": high_vif_columns,
            "Adj_rsq_after_dropping_col": adj_rsq_scores,
            "Test RMSE": rmse_test_data,
        }
    ).sort_values(by="Adj_rsq_after_dropping_col", ascending=False)

    print(temp)
    print("\n\n")

In [None]:
high_vif_columns = [
    "engine_num",
    "power_num",
    "new_price_num_log",
    "Fuel_Type",
    "car_category",
]
treating_multicollinearity(high_vif_columns, x_train, x_test)

In [None]:
# Dropping cars_category would have the maximum impact on predictive power of the model (amongst the variables being considered)
# We'll drop engine_num and check the vif again

# Drop 'engine_num' from train and test
col_to_drop = "engine_num"
x_train = x_train.loc[:, ~x_train.columns.str.startswith(col_to_drop)]
x_test = x_test.loc[:, ~x_test.columns.str.startswith(col_to_drop)]

# Check VIF now
vif = checking_vif(x_train)
print("VIF after dropping ", col_to_drop)
print(vif)

In [None]:
# Dropping engine_num has brought the VIF of power_num below 5
# new_price_num, Fuel_Type and car_category still show some multicollinearity

# Check which one of these should we drop next
high_vif_columns = [
    "new_price_num",
    "Fuel_Type",
    "car_category",
]
treating_multicollinearity(high_vif_columns, x_train, x_test)

In [None]:
# Drop 'new_price_num' from train and test since the RMSE and Adj. Rsq is not affected much by this variable
col_to_drop = "new_price_num"
x_train = x_train.loc[:, ~x_train.columns.str.startswith(col_to_drop)]
x_test = x_test.loc[:, ~x_test.columns.str.startswith(col_to_drop)]

# Check VIF now
vif = checking_vif(x_train)
print("VIF after dropping ", col_to_drop)
print(vif)

We have removed multicollinearity from the data now

Fuel_Type variables are showing high vif because most cars are either diesel and petrol. 
These two features are correlated with each other.

We will not drop this variable from the model because this will not affect the interpretation of other features in the model

In [None]:
# Fit linear model on new dataset
olsmodel3 = build_ols_model(x_train)
print(olsmodel3.summary())

print("\n\n")

# Checking model performance
model_pref(olsmodel3, x_train, x_test)

Model R-squared and Adjusted R squared is same as the previous model - olsmodel2.
Removal of multicollinear variable has not causes any information loss in the model.

The RMSE of the model on train data has increased.
Before we can make inferences from this model, let us ensure that other model assumptions are followed.

### Checking Assumption 2: Mean of residuals should be 0

In [None]:
pd.set_option('display.float_format', lambda x: '%.3f' % x) 

residuals = olsmodel3.resid
np.mean(residuals)

* Mean of redisuals is very close to 0.

### Checking Assumption 3: Linearity of variables

Predictor variables must have a linear relation with the dependent variable.

To test the assumption, we'll plot residuals and fitted values on a plot and ensure that residuals do not form a strong pattern. They should be randomly and uniformly scattered on the x axis.






In [None]:
# predicted values
fitted = olsmodel3.fittedvalues

# sns.set_style("whitegrid")
sns.residplot(fitted, residuals, color="purple", lowess=True)
plt.xlabel("Fitted Values")
plt.ylabel("Residual")
plt.title("Residual PLOT")
plt.show()

### ### Checking Assumption 4: No Heteroscedasticity


TEST FOR HOMOSCEDASTICITY

* Homoscedacity - If the residuals are symmetrically distributed across the regression line , then the data is said to homoscedastic.

* Heteroscedasticity- - If the residuals are not symmetrically distributed across the regression line, then the data is said to be heteroscedastic. In this case the residuals can form a funnel shape or any other non symmetrical shape.

We'll use `Goldfeldquandt Test` to test the following hypothesis

Null hypothesis : Residuals are homoscedastic
Alternate hypothesis : Residuals have hetroscedasticity

alpha = 0.05 

In [None]:
import statsmodels.stats.api as sms
from statsmodels.compat import lzip

name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(residuals, x_train)
lzip(name, test)

Since p-value > 0.05 we cannot reject the Null Hypothesis that the residuals are homoscedastic. 

Assumptions 3 is also satisfied by our olsmodel3.

### Checking Assumption 5: Normality of error terms

The residuals should be normally distributed.

In [None]:
# Plot histogram of residuals
sns.distplot(residuals)

In [None]:
# Plot q-q plot of residuals
import pylab
import scipy.stats as stats

stats.probplot(residuals, dist="norm", plot=pylab)
plt.show()

The residuals have a close to normal distribution. Assumption 5 is also satisfied.
We should further investigate these values in the tails where we have made huge residual errors.

Now that we have seen that olsmodel3 follows all the linear regression assumptions. Let us use that model to draw inferences.

In [None]:
print(olsmodel3.summary())

## Observations from the model

1. With our linear regression model we have been able to capture ~90 variation in our data.
2. The model indicates that the most significant predictors of price of used cars are - 
    - The year of manufacturing
    - Number of seats in the car
    - Power of the engine
    - Mileage
    - Kilometers Driven
    - Location
    - Fuel_Type
    - Transmission - Automatic/Manual
    - Car Category - budget brand to ultra luxury
    The p-values for these predictors are <0.05 in our final model 
3. Newer cars sell for higher prices. 1 unit increase in the year of manufacture leads to [ exp(0.1170) = 1.12 Lakh ] increase in the price of the vehicle, when everything else is constant.
It is important to note here that the predicted values are log(price) and therefore coefficients have to converted accordingly to understand that influence in Price.
4. As the number of seats increases, the price of the car increases - exp(0.0343) = 1.03 Lakhs
5. Mileage is inversely correlated with Price. Generally, high Mileage cars are the lower budget cars.
It is important to note here that correlation is not equal to causation. That is to say that increase in Mileage does not lead to a drop in prices. It can be understood in such a way that the cars with high mileage do not have a high power engine and therefore have low prices.
6. Kilometers Driven have a negative relationship with the price which is intuitive. A car that has been driven more will have more wear and tear and hence sell at a lower price, everything else being 0.
7. The categorical variables are a little hard to interpret. But it can be seen that all the car_category variables in the dataset have a positive relationship with the Price and the magnitude of this positive relationship increases as the brand category moves to the luxury brands. It will not be incorrect to interpret that the dropped car_category variable for budget friendly cars would have a negative relationship with the price (because the other 3 are increasingly positive.)


* Some southern markets tend to have higher prices. It might be a good strategy to plan growth in southern cities using this information. Markets like Kolkata(coeff = -0.2) are very risky and we need to be careful about investments in this area.
* We will have to analyse the cost side of things before we can talk about profitability in the business. We should gather data regarding that.
* The next step post that would be to cluster different sets of data and see if we should make multiple models for different locations/car types.

# Add-on: Analyzing predictions where we were way off the mark

In [None]:
# Extracting the rows from original data frame df where indexes are same as the training data
original_df = data[data.index.isin(x_train.index.values)].copy()

# Extracting predicted values from the final model
residuals = olsmodel3.resid
fitted_values = olsmodel3.fittedvalues

# Add new columns for predicted values
original_df["Predicted price_log "] = fitted_values
original_df["Predicted Price"] = fitted_values.apply(math.exp)
original_df["residuals"] = residuals
original_df["Abs_residuals"] = residuals.apply(math.exp)
original_df["Difference in Lakhs"] = np.abs(
    original_df["Price"] - original_df["Predicted Price"]
)

# Let us look at the top 20 predictions where our model made highest extimation errors (on train data)
original_df.sort_values(by=["Difference in Lakhs"], ascending=False).head(100)

* A 2017 Land Rover, whose new model sells at 230 Lakhs and the used version sold at 160 Lakhs was predicted to be sold at 32L. It is not apparent after looking at numerical predictors, why our model predicted such low value here. This could be because all other land rovers in our data seems to have sold at lower prices.
* The second one in the list here is a Porsche cayenne that was sold at 2 Lakhs but our model predicted the price as 85.4. This is most likely a data entry error. A 2019 manufactured Porsche selling for 2 Lakh is highly unlikely. With all the information we have, the predicted price 85L seems much more likely. We will be better off dropping this observation from our current model. If possible, the better route would be to gather more information here.
* There are a few instances where the model predicts lesser than the actual selling price. These could be a cause for concern. The model predicting lesser than potential selling price is not good for business.
* Let us quickly visualise some of these observations. 

In [None]:
sns.scatterplot(
    original_df["Difference in Lakhs"],
    original_df["Price"],
    hue=original_df["Fuel_Type"],
)

Most outliers are the Petrol cars. Our model predicts that resale value of diesel cars is higher compared to petrol cars. This is probably the cause of these outliers.