# Car Price Prediction - Linear Regression
In this notebook, we will look at a real life application of linear reggression. We will be predicting used car prices, after making a model and train it with the data that we have.

In this practical example, we will go through each step like data preprocessing, checking for OLS assumptions, creating dummy variables to incorporate categorical data, model training and model testing.

Let's start the process!

## 1. Importing Libraries

In [None]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt, seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

sns.set()

## 2. Loading Raw Data

In [None]:
raw_data = pd.read_csv("../input/used-car-prices/used_cars_prices.csv")
raw_data.head(4)

## 3. Data Preprocessing

In [None]:
raw_data.describe(include = "all")  # to gain insight of all data

**Insights:**
- There are 4345 observations.
- Coloumns 'Price' and "EngineV" has some null values.
- Some coloumns have outliers.

### 3.1. Dropping Unwanted Columns

Column 'Model' has 312 unique categorical values in 4345 observations. Creating a dummy value for each unique value can complicate our model, for now we can drop it.

In [None]:
data_p1 = raw_data.drop(["Model"], axis = 1)
data_p1.head(1)

### 3.2. Dealing with Missing Values

In [None]:
data_p1.isnull().sum()

There are a little over 300 missing values in the data. If missing values are < 5% of the data, the rule of thumb is, we can drop them.

In [None]:
data_p2 = data_p1.dropna()
data_p2.describe(include = "all")

### 3.3. Removing of Outliers

This is a very important step of the process as outliers have the ability to influence the model in great manner. Also, for a good linear regression model, our data should be Normally Distributed and with outliers at any end, it will not be possible.

We will be using Probability Distribution of the function to look for outliers.

In [None]:
sns.distplot(data_p2.Price)

In this graph we can clearly see that there are some prices on the higher side. These are our outliers.

One way to tackle high outliers is to drop top 0.5% or top 1% of the points. We will use 99% quantile for this.

In [None]:
q = data_p2["Price"].quantile(0.99)
data_p31 = data_p2[data_p2["Price"] < q]
sns.distplot(data_p31.Price)
data_p31.describe()

We can see the differnce as now maximum price is relatively more close to the mean.

We will do the same for other numerical data too.

In [None]:
sns.distplot(data_p31.Mileage)

In [None]:
q = data_p31["Mileage"].quantile(0.99)
data_p32 = data_p31[data_p31["Mileage"] < q]
sns.distplot(data_p32.Mileage)
data_p32.describe()

In [None]:
sns.distplot(data_p32.EngineV)

The case on engine volume is a little different and here we need some commonsense or vehicle knowledge to address outliers. Upon a little googling we can find out there are no cars with engine volumes as high as 30/40 litres. Some sports cars have bigger engines and even they are not more than 5-6 litres.

There is a very bad practice used while filling the data that if there is no data for a certain thing, 99.99 is filled at its place. This can be a possible answer to this problem.

However, we will be dropping all observations with engine volumes more than 6.5 litres.

In [None]:
data_p33 = data_p32[data_p32["EngineV"] < 6.5]
sns.distplot(data_p33.EngineV)
data_p33.describe()

In [None]:
sns.distplot(data_p33.Year)

Year have a left skewed distribution, we will just drop bottom 1% for it.

In [None]:
q = data_p33["Year"].quantile(0.01)
data_p34 = data_p33[data_p33["Year"] > q]
sns.distplot(data_p34.Year)
data_p34.describe()

**Preprocessing Complete!**

We are done with data preprocessing. We will save the resulted data in preprocessed_data variable and will be resetting index.

In [None]:
preprocessed_data = data_p34.reset_index(drop = True)
preprocessed_data.head()

In [None]:
preprocessed_data.describe(include = "all")

## 4. Checking OLS Assumptions
Now, that we have clean data, we need to check some assumptions that need to be true before applying Linear Regression.

### 4.1. Linearity
First up, we need to check whehter all of the features of our to be model are in linear relationship with the target or not.

Best way to do this is plotting a scatter chart of each feature with the target individually. We can use plt.scatter() for this purpose but to make it more presentable and comparable we will be pltting all scatters in same line and since 'price' is common for each scatter chart, we will make an equation in which it is shared by all other regressors.

In [None]:
y, (x1, x2, x3) = plt.subplots(1, 3, sharey=True, figsize =(15,3))  # sharey - sharing what is on y-axis

x1.scatter(preprocessed_data['Mileage'],preprocessed_data['Price'])
x1.set_title('Price and Mileage')
x2.scatter(preprocessed_data['EngineV'],preprocessed_data['Price'])
x2.set_title('Price and EngineV')
x3.scatter(preprocessed_data['Year'],preprocessed_data['Price'])
x3.set_title('Price and Year')

plt.show()

Looking at these charts, we can say that none of the regressor seem to be in linear relationship with the target. To fix this problem we can transfor one of the variable in the charts. Since, price is common, let's transform it first. We will also be dropping column 'Price'. We will call new data frame linear_data.

In [None]:
preprocessed_data["LogPrice"] = np.log(preprocessed_data["Price"])
linear_data = preprocessed_data.drop(["Price"], axis = 1)
linear_data.head(3)

Prices have been transformed. Let's plot scatter chart again with transformed prices.

In [None]:
y, (x1, x2, x3) = plt.subplots(1, 3, sharey=True, figsize =(15,3))  # sharey - sharing what is on y-axis

x1.scatter(linear_data['Mileage'], linear_data['LogPrice'])
x1.set_title('LogPrice and Mileage')
x2.scatter(linear_data['EngineV'], linear_data['LogPrice'])
x2.set_title('LogPrice and EngineV')
x3.scatter(linear_data['Year'], linear_data['LogPrice'])
x3.set_title('LogPrice and Year')

plt.show()

Now, we can see some linear relationship between target and regressors.

### 4.2. Multicollinearity
To check whether our features have interdependencies, we use Variance Inflation Factor as its value can measure how much a feature depend on other features.

We have a method for that in statsmodels, let's import the required module for that method.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

We will only be checking numerical variables as categorical variables do not have interdependencies among them.

First, we need to create a DataFrame of to-be checked variables and a DataFrame to insert VIF values in it.

In [None]:
check_variables = linear_data[["Mileage", "EngineV", "Year"]]
vif = pd.DataFrame()

**Calculating VIF**

In [None]:
vif["VIF"] = [variance_inflation_factor(check_variables.values, i) for i in range(check_variables.shape[1])]
vif["Features"] = check_variables.columns
vif = vif[["Features", "VIF"]]
vif

**Findings:** We have got VIF for each factor.

VIF may range from 1 to inf.
- VIF = 1 means no interdependency with any feature.
- VIF = 1 - 5  means some interdependency, but okay.
- VIF < 10 is acceptable for some people under some scenarios.
- VIF > 10 is troublesome.

Year has VIF > 10. And it seem fair, mileage and year are interdependent. One way to fix this is dropping one, let's drop 'Year'. We will call it qualified data after dropping 'Year'.

In [None]:
qualified_data = linear_data.drop(["Year"], axis = 1)
qualified_data.head(2)

Now, we have data that is clean and complies with OLS assumptions. We did not check for Autocorrelation because it is often found in time series data and since we do not have time series data here, we won't worry about that. For now, we also assume that we have not left any important variable out, so we can say there is no endogeneity. Normality and Homoscadesticity is present in data, so fifth assumptions is also qualified.

## 5. Dummy Variables for Categorical Data
There is just one thing left before we start regression analysis and that is creating dummy variables for categorical data.

We have get_dummies() function of pandas which takes in a DataFrame and returns one with dummy variables. We will call this regression ready data 'data'.

In [None]:
data = pd.get_dummies(qualified_data, drop_first = True)
data.head(3)

Little rearrangement to make 'LogPrice' first column.

In [None]:
data.columns

In [None]:
data = data[['LogPrice', 'Mileage', 'EngineV', 'Brand_BMW', 'Brand_Mercedes-Benz',
       'Brand_Mitsubishi', 'Brand_Renault', 'Brand_Toyota', 'Brand_Volkswagen',
       'Body_hatch', 'Body_other', 'Body_sedan', 'Body_vagon', 'Body_van',
       'Engine Type_Gas', 'Engine Type_Other', 'Engine Type_Petrol',
       'Registration_yes']]
data.head(3)

## 6. Declaring Inputs and Target
Now, we are ready to apply linear regression analysis on our data. First step in that direction is to declare what are the independent and dependent variables for the model.

In [None]:
target = data["LogPrice"]
inputs = data.drop(["LogPrice"], axis = 1)

## 7. Scaling Data (Standardization)
Next up is data standardization.

We can or can not standardize dummy variables as it does not change the weightage of them as features.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(inputs)

scaled_inputs = scaler.transform(inputs)

## 8. Train - Test Split:
To make sure that our model is not a underfitted or overfitted, we will test it with known targets. For this, we need to split the data into two groups one for train the model and one for testing it.

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, target, test_size = 0.2, random_state = 365)

## 9. Training the Model

We are ready to train our model with training set of inputs and target.

In [None]:
reg = LinearRegression()  # making an object of LinearRegression()

reg.fit(x_train, y_train)  # making a model to fit out training data

y_hat = reg.predict(x_train)  # predicting prices for same input, to compare

We have trained the model and have made predictions for the same inputs as used for training to compare how close is our predicted prices to the actual prices.

Let's scatter plot of Actual and Predicted Prices to compare. More observations close to a 45 degree line, better the model.

In [None]:
plt.scatter(y_train, y_hat)

plt.xlabel("Actual Prices", fontsize = 18)
plt.ylabel("Predicted Prices", fontsize = 18)

plt.xlim(6, 13)  # setting scale on x-axis
plt.ylim(6, 13)  # setting scale on y-axis

plt.show()

Seems like a good model, doesn't it?

Another way to check how good the model is to plot a **Probability Distribution Function of Residuals**. (Residual = Observation - Prediction).

In [None]:
sns.distplot(y_train - y_hat)
plt.title("Residuals PDF")
plt.show()

For the best case scenario, the residuals should be normally distributed. This graph is mostly okay. But, at the same time it has more residuals on the negative side of the graph. It suggests that our model has overestimated price more often than the time it has underestimated.

Another way to assess the model is to find its R-Squared.

In [None]:
reg.score(x_train, y_train)

It tells that around 75% of the variability of the target is defined by the feature that we have included in our model.

## 10. Creating a Regression Summary

A summary of regression will tell how and how much a feature affects the variability of the target.

In [None]:
reg_summary = pd.DataFrame(inputs.columns, columns = ["Features"])
reg_summary["Weight"] = reg.coef_
reg_summary

A negative weight implies that more the feature is, the target will get reduced.

For dummies it a bit different, the dropped variable will have the benchmark weightage and weightage will show of others that how much more/less they affect the price.

## 11. Model Testing

With the model ready to predict, let's test it with the data the model has not seen yet. We can find how good the model performs as we know the actual dependent variables.

In [None]:
y_hat_test = reg.predict(x_test)

Predictions calculated. Now, we can plot actual and predicted prices against each other on the scatter plot. For a good model, all points should be converging towards a straight 45 degree line from the origin of the graph.

In [None]:
plt.scatter(y_test, y_hat_test, alpha = 0.3)  # alpha changes tranparency of the points, so we can see where they are concentrating more

plt.xlabel("Actual Prices", fontsize = 18)
plt.ylabel("Predicted Prices", fontsize = 18)

plt.xlim(6, 13)
plt.ylim(6, 13)

plt.show()

Now, we can make a DataFrame with Actual Prices and Predicted Prices of the cars. And some more columns to make comparison clear.

In [None]:
pd.options.display.max_rows = 999  # to make each value print

predictions = pd.DataFrame(np.exp(y_hat_test), columns = ["Predicted Prices"])
y_test = y_test.reset_index(drop = True)
predictions["Actual Prices"] = np.exp(y_test)
predictions["Residual"] = predictions["Actual Prices"] - predictions["Predicted Prices"]
predictions["Difference %"] = np.absolute(predictions["Residual"]/predictions["Actual Prices"]*100)
predictions = predictions.sort_values(by = ["Difference %"], ascending = False)
predictions