# Car price prediction

<img src="https://s1.1zoom.ru/b5050/215/BMW_E46_M3_silver_450821_1366x768.jpg" alt="Drawing" style="width: 900px;">



# Table of contents

- [Imports](#imports)
- [Read the data](#read)
- [EDA](#eda)
  - [Overview](#eda.overview)
  - [Data transformation. Stage 1](#eda.dt1)
  - [Let's take a closer look at the data](#eda.closer)
  - [Data transformation. Stage 2](#eda.dt2)
  - [Deal with NA](#eda.na)
  - [Data transformation. Stage 3](#eda.dt3)
  - [And final pairplot...](#eda.fpp)
  - [Conclusion](#eda.c)
- [Linear Regression](#lr)
  - [Dataset](#lr.ds)
  - [Regression analysis](#lr.ra)
  - [Conclusion](#lr.c)
  - [Ridge regression](#lr.rr)
  - [Conclusion](#lr.rrc)
- [XGBoost](#xgb)
  - [Dataset](#xgb.ds)
  - [Model](#xgb.m)
- [Conclusion](#conclusion)

<a id="imports"></a>
# Imports

In [None]:
import os
import re
import random
import warnings

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

from sklearn.linear_model import LinearRegression, Lasso, Ridge
import xgboost as xgb

from sklearn.metrics import mean_squared_error, r2_score

import statsmodels.api as sm
import statsmodels.stats.diagnostic as smd
from statsmodels.stats.outliers_influence import variance_inflation_factor

from scipy.stats import shapiro, boxcox, kstest, probplot
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

In [None]:
random_state=10
#warnings.filterwarnings("error")
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
def validate(y_true, y_pred):
    resid = y_true - y_pred
    
    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print("MSE: %s" % mse)
    print("R^2: %s" % r2)
    print("Residuals mean: {0}".format(np.mean(resid)))
    
    fig, ax = plt.subplots(figsize=(19,4), ncols=4)
    ax[0] = sns.scatterplot(x=y_true, y=resid, ax=ax[0])
    ax[1] = sns.scatterplot(x=y_true, y=y_pred, ax=ax[1])
    ax[2] = sns.histplot(resid, ax=ax[2])
    probplot(resid, dist="norm",  plot=ax[3])
    
    statistic, p_value = kstest(resid, 'norm')
    if p_value>0.05:
        print("Distribution is normal. Statistic: {0:.3}, p-value: {1:.4}".format(statistic, p_value))
    else:
        print("Distribution is not normal. Statistic: {0:.3}, p-value: {1:.4}".format(statistic, p_value))

<a id="read"></a>
# Read the data 

In [None]:
data = pd.read_csv("/kaggle/input/vehicle-dataset-from-cardekho/Car details v3.csv")
display(data.head(3))

<a id="eda"></a>
# EDA

<a id="eda.overview"></a>
### Overview

In [None]:
display(data.info())

We see missing values and some data type mismatches.

The most important features are filled in completely, so i'll deal with missing data later.

<a id="eda.dt1"></a>
### Data transformation. Stage 1

#### Data types

In [None]:
data["mileage"] = data["mileage"].str.replace(" kmpl", "")
data["mileage"] = data["mileage"].str.replace(" km/kg", "")
data["mileage"] = data["mileage"].astype(float)

data["engine"] = data["engine"].str.replace(" CC", "")
data["engine"] = data["engine"].astype(float, errors="ignore")

data["max_power"] = data["max_power"].str.replace(" bhp", "")
data.loc[data["max_power"]=='', "max_power"]=np.NaN
data["max_power"] = data["max_power"].astype(float, errors="ignore")

#### owner

In [None]:
remapped = {'First Owner': 1, 'Second Owner': 2, 'Third Owner': 3, 'Fourth & Above Owner': 4, 'Test Drive Car': 0}
data = data.replace({"owner": remapped})

#### date

In [None]:
max_date = max(data["year"])
data["year2"] = data["year"].apply(lambda x: max_date - x)

#### torque

In [None]:
def torque_parser(x):
    try:
        try:
            parsed = re.findall(r"([\d]+).*(nm|kgm)", x, re.IGNORECASE)[0]
        except Exception as e:
            parsed = [re.findall(r"[\d]+", x)[0], "nm"]
        finally:
            if parsed[1].lower() == "nm":
                torque = float(parsed[0])
            else:
                kgm = float(parsed[0])
                if kgm < 100:
                    torque = float(parsed[0])/0.10197
                else:
                    torque = float(parsed[0])
    except Exception as e:
        torque = np.NaN
    return torque 

data["torque2"] = data["torque"].apply(torque_parser)

In [None]:
data = data.drop(["year", "torque"], axis=1)

<a id="eda.closer"></a>
### Let's take a closer look at the data

In [None]:
display(data.describe())
display(data.describe(include=object))

In [None]:
ax = sns.pairplot(data)

- Not all predictors have a linear relationship with the target variable
- I assume that the brand and model will affect the value
- Zero mileage and zero max_power looks bad
- km_driver more than 300k km looks like outliars.
- mileage more than 35 looks like outliars.
- 789nm looks like a bug

<a id="eda.dt2"></a>
### Data transformation. Stage 2

#### Brand and model

In [None]:
def brand_parser(x):
    try:
        parsed = re.findall(r"^(\S*)\s(\S*)", x, re.IGNORECASE)[0]
    except Exception as e:
        parsed = ["unparsed", "value"]
    finally:
        return parsed[0] + " " + parsed[1]
    
data["brand_model"] = data["name"].apply(brand_parser)

In [None]:
fig, ax = plt.subplots(figsize=(16,3))

vals, cnts = np.unique(data["brand_model"], return_counts=True)
idxs = np.argsort(-cnts)

models = np.random.choice(vals, 40)
df = data[data["brand_model"].isin(models)]
ax = sns.boxplot(x=df["brand_model"], y=df["selling_price"], ax=ax)

for tick in ax.get_xticklabels():
    tick.set_rotation(90)

<a id="eda.na"></a>
### Deal with NA

Fill in the missing values with the average for each brand_model

In [None]:
data["mileage"] = data.groupby("brand_model").transform(lambda x: x.fillna(x.mean()))["mileage"]
data["engine"] = data.groupby("brand_model").transform(lambda x: x.fillna(x.mean()))["engine"]
data["max_power"] = data.groupby("brand_model").transform(lambda x: x.fillna(x.mean()))["max_power"]
data["seats"] = data.groupby("brand_model").transform(lambda x: x.fillna(np.round(x.mean())))["seats"]
data["torque2"] = data.groupby("brand_model").transform(lambda x: x.fillna(x.mean()))["torque2"]

In [None]:
na_count = data.isna().any(axis=1).sum()
print("Records with NA values: %s" % na_count)
data = data.dropna()

<a id="eda.dt3"></a>
### Data transformation. Stage 3

#### Drop bad values

In [None]:
data = data[data["mileage"]>0]
data = data[data["max_power"]>0]

#### Target variable

In [None]:
data["selling_price2"] = np.log(data["selling_price"])
cols = data.columns.tolist()
cols = cols[-1:] + cols[:-1]
data = data[cols]

#### km_driver

In [None]:
# km_driver more than 300k km looks like outliars. От греха подальше...
data = data[data["km_driven"]<300000]

#### mileage

In [None]:
# mileage more than 35 looks like outliars. Туда же.
data = data[data["mileage"]<35]

#### Torque  values

In [None]:
# "Maruti Zen D" torque looks like a mistake. It isn't 789nm, but 78nm. Хотел бы я 790 Нм, но нет.
data.loc[data["name"]=="Maruti Zen D", "torque2"] = 78
# this will make the relationship between torque2 and target variable more linear 
data["torque2"] = np.log(data["torque2"])

#### Other mistakes

In [None]:
# Fix some mistakes
data.loc[data["brand_model"]=="Honda BRV", "brand_model"] = "Honda BR-V"
data.loc[data["brand_model"]=="Ford Ecosport", "brand_model"] = "Ford EcoSport"
data.loc[data["brand_model"]=="Ambassador CLASSIC", "brand_model"] = "Ambassador Classic"

<a id="eda.fpp"></a>
### And final pairplot...

In [None]:
ax = sns.pairplot(data)

In [None]:
fig, ax = plt.subplots(figsize=(15,5))
corr = data.corr()
ax = sns.heatmap(corr, annot=True, ax=ax, cmap="YlGnBu")

<a id="eda.c"></a>
## Conclusion 
1. The distribution of the target variable appears to be normal. This does not linear regression assume, but in this case it improves the result.
2. Removed explicit outliers and corrected data errors
3. The dependence of predictors with target variable appears to be linear
4. Correlation matrix does not show strong linear relationship between predictors

In [None]:
data = data.drop(["selling_price"], axis=1)
data_cleared = data.copy()

In [None]:
data = data_cleared.copy()

<a id="lr"></a>
# Linear regression model

<a id="lr.ds"></a>
## Dataset

In [None]:
data_lr = data.copy()

y = data_lr["selling_price2"]
X = data_lr.drop(["name", "selling_price2"], axis=1)

X = pd.get_dummies(X, columns=["fuel", "seller_type", "transmission", "owner", "seats", "brand_model"])

display(X.shape)
display(X.head(2))

<a id="lr.ra"></a>
## Regression analysis

In [None]:
X_ = sm.add_constant(X)
model_ols = sm.OLS(y, X_).fit()
print(model_ols.summary())

In [None]:
y_pred = model_ols.predict(X_)
validate(y, y_pred)

This model explains 94.6% of the variation in the dependent variable, while the MSE was 0.038.

When diagnosing the model, 2 problems were identified:
 - Abnormal distribution of residuals
 - Signs of heteroscedasticity
 
Violating the linear regression assumptions can result in the trained model not being optimal for a given dataset.
Also, if the assumption about the random distribution of residuals is violated, we cannot reliably use statistical tests to determine the significance of the predictor.

Violations of linear regression assumptions may be due to outliers, non-linear relationships, or the absence of a predictor.

Looking ahead, I will say that the transformation of predictors did not lead to an increase in the accuracy of the model.

Let's try to identify and remove outliers.

In [None]:
influence = model_ols.get_influence()
(c, p) = influence.cooks_distance
    
distances = pd.DataFrame(c, index=X.index)
distances = distances.fillna(1)

In [None]:
n_max = 30
max_values = distances.nlargest(n_max, columns=0)[0]

fig, ax = plt.subplots(figsize=(16, 4))
ax.set_yscale("log")
ax = sns.barplot(y=max_values, x=np.arange(n_max), ax=ax, palette="Blues_r")

In [None]:
X = X.drop(max_values.index)
y = y.drop(max_values.index)

In [None]:
X_ = sm.add_constant(X)
model_ols2 = sm.OLS(y, X_).fit()
print(model_ols2.summary())


In [None]:
y_pred = model_ols.predict(X_)
validate(y, y_pred)

<a id="lr.c"></a>
### Conclusion

Deleting points with great influence allowed to slightly improve performance, but the model has not changed fundamentally.

Let's build a regression model taking into account the identified problems.

I will assume that hetetoscedasticity may be due to the absence of a predictor. For example, the equipment of the car, which affects the cost.

<a id="lr.rr"></a>
## Ridge regression

In [None]:
scaler = StandardScaler()
X_sc = pd.DataFrame(scaler.fit_transform(X), index=X.index)

X_train, X_test, y_train, y_test = train_test_split(X_sc, y, test_size=0.33, random_state=random_state)

In [None]:
def hyperopt(X, y, params):
    try:
        model = Ridge(**params, normalize=False)
        score = cross_val_score(model, X, y, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
        return -score.mean()
    
    except Exception as ex :
        print(ex)
        return np.inf

def f_model(params):
    global best
    global best_params
    acc = hyperopt(X_train, y_train, params)
    if (acc < best):
        best = acc
        best_params = params
        print("new best: {0:.7} {1}".format(best, params))
    return {'loss': acc, 'status': STATUS_OK}


def model_tune(space, random_state=random_state, iters=10):
    global best
    global best_params
    best, best_params = np.inf, None 
    res = fmin(f_model, space, algo=tpe.suggest, max_evals=iters, rstate=np.random.RandomState(random_state))
    model = Ridge(random_state=random_state, **best_params, normalize=False)
    print("\nBest_params: \n", best_params)
    return model


space_l = {
    'alpha': hp.uniform('alpha', 0.00001, 2),
    'tol': hp.uniform('tol', 0.000001, 0.5),
}

In [None]:
model_reg = model_tune(space_l, iters=30)

In [None]:
model_reg = model_reg.fit(X_train, y_train)

In [None]:
y_pred = model_reg.predict(X_test)
validate(y_test, y_pred)

<a id="lr.rrc"></a>
### Conclusion

The model is built, the previously mentioned problems are observed - the non-normality of the distribution of the residuals and the signs of heteroscedasticity.

The MSE value is 0.042. But the average of the errors is close to zero.

<a id="xgb"></a>
# XGBOOST

For comparison, let's build a XGBoost model.

<a id="xgb.ds"></a>
## Dataset

In [None]:
data_xgb = data.copy()

y = data_xgb["selling_price2"].copy()
X = data_xgb.drop(["name", "selling_price2"], axis=1).copy()

X = pd.get_dummies(X, columns=["fuel", "seller_type", "seats", "transmission", "brand_model"], drop_first=True)

scaler = MinMaxScaler()
X = pd.DataFrame(scaler.fit_transform(X), index=X.index)

display(X.head(3))
display(y.head(3))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

<a id="xgb.m"></a>
## Model

In [None]:
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', 
                          colsample_bytree = 0.3, 
                          learning_rate = 0.1, 
                          max_depth = 10, 
                          alpha = 1, 
                          n_estimators = 250)

xg_reg = xg_reg.fit(X_train,y_train)


In [None]:
y_pred = xg_reg.predict(X_test)
validate(y_test, y_pred)

<a id="conclusion"></a>
# Conclusion

In this solution the following steps were taken


1. Data understanding and preparing
    - Removed outliers and erroneous values
    - Parsed text values
    - Filled missing values
    - Features are transformed
2. Performing regression analysis
    - Evaluated the fulfillment of the linear regression assumptions
    - Removed outliers based on Cook's distance
3. Fitted a linear regression model. Optimal parameters are configured via HyperOpt.
4. Fitted a comparative model based on XGBoost.

The linear regression model showed 2 problems - the residuals are not normally distributed, and heteroscedasticity is also observed.
I will assume that the reasons lie in the absence of an important predictor. When using such a model, it should be borne in mind that it may not be optimal for the given task/dataset.

In addition, I note that the accuracy of linear regression almost coincided with the accuracy of the model based on XGBoost. It seems that it is difficult to achieve a better result on the current data.

Thanks for attention!