# Little introduction

Hi, thanks for checking out my notebook. If you find some mistakes or have suggestions for improvement, please, don't hesitate to write them down in the comments. I am new to this field, so any feedback/advices are apriciated. Thanks in advance and good luck!

# Dictionary

* kmpl - Kilometers Per Litre is a term use to express the fuel efficiency of a vehicle. Fuel efficiency is defined as the ratio of distance travelled per unit of fuel consumed.
* 1 km/kg = 1.4 kmpl
* СС - The term “cc” stands for Cubic Centimeters or simply cm³ which is a metric unit to measure the Engine's Capacity or its volume. Usually increase power of car.
* bhp - Traditionally ‘brake horsepower’ (bhp) has been used as the definitive measurement of engine power. It’s distinct from horsepower because it takes into account power loss due to friction – it’s measured by running an engine up to full revs, then letting it naturally slow down to a dead stop.
* nm at rpm - Torque is rotational force, and since an engine relies on a rotating crank to do its work, torque is the force the engine is able to generate. Modern engines generate different levels of torque at different engine speeds (RPMs, or revolutions per minute that the engine is turning through). It’s expressed in Newton-Metres (Nm), and this is what you actually feel when you’re pushed back into your seat on acceleration. A car brochure will indicate the maximum torque the engine is able to generate, and the specific RPM at which it is generated. For instance, the Maruti Dzire generates 113Nm at 4200RPM (petrol) and 190Nm at 2000RPM (diesel). This means the petrol engine produces less torque at a much higher engine speed than the diesel motor, which produces more at a quite low engine speed. The bottomline: Look for a good torque (over 110Nm) with a low RPM (4,000 or so).
* 1 kgm = 9.80665 Nm

# Importing the dataset

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from scipy.special import boxcox, boxcox1p
from scipy.stats import skew

import re

In [None]:
orig_data = pd.read_csv('/kaggle/input/vehicle-dataset-from-cardekho/Car details v3.csv')

In [None]:
# read the dataset
data = pd.read_csv('/kaggle/input/vehicle-dataset-from-cardekho/Car details v3.csv')

data.head()

In [None]:
# check the shape of dataset
data.shape

# Data Cleaning

In [None]:
# remove columns with one distinct value to not waste time on them
d_types = dict(data.dtypes)
for name, type_ in d_types.items():
    if data[name].nunique() == 1:
        data = data.drop(name, axis=1)
        print(f'Column "{name}" has been dropped.')

In [None]:
# check if shape changed after removing fields with one distinct value
data.shape

We don't have such columns.

## Handling the missing values

In [None]:
# check amount of null values in each column and types of columns
data.info()

In [None]:
# suspicously, number of null values in columns 8-12 are similar;
# if they are in the same rows, we should consider deleting them,
# since it is >40% of features;
# lets check number of null values in each row;

data[data.isnull().sum(axis=1)>0].isnull().sum(axis=1).sort_values(ascending=False)

In [None]:
# some of rows were proven to have >40% of features missing;
# lets delete them;

data = data[~(data.isnull().sum(axis=1)>=5)]

In [None]:
data.shape

In [None]:
data[data.isnull().sum(axis=1)>0].isnull().sum(axis=1).sort_values(ascending=False)

In [None]:
# there are only 7 rows left with missing values;
# trying to fill them don't worth it, because it won't
# influence our model much in comparisson to other 7906 rows
# so for saving time lets just drop them

data = data[~(data.isnull().sum(axis=1)>0)]

In [None]:
data.isnull().sum().sort_values(ascending=False)

In [None]:
data

In [None]:
# reseting index
data = data.reset_index(drop=True)
data

## Checking the correctness of types of the columns

In [None]:
# lets take a look again on values in columns of our dataset
data.head()

In [None]:
data.info()

Such columns as 'mileage', 'engine', 'max_power' and 'torque' should be numeric. Lets fix that. 

### 'mileage', 'engine' and 'max_power' columns

In [None]:
columns = ['mileage', 'engine', 'max_power']
for name in columns:
    data[[name+'_val', name+'_unit']] = data[name].str.split(expand=True)
data[['mileage', 'mileage_val', 'mileage_unit', 'engine', 'engine_val'
     , 'engine_unit', 'max_power', 'max_power_val', 'max_power_unit']].head()

In [None]:
data[['mileage', 'mileage_val', 'mileage_unit', 'engine', 'engine_val'
     , 'engine_unit', 'max_power', 'max_power_val', 'max_power_unit']].info()

In [None]:
#  lets drop columns 'mileage', 'engine', 'max_power' and covert '..._val'
# columns to float for further analysis
for name in columns:
    data = data.drop(name, axis=1)
    data = data.astype({name+'_val':float})
data.info()

In [None]:
# lets check if each column has the same unit for every row
for name in columns:
    print(f"<======= {name+'_unit'} =======>")
    print(data[name+'_unit'].value_counts())

In [None]:
#  choose few random rows with unit 'km/kg' in 'mileage_unit' to check
# correctness of values transformation later
data.loc[data['mileage_unit'] == 'km/kg'].head()

In [None]:
# lets change values in mileage_val to 'kmpl' unit (1 kmpl = 1 km/kg * 1.40)
data.loc[data['mileage_unit'] == 'km/kg', 'mileage_val'] = 1.4 * data.loc[data['mileage_unit'] == 'km/kg', 'mileage_val']
data.loc[data['mileage_unit'] == 'km/kg', 'mileage_unit'] = 'kmpl' 
print(f"<======= {'mileage_unit'} =======>")
print(data['mileage_unit'].value_counts())

In [None]:
# check on Row 6 correctness of transformation
data.iloc[6]

In [None]:
# 17.3 - 'mileage_val' before transformation
17.3*1.4

In [None]:
# drop '..._unit' columns and rename '..._val' columns with their units
for name in columns:
    data = data.rename(columns={
        name+'_val':name+'_'+str.lower(data[name+'_unit'][0])
    })
    data = data.drop(name+'_unit', axis=1)
data.info()

### 'torque' column

In [None]:
data['torque'].head(20)

Seems like we have next types of value in 'torgue':
* 190Nm@ 2000rpm
* 250Nm@ 1500-2500rpm
* 12.7@ 2,700(kgm@ rpm)
* 22.4 kgm at 1750-2750rpm
* 113.75nm@ 4000rpm

So we will choose next separators for split function: 'N', '@', ' ', 'r', '(', ')', 'at', 'n'. Also we should remove ',' from values (check 3rd example).

Spoiler: previous list of formats of value in 'torgue' was far from been full :). Loop for parsing returned to many errors. Skip the part of finding new formats in notebook, basically taking a look at values of 'torque' in rows from error for understanding missed paterns and completing previous list of formats. Error catching released by using 'try:'/'except:'. Here is final list of possible formats:
* 190Nm@ 2000rpm
* 250Nm@ 1500-2500rpm
* 12.7@ 2,700(kgm@ rpm)
* 22.4 kgm at 1750-2750rpm
* 113.75nm@ 4000rpm
* 6.1kgm@ 3000rpm
* 250Nm@ 1500~4500rpm
* 96 Nm at 3000 rpm
* 400Nm
* 135 Nm at 2500  rpm (double space between 2500 and rpm + space in the end)
* 96  Nm at 3000  rpm (same as previous + doublespace between 96 and Nm)
* 51Nm@ 4000+/-500rpm
* 48@ 3,000+/-500(NM@ rpm)
* 510@ 1600-2400
* 135.4Nm@ 2500
* 210 / 1900
* 400 Nm /2000 rpm
* 380Nm(38.7kgm)@ 2500rpm (occured only in one row, so just added '(38.7kgm)@ ' to delimeters as exception)
* 110(11.2)@ 4800 (occured only in one row, so just added '(11.2)@ ' to delimeters as exception)
* 215Nm@ 1750-3000

Added delimeters: 'k', '~', '+/-500r', '+/-500(N', ' / ', '(38.7kgm)@' and e.t.c (for full list check code below).

In [None]:
def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

In [None]:
#  may be not the best solution for parsing such column, but still better
# than nothing;
#  last version of parsing with consideration of all possible values in dataset;
#  algorithm was build according to principle 'find problem - solve problem'
# since I hadn't known all formats from the beginning, so now you may try 
# create more rational one by considering list of formats above;
for i in range(data.shape[0]):
    temp = re.split(
        '\(11.2\)@ |\(38.7kgm\)@ |\+\/-500r|\+\/-500\(N|  N| N| r|N|@ r|@ |  r|r|\(k|\(| at |n|-|~| k|k| \/ | \/|\s',
        str(data.loc[i, 'torque']).replace(',','').strip(' \)')
    )
    if is_number(temp[0]):
        data.loc[i, 'torque_val'] = float(temp[0])
    else:
        print('Error:', i, temp)
    try:
        if is_number(temp[1]):
            data.loc[i, 'torque_eng_sp_val'] = float(temp[1])
            if len(temp) == 2:
                data.loc[i, 'torque_unit'] = 'Nm'
                data.loc[i, 'torque_eng_sp_unit'] = 'rpm'
            elif len(temp) > 2:
                if is_number(temp[2]):
                    data.loc[i, 'torque_eng_sp_val'] = (float(temp[1])+float(temp[2]))/2
                    data.loc[i, 'torque_unit'] = 'Nm'
                    data.loc[i, 'torque_eng_sp_unit'] = 'rpm'
                else:
                    data.loc[i, 'torque_unit'] = temp[2]
                    data.loc[i, 'torque_eng_sp_unit'] = temp[3]
        elif len(temp) == 2:
            data.loc[i, 'torque_unit'] = temp[1]
        elif len(temp) == 3:
            data.loc[i, 'torque_unit'] = temp[1]
            data.loc[i, 'torque_eng_sp_val'] = float(temp[2])
            data.loc[i, 'torque_eng_sp_unit'] = 'rpm'
        elif len(temp) == 4:
            if is_number(temp[3]):
                data.loc[i, 'torque_eng_sp_val'] = (float(temp[2])+float(temp[3]))/2
                data.loc[i, 'torque_eng_sp_unit'] = 'rpm'
                data.loc[i, 'torque_unit'] = temp[1]
            else:
                data.loc[i, 'torque_unit'] = temp[1]
                data.loc[i, 'torque_eng_sp_val'] = float(temp[2])
                data.loc[i, 'torque_eng_sp_unit'] = temp[3]
        elif len(temp) == 5:
            data.loc[i, 'torque_unit'] = temp[1]
            data.loc[i, 'torque_eng_sp_val'] = (float(temp[2])+float(temp[3]))/2
            data.loc[i, 'torque_eng_sp_unit'] = temp[4]
        else:
            print('Error:', i, temp)
    except:
        print('Error:', i, temp)

In [None]:
data.info()

In [None]:
data[data.isnull().any(axis=1)]['name'].value_counts()

Not to big variety, so lets find info in the google. The results:
* XC40 D4 Inscription BSIV = 1750 rpm (https://www.carwale.com/volvo-cars/xc40/inscription/)
* XC40 D4 R-Design  = 1750 rpm (https://www.carwale.com/volvo-cars/xc40/d4-r-design/)
* XC60 Inscription D5 BSIV = 1750 rpm (https://www.carwale.com/volvo-cars/xc60/d5-inscription/)
* S90 D4 Inscription BSIV = 1750 rpm (https://www.carwale.com/volvo-cars/xc60/d5-inscription/)

In [None]:
# fill our missing values;
data['torque_eng_sp_val'] = data['torque_eng_sp_val'].fillna(1750)
data['torque_eng_sp_unit'] = data['torque_eng_sp_unit'].fillna('rpm')

In [None]:
data.info()

In [None]:
# fix values after parsing
print(f"<======= 'torque_unit' =======>")
print(data['torque_unit'].value_counts())

In [None]:
data['torque_unit'] = data['torque_unit'].replace({
    'm': 'nm',
    'gm': 'kgm',
    'Nm': 'nm',
    'KGM': 'kgm',
    'M': 'nm'
})
print(f"<======= 'torque_unit' =======>")
print(data['torque_unit'].value_counts())

In [None]:
print(f"<======= 'torque_eng_sp_unit' =======>")
print(data['torque_eng_sp_unit'].value_counts())

In [None]:
data['torque_eng_sp_unit'] = data['torque_eng_sp_unit'].replace({
    'pm': 'rpm',
    'RPM': 'rpm'
})
print(f"<======= 'torque_eng_sp_unit' =======>")
print(data['torque_eng_sp_unit'].value_counts())

In [None]:
data.loc[data['torque_unit'] == 'kgm', 'torque_val'].head()

In [None]:
# kgm values to nm (1 kgm = 9.80665 Nm)
data.loc[data['torque_unit'] == 'kgm', 'torque_val'] = 9.80665 * data.loc[data['torque_unit'] == 'kgm', 'torque_val']
data.loc[data['torque_unit'] == 'kgm', 'torque_unit'] = 'nm' 
print(f"<======= 'torque_unit' =======>")
print(data['torque_unit'].value_counts())

In [None]:
# check on Row 2 correctness of transformation
data.iloc[2]

In [None]:
# 12.7 - 'torque_val' before transformation
12.7*9.80665

In [None]:
# drop '..._unit', 'torque' columns and rename '..._val' columns with their units
columns = ['torque', 'torque_eng_sp']
data = data.drop('torque', axis=1)
for name in columns:
    data = data.rename(columns={
        name+'_val':name+'_'+str.lower(data[name+'_unit'][0])
    })
    data = data.drop(name+'_unit', axis=1)
data.info()

## Outliers (spotting and deleting)

In [None]:
fig, axes = plt.subplots(3, 2, sharex=False, figsize=(20, 10))
temp_list = ['km_driven', 'mileage_kmpl', 'engine_cc', 'max_power_bhp', 
             'torque_nm', 'torque_eng_sp_rpm']

for r in range(3):
    for c in range(2):
        axes[r, c].tick_params(labelbottom=True)
        axes[r, c].scatter(y = data['selling_price'], x = data[temp_list[r*2+c]])

Delete all outliers that seem to stand out from usual records since they may harm accuracy of our future linear model on more general cases and still has bad accuracy for outliers.

In [None]:
data = data[~(data['torque_nm']>750)]

fig, axes = plt.subplots(3, 2, sharex=False, figsize=(20, 10))
temp_list = ['km_driven', 'mileage_kmpl', 'engine_cc', 'max_power_bhp', 
             'torque_nm', 'torque_eng_sp_rpm']

for r in range(3):
    for c in range(2):
        axes[r, c].tick_params(labelbottom=True)
        axes[r, c].scatter(y = data['selling_price'], x = data[temp_list[r*2+c]])

In [None]:
data = data[~(data['km_driven']>400000)]

fig, axes = plt.subplots(3, 2, sharex=False, figsize=(20, 10))
temp_list = ['km_driven', 'mileage_kmpl', 'engine_cc', 'max_power_bhp', 
             'torque_nm', 'torque_eng_sp_rpm']

for r in range(3):
    for c in range(2):
        axes[r, c].tick_params(labelbottom=True)
        axes[r, c].scatter(y = data['selling_price'], x = data[temp_list[r*2+c]])

In [None]:
data = data[~(data['selling_price']>8000000)]

fig, axes = plt.subplots(3, 2, sharex=False, figsize=(20, 10))
temp_list = ['km_driven', 'mileage_kmpl', 'engine_cc', 'max_power_bhp', 
             'torque_nm', 'torque_eng_sp_rpm']

for r in range(3):
    for c in range(2):
        axes[r, c].tick_params(labelbottom=True)
        axes[r, c].scatter(y = data['selling_price'], x = data[temp_list[r*2+c]])

## Data Encoding

### 'name' column

In [None]:
# lets check the diversity of values in column 'name'
print(f"<======= 'name' =======>")
print(data['name'].value_counts())
print()

In [None]:
# seems like first word in 'name' is a car brand, extract it from this column
# as new column 'car_brand';
data['car_brand'] = data['name'].str.split(expand=True)[0]
data.loc[data['car_brand']=='Land', 'car_brand'] = 'Land Rover'
 
data.head()

In [None]:
# check correctness of values in new column 'car_brand'
data['car_brand'].value_counts()

In [None]:
#  seems like brands were extracted correctly;
#  drop column 'name' since it has a bigger variety of values and only name of
# brand can be useful for the model, while name of specific model doesn't give
# any information;
data = data.drop('name', axis=1)

# count amount of unique values in new column 'car brand'
data['car_brand'].nunique()

In [None]:
#  lets binary encode the car_brand;
#  Importing Binary Encoder function from category_encoder and encode 
# the 'car_brand' column;

from category_encoders import BinaryEncoder

be = BinaryEncoder()
x = be.fit_transform(data['car_brand'])

In [None]:
x.head()

In [None]:
data = pd.concat([data,x],axis=1)
data.head()

In [None]:
# since we encoded column car_brand, we won't need it anymore for fitting model
data = data.drop('car_brand', axis=1)
data.head()

### 'year' column

In [None]:
#  finding out unique values of 'year';
data['year'].value_counts()

In [None]:
data['year'].nunique()

In [None]:
#  performing feature mapping since the higher the value of year, the more new 
# the car is;
year_dict = {}
for i in range(27):
    year_dict[1994+i] = i
year_dict

In [None]:
data['year_0'] = data['year'].replace(year_dict)
data['year_0'].value_counts()

In [None]:
data = data.drop('year', axis=1)

In [None]:
data.info()

### 'fuel' column

In [None]:
#  finding out unique values of 'fuel';
data['fuel'].value_counts()

In [None]:
#  binary encode the car_brand;
be = BinaryEncoder()
x = be.fit_transform(data['fuel'])
x.head()

In [None]:
data = pd.concat([data,x],axis=1)
data.head()

In [None]:
# since we encoded column car_brand, we won't need it anymore for fitting model
data = data.drop('fuel', axis=1)
data.head()

### 'seller_type' column

In [None]:
#  finding out unique values of column;
data['seller_type'].value_counts()

In [None]:
#  binary encode the car_brand;
be = BinaryEncoder()
x = be.fit_transform(data['seller_type'])
x.head()

In [None]:
data = pd.concat([data,x],axis=1)
data.head()

In [None]:
# since we encoded column car_brand, we won't need it anymore for fitting model
data = data.drop('seller_type', axis=1)
data.head()

### 'transmission' column

In [None]:
#  finding out unique values of column;
data['transmission'].value_counts()

In [None]:
#  performing feature mapping to (0,1) since column has only two values;
data['transmission_0'] = data['transmission'].replace({'Manual':1, 'Automatic':0})
data['transmission_0'].value_counts()

In [None]:
data = data.drop('transmission', axis=1)

In [None]:
data.head()

### 'owner' column

In [None]:
#  finding out unique values of column;
data['owner'].value_counts()

In [None]:
#  performing feature mapping to consider how much car has been used;
#  by watching on kn driven seems like Test Drive Cars were used less;
owner_dict={
    'Test Drive Car' : 0,
    'First Owner' : 1,
    'Second Owner' : 2,
    'Third Owner' : 3,
    'Fourth & Above Owner' : 4
}

data['owner_0'] = data['owner'].replace(owner_dict)
data['owner_0'].value_counts()

In [None]:
data = data.drop('owner', axis=1)
data.head()

In [None]:
#  lets check if all data has been converted to numbers;
data.info()

# Check the correlation of data

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))

corr = data.corr()
sns.heatmap(corr, annot=True, ax=ax)

In [None]:
#  as we can see, max_power_bhp has high correlation with few columns;
#  other high correlations are for columns, that were "object" type in the
# beginning, so we won't consider them;
data = data.drop('max_power_bhp', axis=1)

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))

corr = data.corr()
sns.heatmap(corr, annot=True, ax=ax)

# Preparing data for model fitting

In [None]:
#  now we are ready to build our linear model;
Y = data['selling_price'].values
Y

In [None]:
X = data.drop('selling_price', axis=1).values
X

In [None]:
from sklearn.preprocessing import StandardScaler

Scaler = StandardScaler()
X = Scaler.fit_transform(X)

In [None]:
X.mean(), X.var()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, 
                                                    random_state = 5)

In [None]:
X.shape, X_train.shape, X_test.shape, Y.shape, Y_train.shape, Y_test.shape

# Model

In [None]:
from sklearn.linear_model import LinearRegression

lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)

In [None]:
Y_pred = lin_model.predict(X_test)
Y_pred

In [None]:
def rmse_score(y_test , y_pred):
    value = (1/len(y_test))*np.sum((y_test - y_pred)**2)
    return np.sqrt(value)

def r2_score(y_test , y_pred):
    ssr = (1/len(y_test))*np.sum((y_test - y_pred)**2)
    sst = (1/len(y_test))*np.sum((y_test - np.mean(y_test))**2)
    return (1 - (ssr/sst))

def mae(y_test , y_pred):
    return (1/len(y_test))*np.sum(np.abs(y_test - y_pred))

def adj_r2_score(y_test , y_pred , n_features):
    numerator = (1-r2_score(y_test , y_pred))*(len(y_test) - 1)
    denominator = len(y_test) - n_features - 1
    return 1 - (numerator/denominator)

In [None]:
print('RMSE = ', rmse_score(Y_test , Y_pred))
print('MAE = ', mae(Y_test , Y_pred))
print('R2 = ', r2_score(Y_test , Y_pred))
print('adj_R2 = ', adj_r2_score(Y_test , Y_pred, X.shape[1]))