In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Importing Dataset

In [2]:
dataset = pd.read_csv('Car-Ads.csv')
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 16 columns):
maker                  171819 non-null object
model                  137956 non-null object
mileage                182252 non-null float64
manufacture_year       181223 non-null float64
engine_displacement    159210 non-null float64
engine_power           170868 non-null float64
body_type              136005 non-null object
color_slug             12240 non-null object
stk_year               102359 non-null object
transmission           160413 non-null object
door_count             164243 non-null object
seat_count             156800 non-null object
fuel_type              94681 non-null object
date_created           200000 non-null object
date_last_seen         200000 non-null object
price_eur              200000 non-null float64
dtypes: float64(5), object(11)
memory usage: 24.4+ MB


#Data Wrangling

Selecting the features which has 80% or more non-null values.

In [3]:
dataset = dataset.iloc[:, [2,3,4,5,9,10,11,13,14,15]]
dataset

Unnamed: 0,mileage,manufacture_year,engine_displacement,engine_power,transmission,door_count,seat_count,date_created,date_last_seen,price_eur
0,178000.0,2000.0,1390.0,55.0,man,4,5,2016-01-03 19:42:48.205853+00:00,2016-01-07 00:56:35.766128+00:00,2500.30
1,135000.0,2007.0,1149.0,55.0,man,4,5,2015-12-08 08:46:03.020179+00:00,2016-01-18 19:02:24.218185+00:00,2980.24
2,138000.0,2005.0,1984.0,147.0,man,4.0,5.0,2016-03-05 22:09:11.127858+00:00,2016-07-03 17:39:48.838084+00:00,8010.25
3,105000.0,2009.0,,,,,,2015-12-12 19:48:16.546082+00:00,2016-01-02 10:02:05.676711+00:00,2300.26
4,129385.0,2003.0,,,,,,2016-01-01 17:28:46.527414+00:00,2016-01-17 22:49:09.853789+00:00,2800.30
...,...,...,...,...,...,...,...,...,...,...
199995,280000.0,1997.0,,,,,,2015-12-12 17:05:30.968304+00:00,2015-12-15 01:22:14.496198+00:00,1299.15
199996,,2000.0,1900.0,66.0,man,,,2015-11-14 20:36:38.222092+00:00,2016-01-27 20:40:15.463610+00:00,2072.54
199997,150000.0,2006.0,,,man,,,2015-12-28 01:50:23.539210+00:00,2015-12-29 03:51:10.129326+00:00,2290.67
199998,5093.0,2015.0,1395.0,110.0,auto,4,5,2015-12-18 21:13:01.874178+00:00,2016-01-19 06:46:12.073615+00:00,28989.71


In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 10 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   mileage              182252 non-null  float64
 1   manufacture_year     181223 non-null  float64
 2   engine_displacement  159210 non-null  float64
 3   engine_power         170868 non-null  float64
 4   transmission         160413 non-null  object 
 5   door_count           164243 non-null  object 
 6   seat_count           156800 non-null  object 
 7   date_created         200000 non-null  object 
 8   date_last_seen       200000 non-null  object 
 9   price_eur            200000 non-null  float64
dtypes: float64(5), object(5)
memory usage: 15.3+ MB


The manufacture year are well beyond 1883 when the first car was made. Assuming that they are mistakes while recording the data I will put null values there. These null values will be later imputed.

In [5]:
for i in range(dataset.shape[0]):
  if dataset.manufacture_year[i] < 1885:
    dataset.manufacture_year[i] = np.nan


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Converting the year to timestamp. Then the manufacture year will be subtracted from current year to find out how many years has it been since the car's manufacture.

In [6]:
dataset.manufacture_year = pd.to_datetime(dataset.manufacture_year, format='%Y')
ts = pd.to_datetime('1/1/2020')
ts.year

# print((ts.year - dataset.manufacture_year.year))

for i in range(dataset.shape[0]):
  dataset.manufacture_year[i] = ts.year - dataset.manufacture_year[i].year

dataset.manufacture_year = dataset.manufacture_year.astype(float)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


Label Encoding the transmission type. Later it will be one hot encoded if needed.

In [0]:
from sklearn.preprocessing import LabelEncoder
dataset.transmission = dataset.transmission.astype(str)
labelencoder = LabelEncoder()
dataset.transmission = labelencoder.fit_transform(dataset.transmission)

In [12]:
dataset.transmission.value_counts()

1    117622
0     42791
2     39587
Name: transmission, dtype: int64

Label Encoder encoded the null values as well. Retrieving the null values

In [13]:
dataset.transmission[dataset.transmission == 2] = np.nan 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


The door count and seat count contains the string 'None'. Chaging it to nan.

In [14]:
dataset.door_count[dataset.door_count == 'None'] = np.nan
dataset.door_count = dataset.door_count.astype(float)
dataset.seat_count[dataset.seat_count == 'None'] = np.nan
dataset.seat_count = dataset.seat_count.astype(float)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [15]:
dataset

Unnamed: 0,mileage,manufacture_year,engine_displacement,engine_power,transmission,door_count,seat_count,date_created,date_last_seen,price_eur
0,178000.0,20.0,1390.0,55.0,1.0,4.0,5.0,2016-01-03 19:42:48.205853+00:00,2016-01-07 00:56:35.766128+00:00,2500.30
1,135000.0,13.0,1149.0,55.0,1.0,4.0,5.0,2015-12-08 08:46:03.020179+00:00,2016-01-18 19:02:24.218185+00:00,2980.24
2,138000.0,15.0,1984.0,147.0,1.0,4.0,5.0,2016-03-05 22:09:11.127858+00:00,2016-07-03 17:39:48.838084+00:00,8010.25
3,105000.0,11.0,,,,,,2015-12-12 19:48:16.546082+00:00,2016-01-02 10:02:05.676711+00:00,2300.26
4,129385.0,17.0,,,,,,2016-01-01 17:28:46.527414+00:00,2016-01-17 22:49:09.853789+00:00,2800.30
...,...,...,...,...,...,...,...,...,...,...
199995,280000.0,23.0,,,,,,2015-12-12 17:05:30.968304+00:00,2015-12-15 01:22:14.496198+00:00,1299.15
199996,,20.0,1900.0,66.0,1.0,,,2015-11-14 20:36:38.222092+00:00,2016-01-27 20:40:15.463610+00:00,2072.54
199997,150000.0,14.0,,,1.0,,,2015-12-28 01:50:23.539210+00:00,2015-12-29 03:51:10.129326+00:00,2290.67
199998,5093.0,5.0,1395.0,110.0,0.0,4.0,5.0,2015-12-18 21:13:01.874178+00:00,2016-01-19 06:46:12.073615+00:00,28989.71


#Data Preprocessing 

Separating feature matrix X and the labels y

In [0]:
X = dataset.iloc[:, :7]
y = dataset.iloc[:, 9]

Imputing the numerical features of X with median and categorical features with mode. The numerical features are further sacled using standardization.

In [17]:
from sklearn.impute import SimpleImputer

X_num = X.drop(['transmission', 'manufacture_year'], axis = 1)
# X_num = X_num.drop('manufacture_year', axis=1)

imputer = SimpleImputer(strategy= 'median')
X_num = imputer.fit_transform(X_num) 
X_num = pd.DataFrame(X_num)

X_num =pd.concat([X_num, X['manufacture_year']], axis =1)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_num = scaler.fit_transform(X_num)
X_num = pd.DataFrame(X_num)

X_cat = X['transmission']
X = pd.concat([X_num, X_cat], axis=1)

imputer = SimpleImputer(strategy = 'most_frequent')

X = imputer.fit_transform(X)

X = pd.DataFrame(X)
X

Unnamed: 0,0,1,2,3,4,5,6
0,0.192611,-0.330904,-0.993936,-0.055463,0.055973,1.108413,1.0
1,0.060979,-0.469818,-0.993936,-0.055463,0.055973,0.145755,1.0
2,0.070162,0.011483,1.350972,-0.055463,0.055973,0.420800,1.0
3,-0.030858,-0.096882,-0.229292,-0.055463,0.055973,-0.129290,1.0
4,0.043790,-0.096882,-0.229292,-0.055463,0.055973,0.695845,1.0
...,...,...,...,...,...,...,...
199995,0.504856,-0.096882,-0.229292,-0.055463,0.055973,1.520981,1.0
199996,-0.079837,-0.036935,-0.713567,-0.055463,0.055973,1.108413,1.0
199997,0.106897,-0.096882,-0.229292,-0.055463,0.055973,0.283277,1.0
199998,-0.336695,-0.328022,0.407911,-0.055463,0.055973,-0.954426,0.0


Splitting the data into training set and testing set. Used stratified sampling because the categorical value 'transmission' must be proportionally divided into train and test sets.

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25, random_state=42, stratify= X[6])

#Model Selection



I tried cross validating 4 different regressors to compare their cv sccore and select one but it took a lot of time and was quite infeasible.

In [0]:
from sklearn.model_selection import cross_val_score

lin_reg_score = cross_val_score(lin_reg, X_train, y_train, cv=5, scoring='r2')
svr_score = cross_val_score(svr, X_train, y_train, cv=5, scoring='r2')
dec_tree_score = cross_val_score(dec_tree, X_train, y_train, cv=5, scoring='r2')
lin_reg_score = cross_val_score(random_fores, X_train, y_train, cv=5, scoring='r2')

I will use the Random Forest Regression because our data contains a lot of outliers. Handling the outliers either by deleting, imputing, discretization will cause loss of information. The random forest and tree algorithms are not affected by outliers as they dont depend on the weighted sum of features. 

In [0]:
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor(random_state=42)

# Tuning the hyperparameters of the Random forest regressor

In [20]:
from sklearn.model_selection import GridSearchCV

param_grid = [
{'n_estimators': [30,100,300], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

grid_search = GridSearchCV(random_forest, param_grid, cv=5,
scoring='r2')

grid_search.fit(X_train, y_train)

KeyboardInterrupt: ignored

#Evaluating the model on test set
 
I have used r2 score as metric as it is easy to interpret than rmse. The closer it is to 1 better it is.

In [26]:
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)

from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)


0.768093433775997

In [28]:
print('r2 score:', r2)

r2 score: 0.768093433775997
