# Used Vehicle Price Prediction: KaggleX Skill Assessment Challenge
This work is for the participation the challenge mentioned in the title, aiming to predict used vehicle prices based on the data given.

## Dataset
We are given train.csv and test.csv, with the former (as the name suggests) having 12 features column an 1 target column of price. The test data lacks the target price column so has 12 columns.

The test data is usually large (from my experience), having about 36k rows compared to the 54k rows in the training dataset. (may make the prediction hard if the test data distribution is marginally different from training data?)

## Methodology
Off the top of my head I will approach this similar to my previous project where we follow the steps of:
1. data exploration: distribution, outliers, data types, correlation...
2. data preprocessing: data cleaning, feature engineering, train-test split
3. baseline modeling: use baseline models like decision trees, random forest & linear regression
4. model2 : build fancy model trying to beat baseline model
5. model tuning: overfit then prune? hyperparameter-tuning? monitor loss-curve? early stopping?
6. model evaluation?


# 1. Data Preparation 

## 1.1 Data Loading

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/kagglex/sample_submission.csv
/kaggle/input/kagglex/train.csv
/kaggle/input/kagglex/test.csv


In [2]:
#load the train.csv into a dataframe
train_df = pd.read_csv('/kaggle/input/kagglex/train.csv')
test_df = pd.read_csv('/kaggle/input/kagglex/test.csv')

print(train_df.shape)
print(test_df.shape)

(54273, 13)
(36183, 12)


## 1.2 Data Exploration

In [3]:
train_df.head()

Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,0,Ford,F-150 Lariat,2018,74349,Gasoline,375.0HP 3.5L V6 Cylinder Engine Gasoline Fuel,10-Speed A/T,Blue,Gray,None reported,Yes,11000
1,1,BMW,335 i,2007,80000,Gasoline,300.0HP 3.0L Straight 6 Cylinder Engine Gasoli...,6-Speed M/T,Black,Black,None reported,Yes,8250
2,2,Jaguar,XF Luxury,2009,91491,Gasoline,300.0HP 4.2L 8 Cylinder Engine Gasoline Fuel,6-Speed A/T,Purple,Beige,None reported,Yes,15000
3,3,BMW,X7 xDrive40i,2022,2437,Hybrid,335.0HP 3.0L Straight 6 Cylinder Engine Gasoli...,Transmission w/Dual Shift Mode,Gray,Brown,None reported,Yes,63500
4,4,Pontiac,Firebird Base,2001,111000,Gasoline,200.0HP 3.8L V6 Cylinder Engine Gasoline Fuel,A/T,White,Black,None reported,Yes,7850


In [4]:
test_df.head()

Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title
0,54273,Mercedes-Benz,E-Class E 350,2014,73000,Gasoline,302.0HP 3.5L V6 Cylinder Engine Gasoline Fuel,A/T,White,Beige,None reported,Yes
1,54274,Lexus,RX 350 Base,2015,128032,Gasoline,275.0HP 3.5L V6 Cylinder Engine Gasoline Fuel,8-Speed A/T,Silver,Black,None reported,Yes
2,54275,Mercedes-Benz,C-Class C 300,2015,51983,Gasoline,241.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,7-Speed A/T,Blue,White,None reported,Yes
3,54276,Land,Rover Range Rover 5.0L Supercharged Autobiogra...,2018,29500,Gasoline,518.0HP 5.0L 8 Cylinder Engine Gasoline Fuel,Transmission w/Dual Shift Mode,White,White,At least 1 accident or damage reported,Yes
4,54277,BMW,X6 xDrive40i,2020,90000,Gasoline,335.0HP 3.0L Straight 6 Cylinder Engine Gasoli...,8-Speed A/T,White,Black,At least 1 accident or damage reported,Yes


Quick look and the data suggest some columns should be more valuable than others?
* brand
* ~model??~
* model_year
* fuel_type
* milage (need transformation?)
* ext_col (need transformation, make it simple)
* accident

engine is a mess (need transformation), will not consider first as heuristically i think it might be less important. color can be important but not sure if enough. year, brand and accident should the most important.

lets check the distribution for numerical and unique value of each categorical column to further determine:

In [5]:
# check the distribution of numerical numbers
train_df.describe()

Unnamed: 0,id,model_year,milage,price
count,54273.0,54273.0,54273.0,54273.0
mean,27136.0,2015.091979,72746.175667,39218.44
std,15667.409917,5.588909,50469.490448,72826.34
min,0.0,1974.0,100.0,2000.0
25%,13568.0,2012.0,32268.0,15500.0
50%,27136.0,2016.0,66107.0,28000.0
75%,40704.0,2019.0,102000.0,45000.0
max,54272.0,2024.0,405000.0,2954083.0


In [6]:
# check the distribution of numerical numbers
test_df.describe()

Unnamed: 0,id,model_year,milage
count,36183.0,36183.0,36183.0
mean,72364.0,2015.063953,72479.266755
std,10445.276732,5.589336,50714.968252
min,54273.0,1974.0,100.0
25%,63318.5,2012.0,31681.0
50%,72364.0,2016.0,65680.0
75%,81409.5,2019.0,102000.0
max,90455.0,2024.0,405000.0


The distribution of test data falls right in line with that of the training data's, which should be really good for producing accurate predictions.

In [7]:
# check unique values of categorical data
print("columns and respective unique values:")
print("brands:", train_df.brand.unique())
# print("model:", train_df.model.unique())
print("fuel_type:", train_df.fuel_type.unique())
print("ext_col:", train_df.ext_col.unique())
print("clean_title:", train_df.clean_title.unique())
print("accident:", train_df.accident.unique())

columns and respective unique values:
brands: ['Ford' 'BMW' 'Jaguar' 'Pontiac' 'Acura' 'Audi' 'GMC' 'Maserati'
 'Chevrolet' 'Porsche' 'Mercedes-Benz' 'Tesla' 'Lexus' 'Kia' 'Lincoln'
 'Dodge' 'Volkswagen' 'Land' 'Cadillac' 'Mazda' 'RAM' 'Subaru' 'Hyundai'
 'MINI' 'Jeep' 'Honda' 'Hummer' 'Nissan' 'Toyota' 'Volvo' 'Genesis'
 'Mitsubishi' 'Buick' 'INFINITI' 'McLaren' 'Scion' 'Lamborghini' 'Bentley'
 'Suzuki' 'Ferrari' 'Alfa' 'Rolls-Royce' 'Chrysler' 'Aston' 'Rivian'
 'Lotus' 'Saturn' 'Lucid' 'Mercury' 'Maybach' 'FIAT' 'Plymouth' 'Bugatti']
fuel_type: ['Gasoline' 'Hybrid' 'E85 Flex Fuel' 'Diesel' '–' 'Plug-In Hybrid'
 'not supported']
ext_col: ['Blue' 'Black' 'Purple' 'Gray' 'White' 'Red' 'Silver' 'Summit White'
 'Platinum Quartz Metallic' 'Green' 'Orange' 'Lunar Rock'
 'Red Quartz Tintcoat' 'Beige' 'Gold' 'Jet Black Mica'
 'Delmonico Red Pearlcoat' 'Brown' 'Rich Garnet Metallic'
 'Stellar Black Metallic' 'Yellow' 'Deep Black Pearl Effect' 'Metallic'
 'Ice Silver Metallic' 'Agate Black Meta

In [8]:
# check unique values of categorical data
print("columns and respective unique values:")
print("brands:", test_df.brand.unique())
# print("model:", test_df.model.unique())
print("fuel_type:", test_df.fuel_type.unique())
print("ext_col:", test_df.ext_col.unique())
print("clean_title:", test_df.clean_title.unique())
print("accident:", test_df.accident.unique())

columns and respective unique values:
brands: ['Mercedes-Benz' 'Lexus' 'Land' 'BMW' 'Chevrolet' 'Dodge' 'Audi' 'Ford'
 'Kia' 'Toyota' 'Cadillac' 'GMC' 'Jeep' 'Mazda' 'Acura' 'INFINITI'
 'Volkswagen' 'Subaru' 'Hyundai' 'Jaguar' 'Porsche' 'Lincoln' 'Nissan'
 'RAM' 'Buick' 'Honda' 'MINI' 'Rolls-Royce' 'Genesis' 'Bentley' 'Volvo'
 'Saturn' 'Ferrari' 'Bugatti' 'Tesla' 'Pontiac' 'Hummer' 'Mitsubishi'
 'Maserati' 'Alfa' 'Scion' 'Lamborghini' 'Chrysler' 'McLaren' 'Lotus'
 'Rivian' 'Aston' 'FIAT' 'Lucid' 'Mercury' 'Suzuki' 'Saab' 'smart']
fuel_type: ['Gasoline' 'E85 Flex Fuel' 'Diesel' 'Hybrid' '–' 'Plug-In Hybrid'
 'not supported']
ext_col: ['White' 'Silver' 'Blue' 'Red' 'Black' 'Gray' 'Atomic Silver' 'Green'
 'Octane Red Pearlcoat' 'Purple' 'Diamond Black' 'Agate Black Metallic'
 '–' 'Orange' 'Polymetal Gray Metallic' 'Crystal Black Pearl'
 'Snowflake White Pearl' 'Jet Black Mica' 'Black Raven' 'Black Clearcoat'
 'Yellow' 'Metallic' 'Imperial Blue Metallic' 'Phytonic Blue Metallic'
 'Gold' 'B

lets check missing data:

In [9]:
# check missing values
print("NaN value in brand:", train_df.brand.isna().sum())
print("NaN value in model:", train_df.model.isna().sum())
print("NaN value in model_year:", train_df.model_year.isna().sum())
print("NaN value in fuel_type:", train_df.fuel_type.isna().sum())
print("'-' or 'not supported' value in fuel_type:", train_df[(train_df.fuel_type == '–') | (train_df.fuel_type == 'not supported')].shape[0])
print("NaN value in milage:", train_df.milage.isna().sum())
print("NaN value in ext_col:", train_df.ext_col.isna().sum())
print("'-' in ext_col:", train_df[train_df.ext_col == '–'].shape[0])
print("NaN value in accident:", train_df.accident.isna().sum())
print("NaN value in price:", train_df.price.isna().sum())
print("0 value in price:", train_df[(train_df.price == 0)].shape[0])

NaN value in brand: 0
NaN value in model: 0
NaN value in model_year: 0
NaN value in fuel_type: 0
'-' or 'not supported' value in fuel_type: 298
NaN value in milage: 0
NaN value in ext_col: 0
'-' in ext_col: 41
NaN value in accident: 0
NaN value in price: 0
0 value in price: 0


In [10]:
# check missing values
print("NaN value in brand:", test_df.brand.isna().sum())
print("NaN value in model:", test_df.model.isna().sum())
print("NaN value in model_year:", test_df.model_year.isna().sum())
print("NaN value in fuel_type:", test_df.fuel_type.isna().sum())
print("'-' or 'not supported' value in fuel_type:", test_df[(test_df.fuel_type == '–') | (test_df.fuel_type == 'not supported')].shape[0])
print("NaN value in milage:", test_df.milage.isna().sum())
print("NaN value in ext_col:", test_df.ext_col.isna().sum())
print("'-' in ext_col:", test_df[test_df.ext_col == '–'].shape[0])
print("NaN value in accident:", test_df.accident.isna().sum())

NaN value in brand: 0
NaN value in model: 0
NaN value in model_year: 0
NaN value in fuel_type: 0
'-' or 'not supported' value in fuel_type: 201
NaN value in milage: 0
NaN value in ext_col: 0
'-' in ext_col: 27
NaN value in accident: 0


quick thoughts upon inspection:

There are columns that are clearly useful and important:
* *brands*
* *model_year*
* *milage*
* *accident*, can be changed to 1 & 0 to indicate

There are also columns that needs work:
* *fuel_type* has some missing value & might be useful, we will drop columns with missing values & proceed, and change 'E85 Flex Fuel' to 'Gasoline' aaaand 'Plug-In Hybrid' to 'Hybrid'
* *ext_col* may be useful, but there is a few missing values & need transformation (try to convert most to simple color: white, red, black etc)

Finally there are columns deemed not significant and we will proceed without for now:
* *model* will not be used for now, a lot of work to do and seems less significant

Also it is noteworthy that the target value *price* is free of missing value or 0

# 2. Data Preprocessing
## 2.1 Data Cleaning
remove rows with *fuel_type* having missing values

In [11]:
train_df = train_df[train_df['fuel_type'] != '–']
train_df = train_df[train_df['fuel_type'] != 'not supported']
train_df = train_df[train_df['ext_col'] != '–']
print("'-' or 'not supported' value in fuel_type:", train_df[(train_df.fuel_type == '–') | (train_df.fuel_type == 'not supported')].shape[0])
print("'-' in ext_col:", train_df[train_df.ext_col == '–'].shape[0])
train_df.shape

'-' or 'not supported' value in fuel_type: 0
'-' in ext_col: 0


(53935, 13)

In [12]:
test_df = test_df[test_df['fuel_type'] != '–']
test_df = test_df[test_df['fuel_type'] != 'not supported']
test_df = test_df[test_df['ext_col'] != '–']
print("'-' or 'not supported' value in fuel_type:", test_df[(test_df.fuel_type == '–') | (test_df.fuel_type == 'not supported')].shape[0])
print("'-' in ext_col:", test_df[test_df.ext_col == '–'].shape[0])
test_df.shape

'-' or 'not supported' value in fuel_type: 0
'-' in ext_col: 0


(35955, 12)

## 2.2 Data Transformation

### 2.2.1 Replacing values in accident & fuel_type

In [13]:
pd.set_option('future.no_silent_downcasting', True)

In [14]:
# Change accident to numerical of 1 & 0
train_df['accident'] = train_df['accident'].replace('None reported', False )
train_df['accident'] = train_df['accident'].replace('At least 1 accident or damage reported', True )
print("train accident:", train_df.accident.unique())

test_df['accident'] = test_df['accident'].replace('None reported', False )
test_df['accident'] = test_df['accident'].replace('At least 1 accident or damage reported', True )
print("test accident:", test_df.accident.unique())

train accident: [False True]
test accident: [False True]


In [15]:
# Change fuel_type to narrow down the types
train_df['fuel_type'] = train_df['fuel_type'].replace('E85 Flex Fuel', 'Gasoline')
train_df['fuel_type'] = train_df['fuel_type'].replace('Plug-In Hybrid', 'Hybrid')
print("train fuel_type:", train_df.fuel_type.unique())

test_df['fuel_type'] = test_df['fuel_type'].replace('E85 Flex Fuel', 'Gasoline')
test_df['fuel_type'] = test_df['fuel_type'].replace('Plug-In Hybrid', 'Hybrid')
print("test fuel_type:", test_df.fuel_type.unique())

train fuel_type: ['Gasoline' 'Hybrid' 'Diesel']
test fuel_type: ['Gasoline' 'Diesel' 'Hybrid']


### 2.2.2 Deal with the strings in column ext_col to make them more generic
turn weird color names into general colors (e.g. white, black, blue...)
make a new column?
then mayyybe remove the weird colors if there is minimal of them? want to make sure the test data has same distribution tho....


In [16]:
# Define a dictionary mapping generic color names to their potential variations
color_map = {
    'white': ['white', 'snow', 'ivory', 'pearl', 'cream', 'frost', 'glacier', 'ice', 'chalk', 'yulong'],
    'black': ['black', 'ebony', 'onyx', 'jet', 'noir', 'raven', 'nightfall', 'nero', 'noctis', 'moonlight', 'obsidian'],
    'blue': ['blue', 'navy', 'sapphire', 'indigo', 'caelum', 'reflex', 'sea', 'tempest', 'blu'],
    'red': ['red', 'crimson', 'scarlet', 'ruby', 'maroon', 'sangria', 'mars', 'corsa', 'rosso'],
    'green': ['green', 'olive', 'emerald', 'jade', 'lime', 'jungle', 'moss', 'caviar', 'verde'],
    'yellow': ['yellow', 'gold', 'lemon', 'amber', 'hellayella'],
    'silver': ['silver', 'platinum', 'steel', 'zynith', 'radiance', 'metallic', 'magno'], # we mayyy want to remove metallic from silver...
    'purple': ['purple', 'lavender', 'amethyst', 'violet', 'plum'],
    'gray': ['gray', 'grey', 'charcoal', 'slate', 'graphite', 'ash'],
    'orange': ['orange', 'tangerine', 'apricot', 'peach', 'mango'],
    'brown': ['brown', 'tan', 'chocolate', 'camel', 'khaki', 'dune'],
    'beige': ['beige', 'cream', 'vanilla', 'linen', 'isis', 'lunar']
}

def transform_color(color_str):
    color_str = color_str.lower()
    for generic_color, variations in color_map.items():
        for variation in variations:
            if variation in color_str:
                return generic_color.capitalize()
    return color_str

# Apply the transform_color function to the 'Color' column
train_df2 = train_df.copy()
train_df2['ext_col'] = train_df2['ext_col'].apply(transform_color)
print("train ext_col:", train_df2.ext_col.unique())

test_df2 = test_df.copy()
test_df2['ext_col'] = test_df2['ext_col'].apply(transform_color)
print("test ext_col:", test_df2.ext_col.unique())

train ext_col: ['Blue' 'Black' 'Purple' 'Gray' 'White' 'Red' 'Silver' 'Green' 'Orange'
 'Beige' 'Yellow' 'Brown' 'c / c' 'pink' 'custom color']
test ext_col: ['White' 'Silver' 'Blue' 'Red' 'Black' 'Gray' 'Green' 'Purple' 'Orange'
 'Yellow' 'Brown' 'Beige' 'c / c' 'pink' 'custom color']


In [17]:
print(" pink in ext_col:", train_df2[train_df2.ext_col == 'pink'].shape[0])
print(" c / c in ext_col:", train_df2[train_df2.ext_col == 'c / c'].shape[0])
print(" custom color in ext_col:", train_df2[train_df2.ext_col == 'custom color'].shape[0])

# Remove rows with edge cases
train_df2 = train_df2[train_df2['ext_col'] != 'pink']
train_df2 = train_df2[train_df2['ext_col'] != "c / c"]
train_df2 = train_df2[train_df2['ext_col'] != 'custom color']
print("ext_col:", train_df2.ext_col.unique())
train_df2.shape

 pink in ext_col: 7
 c / c in ext_col: 14
 custom color in ext_col: 1
ext_col: ['Blue' 'Black' 'Purple' 'Gray' 'White' 'Red' 'Silver' 'Green' 'Orange'
 'Beige' 'Yellow' 'Brown']


(53913, 13)

In [18]:
print(" pink in ext_col:", test_df2[test_df2.ext_col == 'pink'].shape[0])
print(" c / c in ext_col:", test_df2[test_df2.ext_col == 'c / c'].shape[0])
print(" custom color in ext_col:", test_df2[test_df2.ext_col == 'custom color'].shape[0])

# Remove rows with edge cases
test_df2 = test_df2[test_df2['ext_col'] != 'pink']
test_df2 = test_df2[test_df2['ext_col'] != "c / c"]
test_df2 = test_df2[test_df2['ext_col'] != 'custom color']
print("ext_col:", test_df2.ext_col.unique())
test_df2.shape

 pink in ext_col: 4
 c / c in ext_col: 6
 custom color in ext_col: 1
ext_col: ['White' 'Silver' 'Blue' 'Red' 'Black' 'Gray' 'Green' 'Purple' 'Orange'
 'Yellow' 'Brown' 'Beige']


(35944, 12)

In [19]:
# drop some useless columns
drop_col = ['model', 'engine', 'transmission', 'int_col', 'clean_title']
train_df2 = train_df2.drop(drop_col, axis=1)
train_df2.head(10)

Unnamed: 0,id,brand,model_year,milage,fuel_type,ext_col,accident,price
0,0,Ford,2018,74349,Gasoline,Blue,False,11000
1,1,BMW,2007,80000,Gasoline,Black,False,8250
2,2,Jaguar,2009,91491,Gasoline,Purple,False,15000
3,3,BMW,2022,2437,Hybrid,Gray,False,63500
4,4,Pontiac,2001,111000,Gasoline,White,False,7850
5,5,Acura,2003,124756,Gasoline,Red,True,4995
6,6,Audi,2014,107380,Gasoline,Gray,False,26500
7,7,GMC,2019,51300,Gasoline,White,True,25500
8,8,Audi,2016,87842,Gasoline,Silver,False,13999
9,9,Acura,2007,152270,Gasoline,Gray,True,6700


In [20]:
test_df2 = test_df2.drop(drop_col, axis=1)
test_df2.head(10)

Unnamed: 0,id,brand,model_year,milage,fuel_type,ext_col,accident
0,54273,Mercedes-Benz,2014,73000,Gasoline,White,False
1,54274,Lexus,2015,128032,Gasoline,Silver,False
2,54275,Mercedes-Benz,2015,51983,Gasoline,Blue,False
3,54276,Land,2018,29500,Gasoline,White,True
4,54277,BMW,2020,90000,Gasoline,White,True
5,54278,Chevrolet,2018,2894,Gasoline,Silver,False
6,54279,Land,2019,41200,Gasoline,Silver,True
7,54280,Land,2019,58000,Gasoline,White,True
8,54281,Dodge,2013,124705,Gasoline,Red,True
9,54282,Audi,2022,29850,Gasoline,Black,False


### 2.2.3 ~One-hot~ Encoding the categorical data
there is actually quite an amount of color type (12) and even more brand... we need to do other types of encoding:

* label encoding: using unique integer value to encode (can introduce unwanted relationship)
* frequent/ infrequent encoding: we can group the infrequent categories into single cat to prevent high cardinality, lets look into this first

as for fuel_type, we will perform one-hot encoding since there is only 3 unique values

In [21]:
# see the unique value counts of the high cardinality columns
print(train_df2['brand'].value_counts())
print(train_df2['ext_col'].value_counts())

brand
BMW              7367
Ford             6664
Mercedes-Benz    5074
Chevrolet        4395
Audi             2916
Porsche          2602
Toyota           2298
Lexus            2257
Jeep             2237
Land             1993
Cadillac         1552
Nissan           1229
GMC              1074
RAM               966
INFINITI          957
Dodge             952
Lincoln           766
Subaru            738
Mazda             723
Hyundai           694
Jaguar            654
Volkswagen        628
Honda             615
Acura             575
Kia               526
Volvo             445
MINI              363
Maserati          293
Bentley           269
Genesis           249
Chrysler          248
Buick             228
Mitsubishi        182
Hummer            176
Pontiac           149
Alfa              144
Rolls-Royce       124
Lamborghini       114
Tesla             110
Ferrari            86
Saturn             58
Scion              53
Aston              49
McLaren            43
Rivian             27
FIAT

seeing how many relatively rare values there are, we can suggest:

* Grouping brand of less than 500 to 'Others', reducing 53 to 26
* Grouping ext_col of less than 1000 to 'Other", reducing 12 to 7

In [22]:
other_color = ['Yellow', 'Green', 'Beige', 'Brown', 'Orange', 'Purple']
other_brand = ['Volvo', 'MINI', 'Maserati', 'Bentley', 'Genesis', 'Chrysler', 'Buick', 'Mitsubishi', 'Hummer', 'Pontiac', 'Alfa', 
               'Rolls-Royce', 'Lamborghini', 'Tesla', 'Ferrari', 'Saturn', 'Scion', 'Aston', 'McLaren', 'Rivian', 'FIAT',
               'Lotus', 'Mercury', 'Suzuki', 'Maybach', 'Lucid', 'Plymouth', 'Bugatti', 'Saab', 'smart']

# Using replace function to group edge cases of brand & ext_col
train_df3 = train_df2.copy()
train_df3['ext_col'] = train_df3['ext_col'].replace(other_color, 'other')
train_df3['brand'] = train_df3['brand'].replace(other_brand, 'other')

test_df3 = test_df2.copy()
test_df3['ext_col'] = test_df3['ext_col'].replace(other_color, 'other')
test_df3['brand'] = test_df3['brand'].replace(other_brand, 'other')

In [23]:
print(train_df3['brand'].value_counts())
print(train_df3['ext_col'].value_counts())

brand
BMW              7367
Ford             6664
Mercedes-Benz    5074
Chevrolet        4395
other            3461
Audi             2916
Porsche          2602
Toyota           2298
Lexus            2257
Jeep             2237
Land             1993
Cadillac         1552
Nissan           1229
GMC              1074
RAM               966
INFINITI          957
Dodge             952
Lincoln           766
Subaru            738
Mazda             723
Hyundai           694
Jaguar            654
Volkswagen        628
Honda             615
Acura             575
Kia               526
Name: count, dtype: int64
ext_col
Black     15678
White     13995
Gray       7902
Silver     5569
Blue       4761
Red        3176
other      2832
Name: count, dtype: int64


In [24]:
print(test_df3['brand'].value_counts())
print(test_df3['ext_col'].value_counts())

brand
BMW              4854
Ford             4362
Mercedes-Benz    3257
Chevrolet        2973
other            2320
Audi             1916
Porsche          1792
Toyota           1562
Lexus            1530
Jeep             1502
Land             1360
Cadillac         1050
Nissan            908
GMC               746
INFINITI          666
Dodge             655
RAM               606
Mazda             512
Lincoln           496
Subaru            492
Jaguar            446
Hyundai           436
Honda             401
Acura             379
Volkswagen        376
Kia               347
Name: count, dtype: int64
ext_col
Black     10397
White      9189
Gray       5377
Silver     3710
Blue       3221
Red        2155
other      1895
Name: count, dtype: int64


In [25]:
# Define your own encoding dictionary
ext_col_encoding = {
    'Black': 0,
    'White': 1,
    'Gray': 2,
    'Silver': 3,
    'Blue': 4,
    'Red': 5,
    'other': 6    
}

brand_encoding = {
    'BMW': 0, 
    'Ford': 1,
    'Mercedes-Benz': 2,
    'Chevrolet': 3,
    'Audi': 4,
    'Porsche': 5,
    'Toyota': 6,
    'Lexus': 7,  
    'Jeep': 8,  
    'Land': 9,  
    'Cadillac': 10,  
    'Nissan': 11,  
    'GMC': 12,  
    'INFINITI': 13,  
    'Dodge': 14,  
    'RAM': 15,  
    'Mazda': 16,  
    'Lincoln': 17,  
    'Subaru': 18,  
    'Jaguar': 19,  
    'Hyundai': 20,  
    'Honda': 21,  
    'Acura': 22,  
    'Volkswagen': 23,  
    'Kia': 24,  
    'other': 25
}


# Encode the 'ext_col' column using the predefined dictionary
train_df3['ext_col_encoded'] = train_df3['ext_col'].map(ext_col_encoding)
train_df3['brand_encoded'] = train_df3['brand'].map(brand_encoding)

# One-hot encode fuel_type
train_df3 = pd.get_dummies(train_df3, columns=['fuel_type'])

train_df3.head()

Unnamed: 0,id,brand,model_year,milage,ext_col,accident,price,ext_col_encoded,brand_encoded,fuel_type_Diesel,fuel_type_Gasoline,fuel_type_Hybrid
0,0,Ford,2018,74349,Blue,False,11000,4,1,False,True,False
1,1,BMW,2007,80000,Black,False,8250,0,0,False,True,False
2,2,Jaguar,2009,91491,other,False,15000,6,19,False,True,False
3,3,BMW,2022,2437,Gray,False,63500,2,0,False,False,True
4,4,other,2001,111000,White,False,7850,1,25,False,True,False


In [26]:
# Encode the 'ext_col' column using the predefined dictionary
test_df3['ext_col_encoded'] = test_df3['ext_col'].map(ext_col_encoding)
test_df3['brand_encoded'] = test_df3['brand'].map(brand_encoding)

# One-hot encode fuel_type
test_df3 = pd.get_dummies(test_df3, columns=['fuel_type'])

test_df3.head()

Unnamed: 0,id,brand,model_year,milage,ext_col,accident,ext_col_encoded,brand_encoded,fuel_type_Diesel,fuel_type_Gasoline,fuel_type_Hybrid
0,54273,Mercedes-Benz,2014,73000,White,False,1,2,False,True,False
1,54274,Lexus,2015,128032,Silver,False,3,7,False,True,False
2,54275,Mercedes-Benz,2015,51983,Blue,False,4,2,False,True,False
3,54276,Land,2018,29500,White,True,1,9,False,True,False
4,54277,BMW,2020,90000,White,True,1,0,False,True,False


### 2.2.4 Make a new dataframe for the preprocessed data

In [27]:
frain_df = train_df3[['id', 'model_year', 'milage', 'fuel_type_Diesel', 'fuel_type_Gasoline', 'fuel_type_Hybrid', 'accident', 'brand_encoded', 
                      'ext_col_encoded', 'price']].copy()
frain_df.head()

Unnamed: 0,id,model_year,milage,fuel_type_Diesel,fuel_type_Gasoline,fuel_type_Hybrid,accident,brand_encoded,ext_col_encoded,price
0,0,2018,74349,False,True,False,False,1,4,11000
1,1,2007,80000,False,True,False,False,0,0,8250
2,2,2009,91491,False,True,False,False,19,6,15000
3,3,2022,2437,False,False,True,False,0,2,63500
4,4,2001,111000,False,True,False,False,25,1,7850


In [28]:
fest_df = test_df3[['id', 'model_year', 'milage', 'fuel_type_Diesel', 'fuel_type_Gasoline', 'fuel_type_Hybrid', 
                    'accident', 'brand_encoded', 'ext_col_encoded']].copy()
fest_df.head()

Unnamed: 0,id,model_year,milage,fuel_type_Diesel,fuel_type_Gasoline,fuel_type_Hybrid,accident,brand_encoded,ext_col_encoded
0,54273,2014,73000,False,True,False,False,2,1
1,54274,2015,128032,False,True,False,False,7,3
2,54275,2015,51983,False,True,False,False,2,4
3,54276,2018,29500,False,True,False,True,9,1
4,54277,2020,90000,False,True,False,True,0,1


## 2.3 Train-test split
We will split the training data to training set and validation set, and keep the test data as test set.

In [29]:
from sklearn.model_selection import train_test_split

# Setting X asfeatures and y as target variable - price
X = frain_df.drop(['price', 'id'], axis=1)  # Features
y = frain_df['price']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train shape: ", X_train.shape)
print("X_test shape: ", X_test.shape)
print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)

X_train shape:  (43130, 8)
X_test shape:  (10783, 8)
y_train shape:  (43130,)
y_test shape:  (10783,)


# Final Prediction