# Used Vehicle Price Prediction: KaggleX Skill Assessment Challenge
This work is for the participation the cahllenge mentioned in the title, aiming to predict used vehicle prices based on the data given.

## Dataset
We are given train.csv and test.csv, with the former (as the name suggests) having 12 features column an 1 target column of price. The test data lacks the target price column so has 12 columns.

The test data is usually large (from my experience), having about 36k rows compared to the 54k rows in the training dataset. (may make the prediction hard if the test data distribution is marginally different from training data?)

## Methodology
Off the top of my head I will approach this similar to my previous project where we follow the steps of:
1. data exploration: distribution, outliers, data types, correlation...
2. data preprocessing: data cleaning, feature engineering, train-test split
3. baseline modeling: use baseline models like decision trees, random forest & linear regression
4. model2 : build fancy model trying to beat baseline model
5. model tuning: overfit then prune? hyperparameter-tuning? monitor loss-curve? early stopping?
6. model evaluation?


# 1. Data Preparation 

## 1.1 Data Loading

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/kagglex/sample_submission.csv
/kaggle/input/kagglex/train.csv
/kaggle/input/kagglex/test.csv


In [2]:
#load the train.csv into a dataframe
train_df = pd.read_csv('/kaggle/input/kagglex/train.csv')
test_df = pd.read_csv('/kaggle/input/kagglex/test.csv')

print(train_df.shape)
print(test_df.shape)

(54273, 13)
(36183, 12)


## 1.2 Data Exploration

In [3]:
train_df.head()

Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,0,Ford,F-150 Lariat,2018,74349,Gasoline,375.0HP 3.5L V6 Cylinder Engine Gasoline Fuel,10-Speed A/T,Blue,Gray,None reported,Yes,11000
1,1,BMW,335 i,2007,80000,Gasoline,300.0HP 3.0L Straight 6 Cylinder Engine Gasoli...,6-Speed M/T,Black,Black,None reported,Yes,8250
2,2,Jaguar,XF Luxury,2009,91491,Gasoline,300.0HP 4.2L 8 Cylinder Engine Gasoline Fuel,6-Speed A/T,Purple,Beige,None reported,Yes,15000
3,3,BMW,X7 xDrive40i,2022,2437,Hybrid,335.0HP 3.0L Straight 6 Cylinder Engine Gasoli...,Transmission w/Dual Shift Mode,Gray,Brown,None reported,Yes,63500
4,4,Pontiac,Firebird Base,2001,111000,Gasoline,200.0HP 3.8L V6 Cylinder Engine Gasoline Fuel,A/T,White,Black,None reported,Yes,7850


In [4]:
test_df.head()

Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title
0,54273,Mercedes-Benz,E-Class E 350,2014,73000,Gasoline,302.0HP 3.5L V6 Cylinder Engine Gasoline Fuel,A/T,White,Beige,None reported,Yes
1,54274,Lexus,RX 350 Base,2015,128032,Gasoline,275.0HP 3.5L V6 Cylinder Engine Gasoline Fuel,8-Speed A/T,Silver,Black,None reported,Yes
2,54275,Mercedes-Benz,C-Class C 300,2015,51983,Gasoline,241.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,7-Speed A/T,Blue,White,None reported,Yes
3,54276,Land,Rover Range Rover 5.0L Supercharged Autobiogra...,2018,29500,Gasoline,518.0HP 5.0L 8 Cylinder Engine Gasoline Fuel,Transmission w/Dual Shift Mode,White,White,At least 1 accident or damage reported,Yes
4,54277,BMW,X6 xDrive40i,2020,90000,Gasoline,335.0HP 3.0L Straight 6 Cylinder Engine Gasoli...,8-Speed A/T,White,Black,At least 1 accident or damage reported,Yes


Quick look and the data suggest some columns should be more valuable than others?
* brand
* ~model??~
* model_year
* fuel_type
* milage (need transformation?)
* ext_col (need transformation, make it simple)
* accident

engine is a mess (need transformation), will not consider first as heuristically i think it might be less important. color can be important but not sure if enough. year, brand and accident should the most important.

lets check the distribution for numerical and unique value of each categorical column to further determine:

In [5]:
# check the distribution of numerical numbers
train_df.describe()

Unnamed: 0,id,model_year,milage,price
count,54273.0,54273.0,54273.0,54273.0
mean,27136.0,2015.091979,72746.175667,39218.44
std,15667.409917,5.588909,50469.490448,72826.34
min,0.0,1974.0,100.0,2000.0
25%,13568.0,2012.0,32268.0,15500.0
50%,27136.0,2016.0,66107.0,28000.0
75%,40704.0,2019.0,102000.0,45000.0
max,54272.0,2024.0,405000.0,2954083.0


In [6]:
# check unique values of categorical data
print("columns and respective unique values:")
print("brands:", train_df.brand.unique())
# print("model:", train_df.model.unique())
print("fuel_type:", train_df.fuel_type.unique())
print("ext_col:", train_df.ext_col.unique())
print("clean_title:", train_df.clean_title.unique())
print("accident:", train_df.accident.unique())

columns and respective unique values:
brands: ['Ford' 'BMW' 'Jaguar' 'Pontiac' 'Acura' 'Audi' 'GMC' 'Maserati'
 'Chevrolet' 'Porsche' 'Mercedes-Benz' 'Tesla' 'Lexus' 'Kia' 'Lincoln'
 'Dodge' 'Volkswagen' 'Land' 'Cadillac' 'Mazda' 'RAM' 'Subaru' 'Hyundai'
 'MINI' 'Jeep' 'Honda' 'Hummer' 'Nissan' 'Toyota' 'Volvo' 'Genesis'
 'Mitsubishi' 'Buick' 'INFINITI' 'McLaren' 'Scion' 'Lamborghini' 'Bentley'
 'Suzuki' 'Ferrari' 'Alfa' 'Rolls-Royce' 'Chrysler' 'Aston' 'Rivian'
 'Lotus' 'Saturn' 'Lucid' 'Mercury' 'Maybach' 'FIAT' 'Plymouth' 'Bugatti']
fuel_type: ['Gasoline' 'Hybrid' 'E85 Flex Fuel' 'Diesel' '–' 'Plug-In Hybrid'
 'not supported']
ext_col: ['Blue' 'Black' 'Purple' 'Gray' 'White' 'Red' 'Silver' 'Summit White'
 'Platinum Quartz Metallic' 'Green' 'Orange' 'Lunar Rock'
 'Red Quartz Tintcoat' 'Beige' 'Gold' 'Jet Black Mica'
 'Delmonico Red Pearlcoat' 'Brown' 'Rich Garnet Metallic'
 'Stellar Black Metallic' 'Yellow' 'Deep Black Pearl Effect' 'Metallic'
 'Ice Silver Metallic' 'Agate Black Meta

lets check missing data:

In [7]:
# check missing values
print("NaN value in brand:", train_df.brand.isna().sum())
print("NaN value in model:", train_df.model.isna().sum())
print("NaN value in model_year:", train_df.model_year.isna().sum())
print("NaN value in fuel_type:", train_df.fuel_type.isna().sum())
print("'-' or 'not supported' value in fuel_type:", train_df[(train_df.fuel_type == '–') | (train_df.fuel_type == 'not supported')].shape[0])
print("NaN value in milage:", train_df.milage.isna().sum())
print("NaN value in ext_col:", train_df.ext_col.isna().sum())
print("NaN value in accident:", train_df.accident.isna().sum())
print("NaN value in price:", train_df.price.isna().sum())
print("0 value in price:", train_df[(train_df.price == 0)].shape[0])

NaN value in brand: 0
NaN value in model: 0
NaN value in model_year: 0
NaN value in fuel_type: 0
'-' or 'not supported' value in fuel_type: 298
NaN value in milage: 0
NaN value in ext_col: 0
NaN value in accident: 0
NaN value in price: 0
0 value in price: 0


quick thoughts upon inspection:

There are columns that are clearly useful and important:
* *brands*
* *model_year*
* *milage*
* *accident*

There are also columns that needs work:
* *fuel_type* has some missing value & might be useful, we will drop columns with missing values & proceed
* *ext_col* may be useful, but need transformation (try to convert most to simple color: white, red, black etc)

Finally there are columns deemed not significant and we will proceed without for now:
* *model* will not be used for now, a lot of work to do and seems less significant

Also it is noteworthy that the target value *price* is free of missing value or 0

## 1.3 Data Cleaning
remove columns with *fuel_type* having missing values

## 1.4 Data Transformation

### 1.4.1 Deal with the strings in column ext_col to make them more generic
turn weird color names into general colors (e.g. white, black, blue...)
make a new column?
then mayyybe remove the weird colors if there is minimal of them? want to make sure the test data has same distribution tho....


### 1.4.2 One-hot encoding the categorical data
use some library to one-hot encode: brand, fuel_type, (new)ext_col, accident

### 1.4.3 Make a new dataframe for the preprocessed data