# Regression of Used Car Prices Dataset
Input files - [Downloaded here](https://www.kaggle.com/competitions/playground-series-s4e9/data)

Original dataset - [here](https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset)
## Development Notes/Ideas


## Libraries

In [3]:
### libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import boxcox
from sklearn.preprocessing import RobustScaler
from sklearn.impute import KNNImputer

## Load and Preview Data
### Data Description
- Brand & Model: Identify the brand or company name along with the specific model of each vehicle.
- Model Year: Discover the manufacturing year of the vehicles, crucial for assessing depreciation and technology advancements.
- Mileage: Obtain the mileage of each vehicle, a key indicator of wear and tear and potential maintenance requirements.
- Fuel Type: Learn about the type of fuel the vehicles run on, whether it's gasoline, diesel, electric, or hybrid.
- Engine Type: Understand the engine specifications, shedding light on performance and efficiency.
- Transmission: Determine the transmission type, whether automatic, manual, or another variant.
- Exterior & Interior Colors: Explore the aesthetic aspects of the vehicles, including exterior and interior color options.
- Accident History: Discover whether a vehicle has a prior history of accidents or damage, crucial for informed decision-making.
- Clean Title: Evaluate the availability of a clean title, which can impact the vehicle's resale value and legal status.
- Price: Access the listed prices for each vehicle, aiding in price comparison and budgeting.

In [4]:
### load data
train_raw = pd.read_csv('01_Data/train.csv')
test_raw=pd.read_csv('01_Data/test.csv')
original_raw = pd.read_csv('01_Data/used_cars.csv')

### data info
train_raw.info()
print("\n")
test_raw.info()
print("\n")
original_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188533 entries, 0 to 188532
Data columns (total 13 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            188533 non-null  int64 
 1   brand         188533 non-null  object
 2   model         188533 non-null  object
 3   model_year    188533 non-null  int64 
 4   milage        188533 non-null  int64 
 5   fuel_type     183450 non-null  object
 6   engine        188533 non-null  object
 7   transmission  188533 non-null  object
 8   ext_col       188533 non-null  object
 9   int_col       188533 non-null  object
 10  accident      186081 non-null  object
 11  clean_title   167114 non-null  object
 12  price         188533 non-null  int64 
dtypes: int64(4), object(9)
memory usage: 18.7+ MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125690 entries, 0 to 125689
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   -

In [5]:
### preview data
train_raw.head(5)

Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,0,MINI,Cooper S Base,2007,213000,Gasoline,172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel,A/T,Yellow,Gray,None reported,Yes,4200
1,1,Lincoln,LS V8,2002,143250,Gasoline,252.0HP 3.9L 8 Cylinder Engine Gasoline Fuel,A/T,Silver,Beige,At least 1 accident or damage reported,Yes,4999
2,2,Chevrolet,Silverado 2500 LT,2002,136731,E85 Flex Fuel,320.0HP 5.3L 8 Cylinder Engine Flex Fuel Capab...,A/T,Blue,Gray,None reported,Yes,13900
3,3,Genesis,G90 5.0 Ultimate,2017,19500,Gasoline,420.0HP 5.0L 8 Cylinder Engine Gasoline Fuel,Transmission w/Dual Shift Mode,Black,Black,None reported,Yes,45000
4,4,Mercedes-Benz,Metris Base,2021,7388,Gasoline,208.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,7-Speed A/T,Black,Beige,None reported,Yes,97500


In [6]:
### summarise data 
train_raw.describe(include='all')

Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
count,188533.0,188533,188533,188533.0,188533.0,183450,188533,188533,188533,188533,186081,167114,188533.0
unique,,57,1897,,,7,1117,52,319,156,2,1,
top,,Ford,F-150 XLT,,,Gasoline,355.0HP 5.3L 8 Cylinder Engine Gasoline Fuel,A/T,Black,Black,None reported,Yes,
freq,,23088,2945,,,165940,3462,49904,48658,107674,144514,167114,
mean,94266.0,,,2015.829998,65705.295174,,,,,,,,43878.02
std,54424.933488,,,5.660967,49798.158076,,,,,,,,78819.52
min,0.0,,,1974.0,100.0,,,,,,,,2000.0
25%,47133.0,,,2013.0,24115.0,,,,,,,,17000.0
50%,94266.0,,,2017.0,57785.0,,,,,,,,30825.0
75%,141399.0,,,2020.0,95400.0,,,,,,,,49900.0


In [7]:
### summarise data - original
original_raw.describe(include='all')

Unnamed: 0,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
count,4009,4009,4009.0,4009,3839,4009,4009,4009,4009,3896,3413,4009
unique,57,1898,,2818,7,1146,62,319,156,2,1,1569
top,Ford,M3 Base,,"110,000 mi.",Gasoline,2.0L I4 16V GDI DOHC Turbo,A/T,Black,Black,None reported,Yes,"$15,000"
freq,386,30,,16,3309,52,1037,905,2025,2910,3413,39
mean,,,2015.51559,,,,,,,,,
std,,,6.104816,,,,,,,,,
min,,,1974.0,,,,,,,,,
25%,,,2012.0,,,,,,,,,
50%,,,2017.0,,,,,,,,,
75%,,,2020.0,,,,,,,,,
