In [1]:
!python3 --version

Python 3.10.12


#### 🚗 Used Car Price Prediction

Data from - https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data

##### 👋 Introduction

So called Second hand's car have a huge market base. Many consider to buy a Used Car intsead of buying of new one, as it's is feasible and a better investment.

The main reason for this huge market is that when you buy a New Car and sale it just another day without any default on it, the price of car reduces by 30%.

There are also many frauds in the market who not only sale wrong but also they could mislead to wrong price.




##### 💻 Import Necessary Libraries and Load Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

In [2]:
car_df = pd.read_csv("../data/vehicles.csv")

In [9]:
car_df.head(40)

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
0,7222695916,https://prescott.craigslist.org/cto/d/prescott...,prescott,https://prescott.craigslist.org,6000,,,,,,...,,,,,,,az,,,
1,7218891961,https://fayar.craigslist.org/ctd/d/bentonville...,fayetteville,https://fayar.craigslist.org,11900,,,,,,...,,,,,,,ar,,,
2,7221797935,https://keys.craigslist.org/cto/d/summerland-k...,florida keys,https://keys.craigslist.org,21000,,,,,,...,,,,,,,fl,,,
3,7222270760,https://worcester.craigslist.org/cto/d/west-br...,worcester / central MA,https://worcester.craigslist.org,1500,,,,,,...,,,,,,,ma,,,
4,7210384030,https://greensboro.craigslist.org/cto/d/trinit...,greensboro,https://greensboro.craigslist.org,4900,,,,,,...,,,,,,,nc,,,
5,7222379453,https://hudsonvalley.craigslist.org/cto/d/west...,hudson valley,https://hudsonvalley.craigslist.org,1600,,,,,,...,,,,,,,ny,,,
6,7221952215,https://hudsonvalley.craigslist.org/cto/d/west...,hudson valley,https://hudsonvalley.craigslist.org,1000,,,,,,...,,,,,,,ny,,,
7,7220195662,https://hudsonvalley.craigslist.org/cto/d/poug...,hudson valley,https://hudsonvalley.craigslist.org,15995,,,,,,...,,,,,,,ny,,,
8,7209064557,https://medford.craigslist.org/cto/d/grants-pa...,medford-ashland,https://medford.craigslist.org,5000,,,,,,...,,,,,,,or,,,
9,7219485069,https://erie.craigslist.org/cto/d/erie-2012-su...,erie,https://erie.craigslist.org,3000,,,,,,...,,,,,,,pa,,,


In [4]:
car_df.shape

(426880, 26)

There are almost 400,000 entries of cars listed on craigslist and have 26 attributes repreesenting the data

In [5]:
car_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 26 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   url           426880 non-null  object 
 2   region        426880 non-null  object 
 3   region_url    426880 non-null  object 
 4   price         426880 non-null  int64  
 5   year          425675 non-null  float64
 6   manufacturer  409234 non-null  object 
 7   model         421603 non-null  object 
 8   condition     252776 non-null  object 
 9   cylinders     249202 non-null  object 
 10  fuel          423867 non-null  object 
 11  odometer      422480 non-null  float64
 12  title_status  418638 non-null  object 
 13  transmission  424324 non-null  object 
 14  VIN           265838 non-null  object 
 15  drive         296313 non-null  object 
 16  size          120519 non-null  object 
 17  type          334022 non-null  object 
 18  pain

We get to know a lot from seeing this initial data.

Upon the first look, we see that many columns do not give us any good info like the *id* and *url* column

##### 🧹 Data Cleaning

Before the data is even analyzed, it needs to be cleaned. This involves handling missing values, outliers, and other data issues.

In [12]:
drop_col = ['url', 'region_url', 'title_status', 'VIN', 'size', 'image_url', 'description', 'lat','long']

We have selected a list of columns that we think can be removed.

Although, lat and long can give some crucial information, we chose to remove them since it is out of the scope for a simple machine learning model we are tryna build.

In [13]:
car_df = car_df.drop(columns=drop_col)

In [15]:
car_df = car_df.drop(columns=["id"])

In [16]:
car_df.head()

Unnamed: 0,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,transmission,drive,type,paint_color,county,state,posting_date
0,prescott,6000,,,,,,,,,,,,,az,
1,fayetteville,11900,,,,,,,,,,,,,ar,
2,florida keys,21000,,,,,,,,,,,,,fl,
3,worcester / central MA,1500,,,,,,,,,,,,,ma,
4,greensboro,4900,,,,,,,,,,,,,nc,


The target variable we have to predict is the PRICE column.

In [17]:
car_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   region        426880 non-null  object 
 1   price         426880 non-null  int64  
 2   year          425675 non-null  float64
 3   manufacturer  409234 non-null  object 
 4   model         421603 non-null  object 
 5   condition     252776 non-null  object 
 6   cylinders     249202 non-null  object 
 7   fuel          423867 non-null  object 
 8   odometer      422480 non-null  float64
 9   transmission  424324 non-null  object 
 10  drive         296313 non-null  object 
 11  type          334022 non-null  object 
 12  paint_color   296677 non-null  object 
 13  county        0 non-null       float64
 14  state         426880 non-null  object 
 15  posting_date  426812 non-null  object 
dtypes: float64(3), int64(1), object(12)
memory usage: 52.1+ MB


In [22]:
car_df = car_df.drop(columns=["county"])

In [26]:
car_df.shape

(115988, 15)

In [24]:
car_df = car_df.dropna()

In [25]:
car_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 115988 entries, 31 to 426878
Data columns (total 15 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   region        115988 non-null  object 
 1   price         115988 non-null  int64  
 2   year          115988 non-null  float64
 3   manufacturer  115988 non-null  object 
 4   model         115988 non-null  object 
 5   condition     115988 non-null  object 
 6   cylinders     115988 non-null  object 
 7   fuel          115988 non-null  object 
 8   odometer      115988 non-null  float64
 9   transmission  115988 non-null  object 
 10  drive         115988 non-null  object 
 11  type          115988 non-null  object 
 12  paint_color   115988 non-null  object 
 13  state         115988 non-null  object 
 14  posting_date  115988 non-null  object 
dtypes: float64(2), int64(1), object(12)
memory usage: 14.2+ MB


We only know have a fourth of our original data after getting rid of the NAN's

In [27]:
car_df.head()

Unnamed: 0,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,transmission,drive,type,paint_color,state,posting_date
31,auburn,15000,2013.0,ford,f-150 xlt,excellent,6 cylinders,gas,128000.0,automatic,rwd,truck,black,al,2021-05-03T14:02:03-0500
32,auburn,27990,2012.0,gmc,sierra 2500 hd extended cab,good,8 cylinders,gas,68696.0,other,4wd,pickup,black,al,2021-05-03T13:41:25-0500
33,auburn,34590,2016.0,chevrolet,silverado 1500 double,good,6 cylinders,gas,29499.0,other,4wd,pickup,silver,al,2021-05-03T12:41:33-0500
34,auburn,35000,2019.0,toyota,tacoma,excellent,6 cylinders,gas,43000.0,automatic,4wd,truck,grey,al,2021-05-03T12:12:59-0500
35,auburn,29990,2016.0,chevrolet,colorado extended cab,good,6 cylinders,gas,17302.0,other,4wd,pickup,red,al,2021-05-03T11:31:14-0500


In [29]:
car_df["state"].value_counts()

state
ca    12743
fl     8145
ny     6506
tx     5716
oh     5477
mi     4387
nc     4353
pa     4230
wi     3766
va     2988
or     2984
ia     2930
ma     2882
nj     2851
tn     2781
mn     2658
il     2640
co     2449
az     2282
ok     2221
in     2119
ks     2008
id     1805
sc     1783
ct     1632
ga     1627
wa     1531
ky     1507
vt     1390
al     1322
mo     1282
nm     1199
mt     1164
md     1063
ar      990
ak      911
me      830
nh      822
ri      813
dc      756
nv      734
la      692
hi      593
sd      404
de      355
ms      344
ne      338
ut      303
wv      267
nd      226
wy      189
Name: count, dtype: int64