### In this project, we would try to explore regression analysis and make predictions on used car prices using ensemble of regressors. I also hope that I can upload the model as a web app for other people to play around with.

**Citations:**

URL: https://www.kaggle.com/austinreese/craigslist-carstrucks-data

---

The following description is from Kaggle:

**Context**

Craigslist is the world's largest collection of used vehicles for sale, yet it's very difficult to collect all of them in the same place. I built a scraper for a school project and expanded upon it later to create this dataset which includes every used vehicle entry within the United States on Craigslist.

**Content**

This data is scraped every few months, it contains most all relevant information that Craigslist provides on car sales including columns like price, condition, manufacturer, latitude/longitude, and 18 other categories. For ML projects, consider feature engineering on location columns such as long/lat. For previous listings, check older versions of the dataset.

See https://github.com/AustinReese1998/craigslistFilter

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
pd.set_option('display.max_columns', None)

In [3]:
#data import
df = pd.read_csv('Data/vehicles.csv')
df.head()

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,vin,drive,size,type,paint_color,image_url,description,county,state,lat,long
0,7088746062,https://greensboro.craigslist.org/ctd/d/cary-2...,greensboro,https://greensboro.craigslist.org,10299,2012.0,acura,tl,,,gas,90186.0,clean,automatic,19UUA8F22CA003926,,,other,blue,https://images.craigslist.org/01414_3LIXs9EO33...,2012 Acura TL Base 4dr Sedan Offered by: B...,,nc,35.7636,-78.7443
1,7088745301,https://greensboro.craigslist.org/ctd/d/bmw-3-...,greensboro,https://greensboro.craigslist.org,0,2011.0,bmw,335,,6 cylinders,gas,115120.0,clean,automatic,,rwd,,convertible,blue,https://images.craigslist.org/00S0S_1kTatLGLxB...,BMW 3 Series 335i Convertible Navigation Dakot...,,nc,,
2,7088744126,https://greensboro.craigslist.org/cto/d/greens...,greensboro,https://greensboro.craigslist.org,9500,2011.0,jaguar,xf,excellent,,gas,85000.0,clean,automatic,,,,,blue,https://images.craigslist.org/00505_f22HGItCRp...,2011 jaguar XF premium - estate sale. Retired ...,,nc,36.1032,-79.8794
3,7088743681,https://greensboro.craigslist.org/ctd/d/cary-2...,greensboro,https://greensboro.craigslist.org,3995,2004.0,honda,element,,,gas,212526.0,clean,automatic,5J6YH18314L006498,fwd,,SUV,orange,https://images.craigslist.org/00E0E_eAUnhFF86M...,2004 Honda Element LX 4dr SUV Offered by: ...,,nc,35.7636,-78.7443
4,7074612539,https://lincoln.craigslist.org/ctd/d/gretna-20...,lincoln,https://lincoln.craigslist.org,41988,2016.0,chevrolet,silverado k2500hd,,,gas,,clean,automatic,1GC1KWE85GF266427,,,,,https://images.craigslist.org/00S0S_8msT7RQquO...,"Shop Indoors, Heated Showroom!!!www.gretnaauto...",,ne,41.1345,-96.2458


It seems the data has some NULL values for some columns, let's view the info for df

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539759 entries, 0 to 539758
Data columns (total 25 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            539759 non-null  int64  
 1   url           539759 non-null  object 
 2   region        539759 non-null  object 
 3   region_url    539759 non-null  object 
 4   price         539759 non-null  int64  
 5   year          538772 non-null  float64
 6   manufacturer  516175 non-null  object 
 7   model         531746 non-null  object 
 8   condition     303707 non-null  object 
 9   cylinders     321264 non-null  object 
 10  fuel          536366 non-null  object 
 11  odometer      440783 non-null  float64
 12  title_status  536819 non-null  object 
 13  transmission  535786 non-null  object 
 14  vin           315349 non-null  object 
 15  drive         383987 non-null  object 
 16  size          168550 non-null  object 
 17  type          392290 non-null  object 
 18  pain

#### We would start by dropping some of the columns. For example, columns that contain too many NULL values to be useful or columns that are hard to make use of for this analysis

In [5]:
#columns contain too many NULL values
df.drop(['condition','cylinders', 'vin', 'drive', 'size', 'type', 'paint_color', 'county'], axis = 1, inplace = True)
#columns that are hard to make use of
df.drop(['id', 'url', 'region', 'region_url', 'image_url', 'description', 'lat', 'long'], axis = 1, inplace = True)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539759 entries, 0 to 539758
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   price         539759 non-null  int64  
 1   year          538772 non-null  float64
 2   manufacturer  516175 non-null  object 
 3   model         531746 non-null  object 
 4   fuel          536366 non-null  object 
 5   odometer      440783 non-null  float64
 6   title_status  536819 non-null  object 
 7   transmission  535786 non-null  object 
 8   state         539759 non-null  object 
dtypes: float64(2), int64(1), object(6)
memory usage: 37.1+ MB


#### We still have some NULL values in some of the columns left - note that column "odometer" still contains many NULL values, because I think it is an important factor in predicting used car prices, and because we have enough number of examples to work with, we would drop the rows where "odometer" is NULL.

In [7]:
df.drop(df.index[df['odometer'].isna()], inplace = True)
df.reset_index(drop = True, inplace = True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440783 entries, 0 to 440782
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   price         440783 non-null  int64  
 1   year          439815 non-null  float64
 2   manufacturer  425660 non-null  object 
 3   model         436331 non-null  object 
 4   fuel          437477 non-null  object 
 5   odometer      440783 non-null  float64
 6   title_status  437924 non-null  object 
 7   transmission  437038 non-null  object 
 8   state         440783 non-null  object 
dtypes: float64(2), int64(1), object(6)
memory usage: 30.3+ MB


#### Data Validity Checking

In [9]:
print(f"The max price of a used car is {df['price'].max()}, the min price of a used car is {df['price'].min()}.")

The max price of a used car is 4198286601, the min price of a used car is 0.


In [10]:
#The above does not make lots of sense, so we would like to only keep the used cars with prices between 1000 and 200000
drop_idx = df.index[(df['price'] < 1000) | (df['price'] > 200000)]
df.drop(drop_idx, inplace = True)
df.reset_index(drop = True, inplace = True)

In [11]:
print(f"The max price of a used car is {df['price'].max()}, the min price of a used car is {df['price'].min()}.")

The max price of a used car is 200000, the min price of a used car is 1000.


In [12]:
df['year'].value_counts()

2017.0    32434
2016.0    30910
2015.0    30203
2013.0    29213
2014.0    28718
          ...  
1928.0        2
1942.0        2
1922.0        1
1914.0        1
0.0           1
Name: year, Length: 101, dtype: int64

In [13]:
#We would only be interested in used cars with year between 2000 and 2019. 
#If a used car is too old, we are not intersted and also they have fewer instances to be trained thus prediction accuracy is questionable.
#The dataset is from Jan 2020 so we would only include models up until year 2019 
#(year 2020 model should be out already but again, there are few instances or even bad data as there are some used cars have model year 2021)
drop_idx = df.index[(df['year'] < 2000) | (df['year'] > 2019)]
df.drop(drop_idx, inplace = True)
df.reset_index(drop = True, inplace = True)

In [15]:
#top 5 rows
df['year'].value_counts().head()

2017.0    32434
2016.0    30910
2015.0    30203
2013.0    29213
2014.0    28718
Name: year, dtype: int64

#### Now we still have some missing values but they are a small percentage of the overall number of records. We can impute the missing values from existing data.