### In this project, we would explore regression analysis and make predictions on used car prices. After that, we would build a web app for user interactions and live predictions.
---
#### This is PART I of the project where we solely focus on data pre-processing

#### Highlights of this part:
1.  **Thought Processes** on data pre-processing
2.  Text manipulations

**Citations:**

This dataset is from Kaggle - Used Cars Dataset

URL: https://www.kaggle.com/austinreese/craigslist-carstrucks-data

---

The following description is from Kaggle:

**Context**

Craigslist is the world's largest collection of used vehicles for sale, yet it's very difficult to collect all of them in the same place. I built a scraper for a school project and expanded upon it later to create this dataset which includes every used vehicle entry within the United States on Craigslist.

**Content**

This data is scraped every few months, it contains most all relevant information that Craigslist provides on car sales including columns like price, condition, manufacturer, latitude/longitude, and 18 other categories. For ML projects, consider feature engineering on location columns such as long/lat. For previous listings, check older versions of the dataset.

See https://github.com/AustinReese1998/craigslistFilter

In [1]:
import numpy as np
import pandas as pd

In [2]:
pd.set_option('display.max_columns', None)

In [3]:
df = pd.read_csv('Data/vehicles.csv')
df.head()

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,vin,drive,size,type,paint_color,image_url,description,county,state,lat,long
0,7088746062,https://greensboro.craigslist.org/ctd/d/cary-2...,greensboro,https://greensboro.craigslist.org,10299,2012.0,acura,tl,,,gas,90186.0,clean,automatic,19UUA8F22CA003926,,,other,blue,https://images.craigslist.org/01414_3LIXs9EO33...,2012 Acura TL Base 4dr Sedan Offered by: B...,,nc,35.7636,-78.7443
1,7088745301,https://greensboro.craigslist.org/ctd/d/bmw-3-...,greensboro,https://greensboro.craigslist.org,0,2011.0,bmw,335,,6 cylinders,gas,115120.0,clean,automatic,,rwd,,convertible,blue,https://images.craigslist.org/00S0S_1kTatLGLxB...,BMW 3 Series 335i Convertible Navigation Dakot...,,nc,,
2,7088744126,https://greensboro.craigslist.org/cto/d/greens...,greensboro,https://greensboro.craigslist.org,9500,2011.0,jaguar,xf,excellent,,gas,85000.0,clean,automatic,,,,,blue,https://images.craigslist.org/00505_f22HGItCRp...,2011 jaguar XF premium - estate sale. Retired ...,,nc,36.1032,-79.8794
3,7088743681,https://greensboro.craigslist.org/ctd/d/cary-2...,greensboro,https://greensboro.craigslist.org,3995,2004.0,honda,element,,,gas,212526.0,clean,automatic,5J6YH18314L006498,fwd,,SUV,orange,https://images.craigslist.org/00E0E_eAUnhFF86M...,2004 Honda Element LX 4dr SUV Offered by: ...,,nc,35.7636,-78.7443
4,7074612539,https://lincoln.craigslist.org/ctd/d/gretna-20...,lincoln,https://lincoln.craigslist.org,41988,2016.0,chevrolet,silverado k2500hd,,,gas,,clean,automatic,1GC1KWE85GF266427,,,,,https://images.craigslist.org/00S0S_8msT7RQquO...,"Shop Indoors, Heated Showroom!!!www.gretnaauto...",,ne,41.1345,-96.2458


It seems the data has some NULL values for some columns, let's view the info for our dataframe

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539759 entries, 0 to 539758
Data columns (total 25 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            539759 non-null  int64  
 1   url           539759 non-null  object 
 2   region        539759 non-null  object 
 3   region_url    539759 non-null  object 
 4   price         539759 non-null  int64  
 5   year          538772 non-null  float64
 6   manufacturer  516175 non-null  object 
 7   model         531746 non-null  object 
 8   condition     303707 non-null  object 
 9   cylinders     321264 non-null  object 
 10  fuel          536366 non-null  object 
 11  odometer      440783 non-null  float64
 12  title_status  536819 non-null  object 
 13  transmission  535786 non-null  object 
 14  vin           315349 non-null  object 
 15  drive         383987 non-null  object 
 16  size          168550 non-null  object 
 17  type          392290 non-null  object 
 18  pain

#### We would start by dropping some of the columns. For example, columns that contain too many NULL values to be useful or columns that are hard to make use of for this analysis

In [5]:
#columns contain too many NULL values
df.drop(['condition','cylinders', 'vin', 'drive', 'size', 'type', 'paint_color', 'county'], axis = 1, inplace = True)
#columns that are hard to make use of
df.drop(['id', 'url', 'region', 'region_url', 'image_url', 'description', 'lat', 'long'], axis = 1, inplace = True)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539759 entries, 0 to 539758
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   price         539759 non-null  int64  
 1   year          538772 non-null  float64
 2   manufacturer  516175 non-null  object 
 3   model         531746 non-null  object 
 4   fuel          536366 non-null  object 
 5   odometer      440783 non-null  float64
 6   title_status  536819 non-null  object 
 7   transmission  535786 non-null  object 
 8   state         539759 non-null  object 
dtypes: float64(2), int64(1), object(6)
memory usage: 37.1+ MB


#### We still have some NULL values in some of the columns left - note that column "odometer" still contains many NULL values, because I think it is an important factor in predicting used car prices, and because we have enough number of examples to work with, we would drop the rows where "odometer" is NULL.

In [7]:
df.drop(df.index[df['odometer'].isna()], inplace = True)
df.reset_index(drop = True, inplace = True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440783 entries, 0 to 440782
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   price         440783 non-null  int64  
 1   year          439815 non-null  float64
 2   manufacturer  425660 non-null  object 
 3   model         436331 non-null  object 
 4   fuel          437477 non-null  object 
 5   odometer      440783 non-null  float64
 6   title_status  437924 non-null  object 
 7   transmission  437038 non-null  object 
 8   state         440783 non-null  object 
dtypes: float64(2), int64(1), object(6)
memory usage: 30.3+ MB


#### Now we still have some missing values but they are a small percentage of the overall number of records. We can impute the missing values from existing data.

In [9]:
#For categorical variables, we would impute missing values based on existing class distribution by random sampling with replacement
def CategoricalImputer(df, column_names):
    for column in column_names:
        replacements = np.random.choice(a = df.loc[df[column].notna(), column], size = df[column].isna().sum(), replace = True)
        replacements = pd.Series(replacements, index = df.index[df[column].isna()])
        df[column].fillna(replacements, inplace = True)

In [10]:
CategoricalImputer(df, ['year', 'manufacturer', 'model', 'fuel', 'title_status', 'transmission'])

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440783 entries, 0 to 440782
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   price         440783 non-null  int64  
 1   year          440783 non-null  float64
 2   manufacturer  440783 non-null  object 
 3   model         440783 non-null  object 
 4   fuel          440783 non-null  object 
 5   odometer      440783 non-null  float64
 6   title_status  440783 non-null  object 
 7   transmission  440783 non-null  object 
 8   state         440783 non-null  object 
dtypes: float64(2), int64(1), object(6)
memory usage: 30.3+ MB


#### Great, we do not have NULL values anymore, now we would like to perform some data validity checking

**Price:**

In [12]:
print(f"The max price of all used cars is {df['price'].max()}, the min price of all used cars is {df['price'].min()}.")

The max price of all used cars is 4198286601, the min price of all used cars is 0.


The above does not make lots of sense, let's only keep the used cars with prices between 1000 and 200000

In [13]:
drop_idx = df.index[(df['price'] < 1000) | (df['price'] > 200000)]
df.drop(drop_idx, inplace = True)
df.reset_index(drop = True, inplace = True)

In [14]:
print(f"The max price of all used cars is now {df['price'].max()}, the min price of all used cars is now {df['price'].min()}.")

The max price of all used cars is now 200000, the min price of all used cars is now 1000.


**Year:**

In [15]:
df['year'].value_counts()

2017.0    32499
2016.0    30981
2015.0    30279
2013.0    29284
2014.0    28794
          ...  
1928.0        2
1942.0        2
1922.0        1
1914.0        1
0.0           1
Name: year, Length: 101, dtype: int64

In [16]:
df['year'].max() #doesn't make sense becaue the data is from Jan 2020

2021.0

Let's only work with used cars with year between 2000 and 2019. 
Some of the used cars are really old such as the one from 1914. Although this is possible, those cars have few instances to be trained upon thus prediction accuracy is questionable.
The dataset is from Jan 2020 so we would only include models up until year 2019 (year 2020 model should be out already but again, there are few instances or even bad data as there are some used cars have model year 2021)

In [17]:
drop_idx = df.index[(df['year'] < 2000) | (df['year'] > 2019)]
df.drop(drop_idx, inplace = True)
df.reset_index(drop = True, inplace = True)

**Manufacturer:**

In [18]:
df['manufacturer'].value_counts()

ford               70487
chevrolet          55134
toyota             30385
nissan             21634
ram                19658
honda              19613
jeep               18593
gmc                17465
dodge              12961
bmw                11489
hyundai             9764
subaru              9553
mercedes-benz       8779
volkswagen          8445
kia                 7493
chrysler            6291
cadillac            5843
buick               5491
lexus               5057
audi                4930
mazda               4798
infiniti            3291
acura               3120
lincoln             2479
volvo               2473
pontiac             2217
mini                2055
mitsubishi          1941
rover               1601
saturn              1188
mercury             1180
jaguar               798
fiat                 786
tesla                198
harley-davidson      163
alfa-romeo            43
ferrari               34
aston-martin          25
land rover            15
porche                 7


In [19]:
#Manufacturer data looks good but again to ensure there is enough data for training,
#we would like to drop instances where count < 1000,
#knowing that we would not be able to predict instances with those manufacturers
counts = df['manufacturer'].value_counts()
values_to_drop = counts.index[counts < 1000]
drop_idx = df.index[df['manufacturer'].isin(values_to_drop)]
df.drop(drop_idx, inplace = True)
df.reset_index(drop = True, inplace = True)

**Model column turns out to be a special one...**

In [20]:
df['model'].value_counts()

f-150                          8132
silverado 1500                 5261
1500                           4678
silverado                      3522
wrangler                       2787
                               ... 
yukon xl denali 1500 4x4 mo       1
silverado 3500 high co            1
f 150 xlt 4x4 super cab           1
1500 long horn 4x4 gas            1
Chrysler/ Sebring                 1
Name: model, Length: 25498, dtype: int64

#### Noticed the issue here? Because "Model" field might be a free text field, seller can put different variations of the same car model. For example, "silverado 1500", "1500" and "silverado" might represent the same car model. If we use the values as they are now, they would be treated as different car models, which could impact our model (not car model :)) performance. To mitigate this issue, we would take the following steps:
1. Transform all "Model" data into lower case for better matching.
2. Strip out the words of individual car model for each record
3. Find out the top most-frequent 150 words based on weighted frequency. (weights are assigned based on word sequence in a record (ex. "silverado 1500" - "silverado" would have more weights than "1500" because the further the word is to the front, the more likely the word represents the actual car model))(Here we are assuming a car model is only one word instead of a combination of words)
4. Go through the 150 words, clean them up and end up with the most common derived car models.
5. Map records to the most common derived "car models" and mark "Others" if cannot be mapped.

In [21]:
#Transform car models to lower case
df['model'] = df['model'].str.lower()

In [22]:
weighted_words = []
for item in df['model']:
    #split into words
    words = item.split()
    #create weighted words list
    for word, weight in zip(words, (np.arange(len(words)) + 1)[::-1]):
        weighted_words += [word] * weight

In [23]:
#Find most frequent "car models"
top_models = pd.Series(weighted_words).value_counts().head(150)

In [24]:
#Now we would go through all the "top models" and clean the Series up
pd.set_option('display.max_rows', None)
top_models

1500           45138
silverado      43546
sport          28960
cab            27778
sierra         23751
super          22727
f150           22274
crew           21938
grand          21091
4x4            19352
2500           16691
f-150          15386
se             14416
lt             14405
duty           13798
sedan          13618
wrangler       13173
4d             10931
tacoma         10724
xlt            10301
f-250          10241
escape         10161
cherokee        9843
awd             9738
coupe           9671
s               9528
civic           9430
f250            9317
tundra          9172
limited         9116
f-350           9062
utility         8926
accord          8804
mustang         8764
3500            8391
supercrew       7978
slt             7842
equinox         7826
rogue           7487
altima          7458
camry           7425
lx              7084
focus           7035
4dr             6840
explorer        6539
fusion          6425
2500hd          6250
camaro       

In [25]:
top_models.drop(['sport', 'cab', 'super', '4x4', 'se', 'lt', 'sedan',  '4d', 'xlt', 'lifted', 
                 'awd', 's',  'slt', 'lx', '4dr', '4wd', 'duty', 'limited', 'gas', 'unlimited',
                'premium', 'sv',  'ls',  'xl',  'ex', 'utility', 'series', 'double', 'cargo', 
                '2d', 'pickup', '3', 'le', 'sel', 'gt', 'hd', 'ltz', 'fwd', 'ex-l', 'sle', 'hatchback',
                 'sxt', '2.5', 'fe', '300', 'quad', 'diesel', 'sl', 'van', 'rx', '4', '2.0t', '2.5i'], inplace = True)

In [26]:
pd.reset_option('display.max_rows')

In [27]:
top_models = pd.Series(top_models.index)

In [28]:
top_models

0          1500
1     silverado
2        sierra
3          f150
4          crew
        ...    
92      liberty
93       murano
94       denali
95      patriot
96     sportage
Length: 97, dtype: object

The list might not be perfect but it is as good as we can get, now finally we would like to map every record to one of the top_models.

In [29]:
#First, we would like to assign "Others" to records that do not have words in top_models
df.loc[~df['model'].str.contains('|'.join(top_models)), ['model']] = 'Others'

In [30]:
#map to top_models
from tqdm import tqdm
for item in tqdm(top_models):
    df['model'] = df['model'].apply(lambda x: item if item in x else x)

100%|██████████| 97/97 [00:04<00:00, 20.25it/s]


In [31]:
print(f"We successfuly mapped {(df['model'] != 'Others').sum()} records for car models out of total of {len(df)} records.")

We successfuly mapped 269258 records for car models out of total of 375408 records.


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 375408 entries, 0 to 375407
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   price         375408 non-null  int64  
 1   year          375408 non-null  float64
 2   manufacturer  375408 non-null  object 
 3   model         375408 non-null  object 
 4   fuel          375408 non-null  object 
 5   odometer      375408 non-null  float64
 6   title_status  375408 non-null  object 
 7   transmission  375408 non-null  object 
 8   state         375408 non-null  object 
dtypes: float64(2), int64(1), object(6)
memory usage: 25.8+ MB


**Odometer:**

In [33]:
print(f"The odometer ranges from {df['odometer'].min()} to {df['odometer'].max()}.")

The odometer ranges from 0.0 to 64809218.0.


Odometer values between 10 and 250000 miles would be a reasonable range. So we would drop odometer values < 10 and > 250000

In [34]:
df.drop(df.index[(df['odometer'] < 10) | (df['odometer'] > 250000)], inplace = True)
df.reset_index(drop = True, inplace = True)

In [35]:
print(f"The odometer now ranges from {df['odometer'].min()} to {df['odometer'].max()}.")

The odometer now ranges from 10.0 to 250000.0.


**Title_status:**

In [36]:
df['title_status'].value_counts()

clean         352850
rebuilt         9608
salvage         3713
lien            2403
missing           70
parts only        34
Name: title_status, dtype: int64

We would like to drop "missing" and "parts only" cause these two title status contain too few entries

In [37]:
df.drop(df.index[df['title_status'].isin(['missing', 'parts only'])], inplace = True)
df.reset_index(drop = True, inplace = True)

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 368574 entries, 0 to 368573
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   price         368574 non-null  int64  
 1   year          368574 non-null  float64
 2   manufacturer  368574 non-null  object 
 3   model         368574 non-null  object 
 4   fuel          368574 non-null  object 
 5   odometer      368574 non-null  float64
 6   title_status  368574 non-null  object 
 7   transmission  368574 non-null  object 
 8   state         368574 non-null  object 
dtypes: float64(2), int64(1), object(6)
memory usage: 25.3+ MB


#### Now all the data pre-processing is complete, we would save it as a new csv file for Part II of this analysis to use directly

In [39]:
df.to_csv('Data/vehicles_pre-processed.csv', index = False)