# Used Car Auction Sales

Elimelech Berlin  
January 2024

## Context

## Data

## EDA

Imports:

In [1]:
import pandas as pd
from datetime import date, time, datetime

Load the data:
(While running `pd.read_csv('data/car_prices.csv')`, this error was raised: 'ParserError: Error tokenizing data. C error: Expected 16 fields in line 408163, saw 17'). To fix the issue causing this, I examined the file in Notepad & found an error (extra comma in original csv file. So I edited it manually.)

In [2]:
# load csv file into a dataframe, raise a warning when rows cannot be loaded
df = pd.read_csv('data/car_prices.csv', on_bad_lines='warn')

Skipping line 408163: expected 16 fields, saw 17
Skipping line 417837: expected 16 fields, saw 17
Skipping line 421291: expected 16 fields, saw 17
Skipping line 424163: expected 16 fields, saw 17

Skipping line 427042: expected 16 fields, saw 17
Skipping line 427045: expected 16 fields, saw 17
Skipping line 434426: expected 16 fields, saw 17
Skipping line 444503: expected 16 fields, saw 17
Skipping line 453796: expected 16 fields, saw 17

Skipping line 461599: expected 16 fields, saw 17
Skipping line 461614: expected 16 fields, saw 17

Skipping line 492486: expected 16 fields, saw 17
Skipping line 497010: expected 16 fields, saw 17
Skipping line 497013: expected 16 fields, saw 17
Skipping line 499085: expected 16 fields, saw 17
Skipping line 501457: expected 16 fields, saw 17
Skipping line 505301: expected 16 fields, saw 17
Skipping line 505308: expected 16 fields, saw 17
Skipping line 520463: expected 16 fields, saw 17

Skipping line 528998: expected 16 fields, saw 17
Skipping line 52

> ~ 25 rows are dropped from the data

Let's have a look at the shape of the data:

In [3]:
df.shape

(558811, 16)

> There are 550k+ records describeed by 16 columns/features.

To gain further understanding of the data, view colunm names & the first few rows of the data:

In [4]:
df.columns

Index(['year', 'make', 'model', 'trim', 'body', 'transmission', 'vin', 'state',
       'condition', 'odometer', 'color', 'interior', 'seller', 'mmr',
       'sellingprice', 'saledate'],
      dtype='object')

In [5]:
df.head()

Unnamed: 0,year,make,model,trim,body,transmission,vin,state,condition,odometer,color,interior,seller,mmr,sellingprice,saledate
0,2015,Kia,Sorento,LX,SUV,automatic,5xyktca69fg566472,ca,5.0,16639.0,white,black,"kia motors america, inc",20500,21500,Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
1,2015,Kia,Sorento,LX,SUV,automatic,5xyktca69fg561319,ca,5.0,9393.0,white,beige,"kia motors america, inc",20800,21500,Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
2,2014,BMW,3 Series,328i SULEV,Sedan,automatic,wba3c1c51ek116351,ca,4.5,1331.0,gray,black,financial services remarketing (lease),31900,30000,Thu Jan 15 2015 04:30:00 GMT-0800 (PST)
3,2015,Volvo,S60,T5,Sedan,automatic,yv1612tb4f1310987,ca,4.1,14282.0,white,black,volvo na rep/world omni,27500,27750,Thu Jan 29 2015 04:30:00 GMT-0800 (PST)
4,2014,BMW,6 Series Gran Coupe,650i,Sedan,automatic,wba6b2c57ed129731,ca,4.3,2641.0,gray,black,financial services remarketing (lease),66000,67000,Thu Dec 18 2014 12:30:00 GMT-0800 (PST)


> What most of the columns describe is self explanatory. Nearly all of the columns describe features that are relevant to our analysis, even the VIN is relevant, as some of the information encoded in it may affect vehicle price. (That information will be extracted & a new column created for it.) 'Seller' may be dropped as that will be diffucult, if not impossible to extract meaningful insight from it. Saledate may prove useful, as seasonality (month), day of week & time of day may play a role in the saleprice, however, that column must divided into several different features.

Now, let's drop 'Seller' column:

In [6]:
df.drop(['seller'], axis=1, inplace=True)

With the irrelevant column removed, we can proceed with cleaning & prepping the data. First, check for missing values:

In [7]:
df.isna().sum()

year                0
make            10301
model           10399
trim            10651
body            13195
transmission    65352
vin                 0
state               0
condition       11794
odometer           94
color             749
interior          749
mmr                 0
sellingprice        0
saledate            1
dtype: int64

> There are several columns missing 10k+ values: make, model, trim, body, transmission, condition. Let's investigate this further by investigating if the above missing values are present in the same rows:

In [8]:
df[df['make'].isna()].isna().sum()

year                0
make            10301
model           10301
trim            10301
body            10301
transmission     1761
vin                 0
state               0
condition          87
odometer            4
color              15
interior           15
mmr                 0
sellingprice        0
saledate            0
dtype: int64

> We see many rows are missing information in several colmns. The missing information is likely essential to our undersstanding of what drives value for the vehicles & we can't learn enough info without it. Although some of the missing information can be derived from VINs of those cars, we will drop those records from the dataframe & exclude them from this analysis.

In [31]:
# create new dataframe without null values
df2 = df.dropna().reset_index(drop=True)

Let's now have a look at the datatypes present in the dataset:

In [32]:
df2.dtypes

year              int64
make             object
model            object
trim             object
body             object
transmission     object
vin              object
state            object
condition       float64
odometer         object
color            object
interior         object
mmr               int64
sellingprice     object
saledate         object
dtype: object

> Several columns are present in the wrong data type: odometer, selling price & saledate. Of those 3, saledate is the only one that will not be transformed to numeric dtype, it will be changed to datetime.  

To deal with these columns, lets's begin with column 'odometer'. Examine individual values to learn if the entire column is non-numeric (which can be addresed by a simple transformation of dtype) or if there are problems with specific values:

In [33]:
type(df2.iloc[22265]['odometer'])

float

> The above output shows that there are numeric-types present in the column, which indicates that non-numeric dtypes are likely a result of incorrectly saved values. To deal with this, first attempt to simply change the datatype:

In [34]:
df2['odometer'] = pd.to_numeric(df2['odometer'])

In [35]:
df2.odometer.dtypes

dtype('float64')

> That seems to have solved the problem, as everything was converted without any errors.

Now, let's attempt the same solution for sellingprice:

In [36]:
df2['sellingprice'] = pd.to_numeric(df2['sellingprice'])

In [37]:
df2.sellingprice.dtypes

dtype('int64')

> With the numeric type columns succesfully converted, let's transform the 'saledate' column to datetime-like type.

First, view dtypes present in the column, & then view an example:

In [38]:
# view datatypes present in the column
df2.saledate.dtypes

dtype('O')

> That worked without a problem.

Proceed to replace the original column with a transformed version. To correctly format the format string used as an argument to the datetime.strptime method, preview the saletime value for one of the rows:

In [39]:
print(df2['saledate'][26846])
type(df2['saledate'][26846])

Thu Jan 22 2015 06:30:00 GMT-0800 (PST)


str

> The above output shows that the data in saledate is a string. Let's proceed to transform it to a datetime object:

In [40]:
df2['saledate'] = [datetime.strptime(d.replace(' (PST)', '').replace(' (PDT)', ''), "%a %b %d %Y %H:%M:%S GMT%z") for d in df2['saledate']]

Now, iterate through every saledate to ensure correct transformation:

In [41]:
for i in df2.index:
    if not isinstance(df2['saledate'][i], datetime):
        print(type(df2['saledate'][i]))

Let's check for duplicated rows. Although there may be some redundancy in column 'vin', this may result from multiple sales for a single vehicle.

In [42]:
df2.duplicated(subset='vin').sum()

6557

> There are a number of vehicles listed more than once. This may not be an issue with redundant information, as a single vehicle may have been sold multiple times. To investigate this, let's view some of the duplicates & than check if all columns are duplicated:

In [43]:
df2[df2.duplicated(subset='vin', keep=False)].sort_values(by = 'vin')

Unnamed: 0,year,make,model,trim,body,transmission,vin,state,condition,odometer,color,interior,mmr,sellingprice,saledate
178324,2000,Acura,TL,3.2,Sedan,automatic,19uua5663ya022038,fl,1.9,105431.0,gold,tan,2325,1000,2015-01-27 10:00:00-08:00
31978,2000,Acura,TL,3.2,Sedan,automatic,19uua5663ya022038,fl,1.9,105420.0,gold,beige,2150,1100,2014-12-23 12:15:00-08:00
128236,2006,Acura,TL,Base,Sedan,manual,19uua65596a059705,nj,2.6,89661.0,white,brown,9025,8200,2015-01-28 01:30:00-08:00
289231,2006,Acura,TL,Base,Sedan,manual,19uua65596a059705,nj,2.5,89741.0,white,black,9100,8500,2015-03-04 01:30:00-08:00
150649,2005,Acura,TL,3.2,Sedan,automatic,19uua66215a070166,ca,3.7,131727.0,silver,gray,6600,6900,2015-01-22 04:00:00-08:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209261,2006,Maserati,Quattroporte,Base,Sedan,automatic,zamce39a460025306,ca,2.9,92655.0,silver,black,17250,15500,2015-02-04 04:30:00-08:00
166118,2007,Maserati,Quattroporte,Executive GT DuoSelect,Sedan,automatic,zamce39a470026893,ca,2.7,46087.0,gray,gray,25100,23000,2015-02-10 04:30:00-08:00
412054,2007,Maserati,Quattroporte,Executive GT DuoSelect,sedan,automatic,zamce39a470026893,ca,3.4,46128.0,gray,gray,26800,23500,2015-06-04 05:30:00-07:00
101745,2014,FIAT,500L,Easy,Wagon,automatic,zfbcfabh4ez025834,fl,4.0,9435.0,red,gray,12600,10200,2015-02-02 04:30:00-08:00


> In the above output we see several examples of rows with identical 'vin's but they are for records of different sales.

Let's now check the entire dataframe for rows that are truly duplicated (i.e multiple records for the same sale.):

In [44]:
df2.duplicated(subset = ['vin', 'saledate']).sum()

56

> The above output shows that there are a small number of rows with identical 'vin' & 'saledate'. Let's view a dataframe with those records:

In [45]:
df2[df2.duplicated(subset = ['vin', 'saledate'], keep= False)].sort_values('vin')

Unnamed: 0,year,make,model,trim,body,transmission,vin,state,condition,odometer,color,interior,mmr,sellingprice,saledate
199574,2012,Honda,Civic,EX-L,Sedan,automatic,19xfb2f97ce313922,md,4.4,68059.0,white,beige,11100,11600,2015-02-03 01:30:00-08:00
26380,2012,Honda,Civic,EX-L,Sedan,automatic,19xfb2f97ce313922,md,4.0,1.0,white,beige,14150,3900,2015-02-03 01:30:00-08:00
12625,2007,Dodge,Caliber,SXT,Wagon,automatic,1b3hb48bx7d113596,ga,2.0,1.0,red,gray,5100,2300,2015-01-22 04:30:00-08:00
150018,2007,Dodge,Caliber,SXT,Wagon,automatic,1b3hb48bx7d113596,ga,2.0,326716.0,red,gray,1175,900,2015-01-22 04:30:00-08:00
151411,2001,Dodge,Dakota,Base,Club Cab,automatic,1b7gg22n01s348630,az,1.0,140318.0,black,black,1600,2200,2015-01-22 03:00:00-08:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
472314,2008,BMW,1 Series,135i,Convertible,automatic,wbaun93588vf56134,ca,3.3,96344.0,red,black,12500,13000,2015-07-08 09:30:00-07:00
102673,2003,Mercedes-Benz,E-Class,E500,Sedan,automatic,wdbuf70j23a235692,nc,1.3,146913.0,black,tan,4375,2800,2015-01-13 09:15:00-08:00
16429,2003,Mercedes-Benz,E-Class,E500,Sedan,automatic,wdbuf70j23a235692,nc,1.2,1.0,black,tan,7225,2000,2015-01-13 09:15:00-08:00
280622,2007,Mercedes-Benz,CLS-Class,CLS550,Sedan,automatic,wdddj72x27a080493,mo,2.1,79869.0,gray,black,15400,17000,2015-07-08 07:30:00-07:00


> Several unexpected values are present in the above dataframe: some rows describe an identical sale, but with different values for some of the columns. Some of them appear to be errors, as one of the rows have '1.0' as the record for the odometer but another row for the same vehicle/sale has a number better aligned with what one would expect for a vehicle several years old.

Let's proceed by dropping rows with '1.0' as odometer value:

In [46]:
df2[df2.duplicated(subset = ['vin', 'saledate'], keep= False)].index

Index([  3306,  12625,  14712,  16429,  26380,  28926,  30611,  39281,  40540,
        41485,
       ...
       408840, 411514, 411971, 418486, 441853, 452492, 461611, 466361, 471252,
       472314],
      dtype='int64', length=112)

In [None]:
odo_1 = 

In [30]:
df2.drop(df2[df2.duplicated(subset = ['vin', 'saledate'], keep= False)].index)

Unnamed: 0,index,year,make,model,trim,body,transmission,vin,state,condition,odometer,color,interior,mmr,sellingprice,saledate
0,0,2015,Kia,Sorento,LX,SUV,automatic,5xyktca69fg566472,ca,5.0,16639.0,white,black,20500,21500,2014-12-16 12:30:00-08:00
1,1,2015,Kia,Sorento,LX,SUV,automatic,5xyktca69fg561319,ca,5.0,9393.0,white,beige,20800,21500,2014-12-16 12:30:00-08:00
2,2,2014,BMW,3 Series,328i SULEV,Sedan,automatic,wba3c1c51ek116351,ca,4.5,1331.0,gray,black,31900,30000,2015-01-15 04:30:00-08:00
3,3,2015,Volvo,S60,T5,Sedan,automatic,yv1612tb4f1310987,ca,4.1,14282.0,white,black,27500,27750,2015-01-29 04:30:00-08:00
4,4,2014,BMW,6 Series Gran Coupe,650i,Sedan,automatic,wba6b2c57ed129731,ca,4.3,2641.0,gray,black,66000,67000,2014-12-18 12:30:00-08:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
472331,558805,2011,BMW,5 Series,528i,Sedan,automatic,wbafr1c53bc744672,fl,3.9,66403.0,white,brown,20300,22800,2015-07-07 06:15:00-07:00
472332,558807,2012,Ram,2500,Power Wagon,Crew Cab,automatic,3c6td5et6cg112407,wa,5.0,54393.0,white,black,30200,30800,2015-07-08 09:30:00-07:00
472333,558808,2012,BMW,X5,xDrive35d,SUV,automatic,5uxzw0c58cl668465,ca,4.8,50561.0,black,black,29800,34000,2015-07-08 09:30:00-07:00
472334,558809,2015,Nissan,Altima,2.5 S,sedan,automatic,1n4al3ap0fc216050,ga,3.8,16658.0,white,black,15100,11100,2015-07-09 06:45:00-07:00
