# eBay Car Sales Data
### a guided project to explore eBay sales data - Dataquest.io

This project explores a sample of 50,000 car sales scraped from eBay. We will be using numpy and pandas to clean and prepare the data for basic analysis. 

### Import and view the data

In [2]:
import numpy as np
import pandas as pd

#get the data
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')

In [3]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [4]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

From the data we can see that the data isn't in english. We can also see that there are some NaN values that will need to be rectified. Most values are strings (object)

### Rename the columns to be more descriptive
Convention is to use _ instead of camel case, rename the columns apropriately. Also renaming some of them to be more descriptive of the contents. 

In [5]:

#A function that accepts a column name and ouputs my preferred name
def cleanColNames(inStr) :
    inStr = inStr.replace("yearOfRegistration", "registration_year")
    inStr = inStr.replace("monthOfRegistration", "registration_month")
    inStr = inStr.replace("notRepairedDamage", "unprepaired_damage")
    inStr = inStr.replace("dateCreated", "ad_created")
    inStr = inStr.replace("dateCrawled", "date_crawled")
    inStr = inStr.replace("offerType", "offer_type")
    inStr = inStr.replace("vehicleType", "vehicle_type")
    inStr = inStr.replace("powerPS", "power_ps")
    inStr = inStr.replace("fuelType", "fuel_type")
    inStr = inStr.replace("nrOfPictures", "num_pictures")
    inStr = inStr.replace("postalCode", "postal_code")
    inStr = inStr.replace("lastSeen", "last_seen")
    return inStr

cleanedCols = []

#for each column in the dataset, get the cleaned name
for column in autos.columns :
    column = cleanColNames(column)
    cleanedCols.append(column)
    
#rename the dataframe columns to the newly created cleaned names
autos.columns = cleanedCols
  

#Print to verify changes
autos.info()
    
    

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
seller                50000 non-null object
offer_type            50000 non-null object
price                 50000 non-null object
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer              50000 non-null object
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unprepaired_damage    40171 non-null object
ad_created            50000 non-null object
num_pictures          50000 non-null int64
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(5)

### A first pass at data exploration
Using this pass we can start to identify what data we can use and what is not valuable. We can start to organize the values in better ways. 


In [6]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unprepaired_damage,ad_created,num_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-21 16:37:21,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


It looks like the num_pictures column has all zeros and can be dropped. 
We are missing the numerical values for price and odometer, these need to be cast to INT vals


In [7]:
#remove the num_pictures field
#autos = autos.drop('num_pictures', 1)


In [8]:
#investigate the price field to see why it isn't showing up
autos["price"].describe()

count     50000
unique     2357
top          $0
freq       1421
Name: price, dtype: object

In [9]:
#remove the $ from the price field and then cast to int
autos['price'] = autos['price'].str.replace("$", '')
autos['price'] = autos['price'].str.replace(",", '')
autos['price'] = autos['price'].astype(int)
autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [10]:
#Investigate the odometer field for the same
autos['odometer'].describe()

count         50000
unique           13
top       150,000km
freq          32424
Name: odometer, dtype: object

In [11]:
autos['odometer'] = autos['odometer'].str.replace("km", "")
autos['odometer'] = autos['odometer'].str.replace(",", "")
autos['odometer'] = autos['odometer'].astype(int)
autos.rename(columns = {'odometer' : 'odometer_km'}, inplace = True)
autos['odometer_km'].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

Take a closer look at the price and odometer fields to see what cleaning needs to be done here. 


In [12]:
autos['price'].value_counts().sort_index(ascending=True)

0           1421
1            156
2              3
3              1
5              2
8              1
9              1
10             7
11             2
12             3
13             2
14             1
15             2
17             3
18             1
20             4
25             5
29             1
30             7
35             1
40             6
45             4
47             1
49             4
50            49
55             2
59             1
60             9
65             5
66             1
            ... 
151990         1
155000         1
163500         1
163991         1
169000         1
169999         1
175000         1
180000         1
190000         1
194000         1
197000         1
198000         1
220000         1
250000         1
259000         1
265000         1
295000         1
299000         1
345000         1
350000         1
999990         1
999999         2
1234566        1
1300000        1
3890000        1
10000000       1
11111111       2
12345678      

In [13]:
#Remove any rows that have outlier sales price
#anything <=0 and anything > 999000 is assumed to be an outlier
autos = autos.loc[autos['price'].between(0, 999000), :]

In [14]:
autos['odometer_km'].value_counts().sort_index(ascending=True)

5000        966
10000       264
20000       784
30000       789
40000       818
50000      1025
60000      1164
70000      1230
80000      1436
90000      1757
100000     2168
125000     5169
150000    32416
Name: odometer_km, dtype: int64

The data values for the odometer seem to fall in line with what would be expected. There are no outliers to remove here


### Fixing the date stamps
Some of the date fields are strings, trim and fix those to be more in-line with the rest of the dataset. 

In [15]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


We can see that the values all have YYYY-MM-DD as the first ten characters. 
By creating a distribution of the YMD values we can look for trends

In [17]:
#Date Crawled Distribution
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025387
2016-03-06    0.013944
2016-03-07    0.035970
2016-03-08    0.033269
2016-03-09    0.033209
2016-03-10    0.032129
2016-03-11    0.032489
2016-03-12    0.036770
2016-03-13    0.015564
2016-03-14    0.036630
2016-03-15    0.033990
2016-03-16    0.029508
2016-03-17    0.031509
2016-03-18    0.013064
2016-03-19    0.034910
2016-03-20    0.037831
2016-03-21    0.037490
2016-03-22    0.032909
2016-03-23    0.032389
2016-03-24    0.029108
2016-03-25    0.031749
2016-03-26    0.032489
2016-03-27    0.031049
2016-03-28    0.034850
2016-03-29    0.034150
2016-03-30    0.033629
2016-03-31    0.031909
2016-04-01    0.033809
2016-04-02    0.035410
2016-04-03    0.038691
2016-04-04    0.036490
2016-04-05    0.013104
2016-04-06    0.003181
2016-04-07    0.001420
Name: date_crawled, dtype: float64

We observe that the date crawled appears to have been done on consecutive days 03-05 -> 04-07 in 2016. No single day accounts for more than 4% of total line items. 


In [18]:
#Ad created Distribution
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()


2015-06-11    0.000020
2015-08-10    0.000020
2015-09-09    0.000020
2015-11-10    0.000020
2015-12-05    0.000020
2015-12-30    0.000020
2016-01-03    0.000020
2016-01-07    0.000020
2016-01-10    0.000040
2016-01-13    0.000020
2016-01-14    0.000020
2016-01-16    0.000020
2016-01-22    0.000020
2016-01-27    0.000060
2016-01-29    0.000020
2016-02-01    0.000020
2016-02-02    0.000040
2016-02-05    0.000040
2016-02-07    0.000020
2016-02-08    0.000020
2016-02-09    0.000040
2016-02-11    0.000020
2016-02-12    0.000060
2016-02-14    0.000040
2016-02-16    0.000020
2016-02-17    0.000020
2016-02-18    0.000040
2016-02-19    0.000060
2016-02-20    0.000040
2016-02-21    0.000060
                ...   
2016-03-09    0.033229
2016-03-10    0.031869
2016-03-11    0.032789
2016-03-12    0.036610
2016-03-13    0.016925
2016-03-14    0.035230
2016-03-15    0.033749
2016-03-16    0.030008
2016-03-17    0.031189
2016-03-18    0.013724
2016-03-19    0.033849
2016-03-20    0.037871
2016-03-21 

We observe that the date the ads were created are generally the same as the dates that the website was crawled with the caveat that there are some ads created before the crawler began to collect data. Earliest ad created is in 2015


In [19]:
#Last seen distribution
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001080
2016-03-06    0.004421
2016-03-07    0.005362
2016-03-08    0.007582
2016-03-09    0.009843
2016-03-10    0.010763
2016-03-11    0.012524
2016-03-12    0.023807
2016-03-13    0.008983
2016-03-14    0.012804
2016-03-15    0.015884
2016-03-16    0.016445
2016-03-17    0.027928
2016-03-18    0.007422
2016-03-19    0.015744
2016-03-20    0.020706
2016-03-21    0.020726
2016-03-22    0.021586
2016-03-23    0.018585
2016-03-24    0.019565
2016-03-25    0.019205
2016-03-26    0.016965
2016-03-27    0.016024
2016-03-28    0.020846
2016-03-29    0.022326
2016-03-30    0.024847
2016-03-31    0.023827
2016-04-01    0.023106
2016-04-02    0.024887
2016-04-03    0.025367
2016-04-04    0.024627
2016-04-05    0.124275
2016-04-06    0.220982
2016-04-07    0.130957
Name: last_seen, dtype: float64

Last seen corresponds to the date crawled

In [23]:
#Registration Year distribution
autos['registration_year'].describe()

count    49986.000000
mean      2005.075721
std        105.727161
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

We can observe that the mean year of registration for these cars is 2005 (apx 10 years old at time of sale). There are outliers with registration errors of 1000 or 9999. 

For further data cleaning let's remove any dates that are earlier than 1900 (cars were not really around at that time) , and later than 2016 (dates after the crawler had scraped data). 

In [54]:
#strip values before 1900 and after 2016
registrationYears = (autos['registration_year'] > 1900) & (autos['registration_year'] <= 2016)
#normalize and print the years
autos[registrationYears]['registration_year'].value_counts().sort_index()

1910       9
1927       1
1929       1
1931       1
1934       2
1937       4
1938       1
1939       1
1941       2
1943       1
1948       1
1950       3
1951       2
1952       1
1953       1
1954       2
1955       2
1956       5
1957       2
1958       4
1959       7
1960      33
1961       6
1962       4
1963       9
1964      12
1965      17
1966      22
1967      27
1968      26
        ... 
1987      75
1988     142
1989     181
1990     395
1991     356
1992     390
1993     445
1994     660
1995    1312
1996    1444
1997    2028
1998    2453
1999    2998
2000    3354
2001    2702
2002    2533
2003    2727
2004    2737
2005    3015
2006    2707
2007    2304
2008    2231
2009    2097
2010    1597
2011    1634
2012    1323
2013     806
2014     665
2015     399
2016    1316
Name: registration_year, Length: 78, dtype: int64

We can observe that there are some outlier ads for cars that are relatively or very old, but there is a cluster of cars for sale aged between 5-20 years old (registration years between 1995 and 2010). We can assume that the most common age of car for sale on ebay is in this bracket. 

## Parse through Brands to look for trends according the the car manufacturers


In [78]:
#Find the top 25 unique brands
topBrands = autos['brand'].value_counts()[:25]
topBrands

volkswagen        10684
opel               5460
bmw                5428
mercedes_benz      4733
audi               4283
ford               3477
renault            2404
peugeot            1456
fiat               1307
seat                941
skoda               786
mazda               757
nissan              754
smart               701
citroen             700
toyota              617
sonstige_autos      543
hyundai             488
volvo               456
mini                424
mitsubishi          406
honda               399
kia                 356
alfa_romeo          329
porsche             294
Name: brand, dtype: int64

In [77]:
#Explore the prices by top brands
#create a dictionary for each brand/mean price
brandsMeanPrice = {}
for brand in topBrands.index :
    #select all rows with that brand
    selected_rows = autos[autos["brand"] == brand]
    mean = selected_rows["price"].mean()
    brandsMeanPrice[brand] = mean.round()
    
brandsMeanPrice

{'alfa_romeo': 3944.0,
 'audi': 8966.0,
 'bmw': 8027.0,
 'citroen': 3687.0,
 'fiat': 2698.0,
 'ford': 3627.0,
 'honda': 3890.0,
 'hyundai': 5317.0,
 'kia': 5707.0,
 'mazda': 3963.0,
 'mercedes_benz': 8390.0,
 'mini': 10392.0,
 'mitsubishi': 3314.0,
 'nissan': 4589.0,
 'opel': 2846.0,
 'peugeot': 3011.0,
 'porsche': 44538.0,
 'renault': 2351.0,
 'seat': 4219.0,
 'skoda': 6305.0,
 'smart': 3483.0,
 'sonstige_autos': 10538.0,
 'toyota': 5098.0,
 'volkswagen': 5159.0,
 'volvo': 4686.0}

By aggregating the brands and evaluating the mean sale price for each brand we can observe some trends. Firstly, the brand that sells for on average the most is Porche, at 44,xxx average. There were only 294 Porche's in the dataset.

The most expensive vehicles are Porche, Mini, Sonstige, Audi, Mercedes Benz. 

The least expensive vehicles are Renault, Fiat, Opel, Ford, Puegot.

Volkswagen, which had the most cars in the dataset by far, had a mean price of 5159. 


## Comparing brands with their mileage to identify mileage v price

In [84]:
#Convert the brandsMeanPrice dict to a series
bmp_series = pd.Series(brandsMeanPrice)
print(bmp_series)


alfa_romeo         3944.0
audi               8966.0
bmw                8027.0
citroen            3687.0
fiat               2698.0
ford               3627.0
honda              3890.0
hyundai            5317.0
kia                5707.0
mazda              3963.0
mercedes_benz      8390.0
mini              10392.0
mitsubishi         3314.0
nissan             4589.0
opel               2846.0
peugeot            3011.0
porsche           44538.0
renault            2351.0
seat               4219.0
skoda              6305.0
smart              3483.0
sonstige_autos    10538.0
toyota             5098.0
volkswagen         5159.0
volvo              4686.0
dtype: float64


In [85]:
#Get the mileage averages for each of the brands 
#create a dictionary for each brand/mean miles
brandsMeanMileage = {}
for brand in topBrands.index :
    #select all rows with that brand
    selected_rows = autos[autos["brand"] == brand]
    mean = selected_rows["odometer_km"].mean()
    brandsMeanMileage[brand] = mean.round()
    
brandsMeanMileage

{'alfa_romeo': 131109.0,
 'audi': 129644.0,
 'bmw': 132518.0,
 'citroen': 119879.0,
 'fiat': 117012.0,
 'ford': 124153.0,
 'honda': 123709.0,
 'hyundai': 106783.0,
 'kia': 112640.0,
 'mazda': 125132.0,
 'mercedes_benz': 130882.0,
 'mini': 89375.0,
 'mitsubishi': 126293.0,
 'nissan': 118979.0,
 'opel': 129295.0,
 'peugeot': 127352.0,
 'porsche': 97364.0,
 'renault': 128224.0,
 'seat': 122062.0,
 'skoda': 110948.0,
 'smart': 100756.0,
 'sonstige_autos': 87385.0,
 'toyota': 115989.0,
 'volkswagen': 128949.0,
 'volvo': 138607.0}

In [86]:
#convert it to a series
bmodo = pd.Series(brandsMeanMileage)
print(bmodo)

alfa_romeo        131109.0
audi              129644.0
bmw               132518.0
citroen           119879.0
fiat              117012.0
ford              124153.0
honda             123709.0
hyundai           106783.0
kia               112640.0
mazda             125132.0
mercedes_benz     130882.0
mini               89375.0
mitsubishi        126293.0
nissan            118979.0
opel              129295.0
peugeot           127352.0
porsche            97364.0
renault           128224.0
seat              122062.0
skoda             110948.0
smart             100756.0
sonstige_autos     87385.0
toyota            115989.0
volkswagen        128949.0
volvo             138607.0
dtype: float64


In [90]:
#Make them into a dataframe
makeMiles = pd.DataFrame(bmp_series, columns=["mean_price"])
makeMiles.head(5)

Unnamed: 0,mean_price
alfa_romeo,3944.0
audi,8966.0
bmw,8027.0
citroen,3687.0
fiat,2698.0


In [93]:
#Add the odometer to the df
makeMiles.insert(1, 'mean_mileage', bmodo)

makeMiles

Unnamed: 0,mean_price,mean_mileage
alfa_romeo,3944.0,131109.0
audi,8966.0,129644.0
bmw,8027.0,132518.0
citroen,3687.0,119879.0
fiat,2698.0,117012.0
ford,3627.0,124153.0
honda,3890.0,123709.0
hyundai,5317.0,106783.0
kia,5707.0,112640.0
mazda,3963.0,125132.0


In [95]:
#Add a column for average price per mile
makeMiles['dollars_per_km'] = makeMiles['mean_price'] / makeMiles['mean_mileage']

makeMiles

Unnamed: 0,mean_price,mean_mileage,dollars_per_km
alfa_romeo,3944.0,131109.0,0.030082
audi,8966.0,129644.0,0.069159
bmw,8027.0,132518.0,0.060573
citroen,3687.0,119879.0,0.030756
fiat,2698.0,117012.0,0.023057
ford,3627.0,124153.0,0.029214
honda,3890.0,123709.0,0.031445
hyundai,5317.0,106783.0,0.049793
kia,5707.0,112640.0,0.050666
mazda,3963.0,125132.0,0.031671


We can see that some vehicles seem to hold their value quite well. Porches are selling, used, at a mean of 45c per km on the odometer. Cheaper cars are selling for MUCH less. Renault sells for 1.8c/km on the odometer. 

This analysis suggests that make has a significant impact on the value of a used car. 