# Exploring Ebay Car Sales Data
In this project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

In [1]:
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [2]:
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Some observations:
- The dataset contains 20 columns, most of which are strings.
- Some columns have null values, but none have more than ~20% null values.
- The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

## Cleaning Column Names

In [3]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

We'll convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

In [4]:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

In [5]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Initial Exploration and Cleaning

In [6]:
autos.describe()

Unnamed: 0,registration_year,power_ps,registration_month,nr_of_pictures,postal_code
count,50000.0,50000.0,50000.0,50000.0,50000.0
mean,2005.07328,116.35592,5.72336,0.0,50813.6273
std,105.712813,209.216627,3.711984,0.0,25779.747957
min,1000.0,0.0,0.0,0.0,1067.0
25%,1999.0,70.0,3.0,0.0,30451.0
50%,2003.0,105.0,6.0,0.0,49577.0
75%,2008.0,150.0,9.0,0.0,71540.0
max,9999.0,17700.0,12.0,0.0,99998.0


- Let's confirm if some columns have only one value:

In [7]:
print(autos['seller'].unique())
print(autos['offer_type'].unique())
print(autos['unrepaired_damage'].unique())
print(autos['nr_of_pictures'].unique())


['privat' 'gewerblich']
['Angebot' 'Gesuch']
['nein' nan 'ja']
[0]


We can see that **nr_of_pictures** has only one value (0), so it can be dropped.

- **price** and **odometer** are currently stored as text. Let's convert them to numeric data:

In [8]:
#Clean and convert price column
autos['price'] = autos['price'].str.replace('$', '')
autos['price'] = autos['price'].str.replace(',', '')
autos['price'] = autos['price'].astype(int)
autos.rename({'price': 'price_usd'}, axis=1, inplace=True)

#Clean and convert odometer column
autos['odometer'] = autos['odometer'].str.replace('km', '')
autos['odometer'] = autos['odometer'].str.replace(',', '')
autos['odometer'] = autos['odometer'].astype(int)
autos.rename({'odometer': 'odometer_km'}, axis=1, inplace=True)

## Exploring the Odometer and Price Columns

### Odometer

Summary Statistics:

In [9]:
autos['odometer_km'].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [10]:
autos['odometer_km'].value_counts()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

### Price

Summary Statistics

In [11]:
autos['price_usd'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_usd, dtype: float64

In [12]:
autos['price_usd'].value_counts().sort_index()

0           1421
1            156
2              3
3              1
5              2
8              1
9              1
10             7
11             2
12             3
13             2
14             1
15             2
17             3
18             1
20             4
25             5
29             1
30             7
35             1
40             6
45             4
47             1
49             4
50            49
55             2
59             1
60             9
65             5
66             1
            ... 
151990         1
155000         1
163500         1
163991         1
169000         1
169999         1
175000         1
180000         1
190000         1
194000         1
197000         1
198000         1
220000         1
250000         1
259000         1
265000         1
295000         1
299000         1
345000         1
350000         1
999990         1
999999         2
1234566        1
1300000        1
3890000        1
10000000       1
11111111       2
12345678      

The maximum value looks unrealistic, so we'll remove it:

In [13]:
autos = autos[autos['price_usd'] < 99999999]

## Exloring the date columns

The **date_crawled**, **last_seen**, and **ad_created** columns are all identified as string values:

In [14]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


For each column, we'll extract the first 10 characters (represent the day) and calculate their distribution as percentages.

In [15]:
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025381
2016-03-06    0.013940
2016-03-07    0.035961
2016-03-08    0.033301
2016-03-09    0.033221
2016-03-10    0.032121
2016-03-11    0.032481
2016-03-12    0.036781
2016-03-13    0.015560
2016-03-14    0.036621
2016-03-15    0.033981
2016-03-16    0.029501
2016-03-17    0.031521
2016-03-18    0.013060
2016-03-19    0.034901
2016-03-20    0.037821
2016-03-21    0.037521
2016-03-22    0.032921
2016-03-23    0.032381
2016-03-24    0.029101
2016-03-25    0.031741
2016-03-26    0.032481
2016-03-27    0.031041
2016-03-28    0.034841
2016-03-29    0.034181
2016-03-30    0.033621
2016-03-31    0.031921
2016-04-01    0.033801
2016-04-02    0.035401
2016-04-03    0.038681
2016-04-04    0.036521
2016-04-05    0.013100
2016-04-06    0.003180
2016-04-07    0.001420
Name: date_crawled, dtype: float64

In [16]:
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2015-06-11    0.000020
2015-08-10    0.000020
2015-09-09    0.000020
2015-11-10    0.000020
2015-12-05    0.000020
2015-12-30    0.000020
2016-01-03    0.000020
2016-01-07    0.000020
2016-01-10    0.000040
2016-01-13    0.000020
2016-01-14    0.000020
2016-01-16    0.000020
2016-01-22    0.000020
2016-01-27    0.000060
2016-01-29    0.000020
2016-02-01    0.000020
2016-02-02    0.000040
2016-02-05    0.000040
2016-02-07    0.000020
2016-02-08    0.000020
2016-02-09    0.000040
2016-02-11    0.000020
2016-02-12    0.000060
2016-02-14    0.000040
2016-02-16    0.000020
2016-02-17    0.000020
2016-02-18    0.000040
2016-02-19    0.000060
2016-02-20    0.000040
2016-02-21    0.000060
                ...   
2016-03-09    0.033241
2016-03-10    0.031861
2016-03-11    0.032781
2016-03-12    0.036621
2016-03-13    0.016920
2016-03-14    0.035221
2016-03-15    0.033741
2016-03-16    0.030001
2016-03-17    0.031201
2016-03-18    0.013720
2016-03-19    0.033841
2016-03-20    0.037861
2016-03-21 

In [17]:
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001080
2016-03-06    0.004420
2016-03-07    0.005360
2016-03-08    0.007600
2016-03-09    0.009860
2016-03-10    0.010760
2016-03-11    0.012520
2016-03-12    0.023820
2016-03-13    0.008980
2016-03-14    0.012800
2016-03-15    0.015880
2016-03-16    0.016440
2016-03-17    0.027921
2016-03-18    0.007420
2016-03-19    0.015740
2016-03-20    0.020700
2016-03-21    0.020740
2016-03-22    0.021580
2016-03-23    0.018580
2016-03-24    0.019560
2016-03-25    0.019200
2016-03-26    0.016960
2016-03-27    0.016020
2016-03-28    0.020860
2016-03-29    0.022340
2016-03-30    0.024840
2016-03-31    0.023840
2016-04-01    0.023100
2016-04-02    0.024900
2016-04-03    0.025361
2016-04-04    0.024620
2016-04-05    0.124282
2016-04-06    0.220984
2016-04-07    0.130923
Name: last_seen, dtype: float64

All the data was collected in 2016, from March to April.
Now let's examine the **registration_year** column:

In [18]:
autos['registration_year'].describe()

count    49999.000000
mean      2005.073401
std        105.713866
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

In [19]:
autos['registration_year'].value_counts().sort_index()

1000       1
1001       1
1111       1
1500       1
1800       2
1910       9
1927       1
1929       1
1931       1
1934       2
1937       4
1938       1
1939       1
1941       2
1943       1
1948       1
1950       3
1951       2
1952       1
1953       1
1954       2
1955       2
1956       5
1957       2
1958       4
1959       7
1960      34
1961       6
1962       4
1963       9
        ... 
2001    2703
2002    2533
2003    2727
2004    2737
2005    3015
2006    2708
2007    2304
2008    2231
2009    2098
2010    1597
2011    1634
2012    1323
2013     806
2014     666
2015     399
2016    1316
2017    1453
2018     492
2019       3
2800       1
4100       1
4500       1
4800       1
5000       4
5911       1
6200       1
8888       1
9000       2
9996       1
9999       4
Name: registration_year, Length: 97, dtype: int64

As shown above, there are many unrealistic values of registration year: from 1000 to 1800 when cars didn't exist, or 2017 and later while the data was collected in 2016.

## Dealing with Incorrect Registration Year Data

In [20]:
#Remove unrealistic values
autos = autos[autos['registration_year'].between(1910, 2016)]

In [21]:
autos['registration_year'].value_counts(normalize=True).sort_index()

1910    0.000187
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000042
1937    0.000083
1938    0.000021
1939    0.000021
1941    0.000042
1943    0.000021
1948    0.000021
1950    0.000062
1951    0.000042
1952    0.000021
1953    0.000021
1954    0.000042
1955    0.000042
1956    0.000104
1957    0.000042
1958    0.000083
1959    0.000146
1960    0.000708
1961    0.000125
1962    0.000083
1963    0.000187
1964    0.000250
1965    0.000354
1966    0.000458
1967    0.000562
1968    0.000541
          ...   
1987    0.001562
1988    0.002957
1989    0.003769
1990    0.008225
1991    0.007412
1992    0.008141
1993    0.009266
1994    0.013742
1995    0.027339
1996    0.030066
1997    0.042226
1998    0.051075
1999    0.062444
2000    0.069836
2001    0.056281
2002    0.052741
2003    0.056781
2004    0.056989
2005    0.062777
2006    0.056385
2007    0.047973
2008    0.046453
2009    0.043684
2010    0.033252
2011    0.034023
2012    0.027547
2013    0.016782
2014    0.0138

## Exploring Price by Brand

In [35]:
autos['brand'].value_counts(normalize=True)

volkswagen        0.212131
bmw               0.110021
opel              0.108168
mercedes_benz     0.095342
audi              0.086389
ford              0.069794
renault           0.047348
peugeot           0.029525
fiat              0.025860
seat              0.018177
skoda             0.016033
mazda             0.015137
nissan            0.015096
citroen           0.013930
smart             0.013909
toyota            0.012472
sonstige_autos    0.010952
hyundai           0.009849
volvo             0.009245
mini              0.008641
mitsubishi        0.008141
honda             0.007850
kia               0.007100
alfa_romeo        0.006621
porsche           0.006101
suzuki            0.005913
chevrolet         0.005705
chrysler          0.003665
daihatsu          0.002561
dacia             0.002561
jeep              0.002249
subaru            0.002186
land_rover        0.002041
saab              0.001603
jaguar            0.001582
trabant           0.001562
daewoo            0.001499
r

We'll limit our analysis to only popular brands that have over 5% of the total count:

In [36]:
brand_counts = autos['brand'].value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .05].index
print(common_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')


Calculate the mean price for each of the 6 brands:

In [37]:
brand_mean_price = {}
for brand in common_brands:
    brand_mean_price[brand] = autos.loc[autos['brand'] == brand, 'price_usd'].mean()
print(brand_mean_price)

{'opel': 5252.61655437921, 'mercedes_benz': 8485.239571958942, 'ford': 7263.015811455847, 'bmw': 8334.645155185466, 'volkswagen': 6516.457597173145, 'audi': 9093.65003615329}


## Exploring Mileage and Storing Aggregate Data

Calculating the mean mileage for each brand:

In [39]:
brand_mean_mileage = {}
for brand in common_brands:
    brand_mean_mileage[brand] = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()
print(brand_mean_mileage)

{'opel': 129227.14148219442, 'mercedes_benz': 130856.0821139987, 'ford': 124046.83770883054, 'bmw': 132434.70855412565, 'volkswagen': 128730.36906164115, 'audi': 129287.78018799711}


Creating a dataframe for aggregate data:

In [41]:
#Convert the two dictionaries to series
bmp = pd.Series(brand_mean_price)
bmm = pd.Series(brand_mean_mileage)

#Create dataframe
top_six = pd.DataFrame(bmp, columns = ['mean_price'])
top_six['mean_mileage'] = bmm
top_six

Unnamed: 0,mean_price,mean_mileage
audi,9093.650036,129287.780188
bmw,8334.645155,132434.708554
ford,7263.015811,124046.837709
mercedes_benz,8485.239572,130856.082114
opel,5252.616554,129227.141482
volkswagen,6516.457597,128730.369062
