# Ebay Car Sales Data

The goal of this project is to analyze data for 50000 used cars from eBay-Kleinanzeigen (German eBay), crawled in 2016. We will examine most popular brands listed on German eBay, their mileage, age, and price.

The dictionary provided with data is as follows:

- dateCrawled - When this ad was first crawled. All field-values are taken from this date.
- name - Name of the car.
- seller - Whether the seller is private or a dealer.
- offerType - The type of listing
- price - The price on the ad to sell the car.
- abtest - Whether the listing is included in an A/B test.
- vehicleType - The vehicle Type.
- yearOfRegistration - The year in which the car was first registered.
- gearbox - The transmission type.
- powerPS - The power of the car in PS.
- model - The car model name.
- odometer - How many kilometers the car has driven.
- monthOfRegistration - The month in which the car was first registered.
- fuelType - What type of fuel the car uses.
- brand - The brand of the car.
- notRepairedDamage - If the car has a damage which is not yet repaired.
- dateCreated - The date on which the eBay listing was created.
- nrOfPictures - The number of pictures in the ad.
- postalCode - The postal code for the location of the vehicle.
- lastSeenOnline - When the crawler saw this ad last online.

The dataset was downloaded from https://dataquest.io


In [79]:
# Import pandas and NumPy libraries and read the file
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [80]:
# Summary of the data
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [81]:
# First few entries
autos.head(5)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


The dataset consists of 20 columns, most of which are strings. Some columns (type, gearbox, model etc) have null values, and column names use camelcase instead of snakecase.

Let's begin with renaming column names.

In [82]:
print(autos.columns)

import re # import regex
names = []
for name in autos.columns:
    name = re.sub(r'(?<!^)(?=[A-Z])', '_', name).lower() # change to snakecase
    names.append(name)
print(names)
autos.columns = names

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'year_of_registration', 'gearbox', 'power_p_s', 'model', 'odometer', 'month_of_registration', 'fuel_type', 'brand', 'not_repaired_damage', 'date_created', 'nr_of_pictures', 'postal_code', 'last_seen']


In [83]:
autos.rename({"year_of_registration" : "registration_year"}, axis=1, inplace=True)
autos.rename({"month_of_registration" : "registration_month"}, axis=1, inplace=True)
autos.rename({"not_repaired_damage" : "unrepaired_damage"}, axis=1, inplace=True)
autos.rename({"date_created" : "ad_created"}, axis=1, inplace=True)

print(autos.columns)

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')


In [84]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Data Cleaning

We are going to do basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:
- Text columns where all or almost all values are the same (don't have useful information for analysis).
- Examples of numeric data stored as text which can be cleaned and converted.

In [85]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-04-02 11:37:04,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Several columns have only 2 unique values, we need to explore further.

In [86]:
print(autos['seller'].value_counts(dropna=False))
# Almost all ads come from private sellers, the column can be dropped.

privat        49999
gewerblich        1
Name: seller, dtype: int64


In [87]:
print(autos['offer_type'].value_counts(dropna=False))
# Almost all entries are offers from sellers, the column can be dropped.

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64


In [88]:
print(autos['abtest'].value_counts(dropna=False))

test       25756
control    24244
Name: abtest, dtype: int64


In [89]:
print(autos['nr_of_pictures'].describe())
# The column is empty and can be dropped.

count    50000.0
mean         0.0
std          0.0
min          0.0
25%          0.0
50%          0.0
75%          0.0
max          0.0
Name: nr_of_pictures, dtype: float64


In [90]:
autos_c_0 = autos.drop(labels=['seller', 'offer_type', 'nr_of_pictures'], axis=1)
autos_c_0

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,35683,2016-04-05 16:45:07


Now let's check if any numeric values are stored as strings.

In [91]:
autos_c_0.dtypes

date_crawled          object
name                  object
price                 object
abtest                object
vehicle_type          object
registration_year      int64
gearbox               object
power_p_s              int64
model                 object
odometer              object
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
postal_code            int64
last_seen             object
dtype: object

Columns containing numeric data (price, odometer) need to be converted to integer.

In [92]:
print(autos_c_0['price'].value_counts(dropna=False))

$0         1421
$500        781
$1,500      734
$2,500      643
$1,000      639
           ... 
$414          1
$79,933       1
$5,198        1
$18,890       1
$16,995       1
Name: price, Length: 2357, dtype: int64


In [93]:
autos_c_0['price'] = autos_c_0['price'].str.replace("$", "", regex=False) # Remove all symbols
autos_c_0['price'] = autos_c_0['price'].str.replace(",", "", regex=False)
autos_c_0['price'] = autos_c_0['price'].astype(int) # Convert to integer
autos_c_0['price']

0         5000
1         8500
2         8990
3         4350
4         1350
         ...  
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price, Length: 50000, dtype: int64

In [94]:
print(autos_c_0['odometer'].value_counts(dropna=False))

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64


In [95]:
autos_c_0['odometer'] = autos_c_0['odometer'].str.replace("km", "", regex=False) # Remove all symbols
autos_c_0['odometer'] = autos_c_0['odometer'].str.replace(",", "", regex=False)
autos_c_0['odometer'] = autos_c_0['odometer'].astype(int) # Convert to integer
autos_c_0['odometer']

0        150000
1        150000
2         70000
3         70000
4        150000
          ...  
49995    100000
49996    150000
49997      5000
49998     40000
49999    150000
Name: odometer, Length: 50000, dtype: int64

In [96]:
autos_c_0.rename({"odometer" : "odometer_km"}, axis=1, inplace=True) # Rename odometer column to odometer_km

### Removing outliers
We will explore odometer and price data, as well as dates columns, using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we might want to remove.

In [97]:
autos_c_0['price'].unique().shape

(2357,)

In [98]:
autos_c_0['price'].describe()
# There are a few entries with very high prices that affect 
# the standard deviation and the skewness of the population

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [99]:
autos_c_0['price'].value_counts().sort_index(ascending=True).head(20)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

In [100]:
autos_c_0[autos_c_0['price'] > 500000].head(10)
# The entries with high prices seem unusual

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
514,2016-03-17 09:53:08,Ford_Focus_Turnier_1.6_16V_Style,999999,test,kombi,2009,manuell,101,focus,125000,4,benzin,ford,nein,2016-03-17 00:00:00,12205,2016-04-06 07:17:35
2897,2016-03-12 21:50:57,Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000,11111111,test,limousine,1973,manuell,48,escort,50000,3,benzin,ford,nein,2016-03-12 00:00:00,94469,2016-03-12 22:45:27
7814,2016-04-04 11:53:31,Ferrari_F40,1300000,control,coupe,1992,,0,,50000,12,,sonstige_autos,nein,2016-04-04 00:00:00,60598,2016-04-05 11:34:11
11137,2016-03-29 23:52:57,suche_maserati_3200_gt_Zustand_unwichtig_laufe...,10000000,control,coupe,1960,manuell,368,,100000,1,benzin,sonstige_autos,nein,2016-03-29 00:00:00,73033,2016-04-06 21:18:11
22947,2016-03-22 12:54:19,Bmw_530d_zum_ausschlachten,1234566,control,kombi,1999,automatik,190,,150000,2,diesel,bmw,,2016-03-22 00:00:00,17454,2016-04-02 03:17:32
24384,2016-03-21 13:57:51,Schlachte_Golf_3_gt_tdi,11111111,test,,1995,,0,,150000,0,,volkswagen,,2016-03-21 00:00:00,18519,2016-03-21 14:40:18
27371,2016-03-09 15:45:47,Fiat_Punto,12345678,control,,2017,,95,punto,150000,0,,fiat,,2016-03-09 00:00:00,96110,2016-03-09 15:45:47
37585,2016-03-29 11:38:54,Volkswagen_Jetta_GT,999990,test,limousine,1985,manuell,111,jetta,150000,12,benzin,volkswagen,ja,2016-03-29 00:00:00,50997,2016-03-29 11:38:54
39377,2016-03-08 23:53:51,Tausche_volvo_v40_gegen_van,12345678,control,,2018,manuell,95,v40,150000,6,,volvo,nein,2016-03-08 00:00:00,14542,2016-04-06 23:17:31
39705,2016-03-22 14:58:27,Tausch_gegen_gleichwertiges,99999999,control,limousine,1999,automatik,224,s_klasse,150000,9,benzin,mercedes_benz,,2016-03-22 00:00:00,73525,2016-04-06 05:15:30


In [101]:
autos_c_0[autos_c_0['price'] < 100].shape
# We will also remove entries with 0 and unusually low prices

(1762, 17)

In [102]:
autos_c_1 = autos_c_0[autos_c_0['price'].between(100, 500000)] # Remove entries with very high and very low prices

In [103]:
# Explore the cleaned dataset
autos_c_1['price'].describe()

count     48224.000000
mean       5930.371433
std        9078.372762
min         100.000000
25%        1250.000000
50%        3000.000000
75%        7499.000000
max      350000.000000
Name: price, dtype: float64

After removing price outliers the distribution looks ok.

In [105]:
autos_c_1['odometer_km'].describe()

count     48224.000000
mean     125919.148142
std       39543.339640
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [106]:
autos_c_1['odometer_km'].value_counts()

150000    31212
125000     5037
100000     2101
90000      1733
80000      1412
70000      1214
60000      1153
50000      1009
40000       814
30000       777
5000        760
20000       757
10000       245
Name: odometer_km, dtype: int64

The distribution of odometer values looks ok.

Now let's explore the date columns.

In [107]:
autos_c_1[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [108]:
# Extract first 10 symbols from the dates (e.g. 2016-03-26),
# and show the relative frequency of each date (normalize=True) in ascending order
autos_c_1['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025361
2016-03-06    0.014039
2016-03-07    0.036061
2016-03-08    0.033179
2016-03-09    0.033013
2016-03-10    0.032287
2016-03-11    0.032598
2016-03-12    0.036911
2016-03-13    0.015677
2016-03-14    0.036662
2016-03-15    0.034319
2016-03-16    0.029467
2016-03-17    0.031499
2016-03-18    0.012898
2016-03-19    0.034734
2016-03-20    0.037803
2016-03-21    0.037201
2016-03-22    0.032888
2016-03-23    0.032287
2016-03-24    0.029446
2016-03-25    0.031499
2016-03-26    0.032308
2016-03-27    0.031126
2016-03-28    0.034962
2016-03-29    0.034112
2016-03-30    0.033738
2016-03-31    0.031851
2016-04-01    0.033697
2016-04-02    0.035605
2016-04-03    0.038611
2016-04-04    0.036538
2016-04-05    0.013064
2016-04-06    0.003173
2016-04-07    0.001389
Name: date_crawled, dtype: float64

The ads were crawled between March 5, 2016 and April 7, 2016.

In [109]:
autos_c_1['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038860
2016-04-04    0.036890
2016-04-05    0.011799
2016-04-06    0.003256
2016-04-07    0.001244
Name: ad_created, Length: 76, dtype: float64

The ads were created between June 11, 2015 and April 7, 2016.

In [110]:
autos_c_1['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001078
2016-03-06    0.004313
2016-03-07    0.005433
2016-03-08    0.007320
2016-03-09    0.009580
2016-03-10    0.010638
2016-03-11    0.012400
2016-03-12    0.023785
2016-03-13    0.008875
2016-03-14    0.012629
2016-03-15    0.015863
2016-03-16    0.016444
2016-03-17    0.028098
2016-03-18    0.007320
2016-03-19    0.015760
2016-03-20    0.020654
2016-03-21    0.020550
2016-03-22    0.021359
2016-03-23    0.018580
2016-03-24    0.019762
2016-03-25    0.019098
2016-03-26    0.016672
2016-03-27    0.015552
2016-03-28    0.020840
2016-03-29    0.022292
2016-03-30    0.024697
2016-03-31    0.023826
2016-04-01    0.022852
2016-04-02    0.024884
2016-04-03    0.025133
2016-04-04    0.024531
2016-04-05    0.125062
2016-04-06    0.221964
2016-04-07    0.132154
Name: last_seen, dtype: float64

Most ad creators visit the website regularly, last seen around the last time the data was crawled.

In [111]:
autos_c_1['registration_year'].describe()

count    48224.000000
mean      2004.730964
std         87.897388
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

In [118]:
# Check for cars with registration date before 1900 and after the date of the ad
autos_c_1[autos_c_1['registration_year'].between(1900, 2016)].shape

(46352, 17)

Around 4% of listings have incorrect information about the year the car was first registered.

In [113]:
# Clean the dataset
autos_c_2 = autos_c_1[autos_c_1['registration_year'].between(1900, 2016)]
autos_c_2.shape

(46352, 17)

In [114]:
# Look at the distribution of cars' registration years
autos_c_2['registration_year'].value_counts(normalize=True).head(15)

2000    0.066966
2005    0.062802
1999    0.062112
2004    0.058228
2003    0.058099
2006    0.057560
2001    0.056718
2002    0.053439
1998    0.050483
2007    0.049038
2008    0.047679
2009    0.044874
1997    0.041530
2011    0.034907
2010    0.034238
Name: registration_year, dtype: float64

Most of the listed cars were registered between 1997 and 2011.

## Exploring Price and Mileage by Brand

We will explore variations across different car brands.

In [115]:
autos_c_2['brand'].unique().shape

(40,)

In [116]:
autos_c_2['brand'].value_counts(dropna=False, normalize=True)

volkswagen        0.211404
bmw               0.110179
opel              0.107245
mercedes_benz     0.096652
audi              0.086771
ford              0.069835
renault           0.047075
peugeot           0.029858
fiat              0.025608
seat              0.018252
skoda             0.016418
nissan            0.015339
mazda             0.015231
smart             0.014196
citroen           0.014045
toyota            0.012793
hyundai           0.010010
sonstige_autos    0.009536
volvo             0.009126
mini              0.008802
mitsubishi        0.008177
honda             0.007875
kia               0.007076
alfa_romeo        0.006666
porsche           0.006019
suzuki            0.005933
chevrolet         0.005674
chrysler          0.003517
dacia             0.002654
daihatsu          0.002503
jeep              0.002287
land_rover        0.002114
subaru            0.002114
saab              0.001661
jaguar            0.001532
daewoo            0.001489
trabant           0.001359
r

Since the data comes from the German eBay, the majority of the brands are German. There are 40 brands overall, we will look at those with the share of total over 5% (first six).

In [117]:
selected_brands = autos_c_2['brand'].value_counts(dropna=False, normalize=True).index[:6] # Select first 6 entries

brands_price_mean = {}
brands_mileage_mean = {}

for brand in selected_brands:
    brands_price_mean[brand] = round(autos_c_2.loc[autos_c_2['brand'] == brand, 'price'].mean()) 
    # Aggregate the prices of cars for each brand and compute the mean
    
    brands_mileage_mean[brand] = round(autos_c_2.loc[autos_c_2['brand'] == brand, 'odometer_km'].mean())
    # Aggregate the mileage of cars for each brand and compute the mean

brands_mean = pd.DataFrame(pd.Series(brands_price_mean), columns = ['mean_price'])
# Transform the dictionary into the dataframe

brands_mean['mean_mileage'] = pd.Series(brands_mileage_mean)
# Add mileage as a new column

print(brands_mean)

               mean_price  mean_mileage
volkswagen           5437        128800
bmw                  8382        132695
opel                 3005        129384
mercedes_benz        8673        131026
audi                 9381        129245
ford                 3779        124277


Most affordable brands are Opel and Ford, with Volkswagen in the middle, and BMW, Mercedes-Benz and Audi the most expensive. On average cheap brands have slightly less mileage than expensive ones.

## Exploring Most Common Car Models

Next we will examine what is the most popular model among top-6 brands. We will pick one brand from low pricing category (Opel), one from medium (VW), and one from expensive (BMW).

In [148]:
# Opel
autos_c_2.loc[autos_c_2['brand'] == 'opel', 'model'].value_counts().head(3)

# Most popular model is Opel Corsa

corsa     1568
astra     1337
vectra     538
Name: model, dtype: int64

In [164]:
# Selecting data for Opel Corsa
opel_corsa = autos_c_2[(autos_c_2['brand'] == 'opel') & (autos_c_2['model'] == 'corsa')]

# Calculating its average price and mileage
opel_corsa_price = round(opel_corsa['price'].mean())
opel_corsa_mileage = round(opel_corsa['odometer_km'].mean())
opel_corsa_age = 2016 - round (opel_corsa['registration_year'].mean())

opel_corsa = [opel_corsa_price, opel_corsa_mileage, opel_corsa_age]

In [139]:
# Volkswagen
autos_c_2.loc[autos_c_2['brand'] == 'volkswagen', 'model'].value_counts().head(3)

# Most popular model by far is VW Golf

golf      3684
polo      1592
passat    1345
Name: model, dtype: int64

In [165]:
# Selecting data for Volkswagen Golf
vw_golf = autos_c_2[(autos_c_2['brand'] == 'volkswagen') & (autos_c_2['model'] == 'golf')]

# Calculating its average price and mileage
vw_golf_price = round(vw_golf['price'].mean())
vw_golf_mileage = round(vw_golf['odometer_km'].mean())
vw_golf_age = 2016 - round (vw_golf['registration_year'].mean())

vw_golf = [vw_golf_price, vw_golf_mileage, vw_golf_age]

In [150]:
# BMW
autos_c_2.loc[autos_c_2['brand'] == 'bmw', 'model'].value_counts().head(3)

# Most popular model is BMW 3 series

3er    2602
5er    1123
1er     521
Name: model, dtype: int64

In [171]:
# Selecting data for BMW 3 series
bmw_3 = autos_c_2[(autos_c_2['brand'] == 'bmw') & (autos_c_2['model'] == '3er')]

# Calculating its average price and mileage
bmw_3_price = round(bmw_3['price'].mean())
bmw_3_mileage = round(bmw_3['odometer_km'].mean())
bmw_3_age = 2016 - round (bmw_3['registration_year'].mean())

bmw_3 = [bmw_3_price, bmw_3_mileage, bmw_3_age]

In [172]:
average_cars = pd.DataFrame([opel_corsa, vw_golf, bmw_3], 
                            index = ['Opel Corsa', 'VW Golf', 'BMW 3'],
                            columns = ['price', 'mileage', 'age'])

print(average_cars)

            price  mileage  age
Opel Corsa   1905   128406   14
VW Golf      5113   128128   14
BMW 3        6032   137367   14


# Conclusion

We analyzed the used car data from German eBay, cleaned the outliers, and examined the most popular brands listed.

Unsurprisingly, most common brands here are almost all German - VW (21% of population), BMW, Opel (both - 11%), Mercedes-Benz (10%), Audi (9%), and Ford (6%).

Most common models listed for top-3 brands are VW Golf (average price 5k USD, 128k km, 14 years old), BMW 3 (6k USD, 137k km, 14 years old), and Opel Corsa (1,9k USD, 128k km, 14 years old).