## Exploring eBay Car Sales Data

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle by user orgesleka.
The original dataset isn't available on Kaggle anymore, but you can find it [here](https://data.world/data-society/used-cars-data).

This dataset has only 50,000 datapoints sampled from the original, and some errors have been added. After all, this is a guided project to learn data cleaning. The aim of this project is to clean the dataset and perform some initial analysis on it.

Data dictionary for the dataset:

| Index | Column name          | Desciption                                          |
|-------|----------------------|-----------------------------------------------------|
|   0   | dateCrawled          |  When the ad was first crawled.                     |
|   1   | name                 |  Name of the car.                                   |
|   2   | seller               |  Whether the seller is private or a dealer.         |
|   3   | offerType            |  The type of listing.                               |
|   4   | price                |  The listed selling price of the car.               |
|   5   | abtest               |  Whether the listing is included in an A/B test.    |
|   6   | vehicleType          |  The type of vehicle.                               |
|   7   | yearOfRegistration   |  The year in which the car was first registered.    |
|   8   | gearbox              |  The type of transmission.                          |
|   9   | powerPS              |  The power of the car in PS.                        |
|   10  | model                |  The car model name.                                |
|   11  | odometer             |  How many kilometers the car has   driven.          |
|   12  | monthOfRegistration  |  The month in which the car was   first registered. |
|   13  | fuelType             |  What type of fuel the car uses.                    |
|   14  | brand                |  The brand of the car.                              |
|   15  | notRepairedDamage    |  If the car has a damage which is not yet repaired. |
|   16  | dateCreated          |  The date the eBay listing was created.             |
|   17  | nrOfPictures         |  The number of pictures in the ad.                  |
|   18  | postalCode           |  The postal code for the location of the vehicle.   |
|   19  | lastSeenOnline       |  When the crawler saw this ad last online.          

## Import libraries and load data

In [1]:
# Import libraries
import numpy as np
import pandas as pd

In [2]:
autos = pd.read_csv("autos.csv", encoding="latin1")

## Explore and geting further information

In [3]:
autos # This will show the first and last few lines in the dataset.

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [4]:
print(autos.info())
print('\n')
print(autos.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

It looks as if most columns contain complete rows without missing data. We observe some missing data for the rows `vehicleType`, `gearbox`, `model`, `fuelType`, and `notRepairedDamage`. 
Most of the columns are registered as DType `object`, otherwise as `int64`. 
Some work might need to be done on the column `name` (carmaker could be extracted), `price` ($-Symbol needs to be removed and commas as thousand separators), `odometer`('km' needs to be removed and commas as thousand separators) and `powerPS`. In `powerPS` it seems there are values given as 0 when cars cannot have 0 PS. Of course all the date columns could be brought into an analysable shape too. And also:

* The dataset contains 20 columns, most of which are strings.
* Some columns have null values, but none have more than ~20% null values.
* The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

Let's convert the column names from camelcase to snakecase and change some of the column names based on the data dictionary in order to be more descriptive.

## Cleaning Column Names

In [5]:
print(autos.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


We will use a mapping dictionary to change the column names from camelcase to snakecase.

In [6]:
# Define mapping dictionary
mapping_dictionary = {'dateCrawled': 'date_crawled', 'name': 'name', 'seller': 'seller', 'offerType': 'offer_type', 'price': 'price',
                       'abtest': 'abtest', 'vehicleType': 'vehicle_type', 'yearOfRegistration': 'registration_year', 'gearbox': 'gearbox', 
                       'powerPS': 'power_ps', 'model': 'model', 'odometer': 'odometer', 'monthOfRegistration': 'registration_month', 
                       'fuelType': 'fuel_type', 'brand': 'brand', 'notRepairedDamage': 'unrepaired_damage', 'dateCreated': 'ad_created', 
                       'nrOfPictures': 'nr_of_pictures', 'postalCode': 'postal_code', 'lastSeen': 'last_seen'}

# Rename columns in the DataFrame
autos.rename(columns=mapping_dictionary, inplace=True)

In [7]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


It would have also been possible to assign the new column names as a list `new_columns = []` and then to make all the changes there:
`new_columns = ['date_crawled', 'name']`and so on. I just did it this way because it seemed easier.

## Initial Exploration and Cleaning

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:

* Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
* Examples of numeric data stored as text which can be cleaned and converted.

In [8]:
print(autos.describe(include='all'))

               date_crawled         name  seller offer_type  price abtest  \
count                 50000        50000   50000      50000  50000  50000   
unique                48213        38754       2          2   2357      2   
top     2016-04-02 11:37:04  Ford_Fiesta  privat    Angebot     $0   test   
freq                      3           78   49999      49999   1421  25756   
mean                    NaN          NaN     NaN        NaN    NaN    NaN   
std                     NaN          NaN     NaN        NaN    NaN    NaN   
min                     NaN          NaN     NaN        NaN    NaN    NaN   
25%                     NaN          NaN     NaN        NaN    NaN    NaN   
50%                     NaN          NaN     NaN        NaN    NaN    NaN   
75%                     NaN          NaN     NaN        NaN    NaN    NaN   
max                     NaN          NaN     NaN        NaN    NaN    NaN   

       vehicle_type  registration_year  gearbox      power_ps  model  \
cou

In [9]:
autos["nr_of_pictures"].value_counts()

0    50000
Name: nr_of_pictures, dtype: int64

This inspection shows that 
* the column `nr_of_pictures` only contains the value 0 and can be dropped
* `seller` and `offer_type` have 2 unique values each, but one of the values only occurs once. So they can be dropped too.
* `price`and `odometer` contains numerical data stored as strings with additional symbols
Dropping the columns `nr_of_pictures`, `seller`, `offer_type`:

In [10]:
autos = autos.drop(["nr_of_pictures", "seller", "offer_type"], axis=1)

Let us remove the non-numeric charcters from `price`and `odometer` and convert these columns to int.

In [11]:
autos["price"] = autos["price"].str.replace("$", "") # remove extra character
autos["price"] = autos["price"].str.replace(",", "") # remove extra character
autos["price"] = autos["price"].astype(int)  # recast as int
print("`price` dtype:", autos["price"].dtype)

`price` dtype: int64


  autos["price"] = autos["price"].str.replace("$", "") # remove extra character


In [12]:
autos["price"].head()

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

In [13]:
autos["odometer"] = autos["odometer"].str.replace("km", "") # remove extra character
autos["odometer"] = autos["odometer"].str.replace(",", "") # remove extra character
autos["odometer"] = autos["odometer"].astype(int)  # recast as int
print("`odometer` dtype:", autos["odometer"].dtype)
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
autos["odometer_km"].head()

`odometer` dtype: int64


0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int64

## Exploring the Odometer and Price Columns

Let us look at `odometer_km`and `price` more closely. 

In [14]:
print(autos["odometer_km"].unique().shape)

(13,)


In [15]:
print(autos["odometer_km"].describe())

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64


In [16]:
print(autos["odometer_km"].value_counts())

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64


`odometer_km` has 13 unique values. It looks as if the distribution is highly skewed towards cars with many kilometers recorded. The median (150000 km) is also the maximum value.

In [17]:
print(autos["price"].unique().shape)
print(autos["price"].describe())
autos["price"].value_counts().head(20)

(2357,)
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
999      434
750      433
900      420
650      419
850      410
700      395
4500     394
300      384
2200     382
950      379
Name: price, dtype: int64

`price` has 2357 unique values. 1421 cars are recorded as costing 0 (either an error or people don't want to pay for the junkyard?). And there are some unrealistically high prices. Let's sort and look at the highest values.

In [18]:
autos["price"].value_counts().sort_index(ascending=False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

And the lowest values:

In [19]:
autos["price"].value_counts().sort_index(ascending=True).head(20)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

Considering that customers can bid on prices, it might be that the opening bid is 1. The 0 prices make only 2% of the whole dataset, and we will remove them. Then there are 12 or so cases above 350000, going into the millions. These are not real prices, and we will remove them. Prices up to 350000 could be possible. We will implement this as keeping the items between the price range 1 and 350000.

In [20]:
autos = autos[autos["price"].between(1,351000)]
autos["price"].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

Now we see that the prices make a lot more sense. The mean price is 5888.94, the median price 3000.

## Exploring the Date Columns

Right now, the `date_crawled`, `last_seen`, and `ad_created` columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values.

Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:

In [21]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


As the first 10 characters represent the day, we will start extracting those. We will then calculate the distributions of values in `date_crawled`, `last_seen`, and `ad_created` percentages.

In [22]:
(autos["date_crawled"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_index()
        )

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

In [25]:
(autos["date_crawled"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_values()
        )

2016-04-07    0.001400
2016-04-06    0.003171
2016-03-18    0.012911
2016-04-05    0.013096
2016-03-06    0.014043
2016-03-13    0.015670
2016-03-05    0.025327
2016-03-24    0.029342
2016-03-16    0.029610
2016-03-27    0.031092
2016-03-25    0.031607
2016-03-17    0.031628
2016-03-31    0.031834
2016-03-10    0.032184
2016-03-26    0.032204
2016-03-23    0.032225
2016-03-11    0.032575
2016-03-22    0.032987
2016-03-09    0.033090
2016-03-08    0.033296
2016-04-01    0.033687
2016-03-30    0.033687
2016-03-29    0.034099
2016-03-15    0.034284
2016-03-19    0.034778
2016-03-28    0.034860
2016-04-02    0.035478
2016-03-07    0.036014
2016-04-04    0.036487
2016-03-14    0.036549
2016-03-12    0.036920
2016-03-21    0.037373
2016-03-20    0.037887
2016-04-03    0.038608
Name: date_crawled, dtype: float64

The crawling period was roughly March to April 2016. The dates look pretty uniformly distributed, but on the last two days, there are a lot fewer crawling records.

In [26]:
(autos["ad_created"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_index()
        )

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64

In [27]:
(autos["ad_created"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_values()
        )

2016-02-16    0.000021
2016-02-09    0.000021
2015-09-09    0.000021
2016-01-07    0.000021
2016-01-16    0.000021
                ...   
2016-03-12    0.036755
2016-04-04    0.036858
2016-03-21    0.037579
2016-03-20    0.037949
2016-04-03    0.038855
Name: ad_created, Length: 76, dtype: float64

There is a large variety of ad created dates. Most fall within 1-2 months of the listing date, but a few are quite old, with the oldest at around 9 months.

In [28]:
(autos["last_seen"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_index()
        )

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

In [30]:
(autos["last_seen"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_values()
        )

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-18    0.007351
2016-03-08    0.007413
2016-03-13    0.008895
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-14    0.012602
2016-03-27    0.015649
2016-03-19    0.015834
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-26    0.016802
2016-03-23    0.018532
2016-03-25    0.019211
2016-03-24    0.019767
2016-03-21    0.020632
2016-03-20    0.020653
2016-03-28    0.020859
2016-03-22    0.021373
2016-03-29    0.022341
2016-04-01    0.022794
2016-03-31    0.023783
2016-03-12    0.023783
2016-04-04    0.024483
2016-03-30    0.024771
2016-04-02    0.024915
2016-04-03    0.025203
2016-03-17    0.028086
2016-04-05    0.124761
2016-04-07    0.131947
2016-04-06    0.221806
Name: last_seen, dtype: float64

The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.

The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.


In [31]:
print(autos["registration_year"].describe())

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64


The registration year contains nonsensible values lik 9999, which we should delete.

In [32]:
autos["registration_year"].value_counts().sort_index(ascending=False).head(20)

9999       3
9000       1
8888       1
6200       1
5911       1
5000       4
4800       1
4500       1
4100       1
2800       1
2019       2
2018     470
2017    1392
2016    1220
2015     392
2014     663
2013     803
2012    1310
2011    1623
2010    1589
Name: registration_year, dtype: int64

In [36]:
autos["registration_year"].value_counts().sort_index(ascending=True).head(30)

1000     1
1001     1
1111     1
1800     2
1910     5
1927     1
1929     1
1931     1
1934     2
1937     4
1938     1
1939     1
1941     2
1943     1
1948     1
1950     3
1951     2
1952     1
1953     1
1954     2
1955     2
1956     4
1957     2
1958     4
1959     6
1960    23
1961     6
1962     4
1963     8
1964    12
Name: registration_year, dtype: int64

## Dealing with Incorrect Registration Year Data
The registration year strikes me as odd. Would there be cars listed that were registered in 1910? It would be possible if these really were oldtimer cars. They might fetch a high amount, potentially? We will retain all values of registration year between 1910 and 2016. Additionally, We will replace the year values 2017-2018 with 2016, since this was the time when the data was crawled. Presumably, those later years entries are typos.

In [38]:
autos[(autos["registration_year"] >= 2017) & (autos["registration_year"] <= 2018)] = 2016

In [39]:
autos = autos[autos["registration_year"].between(1910,2016)]
autos["registration_year"].describe()

count    48543.000000
mean      2003.412830
std          7.480992
min       1910.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

Looking at the median registration year, which is 2004, tells us that on average the cars listed here are on average 12 years old when they go on sale. The oldest car is an oldtimer from 1910, the newest has only been registered in 2016. (My procedure differs from the procedure used in the solution, where only the range 1900 - 2016 is kept, without assigning the 2017-2018 cars to 2016.)

In [42]:
autos["registration_year"].value_counts(normalize=True).head(20)

2000    0.065015
2016    0.063490
2005    0.060482
1999    0.059679
2004    0.055683
2003    0.055600
2006    0.055003
2001    0.054302
2002    0.051212
1998    0.048678
2007    0.046907
2008    0.045630
2009    0.042952
1997    0.040191
2011    0.033434
2010    0.032734
1996    0.028284
2012    0.026986
1995    0.025277
2013    0.016542
Name: registration_year, dtype: float64

We can see that most cars were registered within the last 20 years or so.

## Exploring Price by Brand

When working with data on cars, it's natural to explore variations across different car brands. We can use aggregation to understand the brand column.

In [44]:
print(autos["brand"].unique())  # unique values of weight

['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault'
 2016 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche'
 'mini' 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi'
 'jaguar' 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai'
 'honda' 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada'
 'rover' 'daihatsu' 'lancia']


In [49]:
value_counts_brand = autos["brand"].value_counts(normalize=True).head(20)
print(value_counts_brand)

volkswagen        0.203160
bmw               0.105824
opel              0.103455
mercedes_benz     0.092763
audi              0.083246
ford              0.067219
renault           0.045341
2016              0.038358
peugeot           0.028696
fiat              0.024659
seat              0.017572
skoda             0.015780
nissan            0.014688
mazda             0.014606
smart             0.013617
citroen           0.013473
toyota            0.012216
hyundai           0.009641
sonstige_autos    0.009435
volvo             0.008796
Name: brand, dtype: float64


There is the number 2016 where a brand name should be. I drop that case.

In [50]:
autos = autos[autos["brand"] != 2016]


In [51]:
value_counts_brand = autos["brand"].value_counts(normalize=True).head(20)
print(value_counts_brand)

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
Name: brand, dtype: float64


Base on this, I think it makes sense to aggregate over brands that have each at least ca. 5% of the total listings. Let's be genereous and count the top 7 brands, including Renault. So our brands are: Volkswagen, BMW, Merceds Benz, Audi, Ford, and Renault.
Let's explore the average price for those brands. We will use that to loop through.

In [62]:
avg_price = {}
top_brands = autos["brand"].value_counts().head(7).index

for brand in top_brands:
    avg_price[brand] = round(autos.loc[autos["brand"] == brand, "price"].mean(), 2)

print(avg_price)
    

{'volkswagen': 5402.41, 'bmw': 8332.82, 'opel': 2975.24, 'mercedes_benz': 8628.45, 'audi': 9336.69, 'ford': 3749.47, 'renault': 2474.86}


We observe that Audi (avg. price 9336.69), Mercedes Benz (avg. price 8628.45), and BMW (avg. price 8332.82) are the most expensive brands amon the top 7 listed brands. Volkswagen is the mid-price brand, and Ford, Opel and Renault are relatively cheaper brands.

## Storing Aggregate Data in a DataFrame

For the top 7 brands (in the solution: top 6), let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price. While our natural instinct may be to display both aggregated series objects and visually compare them, this has a few limitations:

* it's difficult to compare more than two aggregate series objects if we want to extend to more columns
* we can't compare more than a few rows from each series object
* we can only sort by the index (brand name) of both series objects so we can easily make visual comparisons

Instead, we can combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly, using panda series constructor and panda dataframe constructor. 

First, we will calulate the average mileage for each of the top brands.


In [63]:
avg_mileage = {}
for brand in top_brands:
    avg_mileage[brand] = round(autos.loc[autos["brand"] == brand, "odometer_km"].mean(), 2)

print(avg_mileage)

{'volkswagen': 128707.16, 'bmw': 132572.51, 'opel': 129310.04, 'mercedes_benz': 130788.36, 'audi': 129157.39, 'ford': 124266.01, 'renault': 128071.33}


Now we convert both series to series objects using the series constructor.

In [64]:
avg_price_series = pd.Series(avg_price)
avg_mileage_series = pd.Series(avg_mileage)
print(avg_price_series)
print(avg_mileage_series)

volkswagen       5402.41
bmw              8332.82
opel             2975.24
mercedes_benz    8628.45
audi             9336.69
ford             3749.47
renault          2474.86
dtype: float64
volkswagen       128707.16
bmw              132572.51
opel             129310.04
mercedes_benz    130788.36
audi             129157.39
ford             124266.01
renault          128071.33
dtype: float64


Now we create a dataframe from the first series object using the dataframe constructor.

In [65]:
avg_price_series_df = pd.DataFrame(avg_price_series, columns=['mean_price'])
avg_price_series_df

Unnamed: 0,mean_price
volkswagen,5402.41
bmw,8332.82
opel,2975.24
mercedes_benz,8628.45
audi,9336.69
ford,3749.47
renault,2474.86


Now we assign the other series a new column in this dataframe.

In [66]:
avg_price_series_df['mean_mileage'] = avg_mileage_series
avg_price_series_df

Unnamed: 0,mean_price,mean_mileage
volkswagen,5402.41,128707.16
bmw,8332.82,132572.51
opel,2975.24,129310.04
mercedes_benz,8628.45,130788.36
audi,9336.69,129157.39
ford,3749.47,124266.01
renault,2474.86,128071.33


The interesting result here is that the high-proced brands - Audi, BMW, Merceds Benz - also have the highest mean mileage. They are all above 13000 kilometers. The somewhat cheaper brands on average all do not go beyond the 13000 km limit. It could mean that the high-proced brands are driven for a longer time, since they do not need as many repairs, whereas the c