# Exploring Ebay Car Sales Data

## 1. Introduction

In this project, we will analyse a dataset comprised of information about used cars, announced on the classifieds section of the German eBay website, *eBay Kleinanzeigen*.
The original dataset, scraped and uploaded to Kaggle by user orgesleka, is no longer available, but you can now find it [here](https://data.world/data-society/used-cars-data). This analysis is done using a smaller subset of that dataset, which has been further "dirtied" for learning purposes. That dataset is available in the repo of the project under the name `autos.csv`. 

The data dictionary for this dataset is the following:
- `dateCrawled`: When this ad was first crawled. All field-values are taken from this date.
- `name`: Name of the car.
- `seller`: Whether the seller is private or a dealer.
- `offerType`: The type of listing
- `price`: The price on the ad to sell the car.
- `abtest`: Whether the listing is included in an A/B test.
- `vehicleType`: The vehicle Type.
- `yearOfRegistration`: The year in which the car was first registered.
- `gearbox`: The transmission type.
- `powerPS`: The power of the car in PS.
- `model`: The car model name.
- `odometer`: How many kilometers the car has driven.
- `monthOfRegistration`: The month in which the car was first registered.
- `fuelType`: What type of fuel the car uses.
- `brand`: The brand of the car.
- `notRepairedDamage`: If the car has a damage which is not yet repaired.
- `dateCreated`: The date on which the eBay listing was created.
- `nrOfPictures`: The number of pictures in the ad.
- `postalCode`: The postal code for the location of the vehicle.
- `lastSeenOnline`: When the crawler saw this ad last online.


## 2. Aim of the project
Clean the data and analyze the used car listing.

In [1]:
# import the pandas and NumPy libraries

import pandas as pd
import numpy as np

# read the autos.csv file into pandas, and assign it to the variable name autos
autos=pd.read_csv('autos.csv',encoding='Latin-1')

In [2]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

### Observations

Right away, we can see that the following columns have missing (null) values:
- `vehicleType`
- `gearbox`
- `model`
- `fuelType`
- `notRepairedDamage`

We can also observe the columns `price` and `odometer` are objects, which we will probably have to modify them into a numeric type.

We can see that the names of the columns have lower- and uppercase letters and no spaces between them. This is known as [camelcase](https://en.wikipedia.org/wiki/Camel_case#:~:text=Camel%20case%20(sometimes%20stylized%20as,word%20starting%20with%20either%20case.). It will be easier for us to handle column names in the future if we convert them into [snakecase](https://en.wikipedia.org/wiki/Snake_case). It will also be helpful to rename some columns with a more descriptive title of its content (e.g. convert `price` into `price_dollars`).

## 3. Column Renaming

In [4]:
# print an array of the existing column names

autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:
# rename columns into snakecase

autos.rename(columns={'dateCrawled':'date_crawled', 
                      'offerType':'offer_type', 
                      'price':'price_dollars', 
                      'odometer':'odometer_km', 
                      'abtest':'ab_test',
                      'vehicleType':'vehicle_type', 
                      'yearOfRegistration':'registration_year',
                      'powerPS':'power_ps', 
                      'monthOfRegistration':'registration_month', 
                      'fuelType':'fuel_type',
                      'notRepairedDamage':'unrepaired_damage', 
                      'dateCreated':'ad_created', 
                      'nrOfPictures':'nr_of_pictures', 
                      'postalCode':'postal_code',
                      'lastSeen':'last_seen'}, inplace=True)
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_dollars',
       'ab_test', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps',
       'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

Now that we have converted the column names into a more managable and informative format, we will do some basic data exploration to determine what other cleaning tasks need to be done. 

We will start by: 
- dropping any text columns where all or almost all values are the same, as they don't have useful information for analysis.
- cleaning and converting numeric data stored as text, such as `price_dollars` and `odometer_km`.

In [6]:
# view min/max/median/mean etc.

autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price_dollars,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-21 16:37:21,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


From the table above, we can conclude:
- The `seller` and `offer_type` columns have only two unique values. Furthermore, the top value has a frequency of 49999, meaning that all rows but one this value.  Therefore, those columns contain virtually no useful information for analysis.
- The `nr_of_pictures` column is irrelevant to us, as all rows present a value of 0. 


## 4. Deleting irrelevant columns:
### 4.1 `seller`, `offer_type` and `nr_of_pictures`

In [7]:
# look up the number of initil rows in the dataframe

print('Number of columns:',autos.shape[1])

# drop the columns

autos.drop(columns=['seller','offer_type','nr_of_pictures'], 
           inplace=True)
print('Updated number of columns:',autos.shape[1])

Number of columns: 20
Updated number of columns: 17



## 5. Text to numberic dtype conversion:
### 5.1 `price_dollars`

In [8]:
# view min/max/median/mean etc.

autos['price_dollars'].describe()

count     50000
unique     2357
top          $0
freq       1421
Name: price_dollars, dtype: object

As we can see, the top entry is $0. In order to convert `price_dollars` into a numeric type, we need to remove any currency symbols. 

In [9]:
# Remove any non-numeric characters.

autos['price_dollars'] = autos['price_dollars'].str.replace(',','').str.replace('$','')
autos['price_dollars'].describe()

count     50000
unique     2357
top           0
freq       1421
Name: price_dollars, dtype: object

In [10]:
# Convert the column to a numeric dtype.

autos['price_dollars']=autos['price_dollars'].astype('int')
autos['price_dollars']

0         5000
1         8500
2         8990
3         4350
4         1350
         ...  
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price_dollars, Length: 50000, dtype: int64

### 5.2 `odometer_km`

In [11]:
# view min/max/median/mean etc.

autos['odometer_km'].describe()

count         50000
unique           13
top       150,000km
freq          32424
Name: odometer_km, dtype: object

The `odometer_km` column also contains non-numeric characters, so we'll remove those as well before converting it to a numeric type.

In [12]:
# Remove any non-numeric characters.

autos['odometer_km'] = autos['odometer_km'].str.replace(',','').str.replace('km','')                                                                                 

In [13]:
# Convert the column to a numeric dtype.

autos['odometer_km'] = autos['odometer_km'].astype('int')
autos['odometer_km']

0        150000
1        150000
2         70000
3         70000
4        150000
          ...  
49995    100000
49996    150000
49997      5000
49998     40000
49999    150000
Name: odometer_km, Length: 50000, dtype: int64

## 6. Data exploration:

### 6.1 `price_dollars`

In [14]:
# See how many unique values are in the column

autos['price_dollars'].unique().shape

(2357,)

In [15]:
# view min/max/median/mean etc.

autos['price_dollars'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_dollars, dtype: float64

### Observations

We can already tell that there are some problems with the values in `price_dollars`:
- The minimum value is 0 dollars, which is unusual.
- The maximum value is 1 billion dollars, which is clearly incorrect.
- The mean is 9840.0 dollars with a standard deviation of 481104.4, which is likely caused by the outlier discussed above.

In [16]:
# calculate the counts for each value. 

autos['price_dollars'].value_counts()

0        1421
500       781
1500      734
2500      643
1000      639
         ... 
20790       1
8970        1
846         1
2895        1
33980       1
Name: price_dollars, Length: 2357, dtype: int64

We can see that there are 1421 entries with a price of 0 dollars.

In [17]:
# calculate the values with their respective counts in ascending order

autos['price_dollars'].value_counts().sort_index(ascending= True)

0           1421
1            156
2              3
3              1
5              2
            ... 
10000000       1
11111111       2
12345678       3
27322222       1
99999999       1
Name: price_dollars, Length: 2357, dtype: int64

There are an additional 156 entries with a set price of 1 dollar. This could be because sellers do not wish to set a price upfront and prefer to negotiate the price of the car in private. We will remove these entries from our analysis. 

Other values that stand out are prices equal and superior to 999990 dollars. In total, there are 14 entries in this range.
Given that we are working with a dataset of used cars, these prices are unreasonably high and drive our mean and standard deviation up. Therefore, we will consider them **outliers** and remove them from our analysis.

In [18]:
# remove outliers

cleaned_prices_autos=autos[autos["price_dollars"].between(2,999989)]

In [19]:
cleaned_prices_autos.shape

(48409, 17)

In [20]:
# view min/max/median/mean etc.

cleaned_prices_autos['price_dollars'].describe()

count     48409.000000
mean       5907.909707
std        9068.263463
min           2.000000
25%        1250.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price_dollars, dtype: float64

In [21]:
# calculte the counts for each value

cleaned_prices_autos['price_dollars'].value_counts()

500      781
1500     734
2500     643
1200     639
1000     639
        ... 
173        1
205        1
410        1
4335       1
17799      1
Name: price_dollars, Length: 2345, dtype: int64

After cleaning the `price_dollars` column, we are left with **48409 entries**. Our mean price is now **5907.9 dollars**, which is lower that before, as is the standard deviation. The most frequent price for an used car is **500 dollars**, which appears in **781 entries (1.6%)**. 


### 6.2 `odometer_km`

In [22]:
# See how many unique values are in the column

cleaned_prices_autos['odometer_km'].unique().shape

(13,)

In [23]:
# view min/max/median/mean etc.

cleaned_prices_autos['odometer_km'].describe()

count     48409.000000
mean     125788.902890
std       39737.761014
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [24]:
# calculte the counts for each value

cleaned_prices_autos['odometer_km'].value_counts()

150000    31307
125000     5046
100000     2108
90000      1733
80000      1414
70000      1215
60000      1154
50000      1011
40000       815
5000        815
30000       779
20000       762
10000       250
Name: odometer_km, dtype: int64

The `odometer_km` column seems to have no major issues:
- The min value is 5000 km, present in 815 entries.
- The max value is **150000 km**, which is also the most frequent value, occuring in **31307 entries (64.7%)**.
- The mean is **125788.9 km**, with a standard deviation of      39737.8 km. 


Most entries have a value superior to 50000 km. Since we are working with used cars, this seems reasonable. 
Therefore, we will not remove any entries based on unrealistically high or low `odometer_km` values.

In [25]:
# make a copy of the dataframe under a shorter name

clean_autos = cleaned_prices_autos.copy()

## 6.3 dates and times

There are several columns that represent date values:
- `date_crawled`: When the ad was first crawled. Added by the crawler.
- `last_seen`: When the crawler saw the ad last online. Added by the crawler
- `ad_created`: The date on which the eBay listing was created. Created by the website.
- `registration_month`: The month in which the car was first registered. Created by the website.
- `registration_year`: The year in which the car was first registered. Created by the website.


In [26]:
clean_autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48409 entries, 0 to 49999
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        48409 non-null  object
 1   name                48409 non-null  object
 2   price_dollars       48409 non-null  int64 
 3   ab_test             48409 non-null  object
 4   vehicle_type        43891 non-null  object
 5   registration_year   48409 non-null  int64 
 6   gearbox             46113 non-null  object
 7   power_ps            48409 non-null  int64 
 8   model               45978 non-null  object
 9   odometer_km         48409 non-null  int64 
 10  registration_month  48409 non-null  int64 
 11  fuel_type           44438 non-null  object
 12  brand               48409 non-null  object
 13  unrepaired_damage   39405 non-null  object
 14  ad_created          48409 non-null  object
 15  postal_code         48409 non-null  int64 
 16  last_seen           48

### Observations

As of now, the `date_crawled`, `ad_created` and `last_seen` columns are recognized as string types (object). The `registration_year` and `registration_month` columns, on the other hand, are stored as numeric data. 

We will convert the data stored in `date_crawled`, `ad_created` and `last_seen` into numeric data, so we can process it quantitatively.

### 6.3.1  `date_crawled`

In [27]:
clean_autos['date_crawled'].describe()

count                   48409
unique                  46739
top       2016-03-30 19:48:02
freq                        3
Name: date_crawled, dtype: object

As we can see above, the first 10 characters correspond to the date (e.g. 2016-03-21). Therefore, we can extract the date values and then generate a distribution using the `Series.value_counts()` command. 

In [28]:
# extract the date from the string

clean_autos['date_crawled']=clean_autos['date_crawled'].str[:10]

In [29]:
clean_autos['date_crawled'].head()

0    2016-03-26
1    2016-04-04
2    2016-03-26
3    2016-03-12
4    2016-04-01
Name: date_crawled, dtype: object

In [30]:
clean_autos['date_crawled'].value_counts()

2016-04-03    1868
2016-03-20    1830
2016-03-21    1806
2016-03-12    1789
2016-03-14    1773
2016-04-04    1766
2016-03-07    1745
2016-04-02    1718
2016-03-28    1687
2016-03-19    1682
2016-03-15    1659
2016-03-29    1652
2016-04-01    1633
2016-03-30    1633
2016-03-08    1611
2016-03-09    1600
2016-03-22    1594
2016-03-11    1578
2016-03-23    1562
2016-03-26    1561
2016-03-10    1559
2016-03-31    1540
2016-03-17    1531
2016-03-25    1528
2016-03-27    1507
2016-03-16    1429
2016-03-24    1423
2016-03-05    1228
2016-03-13     758
2016-03-06     681
2016-04-05     633
2016-03-18     625
2016-04-06     153
2016-04-07      67
Name: date_crawled, dtype: int64

In [31]:
# calculate the distribution of values in percentages instead of counts

clean_autos['date_crawled'].value_counts(normalize=True, dropna=False)

2016-04-03    0.038588
2016-03-20    0.037803
2016-03-21    0.037307
2016-03-12    0.036956
2016-03-14    0.036625
2016-04-04    0.036481
2016-03-07    0.036047
2016-04-02    0.035489
2016-03-28    0.034849
2016-03-19    0.034746
2016-03-15    0.034270
2016-03-29    0.034126
2016-04-01    0.033733
2016-03-30    0.033733
2016-03-08    0.033279
2016-03-09    0.033052
2016-03-22    0.032928
2016-03-11    0.032597
2016-03-23    0.032267
2016-03-26    0.032246
2016-03-10    0.032205
2016-03-31    0.031812
2016-03-17    0.031626
2016-03-25    0.031564
2016-03-27    0.031131
2016-03-16    0.029519
2016-03-24    0.029395
2016-03-05    0.025367
2016-03-13    0.015658
2016-03-06    0.014068
2016-04-05    0.013076
2016-03-18    0.012911
2016-04-06    0.003161
2016-04-07    0.001384
Name: date_crawled, dtype: float64

In [32]:
# calculate the distribution of values in percentages and in ascending order

clean_autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

2016-03-05    0.025367
2016-03-06    0.014068
2016-03-07    0.036047
2016-03-08    0.033279
2016-03-09    0.033052
2016-03-10    0.032205
2016-03-11    0.032597
2016-03-12    0.036956
2016-03-13    0.015658
2016-03-14    0.036625
2016-03-15    0.034270
2016-03-16    0.029519
2016-03-17    0.031626
2016-03-18    0.012911
2016-03-19    0.034746
2016-03-20    0.037803
2016-03-21    0.037307
2016-03-22    0.032928
2016-03-23    0.032267
2016-03-24    0.029395
2016-03-25    0.031564
2016-03-26    0.032246
2016-03-27    0.031131
2016-03-28    0.034849
2016-03-29    0.034126
2016-03-30    0.033733
2016-03-31    0.031812
2016-04-01    0.033733
2016-04-02    0.035489
2016-04-03    0.038588
2016-04-04    0.036481
2016-04-05    0.013076
2016-04-06    0.003161
2016-04-07    0.001384
Name: date_crawled, dtype: float64

As we can observe above, there are 48409 entries in `date_crawled`.
The frequencies for each value are quite distributed between all the different values. The range of dates starts on **2016-03-05** and ends on **2016-04-07**.
According to this column, most ads were crawled in **2016-04-03 (3.86%)**.

### 6.3.2 `ad_created`

In [33]:
clean_autos['ad_created'].describe()

count                   48409
unique                     76
top       2016-04-03 00:00:00
freq                     1880
Name: ad_created, dtype: object

In [34]:
# extract the date from the string

clean_autos['ad_created']=clean_autos['ad_created'].str[:10]

In [35]:
# calculate the distribution of values in percentages 

clean_autos['ad_created'].value_counts(normalize=True, dropna=False)

2016-04-03    0.038836
2016-03-20    0.037865
2016-03-21    0.037534
2016-04-04    0.036853
2016-03-12    0.036770
                ...   
2016-01-03    0.000021
2016-02-09    0.000021
2016-01-14    0.000021
2016-01-22    0.000021
2016-01-07    0.000021
Name: ad_created, Length: 76, dtype: float64

In [36]:
# calculate the distribution of values in percentages and in ascending order

clean_autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038836
2016-04-04    0.036853
2016-04-05    0.011795
2016-04-06    0.003243
2016-04-07    0.001239
Name: ad_created, Length: 76, dtype: float64

In the `ad_created` column, there are also 48409 entries. The range of dates from the  columns starts on **2015-06-11** and ends on **2016-04-07**. According to this column, the most popular day for ad creation was **2016-04-03 (3.88%)**.


### 6.3.3 `last_seen`

In [37]:
clean_autos['last_seen'].describe()

count                   48409
unique                  38377
top       2016-04-07 06:17:27
freq                        8
Name: last_seen, dtype: object

In [38]:
# extract the date from the string

clean_autos['last_seen']=clean_autos['last_seen'].str[:10]

In [39]:
# calculate the distribution of values in percentages

clean_autos['last_seen'].value_counts(normalize=True, dropna=False)

2016-04-06    0.221591
2016-04-07    0.132021
2016-04-05    0.124935
2016-03-17    0.028073
2016-04-03    0.025202
2016-04-02    0.024851
2016-03-30    0.024747
2016-04-04    0.024500
2016-03-31    0.023839
2016-03-12    0.023797
2016-04-01    0.022868
2016-03-29    0.022331
2016-03-22    0.021380
2016-03-28    0.020885
2016-03-20    0.020637
2016-03-21    0.020616
2016-03-24    0.019748
2016-03-25    0.019191
2016-03-23    0.018592
2016-03-26    0.016815
2016-03-16    0.016443
2016-03-15    0.015865
2016-03-19    0.015824
2016-03-27    0.015617
2016-03-14    0.012622
2016-03-11    0.012374
2016-03-10    0.010618
2016-03-09    0.009626
2016-03-13    0.008862
2016-03-08    0.007375
2016-03-18    0.007333
2016-03-07    0.005412
2016-03-06    0.004338
2016-03-05    0.001074
Name: last_seen, dtype: float64

In [40]:
# calculate the distribution of values in percentages and in ascending order

clean_autos['last_seen'].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

2016-03-05    0.001074
2016-03-06    0.004338
2016-03-07    0.005412
2016-03-08    0.007375
2016-03-09    0.009626
2016-03-10    0.010618
2016-03-11    0.012374
2016-03-12    0.023797
2016-03-13    0.008862
2016-03-14    0.012622
2016-03-15    0.015865
2016-03-16    0.016443
2016-03-17    0.028073
2016-03-18    0.007333
2016-03-19    0.015824
2016-03-20    0.020637
2016-03-21    0.020616
2016-03-22    0.021380
2016-03-23    0.018592
2016-03-24    0.019748
2016-03-25    0.019191
2016-03-26    0.016815
2016-03-27    0.015617
2016-03-28    0.020885
2016-03-29    0.022331
2016-03-30    0.024747
2016-03-31    0.023839
2016-04-01    0.022868
2016-04-02    0.024851
2016-04-03    0.025202
2016-04-04    0.024500
2016-04-05    0.124935
2016-04-06    0.221591
2016-04-07    0.132021
Name: last_seen, dtype: float64

In the `last_seen` column, there are again 48409 entries. The range of dates from the  columns starts on **2016-03-05** and ends on **2016-04-07**. According to this column, the day when most ads were last seen online is **2016-04-06 (2.22%)**.

### Date and time: summary
        

| Column     | Range of dates | Most frequent day
| ----------- | ----------- | ----------- |
| **`date_crawled`**   | 2016-03-05 to 2016-04-07   |2016-04-03 (3.86%)|
| **`ad_created`**   | 2015-06-11 to 2016-04-07     |2016-04-03 (3.88%)|
| **`last_seen`**  | 2016-03-05 to 2016-04-07       |2016-04-06 (2.22%)|

### Main conclusions:

- `dateCrawled` displays the date the ads were crawled. Ad crawling started in March of 2016 and ended in April of the same year.
- Ads created as early as June 2015 are included in this dataset, as we can see in the `ad_created` column.
- The `last_seen` column contains the date when the crawler last saw the ad online. The day when most ads were last seen online is 2016-04-06 (2.22%).


## 6.4 'registration_year'

In [41]:
clean_autos['registration_year'].describe()

count    48409.000000
mean      2004.774319
std         88.783278
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

As we can see above, the `registration_year` column contains some errors. The maximum value is 9999 and the minimum is 1000, both clearly incorrect. 

Given that the `registration_year` corresponds to the year in which the car was first registered, we known that cannot be after 2015, the year the first ads were created.
As for the earliest acceptable year for car registration, we will accept any year after 1900.

In [42]:
# remove the rows where the 'registation_year' values are outside the range we have defined

clean_autos = clean_autos[clean_autos['registration_year'].between(1900,2015)]

In [43]:
clean_autos['registration_year'].describe()

count    45322.000000
mean      2002.577291
std          6.924648
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2007.000000
max       2015.000000
Name: registration_year, dtype: float64

In [44]:
clean_autos['registration_year'].value_counts(normalize=True)

2000    0.069017
2005    0.064582
1999    0.063678
2004    0.059574
2003    0.059508
          ...   
1939    0.000022
1938    0.000022
1927    0.000022
1931    0.000022
1952    0.000022
Name: registration_year, Length: 77, dtype: float64

As we can see, the minimum value for this column is 1910, so cars on sale were registered as early as that. 
We can also observe that the year **2000** is the year in which the **highest number of cars were registered (6.9%)**. 

In [45]:
# select the top 15 most frequent values 

clean_autos['registration_year'].value_counts(normalize=True, dropna=False)[0:15]

2000    0.069017
2005    0.064582
1999    0.063678
2004    0.059574
2003    0.059508
2006    0.058912
2001    0.058073
2002    0.054720
1998    0.051895
2007    0.050174
2008    0.048828
2009    0.045960
1997    0.042849
2011    0.035810
2010    0.035060
Name: registration_year, dtype: float64

In [46]:
# add the frequencies of the top 15 most frequent values (excluding the first)

sum(clean_autos['registration_year'].value_counts(normalize=True, dropna=False)[0:15])

0.7986408366797583

In addition, almost **80%** of all car registrations occured between **1997** and **2011**. We can conclude from this analysis that, at the time of crawling, the majority of cars on sale in this platform were at least more than 5 years old. 


## 7. Determining `time_online`

Using the `last_seen` and `ad_created` columns, we can calculate how long on average an ad remains online until the car is sold. We will name this new variable `time_online`.


In [47]:
# convert 'ad_created' column to datetime

clean_autos['ad_created']=pd.to_datetime(clean_autos['ad_created'])

In [48]:
# convert 'last_seen' column to datetime

clean_autos['last_seen']=pd.to_datetime(clean_autos['last_seen'])

In [49]:
clean_autos['ad_created'].describe(datetime_is_numeric=True)

count                            45322
mean     2016-03-20 19:20:18.322235136
min                2015-06-11 00:00:00
25%                2016-03-13 00:00:00
50%                2016-03-21 00:00:00
75%                2016-03-29 00:00:00
max                2016-04-07 00:00:00
Name: ad_created, dtype: object

In [50]:
clean_autos['last_seen'].describe(datetime_is_numeric=True)

count                            45322
mean     2016-03-29 18:43:23.133189120
min                2016-03-05 00:00:00
25%                2016-03-23 00:00:00
50%                2016-04-04 00:00:00
75%                2016-04-06 00:00:00
max                2016-04-07 00:00:00
Name: last_seen, dtype: object

In [51]:
# create a new column with the time past between `ad_created` and `last_seen``

clean_autos['time_online']=clean_autos['last_seen']-clean_autos['ad_created']

In [52]:
clean_autos['time_online'].shape

(45322,)

In [53]:
# calculate the distribution of values in percentages 

clean_autos['time_online'].value_counts(normalize=True, dropna=False)

0 days      0.137373
2 days      0.103327
4 days      0.074313
1 days      0.056772
6 days      0.055889
              ...   
63 days     0.000022
239 days    0.000022
109 days    0.000022
44 days     0.000022
62 days     0.000022
Name: time_online, Length: 67, dtype: float64

In [54]:
# calculate the distribution of values in percentages and in ascending order

clean_autos['time_online'].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

0 days      0.137373
1 days      0.056772
2 days      0.103327
3 days      0.049755
4 days      0.074313
              ...   
109 days    0.000022
149 days    0.000022
209 days    0.000022
239 days    0.000022
300 days    0.000022
Name: time_online, Length: 67, dtype: float64

According to this result, **13.7%** of ads were online for **less than a day**. It is likely that these ads were removed by the poster not because the car was sold, but because of some error in the description or price or something of that nature. 

In [55]:
# select the top 15 most frequent values (excluding the first)

clean_autos['time_online'].value_counts(normalize=True, dropna=False)[1:15]

2 days     0.103327
4 days     0.074313
1 days     0.056772
6 days     0.055889
3 days     0.049755
8 days     0.048012
9 days     0.036936
7 days     0.034001
11 days    0.033714
5 days     0.033648
10 days    0.026499
13 days    0.025705
12 days    0.024403
14 days    0.023344
Name: time_online, dtype: float64

In [56]:
# add the frequencies of the top 15 most frequent values (excluding the first)

sum(clean_autos['time_online'].value_counts(normalize=True, dropna=False).sort_index(ascending=True)[1:15])

0.6263183442919554

### Main conclusions:

If we exclude the '0 days' value from the top 15 most frequent results, we obtain that **62.6%** of the ads are removed up to **14 days** after their creation. We can infer from this that most of the used cars announced in the  *eBay Kleinanzeigen* platform are sold after less than 2 weeks after publishing. 

## 8. Exploring mean price by brand

In [57]:
clean_autos['brand'].describe()

count          45322
unique            40
top       volkswagen
freq            9548
Name: brand, dtype: object

In [58]:
clean_autos['brand'].value_counts()

volkswagen        9548
bmw               5034
opel              4803
mercedes_benz     4411
audi              3956
ford              3163
renault           2109
peugeot           1343
fiat              1151
seat               820
skoda              754
nissan             692
mazda              688
smart              643
citroen            636
toyota             580
hyundai            457
sonstige_autos     440
volvo              419
mini               399
mitsubishi         370
honda              355
kia                327
alfa_romeo         302
porsche            279
suzuki             268
chevrolet          261
chrysler           161
dacia              121
daihatsu           115
jeep               104
subaru              97
land_rover          96
saab                76
jaguar              73
daewoo              69
trabant             65
rover               60
lancia              50
lada                27
Name: brand, dtype: int64

In [59]:
clean_autos['brand'].value_counts(normalize=True)

volkswagen        0.210670
bmw               0.111072
opel              0.105975
mercedes_benz     0.097326
audi              0.087287
ford              0.069790
renault           0.046534
peugeot           0.029632
fiat              0.025396
seat              0.018093
skoda             0.016637
nissan            0.015269
mazda             0.015180
smart             0.014187
citroen           0.014033
toyota            0.012797
hyundai           0.010083
sonstige_autos    0.009708
volvo             0.009245
mini              0.008804
mitsubishi        0.008164
honda             0.007833
kia               0.007215
alfa_romeo        0.006663
porsche           0.006156
suzuki            0.005913
chevrolet         0.005759
chrysler          0.003552
dacia             0.002670
daihatsu          0.002537
jeep              0.002295
subaru            0.002140
land_rover        0.002118
saab              0.001677
jaguar            0.001611
daewoo            0.001522
trabant           0.001434
r

We will start by selecting the top 10 most common brands. 

In [60]:
# select the top 10 brands

top_ten_brands=clean_autos['brand'].value_counts().index[0:10]
print(top_ten_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat'],
      dtype='object')


In [61]:
# create an empty dictionary to hold the mean price for each brand

mean_price_brand={}

# loop over each top 10 brand and retrieve the mean price

for b in top_ten_brands:
    brand_rows=clean_autos[clean_autos['brand']==b]
    mean_price=brand_rows['price_dollars'].mean()
    mean_price_brand[b]=mean_price

In [62]:
mean_price_brand

{'volkswagen': 5493.5633640553,
 'bmw': 8431.807310290027,
 'opel': 3029.383302102852,
 'mercedes_benz': 8739.202448424394,
 'audi': 9373.358442871588,
 'ford': 3824.3386025924756,
 'renault': 2500.814129919393,
 'peugeot': 3137.9471332836933,
 'fiat': 2866.012163336229,
 'seat': 4439.796341463415}

### Main conclusions:

Among the top 10 most common brands, there are 3 **luxury brands**: `audi`, `mercedes_benz` and `bmw`. Used cars on sale by these brands have higher prices, costing on average around **9000 dollars**. 

The remaining brands are more affordable, and their mean prices range between 2500 to 5500 dollars. 

`fiat` and `renault` are the **cheapest brands** on average, with prices below **3000 dollars.** `volkswagen` and `seat` are **middle-range brands**, with mean prices closer to **5000 dollars**. 

## 9. Exploring mean price and average mileage 

In [63]:
# create an empty dictionary to hold the mean mileage for each brand

mean_miles_brand={}

# loop over each top 10 brand and retrieve the mean price

for b in top_ten_brands:
    brand_rows=clean_autos[clean_autos['brand']==b]
    mean_miles=brand_rows['odometer_km'].mean()
    mean_miles_brand[b]=mean_miles

In [64]:
mean_miles_brand

{'volkswagen': 128441.55844155845,
 'bmw': 132446.36471990464,
 'opel': 129244.22236102435,
 'mercedes_benz': 130671.04964860575,
 'audi': 129246.71385237614,
 'ford': 123951.94435662345,
 'renault': 128013.27643432906,
 'peugeot': 127081.16157855547,
 'fiat': 116589.92180712424,
 'seat': 120865.85365853658}

In [65]:
# convert the mean prices dictionary to a series object, using the series constructor

price_series=pd.Series(data=mean_price_brand)

In [66]:
# convert the mean mileage dictionary to a series object, using the series constructor

miles_series=pd.Series(data=mean_miles_brand)

In [67]:
price_series

volkswagen       5493.563364
bmw              8431.807310
opel             3029.383302
mercedes_benz    8739.202448
audi             9373.358443
ford             3824.338603
renault          2500.814130
peugeot          3137.947133
fiat             2866.012163
seat             4439.796341
dtype: float64

In [68]:
# create a dataframe from the price series using the dataframe constructor

price_miles_dataframe=pd.DataFrame(price_series, columns=['mean_price'])

In [69]:
price_miles_dataframe

Unnamed: 0,mean_price
volkswagen,5493.563364
bmw,8431.80731
opel,3029.383302
mercedes_benz,8739.202448
audi,9373.358443
ford,3824.338603
renault,2500.81413
peugeot,3137.947133
fiat,2866.012163
seat,4439.796341


In [70]:
# add the mileage series as a new column in the new dataframe

price_miles_dataframe['mean_mileage']=miles_series

In [71]:
price_miles_dataframe

Unnamed: 0,mean_price,mean_mileage
volkswagen,5493.563364,128441.558442
bmw,8431.80731,132446.36472
opel,3029.383302,129244.222361
mercedes_benz,8739.202448,130671.049649
audi,9373.358443,129246.713852
ford,3824.338603,123951.944357
renault,2500.81413,128013.276434
peugeot,3137.947133,127081.161579
fiat,2866.012163,116589.921807
seat,4439.796341,120865.853659


In [72]:
price_miles_dataframe['mean_mileage'].describe()

count        10.000000
mean     126655.206686
std        4835.453060
min      116589.921807
25%      124734.248662
50%      128227.417438
75%      129246.090980
max      132446.364720
Name: mean_mileage, dtype: float64

### Main conclusions:

- There are no drastic differences in the average mileage between the top 10 brands. All brands display a **mean mileage over 110000 km**, and the standard deviation of the series is below 5000 km.
- The brand with the highest average mileage is `mercedes_benz` (**132446.4 km**) and the one with the lowest is `fiat` (**116589.9 km**). The 15856.5 km difference does not seem to justify the difference in mean prices, considering that `mercedes_benz` cars are 204.9% more expensive on average than `fiat` cars.
- Similarly, we can see that `opel` cars have almost exactly the same mean mileage as `audi` cars, while their average price is 67.7% cheaper.
- Therefore, we can conclude that the **average mileage of cars produced by the top 10 brands is not likely to influence the cars' average price**.

## 10. Final summary

We performed several **data cleaning tasks** before analyzing the dataset, such as:
- removing 3 irrelevant columns (`seller`, `offer_type`, and `nr_of_pictures`) that added no valuable information;
- excluding unreasonable entries in the `odometer_km`, `price_dollars` and `registration_year` columns, which we categorized as outliers.
    

Regarding the analysis of the data, we have determined that:
- at the time of crawling, the majority of cars on sale in the *eBay Kleinanzeigen* platform were at least **more than 5 years old**;
- most of the used cars announced in the crawled ads were sold after **less than 2 weeks** after publishing;
- Among the top 10 most common brands, there were 3 **luxury brands** (`audi`, `mercedes_benz` and `bmw`, average price ~9000 dollars), **middle-range brands** (`volkswagen` and `seat`, average price ~5000 dollars) and **cheaper brands** (`fiat` and `renault`, average price ~3000 dollars);
- The mean mileage of cars produced by the top 10 brands **does not strongly correlate** with the average price and it is unlikely to influence the cars' price.
