# Analysis of used cars listed on eBay Kleinanzeigen
## Introduction
This project aims at analyzing a data set containing information about 50,000 used cars listed on eBay Kleinanzeigen, a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website.

The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data) by Orges Leka. The version of the dataset we are working with is a sample of 50,000 data points that was prepared by [Dataquest](https://www.dataquest.io), including simulating a less-cleaned version of the data.

The data dictionary provided with data is as follows:

- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
- `name` - Name of the car.
- `seller` - Whether the seller is private or a dealer.
- `offerType` - The type of listing
- `price` - The price on the ad to sell the car.
- `abtest` - Whether the listing is included in an A/B test.
- `vehicleType` - The vehicle Type.
- `yearOfRegistration` - The year in which which year the car was first registered.
- `gearbox` - The transmission type.
- `powerPS` - The power of the car in PS.
- `model` - The car model name.
- `kilometer` - How many kilometers the car has driven.
- `monthOfRegistration` - The month in which which year the car was first registered.
- `fuelType` - What type of fuel the car uses.
- `brand` - The brand of the car.
- `notRepairedDamage` - If the car has a damage which is not yet repaired.
- `dateCreated` - The date on which the eBay listing was created.
- `nrOfPictures` - The number of pictures in the ad.
- `postalCode` - The postal code for the location of the vehicle.
- `lastSeenOnline` - When the crawler saw this ad last online.

In [1]:
import numpy as np
import pandas as pd
import re

pd.set_option('display.max_rows', 100)

# Import the data set from a csv file
autos = pd.read_csv("autos.csv", encoding="Latin-1")
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [2]:
# Print a summary of the data set
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [3]:
# List columns with null values
autos.columns[autos.isnull().any().values]

Index(['vehicleType', 'gearbox', 'model', 'fuelType', 'notRepairedDamage'], dtype='object')

Our dataset contains 20 columns, most of which are stored as strings. The columns `vehicleType`, `gearbox`, `model`, `fuelType`, `notRepairedDamage` have null values. We'll start by cleaning the column names to be easier to work with the data.

## Clean columns
The column names use camelcase instead of Python's preferred snakecase. We will convert them to snakecase and also change some names to more straightforward ones.

In [4]:
autos.rename({"yearOfRegistration": "registration_year",
             "monthOfRegistration": "registration_month",
             "notRepairedDamage": "unrepaired_damage",
             "dateCreated": "ad_created",
             "nrOfPictures": "num_photos"},
             axis=1, inplace=True)

autos.columns = [re.sub(r'(?<!^)(?=[A-Z])', '_', col).lower() for col in autos.columns]

autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
       'last_seen'],
      dtype='object')

Now let’s look if there are text columns where all or almost all values are the same.

In [5]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-23 19:38:20,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In the columns `seller` and `offer_type`, all values except one are the same, and so we can remove these columns. The `num_photos` column also looks odd, and so we'll investigate this further.

In [6]:
autos["num_photos"].value_counts()

0    50000
Name: num_photos, dtype: int64

The column `num_photos` will also be removed since all its values are equal to zero.

In [7]:
# Remove the columns seller, offer_type and num_photos
autos.drop(["num_photos", "seller", "offer_type"], axis=1, inplace=True)

The columns `price` and `odometer` are stored as text instead of numeric values. We need to remove the non-numeric characters and convert to a numeric type.

In [8]:
autos["price"] = (autos["price"]
                  .str.replace("$", "")
                  .str.replace(",", "")
                  .astype(int)
                 )

autos["odometer"] = (autos["odometer"]
                     .str.replace("km", "")
                     .str.replace(",", "")
                     .astype(int)
                    )
autos.rename({"odometer ": "odometer_km"}, axis=1, inplace=True)
#autos.dtypes

We will the values on the columns `price` and `odometer`.

In [9]:
autos[["price","odometer"]].describe()

Unnamed: 0,price,odometer
count,50000.0,50000.0
mean,9840.044,125732.7
std,481104.4,40042.211706
min,0.0,5000.0
25%,1100.0,125000.0
50%,2950.0,150000.0
75%,7200.0,150000.0
max,100000000.0,150000.0


The highest price is $100 million, which seems excessive. Let’s look in more detail to the highest prices.

In [10]:
autos["price"].nlargest(20)

39705    99999999
42221    27322222
27371    12345678
39377    12345678
47598    12345678
2897     11111111
24384    11111111
11137    10000000
47634     3890000
7814      1300000
22947     1234566
514        999999
43049      999999
37585      999990
36818      350000
14715      345000
34723      299000
35923      295000
12682      265000
47337      259000
Name: price, dtype: int32

The prices seem to gradually increase up to $350,000 and then jump up to less realistic numbers. We can also see that there are zero values.

In [11]:
sum(autos["price"] == 0)

1421

We will remove the cars with a price equal to zero or greater than $350,000.

In [12]:
autos = autos[~((autos["price"] == 0) | (autos["price"] > 350000))]
autos["price"].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

## Explore the columns with dates
There are a number of columns with date information:

- `date_crawled`
- `registration_month`
- `registration_year`
- `ad_created`
- `last_seen`

In [13]:
autos[["date_crawled","registration_month","registration_year","ad_created","last_seen"]].head()

Unnamed: 0,date_crawled,registration_month,registration_year,ad_created,last_seen
0,2016-03-26 17:47:46,3,2004,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,6,1997,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,7,2009,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,6,2007,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,7,2003,2016-04-01 00:00:00,2016-04-01 14:38:50


The `date_crawled`, `last_seen`, and `ad_created` columns are stored as strings. The `registration_month` and `registration_year` are stored as numeric values. We'll explore each of these columns to learn more about the listings.

In [14]:
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

It seems that the site was crawled daily, and there is not a significant difference between the amount of data crawled everyday.

In [15]:
autos["ad_created"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
2016-02-22    0.000021
2016-02-23    0.000082
2016-02-24    0.000041
2016-02-25    0.000062
2016-02-26    0.000041
2016-02-27    0.000124
2016-02-28    0.000206
2016-02-29    0.000165
2016-03-01    0.000103
2016-03-02    0.000103
2016-03-03    0.000865
2016-03-04    0.001483
2016-03-05    0.022897
2016-03-06 

There is a wide range of ad created dates. The site crawling started on 2016-03-05, and most of the dates fall within 1-2 months of the listing date.

In [16]:
last_seen = autos["last_seen"].str[:10].value_counts(normalize=True, dropna=False).sort_index()
print(last_seen)
print(last_seen[:-3].max())

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64
0.02808607021517554


The last seen date allows us to determine on what day a listing was removed, presumably because the car was sold. Apart from the last three dates, the maximum percentage of a single day is 3%. The last three days contain a disproportionate amount of 'last seen' values. It is unlikely that there was such a massive increase in sales. Most likely this is linked with the crawling period ending and does not a huge increase in sales.

In [17]:
autos["registration_year"].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The year that the car was first registered will likely indicate the age of the car. The extrema are 1000 and 9999, which do not make sense. To validate the latest valid registration year, we can compare with the date the ad was last seen. Determining the earliest valid year is more complicated. In 1885, Karl Benz developed a petrol or gasoline-powered automobile.[[3]( https://en.wikipedia.org/wiki/History_of_the_automobile#cite_note-3)] This is also considered to be the first "production" vehicle as Benz made several identical copies. In 1908, the Ford Model T, created by the Ford Motor Company, began production and would become the first automobile to be mass-produced on a moving assembly line.[[4]( https://en.wikipedia.org/wiki/History_of_the_automobile#cite_note-History.com-4)]. Let's count the number of listings with cars that fall outside the 1900 - 2016 period to see if it's safe to remove those rows entirely, or if we need a more detailed analysis.

In [18]:
is_in_interval = autos["registration_year"].between(1900,2016)
sum(~is_in_interval) / autos.shape[0]

0.038793369710697

Since the listings with strange dates represent less than 4% of the data set, we will remove them and after will calculate the distribution of the registration years.

In [19]:
autos = autos[is_in_interval]

autos["registration_year"].value_counts(normalize=True).head(10)

2000    0.067608
2005    0.062895
1999    0.062060
2004    0.057904
2003    0.057818
2006    0.057197
2001    0.056468
2002    0.053255
1998    0.050620
2007    0.048778
Name: registration_year, dtype: float64

## Exploring price and mileage by brand

In [20]:
top_brands_share=autos["brand"].value_counts(normalize=True).head(20)
print(top_brands_share)

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
Name: brand, dtype: float64


The top five brands are german and represent more than 60% of the listings. Several brands do not have a significant percentage of listings. So we will limit our analysis to brands representing more than 5% of the total listings.

In [21]:
top_brands = top_brands_share[top_brands_share > 0.05].index

price_mileage = pd.DataFrame(index=top_brands, columns=["price","mileage"])

for brand in top_brands:

    price = autos.loc[autos["brand"] == brand,"price"].mean()
    mileage = autos.loc[autos["brand"] == brand,"odometer"].mean()
    
    price_mileage.loc[brand] = [price,mileage]

price_mileage.sort_values(['price'], inplace=True)

print(price_mileage)

                 price mileage
opel           2975.24  129310
ford           3749.47  124266
volkswagen     5402.41  128707
bmw            8332.82  132573
mercedes_benz  8628.45  130788
audi           9336.69  129157


On average, Opel and Ford are the most affordable brands. In contrast, BMW, Mercedes Benz and Audi are the most expensive ones. Volkswagen is in between, which might explain its popularity since it might be a 'best of both worlds'. The car mileages of the top brands do not vary as much as the prices do. There is a slight trend of the mileages increasing with the prices.