# Ebay Car Sales Data

A dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle

In [1]:
# Setting up the environment

import re
import numpy as np
import pandas as pd

# Reading in the data

autos = pd.read_csv('autos.csv', encoding='Latin-1')


In [2]:
# Previewing data

print(autos.info())
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Column Observations

Most of the columns were imported as object types, with the exception of:
- year of registration
- month of registration
- number of pictures
- postal code

The number of data points scrapped was 50,000.  There are multiple columns that have less than 50,000 non-null values. 
- vehicle type
- gearbox
- model
- fuel type
- not repaired damage


In [3]:
# Examining column names
column_name = autos.columns

In [4]:
# Creating a function to clean column names
def clean_col_names(name_string):
    new_string = name_string[0].upper() + name_string[1:]
    name_list = re.findall('[A-Z][^A-Z]*', new_string)
    cleaned_list = [name_list[0].lower()]
    for word in name_list[1:]:
        cleaned_list.append("_" + word.lower())
    return "".join(cleaned_list)
    
# Applying the clean column function to the list of column names
new_col_name = [clean_col_names(x) for x in column_name]

# Getting the column names into a consistent format
autos.columns = new_col_name

# Reassigning confusing column names 
rename_dict = {
    'year_of_registration':'registration_year',
    'month_of_registration':'registration_month',
    "not_repaired_damage":'unrepaired_damage',
    'date_created':'ad_created'
}

autos = autos.rename(rename_dict, axis='columns')

Column names were renamed to provide a consistent syntax in order to reduce confusion and make the columns easier to work with.

## Exploring Data

In [5]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-23 19:38:20,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


## Data Cleaning 

- The seller column is almost entirely made up of the value "privat"
- The offer_type column is also similarly consistent with the value "Angebot"
- Columns that are object type that should be a numeric type:
    - price
    - odometer
    
    

In [6]:
# Looking at the price column 
autos['price'].unique()

array(['$5,000', '$8,500', '$8,990', ..., '$385', '$22,200', '$16,995'],
      dtype=object)

In [7]:
# Removing non-numeric characters
autos['price'] = (autos['price']
                      .str.replace("$", "")
                      .str.replace(",", "")
                      .astype(int)
                 )

In [8]:
# Looking at the odometer column
autos['odometer'].unique()

array(['150,000km', '70,000km', '50,000km', '80,000km', '10,000km',
       '30,000km', '125,000km', '90,000km', '20,000km', '60,000km',
       '5,000km', '100,000km', '40,000km'], dtype=object)

In [9]:
# Removing non-numeric characters 
autos['odometer'] = (autos['odometer']
                         .str.replace("km", "")
                         .str.replace(",", "")
                         .astype(int)
                    )

In [10]:
# Renaming column to capture lost data
autos = autos.rename({'odometer':'odometer_km'}, axis="columns")

In [11]:
# Checking work
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Descriptive Statistics: Odometer

In [12]:
# Looking at unique values
print(autos['odometer_km'].unique().shape)
# Looking at general statistics for the series
print(autos['odometer_km'].describe())
# Looking at the number of each of the values
print(autos['odometer_km'].value_counts().sort_index(ascending=False))

(13,)
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: odometer_km, dtype: int64


There does not appear to be any outliers in the odometer column.  All of the values are reasonable odometer readings for a car.  

## Descriptive Statistcs: Price

In [14]:
# Looking at unique values
print(autos['price'].unique().shape)
# Looking at general statistics for the series
print(autos['price'].describe())
# Looking at the number of each of the values
print(autos['price'].value_counts().sort_index(ascending=False).head(20))
print(autos['price'].value_counts().sort_index(ascending=False).tail(20))

(2357,)
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64
35       1
30       7
29       1
25       5
20       4
18       1
17       3
15       2
14       1
13       2
12       3
11       2
10       7
9        1
8        1
5        2
3        1
2        3
1      156
0     1421
Name: price, dtype: int64


In [18]:
# Removing outliers
autos = autos[autos["price"].between(100,1000000)]

## Dates

In [19]:
# Looking at the date values
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [31]:
# Getting just the date
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).head(10)

2016-04-03    0.038609
2016-03-20    0.037800
2016-03-21    0.037220
2016-03-12    0.036909
2016-03-14    0.036660
2016-04-04    0.036536
2016-03-07    0.036059
2016-04-02    0.035602
2016-03-28    0.034960
2016-03-19    0.034732
Name: date_crawled, dtype: float64

In [32]:
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).head(10)

2016-04-03    0.038858
2016-03-20    0.037863
2016-03-21    0.037448
2016-04-04    0.036888
2016-03-12    0.036743
2016-03-14    0.035291
2016-04-02    0.035291
2016-03-28    0.035063
2016-03-07    0.034794
2016-03-29    0.034089
Name: ad_created, dtype: float64

In [33]:
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).head(10)

2016-04-06    0.221971
2016-04-07    0.132146
2016-04-05    0.125054
2016-03-17    0.028096
2016-04-03    0.025131
2016-04-02    0.024882
2016-03-30    0.024696
2016-04-04    0.024530
2016-03-31    0.023825
2016-03-12    0.023783
Name: last_seen, dtype: float64

Most of the crawling was done between March-April 2016, which corresponds similarly to the date the add was creaeted and the date "last seen."  Car sites tend to have a limited amount of time for a posting before it is automatically removed.  

In [35]:
autos['registration_year'].describe()

count    48227.000000
mean      2004.730151
std         87.894768
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

One thing that stands out from the exploration is that the registration_year column contains some odd values:

    The minimum value is 1000, before cars were invented
    The maximum value is 9999, many years into the future


In [37]:
# Looking at irregular registration years
autos[~autos["registration_year"].between(1900,2016)]

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
10,2016-03-15 01:41:36,VW_Golf_Tuning_in_siber/grau,privat,Angebot,999,test,,2017,manuell,90,,150000,4,benzin,volkswagen,nein,2016-03-14 00:00:00,0,86157,2016-04-07 03:16:21
65,2016-04-04 19:30:39,Ford_Fiesta_zum_ausschlachten,privat,Angebot,250,control,,2017,manuell,65,fiesta,125000,9,benzin,ford,,2016-04-04 00:00:00,0,65606,2016-04-05 12:22:12
68,2016-04-03 17:36:59,Mini_cooper_s_clubman_/vollausstattung_/_Navi/...,privat,Angebot,10990,test,,2017,manuell,174,clubman,100000,0,,mini,nein,2016-04-03 00:00:00,0,83135,2016-04-05 17:26:26
84,2016-03-27 19:52:54,Renault_twingo,privat,Angebot,900,control,,2018,,60,twingo,150000,0,,renault,,2016-03-27 00:00:00,0,40589,2016-04-05 18:46:49
113,2016-04-03 14:58:29,Golf_4_Anfaenger_auto,privat,Angebot,1200,test,,2017,manuell,75,golf,150000,7,,volkswagen,,2016-04-03 00:00:00,0,97656,2016-04-05 14:15:48
164,2016-03-13 20:39:16,Opel_Meriva__nur_76000_Km__unfallfrei__scheckh...,privat,Angebot,4800,control,,2018,manuell,0,meriva,80000,4,benzin,opel,nein,2016-03-13 00:00:00,0,37627,2016-04-04 16:48:02
197,2016-04-05 10:36:24,VW_Polo_9N_an_Bastler,privat,Angebot,888,control,,2017,manuell,64,polo,20000,7,,volkswagen,ja,2016-04-05 00:00:00,0,58566,2016-04-07 13:16:13
253,2016-03-27 13:25:18,Ford_mondeo_Gas_anlage_mit_TÜV_04.2017,privat,Angebot,2250,test,,2017,manuell,0,mondeo,150000,8,benzin,ford,nein,2016-03-27 00:00:00,0,56575,2016-04-05 15:18:34
348,2016-03-17 20:58:24,VW_Beetle_1.8Turbo_mit_Vollausstattung_und_seh...,privat,Angebot,3750,control,,2017,manuell,150,beetle,150000,7,,volkswagen,nein,2016-03-17 00:00:00,0,45896,2016-03-24 17:17:50
390,2016-03-25 12:59:06,Fiat_Bertone_X_1_9__X_1/9__X19__X_19__X1_9__X_19,privat,Angebot,7750,test,,2018,manuell,76,andere,150000,6,benzin,fiat,nein,2016-03-25 00:00:00,0,78239,2016-03-28 12:16:50


All of these have registration years that are before the time of cars or later than the date crawled, which is not very plausible. 

## Brands

In [41]:
# Looking at the top 20 car brands
car_brands = autos['brand'].value_counts(normalize=True).index

In [50]:
brand_dict = {}
for car in car_brands:
    brand_subset = autos[autos["brand"]==car]
    brand_dict[car] = brand_subset['price'].mean()
    
brand_dict

{'alfa_romeo': 4054.471875,
 'audi': 9259.510248372317,
 'bmw': 8310.138470341408,
 'chevrolet': 6692.60294117647,
 'chrysler': 3539.9166666666665,
 'citroen': 3783.6788856304984,
 'dacia': 5897.736434108527,
 'daewoo': 1093.6,
 'daihatsu': 1641.2644628099174,
 'fiat': 2815.635782747604,
 'ford': 4053.7575215966635,
 'honda': 4010.4728682170544,
 'hyundai': 5416.23382045929,
 'jaguar': 11844.041666666666,
 'jeep': 11573.638888888889,
 'kia': 5923.288629737609,
 'lada': 2647.7241379310344,
 'lancia': 3240.703703703704,
 'land_rover': 18934.272727272728,
 'mazda': 4075.319293478261,
 'mercedes_benz': 8580.202247191011,
 'mini': 10566.824940047962,
 'mitsubishi': 3414.741116751269,
 'nissan': 4681.94046008119,
 'opel': 2974.764503159104,
 'peugeot': 3086.930281690141,
 'porsche': 46764.2,
 'renault': 2450.9015611448394,
 'rover': 1586.4923076923078,
 'saab': 3183.493670886076,
 'seat': 4348.652792990142,
 'skoda': 6394.309677419355,
 'smart': 3538.344927536232,
 'sonstige_autos': 12575.77

The most expensive cars are brands most often associated with sports cars. Cheaper cars consist primarily of French cars and old Soviet cars.  

In [53]:
# Looking at odometer readings compared to price
top_5_car_brands = autos['brand'].value_counts(normalize=True).head(5).index
brand_dict_price = {}
brand_dict_mileage = {}
for car in top_5_car_brands:
    brand_subset = autos[autos["brand"]==car]
    brand_dict_price[car] = brand_subset['price'].mean()
    brand_dict_mileage[car] = brand_subset['odometer_km'].mean()

In [60]:
# Creating a dataframe from the dictionaries
top_5_price = pd.Series(brand_dict_price)
top_5_mileage = pd.Series(brand_dict_mileage)

top_5_df = pd.DataFrame(top_5_price, columns=['mean_price'])
top_5_df['mean_mileage'] = top_5_mileage

In [61]:
top_5_df

Unnamed: 0,mean_price,mean_mileage
audi,9259.510248,129604.533398
bmw,8310.13847,132824.718673
mercedes_benz,8580.202247,131027.441659
opel,2974.764503,129442.848937
volkswagen,5559.672053,129012.946559


Price and high odometer reading do not appear to be directly correlated.  Opel vehicles are relatively cheap and the mileage difference between Audi and BMW is small.  This could be due to ease of repairability.