# The aim of this project is to clean the data and analyze the included used car listings

we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.


##### The data dictionary provided with data is as follows:

 - dateCrawled - When this ad was first crawled. All field-values are taken from this date.
 - name - Name of the car.
 - seller - Whether the seller is private or a dealer.
 - offerType - The type of listing
 - price - The price on the ad to sell the car.
 - abtest - Whether the listing is included in an A/B test.
 - vehicleType - The vehicle Type.
 - yearOfRegistration - The year in which the car was first registered.
 - gearbox - The transmission type.
 - powerPS - The power of the car in PS.
 - model - The car model name.
 - odometer - How many kilometers the car has driven.
 - monthOfRegistration - The month in which the car was first registered.
 - fuelType - What type of fuel the car uses.
 - brand - The brand of the car.
 - notRepairedDamage - If the car has a damage which is not yet repaired.
 - dateCreated - The date on which the eBay listing was created.
 - nrOfPictures - The number of pictures in the ad.
 - postalCode - The postal code for the location of the vehicle.
 - lastSeenOnline - When the crawler saw this ad last online.

In [1]:
import pandas as pd
import numpy as np

autos = pd.read_csv("autos.csv", encoding = "Latin-1")  #reading data

In [2]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

### Findings:

- Data is either in64 or object type
- Database contains 50000 rows and 20 columns
- we have Nan values in vehicleType, gearbox, model, fuelType and notRepairedDamage. but none have more than ~20% null values
- price is object, we need to convert it to float
- odometer is object type, we need to convert it to float or int
- we need to convert dataCrawled, dataCreated and lastSeen to Datetime object if we would want to work with it
- The column names use camelcase instead of Python's preferred snakecase

### Convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive

The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

In [4]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:
columns_new = {
    'dateCrawled': 'date_crawled',
    'name' : 'name',
    'seller': 'seller',
    'offerType' : 'offer_type',
    'price' : 'price',
    'abtest' : 'ab_test',
    'vehicleType': 'vehicle_type',
    'yearOfRegistration' : 'registration_year',
    'gearbox':'gearbox',
    'powerPS': 'power_PS',
    'model' : 'model',
    'odometer' : 'odometer',
    'monthOfRegistration' : 'registration_month',
    'fuelType': 'fuel_type',
    'brand' : 'brand',
    'notRepairedDamage' : 'unrepaired_damage',
    'dateCreated' : 'ad_created',
    'nrOfPictures' : 'nr_of_pictures',
    'postalCode' : 'postal_code',
    'lastSeen' : 'last_seen'
}

columns = autos.columns

autos.columns = columns.map(columns_new) #converts column values based on dict in columns_new
    


In [6]:
autos.head(3)  #checking if it worked


Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_PS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37


In [7]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_PS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-25 19:57:10,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


We found out that two columns: seller and offer_type are allmost all same. only two unique values. They can be removed. They will not have any significant value in calculations.

In [8]:
autos.drop(labels = 'seller', axis = 1,inplace=True)
autos.drop(labels = 'offer_type', axis = 1,inplace=True)
autos.head(2)

Unnamed: 0,date_crawled,name,price,ab_test,vehicle_type,registration_year,gearbox,power_PS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08


 - #### Convert price to float and update it in the database
 - #### Convert odometer to integer and update the database


In [9]:
autos['price'] = autos['price'].str.strip('$').str.replace(',','').astype(int)
autos['price']

0         5000
1         8500
2         8990
3         4350
4         1350
         ...  
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price, Length: 50000, dtype: int32

In [10]:
autos['odometer'] = autos['odometer'].str.strip('km').str.replace(',','').astype(int)
autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)
autos['odometer_km']

0        150000
1        150000
2         70000
3         70000
4        150000
          ...  
49995    100000
49996    150000
49997      5000
49998     40000
49999    150000
Name: odometer_km, Length: 50000, dtype: int32

### Let's look for data that doesn't look right. We'll start by analyzing the odometer_km and price columns.

We will analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we might want to remove.

In [11]:
autos['price'].value_counts().sort_index(ascending=False).head(12)


99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
Name: price, dtype: int64

In [12]:
autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [13]:
autos['odometer_km'].value_counts()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

In [14]:
autos['odometer_km'].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

From this we see that price has some unusual and unrealistic values. We should drop values higher or equal than 999990. and maybe values that are less than 100.

In [15]:
autos = autos[autos["price"].between(100,500000)] #remove values below 100 and over 50,000

In [16]:
autos['price'].describe()  #checking to see how statistics changed

count     48224.000000
mean       5930.371433
std        9078.372762
min         100.000000
25%        1250.000000
50%        3000.000000
75%        7499.000000
max      350000.000000
Name: price, dtype: float64

### Let's calculate the distribution of values in the date_crawled, ad_created, and last_seen columns (all string columns) as percentages

In [17]:
autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index() 
#To include missing values in the distribution and to use percentages instead of counts

2016-03-05 14:06:30    0.000021
2016-03-05 14:06:40    0.000021
2016-03-05 14:07:04    0.000021
2016-03-05 14:07:08    0.000021
2016-03-05 14:07:21    0.000021
                         ...   
2016-04-07 14:30:09    0.000021
2016-04-07 14:30:26    0.000021
2016-04-07 14:36:44    0.000021
2016-04-07 14:36:55    0.000021
2016-04-07 14:36:56    0.000021
Name: date_crawled, Length: 46571, dtype: float64

date crawled was between March 5 2016 and April 4 2016

In [18]:
autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index()

2015-06-11 00:00:00    0.000021
2015-08-10 00:00:00    0.000021
2015-09-09 00:00:00    0.000021
2015-11-10 00:00:00    0.000021
2015-12-05 00:00:00    0.000021
                         ...   
2016-04-03 00:00:00    0.038860
2016-04-04 00:00:00    0.036890
2016-04-05 00:00:00    0.011799
2016-04-06 00:00:00    0.003256
2016-04-07 00:00:00    0.001244
Name: ad_created, Length: 76, dtype: float64

Advertisments created were between June 11 2015 and July 7 2016.  most values are in 2016.

In [19]:
autos['last_seen'].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05 14:45:46    0.000021
2016-03-05 14:46:02    0.000021
2016-03-05 14:49:34    0.000021
2016-03-05 15:16:11    0.000021
2016-03-05 15:16:47    0.000021
                         ...   
2016-04-07 14:58:44    0.000062
2016-04-07 14:58:45    0.000021
2016-04-07 14:58:46    0.000021
2016-04-07 14:58:48    0.000062
2016-04-07 14:58:50    0.000062
Name: last_seen, Length: 38232, dtype: float64

#### Now let's find out distribution of 'registration_year' column

In [20]:
autos['registration_year'].describe()

count    48224.000000
mean      2004.730964
std         87.897388
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

One thing that stands out from the exploration is that the registration_year column contains some odd values:

 - The minimum value is 1000, before cars were invented
 - The maximum value is 9999, many years into the future
 
Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

#### Let's remove the values outside those upper and lower bounds and calculate the distribution of the remaining values

In [21]:
autos = autos[autos["registration_year"].between(1900,2016)]
autos['registration_year'].value_counts(normalize=True)

2000    0.066966
2005    0.062802
1999    0.062112
2004    0.058228
2003    0.058099
          ...   
1938    0.000022
1939    0.000022
1953    0.000022
1943    0.000022
1952    0.000022
Name: registration_year, Length: 78, dtype: float64

## Exploring Price by Brand

In [55]:
brand_counts = autos["brand"].value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .05].index
print(common_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')


In [33]:
autos['brand'].value_counts().head(10)  #10 most popular brands

volkswagen       9799
bmw              5107
opel             4971
mercedes_benz    4480
audi             4022
ford             3237
renault          2182
peugeot          1384
fiat             1187
seat              846
Name: brand, dtype: int64


German manufacturers represent four out of the top five brands, almost 50% of the overall listings. Volkswagen is by far the most popular brand, with approximately double the cars for sale of the next two brands combined.

There are lots of brands that don't have a significant percentage of listings, so we will limit our analysis to brands representing more than 5% of total listings.

In [57]:
mean_price = {}

for n in common_brands:
    mean_price[n] = round(autos.loc[autos['brand'] == n, 'price'].mean())
    
mean_price  #mean price for each brands


{'volkswagen': 5437,
 'bmw': 8382,
 'opel': 3005,
 'mercedes_benz': 8673,
 'audi': 9381,
 'ford': 3779}

Of the top 6 brands, there is a distinct price gap:

 - Audi, BMW and Mercedes Benz are more expensive
 - Ford and Opel are less expensive
 - Volkswagen is in between - this may explain its popularity, it may be a 'best of 'both worlds' option. -

## Exploring Mileage

let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price

In [60]:
mean_mileage = {}

for n in common_brands:
    mean_mileage[n] = round(autos.loc[autos['brand'] == n, 'odometer_km'].mean())
    
mean_mileage #calculating mean mileage for 6 top car brands

{'volkswagen': 128800,
 'bmw': 132695,
 'opel': 129384,
 'mercedes_benz': 131026,
 'audi': 129245,
 'ford': 124277}

In [66]:
mp_series = pd.Series(mean_price)
Mileage = pd.DataFrame(mp_series, columns=["mean_price"]) #creating Series object from mean price dictionary
Mileage

Unnamed: 0,mean_price
volkswagen,5437
bmw,8382
opel,3005
mercedes_benz,8673
audi,9381
ford,3779


In [68]:
mm_series = pd.Series(mean_mileage)
Mileage['mean_mileage'] = mm_series

Mileage

Unnamed: 0,mean_price,mean_mileage
volkswagen,5437,128800
bmw,8382,132695
opel,3005,129384
mercedes_benz,8673,131026
audi,9381,129245
ford,3779,124277


The range of car mileages does not vary as much as the prices do by brand, instead all falling within 10% for the top brands. There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage