### Project : Exploring Ebay Car Sales Data

**Project Description**
The project clean and analyze a sample dataset from the Classied Ad section of a German Ebay website, which was originally scraped and uploaded to Kaggle by user **orgesleka**.
Original dataset is no longer available on Kaggle but can still be accessed [here](https://data.world/data-society/used-cars-data)

The main tools in this project will be pandas and numpy.

In [2]:
import numpy as np
import pandas as pd
import csv


In [57]:
# upload the dataset into pandas 
""" Note unicode UTF-8 is not working for this dataset """
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [58]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [14]:
autos.tail()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07
49999,2016-03-14 00:42:12,Opel_Vectra_1.6_16V,privat,Angebot,"$1,250",control,limousine,1996,manuell,101,vectra,"150,000km",1,benzin,opel,nein,2016-03-13 00:00:00,0,45897,2016-04-06 21:18:48


In [60]:
# make a copy of the dataset
autos_copy = autos.copy()

In [61]:
# check dataset info
autos_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

### There are some columns with NaN values , but no columns with missing values more than 20% of the data

In [73]:
# printing column list
autos_copy.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'powerPS', 'model',
       'odometer', 'registration_month', 'fuet_type', 'brand',
       'not_repaired_damage', 'ad_created', 'num_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

### Rename Column names and switch from Camelcase to Snakecase to make them easier to read

In [74]:
# replace original column list with corrected_col_names list
autos_copy.rename({'dateCrawled':'date_crawled', 'offerType':'offer_type', 'vehicleType':'vehicle_type', 
                   'yearOfRegistration':'registration_year',
                   'monthOfRegistration':'registration_month',
                   'fuelType':'fuet_type','not_repaired_damage':'unrepaired_damage', 
                   'dateCreated':'ad_created', 'nrOfPictures':'num_pictures', 
                   'postalCode':'postal_code', 'lastSeen':'last_seen'}, axis=1, inplace = True)

In [77]:
# print corrected column list
autos_copy.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'powerPS', 'model',
       'odometer', 'registration_month', 'fuet_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

**Note** We changed the original column names which are in Camelcase into Snakecase to make it easier to read and process; for example, 'notRepaireDamage' was changed into 'not_repaired_damage'. We also changed the following columns names into new names: 

    - yearOfRegistration to registration_year
    - monthOfRegistration to registration_month
    - notRepairedDamage to unrepaired_damage
    - dateCreated to ad_created



## Initial Data Exploration and Cleaning

In [79]:
autos_copy.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer,registration_month,fuet_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-04-02 15:49:30,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Initial observations:
1. There are a number of text columns where almost all of the values are the same (Unique's value):
    - seller 
    - offer_type

2. The num_photos col looks odd, the values of top, unique adn freq are NaN

In [80]:
print(autos_copy['num_pictures'].value_counts().head(20))
print()
print(autos_copy['seller'].value_counts().head(20))
print()
print(autos_copy['offer_type'].value_counts().head(20))


0    50000
Name: num_pictures, dtype: int64

privat        49999
gewerblich        1
Name: seller, dtype: int64

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64


####  Since all rows in num_pictures col are zeros and all most of rows except one in seller and offer_type columns also are the same; 
these 3 columns are not important for our analysis; we will remove these 3 columns

In [84]:
autos_copy = autos_copy.drop(['seller', 'num_pictures', 'offer_type'], axis=1)

In [85]:
autos_copy.columns

Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'powerPS', 'model', 'odometer',
       'registration_month', 'fuet_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

## Explore Price and Odometer Columns

In [87]:
autos_copy.odometer.value_counts()

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

In [89]:
autos_copy.odometer.head(20)

0     150,000km
1     150,000km
2      70,000km
3      70,000km
4     150,000km
5     150,000km
6     150,000km
7     150,000km
8     150,000km
9     150,000km
10    150,000km
11    150,000km
12     50,000km
13    150,000km
14    150,000km
15     80,000km
16    150,000km
17    150,000km
18    150,000km
19    150,000km
Name: odometer, dtype: object

In [90]:
# Remove non-digit chars and convert into int
autos_copy.odometer = (autos_copy['odometer']
                       .str.replace('km','')
                       .str.replace(',','')
                       .astype(int)
                      )


                       
autos_copy.odometer.head(10)

0    150000
1    150000
2     70000
3     70000
4    150000
5    150000
6    150000
7    150000
8    150000
9    150000
Name: odometer, dtype: int64

### Explore Price col


In [95]:
autos_copy['price'].head(20)

0      $5,000
1      $8,500
2      $8,990
3      $4,350
4      $1,350
5      $7,900
6        $300
7      $1,990
8        $250
9        $590
10       $999
11       $350
12     $5,299
13     $1,350
14     $3,999
15    $18,900
16       $350
17     $5,500
18       $300
19     $4,150
Name: price, dtype: object

In [96]:
# Remove non-digit and convert into integer type
autos_copy['price'] = (autos_copy['price']
                       .str.replace('$','')
                       .str.replace(',','')
                       .astype(int)
                      )


In [99]:
# Top 20 prices
autos_copy['price'].value_counts().sort_index(ascending=False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

In [100]:
# Bottom 20 prices
autos_copy['price'].value_counts().sort_index(ascending=True).head(20)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

**We can see that 1421 car with price of zero and about 12 cars with prices over 1 million. Since the auton price could start at 1 dollar and any price higher than 350000 does not seem realistic; we'll remove the prices which are below 1 and higher than 350000 **

In [101]:
autos_copy = autos_copy[autos_copy['price'].between(1,350000)]

In [102]:
# check stats of price col again
autos_copy['price'].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

In [103]:
# Explore columns with date data
autos_copy[['date_crawled', 'ad_created', 'last_seen']].head(5)

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [123]:
# Extract the date from each of the 3 columns to understand the date distributions
# date from date_crawled col
print(autos_copy['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(10))
print()
autos_copy['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index().tail(10)

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
Name: date_crawled, dtype: float64



2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

**==> It seems like the data was crawled over a period of over a month from early March to early April; and the distribution of daily lists seem roughly uniform.**

In [120]:
# data distribution for ad_created col
print(autos_copy['ad_created'].str[:10]
      .value_counts(normalize=True, dropna=False).sort_index()
     )


2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21 

**==> There are large variety of listing created with most of them are with 1-2 months and a few of them are as old as 9 months.** 

In [121]:
# data distribution for last_seen col
print(autos_copy['last_seen'].str[:10]
      .value_counts(normalize=True, dropna=False).sort_index()
     )


2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64


** ==> Last seen ads seems to be spiking for the last 3 days; however, these probably not due to real increase in sales but are related to crawled period ending and not indicating car sales.**

In [124]:
# Look at registration_year distribution

autos_copy['registration_year'].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

** ==> Note that the min value for registration_year is 1000, 1000 year before the invented, which does not make sense. We need to do some processing for the date in registration_year column**