### Project : Exploring Ebay Car Sales Data

**Project Description**
The project clean and analyze a sample dataset from the Classied Ad section of a German Ebay website, which was originally scraped and uploaded to Kaggle by user **orgesleka**.
Original dataset is no longer available on Kaggle but can still be accessed [here](https://data.world/data-society/used-cars-data)

The main tools in this project will be pandas and numpy.

In [287]:
import numpy as np
import pandas as pd
import csv


In [288]:
# upload the dataset into pandas 
""" Note unicode UTF-8 is not working for this dataset """
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [289]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [290]:
autos.tail()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07
49999,2016-03-14 00:42:12,Opel_Vectra_1.6_16V,privat,Angebot,"$1,250",control,limousine,1996,manuell,101,vectra,"150,000km",1,benzin,opel,nein,2016-03-13 00:00:00,0,45897,2016-04-06 21:18:48


In [291]:
# make a copy of the dataset
autos_copy = autos.copy()

In [292]:
# check dataset info
autos_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [293]:
# printing column list
col_name_origin = autos_copy.columns

In [294]:
col_name_origin

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [295]:
# replace original column list with corrected_col_names list
autos_copy.rename({'dateCrawled':'date_crawled',
       'offerType':'offer_type', 
       'vehicleType':'vehicle_type', 'yearOfRegistration': 'registration_year',
       'powerPS':'power_ps', 
       'monthOfRegistration':'registration_month', 
       'fuelType':'fuel_type', 
       'notRepairedDamage':'not_repaired_damage', 'dateCreated':'ad_created', 
       'nrOfPictures':'num_of_pictures', 'postalCode':'postal_code',
       'lastSeen':'last_seen'},axis=1, inplace = True)

In [296]:
# print corrected column list
autos_copy.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'not_repaired_damage', 'ad_created', 'num_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

**Note** We changed the original column names which are in Camelcase into Snakecase to make it easier to read and process; for example, 'notRepaireDamage' was changed into 'not_repaired_damage'. We also changed the following columns names into new names: 

    - yearOfRegistration to registration_year
    - monthOfRegistration to registration_month
    - notRepairedDamage to unrepaired_damage
    - dateCreated to ad_created



### Initial Data Exploration and Cleaning

In [297]:
# print descritive stats of the dataset
autos_copy.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,not_repaired_damage,ad_created,num_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-04-02 15:49:30,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


#### **Initial observations:** 

    1. There are a number of text columns where almost all of the values are the same (Unique's value):
        - seller 
        - offer_type
        
    2. The num_photos col looks odd, the values of top, unique adn freq are NaN

In [298]:
print(autos_copy['num_of_pictures'].value_counts())
print()
print(autos_copy['seller'].value_counts())
print()
print(autos_copy['offer_type'].value_counts())


0    50000
Name: num_of_pictures, dtype: int64

privat        49999
gewerblich        1
Name: seller, dtype: int64

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64



**==> Since num_of_picture has value of 0 for every row,  also, almost all of the values for seller and offer_type columns are the same; we'll drop these 3 columns<==**

In [299]:
# drop num_of_pictures, seller and offer_type
autos_copy = autos_copy.drop(['num_of_pictures', 'seller', 'offer_type'], axis=1)


**There are two columns: price and odometer, which are numeric with values stored as text.** 


In [300]:
# convert price col into int
autos_copy['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)

In [301]:
autos_copy.price

0         5000
1         8500
2         8990
3         4350
4         1350
         ...  
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price, Length: 50000, dtype: int64

In [312]:
autos_copy['odometer'] = autos_copy['odometer'].str.replace('km','').str.replace(',','').astype(int)
autos_copy.rename({'odometer':'odometer_km'}, axis=1, inplace = True)
autos_copy.odometer_km

0        150000
1        150000
2         70000
3         70000
4        150000
          ...  
49995    100000
49996    150000
49997      5000
49998     40000
49999    150000
Name: odometer_km, Length: 50000, dtype: int64

====================================================================================== 

### **Explore Price & Odometer_km** 

======================================================================================

In [None]:
# value count Price col
print(autos_copy.price.unique().shape, "\n")
print(autos_copy.price.describe())
print()
print('Top 20 prices \n')
print(autos_copy.price.value_counts().sort_index(ascending=False).head(20))
print('\n Bottom 20 prices')
autos_copy.price.value_counts().sort_index(ascending=True).head(20)


**We can see that there are 1421 cars with price of 0; maximum price about 100 million with about 12 cars with the price of 1 million or higer. It's reasonable for the aution to start with 1. Thus, we'll remove any price less than 1 dollar and higher 350,000 dollars**



In [None]:
autos_copy = autos_copy[autos_copy['price'].between(1,351000)]
autos_copy['price'].describe()

In [None]:
# Check Odometer_km col
print(autos_copy['odometer_km'].value_counts())

### Exploring the date column

There are a few columns with date info:

    - date_crawled
    - registration_month
    - registration_year
    - ad_created
    - last_seen
    
We'll explore each of these columns to learn more about the listings.

In [314]:
autos_copy[['date_crawled', 'ad_created', 'last_seen']][:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50
