# eBay Car Sales Data
### a guided project to explore eBay sales data - Dataquest.io

This project explores a sample of 50,000 car sales scraped from eBay. We will be using numpy and pandas to clean and prepare the data for basic analysis. 

### Import and view the data

In [1]:
import numpy as np
import pandas as pd

#get the data
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')

In [2]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

From the data we can see that the data isn't in english. We can also see that there are some NaN values that will need to be rectified. Most values are strings (object)

### Rename the columns to be more descriptive
Convention is to use _ instead of camel case, rename the columns apropriately. Also renaming some of them to be more descriptive of the contents. 

In [4]:

#A function that accepts a column name and ouputs my preferred name
def cleanColNames(inStr) :
    inStr = inStr.replace("yearOfRegistration", "registration_year")
    inStr = inStr.replace("monthOfRegistration", "registration_month")
    inStr = inStr.replace("notRepairedDamage", "unprepaired_damage")
    inStr = inStr.replace("dataCreated", "ad_created")
    inStr = inStr.replace("dateCrawled", "date_crawled")
    inStr = inStr.replace("offerType", "offer_type")
    inStr = inStr.replace("vehicleType", "vehicle_type")
    inStr = inStr.replace("powerPS", "power_ps")
    inStr = inStr.replace("fuelType", "fuel_type")
    inStr = inStr.replace("nrOfPictures", "num_pictures")
    inStr = inStr.replace("postalCode", "postal_code")
    inStr = inStr.replace("lastSeen", "last_seen")
    return inStr

cleanedCols = []

#for each column in the dataset, get the cleaned name
for column in autos.columns :
    column = cleanColNames(column)
    cleanedCols.append(column)
    
#rename the dataframe columns to the newly created cleaned names
autos.columns = cleanedCols
  

#Print to verify changes
autos.info()
    
    

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
seller                50000 non-null object
offer_type            50000 non-null object
price                 50000 non-null object
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer              50000 non-null object
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unprepaired_damage    40171 non-null object
dateCreated           50000 non-null object
num_pictures          50000 non-null int64
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(5)

### A first pass at data exploration
Using this pass we can start to identify what data we can use and what is not valuable. We can start to organize the values in better ways. 


In [5]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unprepaired_damage,dateCreated,num_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-19 17:36:18,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


It looks like the num_pictures column has all zeros and can be dropped. 
We are missing the numerical values for price and odometer, these need to be cast to INT vals


In [6]:
#remove the num_pictures field
#autos = autos.drop('num_pictures', 1)


In [7]:
#investigate the price field to see why it isn't showing up
autos["price"].describe()

count     50000
unique     2357
top          $0
freq       1421
Name: price, dtype: object

In [14]:
#remove the $ from the price field and then cast to int
autos['price'] = autos['price'].str.replace("$", '')
autos['price'] = autos['price'].str.replace(",", '')
autos['price'] = autos['price'].astype(int)
autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [8]:
#Investigate the odometer field for the same
autos['odometer'].describe()

count         50000
unique           13
top       150,000km
freq          32424
Name: odometer, dtype: object

In [9]:
autos['odometer'] = autos['odometer'].str.replace("km", "")
autos['odometer'] = autos['odometer'].str.replace(",", "")
autos['odometer'] = autos['odometer'].astype(int)
autos.rename(columns = {'odometer' : 'odometer_km'}, inplace = True)
autos['odometer_km'].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

Take a closer look at the price and odometer fields to see what cleaning needs to be done here. 


In [17]:
autos['price'].value_counts().sort_index(ascending=True)

0           1421
1            156
2              3
3              1
5              2
8              1
9              1
10             7
11             2
12             3
13             2
14             1
15             2
17             3
18             1
20             4
25             5
29             1
30             7
35             1
40             6
45             4
47             1
49             4
50            49
55             2
59             1
60             9
65             5
66             1
            ... 
151990         1
155000         1
163500         1
163991         1
169000         1
169999         1
175000         1
180000         1
190000         1
194000         1
197000         1
198000         1
220000         1
250000         1
259000         1
265000         1
295000         1
299000         1
345000         1
350000         1
999990         1
999999         2
1234566        1
1300000        1
3890000        1
10000000       1
11111111       2
12345678      