## Guided Project NO. 3
### Exploring eBay Car Sales Data
######  The goal of this project is to clean and analyze the dataset of used car listings obtained from eBay Kleinanzeigen, which is a classifieds section of the German eBay website.

---



In [1]:
import pandas as pan
import numpy as num
autos = pan.read_csv('autos.csv',encoding='Latin-1')

In [2]:
autos.info()
#data dictionary

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [3]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


###### Observations.
1. 5000 rows & 20 columns
2. 15 object fields & 5 integer fields
3. Name column is not formatted or organized
4. date & time format consistent
5. language used is not consistent switching between enlgish and german
6. currency is in USA dollars and string instead of numeric value
7. nrOfPictures seem to be first column that add zero to anaylsis
8. none use of snakecase for columns

---

### Cleaning Column Names

In [4]:
autos_mapping = {'dateCrawled':'date_crawled',
                 'offerType':'offer_type',
                 'vehicleType':'vehicle_type',
                 'yearOfRegistration':'registration_year',
                 'powerPS':'power_ps',
                 'monthOfRegistration':'registration_month',
                 'fuelType':'fuel_type',
                 'notRepairedDamage':'unrepaired_damage',
                 'dateCreated':'ad_created',
                 'nrOfPictures':'number_of_pictures',
                 'postalCode':'postal_code',
                 'lastSeen':'last_seen',
                'name':'name',
                'seller':'seller',
                'price':'price',
                'abtest':'abtest',
                'gearbox':'gearbox',
                'model':'model',
                'odometer':'odometer',
                'brand':'brand'}
autos.columns = autos.columns.map(autos_mapping)

In [5]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


##### Column Update
- update casing to 'Snake' from Camel
- yearOfRegistration to registration_year
- monthOfRegistration to registration_month
- notRepairedDamage to unrepaired_damage
- dateCreated to ad_created

Changes made in effort to make csv more readable & descriptive 

---

### Initial Exploration and Cleaning.

In [6]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-10 15:36:24,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In [7]:
autos['number_of_pictures'].value_counts().head()
autos['registration_year'].value_counts().head().sort_index(ascending=True)
autos['registration_year'].unique()

array([2004, 1997, 2009, 2007, 2003, 2006, 1995, 1998, 2000, 2017, 2010,
       1999, 1982, 1990, 2015, 2014, 1996, 1992, 2005, 2002, 2012, 2011,
       2008, 1985, 2016, 1994, 1986, 2001, 2018, 2013, 1972, 1993, 1988,
       1989, 1967, 1973, 1956, 1976, 4500, 1987, 1991, 1983, 1960, 1969,
       1950, 1978, 1980, 1984, 1963, 1977, 1961, 1968, 1934, 1965, 1971,
       1966, 1979, 1981, 1970, 1974, 1910, 1975, 5000, 4100, 2019, 1959,
       9996, 9999, 6200, 1964, 1958, 1800, 1948, 1931, 1943, 9000, 1941,
       1962, 1927, 1937, 1929, 1000, 1957, 1952, 1111, 1955, 1939, 8888,
       1954, 1938, 2800, 5911, 1500, 1953, 1951, 4800, 1001])

##### Potential Candidates to be dropped
- number_of_pictures
- seller
- offer_type

Each of this colummns have unique values of 2 or less, meaning columns have 1 of 2 options which may not contribute to any further anaylsis.

In [8]:
autos['seller'].unique()
autos['offer_type'].unique()
autos['number_of_pictures'].unique()
# confirming that unique characters in potential drop columns

array([0])

In [9]:
autos = autos.drop(['seller','offer_type','number_of_pictures'],axis = 1)
autos.columns # confirming removal of columns

Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

##### Potential Candidates for further investigation
- price
- registration_year
- power_ps




##### Potential candidates for numeric cleaning
- price
- odometer
- power_ps

---

### Exploring Odometer & Price Columns

In [10]:
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype(int)
autos.rename({'odometer':'odometer_km'},axis=1,inplace=True)

In [11]:
autos['price']= autos['price'].str.replace('$',"").str.replace(',','').astype(int)
#autos.rename({'price':'price_us_dollars'},axis=1,inplace=True)

In [12]:
print(autos['price'].unique())
print(autos['price'].unique().shape[0])
print(autos['price'].describe())
autos['price'].value_counts().sort_index(ascending=False).head(15)


[ 5000  8500  8990 ...   385 22200 16995]
2357
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
Name: price, dtype: int64

In [13]:
autos['price'] = autos[autos['price'].between(1,3.5e5)]['price']
print(autos['price'].unique().shape[0])
print(autos['price'].describe())
autos['price'].value_counts().sort_index(ascending=True).head(10)

2347
count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64


1.0     156
2.0       3
3.0       1
5.0       2
8.0       1
9.0       1
10.0      7
11.0      2
12.0      3
13.0      2
Name: price, dtype: int64

##### Outliers observation 'price'
Create a range from 1 dollar to 350,000 based on largest percent difference between two prices juxtaposed
- count change from 50,000 to 48,565
- max change from 100M to 350K
- min change from 0 to 1
- the 25%,50%,&75% changes were minut
- largest change was std going from 48,000 to 9000

In [14]:
print(autos['odometer_km'].unique())
print(autos['odometer_km'].unique().shape[0])
autos['odometer_km'].value_counts().sort_index()

[150000  70000  50000  80000  10000  30000 125000  90000  20000  60000
   5000 100000  40000]
13


5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

---

### Exploring the date columns

In [15]:
autos[['date_crawled','ad_created','last_seen']][:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [16]:
print(autos['date_crawled'].unique())
autos['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True)

['2016-03-26 17:47:46' '2016-04-04 13:38:56' '2016-03-26 18:57:24' ...
 '2016-03-28 10:50:25' '2016-03-08 19:25:42' '2016-03-14 00:42:12']


2016-03-05    0.02538
2016-03-06    0.01394
2016-03-07    0.03596
2016-03-08    0.03330
2016-03-09    0.03322
2016-03-10    0.03212
2016-03-11    0.03248
2016-03-12    0.03678
2016-03-13    0.01556
2016-03-14    0.03662
2016-03-15    0.03398
2016-03-16    0.02950
2016-03-17    0.03152
2016-03-18    0.01306
2016-03-19    0.03490
2016-03-20    0.03782
2016-03-21    0.03752
2016-03-22    0.03294
2016-03-23    0.03238
2016-03-24    0.02910
2016-03-25    0.03174
2016-03-26    0.03248
2016-03-27    0.03104
2016-03-28    0.03484
2016-03-29    0.03418
2016-03-30    0.03362
2016-03-31    0.03192
2016-04-01    0.03380
2016-04-02    0.03540
2016-04-03    0.03868
2016-04-04    0.03652
2016-04-05    0.01310
2016-04-06    0.00318
2016-04-07    0.00142
Name: date_crawled, dtype: float64

- consistent traffic on a near daily basis
- cluster of vollume in March & May

In [17]:
autos['ad_created'].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True)

2015-06-11    0.00002
2015-08-10    0.00002
2015-09-09    0.00002
2015-11-10    0.00002
2015-12-05    0.00002
               ...   
2016-04-03    0.03892
2016-04-04    0.03688
2016-04-05    0.01184
2016-04-06    0.00326
2016-04-07    0.00128
Name: ad_created, Length: 76, dtype: float64

In [18]:
autos['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True)

2016-03-05    0.00108
2016-03-06    0.00442
2016-03-07    0.00536
2016-03-08    0.00760
2016-03-09    0.00986
2016-03-10    0.01076
2016-03-11    0.01252
2016-03-12    0.02382
2016-03-13    0.00898
2016-03-14    0.01280
2016-03-15    0.01588
2016-03-16    0.01644
2016-03-17    0.02792
2016-03-18    0.00742
2016-03-19    0.01574
2016-03-20    0.02070
2016-03-21    0.02074
2016-03-22    0.02158
2016-03-23    0.01858
2016-03-24    0.01956
2016-03-25    0.01920
2016-03-26    0.01696
2016-03-27    0.01602
2016-03-28    0.02086
2016-03-29    0.02234
2016-03-30    0.02484
2016-03-31    0.02384
2016-04-01    0.02310
2016-04-02    0.02490
2016-04-03    0.02536
2016-04-04    0.02462
2016-04-05    0.12428
2016-04-06    0.22100
2016-04-07    0.13092
Name: last_seen, dtype: float64

- last_seen seems to coincide or overlap with date_crawled

---

### Dealing with Incorrect Registration Year Data

In [19]:
autos['registration_year'].describe()

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

- max and min year fall outside the range of a plausible registration
- mean seems to show most car registration year was in 2005
- registration_year seems to have values that are not probable first car was made in 1885 and several registration with years such as 1111 or on the other end 4500.



In [20]:
autos['registration_year']=autos[autos['registration_year'].between(1959,2016)]['registration_year']

- lowest acceptable values was 1959 due to the amount of safety features introduce in the 1950's
- inspections are a common practice for car registration, so it would be best to include a range of cars with the most advance safety features 

In [21]:
autos['registration_year'].value_counts(normalize =True).sort_index()

1959.0    0.000146
1960.0    0.000709
1961.0    0.000125
1962.0    0.000083
1963.0    0.000188
1964.0    0.000250
1965.0    0.000354
1966.0    0.000459
1967.0    0.000563
1968.0    0.000542
1969.0    0.000396
1970.0    0.000938
1971.0    0.000563
1972.0    0.000729
1973.0    0.000542
1974.0    0.000500
1975.0    0.000396
1976.0    0.000563
1977.0    0.000459
1978.0    0.000980
1979.0    0.000729
1980.0    0.002022
1981.0    0.000646
1982.0    0.000896
1983.0    0.001105
1984.0    0.001105
1985.0    0.002188
1986.0    0.001584
1987.0    0.001563
1988.0    0.002959
1989.0    0.003772
1990.0    0.008232
1991.0    0.007419
1992.0    0.008149
1993.0    0.009274
1994.0    0.013755
1995.0    0.027364
1996.0    0.030095
1997.0    0.042266
1998.0    0.051123
1999.0    0.062523
2000.0    0.069901
2001.0    0.056334
2002.0    0.052791
2003.0    0.056834
2004.0    0.057042
2005.0    0.062836
2006.0    0.056438
2007.0    0.048018
2008.0    0.046497
2009.0    0.043725
2010.0    0.033283
2011.0    0.

- largest amount of car registration years sits between 1995 - 2012
- highest registration year appears to be 2000
- lowest registration year appears to be 1962

---

### Exploring Price by Brand

In [22]:
autos['brand'].describe()
autos['brand'].unique()

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

In [23]:
greater_five = (autos['brand'].value_counts(normalize=True)*100)>5
unique_brand = autos['brand'].unique()
price_brand = {}
for brand in unique_brand:
    selected_rows=autos[autos['brand']==brand]
    if greater_five[brand]==True:
        price_brand[brand]=selected_rows['price'].mean()
price_brand

{'bmw': 8261.382442169132,
 'volkswagen': 5332.4784249226,
 'ford': 3728.4121821407452,
 'mercedes_benz': 8536.027085124677,
 'audi': 9212.9306621881,
 'opel': 2944.6075421641085}

- Cars on the cheaprer end are Opel - 2944 & Ford - 3728
- Car in the middle Volkswagen - 5332
- Top three expensives cars are Audi,Mercedes_benz, &bmw

In [24]:
price_brand = pan.DataFrame(pan.Series(price_brand).sort_values(ascending=False), columns=['mean_price'])
price_brand

Unnamed: 0,mean_price
audi,9212.930662
mercedes_benz,8536.027085
bmw,8261.382442
volkswagen,5332.478425
ford,3728.412182
opel,2944.607542


---
### Storing Aggregate Data in a DataFrame

In [25]:
mileage_brand ={}
for brand in unique_brand:
    selected_rows=autos[autos['brand']==brand]
    if greater_five[brand]==True:
        mileage_brand[brand]=selected_rows['odometer_km'].mean()
mileage_brand

{'bmw': 132521.64302818198,
 'volkswagen': 128955.27276129878,
 'ford': 124131.93446392642,
 'mercedes_benz': 130886.14279678918,
 'audi': 129643.9411627364,
 'opel': 129298.66324848929}

In [26]:
mileage_brand = pan.Series(mileage_brand)
mileage_brand

bmw              132521.643028
volkswagen       128955.272761
ford             124131.934464
mercedes_benz    130886.142797
audi             129643.941163
opel             129298.663248
dtype: float64

In [27]:
price_brand['mean_mileage'] = mileage_brand
price_brand

Unnamed: 0,mean_price,mean_mileage
audi,9212.930662,129643.941163
mercedes_benz,8536.027085,130886.142797
bmw,8261.382442,132521.643028
volkswagen,5332.478425,128955.272761
ford,3728.412182,124131.934464
opel,2944.607542,129298.663248


### Conclusion
#### the prices and mileage of popular car brands on German eBay. 
- Audi, Mercedes Benz, and BMW are the most expensive, about three times pricier than Ford and Opel. 
- Volkswagen is in between. 
- there is no correlation between price difference and mileage among these brands.