# PD notebook for data exploration
## This notebook is exploring a dataset found on https://data.world/data-society/used-cars-data, which contains data scraped from classified adverts on eBay Germany

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('autos.csv', encoding='latin-1')

In [3]:
df.shape

(371537, 20)

In [4]:
df.head(2)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,24/03/2016 11:52,Golf_3_1.6,private,Angebot,480,test,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,A5_Sportback_2.7_Tdi,private,Angebot,18300,test,coupe,2011,manual,190,,125000,5,diesel,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46


#### Changing camelCase column names to snake_case for ease of reading

In [5]:
df.rename(columns={'dateCrawled': 'date_crawled', 'offerType': 'offer_type', 'vehicleType': 'vehicle_type', 'yearOfRegistration': 'registration_year', 'powerPS': 'power_PS', 'kilometer': 'odometer_km', 'monthOfRegistration': 'registration_month', 'fuelType': 'fuel_type', 'notRepairedDamage': 'unrepaired_damage', 'dateCreated':'ad_created', 'nrOfPictures': 'nr_of_pictures', 'postalCode': 'postal_code', 'lastSeen': 'last_seen'}, inplace=True)

#### Print the new data frame to check for next cleaning steps

In [6]:
df.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_PS,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,371537,371537,371537,371537,371537.0,371537,333668,371537.0,351328,371537.0,351053,371537.0,371537.0,338151,371537,299477,371537,371537.0,371537.0,371537
unique,15622,233389,2,2,,2,8,,2,,251,,,7,40,2,114,,,18705
top,05/03/2016 14:25,Ford_Fiesta,private,Angebot,,test,sedan,,manual,,golf,,,petrol,volkswagen,no,03/04/2016 00:00,,,07/04/2016 06:45
freq,68,657,371534,371525,,192591,95896,,274219,,30070,,,223863,79640,263189,14451,,,708
mean,,,,,17295.49,,,2004.577883,,115.549151,,125618.161852,5.734473,,,,,0.0,50820.666402,
std,,,,,3587910.0,,,92.865496,,192.137403,,40112.919387,3.712383,,,,,0.0,25799.080292,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1150.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30459.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49610.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71546.0,


#### Results:
- "nr_of_pictures" is 0 for all entries, it can be dropped from the dataframe.
- "registration_month" isn't needed, the year of registration is accurate enough for analysis.
- "name" will be dropped - it contains information of some value, however it isn't in a consistent format so will be hard to use. We already have make, model, engine power, and age in other columns which is most of the same information.
- Other columns which are irrelevant and can be dropped:
  - "seller" is all "private"
  - "offer_type" is all "Angebot" (Offer)
  - "date_crawled" 
  - "ad_created"
  - "last_seen"

#### Since the data set has so many points, lines with null values can be dropped while still maintaining a sufficiently large data set for analysis

In [7]:
columns_to_drop = ['nr_of_pictures', 'name', 'registration_month', 'seller', 'offer_type', 'date_crawled', 'ad_created', 'last_seen', 'abtest']
df.drop(columns_to_drop, axis=1, inplace=True)
df.dropna(inplace=True)

In [8]:
df.describe(include='all')

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power_PS,model,odometer_km,fuel_type,brand,unrepaired_damage,postal_code
count,260965.0,260965,260965.0,260965,260965.0,260965,260965.0,260965,260965,260965,260965.0
unique,,8,,2,,250,,7,39,2,
top,,sedan,,manual,,golf,,petrol,volkswagen,no,
freq,,76396,,200064,,20958,,169248,54194,232953,
mean,8209.431,,2003.309382,,126.331531,,124057.977123,,,,51818.674995
std,343303.4,,6.512486,,145.275389,,39848.992196,,,,25843.42422
min,0.0,,1910.0,,0.0,,5000.0,,,,1067.0
25%,1500.0,,1999.0,,78.0,,100000.0,,,,31226.0
50%,3850.0,,2004.0,,116.0,,150000.0,,,,51103.0
75%,8600.0,,2008.0,,150.0,,150000.0,,,,72766.0


#### Overview
Some of these data are incorrect. 
- Year of registration: The dataset originates from 2016, so no years of registration beyond that are valid. The oldest is 1910 which is possible.
- Power PS: Engine power ranges from 0 to 20,000. 0 is impossible since a vehicle with no power cannot move. 20,000 is also far too high. A limit of 1,000 will be applied, which is far beyond most production cars. While a very small number are above this, little data will be lost and it would be anomalous anyway.
- Price: A price of € 0, if valid, will be due to some other detail in the description that has not been scraped (e.g. swap for something else, or fines tied to the vehicle close to its value etc). Also a price over say € 1,000,000 is an extreme case likely due to some other factor (e.g. specific vehicle history like a movie prop). Anything below € 100 or above € 1,000,000 will be dropped.



In [9]:
df = df[(df['registration_year'] <= 2016) & (df['power_PS'].between(1, 1000)) & (df['price'].between(100, 1000000))]

In [12]:
def count_col_vals(df, col):
    # Sort and count
    data_labels, data_counts = np.unique(df[col],return_counts=True)
    d = {'labels': data_labels, 'counts': data_counts}
    df_result = pd.DataFrame(data=d).sort_values(by='counts').reset_index(drop=True)
    df_result['percentage'] = df_result['counts'] / sum(df_result['counts'])
    # Print in descending order and return df
    print()
    print(df_result.tail(50).iloc[::-1])
    return df_result

count_col_vals(df, "model")


          labels  counts  percentage
249         golf   20096    0.081169
248        other   18937    0.076488
247     3 series   14963    0.060437
246         polo    8238    0.033274
245        corsa    7624    0.030794
244           a4    7594    0.030673
243       passat    7232    0.029211
242        astra    7173    0.028972
241     5 series    6594    0.026634
240      c_class    6594    0.026634
239      e_class    5645    0.022801
238           a3    4829    0.019505
237           a6    4580    0.018499
236        focus    4230    0.017085
235  transporter    3904    0.015769
234       fiesta    3722    0.015033
233     2_series    3689    0.014900
232     1 series    3316    0.013394
231      a_class    2940    0.011875
230       twingo    2770    0.011188
229       fortwo    2763    0.011160
228       vectra    2563    0.010352
227       touran    2528    0.010211
226     3_series    2505    0.010118
225       mondeo    2406    0.009718
224         clio    2208    0.008918


Unnamed: 0,labels,counts,percentage
0,discovery_sport,1,0.000004
1,samara,2,0.000008
2,elefantino,3,0.000012
3,serie_3,3,0.000012
4,rangerover,3,0.000012
...,...,...,...
245,corsa,7624,0.030794
246,polo,8238,0.033274
247,3 series,14963,0.060437
248,other,18937,0.076488
