# Exploring Ebay Car Sales Data

we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle. We've made a few modifications from the original dataset that was uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data):

* We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment
* We dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

### Importing libraries and reading datasets

In [1]:
import numpy as np
import pandas as pd

autos = pd.read_csv("autos.csv",encoding = 'Latin-1')

In [2]:
autos
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


 majority of the columns are do not have non values, and the predominate datatype is object and very few int. we can also see that columns that ought to be int or float are objects example the price and odometer column.

 covert the column names from camelase to snakease 

In [3]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [4]:
names ={'dateCrawled':'date_crawled', 'offerType':'offer_type','vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year','powerPS':'power_ps',
       'monthOfRegistration':'registration_month', 'fuelType':'fuel_type',
       'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created', 'nrOfPictures':'num_pictures','postalCode':'postal_code',
       'lastSeen':'last_seen'}
autos.rename(columns=names,inplace=True)
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


modifying the column names to snake case makes it easier working with pandas. 

looking at the columns more closely to see if anyone needs cleaning. that is numeric data stored as text or columns where all values are thesame

In [5]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-12 16:06:22,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In [6]:
autos["num_pictures"].value_counts()

0    50000
Name: num_pictures, dtype: int64

In [7]:
autos["num_pictures"].head()

0    0
1    0
2    0
3    0
4    0
Name: num_pictures, dtype: int64

num_pictures columns contains is 100% 0,seller and  offer_type columes contain a value that is  99.998% of all values. so we can drop this three columns.

In [8]:
autos = autos.drop(["num_pictures", "seller", "offer_type"], axis=1)

as noted earier price and odometer columns are numeric values stored as text. so we i will convert it to numeric values

In [9]:
autos["price"].value_counts()

$0             1421
$500            781
$1,500          734
$2,500          643
$1,000          639
$1,200          639
$600            531
$800            498
$3,500          498
$2,000          460
$999            434
$750            433
$900            420
$650            419
$850            410
$700            395
$4,500          394
$300            384
$2,200          382
$950            379
$1,100          376
$1,300          371
$3,000          365
$550            356
$1,800          355
$5,500          340
$350            335
$1,250          335
$1,600          327
$1,999          322
               ... 
$26,150           1
$64,990           1
$23,590           1
$4,780            1
$10,949           1
$440              1
$99,999,999       1
$16,699           1
$18,977           1
$35,950           1
$1,498            1
$6,495            1
$1,466            1
$7,430            1
$4,770            1
$15,870           1
$217              1
$69,500           1
$15,413           1


the price column has doller signs and aslo commers. we will use the string replace method to take it out.

In [10]:
autos["price"]=autos["price"].str.replace("$","")
autos["price"]=autos["price"].str.replace(",","")
autos["price"]

0         5000
1         8500
2         8990
3         4350
4         1350
5         7900
6          300
7         1990
8          250
9          590
10         999
11         350
12        5299
13        1350
14        3999
15       18900
16         350
17        5500
18         300
19        4150
20        3500
21       41500
22       25450
23        7999
24       48500
25          90
26         777
27           0
28        5250
29        4999
         ...  
49970    15800
49971      950
49972     3300
49973     6000
49974        0
49975     9700
49976     5900
49977     5500
49978      900
49979    11000
49980      400
49981     2000
49982     1950
49983      600
49984        0
49985     1000
49986    15900
49987    21990
49988     9550
49989      150
49990    17500
49991      500
49992     4800
49993     1650
49994     5000
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price, Length: 50000, dtype: object

converting to integer

In [11]:
autos["price"]=autos["price"].astype(int)
autos["price"]

0         5000
1         8500
2         8990
3         4350
4         1350
5         7900
6          300
7         1990
8          250
9          590
10         999
11         350
12        5299
13        1350
14        3999
15       18900
16         350
17        5500
18         300
19        4150
20        3500
21       41500
22       25450
23        7999
24       48500
25          90
26         777
27           0
28        5250
29        4999
         ...  
49970    15800
49971      950
49972     3300
49973     6000
49974        0
49975     9700
49976     5900
49977     5500
49978      900
49979    11000
49980      400
49981     2000
49982     1950
49983      600
49984        0
49985     1000
49986    15900
49987    21990
49988     9550
49989      150
49990    17500
49991      500
49992     4800
49993     1650
49994     5000
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price, Length: 50000, dtype: int64

In [12]:
autos["odometer"].value_counts()

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

In [13]:
autos["odometer"].describe()

count         50000
unique           13
top       150,000km
freq          32424
Name: odometer, dtype: object

In [14]:
autos["odometer"]=autos["odometer"].str.replace("km","")
autos["odometer"]=autos["odometer"].str.replace(",","").astype(int)
autos["odometer"]

0        150000
1        150000
2         70000
3         70000
4        150000
5        150000
6        150000
7        150000
8        150000
9        150000
10       150000
11       150000
12        50000
13       150000
14       150000
15        80000
16       150000
17       150000
18       150000
19       150000
20       150000
21       150000
22        10000
23       150000
24        30000
25       150000
26       125000
27       150000
28       150000
29       150000
          ...  
49970     60000
49971    150000
49972    150000
49973    150000
49974    150000
49975    100000
49976    150000
49977    150000
49978    150000
49979     70000
49980    125000
49981    150000
49982     90000
49983    150000
49984    150000
49985    150000
49986    125000
49987     50000
49988    150000
49989    150000
49990     30000
49991    150000
49992    125000
49993    150000
49994    150000
49995    100000
49996    150000
49997      5000
49998     40000
49999    150000
Name: odometer, Length: 

In [15]:
autos.rename(columns={"odometer":"odometer_km"},inplace=True)

### Exploring the odometer and price columns

In [16]:
price = autos["price"]
price.unique().shape

(2357,)

In [17]:
price.describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [18]:
price.value_counts().sort_index(ascending=False)

99999999       1
27322222       1
12345678       3
11111111       2
10000000       1
3890000        1
1300000        1
1234566        1
999999         2
999990         1
350000         1
345000         1
299000         1
295000         1
265000         1
259000         1
250000         1
220000         1
198000         1
197000         1
194000         1
190000         1
180000         1
175000         1
169999         1
169000         1
163991         1
163500         1
155000         1
151990         1
            ... 
66             1
65             5
60             9
59             1
55             2
50            49
49             4
47             1
45             4
40             6
35             1
30             7
29             1
25             5
20             4
18             1
17             3
15             2
14             1
13             2
12             3
11             2
10             7
9              1
8              1
5              2
3              1
2             

In [19]:
autos=autos[autos["price"].between(1,351000)]
autos

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,7900,test,bus,2006,automatik,150,voyager,150000,4,diesel,chrysler,,2016-03-21 00:00:00,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,300,test,limousine,1995,manuell,90,golf,150000,8,benzin,volkswagen,,2016-03-20 00:00:00,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,1990,control,limousine,1998,manuell,90,golf,150000,12,diesel,volkswagen,nein,2016-03-16 00:00:00,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,250,test,,2000,manuell,0,arosa,150000,10,,seat,nein,2016-03-22 00:00:00,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,590,control,bus,1997,manuell,90,megane,150000,7,benzin,renault,nein,2016-03-16 00:00:00,15749,2016-04-06 10:46:35


given the site is an auction site,it is possile to have bidding starting from $1, and also it doesn't look like there are many cars sold above 350000. our data set starts form 1 to 351000 

In [20]:
odometer = autos["odometer_km"]
odometer.unique().shape

(13,)

In [21]:
odometer.describe()

count     48565.000000
mean     125770.101925
std       39788.636804
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [22]:
odometer.value_counts().sort_index().head()

5000     836
10000    253
20000    762
30000    780
40000    815
Name: odometer_km, dtype: int64

## Exploring the date columns

1. Date_crawled column

In [23]:
crawled = autos["date_crawled"].str[:10]

In [24]:
crawled.value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

looks like the listing were crawled from the being of march 2016 to April 2016. so for

In [25]:
lastseen = autos["last_seen"].str[:10]
lastseen.value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

the last seen column is similar with that of date crawled,it is for a period of one month between March and April.the last three value seem out of place.

In [26]:
created= autos["ad_created"].str[:10]
created.value_counts(normalize=True, dropna=False).sort_index()

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21 

 Most of the ad created fall within 1-2 months of the listing date,a few are as old as 9 months.

In [27]:
autos["registration_year"].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

the minimum and maximum years for registration are odd because we do not have a year 1000 when cars were invented  and also a year 9999 that we haven't reach. 

## Dealing with incorrect registration year date

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. and any year below 1900 is unrealistic.

In [28]:
autos["registration_year"].value_counts().sort_index(ascending=False)

9999       3
9000       1
8888       1
6200       1
5911       1
5000       4
4800       1
4500       1
4100       1
2800       1
2019       2
2018     470
2017    1392
2016    1220
2015     392
2014     663
2013     803
2012    1310
2011    1623
2010    1589
2009    2085
2008    2215
2007    2277
2006    2670
2005    2936
2004    2703
2003    2699
2002    2486
2001    2636
2000    3156
        ... 
1964      12
1963       8
1962       4
1961       6
1960      23
1959       6
1958       4
1957       2
1956       4
1955       2
1954       2
1953       1
1952       1
1951       2
1950       3
1948       1
1943       1
1941       2
1939       1
1938       1
1937       4
1934       2
1931       1
1929       1
1927       1
1910       5
1800       2
1111       1
1001       1
1000       1
Name: registration_year, Length: 95, dtype: int64

the higest year is 2016 since that is the last year cars were listed based on the crawled column. and the lowest year would be 1900 because that is the century cars where invented. 

In [29]:
autos=autos[autos["registration_year"].between(1900,2016)]
autos["registration_year"].value_counts(normalize=True).sort_index()

1910    0.000107
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000043
1937    0.000086
1938    0.000021
1939    0.000021
1941    0.000043
1943    0.000021
1948    0.000021
1950    0.000064
1951    0.000043
1952    0.000021
1953    0.000021
1954    0.000043
1955    0.000043
1956    0.000086
1957    0.000043
1958    0.000086
1959    0.000129
1960    0.000493
1961    0.000129
1962    0.000086
1963    0.000171
1964    0.000257
1965    0.000364
1966    0.000471
1967    0.000557
1968    0.000557
          ...   
1987    0.001542
1988    0.002892
1989    0.003727
1990    0.007433
1991    0.007262
1992    0.007926
1993    0.009104
1994    0.013474
1995    0.026285
1996    0.029412
1997    0.041794
1998    0.050620
1999    0.062060
2000    0.067608
2001    0.056468
2002    0.053255
2003    0.057818
2004    0.057904
2005    0.062895
2006    0.057197
2007    0.048778
2008    0.047450
2009    0.044665
2010    0.034040
2011    0.034768
2012    0.028063
2013    0.017202
2014    0.0142

most of the cars are registered in the last 20 years


## Exploring price by brand.

Exploring the brand column. 

In [30]:
brands = autos["brand"].value_counts()
brands

volkswagen        9862
bmw               5137
opel              5022
mercedes_benz     4503
audi              4041
ford              3263
renault           2201
peugeot           1393
fiat              1197
seat               853
skoda              766
nissan             713
mazda              709
smart              661
citroen            654
toyota             593
hyundai            468
sonstige_autos     458
volvo              427
mini               409
mitsubishi         384
honda              366
kia                330
alfa_romeo         310
porsche            286
suzuki             277
chevrolet          266
chrysler           164
dacia              123
daihatsu           117
jeep               106
subaru             100
land_rover          98
saab                77
jaguar              73
daewoo              70
trabant             65
rover               62
lancia              50
lada                27
Name: brand, dtype: int64

volkswagen has the higest number of car listings and lada the least. i will select the first of 20 popular cars. 

In [31]:
common_brands = brands.head(20).index
common_brands

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'nissan', 'mazda', 'smart',
       'citroen', 'toyota', 'hyundai', 'sonstige_autos', 'volvo', 'mini'],
      dtype='object')

In [32]:
brand_mean_price = {}

for brand in common_brands:
    brandt = autos[autos["brand"] == brand]
    meanprice = brandt["price"].mean()
    brand_mean_price[brand] = meanprice.round()
print(brand_mean_price)

{'hyundai': 5365.0, 'sonstige_autos': 12339.0, 'fiat': 2814.0, 'bmw': 8333.0, 'toyota': 5167.0, 'opel': 2975.0, 'ford': 3749.0, 'skoda': 6368.0, 'audi': 9337.0, 'smart': 3580.0, 'mini': 10613.0, 'volkswagen': 5402.0, 'citroen': 3779.0, 'volvo': 4947.0, 'nissan': 4743.0, 'mazda': 4113.0, 'seat': 4397.0, 'peugeot': 3094.0, 'renault': 2475.0, 'mercedes_benz': 8628.0}


the top expensive cars are, sonstige,mini,audi,mercedes_benz,bmw. then volkswagen,skoda,hyundai are in the middle.

## Exploring  mileage 

In [33]:
bmp_series = pd.Series(brand_mean_price)
pd.DataFrame(bmp_series,columns=["mean_price"])


Unnamed: 0,mean_price
audi,9337.0
bmw,8333.0
citroen,3779.0
fiat,2814.0
ford,3749.0
hyundai,5365.0
mazda,4113.0
mercedes_benz,8628.0
mini,10613.0
nissan,4743.0


In [34]:
brand_mean_mileage = {}
for brand in common_brands:
    brandt = autos[autos["brand"] == brand]
    meanmileage = brandt["odometer_km"].mean()
    brand_mean_mileage[brand] = meanmileage.round()
bmp_series = pd.Series(brand_mean_price).sort_values(ascending=False)
bmm_series = pd.Series(brand_mean_mileage).sort_values(ascending=False)

In [35]:
dfm = pd.DataFrame(bmm_series,columns=["mean_mileage"])
dfm

Unnamed: 0,mean_mileage
volvo,138068.0
bmw,132573.0
mercedes_benz,130788.0
opel,129310.0
audi,129157.0
volkswagen,128707.0
renault,128071.0
peugeot,127154.0
mazda,124464.0
ford,124266.0


In [36]:
dfm["mean_price"]= bmp_series

In [37]:
brand_info = dfm
brand_info

Unnamed: 0,mean_mileage,mean_price
volvo,138068.0,4947.0
bmw,132573.0,8333.0
mercedes_benz,130788.0,8628.0
opel,129310.0,2975.0
audi,129157.0,9337.0
volkswagen,128707.0,5402.0
renault,128071.0,2475.0
peugeot,127154.0,3094.0
mazda,124464.0,4113.0
ford,124266.0,3749.0
