# **Kaggle Dataset Analysis : Exploring Ebay Car Sales Data**

In this project, I am going to work on a scraped dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset contains a lot of dirty data, my goal is to clean it to prepare further analysis. Then, I will look for patterns among the caracteristics of the cars listed in the dataset.

# **Data exploration**

In [1]:
import pandas as pd
import numpy as np

In [2]:
autos = pd.read_csv("/Users/tangigouez/Desktop/DATAQUEST/MyDataSets/Project_3/autos.csv", encoding = "Latin-1")

In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

The dataframe contains 5000 entries and 20 columns. The datatypes are mainly composed of objects (15 columns) and floats (5 columns). There are missing datas in some columns as they contain less than 5000 entries. The following columns are concerned : "vehicle type", "gearbox", "model", "fuelType" and "notRepairedDamage". 

# **Cleaning column names**

We can notice from the DataFrame.info() that column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores. Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

In [4]:
autos_test_columns = autos.columns

In [5]:
autos_test_columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [6]:
def change_case(str): 
    return ''.join(['_'+i.lower() if i.isupper() else i for i in str]).lstrip('_') 

In [7]:
def clean_col(col):
    col = col.replace('yearOfRegistration','registration_year')
    col = col.replace('monthOfRegistration','registration_month')
    col = col.replace('notRepairedDamage','unrepaired_damage')
    col = col.replace('dateCreated','ad_created')
    col = change_case(col)
    return col 

In [8]:
new_columns = []
for c in autos_test_columns:
    col = clean_col(c)
    new_columns.append(col)
autos.columns = new_columns

In [9]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


# **Initial exploration and cleaning**

In this section, I will explore the DataFrame more granularly to determine if specific cleaning tasks need to be done. Eventually, after the exploration, I will delete or replace datas if they provide any valuable information for the analysis or if they are false and potentially confusing.

In [10]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-12 16:06:22,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


At first glance, we can observe that some columns provide mostly a unique value such as "seller", "offer_type" and "nr_of_pictures". Indeed, in these columns the most frequent value occures 49999 out of 50000. Thus, I conclude that these columns don't provide any value for further analysis and are candidates to be dropped. The price column shows some defaults as 0 dollars occures 1421 times, it needs more investigation. In the odometer and price columns, data is stored as text objects, it would be preferable to store them as floats as they are numeric values. Finally, there are some columns with missing data, such as "registration_year", "power_p_s", "registration_month", and "ad_created" . These columns need to be investigated in order to understand why so much values are missing.

First, as the "seller", "offer_type" and n_of_pictures" don't look valuable, I am going to remove them immediatly

In [11]:
autos = autos.drop(["nr_of_pictures", "seller", "offer_type"], axis=1)

Then, in a second step, I am going to convert the odometer and price columns to numeric values which will ease further analysis.

In [12]:
autos["price"] = (autos["price"]
                          .str.replace("$","")
                          .str.replace(",","")
                          .astype(int)
                          )
autos["price"].head()

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

In [13]:
autos["odometer"] = (autos["odometer"]
                             .str.replace("km","")
                             .str.replace(",","")
                             .astype(int)
                             )
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
autos["odometer_km"].head()

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int64

# **Exploring odometer and price**

Now, I am going to explore more deeply these columns in order to look for eventual data that doesn't look right.

In [14]:
autos["odometer_km"].value_counts()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

In the odometer column, we can note that there are only rounded values. There doesn't seem to be wrong data.

In [15]:
print(autos["price"].unique().shape)
print(autos["price"].describe())
autos["price"].value_counts().head(20)

(2357,)
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
999      434
750      433
900      420
650      419
850      410
700      395
4500     394
300      384
2200     382
950      379
Name: price, dtype: int64

However, in the price column, there are some suspicious datas that could be misleading for further analysis. Indeed, the minimum value is 0, and it occurs 1421 times in the series. Given that this is only 2% of the of the cars, we might consider removing these rows. 

The maximum price is one hundred million dollars, which seems a lot, let's look at the highest prices further.

In [16]:
autos["price"].value_counts().sort_index(ascending=False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

In [17]:
autos["price"].value_counts().sort_index(ascending=True).head(20)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

There are a number of listings with prices below 30 dollars, including about 1,500 at 0 dollars. There are also a small number of listings with very high values, including 14 at around or over 1 million dollars.

Given that eBay is an auction site, there could legitimately be items where the opening bid is 1 dollar. We will keep the 1 dollar items, but remove anything above 350,000 dollars, since these prices look less realistic.

In [18]:
autos = autos[autos["price"].between(1,351000)]

In [19]:
autos["price"].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

# **Exploring the date columns**

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself : 
- `date_crawled`: added by the crawler
- `last_seen`: added by the crawler
- `ad_created`: from the website
- `registration_month`: from the website
- `registration_year`: from the website

At the moment the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing.

In [20]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


From this exploration, we can notice that the first 10 characters represent the day (e.g. 2016-03-12). To understand the date range, we can extract just the date values, use Series.value_counts() to generate a distribution, and then sort by the index.

In [21]:
(autos["date_crawled"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_values()
        )

2016-04-07    0.001400
2016-04-06    0.003171
2016-03-18    0.012911
2016-04-05    0.013096
2016-03-06    0.014043
2016-03-13    0.015670
2016-03-05    0.025327
2016-03-24    0.029342
2016-03-16    0.029610
2016-03-27    0.031092
2016-03-25    0.031607
2016-03-17    0.031628
2016-03-31    0.031834
2016-03-10    0.032184
2016-03-26    0.032204
2016-03-23    0.032225
2016-03-11    0.032575
2016-03-22    0.032987
2016-03-09    0.033090
2016-03-08    0.033296
2016-03-30    0.033687
2016-04-01    0.033687
2016-03-29    0.034099
2016-03-15    0.034284
2016-03-19    0.034778
2016-03-28    0.034860
2016-04-02    0.035478
2016-03-07    0.036014
2016-04-04    0.036487
2016-03-14    0.036549
2016-03-12    0.036920
2016-03-21    0.037373
2016-03-20    0.037887
2016-04-03    0.038608
Name: date_crawled, dtype: float64

From this observation, we can note that data has been crawled each day over a month between march and april 2016. Overall, the amount of data crawled each day is well distributed as we observe nearly the same proportion of values on each date (except in the two earliest and latest dates)

In [22]:
(autos["ad_created"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_values()
        )

2016-01-29    0.000021
2016-02-16    0.000021
2016-01-13    0.000021
2015-08-10    0.000021
2016-02-17    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-22    0.000021
2015-12-30    0.000021
2016-01-14    0.000021
2015-06-11    0.000021
2016-01-03    0.000021
2015-11-10    0.000021
2016-02-08    0.000021
2016-01-16    0.000021
2016-02-07    0.000021
2015-09-09    0.000021
2016-01-22    0.000021
2016-02-01    0.000021
2015-12-05    0.000021
2016-01-07    0.000021
2016-02-24    0.000041
2016-02-20    0.000041
2016-02-02    0.000041
2016-01-10    0.000041
2016-02-12    0.000041
2016-02-18    0.000041
2016-02-05    0.000041
2016-02-26    0.000041
2016-02-14    0.000041
                ...   
2016-03-06    0.015320
2016-03-13    0.017008
2016-03-05    0.022897
2016-03-24    0.029280
2016-03-16    0.030125
2016-03-27    0.030989
2016-03-17    0.031278
2016-03-25    0.031751
2016-03-31    0.031875
2016-03-10    0.031895
2016-03-23    0.032060
2016-03-26    0.032266
2016-03-22 

There is a large variety of ad created dates. Most fall within 1-2 months of the listing date, but a few are quite old, with the oldest at around 9 months.

In [23]:
(autos["last_seen"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_values()
        )

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-18    0.007351
2016-03-08    0.007413
2016-03-13    0.008895
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-14    0.012602
2016-03-27    0.015649
2016-03-19    0.015834
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-26    0.016802
2016-03-23    0.018532
2016-03-25    0.019211
2016-03-24    0.019767
2016-03-21    0.020632
2016-03-20    0.020653
2016-03-28    0.020859
2016-03-22    0.021373
2016-03-29    0.022341
2016-04-01    0.022794
2016-03-31    0.023783
2016-03-12    0.023783
2016-04-04    0.024483
2016-03-30    0.024771
2016-04-02    0.024915
2016-04-03    0.025203
2016-03-17    0.028086
2016-04-05    0.124761
2016-04-07    0.131947
2016-04-06    0.221806
Name: last_seen, dtype: float64

The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.

We can observe that the last three days contain the most 'last seen' values, way above the other dates. Given that these are massiverly superior to the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.

# **Dealing with Incorrect Registration Year Data**

In [24]:
autos["registration_year"].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The year that the car was first registered will likely indicate the age of the car. From this observation, there are clearly wrong datas in the series as the maximum is 9999 and the minimum 1000. As this column seems to contain suspicious data, I am going to investigate further. Indeed, if there are some registration years above 2016 (after the listing was seen) and let's say before 1900 (somewhere in the first few decades of the 1900s), datas are candidates to be removed from our table.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

In [25]:
(~autos["registration_year"].between(1900,2016)).sum() / autos.shape[0]

0.038793369710697

As we can see above, wrong data concern only 4% of the rows of the table, thus we can remove these rows from our selection.

In [26]:
autos = autos[autos["registration_year"].between(1900,2016)]
autos["registration_year"].value_counts(normalize=True).head(10)

2000    0.067608
2005    0.062895
1999    0.062060
2004    0.057904
2003    0.057818
2006    0.057197
2001    0.056468
2002    0.053255
1998    0.050620
2007    0.048778
Name: registration_year, dtype: float64

It appears that most of the vehicles were first registered in the past 20 years.

# **Exploring price by brand**

In [27]:
brand_count = autos["brand"].value_counts(normalize = True)

In [28]:
brand_count

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
r

German manufacturers represent four out of the top five brands, almost 50% of the overall listings. Volkswagen is by far the most popular brand, with approximately double the cars for sale of the next two brands combined.

There are lots of brands that don't have a significant percentage of listings, so I will limit my analysis to brands representing more than 5% of total listings.

In [29]:
brand_counts = autos["brand"].value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .05].index
print(common_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')


In [30]:
brand_mean_prices = {}

for brand in common_brands:
    brand_only = autos[autos["brand"] == brand]
    mean_price = brand_only["price"].mean()
    brand_mean_prices[brand] = int(mean_price)

brand_mean_prices

{'volkswagen': 5402,
 'bmw': 8332,
 'opel': 2975,
 'mercedes_benz': 8628,
 'audi': 9336,
 'ford': 3749}

Of the top 5 brands, there is a distinct price gap:

- Audi, BMW and Mercedes Benz are more expensive
- Ford and Opel are less expensive
- Volkswagen is in between - this may explain its popularity, it may be a 'best of 'both worlds' option.

# **Exploring Mileage**

For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price.

In [31]:
brand_mean_mileage = {}

for brand in common_brands:
    brand_only = autos[autos["brand"] == brand]
    mean_mileage = brand_only["odometer_km"].mean()
    brand_mean_mileage[brand] = int(mean_mileage)

brand_mean_mileage

{'volkswagen': 128707,
 'bmw': 132572,
 'opel': 129310,
 'mercedes_benz': 130788,
 'audi': 129157,
 'ford': 124266}

In [32]:
bmp_series = pd.Series(brand_mean_prices).sort_values(ascending = False)
print(bmp_series)

audi             9336
mercedes_benz    8628
bmw              8332
volkswagen       5402
ford             3749
opel             2975
dtype: int64


In [33]:
bmm_series = pd.Series(brand_mean_mileage).sort_values(ascending = False)
print(bmm_series)

bmw              132572
mercedes_benz    130788
opel             129310
audi             129157
volkswagen       128707
ford             124266
dtype: int64


In [34]:
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df['mean_mileage'] = bmm_series
df

Unnamed: 0,mean_price,mean_mileage
audi,9336,129157
mercedes_benz,8628,130788
bmw,8332,132572
volkswagen,5402,128707
ford,3749,124266
opel,2975,129310


The range of car mileages does not vary as much as the prices do by brand, instead all falling within 10% for the top brands. There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.

# **Further Data Cleaning**

Data columns are all represented in full timestamp values. The first 10 characters represent the day (e.g: 2016-03-26). Let's change the type of these columns to numeric data, so that they are uniform across the dataset.

In [38]:
date_columns=["date_crawled","ad_created","last_seen"]

for i in date_columns:
    col = autos[i].str[:10].str.replace('-','').astype(int)
    autos[i] = col
autos.head(5)

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,20160326,Peugeot_807_160_NAVTECH_ON_BOARD,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,20160326,79588,20160406
1,20160404,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,20160404,71034,20160406
2,20160326,Volkswagen_Golf_1.6_United,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,20160326,35394,20160406
3,20160312,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,20160312,33729,20160315
4,20160401,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,20160401,39218,20160401


# **Further Data Analysis**

#### Relation between mileage and average price

In this part, I am going to analyse if there are any patterns in relation to the mean price and the mileage of the car. 

In [44]:
mileage_group = autos["odometer_km"].unique()

In [45]:
mileage_group

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000,  40000, 100000])

In [50]:
mean_price_per_mileage = {}

for mileage in mileage_group:
    mileage_only = autos[autos["odometer_km"] == mileage]
    mean_price = mileage_only["price"].mean()
    mean_price_per_mileage[mileage] = mean_price

mean_price_per_mileage

{150000: 3767.9271065314942,
 70000: 10927.182813816344,
 50000: 13812.173212487412,
 80000: 9721.947636363637,
 10000: 20550.867219917014,
 30000: 16608.836842105262,
 125000: 6214.0220300597075,
 90000: 8465.02510460251,
 20000: 18448.477088948788,
 60000: 12385.004432624113,
 5000: 8873.51592356688,
 40000: 15499.568381430365,
 100000: 8132.697278911564}

In [51]:
ppm_series = pd.Series(mean_price_per_mileage).sort_values(ascending = False)

In [52]:
ppm_series

10000     20550.867220
20000     18448.477089
30000     16608.836842
40000     15499.568381
50000     13812.173212
60000     12385.004433
70000     10927.182814
80000      9721.947636
5000       8873.515924
90000      8465.025105
100000     8132.697279
125000     6214.022030
150000     3767.927107
dtype: float64

We can clearly draw a trend from this series. Cars with less than 70 000 km mileage are clearly more valuable than others. 
Indeed, the lower the mileage, the higher the mean price of cars. There is no surprises here as the mileage of a car is a indicator of its lifespan and also of the chances that it will need some repair in the near future, engaging costs for the new owner, thus decreasing the value of the car.

#### Price comparison of cars with damage and their non-damaged counterpart

In [53]:
df_by_damage = autos.groupby("unrepaired_damage")

df_by_damage["price"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
unrepaired_damage,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ja,4540.0,2241.146035,3563.276478,1.0,500.0,1000.0,2500.0,44200.0
nein,33834.0,7164.033103,10078.475478,1.0,1800.0,4150.0,9000.0,350000.0


In [54]:
df_by_damage["price"].mean()

unrepaired_damage
ja      2241.146035
nein    7164.033103
Name: price, dtype: float64

Without surprise, we can see that the price of cars with unrepaired damage is comparitively highly cheaper (average price approx. 5000 dollars) than the cars with damages repaired.

#### Common brand/model combination

In [55]:
df_brand = autos.groupby(["brand","model"])
brand_model = df_brand["date_crawled"].count().sort_values(ascending=False)
brand_model

brand          model             
volkswagen     golf                  3707
bmw            3er                   2615
volkswagen     polo                  1609
opel           corsa                 1592
volkswagen     passat                1349
opel           astra                 1348
audi           a4                    1231
mercedes_benz  c_klasse              1136
bmw            5er                   1132
mercedes_benz  e_klasse               958
audi           a3                     825
               a6                     797
ford           focus                  762
               fiesta                 722
volkswagen     transporter            674
renault        twingo                 615
peugeot        2_reihe                600
smart          fortwo                 550
opel           vectra                 544
mercedes_benz  a_klasse               539
bmw            1er                    521
ford           mondeo                 479
renault        clio                   473


Overall, we can see that ***Volkswagen Golf*** is the most common brand/model combination with total 3707 listings, followed by ***BMW 3ER*** and ***Volkswagen Polo***