## Analyzing Used Car Listings on eBay Kleinanzeigen

We will be working on a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle. The version of the dataset we are working with is a sample of 50,000 data points that was prepared by Dataquest including simulating a less-cleaned version of the data.

The aim of this project is to clean the data and analyze the included used car listings.

In [488]:
import numpy as np
import pandas as pd

In [489]:
autos = pd.read_csv("autos.csv", encoding = "Latin_1")
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [490]:
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Our dataset contains 20 columns, most of which are stored as strings. There are a few columns with null values like notRepairedDamage, fuelType, vehicleType... But no columns with null values have more than 20%.
We'll start by cleaning the column names to make the data easier to work with.


---
## Cleaning Column Names

From the work we did in the last screen, we can make the following observations:

- The dataset contains 20 columns, most of which are strings
- Some columns have null values, but none have more than ~20% null values
- The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores

In [491]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

We'll make a few changes here:

- Change the columns from camelcase to snakecase.
- Change a few wordings to more accurately describe the columns.

In [492]:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
                        'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
                        'odometer', 'registration_month', 'fuel_type', 'brand',
                        'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code', 'last_seen']

autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


---
## Initial Data Exploration and Cleaning

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:

- Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
- Examples of numeric data stored as text which can be cleaned and converted.

In [493]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-04-02 11:37:04,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Let's do the low hanging fruit first - convert the price and odometer columns to numerical, remove any non-numeric characters and rename columns to include the unit of measure:

There are two columns, price and odometer, which are numeric values with extra characters being stored as text. We'll clean and convert these.

In [494]:
autos["price"].head()

0    $5,000
1    $8,500
2    $8,990
3    $4,350
4    $1,350
Name: price, dtype: object

In [495]:
autos["price"] = (autos["price"]
                          .str.replace("$","")
                          .str.replace(",","")
                          .astype(int)
                          )
autos.rename({"price": "price_usd"}, axis=1, inplace=True)
autos["price_usd"].head()

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price_usd, dtype: int32

Next 's odometer column

In [496]:
autos['odometer'].head()

0    150,000km
1    150,000km
2     70,000km
3     70,000km
4    150,000km
Name: odometer, dtype: object

In [497]:
autos['odometer'] = (autos['odometer']
                             .str.replace("km", "")
                             .str.replace(",", "")
                             .astype(int))
autos.rename({"odometer": "odometer_km"}, axis = 1, inplace = True)
autos['odometer_km'].head()

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int32

In [498]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        50000 non-null  object
 1   name                50000 non-null  object
 2   seller              50000 non-null  object
 3   offer_type          50000 non-null  object
 4   price_usd           50000 non-null  int32 
 5   abtest              50000 non-null  object
 6   vehicle_type        44905 non-null  object
 7   registration_year   50000 non-null  int64 
 8   gear_box            47320 non-null  object
 9   power_ps            50000 non-null  int64 
 10  model               47242 non-null  object
 11  odometer_km         50000 non-null  int32 
 12  registration_month  50000 non-null  int64 
 13  fuel_type           45518 non-null  object
 14  brand               50000 non-null  object
 15  unrepaired_damage   40171 non-null  object
 16  ad_created          50

Now that the low hanging fruit has been taken care of, let's explore column by columns to determine what else need to be done. We will list all the issues we observe, and then step by step we can start cleaning the columns.

In [499]:
autos.describe(include="all")

Unnamed: 0,date_crawled,name,seller,offer_type,price_usd,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000.0,50000,44905,50000.0,47320,50000.0,47242,50000.0,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,,2,8,,2,,245,,,7,40,2,76,,,39481
top,2016-04-02 11:37:04,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,,25756,12859,,36993,,4024,,,30107,10687,35232,1946,,,8
mean,,,,,9840.044,,,2005.07328,,116.35592,,125732.7,5.72336,,,,,0.0,50813.6273,
std,,,,,481104.4,,,105.712813,,209.216627,,40042.211706,3.711984,,,,,0.0,25779.747957,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1100.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30451.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49577.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71540.0,


---
## Exploring the seller and offer_type columns

Our initial observations:

- There are a number of text columns where all (or nearly all) of the values are the same:
  - seller
  - offer_type
- The num_photos column looks odd, we'll need to investigate this further.

In [500]:
autos["seller"].value_counts()

seller
privat        49999
gewerblich        1
Name: count, dtype: int64

We can see that the seller column contains 49999 identical values. We can drop that column since it is not valuable for our analysis.

Same goes for the offer_type and num_pictures columns

In [501]:
autos["offer_type"].value_counts()

offer_type
Angebot    49999
Gesuch         1
Name: count, dtype: int64

In [502]:
autos["num_pictures"].value_counts()

num_pictures
0    50000
Name: count, dtype: int64

Let's drop those columns using df.drop() method

In [503]:
autos = autos.drop(labels = ["num_pictures", "seller", "offer_type"], axis = 1)

In [504]:
autos.columns

Index(['date_crawled', 'name', 'price_usd', 'abtest', 'vehicle_type',
       'registration_year', 'gear_box', 'power_ps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

In [505]:
autos.shape

(50000, 17)

From the above, we can see we have successfuly dropped 3 unneccessary columns - our new column count is 17

---
## Exploring the Odometer and Price Columns

From the last screen, there are a number of text columns where almost all of the values are the same (seller and offer_type). We also converted the price and odometer columns to numeric types and renamed odometer to odometer_km, price to price_usd

OK. So, what's interesting in 'price_usd' and 'odometer_km'?

For sure, I want:
1. A cheap car. Of course.
2. (But I'm curious with the price of the most expensive car too)
3. A car that has least mileage. Somehow, I'm under the impression of least mileage = better condition. Yea, I don't really know much about cars to be honest 
4. (But I'm also curious with the car with the most mileage.)

And let's see if there's any nonsensical price or mileage in our data set. 

In [506]:
print(autos["odometer_km"].unique().shape)
print(autos["odometer_km"].describe())
autos["odometer_km"].value_counts().sort_index(ascending = False)

(13,)
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64


odometer_km
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: count, dtype: int64

When we explored the odometer_km column, we can see that the values are rounded and equally incremented, which suggests that the users had to choose from predefined values. We can observe that approximately 65% of vehicles have high mileage (150 000 km), and approximately 80% of vehicles have mileage 100 000 km or higher.

In [507]:
print(autos["price_usd"].unique().shape)
print(autos["price_usd"].describe())
print(autos["price_usd"].value_counts().sort_index(ascending = False).head(10))
autos["price_usd"].value_counts().sort_index().head(10)

(2357,)
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_usd, dtype: float64
price_usd
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
Name: count, dtype: int64


price_usd
0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
Name: count, dtype: int64

We can see that there are some unusually high prices - there are 15 vehicles listed with prices above or close to a million. We can probably drop those rows and retain everything priced at 350 000 and less, since that is much more realistic.

When we look at the ascending price order, we can see that there are values which incrementaly increase from 1 USD. Since eBay is an auction site, it might be possible that these sellers have started the auctions with low values. Since that assumption is viable.

In [508]:
autos = autos[autos["price_usd"].between(1, 350000)]
autos["price_usd"].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price_usd, dtype: float64

We can see that all the outliers have been successfully removed and our prices now range from 1 to 350,000 USD

---
## Exploring and cleaning up registration data
#### Does time matter?

Other than prices, does time play an interesting role here?

Let's explore it and see if we can find something interesting.

In [509]:
autos[["date_crawled", "ad_created", "last_seen"]].head()

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


To select the first 10 characters in each column, we can use Series.str[:10]:

In [510]:
print(autos["date_crawled"].str[:10]
      .value_counts(normalize = True, dropna = False)
      .sort_index())

date_crawled
2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: proportion, dtype: float64


I don't see much interesting thing here. It's like the there was a daily crawl but that's it, the percentage are quite inconsistent.

In [511]:
print(autos["ad_created"].str[:10]
      .value_counts(normalize = True, dropna = False)
      .sort_index())

ad_created
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: proportion, Length: 76, dtype: float64


Okay, there's like a huge percentage gap there between March and April 2016 in a period of just 2 weeks. Interesting!

In [512]:
print(autos["last_seen"].str[:10]
      .value_counts(normalize = True, dropna = False)
      .sort_index())

last_seen
2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: proportion, dtype: float64


The percentage spikes on the last three days, but I'm not sure why.

### Do these cars have time-traveling features?

In [513]:
autos["registration_year"].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

One thing that stands out from the exploration we did in the last screen is that the registration_year column contains some odd values:

- The minimum value is 1000, before cars were invented
- The maximum value is 9999, many years into the future

---
## Dealing with Incorrect Registration Year Data

In [514]:
autos["registration_year"].value_counts().sort_index(ascending = False).head(20)

registration_year
9999       3
9000       1
8888       1
6200       1
5911       1
5000       4
4800       1
4500       1
4100       1
2800       1
2019       2
2018     470
2017    1392
2016    1220
2015     392
2014     663
2013     803
2012    1310
2011    1623
2010    1589
Name: count, dtype: int64

In [515]:
autos.describe(include = "all")

Unnamed: 0,date_crawled,name,price_usd,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
count,48565,48565,48565.0,48565,43979,48565.0,46222,48565.0,46107,48565.0,48565.0,44535,48565,39464,48565,48565.0,48565
unique,46882,37470,,2,8,,2,,245,,,7,40,2,76,,38474
top,2016-03-21 20:37:19,Ford_Fiesta,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,2016-04-07 06:17:27
freq,3,76,,25019,12598,,36102,,3900,,,29368,10336,34775,1887,,8
mean,,,5888.935591,,,2004.755421,,117.197158,,125770.101925,5.782251,,,,,50975.745207,
std,,,9059.854754,,,88.643887,,200.649618,,39788.636804,3.685595,,,,,25746.968398,
min,,,1.0,,,1000.0,,0.0,,5000.0,0.0,,,,,1067.0,
25%,,,1200.0,,,1999.0,,71.0,,125000.0,3.0,,,,,30657.0,
50%,,,3000.0,,,2004.0,,107.0,,150000.0,6.0,,,,,49716.0,
75%,,,7490.0,,,2008.0,,150.0,,150000.0,9.0,,,,,71665.0,


In [516]:
autos["last_seen"].max()

'2016-04-07 14:58:50'

In [517]:
autos["date_crawled"].min()

'2016-03-05 14:06:30'

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

In addition, we need to combine year and month of registration to not exceed March 2016, since that is the date ads have been crawled. Any rows with registration months/years newer than February 2016 should be removed.

In [518]:
autos["registration_year"].value_counts().sort_index().head(20)

registration_year
1000    1
1001    1
1111    1
1800    2
1910    5
1927    1
1929    1
1931    1
1934    2
1937    4
1938    1
1939    1
1941    2
1943    1
1948    1
1950    3
1951    2
1952    1
1953    1
1954    2
Name: count, dtype: int64

For the lower limit of our registration interval, we can see that there are a couple of rows showing registration year from the beginning of the 20th century. Maybe those are old-times, so before we remove these rows entirely, it is worth to explore further.
How about... 1886 for the minimum value? It's near the birth year of Mercedes-Benz car according to [Wikipedia](https://en.wikipedia.org/wiki/Car)

In [519]:
autos[autos["registration_year"] < 1911]

Unnamed: 0,date_crawled,name,price_usd,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
3679,2016-04-04 00:36:17,Suche_Auto,1,test,,1910,,0,,5000,0,,sonstige_autos,,2016-04-04 00:00:00,40239,2016-04-04 07:49:15
10556,2016-04-01 06:02:10,UNFAL_Auto,450,control,,1800,,1800,,5000,2,,mitsubishi,nein,2016-04-01 00:00:00,63322,2016-04-01 09:42:30
22316,2016-03-29 16:56:41,VW_Kaefer.__Zwei_zum_Preis_von_einem.,1500,control,,1000,manuell,0,kaefer,5000,0,benzin,volkswagen,,2016-03-29 00:00:00,48324,2016-03-31 10:15:28
22659,2016-03-14 08:51:18,Opel_Corsa_B,500,test,,1910,,0,corsa,150000,0,,opel,,2016-03-14 00:00:00,52393,2016-04-03 07:53:55
24511,2016-03-17 19:45:11,Trabant__wartburg__Ostalgie,490,control,,1111,,0,,5000,0,,trabant,,2016-03-17 00:00:00,16818,2016-04-07 07:17:29
28693,2016-03-22 17:48:41,Renault_Twingo,599,control,kleinwagen,1910,manuell,0,,5000,0,benzin,renault,,2016-03-22 00:00:00,70376,2016-04-06 09:16:59
30781,2016-03-25 13:47:46,Opel_Calibra_V6_DTM_Bausatz_1:24,30,test,,1910,,0,calibra,100000,0,,opel,,2016-03-25 00:00:00,47638,2016-03-26 23:46:29
32585,2016-04-02 16:56:39,UNFAL_Auto,450,control,,1800,,1800,,5000,2,,mitsubishi,nein,2016-04-02 00:00:00,63322,2016-04-04 14:46:21
45157,2016-03-11 22:37:01,Motorhaube,15,control,,1910,,0,,5000,0,,trabant,,2016-03-11 00:00:00,90491,2016-03-25 11:18:57
49283,2016-03-15 18:38:53,Citroen_HY,7750,control,,1001,,0,andere,5000,0,,citroen,,2016-03-15 00:00:00,66706,2016-04-06 18:47:20


We can see that the results are mixed - there are some oldtimers, but also there are some cars which clearly exist in that time period, e.g. Opel Corsa (1950s) or Renault Twingo (1910s). To ensure better data quality, we can remove all rows with registration years prior to 1911.

In [520]:
autos[(autos["registration_year"]  >= 2016) & (autos["registration_month"] > 2)]

Unnamed: 0,date_crawled,name,price_usd,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
10,2016-03-15 01:41:36,VW_Golf_Tuning_in_siber/grau,999,test,,2017,manuell,90,,150000,4,benzin,volkswagen,nein,2016-03-14 00:00:00,86157,2016-04-07 03:16:21
55,2016-03-07 02:47:54,Mercedes_E320_AMG_zu_Tauschen!,1,test,,2017,automatik,224,e_klasse,125000,7,benzin,mercedes_benz,nein,2016-03-06 00:00:00,22111,2016-03-08 05:45:44
65,2016-04-04 19:30:39,Ford_Fiesta_zum_ausschlachten,250,control,,2017,manuell,65,fiesta,125000,9,benzin,ford,,2016-04-04 00:00:00,65606,2016-04-05 12:22:12
113,2016-04-03 14:58:29,Golf_4_Anfaenger_auto,1200,test,,2017,manuell,75,golf,150000,7,,volkswagen,,2016-04-03 00:00:00,97656,2016-04-05 14:15:48
135,2016-03-12 11:00:10,Opel_Meriva_B_Panoramadach__Sitz__und_Lenkradh...,8500,control,,2016,manuell,81,meriva,90000,8,diesel,opel,,2016-03-12 00:00:00,48147,2016-03-22 14:49:31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49876,2016-03-22 17:57:24,Audi_a5_3.0_tdi_s_line,14700,control,,2016,manuell,0,,150000,10,diesel,audi,,2016-03-22 00:00:00,44227,2016-03-22 18:40:53
49910,2016-04-03 21:39:15,Schoener_fast_neuer_Opel_Mokka_in_Zell_Mosel_m...,22200,test,,9000,automatik,140,andere,10000,3,benzin,opel,,2016-04-03 00:00:00,56856,2016-04-05 22:18:26
49919,2016-03-10 09:49:43,Fiat_Punto,180,test,,2016,manuell,86,punto,150000,8,,fiat,ja,2016-03-10 00:00:00,59558,2016-03-10 10:39:58
49935,2016-04-01 21:48:20,Mercedes_A_klasse_angemeldet_mit_Tuef_und_Auto...,800,test,,2017,automatik,101,a_klasse,150000,9,,mercedes_benz,nein,2016-04-01 00:00:00,39108,2016-04-01 21:48:20


Let's see what is the ratio of these outliers:

In [521]:
((autos["registration_year"] >= 2016) & (autos["registration_month"] > 2)).sum()/autos.shape[0]

0.04167610419026047

In [522]:
(autos["registration_year"] < 1911).sum()/autos.shape[0]

0.00020590960568310512

In [523]:
autos = autos[autos["registration_year"].between(1911,2016)]

In [524]:
autos = autos[~((autos["registration_year"] >= 2016) & (autos["registration_month"] > 2))]

In [525]:
autos["registration_year"].value_counts(normalize = True).head(15)

registration_year
2000    0.068787
2005    0.063992
1999    0.063142
2004    0.058913
2003    0.058826
2006    0.058194
2001    0.057453
2002    0.054184
1998    0.051503
2007    0.049628
2008    0.048277
2009    0.045444
1997    0.042523
2011    0.035374
2010    0.034633
Name: proportion, dtype: float64

In [526]:
autos.shape

(45881, 17)

We have reduced our dataset to 45881 rows of data. Registration year distribution looks good, with majority of the data falling into the 1997+ year range.


---
## Exploring Price by Brand

First, we will take a look at all the brands in the dataset and select the top brands by percentage.

In [527]:
autos["brand"].unique().shape

(40,)

We can see there are 40 unique brands in the dataset. Let's see which those are:

In [528]:
autos["brand"].value_counts(normalize = True, ascending = False)

brand
volkswagen        0.210719
bmw               0.110743
opel              0.106951
mercedes_benz     0.096946
audi              0.086942
ford              0.069767
renault           0.047013
peugeot           0.029685
fiat              0.025544
seat              0.018134
skoda             0.016521
nissan            0.015213
mazda             0.015148
smart             0.014080
citroen           0.013971
toyota            0.012750
hyundai           0.010026
sonstige_autos    0.009808
volvo             0.009220
mini              0.008762
mitsubishi        0.008239
honda             0.007846
kia               0.007149
alfa_romeo        0.006648
porsche           0.006125
suzuki            0.005928
chevrolet         0.005732
chrysler          0.003509
dacia             0.002637
daihatsu          0.002528
jeep              0.002289
subaru            0.002136
land_rover        0.002114
saab              0.001656
jaguar            0.001591
daewoo            0.001526
trabant           0.00

As expected, the majority of the brands that are offered are European (over 75%), seems like German brands dominated the top brands.

Also, this is the first time I heard about these car brands: 'skoda', 'sonstige_autos', 'dacia', 'saab', 'trabant', 'lancia', and 'lada'. So I decided to google them and... I'm totally missing out. These cars are super cool! Check out this pretty lava blue Skoda Superb!!

![alt text](https://i.ytimg.com/vi/9sbNHoB7JpI/maxresdefault.jpg "Skoda Superb Lava Blue")

Well, we will take the top of the brands (accounts for more than 1%) for our price analysis:
- volkswagen
- bmw
- opel
- mercedes_benz
- audi
- ford
- renault
- peugeot
- fiat
- seat
- skoda
- nissan
- mazda
- smart
- citroen
- toyota
- huyndai

Create an empty dictionary to hold the price data:

In [529]:
brand_mean_prices = {}

We will assign our normalized value count to a new variable, and then use the .index attribute to access the top market share brands:

In [530]:
brands_count = autos["brand"].value_counts(normalize = True)
brands = brands_count[brands_count > 0.01].index
brands

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'nissan', 'mazda', 'smart',
       'citroen', 'toyota', 'hyundai'],
      dtype='object', name='brand')

We will now loop over each brand and calcute the mean price. Then we will assign the brand as a key to the dictionary, and calculated mean price for each brand as a value to the key (as integer for better readability)

In [531]:
for brand in brands:
    selected_row = autos[autos["brand"] == brand]
    mean_price = selected_row["price_usd"].mean()
    brand_mean_prices[brand] = round(mean_price, 2)
    
brand_mean_prices

{'volkswagen': 5453.48,
 'bmw': 8375.17,
 'opel': 2997.11,
 'mercedes_benz': 8682.19,
 'audi': 9357.02,
 'ford': 3797.41,
 'renault': 2493.64,
 'peugeot': 3112.0,
 'fiat': 2834.95,
 'seat': 4441.76,
 'skoda': 6375.35,
 'nissan': 4789.42,
 'mazda': 4164.67,
 'smart': 3603.69,
 'citroen': 3824.04,
 'toyota': 5200.32,
 'hyundai': 5437.15}

We can see that the #1 brand, Volkswagen, has a very attractive mean price - it is much cheaper than BMW, Mercedes or Audi, while more expensive on average than Opel, Renault or Fiat. The attractive price and German origin most likely make it so popular.\
On the other hand, BMW, Mercedes and Audi are most expensive, but still rank amongst the top 5.\
Opel, Ford, Peugeot and Renault are less expensive than all the above mentioned brands, so it is realistic they will have a large portion of the market share.\
Asian brands such as Renault, Mazda, Toyota and Huyndai with mid-range prices are at the bottom of the list.


---
## Calculating the mean mileage

Using the same principle as above, we will calculate the mean mileage for each of our selected brands:

In [532]:
brand_mean_mileage = {}

In [533]:
for brand in brands:
    selected_rows = autos[autos["brand"] == brand]
    mean_mileage = selected_rows["odometer_km"].mean()
    brand_mean_mileage[brand] = int(mean_mileage)
    
brand_mean_mileage

{'volkswagen': 128526,
 'bmw': 132498,
 'opel': 129242,
 'mercedes_benz': 130683,
 'audi': 129251,
 'ford': 124039,
 'renault': 127950,
 'peugeot': 127063,
 'fiat': 116970,
 'seat': 120907,
 'skoda': 110916,
 'nissan': 118524,
 'mazda': 124079,
 'smart': 98769,
 'citroen': 119329,
 'toyota': 115777,
 'hyundai': 105847}

---
## Storing Aggregate Data in a DataFrame

We will first use pandas series construction to convert both brand_mean_prices and brand_mean_mileage dictionaries to series objects:

In [534]:
mean_prices_series = pd.Series(brand_mean_prices).sort_values(ascending = False)
mean_mileage_series = pd.Series(brand_mean_mileage).sort_values(ascending = False)

Let's convert 2 dictionaries with mutual key to a list:

In [535]:
mean_price_mileage_df = pd.DataFrame(mean_prices_series, columns = ["mean_prices_series"])
mean_price_mileage_df["mean_mileage_series"] = mean_mileage_series
mean_price_mileage_df

Unnamed: 0,mean_prices_series,mean_mileage_series
audi,9357.02,129251
mercedes_benz,8682.19,130683
bmw,8375.17,132498
skoda,6375.35,110916
volkswagen,5453.48,128526
hyundai,5437.15,105847
toyota,5200.32,115777
nissan,4789.42,118524
seat,4441.76,120907
mazda,4164.67,124079


We have merged both series into one dataframe called brand_info with values sorted in a descending order. Now we can easily compare prices and mileage.\
\
We cannot observe a large gap in mileage, but rather a trend that more expensive brands tend to have slightly higher mileage than less expensive brands. Exception is Skoda, which has quite low mileage for the mean price.\
\
Since Mercedes, BMW and Audi mostly make limousines, it may be the reason why these brands have higher mean mileage  - limousines are mostly used for long range travel, while cheaper vehicles will mostly be used within the city limits, for commuting.

### Translating German to English

Since many people are monolingual who happen to be occasionally curious with raw data, let's translate the German words in this data into English, just to be safe.

In [536]:
autos.head()

Unnamed: 0,date_crawled,name,price_usd,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,39218,2016-04-01 14:38:50


In [537]:
autos["vehicle_type"].unique()

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

In [538]:
translation = ({"bus": "bus",
                "limousine": "limousine",
                "kleinwagen": "small_car",
                "kombi": "van",
                "coupe": "coupe",
                "suv": "suv",
                "cabrio": "convertible",
                "andere": "other"})

In [539]:
autos["vehicle_type"] = autos["vehicle_type"].map(translation)
autos["vehicle_type"].value_counts()

vehicle_type
limousine      12591
small_car      10573
van             8925
bus             4031
convertible     3014
coupe           2460
suv             1962
other            390
Name: count, dtype: int64

Next is fuel_type column:

In [540]:
autos["fuel_type"].unique()

array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)

In [541]:
translation = ({'lpg': 'lpg',
                'benzin': 'petrol',
                'diesel': 'diesel',
                'cng': 'cng',
                'hybrid': 'hybrid',
                'elektro': 'electric',
                'andere': 'other'})

In [542]:
autos["fuel_type"] = autos["fuel_type"].map(translation)
autos["fuel_type"].value_counts()

fuel_type
petrol      28172
diesel      13932
lpg           643
cng            70
hybrid         36
electric       17
other          14
Name: count, dtype: int64

Next is gearbox column:

In [543]:
autos["gear_box"].unique()

array(['manuell', 'automatik', nan], dtype=object)

In [544]:
autos['gear_box'] = autos['gear_box'].str.replace('manuell', 'manual').str.replace('automatik', 'automatic')
autos["gear_box"].value_counts()

gear_box
manual       34081
automatic     9760
Name: count, dtype: int64

In [545]:
autos["unrepaired_damage"].unique()

array(['nein', nan, 'ja'], dtype=object)

In [546]:
autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('nein', 'no')
autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('ja', 'yes')
autos["unrepaired_damage"].value_counts()

unrepaired_damage
no     33446
yes     4443
Name: count, dtype: int64

Now I know some words in German!

### TeamCommon or TeamUnique?

I can also see a combination of brand name and model type separated with underscores in the `name` column. 

Let's see what's the common combos for these cars.

In [547]:
brand_model_combo = autos.groupby(["brand", "model"]).count()
brand_model_combo

Unnamed: 0_level_0,Unnamed: 1_level_0,date_crawled,name,price_usd,abtest,vehicle_type,registration_year,gear_box,power_ps,odometer_km,registration_month,fuel_type,unrepaired_damage,ad_created,postal_code,last_seen
brand,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
alfa_romeo,145,4,4,4,4,2,4,2,4,4,4,2,2,4,4,4
alfa_romeo,147,78,78,78,78,73,78,78,78,78,78,74,68,78,78,78
alfa_romeo,156,87,87,87,87,84,87,85,87,87,87,83,78,87,87,87
alfa_romeo,159,32,32,32,32,31,32,32,32,32,32,31,28,32,32,32
alfa_romeo,andere,59,59,59,59,56,59,56,59,59,59,54,51,59,59,59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
volvo,v40,86,86,86,86,84,86,85,86,86,86,83,73,86,86,86
volvo,v50,28,28,28,28,28,28,28,28,28,28,28,25,28,28,28
volvo,v60,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
volvo,v70,90,90,90,90,87,90,86,90,90,90,84,75,90,90,90


In [548]:
common_combo = brand_model_combo["name"].sort_values(ascending = False)
common_combo

brand       model     
volkswagen  golf          3622
bmw         3er           2586
volkswagen  polo          1566
opel        corsa         1556
volkswagen  passat        1333
                          ... 
rover       freelander       2
ford        b_max            1
rover       discovery        1
            rangerover       1
audi        200              1
Name: name, Length: 289, dtype: int64

If you are in the mood to be in the #TeamUnique, you better avoid the `golf` Volkswagen or `3er` BMW and get to buy a `200` Audi or `b_max` Ford

---
## Convert the dates to be uniform numeric data

Just for the fun of it, let's also convert the dates to be uniform numeric data, so `"2016-03-21"` becomes the integer `20160321`.

In [549]:
autos["date_crawled"] = autos["date_crawled"].str[:10].str.replace("-", "").astype(int)

In [550]:
autos["ad_created"] = autos["ad_created"].str[:10].str.replace("-", "").astype(int)

In [551]:
autos["last_seen"] = autos["last_seen"].str[:10].str.replace("-", "").astype(int)

In [552]:
autos.head()

Unnamed: 0,date_crawled,name,price_usd,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,20160326,Peugeot_807_160_NAVTECH_ON_BOARD,5000,control,bus,2004,manual,158,andere,150000,3,lpg,peugeot,no,20160326,79588,20160406
1,20160404,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,8500,control,limousine,1997,automatic,286,7er,150000,6,petrol,bmw,no,20160404,71034,20160406
2,20160326,Volkswagen_Golf_1.6_United,8990,test,limousine,2009,manual,102,golf,70000,7,petrol,volkswagen,no,20160326,35394,20160406
3,20160312,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,4350,control,small_car,2007,automatic,71,fortwo,70000,6,petrol,smart,no,20160312,33729,20160315
4,20160401,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,1350,test,van,2003,manual,0,focus,150000,7,petrol,ford,no,20160401,39218,20160401


### Analysis next steps. Mileage group

Since we have concluded mileages are rounded up, they can easily illustrate for further analysis. We assume vehices with lower mileage will have lower mean prices, so let's see if  that is correct.\
\
We will start by taking a look once again at odometer_km value counts:

In [553]:
autos["odometer_km"].value_counts().sort_index()

odometer_km
5000        742
10000       239
20000       738
30000       756
40000       791
50000       988
60000      1117
70000      1175
80000      1359
90000      1655
100000     2032
125000     4803
150000    29486
Name: count, dtype: int64

We can see that there are 13 mileage categories, we will narrow them down into 3 group - low, medium and high:

In [554]:
odometer_price = autos.groupby("odometer_km")
odometer_price["price_usd"].mean().sort_values(ascending = False)

odometer_km
10000     20574.305439
20000     18483.537940
30000     16644.611111
40000     15540.653603
50000     13844.735830
60000     12442.254252
70000     10987.248511
80000      9752.215600
90000      8515.038066
100000     8205.128937
5000       7267.716981
125000     6231.157402
150000     3806.961405
Name: price_usd, dtype: float64

From the above, we can see that our assumption was correct - mean prices drop significantly with the mileage.

### Exploring damage effect on the price

Damaged cars are cheaper than non-damaged cars. That's the norm. But, by how much?

In [555]:
damage_group = autos.groupby("unrepaired_damage").count()
damage_group

Unnamed: 0_level_0,date_crawled,name,price_usd,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer_km,registration_month,fuel_type,brand,ad_created,postal_code,last_seen
unrepaired_damage,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
no,33446,33446,33446,33446,33046,33446,32804,33446,32480,33446,33446,32372,33446,33446,33446,33446
yes,4443,4443,4443,4443,4244,4443,4290,4443,4170,4443,4443,4065,4443,4443,4443,4443


In [556]:
no_damage = autos[autos["unrepaired_damage"] == "no"]
damage = autos[autos["unrepaired_damage"] == "yes"]

In [557]:
no_damage_avg = no_damage["price_usd"].mean()
damage_avg = damage["price_usd"].mean()
damage_difference = no_damage_avg - damage_avg

print("On overage, cars with damage are  $ {:.2f}".format(damage_difference) +
      " cheaper than their non_damaged counterparts.")

On overage, cars with damage are  $ 4898.82 cheaper than their non_damaged counterparts.


We can see that, an expected, cars which have repaired damages are priced much higher ($4898.82) than cars which have damages left unrepaired\
\
But, let's take a look which brands are more or less affected by unrepaired damage:

In [558]:
brand_unrepaired_vc =  damage["brand"].value_counts(normalize = True).sort_values(ascending = False).head(10)
brand_unrepaired = brand_unrepaired_vc.index
brand_unrepaired

Index(['volkswagen', 'opel', 'ford', 'bmw', 'mercedes_benz', 'audi', 'renault',
       'peugeot', 'fiat', 'nissan'],
      dtype='object', name='brand')

In [559]:
unrepaired_brand_price = {}

for brand in brand_unrepaired:
    selected_row = damage[damage["brand"] == brand]
    mean_price = selected_row["price_usd"].mean()
    unrepaired_brand_price[brand] = int(mean_price)
    
unrepaired_brand_price

{'volkswagen': 2196,
 'opel': 1369,
 'ford': 1391,
 'bmw': 3554,
 'mercedes_benz': 4000,
 'audi': 3350,
 'renault': 1167,
 'peugeot': 1366,
 'fiat': 1166,
 'nissan': 1962}

In [560]:
unrepaired_brand_price_series = pd.Series(unrepaired_brand_price).sort_values(ascending = False)

In [561]:
brand_repaired_vc = no_damage["brand"].value_counts(normalize = True).sort_values(ascending = False).head(10)
brand_repaired = brand_repaired_vc.index
brand_repaired

Index(['volkswagen', 'bmw', 'mercedes_benz', 'opel', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat'],
      dtype='object', name='brand')

In [562]:
repaired_brand_price = {}

for brand in brand_repaired:
    selected_row = no_damage[no_damage["brand"] == brand]
    mean_price = selected_row["price_usd"].mean()
    repaired_brand_price[brand] = int(mean_price)
    
repaired_brand_price

{'volkswagen': 6505,
 'bmw': 9467,
 'mercedes_benz': 9834,
 'opel': 3673,
 'audi': 10902,
 'ford': 4695,
 'renault': 3110,
 'peugeot': 3691,
 'fiat': 3452,
 'seat': 5220}

In [563]:
repaired_brand_price_series = pd.Series(repaired_brand_price).sort_values(ascending = False)

In [564]:
damage_price_info = pd.DataFrame(unrepaired_brand_price_series, columns = ["unrepaired_price"])
damage_price_info["repaired_price"] = repaired_brand_price_series

In [565]:
damage_price_info["diff"] = (damage_price_info["unrepaired_price"] - damage_price_info["repaired_price"]).round()
damage_price_info["diff_%"] = (((damage_price_info["unrepaired_price"] - damage_price_info["repaired_price"]) / damage_price_info["repaired_price"]) * 100).round()

In [566]:
damage_price_info.sort_values(by = ["diff_%"])

Unnamed: 0,unrepaired_price,repaired_price,diff,diff_%
ford,1391,4695.0,-3304.0,-70.0
audi,3350,10902.0,-7552.0,-69.0
volkswagen,2196,6505.0,-4309.0,-66.0
fiat,1166,3452.0,-2286.0,-66.0
opel,1369,3673.0,-2304.0,-63.0
peugeot,1366,3691.0,-2325.0,-63.0
bmw,3554,9467.0,-5913.0,-62.0
renault,1167,3110.0,-1943.0,-62.0
mercedes_benz,4000,9834.0,-5834.0,-59.0
nissan,1962,,,


In [567]:
damage_price_info["diff_%"].mean().round(2)

-64.44

On average, cars with damage are 59%-70% cheaper than their non-damaged counterparts in most brands.

# Conclusion

Let me tell you something interesting from the analysis that we just did. Like, totally interesting. Some common sense stuff like Audi is too expensive for a car that is just into #TeamUnique; BMW, Mercedes-Benz and VW are among the top European car brands, damaged cars are cheaper than non-damaged cars, and how higher mileage could make car prices cheaper. That stuff can be easily deduced even if we don't use this data.

     
![alt text](https://www.motortrend.com/uploads/2023/01/2023-Audi-RS-6-Avant-32.jpg)

2023 Audi `RS6` Model  