## 1. Introduction

In this data cleaning project, we will work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website via webscraping.  The dataset contains 50,000 data points.  We will first clean up the column labels and data, and proceed to analyse certain key data such as price and mileage of the cars.

The data dictionary provided with data is as follows:

- dateCrawled - When this ad was first crawled. All field-values are taken from this date.
- name - Name of the car.
- seller - Whether the seller is private or a dealer.
- offerType - The type of listing
- price - The price on the ad to sell the car.
- abtest - Whether the listing is included in an A/B test.
- vehicleType - The vehicle Type.
- yearOfRegistration - The year in which the car was first registered.
- gearbox - The transmission type.
- powerPS - The power of the car in PS.
- model - The car model name.
- kilometer - How many kilometers the car has driven.
- monthOfRegistration - The month in which the car was first registered.
- fuelType - What type of fuel the car uses.
- brand - The brand of the car.
- notRepairedDamage - If the car has a damage which is not yet repaired.
- dateCreated - The date on which the eBay listing was created.
- nrOfPictures - The number of pictures in the ad.
- postalCode - The postal code for the location of the vehicle.
- lastSeenOnline - When the crawler saw this ad last online.

In [1]:
import numpy as np
import pandas as pd

autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


From the dataframe generated above, we can make the following observations:
- The dataset contains 20 columns, most of which are strings- 
The dataset contains a lot of missing data points, resulting in NaN entries.  However, none of the columns contain more than 20% null values. The closes is notRepairedDamage column with almost 20%.  
- The column names use camelcase instead of Python's preferred snakecase, making it impossible to replace spaces with underscores.

## 2. Clean Column Titles

Next, we will clean the column names to make the data easier to work with.

In [2]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

We make the following changes:
- Change the column headings from camelcase to snakecase.
- Change a few wordings to more accurately describe the columns.

In [3]:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
       'last_seen']

autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## 3.Initial Exploration and Cleaning.



In [4]:
autos.describe()

Unnamed: 0,registration_year,power_ps,registration_month,num_photos,postal_code
count,50000.0,50000.0,50000.0,50000.0,50000.0
mean,2005.07328,116.35592,5.72336,0.0,50813.6273
std,105.712813,209.216627,3.711984,0.0,25779.747957
min,1000.0,0.0,0.0,0.0,1067.0
25%,1999.0,70.0,3.0,0.0,30451.0
50%,2003.0,105.0,6.0,0.0,49577.0
75%,2008.0,150.0,9.0,0.0,71540.0
max,9999.0,17700.0,12.0,0.0,99998.0


We find that the price and odometer columns are numeric values stored as text.  Therefore, we will run the following code to remove any non-numeric characters, and convert the columns to integers:
    


In [5]:
print(autos['price'].dtype)
print(autos['odometer'].dtype)

object
object


We convert price from string/object to integer:

In [6]:
autos['price'] = (autos['price']
                  .str.replace('$','')
                  .str.replace(',','')
                  .astype(int)
                 )

print(autos['price'].head())
print(autos.dtypes)

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int32
date_crawled          object
name                  object
seller                object
offer_type            object
price                  int32
ab_test               object
vehicle_type          object
registration_year      int64
gearbox               object
power_ps               int64
model                 object
odometer              object
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
num_photos             int64
postal_code            int64
last_seen             object
dtype: object


In [7]:
autos['odometer'].dtype

dtype('O')

In [8]:
autos['odometer'] = (autos['odometer']
                     .str.replace('km','')
                     .str.replace(',','')
                     .astype(int)
                    )


We also rename the odometer column to odometer_km.

In [9]:
autos.rename({'odometer' : 'odometer_km'}, axis=1, inplace=True)
print(autos['odometer_km'].head())
print(autos.dtypes)

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int32
date_crawled          object
name                  object
seller                object
offer_type            object
price                  int32
ab_test               object
vehicle_type          object
registration_year      int64
gearbox               object
power_ps               int64
model                 object
odometer_km            int32
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
num_photos             int64
postal_code            int64
last_seen             object
dtype: object


In [10]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


We also note that the columns 'seller' and 'offer_type' all have the same data while 'num_photos' all appear to be 0, which appears to be redundant data.

## 4. Exploring the Odometer and Price Columns


### 4.1 Odometer Column 

In [11]:
autos['odometer_km'].unique().shape

(13,)

In [12]:
autos['odometer_km'].unique()

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000, 100000,  40000], dtype=int64)

We find that there are only 13 unique values for the odometer readings, which suggests that the values in this field are rounded, which might indicate that sellers had to choose from pre-set options for this field.

In [13]:
autos['odometer_km'].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [14]:
autos['odometer_km'].value_counts()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

Additionally, there are far more high mileage than low mileage vehicles.

### 4.2 Price Column

In [16]:
autos['price'].unique().shape

(2357,)

Again, the prices in this column seem rounded, given there are 2357 unique values in the column.  This could well be reflecting people's tendency to round prices on the site.

In [17]:
autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [18]:
(autos['price']
 .value_counts()
 .sort_index(ascending=True)
 .head(10)
)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
Name: price, dtype: int64

In [19]:
(autos['price']
 .value_counts()
 .sort_index(ascending=False)
 .head(20)
)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

In [20]:
autos = autos[autos['price'].between(1,351000)]
autos['price'].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

## 5. Exploring the date columns

There are 3 types of dates in this dataset, which we will analyse in this section.

In [21]:
autos[['date_crawled', 'ad_created', 'last_seen']].head()

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


### 5.1 'date_crawled'

In [22]:
(autos['date_crawled']
 .str[:10]
 .value_counts(dropna=False, normalize=True)
 .sort_index()
)

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

The website appears to be crawled over a one-month period from March to April 2016, with a fairly uniform distribution of the dates.

### 5.2 'ad_created'

In [23]:
(autos['ad_created']
 .str[:10]
 .value_counts(normalize=True, dropna=False)
 .sort_index(ascending=True)
)

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21 

In [24]:
autos['ad_created'].dtype

dtype('O')

In [25]:
(autos["ad_created"]
 .str[:10]
 .unique()
 .shape
)

(76,)

The range of dates for ads create is quite broad, mainly falling within 1-2 months of the listing date, but some date as far back as 9 months old.

### 5.3 'last_seen'

In [26]:
(autos['last_seen']
 .str[:10]
 .value_counts(normalize=True, dropna=False)
 .sort_index(ascending=True)
)

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

In [27]:
(autos["last_seen"]
 .str[:10]
 .unique()
 .shape
)

(34,)

The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.

The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it is unlikely that there was a massive spike in sales, but rather that these values are to do with the end of the crawling period.

### 5.4 registration_year

In [28]:
(autos['registration_year']
 .describe()
)

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

There are number of issues with the registration year:
- There are 1435 missing data points (48565 for registration year, vs 50,000 for the whole dataset)
- The standard deviatioin is 88.6 years, which is not reasonable
- The earliest year of registration was the year 1000 while the latest was 9999.  

It does not make sense for a car be registered after its listing in 2016.   Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

One option is to remove the listings with these values. Let us determine what percentage of our data has invalid values in this column:


In [29]:
(~autos['registration_year'].between(1900, 2016)).sum() / (autos.shape[0])

0.038793369710697

Given that these registration date outliers represent less than 4% of our data, we will remove these rows, using the series.between() method as a boolean operator:

In [30]:
bool = autos['registration_year'].between(1900, 2016)
autos = autos[bool]
(autos['registration_year']
 .value_counts()
 .sort_index(ascending=False)
 .head(10)
)

2016    1220
2015     392
2014     663
2013     803
2012    1310
2011    1623
2010    1589
2009    2085
2008    2215
2007    2277
Name: registration_year, dtype: int64

Through this process, we find that most of the vehicles were first registered in the past 20 years.

## 7. Exploring Price by Brand


In [31]:
print(autos['brand'].unique().shape[0])
print(autos['brand'].unique())

40
['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault'
 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche' 'mini'
 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar'
 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda'
 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover'
 'daihatsu' 'lancia']


We find that there are 40 unique brands in the dataset, with the names all listed as above.

In [32]:
autos['brand'].value_counts(normalize=True).head(10)

volkswagen       0.211264
bmw              0.110045
opel             0.107581
mercedes_benz    0.096463
audi             0.086566
ford             0.069900
renault          0.047150
peugeot          0.029841
fiat             0.025642
seat             0.018273
Name: brand, dtype: float64

We find that German manufacturers represent four out of the top five brands, almost 50% of the overall listings. Volkswagen is by far the most popular brand, with cars  listings equalling the next two brands (BMW and Opel) combined.

For purposes of our research, we will limit our analysis to brands representing more than 5% of total listings.


In [33]:
brand_counts = autos['brand'].value_counts(normalize=True)
bool = brand_counts > 0.05
common_brands = brand_counts[bool].index
common_brands

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

With this common_brands index, we can create a FOR loop to compute the mean price of the most common brands.

In [34]:
brand_mean_prices = {}

for brand in common_brands:
    brand_only = autos[autos['brand'] == brand]
    mean_price = brand_only['price'].mean()
    brand_mean_prices[brand] = int(mean_price)

brand_mean_prices

{'volkswagen': 5402,
 'bmw': 8332,
 'opel': 2975,
 'mercedes_benz': 8628,
 'audi': 9336,
 'ford': 3749}

From the mean price of the top 5 brands, we can infer the following:
- Audi, BMW and Mercedes Benz are clearly the premium brands given their significantly higher prices
- Ford and Opel are clearly the mass market brands with the lowest prices
- Volkswagen represents the mid-market option for car purchasers. 

## 8. Exploring Mileage by Brand

In [35]:
print(autos['odometer_km'].unique().shape[0])
print(autos['odometer_km'].unique())

13
[150000  70000  50000  80000  10000  30000 125000  90000  20000  60000
   5000  40000 100000]


We find that car mileage has been grouped into 13 categories, presumably for the sake of convenience.


In [36]:
brand_mean_mileage = {}

for brand in common_brands:
    # filter autos df by common brand:
    brand_only = autos[autos['brand'] == brand]
    
    mean_mileage = brand_only['odometer_km'].mean()
    brand_mean_mileage[brand] = int(mean_mileage)
    
brand_mean_mileage

{'volkswagen': 128707,
 'bmw': 132572,
 'opel': 129310,
 'mercedes_benz': 130788,
 'audi': 129157,
 'ford': 124266}

We note that the mean mileage of the common brands are roughly the same.

## 9. Storing Aggregate Data in a DataFrame

Using the [pandas series constructor](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) and [pandas dataframe constructor](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) methods, we will convert the mean prices and mean mileage of the common brands from the dictionary format into a new pandas dataframe.

First, we start by converting brand_mean_prices and brand_mean_mileage into pandas series:

In [37]:
mean_mileage = pd.Series(brand_mean_mileage).sort_values(ascending=False)
mean_prices = pd.Series(brand_mean_prices).sort_values(ascending=False)

In [38]:
mean_mileage

bmw              132572
mercedes_benz    130788
opel             129310
audi             129157
volkswagen       128707
ford             124266
dtype: int64

Next, we convert the mean_mileage series into a dataframe:

In [39]:
brand_info = pd.DataFrame(mean_mileage, columns=['mean_mileage'])
brand_info

Unnamed: 0,mean_mileage
bmw,132572
mercedes_benz,130788
opel,129310
audi,129157
volkswagen,128707
ford,124266


We then add a new column for mean_price

In [40]:
brand_info['mean_price'] = mean_prices
brand_info

Unnamed: 0,mean_mileage,mean_price
bmw,132572,8332
mercedes_benz,130788,8628
opel,129310,2975
audi,129157,9336
volkswagen,128707,5402
ford,124266,3749


We find that while mean_price generally reflects the market positioning of the brand, the advertised mileage is faily evenly distributed across the different brands.  There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage, which could possibly reflect the durability of the brand of car make.