# Analyzing Used Car Listings on eBay Kleinanzeigen

We will be working on a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle. The version of the dataset we are working with is a sample of 50,000 data points that was prepared by Dataquest including simulating a less-cleaned version of the data.

The data dictionary provided with data is as follows:

- dateCrawled - When this ad was first crawled. All field-values are taken from this date.
- name - Name of the car.
- seller - Whether the seller is private or a dealer.
- offerType - The type of listing
- price - The price on the ad to sell the car.
- abtest - Whether the listing is included in an A/B test.
- vehicleType - The vehicle Type.
- yearOfRegistration - The year in which the car was first registered.
- gearbox - The transmission type.
- powerPS - The power of the car in PS.
- model - The car model name.
- kilometer - How many kilometers the car has driven.
- monthOfRegistration - The month in which the car was first registered.
- fuelType - What type of fuel the car uses.
- brand - The brand of the car.
- notRepairedDamage - If the car has a damage which is not yet repaired.
- dateCreated - The date on which the eBay listing was created.
- nrOfPictures - The number of pictures in the ad.
- postalCode - The postal code for the location of the vehicle.
- lastSeenOnline - When the crawler saw this ad last online.

The aim of this project is to clean the data and analyze the included used car listings.

In [None]:
import pandas as pd
import numpy as np
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
autos = pd.read_csv('../input/autosdata/autos.csv',encoding='Latin-1')
autos.head()
autos.info()

Our dataset contains 20 columns, most of which are stored as strings. There are a few columns with null values, but no columns have more than ~20% null values. There are some columns that contain dates stored as strings.

We'll start by cleaning the column names to make the data easier to work with.

## Clean Columns

In [None]:
autos.columns

We'll make a few changes here:

- Change the columns from camelcase to snakecase.
- Change a few wordings to more accurately describe the columns.

In [None]:
autos.columns = ['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'registration_year', 'gearbox', 'powerPS', 'model',
       'odometer', 'registration_month', 'fuelType', 'brand',
       'unrepaired_damage', 'ad_created', 'num_photos', 'postalCode',
       'lastSeen']
autos.columns = autos.columns.str.lower()

In [None]:
autos.columns

## Data Exploration and cleaning

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. 

Initially we will look for: 
- Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. 
- Examples of numeric data stored as text which can be cleaned and converted.

In [None]:
autos.describe(include = 'all')


Our initial observations:

- There are a number of text columns where all (or nearly all) of the values are the same:
seller,
offer_type
- The num_photos column looks odd, we'll need to investigate this further.

In [None]:
autos['num_photos'].value_counts()

It looks like the num_photos column has 0 for every column. We'll drop this column, plus the other two we noted as mostly one value.

In [None]:
autos = autos.drop(['num_photos','seller','offertype'],axis = 1)

There are two columns, price and auto, which are numeric values with extra characters being stored as text. We'll clean and convert these.

In [None]:
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)
autos['price'].head()

In [None]:
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype(int)

In [None]:
autos.rename({'odometer':'odometer_km'},axis = 1, inplace = True)
autos['odometer_km'].head()

Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km and price columns. Here's the steps we'll take:

Analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we might want to remove.

In [None]:
autos['odometer_km'].value_counts()

Clearly the data has been split into ranges suggesting the user may have had to pick a range in which their car lies. The vast majority of cars have more than 150,000+ km but with no further information as to the distribution above 150k. 

In [None]:
autos['price'].describe()
autos['price'].value_counts().head(15)

In [None]:
autos['price'].value_counts().sort_index(ascending = False).head(15)
autos['price'].value_counts().sort_index(ascending = True).head(15)

There are a number of listings with prices below 30 dollars, including about 1,500 at 0 dollars. There are also a small number of listings with very high values, including 14 at around or over $1 million.

Given that eBay is an auction site, there could legitimately be items where the opening bid is 1 dollar. We will keep the 1 dollar items, but remove anything above 350,000 dollars, since it seems that prices increase steadily to that number and then jump up to less realistic numbers.

In [None]:
autos = autos[autos['price'].between(1,350000)]
autos.describe()

Let's now move on to the date columns and understand the date range the data covers.

Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two date columns are represented as numeric values, so we can understand the distribution without any extra data processing.

In [None]:
autos[['lastseen','ad_created','datecrawled']].head()

In [None]:
autos['lastseen'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)

The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.

The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.

In [None]:
autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)

In [None]:
autos['datecrawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)

Looks like the site was crawled daily over roughly a one month period in March and April 2016. The distribution of listings crawled on each day is roughly uniform.

In [None]:
autos['registration_year'].describe()

The year that the car was first registered will likely indicate the age of the car. Looking at this column, we note some odd values. The minimum value is 1000, long before cars were invented and the maximum is 9999, many years into the future.

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

In [None]:
autos['registration_year'].between(1900,2016).value_counts(normalize = True)

Given that this is less than 4% of our data, we will remove these rows.

In [None]:
autos = autos[autos['registration_year'].between(1900,2016)]
autos['registration_year'].value_counts(normalize = True).head(10)

It appears that most of the vehicles were first registered in the past 20 years.

In [None]:
autos['brand'].unique().shape
autos['brand'].value_counts(normalize = True)

German manufacturers represent four out of the top five brands, almost 50% of the overall listings. Volkswagen is by far the most popular brand, with approximately double the cars for sale of the next two brands combined.

There are lots of brands that don't have a significant percentage of listings, so we will limit our analysis to brands representing more than 5% of total listings.

In [None]:
brand_counts = autos['brand'].value_counts(normalize = True)
common_brands = brand_counts[brand_counts>0.05]
common_brands.head()

In [None]:
brands = {}
for c in common_brands.index:
    rows = autos[autos['brand'] == c]
    price_mean = rows['price'].mean()
    brands[c] = int(price_mean)
    
brands

Of the top 5 brands, there is a distinct price gap:

Audi, BMW and Mercedes Benz are more expensive
Ford and Opel are less expensive
Volkswagen is in between - this may explain its popularity, it may be a 'best of 'both worlds' option.

For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price. We can do this by creating a new table/dataset with just the dictionaries produced earlier. 

In [None]:
mileage = {}
for c in common_brands.index:
    rows = autos[autos['brand'] == c]
    mile_mean = rows['odometer_km'].mean()
    mileage[c] = int(mile_mean)
    
mileage

In [None]:
mile = pd.Series(mileage)
brand = pd.Series(brands)
final_df = pd.DataFrame(brand,columns = ['price_mean'])
final_df['mile_mean'] = mile
final_df

The range of car mileages does not vary as much as the prices do by brand, instead all falling within 10% for the top brands. There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.