# Analyzing eBay Kleinanzeigen Used Car Sales
eBay Kleinanzeigen is a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website and in this project we will be exploring its data of used car sales.  More specifically, we will clean the original dataset and then perform an analysis of it.  We will also become familiar with some of the unique benefits jupyter notebook provides for pandas.  

The dataset was originally scraped and uploaded to Kaggle.  It can be found [here](https://data.world/data-society/used-cars-data) along with a summary and data dictionary.  We will be working with a modified, more raw version of it to practice data cleaning.

## Summary of Findings and Results

1. Changed column names from camelcase to snakecase and renamed a handful to be more descriptive  
2. Used the `info()` and `describe()` functions for summaries of dataframe and to see what to clean  
3. Converted a couple columns (`price` and `odometer`) from string to numeric and removed a few colummns that do not provide useful infomation  
4. Removed rows that contain outlier data (e.g. invalid registration dates and unrealistic prices)  
5. Explored frequency distributions of various date columns, specifically `date_crawled`, `ad_created`, and `last_seen`  
    * We used the `value_counts()` and `sort_index` to show this information
6. Analyzed average price and average kilometers for the most common used car brands  
    * We created a single pandas dataframe with two pandas series (from two dictionaries) to easily illustrate this information

In [None]:
# Import NumPy and pandas libraries
import numpy as np
import pandas as pd

# Read CSV dataset into pandas as dataframe
autos = pd.read_csv('_data/autos.csv', encoding='Latin-1')

**Jupyter Notebook Feature**: By running a pandas object in a cell, jupyter notebook renders its first and last few values shown below.

In [None]:
autos

Let's look at information of the `autos` dataframe below.

In [None]:
# Print information about the `autos` dataframe (and print its first 5 rows)
autos.info()

# autos.head()

Based on the information (of the `autos` dataframe) there are 50,000 rows and 20 columns.  A quarter of the columns are of `integer` type and the remaining columns are of `object` type.  A quarter of the columns contain `null` values for less than 20% of the rows.  

Also note that the column names use [camel case](https://en.wikipedia.org/wiki/Camel_case) (not [snake case](https://en.wikipedia.org/wiki/Snake_case)).  

## Change Column Names  
In the next section, we inspect the specific column names to see how we can clean them.  In particular, we converted the column names from camel case to snake case by changing all characters from uppercase to lowercase and separating words with underscores.  We also reworded some of the column names to be more descriptive based on the data dictionary.

In [None]:
autos.columns

In [None]:
# Create list of new (cleaned) column names, then replace them in the `auto` dataframe
new_cols = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
       'last_seen']

autos.columns = new_cols

# View the changes made
autos.head()

## Initial Data Exploration and Cleaning
We begin by looking at a summary of each column to see if there are any obvious areas of the data that can be cleaned.

In [None]:
autos.describe(include = 'all')

Based on the above summary table, there are four columns that can likely be changed and one column that needs further investigation:  
* The `price` and `odometer` columns represent numeric values but are stored as texts
    - They contain non-numeric characters which can be removed, allowing the two columns to be converted to integer types
* The `seller` and `offer_type` columns have only one value (except for one row) and thus do not provide any information  
    - They will be removed from the data
* The `num_photos` column values look strange and should be investigated further


In [None]:
autos["price"].unique()

In [None]:
# Remove '$' and ',' characters from the `price` column and convert to integer type
autos["price"] = (autos["price"].str.replace("$","").str.replace(",","").astype(int))

autos["price"].head()

In [None]:
autos["odometer"].unique()

In [None]:
# Remove 'km' and ',' characters from the `odometer` column and convert to integer type
autos["odometer"] = (autos["odometer"].str.replace("km","").str.replace(",","").astype(int))

# Replace 'odometer' column name with 'odometer_km'
autos.rename({"odometer": "odometer_km"}, axis = 1, inplace = True)

autos["odometer_km"].head()

In [None]:
# Check suspicion that all rows of the 'num_photos' column are zeros
autos["num_photos"].value_counts()

The `num_photos` column is all zero and does not provide any information, so it is removed from the dataset.

In [None]:
autos = autos.drop(["num_photos", "seller", "offer_type"], axis = 1)

## More Detailed Look at Price and Odometer Values
In the sections below, we analyze the minimum and maximum values of the `price` and `odometer_km` columns as well as other statistics such as their mean, median, and frequency distributions.  We find that there are unrealistic outliers in the `price` column, so they are removed from the dataset altogether (keeping only the realistic values).

In [None]:
# Check the number of unique values of 'odometer_km' column
autos["odometer_km"].unique().shape

In [None]:
# Look at some summary statistics of 'odometer_km' column
autos["odometer_km"].describe()

In [None]:
# Look at frequency distribution of 'odometer_km' column
autos["odometer_km"].value_counts()

The values in the `odometer_km` column seem reasonable.  There are 13 unique values (note that they are rounded) and the distribution is skewed left, which makes sense since most used cars will have many kilometers on them.  Some used cars do have pretty low values, around five thousand, but this is still reasonable since not everyone likes the cars that they buy.  

Next we take a closer look at the `price` column in the sections below.

In [None]:
# Check the number of unique values of 'price' column
autos["price"].unique().shape

In [None]:
# Change values format from scientific to number notation (with commas and two decimal places)
pd.options.display.float_format = '{:,.4f}'.format

# Look at some summary statistics of 'price' column
autos["price"].describe()

In [None]:
# View number of high price values
autos["price"].value_counts().sort_index(ascending=False).head(20)

In [None]:
# View number of low price values
autos["price"].value_counts().sort_index(ascending=True).head(20)

There are many unrealistic values in the `price` column based on the frequency distributions above.  In particular, there are 1,421 rows with \\$0 price and 14 rows with price > $350k.  We remove these rows from the dataset so that we only retain useful and accurate information.

In [None]:
# Only retain the realistic values of `price` in the dataset
autos = autos[autos["price"].between(1,350001)]

## Exploring the Date Columns
There are five columns in the dataset that represent date values. Two of them were created by the crawler and the remaining three are from the website itself.  Furthermore, two of the columns represent the dates as numeric values and the remaining three use strings.  To understand the distribution of these columns, numeric values are more practical.

We first explore the `date_crawled` column below.

In [None]:
# Restrict to dates only (remove times), convert distribution to percentages,
# dropna, and sort ascending of the `date_crawled` column
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

The `date_crawled` dates are quite evenly distributed (among the days) between March 3, 2016 and April 7, 2016 (only one month).

Secondly, we explore the `ad_created` column.

In [None]:
# Restrict to dates only (remove times), convert distribution to percentages,
# dropna, and sort ascending of the `ad_created` column
autos["ad_created"].str[:7].value_counts(normalize=True, dropna=False).sort_index()

We can see from the frequency distribution above (percentages) that pretty much all of the used car ads were created in March and April 2016 (84% and 16% respectively).  There are some, but very few, ads that were created from June 2015 - February 2016.  

Next, we explore the `last_seen` column.

In [None]:
# Restrict to dates only (remove times), convert distribution to percentages,
# dropna, and sort ascending of the `last_seen` column
autos["last_seen"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

The `last_seen` column has a similar frequency distribution as the `date_crawled` column.  It is pretty uniformly distributed between March 5, 2016 and April 7, 2016.

Finally, we explore the `registration_year` column.

In [None]:
# Look at summary statistics for the `registration_year` column
autos["registration_year"].describe()

The mean and median registration dates are both 2004.  We can also see that there are very low and very high values (year 1000 and year 9999) which do not make sense.  

## Cleaning the Registration Year Data  

Before completely removing the above rows with invalid registration years, we first count the number of listings with cars outside the 1900-2016 interval to give us an idea of how much data/information we would be losing.

In [None]:
(~autos["registration_year"].between(1900,2016)).sum() / autos.shape[0]

Thus, removing rows of data where registration date is outside the 1900-2016 interval results in a loss of only *3.9%* of our dataset, which isn't too much.  We proceed to do so below and look at the resulting distribution.

In [None]:
# Retain rows where `registration_year` is within the interval 1900-2016
autos = autos[autos["registration_year"].between(1900,2016)]

# Generate resulting frequency distribution (percentage and top 10 only)
autos["registration_year"].value_counts(normalize=True).head(15)

The majority of used cars were registered between 1997 and 2011, which seems reasonable given that both the mean and average registration years is 2004 (from a previous cell).

## Exploring the Brands Column

Next, we explore the unique car brands values (i.e. the `brand` column) in our used car listings dataset.

In [None]:
# Generate frequency distribution (percentage) of all unique car brands
autos["brand"].value_counts(normalize=True)

We can see that German manufacturers make up more than half of the listings, which makes sense since we pulled data from a German used car website.  Also note that there are many brands that make up a very small portion of the data (less than 3%), which may not be credible.  

For our analysis of brands, we will limit ourselves to the listings that have more than 4% representation.

In [None]:
# Resulting frequency distribution (percentage) of brands that make up more than 4% of data
autos["brand"].value_counts(normalize=True).loc[lambda x : x > .04]

In [None]:
# Created limited brands dataset
brand_counts = autos["brand"].value_counts(normalize=True)

# Store limited brands/labels in a list/array for access in next cell
brand_limit = brand_counts[brand_counts > .04].index

We now calculate the average price per common brand (and store it) for analysis.

In [None]:
# Create dictionary and loop through common brands to calculate average price, then store back
brand_mean_price = {}

for brand in brand_limit:
    autos_brand = autos[autos["brand"] == brand]
    avg_brand = autos_brand["price"].mean()
    brand_mean_price[brand] = int(avg_brand)

# Show resulting average price by brand
brand_mean_price

Audi is the most expensive used car brand (averaging ~\\$9k), followed closely by Mercedes Benz and BMW (averaging ~\\$9k and ~\\$8k respective).  Opel and Renault are the cheapest averaging around \\$3k.  Volkswagen is in the middle with an average list price of ~\\$5k.  

To see if there is a potential correlation between price and kilometers logged, we calculate the average kilometers logged for each of the same common brands next.

In [None]:
# Create dictionary and loop through common brands to calculate average kilometers, then store back
brand_mean_km = {}

for brand in brand_limit:
    autos_brand = autos[autos["brand"] == brand]
    avg_brand = autos_brand["odometer_km"].mean()
    brand_mean_km[brand] = int(avg_brand)

# Show resulting average kilometers by brand
brand_mean_km

We convert dictionaries (of mean price and mean km) to pandas series, combine information of (two) pandas series into single pandas dataframe, and name columns in order to display nicely in one table.

In [None]:
# Convert dictionaries (created in prior cells) to pandas series (via constructor)
bmp_series = pd.Series(brand_mean_price)
bmk_series = pd.Series(brand_mean_km)

# Combine series objects into single pandas dataframe (via constructor) for easy display
brand_means = pd.DataFrame(bmp_series, columns=['mean_price'])
brand_means["mean_km"] = bmk_series

brand_means

Based on the table above that shows the average price and average kilometers per common used car brand, we can see that the price does not seem to vary by kilometers logged.  All common used car brands have around the same average kilometers, but the prices vary more significantly.