## Project Introduction

The dataset is in regards to used car sales fro eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset is dirty since it contains scrapped data. I will attempt to clean it and peform some analysis work.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
autos = pd.read_csv('/kaggle/input/used-cars-database-50000-data-points/autos.csv', encoding='Latin-1')

In [None]:
autos.head()

In [None]:
autos.info()

### We can see from the output from the cell above that: 
* the dataset contains 5000 entries and 20 columns
* most columns are of string type except for `yearOfRegistration`, `powerPS`, `monthOfRegistration`, `nrOfPictures` and `postalCode`
* some columns contain NULL values: `vehicleType`, `gearBox`, `model`, `fuelType` and `notRepairedDamage`
* inconsistent column naming 
* Some dates are identified as string values with others as numeric values

### 1. Rename columns

change all column names to be snakecase

In [None]:
autos.columns

In [None]:
autos.rename({'yearOfRegistration': 'registration_year',
             'monthOfRegistration': 'registration_month',
             'notRepairedDamage': 'unrepaired_damage',
             'dateCreated': 'date_created',
             'dateCrawled': 'date_crawled',
             'offerType': 'offer_type',
             'vehicleType': 'vehicle_type',
             'powerPS': 'power_ps',
             'fuelType': 'fuel_type',
             'nrOfPictures': 'num_pictures',
             'postalCode': 'postal_code',
             'lastSeen': 'last_seen'}, axis=1, inplace=True)
autos.columns

### 2. Find potentially redundant or mistyped columns 

In [None]:
# numerical columns
autos.describe()

In [None]:
# string columns
autos.describe(include='object')

### Observations from the output:
* a couple of columns contain almost (49999 out of 50000) the same value for all rows: `seller` and `offer_type`
* some string column can be typed to numberic: `price` and `odometer`

### Let's change the type of `price` column to float

Taking a look at the format

In [None]:
# or .unique()
autos['price'].value_counts().sort_index()

Remove non-numeric characters

In [None]:
autos['price'] = autos['price'].str.replace('$', '').str.replace(',', '')

Change type to float

In [None]:
autos['price'] = autos['price'].astype(float)

In [None]:
# numerical columns
autos.describe()

### Do the same for `odometer` column

In [None]:
autos['odometer'].value_counts().sort_index(key=lambda x: x.str.replace('km', '').str.replace(',', '').astype(int))

In [None]:
autos['odometer'] = autos['odometer'].str.replace('km', '').str.replace(',', '')
autos['odometer'] = autos['odometer'].astype(int)

In [None]:
# add unit to the name of odometer column
autos.rename(columns={'odometer': 'odometer_km'}, inplace=True)
autos.describe()

### 3. Look for suspicious data 

Do this for `odometer_km` and `price` columns

### Start with `odometer_km`

In [None]:
autos['odometer_km'].unique().shape

In [None]:
autos['odometer_km'].describe()

In [None]:
autos['odometer_km'].value_counts().sort_index()

The values look okay.

### Now for `price` column

In [None]:
autos['price'].unique().shape

In [None]:
autos['price'].describe()

In [None]:
autos['price'].value_counts()

We'll remove some outliers

In [None]:
mid_50 = autos['price'].quantile([.25, .75])
price_25, price_75 = mid_50.iloc[0], mid_50.iloc[1]

iqr = price_75 - price_25
low = price_25 - 1.5 * iqr
high = price_75 + 1.5 * iqr

low, iqr, high

In [None]:
autos = autos.loc[autos['price'] < high]
print('Count: ', autos.shape[0], '\nMin price: ', autos['price'].min(), '\nMax price:', autos['price'].max())

We have removed entries with prices higher than $16,350.

### 4. Dealing with columns with dates
Here's some information about date columns in the dataset we know:

- `date_crawled`: added by the crawler, identified as string
- `last_seen`: added by the crawler, identified as string
- `date_created`: from the website, identified as string
- `registration_month`: from the website, identified as numeric
- `registration_year`: from the website, identified as numeric

### First take a look at the 'string dates'

In [None]:
string_dates = autos[['date_crawled', 'last_seen', 'date_created']]
string_dates.head()

### Snoop around the dates for a bit, starting with `date_created`

First thing we notice is that only the date is specified (not the time)

In [None]:
date_created = string_dates['date_created'].str[:10]
date_created.value_counts(normalize=True, dropna=False).sort_index()

In [None]:
date_created.describe()

In [None]:
date_created.str[:4].astype(int).describe()

Some observations for `date_created`:

- The ads were created from August 10th, 2015 to April 7th, 2016
- They were created on 73 different days
- The highest number of ads posted in a day during the period is 1797, on April 3rd, 2016

### Then move on to `last_seen`

In [None]:
last_seen = string_dates['last_seen'].str[:10]
last_seen.value_counts(normalize=True, dropna=False).sort_index()

In [None]:
last_seen.describe()

In [None]:
last_seen.str[:4].astype(int).describe()

Some observations for `last_seen`:

- The ads were last viewed from March 5th, 2016 to April 7th, 2016
- They were last viewed on 34 different days, roughly within a month before April 7th, 2016 
- Most ads (around 10,000 of them) were last viewed on April 6th, 2016

### Finally we do the same with `date_crawled`

In [None]:
date_crawled = string_dates['date_crawled'].str[:10]
date_crawled.value_counts(normalize=True, dropna=False).sort_index()

In [None]:
date_crawled.describe()

In [None]:
date_crawled.str[:4].astype(int).describe()

Some observations for `date_crawled`:

- The ads were scraped from March 5th, 2016 to April 7th, 2016
- They were scraped on 34 different days, roughly within a month before April 7th, 2016 
- The highest number of ads scraped in a day during the period is 1785, on April 3rd, 2016

### Let's also take a look at one of the columns with numeric dates - `registration_year`

In [None]:
autos['registration_year'].describe()

In [None]:
autos['registration_year'].astype(str).describe()

The first thing we notice is that there seems to be some erroneous data. The registration year can't be `1000` or `9999`. Let's do some cleaning.

In [None]:
print("Number of entries before: ", autos.shape[0])

In [None]:
# Firstly, the registration year can be later than 2016
year_high = 2016

# Secondly, let's assume the earliest year in which a car can be registered was the year 1885 (when the first car was invented)
year_low = 1885

autos = autos[autos['registration_year'].between(1885, 2016)]

# Lastly, let's say an ad can't be posted before the car was registered
autos = autos[autos['registration_year'] <= autos['date_created'].str[:4].astype(int)]

print("Number of entries after: ", autos.shape[0])

We've removed around 2000 entries with invalid `registration_year`

In [None]:
autos['registration_year'].describe()

In [None]:
autos['registration_year'].value_counts(normalize=True, dropna=False).sort_index()

All cars were registered from the 1910 to 2016

### 4. Next, lets find out what kinds of cars were recorded in the dataset

Specifically let's take a look at the top 6 most `brand`s in the dataset

In [None]:
top_5_brands_list = list(autos['brand'].value_counts(normalize=True).head(6).index)
top_5_brands_list

Get the price and mileage information for each of the 5 brands

In [None]:
top_prices_mileage_info = {}


for brand in top_5_brands_list:
    mean_price = autos.loc[autos['brand'] == brand, 'price'].mean()
    min_price = autos.loc[autos['brand'] == brand, 'price'].min()
    max_price = autos.loc[autos['brand'] == brand, 'price'].max()
    mileage = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()
    
    top_prices_mileage_info[brand] = ["$" + str(round(mean_price, 2)),
                                      "$" + str(round(min_price, 2)),
                                      "$" + str(round(max_price, 2)),
                                      round(mileage, 2)]
    

for key, val in top_prices_mileage_info.items():
    print(key, ":")
    print('Average price: ', val[0])
    print('Min price: ', val[1])
    print('Max price: ', val[2])
    print('Average mileage', val[3])
    print('\n')

In [None]:
price_mileage = pd.DataFrame(data=top_prices_mileage_info, index=['mean_price', 'min_price', 'max_price', 'mean_mileage'])
price_mileage = price_mileage.swapaxes('index', 'columns')
price_mileage

Some observations:
- for each brand, the prices of hign end cars are roughly the same, around $16,000
- `ford` and `opel` have cheaper middle end cars and have relatively less mileage