# Introduction

## Goal

The goal of the project is to clean the data and analyze the included used car listings from Ebay.

## Dataset

The dataset was originally scrapped and upload to [kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data), however we will be using a modified version prepared by dataquest.io. This modified version has been cleaned to a limited degree, but will still need additional work.


### Quick Desription of Columns

- **dateCrawled:** When this ad was first crawled. All field-values are taken from this date.
- **name:** Name of the car.
- **seller:** Whether the seller is private or a dealer.
- **offerType:** The type of listing
- **price:** The price on the ad to sell the car.
- **abtest:** Whether the listing is included in an A/B test.
- **vehicleType:** The vehicle Type.
- **yearOfRegistration:** The year in which the car was first registered.
- **gearbox:** The transmission type.
- **powerPS:** The power of the car in PS.
- **model:** The car model name.
- **kilometer:** How many kilometers the car has driven.
- **monthOfRegistration:** The month in which the car was first registered.
- **fuelType:** What type of fuel the car uses.
- **brand:** The brand of the car.
- **notRepairedDamage:** If the car has a damage which is not yet repaired.
- **dateCreated:** The date on which the eBay listing was created.
- **nrOfPictures:** The number of pictures in the ad.
- **postalCode:** The postal code for the location of the vehicle.
- **lastSeenOnline:** When the crawler saw this ad last online.

### Project Goal

The project goal of this notebook is to demonstrate basic python fundamentals in data cleaning and analysis using pandas and Numpy.

In [None]:
#import needed modules
import numpy as np
import pandas as pd

In [None]:
#Import dataset
autos = pd.read_csv('../input/used-cars-database-50000-data-points/autos.csv', encoding = 'Latin-1')

#Explore dataset
autos.info()
autos.head(3)

In [None]:
#Clean the column names from camelcase to snakecase
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_of_pictures', 'postal_code',
       'last_seen']

#Check column name update
autos.columns
autos.head(2)




##### Chainging column names explination

- Other than the personal preference on how the column look, changing column format from camelcase to snake_case adds more of a readability factor to those who will use the dataset often. Additionally changing, the original names of columns yearOfRegistration, monthOfRegistration, notRepairDamage, dateCreated help with clarifying what data points those columns held. In other words, we made the column names more descriptive for the data that they represent in the autos dataset. 

### Data Exploration and Cleaning

- Let's take a look at all the columns and look for anything that might not make sense


In [None]:
#Explore dataset to determine what other cleaning tasks are needed
autos.describe(include = 'all')






From looking at the table above we can see that:

- seller and offer_type values are the same
- The numer_of_pictures column are full of 0's



In [None]:
#Investigating number of picture columns to see if it has different values
autos['number_of_pictures'].value_counts()

- It looks like are inital thought was correct. the number_of_pictures column is full of 0's
- We will drop columns seller, offer_type, number_of_pictures to clean dataset further

In [None]:
#Droping columns from the autos dataset
autos = autos.drop(['number_of_pictures', 'seller', 'offer_type'], axis = 1)
autos.info()

### Exploration Price and Odometer

- notice that price and odometer datatypes are categorized as an object data type. We will correct this by removing non-numeric characters and changing it to an integer datatype.

In [None]:
#Change the Price and odometer columns to numeric datatypes/ remove non-numeric characters
print(autos[['price','odometer']].dtypes)
print(autos[['price', 'odometer']].head(2))
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)
autos['odometer'] = autos['odometer'].str.replace(',','').str.replace('km','').astype(int)
print(autos[['price','odometer']].head(2))
print(autos[['price', 'odometer']].dtypes)



- Since we removed the kg from the odometer column, our anayst will not understand the datas metric. To fix this we will rename the column to include the metric that supports the data.

In [None]:
#Rename the odometer column to odomter_km to specify length
autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)

#Check if name has been changed 
autos.columns

In [None]:
autos['odometer_km'].value_counts()

In [None]:
#Explore the price data to look for outliers
print(autos['price'].value_counts().shape)
print(autos['price'].describe())
print(autos['price'].value_counts().head(20))

In [None]:
print(autos['price'].value_counts().sort_index(ascending = True).head(20))

In [None]:
print(autos['price'].value_counts().sort_index(ascending = False).head(20))

- When looking at the output above, we see that there 2,357 unique prices with the minimum price = 0 and the max price = 100,000,000 dollars. Obviously a used car will not be free, nor will it be 100,000,000 dollars. Given that eBay is an auction site, it is possible that a seller may start the bid at 1 dollar and the price of a car can exceed 350,000 dollars with the highest being at  1 million dollars. To make the data set more reliable, we limit the price from  1 to 350,000 dollars.

In [None]:
#Keep prices that fall within 1 to 350,000 dollars
autos = autos[autos['price'].between(1,351000)]
autos['price'].describe()

In [None]:
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index()

- From looking at the output above, we can tell that site is being crawled daily

In [None]:
autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index().tail(40)

- As the date gets get closer to 2015, you can see that fewer adds are being published. This could be that either people didn't know how to use eBay to purchase used cars back then or the market place wasn’t fully adopted yet.

In [None]:
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index()

- Again, we see that used car ads are decreasing as it approaches 2015. Meaning, used car ads were staying up longer even if they were being taken off daily. As we get close to 2016 we see a more used car adds being taken down daily.

In [None]:
autos['registration_year'].describe()

- The data output says that the minimum registration date was the year 1000 and the maximum registration was the year 9999, which is really far into the future. This is a clear indication that there are outliers within the registration_year column. Realistically, we would expect the first car registration, if found, to be in the 1900s. So how do we fix this problem?

One option is to keep rows that fall between the range of  1900 - 2016. Let’s see how many rows don’t fall within the 1900- 2016. If a lot of data points do not fall within the range, we might have to consider another option.

In [None]:
#Find the percentage of registration dates that dont fall within 1900s to 2016
(~autos['registration_year'].between(1900,2016)).sum() / autos.shape[0]

- Given that less than 4% of the data does not fall within the range of 1900 - 2016, are option to remove them from the data set is okay.

In [None]:
#Find the percentage of cars being registered for each year
autos = autos[autos["registration_year"].between(1900, 2016)]
print(autos["registration_year"].value_counts(normalize = True).head(15))
print("\n")

#Compute the total percentage of cars registrations that fall within 1900s to 2016
print(autos["registration_year"].value_counts(normalize = True).sum())

we can see that 99.99% of registration_years fall within the range of 1900 - 2016, with most of them being in the 2000's.

In [None]:
#Find the average price of the top 20 brands on ebay
brand_count = autos['brand'].value_counts(normalize = True).head(20)
brand_count_index = brand_count.index

brand_mean_price = {}

for b in brand_count_index:
    selected_rows = autos[autos['brand'] == b]
    mean_price = selected_rows['price'].mean()
    brand_mean_price[b] = round(float(mean_price),2)
    
    
print(brand_count)    
print('\n')
print(brand_mean_price) 


    


##### Explanation

The code above represents a dictionary of the top 20 selling brands on ebay with their mean price next to them.


In [None]:
#Take the brand_mean_price dictionary and translate it into a dataframe
bmp_series = pd.Series(brand_mean_price)
print(bmp_series)

df = pd.DataFrame(bmp_series, columns=['mean_price'])
df

In [None]:
#Find the average mileage of the top 20 brands on eBay
brand_mean_mileage = {}

for b in brand_count_index:
    selected_rows = autos[autos['brand'] == b]
    mean_mileage = selected_rows['odometer_km'].mean()
    brand_mean_mileage[b] = int(mean_mileage)
    
brand_mean_mileage



In [None]:
#Translate brand_mean_mileage dictionary into a dataframe/ change brand_mean_price into a series
mean_mileage = pd.Series(brand_mean_mileage).sort_values(ascending = False)
mean_prices = pd.Series(brand_mean_price).sort_values(ascending = False)
print(mean_mileage)
print('\n')
print(mean_prices)



brand_info = pd.DataFrame(mean_mileage, columns = ['mean_mileage'])
brand_info



In [None]:
#Create a dataframe that holds brands, mean_mileage and mean_price
brand_info['mean_price'] = mean_prices
brand_info

##### Explanation

- The goal above was to compare the brand avg mileage to the brands average price. Since we cannot compare more than two variables within a dictionary, we had to find a way to compare them. Thus, we translated the dictionary variables into series and dataframes in order to compare them all at once. 