**In this project, we will clean and analyse a car sales company dataset**

This dataset has 20 columns and over 50k rows. The rows are as follows:

- dateCrawled - When this ad was first crawled. All field-values are taken from this date.
- name - Name of the car.
- seller - Whether the seller is private or a dealer.
- offerType - The type of listing
- price - The price on the ad to sell the car.
- abtest - Whether the listing is included in an A/B test.
- vehicleType - The vehicle Type.
- yearOfRegistration - The year in which the car was first registered.
- gearbox - The transmission type.
- powerPS - The power of the car in PS.
- model - The car model name.
- kilometer - How many kilometers the car has driven.
- monthOfRegistration - The month in which the car was first registered.
- fuelType - What type of fuel the car uses.
- brand - The brand of the car.
- notRepairedDamage - If the car has a damage which is not yet repaired.
- dateCreated - The date on which the eBay listing was created.
- nrOfPictures - The number of pictures in the ad.
- postalCode - The postal code for the location of the vehicle.
- lastSeenOnline - When the crawler saw this ad last online.

First, we will import the Pandas e Numpy libraries, which are fundamental tools for analysing data.

In [None]:
import pandas as pd
import numpy as np

In [None]:
autos = pd.read_csv('../input/used-cars-database-50000-data-points/autos.csv', encoding='Latin_1')

In [None]:
autos.info()
autos.head()

Checking the columns and some values there, we found out what we should do to make this dataset clean:

- 'price' column should have float values, but has object intead. Its name should be 'price_euros', since this is a german company dataset.
- 'odometer' column should have int values and its name should be 'kilometers'
- 'date created' could use datetime values, but has objetc instead.

In [None]:
autos = autos.rename(index=str, columns={'yearOfRegistration': 'registration_year', 'monthOfRegistration': 'registration_month', 'notRepairedDamage': 'unrepaired_damage', 'dateCreated': 'ad_created', 'dateCrawled':'date_crawled', 'offerType':'offer_type', 'vehicleType':'vehicle_type', 'powerPS':'power_ps', 'odometer': 'odometer_km', 'fuelType':'fuel_type', 'nrOfPictures': 'pictures_num', 'postalCode':'postal_code', 'lastSeen':'last_seen'})

First thing we did was correcting the columns labels, to lower case in some cases, and using underscores as separator when needed. Now we're going to make numeric those values which are in text (object) format.

Below, we convert the price column in int type, removing any non numeric character, that could get in the way.

In [None]:
price_num = autos['price'].str.replace(r'[a-zA-Z]', '')
price_num = price_num.str.replace(',','')
price_num = price_num.str.replace('$', '')
price_num = price_num.astype(int)

In [None]:
autos['price'] = price_num

In [None]:
km = autos['odometer_km'].str.replace(',','')
km = km.str.replace('km', '').astype(int)
autos['odometer_km'] = km

As we checked the price column, we've found out some aberrative values, like cars with price 0, and a car costing 9999999 EUR. That signals made clear that we should investigate further this column, and see if there are more values which do not correspond to reality.

We found other aberrative values, like cars costing 45 and 50 EUR, and others costing 1 million and more. So we've checked some car sales websites to get a general idea of minimum and maximum values found on that market.

In german, french and italian used cars websites we figured out values between 100 EUR and 100.000 EUR. We used that info to make our data cleaner, removing values out of this range.

In [None]:
valid_price = autos[autos['price'].between(100,500000)]

In [None]:
autos = valid_price
autos.info()

After cleaning entries by its prices, we kept 48.224 vehicles registers. Time to check the odometer_km column for aberrative values.

In [None]:
autos['odometer_km'].describe()

Odometer column looks ok, as all its values are between 5.000km and 150.000km.

Now we will analyse columns with time values.

In [None]:
autos['date_crawled']

In [None]:
date_crawled = autos['date_crawled'].str[:10]

In [None]:
date_crawled.value_counts(normalize=True, dropna=False).sort_index(ascending=False).describe()

In [None]:
ad_created = autos['ad_created'].str[:10]
ad_created.value_counts(normalize=True, dropna=False).sort_index().describe()

In [None]:
last_seen = autos['last_seen'].str[:10]
last_seen.value_counts(normalize=True, dropna=False).sort_index().describe()

In [None]:
print ('Top days for date crawled, date of ad creation and last seen')
print (date_crawled.max())
print (ad_created.max())
print (last_seen.max())

In [None]:
print ('The worst day for the same')
print (date_crawled.min())
print (ad_created.min())
print (last_seen.min())

In [None]:
registration_year = autos['registration_year']
registration_year.describe()

As we can see, there are entries with registration year equal to 9999 and before 1000, which is suspect, since cars didnt exist until 1900 and the dataset was compiled in 2016. We checked how many entries we would lose eliminating those.

In [None]:
safe_reg_year_bool = autos['registration_year'].between(1900, 2016)

In [None]:
safe_years_autos = autos[safe_reg_year_bool]
safe_years_autos

After eliminating cars with registration years out of the 1900-2016 range, we still have 46352 entries, which remains a good number to our work. Now, let's clean the original dataset from these values and check the distribution of values by year.

In [None]:
autos = safe_years_autos
autos['registration_year'].value_counts(normalize=True)

We found out that the year with most cars registered on the dataset is 2000, with 6,6% of the total. Most of the vehicles on the dataset were registered between 1997 and 2016, with percentages around 2% and 6%.

Now, we will find the most expensive and the cheapest brands (by mean), and find the most expensive and the most cheap cars.

In [None]:
expensive_cars = {}
cheap_cars = {}
top_car = {}
top_mean_price = {}

unibrand = autos['brand'].unique()

for brand in unibrand:
    selected_rows = autos[autos['brand'] == brand]
        
    expensive = selected_rows.sort_values('price', ascending=False)
    cheap = selected_rows.sort_values('price', ascending=True)
        
    top = expensive[:1]
    top = top[['name', 'price', 'registration_year']]
    top_car[brand] = top
    
    expensive_mean = expensive['price'].mean()
    top_mean_price[brand] = expensive_mean
    
    cheapest = cheap[:1]
    cheapest = cheap[['name', 'price', 'registration_year']]
    

In [None]:
top_mean = sorted(top_mean_price.items(), key=lambda kv: kv[1], reverse=True)

In [None]:
top_mean

Here is the cheapest vehicle on the dateset, a Lancia Bastler 1999.

In [None]:
cheapest[:1]

And here is the top expensive car, the Porsche 991 2016.

In [None]:
top_car['porsche']

We will now compare kilometrage and price by brand, to see if there's some relation between these two metrics.

Below, we created a dictionary with the mean kilometrage by brand.

In [None]:
mean_kilometrage = {}

for brand in unibrand:
    selected_rows = autos[autos['brand'] == brand]
    meankm = selected_rows['odometer_km'].mean()
    mean_kilometrage[brand] = meankm
    

In [None]:
mean_km_apres = sorted(mean_kilometrage.items(), key=lambda kv: kv[1], reverse=True)

And now, we will organize the list, by the top kilometrage by brand and present it.

In [None]:
mean_km_apres

In order to extract these valus, we had to create dictionaries, which are not an ideal final format for working with data. We will transforms these dictionaries into a single dataframe with the coding below.

In [None]:
mean_prices = pd.Series(top_mean_price).astype(float)
mean_km = pd.Series(mean_kilometrage)

In [None]:
km_price_mean = pd.DataFrame()
km_price_mean['mean_price'] = mean_prices

In [None]:
km_price_mean['mean_km'] = mean_km

Let's take a look at the new Dataframe we have.

In [None]:
km_price_mean

Now we have a new dataset with the two columns, it will be easier to make operations between the two columsn. So, we will create a index, dividing the mean brand kilometrage by the mean brand price.

Our goal is to find out how many kilometers had the brands ran in comparison to the their value in Euros.

In [None]:
index_km_value = km_price_mean['mean_km'] / km_price_mean['mean_price']

And here's the result:

In [None]:
index_km_value.sort_values()

We found out that the most expensive brand are in fact the less used ones, the ones with lowest kilometrage.

Now, we will find the most commom brand/model combination, and the least common.

In [None]:
brand_ocurrences = {}
brand_total_vehicles = {}

for brand in unibrand:
    selected_rows = autos[autos['brand'] == brand]
    brand_count = selected_rows['model']
    brand_ocurrences[brand] = brand_count
    brand_total_vehicles[brand] = brand_ocurrences[brand].shape[0]

In [None]:
brand_total_vehicles_org = sorted(brand_total_vehicles.items(), key=lambda kv: kv[1], reverse=True)

This is the list of the most commom brands on the dataset. Volkswagen, BMW, Opel, Mercedes-Benz and Audi are the most common ones. It makes sense, since this is a german dataset, and all the brands with most vehicles for sale are german too.

In [None]:
brand_total_vehicles_org

In [None]:
volks_models = autos[autos['brand'] == 'volkswagen']

As we can see, Volkswagen is the most common brand. And now, the most common model.

In [None]:
volks_models['model'].value_counts()

And we conclude that Volkswagen Golf is the most common Brand/Model on this dataset.

We will now check the value difference on vehicles with unrepaired damage.

In [None]:
damage = autos[autos['unrepaired_damage'] == 'ja']

In [None]:
no_damage = autos[autos['unrepaired_damage'] != 'ja']

As we can see, vehicles with unrepaired damage has around 3 times less values than others.

In [None]:
no_damage['price'].mean()

In [None]:
damage['price'].mean()