# PROJECT OF PANDAS AND NUMPY EVALUATION ON DATAQUEST.IO

The dataset to be studied is a dataset frome *eBay Kleinanzeigen*. It is a classified section of the German eBay website. The dataset contains used cars information.

The dataset, at first was downloaded from Kaggle. However, the Dataquest.io team has done two things to the dataset:

- First of all, they have sampled the dataset in order to make the code faster to run.
- Secondly, they have dirted the dataset. Originally, the dataset uploaded on [kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data) was cleaned. So they have dirted in order to put in practice the things learned on the course.

## 1. Dataset Structure

The data dictionary provided is as follows:

* `dateCrawled` - When the ad was firs crawled.
* `name` - Name of the car.
* `seller` - Whether the seller is private or a dealer.
* `offerType` - The type of listing.
* `price` - The price on the ad to sell the car.
* `abtest` - Whether the listing is included in an A/B test.
* `vechicleType` - The vehicle type.
* `yearOfRegistration` - The year in which the car was first registered.
* `gearbox` - The transmission type.
* `powerPS` - The power of the car in PS.
* `model` - The car model name.
* `kilometer` - How many kilometers the car has driven.
* `monthOfRegistration` - The month in which the car was first registered.
* `fuelType` - What type of fuel the car uses.
* `brand` - The brand of the car.
* `notRepairedDamage` - If the car has a damage which is not yet repaired.
* `dateCreated` - The date on which the eBay listing was created.
* `nrOfPictures` - The number of pictures in the ad.
* `postalCode` - The postal code for the location of the vehicle.
* `lastSeenOnline` - When the crawler saw this ad last time.

## 2. Objective of the project

The aim of this project is to clean the data and analyze the included used car listings.

## 3. Process

### 3.1. Import the libraries

First of all, in order to clean control the dataset, the essential libraries are downloaded

In [1]:
import numpy as np #Numpy library
import pandas as pd #Pandas library

#Read the csv. As the UTF-8 default encoding does not work, Latin-1
#encoding is used.
autos = pd.read_csv('autos.csv' , encoding = 'Latin-1') 

#Some lines of code for checking the dataset
#shp = autos.shape
#print(autos.head())
#print(shp)

### 3.2. Dataset checking

In [2]:
autos #Pandas dataframe of the dataset

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,privat,Angebot,"$7,900",test,bus,2006,automatik,150,voyager,"150,000km",4,diesel,chrysler,,2016-03-21 00:00:00,0,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,privat,Angebot,$300,test,limousine,1995,manuell,90,golf,"150,000km",8,benzin,volkswagen,,2016-03-20 00:00:00,0,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,privat,Angebot,"$1,990",control,limousine,1998,manuell,90,golf,"150,000km",12,diesel,volkswagen,nein,2016-03-16 00:00:00,0,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,privat,Angebot,$250,test,,2000,manuell,0,arosa,"150,000km",10,,seat,nein,2016-03-22 00:00:00,0,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,privat,Angebot,$590,control,bus,1997,manuell,90,megane,"150,000km",7,benzin,renault,nein,2016-03-16 00:00:00,0,15749,2016-04-06 10:46:35


In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

On the previous cell we have checked the information of the columns. Here we can see some things:

- The dataset has 20 columns which most of them are strings.

- The column names use camelcase .

- There are some columns which contain null values, as the numer of non-null elements is smaller than 50 thousand.

- Another point is that some columns consist of object data. Which means that are strings. Whereas some columns such as `yearOfRegistration` is a series of integers.

In order to clean the data, there has to be checked the null data if there can be something done. An check if there is a string based series which would be better an integer or a float.

### 3.3. Edition of the column names

Firs of all the column names will be print.

In [4]:
column_names = autos.columns
print(column_names)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


A copy of the array is performed for editting the column names. Careful executing only once the following cell. If you execute more than once, it won't work. This is because the names would be changed already.

The dataframe.rename method checks the column name in this case and changes it according to the dictionary. However, if the code is executed once. All the names will be changed. Because of this, on the second execution no column name would be equal to the keys of the dictionary, so the code will give you an error.

In [5]:
#column_names_copy = column_names.copy() #Copy is done

#The names of the columns are renamed
autos.rename(
             {'yearOfRegistration' : 'registration_year' ,
              'monthOfRegistration' : 'registration_month',
              'notRepairedDamage' : 'unrepaired_damage' ,
              'dateCreated' : 'ad_created' ,
              'dateCrawled' : 'date_crawled' ,
              'offerType' : 'offer_type' ,
              'vehicleType' : 'vehicle_type' ,
              'powerPS' : 'power_ps' ,
              'fuelType' : 'fuel_type' ,
              'nrOfPictures' : 'num_pictures' ,
              'postalCode' : 'postal_code' ,
              'lastSeen' : 'last_seen' ,
             } , 
             axis = 1 ,
             inplace = True)


In [6]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


A seen on the tabe above, all the column names have been changed.

### 3.4. Data exploration

Some data exploration must be done in order to see if other cleaning steps need to be done or not:

- Text columns where all or almost all values are the same. Sometimes, those could be dropped as they don't have useful information for analysis.

- Numeric data sorted as text which can be cleaned and converted.

In [7]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-09 11:54:38,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


As can be seen on the table above some insight:

- The `seller`column has only two values (*private* and *public*), only one time has the value *public*.
- Same happens with `offer_type` column wich only once has a value different to *Angebot*.
- The `num_pictures` columns looks odd. Further investigation has to ve performed.

In [8]:
autos.loc[: , 'num_pictures'].value_counts()

0    50000
Name: num_pictures, dtype: int64

The cell above tells that there is no ad with pictures on the database. So this column can be taken out too.

In [9]:
autos.drop(labels = ['seller' , 'offer_type' , 'num_pictures'] ,
           axis = 1)

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,"$7,900",test,bus,2006,automatik,150,voyager,"150,000km",4,diesel,chrysler,,2016-03-21 00:00:00,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,$300,test,limousine,1995,manuell,90,golf,"150,000km",8,benzin,volkswagen,,2016-03-20 00:00:00,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,"$1,990",control,limousine,1998,manuell,90,golf,"150,000km",12,diesel,volkswagen,nein,2016-03-16 00:00:00,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,$250,test,,2000,manuell,0,arosa,"150,000km",10,,seat,nein,2016-03-22 00:00:00,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,$590,control,bus,1997,manuell,90,megane,"150,000km",7,benzin,renault,nein,2016-03-16 00:00:00,15749,2016-04-06 10:46:35


The non-interesting data has been taken out. However there are some columns which show interesting data, and they are stored as strings:

- `date_crawled` - It is stored as string, but it should be stored as numpy datetime.
- `price` - It is stored as string because it has the dollar symbol. There has to be taken out the dollar symbol, save the column as integer, and change the column name to price_dollar.
- `odometer` - This column shows the kilometers performed. It is stored as string. Similar as price, the km has to be taken out and the column name renamed to odometer_km.
- `registration_month` stored as float. It should be stored as datetime.
- `ad_created` is stored as string. Has to be stored as datetime too.
- `last_seen` is stored as string. Has to be stored as datetime.

Let's start cleaning the `price` and `odometer` columns.

In [10]:
autos.loc[: , 'price'] = autos.loc[: , 'price'].str.replace('$' , '').str.replace(',','').astype(int)

In [12]:
print(autos.dtypes)

date_crawled          object
name                  object
seller                object
offer_type            object
price                  int64
abtest                object
vehicle_type          object
registration_year      int64
gearbox               object
power_ps               int64
model                 object
odometer              object
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
num_pictures           int64
postal_code            int64
last_seen             object
dtype: object


As we can see on the table above, the price is already an integer. Let's rename the column name.

In [13]:
autos.rename({'price' : 'price_dollar'} , axis = 1 , inplace = True)

In [14]:
print(autos.dtypes)

date_crawled          object
name                  object
seller                object
offer_type            object
price_dollar           int64
abtest                object
vehicle_type          object
registration_year      int64
gearbox               object
power_ps               int64
model                 object
odometer              object
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
num_pictures           int64
postal_code            int64
last_seen             object
dtype: object


Let's do the same with the column `odometer`.

In [16]:
autos.loc[: , 'odometer'] = autos.loc[: , 'odometer'].str.replace(',' , '').str.replace('km' , '').astype(int)

In [19]:
autos.rename({'odometer' : 'odometer_km'} , axis = 1 , inplace = True)

In [20]:
print(autos.dtypes)

date_crawled          object
name                  object
seller                object
offer_type            object
price_dollar           int64
abtest                object
vehicle_type          object
registration_year      int64
gearbox               object
power_ps               int64
model                 object
odometer_km            int64
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
num_pictures           int64
postal_code            int64
last_seen             object
dtype: object
