# The First Data Analysis Project using pandas

The dataset was originally scraped and uploaded to Kaggle. It has been made modified from the original dataset that was uploaded to Kaggle:
- 50,000 data points has been sampled from the full dataset, to ensure the code runs quickly in a hosted environment
- The dataset has been dirtied a bit in order to closely resemble a real world example (the version uploaded to Kaggle was cleaned to be easier to work with)

**The aim of this project is to clean the data and analyze the included used car listings. In addition, to become familiar with some of the unique benefits jupyter notebook provides for pandas.**

First, let's import the pandas and NumPy libraries

In [2]:
import pandas as pd
import numpy as np

In [6]:
autos = pd.read_csv("autos.csv", encoding="Latin-1")

In [58]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 21 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
seller                50000 non-null object
offer_type            50000 non-null object
price_dollars         50000 non-null int32
ab_test               0 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gear_box              47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer_km           50000 non-null int32
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
number_of_pictures    50000 non-null int64
postal_code           50000 non-null int64
last_seen             50000 non-null object
odometer              

In [33]:
autos.head(10)

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gear_box,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,private,Angebot,"$5,000",,bus,2004,manual,158,andere,"150,000km",3,lpg,peugeot,no,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,private,Angebot,"$8,500",,limousine,1997,automatic,286,7er,"150,000km",6,gas,bmw,no,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,private,Angebot,"$8,990",,limousine,2009,manual,102,golf,"70,000km",7,gas,volkswagen,no,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,private,Angebot,"$4,350",,supermini,2007,automatic,71,fortwo,"70,000km",6,gas,smart,no,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,private,Angebot,"$1,350",,microbus,2003,manual,0,focus,"150,000km",7,gas,ford,no,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,private,Angebot,"$7,900",,bus,2006,automatic,150,voyager,"150,000km",4,diesel,chrysler,,2016-03-21 00:00:00,0,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,private,Angebot,$300,,limousine,1995,manual,90,golf,"150,000km",8,gas,volkswagen,,2016-03-20 00:00:00,0,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,private,Angebot,"$1,990",,limousine,1998,manual,90,golf,"150,000km",12,diesel,volkswagen,no,2016-03-16 00:00:00,0,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,private,Angebot,$250,,,2000,manual,0,arosa,"150,000km",10,,seat,no,2016-03-22 00:00:00,0,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,private,Angebot,$590,,bus,1997,manual,90,megane,"150,000km",7,gas,renault,no,2016-03-16 00:00:00,0,15749,2016-04-06 10:46:35


As clearly seen in the cell above, we can see following observations:
- The dataset contains 20 columns, most of them are strings
- Some columns have null values, overall percentage of null values are less than ~20%
- The columns names are defined using camalCase, not snake_case
- Most of the values are written in German

Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

In [11]:
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_of_pictures', 'postal_code',
       'last_seen']
autos.columns = new_columns
bautos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

The data dictionary provided with data is as follows:

| index_number | Label               | Description                                                                |
|--------------|---------------------|----------------------------------------------------------------------------|
| 0            | date_crawled        | When this ad was first crawled. All field-values are taken from this date. |
| 1            | name                | Name of the car.                                                           |
| 2            | seller              | Whether the seller is private or a dealer.                                 |
| 3            | offer_type          | The type of listing                                                        |
| 4            | price               | The price on the ad to sell the car.                                       |
| 5            | ab_test             | Whether the listing is included in an A/B test.                            |
| 6            | vehicle_type        | The type of the vehicle                                                    |
| 7            | registration_year   | The year in which the car was first registered.                            |
| 8            | gearbox             | The transmission type.                                                     |
| 9            | power_ps            | The power of the car in PS.                                                |
| 10           | model               | The car model name.                                                        |
| 11           | odometer            | How many kilometers the car has driven.                                    |
| 12           | registration_month  | The month in which the car was first registered.                           |
| 13           | fuel_type           | What type of fuel the car uses.                                            |
| 14           | brand               | The brand of the car.                                                      |
| 15           | unrepaired_damage   | If the car has a damage which is not yet repaired.                         |
| 16           | ad_created          | The date on which the eBay listing was created.                            |
| 17           | number_of_pictures  | The number of pictures in the ad.                                          |
| 18           | postal_code         | The postal code for the location of the vehicle.                           |
| 19           | last_seen           | When the crawler saw this ad last online.                                  |

Secondly, we need to convert German words into English words
1. Create a map with german words and their respective words in English
2. Apply to our dataset

In [25]:
print(autos["seller"].unique())
print(autos["ab_test"].unique())
print(autos["vehicle_type"].unique())
print(autos["gear_box"].unique())
print(autos["fuel_type"].unique())
print(autos["unrepaired_damage"].unique())

['privat' 'gewerblich']
['control' 'test']
['bus' 'limousine' 'kleinwagen' 'kombi' nan 'coupe' 'suv' 'cabrio'
 'andere']
['manuell' 'automatik' nan]
['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere']
['nein' nan 'ja']


In [31]:
english_words = {"privat":"private",
                 "gewerblich":"commercial",
                 "bus":"bus",
                 "kombi":"microbus",
                 "limousine":"limousine",
                 "kleinwagen":"supermini",
                 "coupe":"coupe",
                 "suv":"suv",
                 "cabrio":"convertible",
                 "andere":"other",
                 "manuell":"manual",
                 "automatik":"automatic",
                 "lpg":"lpg",
                 "benzin":"gas",
                 "diesel":"diesel",
                 "cng":"cng",
                 "hybrid":"hybrid",
                 "elektro":"electro",
                 "nein":"no",
                 "ja":"yes"}
autos["seller"] = autos["seller"].map(english_words)
autos["ab_test"] = autos["ab_test"].map(english_words)
autos["vehicle_type"] = autos["vehicle_type"].map(english_words)
autos["gear_box"] = autos["gear_box"].map(english_words)
autos["fuel_type"] = autos["fuel_type"].map(english_words)
autos["unrepaired_damage"] = autos["unrepaired_damage"].map(english_words) # use it only once

Thirdly, look for 3 types of columns:
- Any columns that have mostly one value that are candidates to be dropped
- Any columns that need more investigation.

| index_number | Label               | Description                                                                |
|--------------|---------------------|----------------------------------------------------------------------------|
| 0            | date_crawled        | When this ad was first crawled. All field-values are taken from this date. |
| 16           | ad_created          | The date on which the eBay listing was created.                            |
| 19           | last_seen           | When the crawler saw this ad last online.                                  |

- Any examples of numeric data stored as text that needs to be cleaned.

| index_number | Label               | Description                                                                |
|--------------|---------------------|----------------------------------------------------------------------------|
| 4            | price               | The price on the ad to sell the car.                                       |
| 11           | odometer            | How many kilometers the car has driven.                                    |

1. Let's remove any non-numeric characters
2. Convert the columns to a numeric values
3. Rename the columns accordingly

In [None]:
autos["odometer_km"] = autos["odometer_km"].str.strip().str.replace("km", "").str.replace(",", "").astype(int)
autos["price"] = autos["price"].str.strip().str.replace("$", "").str.replace(",", "").astype(int)
autos.rename(columns={"price":"price_dollars", "odometer":"odometer_km"}, inplace=True)

We will drop a few columns (ab_test, seller and offer type) because they do not provide any usefull information since the results are monothonic.

In [64]:
autos = autos.drop(["ab_test", "seller", "offer_type"], axis=1)

In [65]:
autos.head(10)

Unnamed: 0,date_crawled,name,price_dollars,vehicle_type,registration_year,gear_box,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,5000,bus,2004,manual,158,andere,150000,3,lpg,peugeot,no,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,8500,limousine,1997,automatic,286,7er,150000,6,gas,bmw,no,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,8990,limousine,2009,manual,102,golf,70000,7,gas,volkswagen,no,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,4350,supermini,2007,automatic,71,fortwo,70000,6,gas,smart,no,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,1350,microbus,2003,manual,0,focus,150000,7,gas,ford,no,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,7900,bus,2006,automatic,150,voyager,150000,4,diesel,chrysler,,2016-03-21 00:00:00,0,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,300,limousine,1995,manual,90,golf,150000,8,gas,volkswagen,,2016-03-20 00:00:00,0,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,1990,limousine,1998,manual,90,golf,150000,12,diesel,volkswagen,no,2016-03-16 00:00:00,0,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,250,,2000,manual,0,arosa,150000,10,,seat,no,2016-03-22 00:00:00,0,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,590,bus,1997,manual,90,megane,150000,7,gas,renault,no,2016-03-16 00:00:00,0,15749,2016-04-06 10:46:35
