# Data cleaning and feature engineering

This notebook guides through the process of cleaning the data and extracting meaningful informations as well as feature engineering. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

%matplotlib inline

data_path = "data/sothebys_scraped.csv"
export_path = "data/sothebys_clean.csv"

Let's read in the data and take look at our dataset.

In [2]:
auctions = pd.read_csv(data_path)

In [3]:
auctions.head(15)

Unnamed: 0,car_info,price,additional_info,auction_type,auction_location
0,2017 Jeep Wrangler Custom,"Sold For $57,120",,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
1,1966 Austin-Healey 3000 Mk III BJ8,"Sold For $58,240",,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
2,1989 Ferrari Testarossa,Sold After Auction,,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
3,2018 Audi SQ5,"Sold For $42,560",,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
4,1960 Austin-Healey 3000 Mk I BN7,"Sold For $40,320",,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
5,2006 Ford GT,Sold After Auction,,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
6,1967 Austin Mini Moke,"Sold For $50,400",,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
7,2009 Mercedes-Benz SL 65 AMG Black Series,"Sold For $161,000",,RM | SOTHEBY'S,ABU DHABI 2019
8,2011 Porsche 911 Speedster,"$300,000 - $350,000",,RM | SOTHEBY'S,ABU DHABI 2019
9,1973 Ferrari 365 GTB/4 Daytona Berlinetta by S...,"Sold For $484,375",,RM | SOTHEBY'S,ABU DHABI 2019


In [7]:
auctions["auction_location"].value_counts(dropna=False)

LONDON 2019                                  85
HERSHEY 2019                                 67
ABU DHABI 2019                               40
ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019     7
NaN                                           1
Name: auction_location, dtype: int64

## Initial data cleaning

## Feature engineering

In [None]:
# dates = nyc_data["date"].str.split("/", n=2, expand=True)
# nyc_data.insert(2, "month", dates[0])
# nyc_data.insert(3, "day", dates[1])

Since breaking down existing columns into a few more detailed ones is something we'll do many times, it might be a good idea to write a simple function that does it for us. Luckily, there is already a good function for splitting the data, so we only need a function that assigns the data based on the split.

In [54]:
def assign_split_data(dataset, split_data, col_list):
    
    for split in range(split_data.shape[1]):
        dataset.insert(split, col_list[split], split_data[split])

First, let's break down our data into more columns. We'll start with `car_info`.

In [55]:
auctions["car_info"][:5]

0             2017 Jeep Wrangler Custom 
1    1966 Austin-Healey 3000 Mk III BJ8 
2               1989 Ferrari Testarossa 
3                         2018 Audi SQ5 
4      1960 Austin-Healey 3000 Mk I BN7 
Name: car_info, dtype: object

Seeing the first 5 entires, we can deduct that it's reasonable to split the column into 4 new columns: 
- year the car was made
- manufacturer (the make of the car)
- the model
- model's variant

We could be more specific but that's something we can easily fix later on when we do the initial analysis.

In [56]:
car_cols = ["year", "manufacturer", "model", "variant"]
car_description = auctions["car_info"].str.split(" ", n=3, expand=True)

In [57]:
assign_split_data(auctions, car_description, car_cols)

In [66]:
auctions.head(5)

Unnamed: 0,year,manufacturer,model,variant,car_info,price,auction_type,auction_location
0,2017,Jeep,Wrangler,Custom,2017 Jeep Wrangler Custom,"Sold For $57,120",RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
1,1966,Austin-Healey,3000,Mk III BJ8,1966 Austin-Healey 3000 Mk III BJ8,"Sold For $58,240",RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
2,1989,Ferrari,Testarossa,,1989 Ferrari Testarossa,Sold After Auction,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
3,2018,Audi,SQ5,,2018 Audi SQ5,"Sold For $42,560",RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
4,1960,Austin-Healey,3000,Mk I BN7,1960 Austin-Healey 3000 Mk I BN7,"Sold For $40,320",RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019


Let's drop the original columns.

Last step is to export the data for others to use.

In [None]:
# exported = auctions.to_csv(export_path, index = None, header=True)

## Conclusions