In [2]:
%matplotlib inline

In [3]:
# import dependencies
import pandas as pd
import numpy as np
import os
import csv
import matplotlib.pyplot as plt
import seaborn as sns

## Loading the dataset

In [13]:
data = pd.read_csv('Automobile_price_data.csv')
data

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470


### Data prepration

Data preperation is a key step in the machine learning pipeline. 

**Goal**: Ensure that machine learning algorithms work in an optimal way.
Data preperation is vital to good machine learning performance. Good data preperation can allow simple machine learning algorithms to work well.

**Data Preperation Steps**: 
- Exploring to understand data problems.
- Remove duplicates.
- Treat missing values.
- Treat errors and outliers.
- Scale features.
- Split dataset
- Visualization to check results.

In [14]:
data.describe()

Unnamed: 0,symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,highway-mpg
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,10.142537,25.219512,30.75122
std,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,3.97204,6.542142,6.886443
min,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,7.0,13.0,16.0
25%,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,8.6,19.0,25.0
50%,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,9.0,24.0,30.0
75%,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,9.4,30.0,34.0
max,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,23.0,49.0,54.0


### Recode the columns names

In [17]:
data.columns = [str.replace('-','_') for str in data.columns]

## Exploring The Dataset

In [22]:
data.isnull().any()

symboling            False
normalized_losses    False
make                 False
fuel_type            False
aspiration           False
num_of_doors         False
body_style           False
drive_wheels         False
engine_location      False
wheel_base           False
length               False
width                False
height               False
curb_weight          False
engine_type          False
num_of_cylinders     False
engine_size          False
fuel_system          False
bore                 False
stroke               False
compression_ratio    False
horsepower           False
peak_rpm             False
city_mpg             False
highway_mpg          False
price                False
dtype: bool

In [24]:
# Missing values are coded with '?'
(data.astype(np.object) == '?').any()

symboling            False
normalized_losses     True
make                 False
fuel_type            False
aspiration           False
num_of_doors          True
body_style           False
drive_wheels         False
engine_location      False
wheel_base           False
length               False
width                False
height               False
curb_weight          False
engine_type          False
num_of_cylinders     False
engine_size          False
fuel_system          False
bore                  True
stroke                True
compression_ratio    False
horsepower            True
peak_rpm              True
city_mpg             False
highway_mpg          False
price                 True
dtype: bool

In [25]:
data.dtypes

symboling              int64
normalized_losses     object
make                  object
fuel_type             object
aspiration            object
num_of_doors          object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_of_cylinders      object
engine_size            int64
fuel_system           object
bore                  object
stroke                object
compression_ratio    float64
horsepower            object
peak_rpm              object
city_mpg               int64
highway_mpg            int64
price                 object
dtype: object

In [29]:
# Counting missing valus in columns
for col in data.columns:
    if data[col].dtype == object:
        count = 0
        count = [count + 1 for i in data[col] if i == '?']
        print(col + ': '+ str(sum(count)))

normalized_losses: 41
make: 0
fuel_type: 0
aspiration: 0
num_of_doors: 2
body_style: 0
drive_wheels: 0
engine_location: 0
engine_type: 0
num_of_cylinders: 0
fuel_system: 0
bore: 4
stroke: 4
horsepower: 2
peak_rpm: 2
price: 4


In [30]:
# Droping column with most frequent missing value
del data["normalized_losses"]

In [39]:
# Removing rows with missing values
cols = ['num_of_doors', 'bore', 'stroke', 'horsepower', 'peak_rpm', 'price']
for column in cols:
    data.loc[data[column]=='?',column] = np.nan
data.dropna(axis = 0, inplace = True)
data.shape

(193, 25)

In [40]:
# Convert some columns to numeric values
cols = ['bore', 'stroke', 'horsepower', 'peak_rpm', 'price']
for column in cols:
    data[column] = pd.to_numeric(data[column])
data[cols].dtypes

bore          float64
stroke        float64
horsepower      int64
peak_rpm        int64
price           int64
dtype: object

# Overview of Feature Engineering

The general ideas good features are the key to good machine learning performance. If we have features that are highly predicitve of the label we're trying to predict, we will have good machine learning performance if not, it may just be noise garbage in garbage out and we may not be getting anywhere.
**Goal** is to develop highly predictive features. 

**Feature Engineering Steps**:
- Exploring to understand data relationships
- Transform features
- Compute interaction terms
- Visualization to check result
- Test with machine learning model

## Transfrom Features
**Why transfrom features?**
- Improve distribution properties
- More covariate (characteristics of the participants in an experiment) with label

**Common transformations**:
- Log, exponential, square, square root, variance, etc.
- Difference, cumulative sum.
- Nonlinear transformed features are not colinear.

**Note**: Just keep in mind that nonlinear transformed features are not going to be colinear. That is they're not going to necessairly have high correlation with original one. If you have numeric feature and it's square value squared, they're not going to be that correlated.

## Interaction Terms

Let's start with an example. Let's say we want to predict the number of people riding a bus route. That might be important for a transit company. Well, it depends on more that one feature in the data. It probably depends on the time of day and whether it's holiday or not. We can imagine on a work day, there are certain directions where there's a lot of people neeeding to say get to a downtown area and the buses are going to have a higher load. But if it turns out to be a Sunday or a national holiday or something, the time profile there still may be a lot of people going downtown to shop or go resturants or movies or something else. So the time of the day they're going is different and maybe the total volumes are different. This is example of an interaction, the interaction is between time of day and the holiday and we call that **Interaction term**.

**Compute interaction terms**:
- Mean, Median, etc.