# Feature Engineering I, 
### a.k.a.
* ### Feature Engineering Intro
* ### Feature Engineering with _pandas_

Two sides to feature engineering:
* Which features help me build a better model?
* How should I preprocess them to include them into the model?

Why do feature engineering:
* Encode our subject matter expertise into the data
* More relevant information gives better performance
* Feed data into a model that are not numbers

In [24]:
import pandas as pd
import numpy as np

In [25]:
# Our toy dataset to work with
df = pd.DataFrame({
    'fruit': ['banana', 'banana', 'banana', 'apple', 'apple', 'apple', 'orange', 'melon'],
    'price': [1.00, 1.50, None, 2.00, 2.50, None, 3.0, 5.0],
    'bio': [1,0,1,0,1,0,1,0]
})

In [26]:
df

Unnamed: 0,fruit,price,bio
0,banana,1.0,1
1,banana,1.5,0
2,banana,,1
3,apple,2.0,0
4,apple,2.5,1
5,apple,,0
6,orange,3.0,1
7,melon,5.0,0


### 1. _Imputation_: filling in missing values

Q: What can we do with missing values?

A: 
* Drop:
    * drop whole columns if they have a lot of missing data (and ideally they are not particularly relevant)
    * drop rows with missing data
* Fill in:
    * average/median/mode 
    * average/median/mode of a group
* Use data to predict the missing values:
    * impute with the same probability distribution as the data that is not missing
    * `KNNImputer`
    * `IterativeImputer`
* For categorical variables, make them their own category

* `df.isnull()`, `df.isna()` — checks for NaNs; sum, or a heatmap
* `df.dropna()` — be careful when dropping all NaNs, make sure that is really what you want to do
* `df.fillna()` — fill missing values

In examples below, you'd want to use `inplace=True` to actually fill the values.

In [27]:
df.fillna(0)

Unnamed: 0,fruit,price,bio
0,banana,1.0,1
1,banana,1.5,0
2,banana,0.0,1
3,apple,2.0,0
4,apple,2.5,1
5,apple,0.0,0
6,orange,3.0,1
7,melon,5.0,0


In [29]:
df.fillna(df.mean())

  df.fillna(df.mean())


Unnamed: 0,fruit,price,bio
0,banana,1.0,1
1,banana,1.5,0
2,banana,2.5,1
3,apple,2.0,0
4,apple,2.5,1
5,apple,2.5,0
6,orange,3.0,1
7,melon,5.0,0


In [30]:
df.fillna(method='bfill')

Unnamed: 0,fruit,price,bio
0,banana,1.0,1
1,banana,1.5,0
2,banana,2.0,1
3,apple,2.0,0
4,apple,2.5,1
5,apple,3.0,0
6,orange,3.0,1
7,melon,5.0,0


In [31]:
df.interpolate()

Unnamed: 0,fruit,price,bio
0,banana,1.0,1
1,banana,1.5,0
2,banana,1.75,1
3,apple,2.0,0
4,apple,2.5,1
5,apple,2.75,0
6,orange,3.0,1
7,melon,5.0,0


In [32]:
df

Unnamed: 0,fruit,price,bio
0,banana,1.0,1
1,banana,1.5,0
2,banana,,1
3,apple,2.0,0
4,apple,2.5,1
5,apple,,0
6,orange,3.0,1
7,melon,5.0,0


In [33]:
# gives you as many rows as you have groups
df.groupby('fruit')['price'].mean()

fruit
apple     2.25
banana    1.25
melon     5.00
orange    3.00
Name: price, dtype: float64

In [34]:
# gives you as many rows as we have in your dataset
df.groupby('fruit')['price'].transform('mean')

0    1.25
1    1.25
2    1.25
3    2.25
4    2.25
5    2.25
6    3.00
7    5.00
Name: price, dtype: float64

In [35]:
df['price_filled'] = df['price'].fillna(df.groupby('fruit')['price'].transform('mean'))

In [38]:
df

Unnamed: 0,fruit,price,bio,price_filled
0,banana,1.0,1,1.0
1,banana,1.5,0,1.5
2,banana,,1,1.25
3,apple,2.0,0,2.0
4,apple,2.5,1,2.5
5,apple,,0,2.25
6,orange,3.0,1,3.0
7,melon,5.0,0,5.0


### 2. _One-Hot Encoding_: converting categories into numbers

`pd.factorize()`

* turns data into a categorical variable
* results in a single column
* scikit-learn equivalent `LabelEncoder`

In [39]:
pd.factorize(df['fruit'])

(array([0, 0, 0, 1, 1, 1, 2, 3]),
 Index(['banana', 'apple', 'orange', 'melon'], dtype='object'))

In [40]:
pd.factorize(df['fruit'])[0]

array([0, 0, 0, 1, 1, 1, 2, 3])

In [41]:
pd.factorize(df['fruit'])[1]

Index(['banana', 'apple', 'orange', 'melon'], dtype='object')

In [42]:
df['fruit_factorized'] = pd.factorize(df['fruit'])[0]

In [43]:
df

Unnamed: 0,fruit,price,bio,price_filled,fruit_factorized
0,banana,1.0,1,1.0,0
1,banana,1.5,0,1.5,0
2,banana,,1,1.25,0
3,apple,2.0,0,2.0,1
4,apple,2.5,1,2.5,1
5,apple,,0,2.25,1
6,orange,3.0,1,3.0,2
7,melon,5.0,0,5.0,3


`pd.get_dummies()`

* turns data into dummy/indicator variables
* returns as many columns as you have categories (minus one)
* scikit-learn equivalent `OneHotEncoder`

In [44]:
pd.get_dummies(df['fruit'])

Unnamed: 0,apple,banana,melon,orange
0,0,1,0,0
1,0,1,0,0
2,0,1,0,0
3,1,0,0,0
4,1,0,0,0
5,1,0,0,0
6,0,0,0,1
7,0,0,1,0


In [45]:
pd.get_dummies(df['fruit'], drop_first=True)

Unnamed: 0,banana,melon,orange
0,1,0,0
1,1,0,0
2,1,0,0
3,0,0,0
4,0,0,0
5,0,0,0
6,0,0,1
7,0,1,0


In [46]:
df.join(pd.get_dummies(df['fruit'], drop_first=True))

Unnamed: 0,fruit,price,bio,price_filled,fruit_factorized,banana,melon,orange
0,banana,1.0,1,1.0,0,1,0,0
1,banana,1.5,0,1.5,0,1,0,0
2,banana,,1,1.25,0,1,0,0
3,apple,2.0,0,2.0,1,0,0,0
4,apple,2.5,1,2.5,1,0,0,0
5,apple,,0,2.25,1,0,0,0
6,orange,3.0,1,3.0,2,0,0,1
7,melon,5.0,0,5.0,3,0,1,0


### 3. _Scaling_: putting our variables on a common scale

_Normalization_:

* output range is [0, 1]
* doesn't deal well with outliers
* scikit-learn equivalent is `MinMaxScaler()`

In [None]:
def normalize(X):
    return (X-X.min())/(X.max()-X.min())

_Standardization_:

* output range is not always same (will be centered around 0, won't go up much more than ~3)
* deals well with outliers
* scikit-learn equivalent is `StandardScaler()`

In [47]:
def standardize(X):
    return (X-X.mean())/X.std()

### 4. _Binning_: turning scalars into categories

In [48]:
df

Unnamed: 0,fruit,price,bio,price_filled,fruit_factorized
0,banana,1.0,1,1.0,0
1,banana,1.5,0,1.5,0
2,banana,,1,1.25,0
3,apple,2.0,0,2.0,1
4,apple,2.5,1,2.5,1
5,apple,,0,2.25,1
6,orange,3.0,1,3.0,2
7,melon,5.0,0,5.0,3


`pd.cut()`

* if we specify number of bins, it returns equally spaced intervals
* by specifying bin edges, we can have arbitrary bin sizes

In [50]:
pd.cut(df['price_filled'], bins=3, labels=['cheap', 'medium', 'expensive'])

0        cheap
1        cheap
2        cheap
3        cheap
4       medium
5        cheap
6       medium
7    expensive
Name: price_filled, dtype: category
Categories (3, object): ['cheap' < 'medium' < 'expensive']

In [54]:
pd.cut(df['price_filled'], bins=[0, 1.5, 4, 5.5], labels=['new_cheap', 'new_medium', 'new_expensive'])

0        new_cheap
1        new_cheap
2        new_cheap
3       new_medium
4       new_medium
5       new_medium
6       new_medium
7    new_expensive
Name: price_filled, dtype: category
Categories (3, object): ['new_cheap' < 'new_medium' < 'new_expensive']

`pd.qcut()`

* same number of datapoints per bin

In [52]:
pd.qcut(df['price_filled'], q=4, labels=['cheap', 'medium', 'expensive', 'very expensive'])

0             cheap
1            medium
2             cheap
3            medium
4         expensive
5         expensive
6    very expensive
7    very expensive
Name: price_filled, dtype: category
Categories (4, object): ['cheap' < 'medium' < 'expensive' < 'very expensive']

### Feature engineering best practices:

#### 1. We should try to split our data set into training and testing sub-samples as early as we can.
   - this is _somewhat_ flexible — e.g. you can drop NaNs from the entire dataset before filling.
   - still, in interest of good machine learning habits and avoiding mistakes (e.g. you _shouldn't_ fill NaNs before splitting), it's smart to _always_ split first!

#### 2. We need to feature engineer our testing data in the same way that we feature-engineered our training data.
   - otherwise the performance of our model will suffer, if it runs at all.
   - writing a function is a nice way to do this.

#### 3. Feature Engineering includes any pre-processing techniques, such as:
   - imputation, dropping missing values
   - converting strings / non-numeric values into numeric values
   - scaling
   - binning
   - combining features