# Feature Engineering Overview
In our previous tutorials, we only brushed upon features and how to handle them. In this overview we'll take a practical approach to learning about feature engineering. The things we will focus on are:
- Develop a baseline model for comparing performances on models with more/different features.
- Encode categorical features so model can make better use of the information.
- Generate new features to provide more information for the model.
- Select specific features to reduce overfitting and increase prediction speed.

In the main exercise, we'll be using the 'TalkingDataAdTracking' kaggle competition dataset. The goal of this dataset is to predict if a user will download an app after clicking through an ad. For learning purposes we'll drop 99% of negative records (negative meaning the app wasn't downloaded) to make the target more balanced.

## 1. Baseline Model
In this overview we'll be using Kickstarter data.

### Kickstarter Warmup (review)

In [ ]:
import pandas as pd
from termcolor import colored # colored prints
from my_modules import data_imports as data

kst_data = data.import_kickstarter_2018_data()
kst_data.head(10)


Looking at this data, let's try to predict whether or not a Kickstarter project will succeed or not. To build teach our model, we can use the *state* column as our outcome. To predict this outcome, we can use features such as category, currency, funding goal, country, and when it was launched.

### Preparing target column
First, let's look at project states and convert them into something we can use as targets in a model. Remember that model's don't like to work with strings, and our outcome data is categorical.

In [ ]:
# pd.unique(kst_data.state)
# kst_data.groupby('state')['ID'].count()
kst_data.groupby('state')['ID'].nunique()

So we see that our dataset has 6 unique states, with mostly failed and successful outcomes.

Since our priority in this quick review is not data cleaning, we'll just go a long with this simple cleansing:
- Drop projects that are "live"
- Counting successful as ```outcome = 1```
- Combining all other states as ```outcome = 0```

In [ ]:
# Drop live projects
kst_data = kst_data.query('state != "live"')

# Add the 'outcome' column with "successful == 1", everything else 0
kst_data = kst_data.assign(outcome=(kst_data['state'] == 'successful').astype(int))
kst_data.head()

### Converting Timestamps
Now that we have our outcome all setup and ready, it's time to handle dates. Let's convert the *launched* feature into something more categorical that our model can understand. We imported both *deadline* and *launched* as python Timestamp objects, so we can use the ```.dt.``` attribute on the timestamp column to get the times. 

In [ ]:
# Note that this below syntax doesn't work on a single Timestamp object. dt must be used on a column
#    kst_data['launched'][0].dt

kst_data = kst_data.assign(
    hour=kst_data.launched.dt.hour,
    day=kst_data.launched.dt.day,
    month=kst_data.launched.dt.month,
    year=kst_data.launched.dt.year
)
kst_data.head()

### Prepping categorical variables
Now we that both our outcome AND timestamp data setup, it's time to get our other categorical variables in check! For our model, we'll be using *category*, *currency*, and *country*, which all need to be converted into integer representations. We'll use scikit-learn's ```LabelEncoder``` for this.

In [ ]:
# print(kst_data.groupby('category')['ID'].nunique())
# print(kst_data.groupby('currency')['ID'].nunique())
# print(kst_data.groupby('country')['ID'].nunique())

In [ ]:
from sklearn.preprocessing import LabelEncoder

cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()

# Apply the label encoder to each column
encoded_kst = kst_data[cat_features].apply(encoder.fit_transform)
encoded_kst.head(10)

Great! Now let's gather all of the columns we're using for this model into a new, clean little dataframe. Because our original dataframe and our encoded dataframe have the same index, we can ```join``` them together easily.

In [ ]:
# kst_data has our hand-encoded hour, day, month, year, and outcome while 'encoded' has the labelencoded data. They both have the same index, so join join join!
base_data = data = kst_data[['goal', 'hour', 'day', 'month', 'year', 'outcome']].join(encoded_kst)
data.head()

### Creating training, validation, and test splits
Ain't our data pretty? Now that's it's ready to go, it's time to split up our data into training, validation and test splits! Since this is just a quick review, let's take a simple approach just use slices of our data. We'll use 10% of the data as validation, 10% for testing, and 80% for training.

**Note:** For python beginners (like me), there are extra steps/comments below to explain the indexing in the end

In [ ]:
valid_fraction = 0.1
valid_size = int(len(data) * valid_fraction)

print("len of data: {}".format(len(data)))
print("valid size: {}".format(valid_size))
# The below indexing is a little confusing, so let's analyze it
# We need 80% of the set for training
print("80% of the dataset: {}".format(round(len(data) * 0.8)))
# This comes out to 300689, which is a difference of...
print( "Full data size - 80% data size: {}".format(round(len(data) - (len(data) * 0.8))))
# 75172! ... hmmmmmm now why are we using 2 * valid_size below?
print("valid_size doubled: {}".format(valid_size * 2))
# They're the same!!! valid_size * 2 === the difference from above!
# Oh ya.... valid_fraction = 0.1, so 100% - (10% * 2) = 80% .... I see

# Remember that python uses the colon as [start:end] accessor, and using negatives gives us the opposite, so [:1] is from start to the first element, and [:-1] is from the start, to the end-1

# start : end - (valid_size * 2), [0 : 375862 - 75173]
train = data[:-2 * valid_size]
# end - (valid_size*2) : end - valid_size, [375862 - 75173 : 375862 - 37586]
valid = data[-2 * valid_size:-valid_size]
# end - valid_size : end, 375862 - 37586 : 375826]
test = data[-valid_size:]

print("Length of train/valid/test: {}/{}/{}".format(len(train), len(valid), len(test)))

In general, we want to be careful that each data set has the same proportion of the target classes (keep spliced data balanced). Let's print out the fraction of successful outcomes from each dataset to confirm:

In [ ]:
# In the above block, we used traditional Python3 string formatting.
# Below, we use the new 3.6 F-strings!
# The below statement would most similary equal:
#   print("Outcome fraction = {:.4f}".format(each.outcome.mean()))

for each in [train, valid, test]:
    print(f"Outcome fraction = {each.outcome.mean():.4f}")

As we can see, each splice has around 35% true outcomes, likely because the data was well randomized beforehand. If this weren't the case, we could have used a helpful sklearn method: ```sklearn.model_selection.StratifiedShuffleSplit```.

### Training a LightGBM model
In previous examples, we used Random Regression Trees and XGBoost. This time around, we'll be using a *LightGBM* model. This is a tree-based model that typically provides the best performance, even compared to XGBoost. This time around our model won't be very optimized (as this is just a quick review) but we'll still see improvement through our feature engineering.

In [ ]:
# if lightgbm can't be found, run the following command (if using conda)
#   conda install -c conda-forge lightgbm
import lightgbm as lgb

feature_cols = train.columns.drop('outcome')

# Read the docs on lightgbm for more info on the parameters
dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])

param = {'num_leaves' : 64, 'objective':'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10, verbose_eval=False)
print(colored("Good to go!", 'green'))

### Making predictions & evaluating the model
Now that we got the model all setup and trained, let's make some predictions on the test set with the model and see how it performs.

In [ ]:
from sklearn import metrics
ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['outcome'], ypred)

print(f"Test AUC score: {score}")

And that's it for the basic baseline! Now we can move on engineering our features further.

## 2. Categorical Encodings
Now that we have a nice lil baseline model, it's time to engineer it a little more. In a previous lesson, Intermediate Machine Learning, we learned about one-hot encoding and in this overview we used basic label coding above. Now we'll learn about a few more encodings, specifically:
- Count Encoding
- Target Encoding
- Singular Value Decomposition

In [ ]:
# Let's define some helper function for testing our encodings. It'll be based off of lightgbm and data prep from above
#  - Helper functions defined in 'feature_engineering.py'
from my_modules import feature_engineering as fe
train, valid, _ = fe.get_kickstarter_splits(base_data)
bst = fe.train_kickstarter_model(train, valid)

### Count Encoding
Count encoding replaces each categorical value with the number of times it appears in the dataset. For this encoding, we'll use *categorical-encodings* package, specifically ```CountEncoder```. This encoder and the others in *categorical-encodings* work like scikit-learn transformers with ```.fit``` and ```.transform``` methods.


In [ ]:
# category_encoders conda install:
#   $ conda install -c conda-forge category_encoders
import category_encoders  as ce
cat_features = ['category', 'currency', 'country']
count_enc = ce.CountEncoder()
# kst_data from above, after timestamp encoding, pre basic encoding
count_encoded = count_enc.fit_transform(kst_data[cat_features])

category_data = base_data.join(count_encoded.add_suffix("_count"))

# Training and testing
train, valid, _ = fe.get_kickstarter_splits(category_data)
bst = fe.train_kickstarter_model(train, valid)

A slight increase from 0.7467 -> 0.7486

### Target Encoding
Target encoding replaces a categorical value with the average value of the target for that value of the feature. For example, given the country value "CA", we would calculate the average outcome for all the rows with ```country == 'CA'```. This is often blended with the target probability over the entire dataset to reduce the variance of values with few occurences.

This technique uses the targets to create new features. So including the validation or test data in the target encodings would be a form of target leakage. We should only learn the target encodings from the training dataset only and apply it to the other datasets.

Much like ```CountEncoder```, we'll use ```TargetEncoder``` from *category_encoders*.

In [ ]:
cat_features = ['category', 'currency', 'country']

target_enc = ce.TargetEncoder(cols=cat_features)

train, valid, _ = fe.get_kickstarter_splits(category_data)

target_enc.fit(train[cat_features], train['outcome'])

train = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

train.head()
bst = fe.train_kickstarter_model(train, valid)

Adding target encoding on top of count encoding has given us another increase, 
0.7486 -> 0.7491

### CatBoost Encoding
Finally we'll look at CatBoost encoding. CatBoost is similar to target encoding in that it's based on the target probability for a given value. However, with CatBoost, for each row, the target probability is calculated only from the rows before it.

In [ ]:
cat_features = ['category', 'currency', 'country']
cat_boost = ce.CatBoostEncoder(cols=cat_features)

train, valid, _ = fe.get_kickstarter_splits(category_data)
cat_boost.fit(train[cat_features], train['outcome'])

train = train.join(cat_boost.transform(train[cat_features]).add_suffix("_cb"))
valid = valid.join(cat_boost.transform(valid[cat_features]).add_suffix("_cb"))

bst = fe.train_kickstarter_model(train, valid)

With our current model, CatBoost only gave us a 0.0001 improvement over target encoding.

## 3. Feature Generation

## 4. Feature Selection