# Feature Engineering Overview
In our previous tutorials, we only brushed upon features and how to handle them. In this overview we'll take a practical approach to learning about feature engineering. The things we will focus on are:
- Develop a baseline model for comparing performances on models with more/different features.
- Encode categorical features so model can make better use of the information.
- Generate new features to provide more information for the model.
- Select specific features to reduce overfitting and increase prediction speed.

In the main exercise, we'll be using the 'TalkingDataAdTracking' kaggle competition dataset. The goal of this dataset is to predict if a user will download an app after clicking through an ad. For learning purposes we'll drop 99% of negative records (negative meaning the app wasn't downloaded) to make the target more balanced.

## 1. Baseline Model
In this overview we'll be using Kickstarter data.

### Kickstarter Warmup (review)

In [1]:
import pandas as pd
from termcolor import colored # colored prints
from my_modules import data_imports as data

kst_data = data.import_kickstarter_2018_data()
kst_data.head(10)


[32mKickstarter 2018 data imported[0m


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0
5,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01,50000.0,2016-02-26 13:38:27,52375.0,successful,224,US,52375.0,52375.0,50000.0
6,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,Food,USD,2014-12-21,1000.0,2014-12-01 18:30:44,1205.0,successful,16,US,1205.0,1205.0,1000.0
7,1000030581,Chaser Strips. Our Strips make Shots their B*tch!,Drinks,Food,USD,2016-03-17,25000.0,2016-02-01 20:05:12,453.0,failed,40,US,453.0,453.0,25000.0
8,1000034518,SPIN - Premium Retractable In-Ear Headphones w...,Product Design,Design,USD,2014-05-29,125000.0,2014-04-24 18:14:43,8233.0,canceled,58,US,8233.0,8233.0,125000.0
9,100004195,STUDIO IN THE SKY - A Documentary Feature Film...,Documentary,Film & Video,USD,2014-08-10,65000.0,2014-07-11 21:55:48,6240.57,canceled,43,US,6240.57,6240.57,65000.0


Looking at this data, let's try to predict whether or not a Kickstarter project will succeed or not. To build teach our model, we can use the *state* column as our outcome. To predict this outcome, we can use features such as category, currency, funding goal, country, and when it was launched.

### Preparing target column
First, let's look at project states and convert them into something we can use as targets in a model. Remember that model's don't like to work with strings, and our outcome data is categorical.

In [2]:
# pd.unique(kst_data.state)
# kst_data.groupby('state')['ID'].count()
kst_data.groupby('state')['ID'].nunique()

state
canceled       38779
failed        197719
live            2799
successful    133956
suspended       1846
undefined       3562
Name: ID, dtype: int64

So we see that our dataset has 6 unique states, with mostly failed and successful outcomes.

Since our priority in this quick review is not data cleaning, we'll just go a long with this simple cleansing:
- Drop projects that are "live"
- Counting successful as ```outcome = 1```
- Combining all other states as ```outcome = 0```

In [3]:
# Drop live projects
kst_data = kst_data.query('state != "live"')

# Add the 'outcome' column with "successful == 1", everything else 0
kst_data = kst_data.assign(outcome=(kst_data['state'] == 'successful').astype(int))
kst_data.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real,outcome
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95,0
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0,0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0,0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0,0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0,0


### Converting Timestamps
Now that we have our outcome all setup and ready, it's time to handle dates. Let's convert the *launched* feature into something more categorical that our model can understand. We imported both *deadline* and *launched* as python Timestamp objects, so we can use the ```.dt.``` attribute on the timestamp column to get the times. 

In [4]:
# Note that this below syntax doesn't work on a single Timestamp object. dt must be used on a column
#    kst_data['launched'][0].dt

kst_data = kst_data.assign(
    hour=kst_data.launched.dt.hour,
    day=kst_data.launched.dt.day,
    month=kst_data.launched.dt.month,
    year=kst_data.launched.dt.year
)
kst_data.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real,outcome,hour,day,month,year
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95,0,12,11,8,2015
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0,0,4,2,9,2017
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0,0,0,12,1,2013
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0,0,3,17,3,2012
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0,0,8,4,7,2015


### Prepping categorical variables
Now we that both our outcome AND timestamp data setup, it's time to get our other categorical variables in check! For our model, we'll be using *category*, *currency*, and *country*, which all need to be converted into integer representations. We'll use scikit-learn's ```LabelEncoder``` for this.

In [5]:
# print(kst_data.groupby('category')['ID'].nunique())
# print(kst_data.groupby('currency')['ID'].nunique())
# print(kst_data.groupby('country')['ID'].nunique())

In [6]:
from sklearn.preprocessing import LabelEncoder

cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()

# Apply the label encoder to each column
encoded_kst = kst_data[cat_features].apply(encoder.fit_transform)
encoded_kst.head(10)

Unnamed: 0,category,currency,country
0,108,5,9
1,93,13,22
2,93,13,22
3,90,13,22
4,55,13,22
5,123,13,22
6,58,13,22
7,41,13,22
8,113,13,22
9,39,13,22


Great! Now let's gather all of the columns we're using for this model into a new, clean little dataframe. Because our original dataframe and our encoded dataframe have the same index, we can ```join``` them together easily.

In [7]:
# kst_data has our hand-encoded hour, day, month, year, and outcome while 'encoded' has the labelencoded data. They both have the same index, so join join join!
base_data = data = kst_data[['goal', 'hour', 'day', 'month', 'year', 'outcome']].join(encoded_kst)
data.head()

Unnamed: 0,goal,hour,day,month,year,outcome,category,currency,country
0,1000.0,12,11,8,2015,0,108,5,9
1,30000.0,4,2,9,2017,0,93,13,22
2,45000.0,0,12,1,2013,0,93,13,22
3,5000.0,3,17,3,2012,0,90,13,22
4,19500.0,8,4,7,2015,0,55,13,22


### Creating training, validation, and test splits
Ain't our data pretty? Now that's it's ready to go, it's time to split up our data into training, validation and test splits! Since this is just a quick review, let's take a simple approach just use slices of our data. We'll use 10% of the data as validation, 10% for testing, and 80% for training.

**Note:** For python beginners (like me), there are extra steps/comments below to explain the indexing in the end

In [8]:
valid_fraction = 0.1
valid_size = int(len(data) * valid_fraction)

print("len of data: {}".format(len(data)))
print("valid size: {}".format(valid_size))
# The below indexing is a little confusing, so let's analyze it
# We need 80% of the set for training
print("80% of the dataset: {}".format(round(len(data) * 0.8)))
# This comes out to 300689, which is a difference of...
print( "Full data size - 80% data size: {}".format(round(len(data) - (len(data) * 0.8))))
# 75172! ... hmmmmmm now why are we using 2 * valid_size below?
print("valid_size doubled: {}".format(valid_size * 2))
# They're the same!!! valid_size * 2 === the difference from above!
# Oh ya.... valid_fraction = 0.1, so 100% - (10% * 2) = 80% .... I see

# Remember that python uses the colon as [start:end] accessor, and using negatives gives us the opposite, so [:1] is from start to the first element, and [:-1] is from the start, to the end-1

# start : end - (valid_size * 2), [0 : 375862 - 75173]
train = data[:-2 * valid_size]
# end - (valid_size*2) : end - valid_size, [375862 - 75173 : 375862 - 37586]
valid = data[-2 * valid_size:-valid_size]
# end - valid_size : end, 375862 - 37586 : 375826]
test = data[-valid_size:]

print("Length of train/valid/test: {}/{}/{}".format(len(train), len(valid), len(test)))

len of data: 375862
valid size: 37586
80% of the dataset: 300690
Full data size - 80% data size: 75172
valid_size doubled: 75172
Length of train/valid/test: 300690/37586/37586


In general, we want to be careful that each data set has the same proportion of the target classes (keep spliced data balanced). Let's print out the fraction of successful outcomes from each dataset to confirm:

In [9]:
# In the above block, we used traditional Python3 string formatting.
# Below, we use the new 3.6 F-strings!
# The below statement would most similary equal:
#   print("Outcome fraction = {:.4f}".format(each.outcome.mean()))

for each in [train, valid, test]:
    print(f"Outcome fraction = {each.outcome.mean():.4f}")

Outcome fraction = 0.3570
Outcome fraction = 0.3539
Outcome fraction = 0.3542


As we can see, each splice has around 35% true outcomes, likely because the data was well randomized beforehand. If this weren't the case, we could have used a helpful sklearn method: ```sklearn.model_selection.StratifiedShuffleSplit```.

### Training a LightGBM model
In previous examples, we used Random Regression Trees and XGBoost. This time around, we'll be using a *LightGBM* model. This is a tree-based model that typically provides the best performance, even compared to XGBoost. This time around our model won't be very optimized (as this is just a quick review) but we'll still see improvement through our feature engineering.

In [10]:
# if lightgbm can't be found, run the following command (if using conda)
#   conda install -c conda-forge lightgbm
import lightgbm as lgb

feature_cols = train.columns.drop('outcome')

# Read the docs on lightgbm for more info on the parameters
dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])

param = {'num_leaves' : 64, 'objective':'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10, verbose_eval=False)
print(colored("Good to go!", 'green'))

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.
[32mGood to go![0m


### Making predictions & evaluating the model
Now that we got the model all setup and trained, let's make some predictions on the test set with the model and see how it performs.

In [11]:
from sklearn import metrics
ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['outcome'], ypred)

print(f"Test AUC score: {score}")

Test AUC score: 0.747615303004287


And that's it for the basic baseline! Now we can move on engineering our features further.

## 2. Categorical Encodings
Now that we have a nice lil baseline model, it's time to engineer it a little more. In a previous lesson, Intermediate Machine Learning, we learned about one-hot encoding and in this overview we used basic label coding above. Now we'll learn about a few more encodings, specifically:
- Count Encoding
- Target Encoding
- Singular Value Decomposition

In [12]:
# Let's define some helper function for testing our encodings. It'll be based off of lightgbm and data prep from above
#  - Helper functions defined in 'feature_engineering.py'
from my_modules import feature_engineering as fe
train, valid, _ = fe.get_data_splits(base_data)
bst = fe.train_kickstarter_model(train, valid)

[36mTraining model...[0m
Validation AUC score: 0.7467


### Count Encoding


## 3. Feature Generation

## 4. Feature Selection