# A practical guide to feature Engineering
This document dicusses some common ways of doing feature engineerings in machine learning with working code whenever possible. The scikit-learn package is used throughout the document. Many common skils for feature engineering are already in the scikit-learn page (http://scikit-learn.org/stable/modules/preprocessing.html.) The goal of this document is to talk about the skills I use, and my perspectives about them.

## Feature Engineering
The goal of feature engineering here includes making features more intepretable, making classification accuracy higher, etc.

## Setup
In the following experiments, we will fix the dataset (iris) and the classifier (Liblinear) to play with. For evaluation, 5-fold cross-validation is used for accuracy computation. Although our toy dataset may seem small, the code provided should work for larger data with reasonable speed. (For example, I have run the scripts on dense data with 20k samples and 20-dim features, and the scripts can finish in one second.) Let's load the data and classiifer here:

In [2]:
from sklearn import svm
from sklearn.datasets import load_iris
classifier = svm.LinearSVC(penalty='l1', dual=False)
data = load_iris()  # TODO for readers: feel free to replace this line with any dataset that you have in mind.
print "Number of samples:", len(data.data)
print "Number of feature:", len(data.data[0])
print "Labels:", set(data.target)
labels = data.target
features = data.data

def DoCrossValidation(classifier, features, labels):
    from sklearn.cross_validation import cross_val_score
    return sum(cross_val_score(classifier, features, labels, cv=5)) / 5
print "Cross-validation accuracies:", DoCrossValidation(classifier, features, labels)


Number of samples: 150
Number of feature: 4
Labels: set([0, 1, 2])
Cross-validation accuracies: 0.953333333333


## Techniques
### Categorical Feature Processing
As many classifiers(, ranging from linear classifiers to neural networks,) assign a weight to a feature at some stage of their classifier training, it is important that the weight should make sense. 
#### Categorical Feature Expansion
A common way to make categorical features make sense is to have each category to take an indivial dimension. For example, if we want to predict if a person has diabete using which state he/she is from as a feature. It make more sense to have 50 features where each feature represent a state than having a single feature that includes all state. That way, the resulting will be something like [0.8 * (indicator if this person is in CA), 0.2 * (indicator if this person is in NY), ...] rather than 0.5 * (a number representing the state). (For more information about such classification tasks see: https://github.com/scan33scan33/CoUS.) 
When it comes to coding, DictVectorizer is a great tool for categorical feature expansion. Let's assume the dimension zero is categorical. Here is how I will do it:

In [7]:
def MakeFeatureDict(data):
    dicted_features = []
    for sample in data:
        dicted_feature = {}
        for kv in enumerate(sample):
            dicted_feature[kv[0]] = kv[1]
        dicted_features.append(dicted_feature)
    return dicted_features

# Marks a dimension 'dim' in a list of feature dictionaries as str to be considered categorical for DictVectorizer.
def MarkDimAsCategorical(dicted_features, dim):
    dicted_features_ret = []
    for sample in dicted_features:
        new_sample = sample.copy()
        if dim in sample:
            new_sample[dim] = str(sample[dim])
        dicted_features_ret.append(new_sample)
    return dicted_features_ret

dicted_features = MakeFeatureDict(features)
dicted_features = MarkDimAsCategorical(dicted_features, 0)  # TODO for readers: change the feature dimension here.
from sklearn.feature_extraction import DictVectorizer
vectorizer = DictVectorizer(sparse=False)
temp_features = vectorizer.fit_transform(dicted_features)
print "Cross-validation accuracies:", DoCrossValidation(classifier, temp_features, labels)

Cross-validation accuracies: 0.933333333333


### Scaling
Scaling is important for two purposes, balancing out the effect of different features and making weights on individual features make sense.
#### First Purpose: Balancing out the Effect of Different Features
I usually don't care much about the first purpose because I belive it is something the models should learn. However, we have to remember that different features may be of different scales when we interpret weights. For example, a weight 0.2 on a feature ranged in [0, 100] may be more important than a weight 1 on a feature ranged in [0, 1]. Just for my convenience, I usually just scale all features to [0, 1]. Here are some examples on how I will do it.


In [8]:
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
temp_features = min_max_scaler.fit_transform(features)
print "Cross-validation accuracies:", DoCrossValidation(classifier, temp_features, labels)

Cross-validation accuracies: 0.94


#### Second Purpose: Making Weights on Individual Features Make Sense
Sometimes, the scaling of a feature in the data set may not be a good scale for the models to learn on. In some applications, we take logs on a feature to make it make more sense. (TODO(scan33scan33): add example applications.)
If I want to apply log to the feature zero, here is how I will do it:

In [6]:
def ApplyFuncToDim(features, func, dim):
    features_ret = []
    for feature in features:
        feature_ret = feature.copy()
        feature_ret[dim] = func(feature_ret[dim])
        features_ret.append(feature_ret)
    return features_ret
import math
temp_features = ApplyFuncToDim(features, lambda x: math.log(x + 1), 0)
print "Cross-validation accuracies:", DoCrossValidation(classifier, temp_features, labels)

Cross-validation accuracies: 0.96


### Quantization
#### Just Quantize!
#### Quantized Categorical Features
#### Quantization with Prior Knowledge
### Dealing with Missing Values
#### Imputation
#### Missing Value Indicator
### Learning higher-order correlations
#### Polynomial of Degree-n Expansion.