# Feature Engeering
* feature extraction
> * Loading features from dicts
> * Feature hashing
> * Text feature extraction
> * Image feature extraction
* feature selection
> * Removing features with low variance
> * Univariate feature selection
> * Recursive feature elimination
> * Feature selection using SelectFromModel
> * Feature selection as part of a pipeline

## 一、Feature Extraction

### 1 Loading features from dicts

* The class DictVectorizer can be used to convert feature arrays represented as lists of standard Python dict objects to the NumPy/SciPy representation used by scikit-learn estimators.

In [3]:
measurements = [
    {'city': 'Dubai', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
    {'city': 'San Fransisco', 'temperature': 18.},
]

from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer()
print(vec)
print(vec.fit_transform(measurements).toarray())
vec.get_feature_names()

DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
        sparse=True)
[[  1.   0.   0.  33.]
 [  0.   1.   0.  12.]
 [  0.   0.   1.  18.]]


['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']

* DictVectorizer is also a useful representation transformation for training sequence classifiers in Natural Language Processing models that typically work by extracting feature windows around a particular word of interest.

In [6]:
pos_window = [
    {
        'word-2': 'the',
        'pos-2': 'DT',
        'word-1': 'cat',
        'pos-1': 'NN',
        'word+1': 'on',
        'pos+1': 'PP',
    }] # in a real application one would extract many such dictionaries

pos_window

[{'pos+1': 'PP',
  'pos-1': 'NN',
  'pos-2': 'DT',
  'word+1': 'on',
  'word-1': 'cat',
  'word-2': 'the'}]

In [11]:
vec = DictVectorizer()
pos_vectorized = vec.fit_transform(pos_window)
print(pos_vectorized                )
print()
print(pos_vectorized.toarray())
vec.get_feature_names()

  (0, 0)	1.0
  (0, 1)	1.0
  (0, 2)	1.0
  (0, 3)	1.0
  (0, 4)	1.0
  (0, 5)	1.0

[[ 1.  1.  1.  1.  1.  1.]]


['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']

### 2 Feature hashing

In [12]:
def token_features(token, part_of_speech):
    if token.isdigit():
        yield "numeric"
    else:
        yield "token={}".format(token.lower())
        yield "token,pos={},{}".format(token, part_of_speech)
    if token[0].isupper():
        yield "uppercase_initial"
    if token.isupper():
        yield "all_uppercase"
    yield "pos={}".format(part_of_speech)

In [13]:
raw_X = (token_features(tok, pos_tagger(tok)) for tok in corpus) # the raw_X to be fed to FeatureHasher.transform

hasher = FeatureHasher(input_type='string')
X = hasher.transform(raw_X)  # scipy.sparse matrix X

NameError: name 'corpus' is not defined

### 3 Text feature extraction

#### 3.1 The Bag of Words representation

* scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
> * tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
> * counting the occurrences of tokens in each document.
> * normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.

* In this scheme, features and samples are defined as follows:
> * each individual token occurrence frequency (normalized or not) is treated as a feature.
> * the vector of all the token frequencies for a given document is considered a multivariate sample.

#### 3.2 Sparsity

#### 3.3 Common Vectorizer usage

#### 3.4 Tf–idf term weighting

#### 3.5 Decoding text files

#### 3.6 Applications and examples

#### 3.7 Limitations of the Bag of Words representation

#### 3.8 Vectorizing a large text corpus with the hashing trick

#### 3.9 Performing out-of-core scaling with HashingVectorizer

#### 3.10 Customizing the vectorizer classes

### 4 Image feature extraction

#### 4.1 Patch extraction

In [23]:
import numpy as np
from sklearn.feature_extraction import image

one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
one_image[:, :, 0]  # R channel of a fake RGB picture

array([[ 0,  3,  6,  9],
       [12, 15, 18, 21],
       [24, 27, 30, 33],
       [36, 39, 42, 45]])

In [21]:
# extracts patches from an image stored as a two-dimensional array, or three-dimensional with color information along the third axis
patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,random_state=0)
print(patches.shape)
print()
print(patches[:, :, :, 0])
print()

patches = image.extract_patches_2d(one_image, (2, 2))
print(patches.shape)
patches[4, :, :, 0]

(2, 2, 2, 3)

[[[ 0  3]
  [12 15]]

 [[15 18]
  [27 30]]]

(9, 2, 2, 3)


array([[15, 18],
       [27, 30]])

In [15]:
# reconstruct the original image from the patches by averaging on overlapping areas
reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3)) # rebuilding an image from all its patches
np.testing.assert_array_equal(one_image, reconstructed)

In [16]:
five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
patches = image.PatchExtractor((2, 2)).transform(five_images)  # supports multiple images as input
patches.shape

(45, 2, 2, 3)

## 二、Feature Selection

### 1 Removing features with low variance：去除变异性低的特征
* VarianceThreshold is a simple baseline approach to feature selection. 
* It removes all features whose variance doesn’t meet some threshold. 
* By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

In [25]:
from sklearn.feature_selection import VarianceThreshold

X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
print(X)
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)

[[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]


array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

* As expected, VarianceThreshold has removed the first column, which has a probability p = 5/6 > .8 of containing a zero

### 2 Univariate feature selection：单个特征选择（统计分析）
* selecting the best features based on univariate statistical tests

* Scikit-learn exposes feature selection routines as objects that implement the transform method:
> * SelectKBest removes all but the k highest scoring features
> * SelectPercentile removes all but a user-specified highest scoring percentage of features
> * using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.
> * GenericUnivariateSelect allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.

In [26]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

iris = load_iris()
X, y = iris.data, iris.target
print(X.shape)

X_new = SelectKBest(chi2, k=2).fit_transform(X, y) #  \chi^2 test 
X_new.shape

(150, 4)


(150, 2)

* These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectPercentile):
> * For regression: f_regression, mutual_info_regression
> * For classification: chi2, f_classif, mutual_info_classif
* The methods based on F-test estimate the degree of linear dependency between two random variables. 
* On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation.

### 3 Recursive feature elimination：递归挑选特征
* to select features by recursively considering smaller and smaller sets of features
> * First, the estimator is trained on the initial set of features and weights are assigned to each one of them. 
> * Then, features whose absolute weights are the smallest are pruned from the current set features. 
> * That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

### 4 Feature selection using SelectFromModel
* SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. 
* The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter. 
* Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. 
* Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.

#### 4.1 L1-based feature selection

* When the goal is to reduce the dimensionality of the data to use with another classifier, they can be used along with feature_selection.SelectFromModel to select the non-zero coefficients. 
* In particular, sparse estimators useful for this purpose are the linear_model.Lasso for regression, and of linear_model.LogisticRegression and svm.LinearSVC for classification:

In [27]:
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel

iris = load_iris()
X, y = iris.data, iris.target
print(X.shape)

lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y) # l2
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
X_new.shape

(150, 4)


(150, 3)

#### 4.2 Randomized sparse models

* RandomizedLasso implements this strategy for regression settings, using the Lasso, while RandomizedLogisticRegression uses the logistic regression and is suitable for classification tasks. 
* To get a full path of stability scores you can use lasso_stability_path.
* Note that for randomized sparse models to be more powerful than standard F statistics at detecting non-zero features, the ground truth model should be sparse, in other words, there should be only a small fraction of features non zero.

#### 4.3 Tree-based feature selection
* to compute feature importances, which in turn can be used to discard irrelevant features 

In [28]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel\

iris = load_iris()
X, y = iris.data, iris.target
print(X.shape)
print()

clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
print(clf.feature_importances_  )

model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape 

(150, 4)

[ 0.04807344  0.04272581  0.61528348  0.29391727]


(150, 2)

### 5 Feature selection as part of a pipeline

In [24]:
clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])
clf.fit(X, y)

NameError: name 'Pipeline' is not defined