<center><img src="img/logo_hse_black.jpg"></center>

<h1><center>Data Analysis</center></h1>
<h2><center>Seminar: Feature Engineering and Feature Selection </center></h2>

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
from sklearn.svm import SVR
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

%matplotlib inline

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12, 6)

# Feature Selection

<center><img src='img/feature_selection.png' width=700></center>

Feature selection is a process of selecting a subset of original features with minimum loss of information related to final task (classification, regression, etc.)

## Why feature selection?

* increase predictive accuracy of classifier
* improve optimization stability by removing multicollinearity
* increase computational efficiency
* reduce cost of future data collection
* make classifier more interpretable

**Not always necessary step**
* some methods have implicit feature selection


## Feature Selection Approaches
* Unsupervised methods
    * don't use target feature
* Filter methdos
    * use target feature
    * consider each feature independently
* Wrapper methods
    * uses model quality
* Embedded methdos
    * embedded inside model

### "Unsupervised" methods

* Determine feature importance regardless of target feature
* Your options?


### Filter methods 
* Features are considered independently of each other
* Individual predictive power is measures

** Basically **
* Order features with respect to feature importances $I(f)$:
$$
I(f_{1})> I(f_{2})> \dots\ge I(f_{D})
$$
* Select top $m$
$$
\hat{F}=\{f_{1},f_{2},...f_{m}\}
$$


* Simple to implement
* Usually quite fast
* When features are correlated, it will take many redundant features


#### Examples

* Correlation
    * Which kind of relationship does correlation measure?
* Mutual Information
    * Entropy of variable $Y$: $H(Y) = - \sum_y p(y)\ln p(y)$
    * Conditional entropy of $Y$ after observing $X$: $H(y|x) = - \sum_x p(x) \sum_y p(y|x)\ln p(y|x) $
    * Mutial information: $$MI(Y, X) = \sum_{x,y} p(x,y) \ln\left[\frac{p(x,y)}{p(x)p(y)}\right]$$
        * Mutual information measures how much $X$ and $Y$ share information between each other
        * $MI(Y,X) = H(Y) - H(Y|X)$
    * Normalized mutual information: $NMI(X,Y) = \frac{MI(Y,X)}{H(Y)}$

<center><img src='img/mi.png' width=300></center>

In [None]:
df_titanic = pd.read_csv('data/titanic.csv')
df_titanic.head()

In [None]:
print(pd.crosstab(df_titanic.Survived, df_titanic.Sex, normalize=True))



In [None]:
P = pd.crosstab(df_titanic.Survived, df_titanic.Sex, normalize=True).values

In [None]:
px = P.sum(axis=1)[:, np.newaxis]

In [None]:
py = P.sum(axis=0)[:, np.newaxis]

In [None]:
px

In [None]:
px.dot(py.T)

In [None]:
def mutual_info(x, y):
    '''
    Method should take arrays of values x and y and calculate their mutual information
    '''
    Pxy = pd.crosstab(x, y, normalize=True).values
    Px  = Pxy.sum(axis=1)[:, np.newaxis]
    Py = Pxy.sum(axis=0)[:, np.newaxis]
    PxPy = Px.dot(Py.T)
    MI = (Pxy*np.log(Pxy/(PxPy))).sum()
    
    return MI

In [None]:
mutual_info(df_titanic.Sex, df_titanic.Survived)

### Wrapper methods
* Selecting suboptimal subset of features
* Could be slow
* Examples: 
    * Recursive Feature Elimination
        * Consider full set of features
        * Fit a model, measure feature importance (based on model)
        * Remove least important feature(s)
        * Repeat
    * [Boruta Algorithm](https://www.google.ru/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0ahUKEwif5biy-fTWAhXkYJoKHbdxCLAQFgg2MAE&url=https%3A%2F%2Fwww.jstatsoft.org%2Farticle%2Fview%2Fv036i11%2Fv36i11.pdf&usg=AOvVaw3tyiHN0BCe2fkkAA6xEVDE)

#### Recursive Feature Elimination

In [None]:
def load_otp():
    # Just data load and some preprocessing
    features = pd.read_csv('data/descr.txt', sep='\t', encoding='cp1251', names=['feature', 'descr'])
    
    features = features.iloc[3:]
    feature_names = features.iloc[:, 0].values
    
    df_data_x = pd.read_csv('data/data_x.csv', sep=';', header=None, names=feature_names)
    df_data_x.loc[:, 'PREVIOUS_CARD_NUM_UTILIZED'] = df_data_x.PREVIOUS_CARD_NUM_UTILIZED.fillna(0)
    
    features.loc[:, 'uniq_vals'] = df_data_x.apply(lambda c: c.nunique(), axis=0).values
    
    features = features.reset_index(drop=True)
    
    df_data_y = pd.read_csv('data/data_y.csv', sep=';', names=['active'])
    
    idx = np.where(df_data_x.dtypes == 'object')[0]

    for i in idx:
        df_data_x.iloc[:, i] = df_data_x.iloc[:, i].str.replace(',', '.').astype('float')
        
    df_data = df_data_x.join(df_data_y)
    
    return df_data, features

In [None]:
from sklearn.feature_selection import RFECV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, Imputer

In [None]:
df_data, features = load_otp()

In [None]:
features.head()

In [None]:
df_data.head()

In [None]:
X = df_data.iloc[:, :-1].values
y = df_data.iloc[:, -1].values

In [None]:
cv = StratifiedKFold(5, shuffle=True, random_state=123)

In [None]:
pipeline = Pipeline([
    ('imputer', Imputer(strategy='median')),
    ('scaller', StandardScaler()),
    ('clf', RFECV(LogisticRegression(), 
                  verbose=2, cv=cv, scoring='roc_auc', n_jobs=1))
])


In [None]:
pipeline.fit(X, y)

In [None]:
rfe = pipeline.steps[-1][1]

In [None]:
rfe.ranking_

In [None]:
idx = rfe.support_

In [None]:
features.feature.values[idx]

In [None]:
plt.plot(range(1,51), rfe.grid_scores_)

### Embedded methods
* Feature selection process in included in the model
* Examples:
    * Decision Trees
    * Linear model with L1 regularization

# Feature Engineering

Usually dataset is not well formend once the task is provided and you have to 
* preprocess initial features
* make features, based on several sources

## Sberbank Data Science Contest

In this ds channenge one have to predict cardholder's gender based on his/her transactional activity

### Lets take a look at the data

Target labels

In [None]:
df_gender = pd.read_csv('data/customers_gender_train.csv')
df_gender.head()

Transactions

In [None]:
df_transactions = pd.read_csv('data/transactions.csv.gz')
df_transactions.head()

[MCC](https://ru.wikipedia.org/wiki/Merchant_Category_Code) codes and transaction type dictionaries

In [None]:
df_tr = pd.read_csv('data/tr_types.csv', sep=';', encoding='utf8')
df_tr.head()

In [None]:
df_mcc = pd.read_csv('data/tr_mcc_codes.csv', sep=';', encoding='utf8')
df_mcc.head()

Firstly, we see strange timestamps and amounts. You can perform some analytical excersises to understand try timestemps and amount values

Some magic operations will be executed in the cells below. If you wish, you can try to understand

In [None]:
from pandas import Timestamp, DateOffset

In [None]:
def preproc_transactions(df_transactions):
    sec_per_day = 86400
    sec_per_hour = 3600
    
    start_date = 1420070400 - 154 * sec_per_day - 3 * sec_per_hour
    
    df_transactions.loc[:, 'day'] = df_transactions.tr_datetime\
                                               .str.split(' ')\
                                               .str.get(0)\
                                               .astype(int)
    df_transactions.loc[:, 'time_raw'] = df_transactions.tr_datetime\
                                                    .str.split(' ')\
                                                    .str.get(1)

    # set temp dt
    df_transactions.loc[:, 'dt_temp'] = pd.to_datetime(df_transactions.loc[:, 'time_raw'], 
                                                    format='%H:%M:%S')\
                                        + DateOffset(years=115)
    
    df_transactions = df_transactions.assign(dt = lambda x: x.dt_temp.astype(np.int64) // 10**9
                                             + (x.day - 153) * sec_per_day)\
                                     .assign(weekday = lambda x: ((x.day + 4) % 7 + 1))
        
    df_transactions.loc[:, 'datetime'] = pd.to_datetime(df_transactions.dt, unit='s')
    df_transactions.loc[:, 'date'] = df_transactions.loc[:, 'datetime'].dt.strftime('%Y-%m-%d')
    df_transactions.loc[:, 'hour'] = df_transactions.loc[:, 'datetime'].dt.strftime('%H')
    
    df_transactions = df_transactions.drop(['dt_temp', 'time_raw', 'tr_datetime'], axis=1)
    
    df_transactions.loc[:, 'amount'] = np.round(df_transactions.loc[:, 'amount']/(np.pi**np.exp(1)))
            
    return df_transactions

In [None]:
df_transactions = df_transactions.pipe(preproc_transactions)

In [None]:
df_transactions.head()

### Lets make new features

Propose your ideas:
1. Amounts in mcc_codes
2. Timestamp features

And implement them!)

In [None]:
df_mcc =\
df_transactions\
.query('amount >= 0')\
.pivot_table(index='customer_id', fill_value=0.0,
             aggfunc='sum',
             columns='mcc_code', 
             values='amount')\
.rename_axis(lambda x: 'mcc_{}'.format(x), axis=1)

In [None]:
df_weekday = \
df_transactions.pivot_table(index='customer_id', 
                            values='amount',
                            aggfunc='count', 
                            columns='weekday')\
               .rename_axis(lambda x: 'weekday_{}'.format(x), axis=1)

total = df_weekday.sum(axis=1)

df_weekday = ((df_weekday.T)/total.T).T

In [None]:
def gen_features(df_input, df_transactions):
    
    # Mcc features
    df_mcc =\
    df_transactions\
    .query('amount >= 0')\
    .pivot_table(index='customer_id', fill_value=0.0,
                 aggfunc='sum',
                 columns='mcc_code', 
                 values='amount')\
    .rename_axis(lambda x: 'mcc_{}'.format(x), axis=1)
    
    # Weekday features
    df_weekday = \
    df_transactions.pivot_table(index='customer_id', fill_value=0.0,
                                values='amount',
                                aggfunc='count', 
                                columns='weekday')\
                   .rename_axis(lambda x: 'weekday_{}'.format(x), axis=1)

    total = df_weekday.sum(axis=1)

    df_weekday = ((df_weekday.T)/total.T).T
    
    df_features = df_input.join(df_mcc, how='left', on='customer_id')\
                       .join(df_weekday, how='left', on='customer_id')\
                       .fillna(0.0)
            
    df_features = df_features.drop(['customer_id'], axis=1)
    
    return df_features

In [None]:
df_features = df_gender.pipe(gen_features, df_transactions)

label = 'gender'
idx_features = df_features.columns != label

X = df_features.loc[:, idx_features].values
y = df_features.loc[:, ~idx_features].values.flatten()

### Simple pipeline with Hyperparameter search

In [None]:
from sklearn.preprocessing import OneHotEncoder, RobustScaler

In [None]:
from scipy.stats import randint as sp_randint
from scipy.stats import lognorm as sp_lognorm
from sklearn.model_selection import RandomizedSearchCV

In [None]:
RND_SEED = 123

In [None]:
model = Pipeline([
    ('scaler', RobustScaler()),
    ('clf', LogisticRegression())
])

In [None]:
param_grid = {
    'scaler__with_centering': [False, True],
    'clf__penalty': ['l1', 'l2'],
    'clf__random_state': [RND_SEED],
    'clf__C': sp_lognorm(4)
}
cv = StratifiedKFold(5, shuffle=True, random_state=123)
random_searcher = RandomizedSearchCV(model, param_grid, n_iter=100, 
                                     random_state=RND_SEED,
                                     scoring='roc_auc', 
                                     n_jobs=-1, cv=cv, 
                                     verbose=2)

random_searcher.fit(X, y)

## Categorical features

* Label encoding
* One-hot encoding 
* Independent separate dataset
* Conjunction of two (or more) categorical features = new categorical feature
* Target Encoding
    * Consider feature $f$ and category `cat_i` in it. Raw values can be encoded via target feature in the following way:
$$ cat\_i\_meantarget = \frac{nrows\_i\cdot mean\_i(target) + \alpha \cdot global\_mean}{nrows + \alpha} $$

In [None]:
df_data.head()

In [None]:
alpha = 10
cv = StratifiedKFold(5, shuffle=True, random_state=123)

def target_encoding(df, col_name, target_name, cv=StratifiedKFold(), alpha=10):
    '''
    Function takes dataframe, categorical feature name and target feature name 
    and computes mean target encoding for that feature using cross-validation
    '''
    
    ## Your Code Here

What about test set in this approach?

## Nearest Neighbour Features

Sometimes features, based on nearest neighour of an object can be helpful.<br/>
So you set various $k$ and calculate features like:
* Fraction of objects of every class (basically kNN prediction)
* Same label streak: the largest number N, such that N nearest neighbors have the same label
* Minimum (normalized) distance to objects of each class
* Mean distance to neighbors of each class
* Mean feature values of neighbours of each class
* ...


Find more [here](https://github.com/hse-aml/competitive-data-science/blob/master/Programming%20assignment%2C%20week%204:%20KNN%20features/compute_KNN_features.ipynb)