# Filter-Based Feature Selection, select top N-features

In this notebook we will use a public dataset to perform a selection of features using the top N-features that are best informative for the class vector

## Data

The website https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ is a collection of datasets for classification and regression. We will use some of them to test our feature selection algorithms

In [1]:
import urllib

filename = "german.numer_scale"  # 1000 x 24
url = "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/" + filename
f = urllib.urlretrieve(url, filename)

## Preprocessing of the data

_MLlib_ relies on _LabeledPoint_ as data structure to that stores a numerical vector (dense or sparse) and a numerical label. An RDD of LabeledPoint represents the dataset given as input to train or test supervised machine learning models.

Spark provides a built-in function to tranforms a libsvm dataset into a RDD[LabeledPoint]

In [2]:
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint

rdd = MLUtils.loadLibSVMFile(sc, filename)
ncols = rdd.first().features.size  # number of columns (no class) of the dataset

In [3]:
rdd.first()

LabeledPoint(-1.0, (24,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],[-1.0,-0.941176,1.0,-0.89011,1.0,1.0,0.333333,1.0,-1.0,0.714286,1.0,-0.333333,-1.0,1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,1.0,-1.0,-1.0,1.0]))

## Pearson correlation coefficients

In this notebook we create first the Pearson correlation coefficients (PCCs) between the class and each features (_scoreClass_), then the PCCs between every pair of feature (_scoreMatrix_). Once these intermediate results are completed, we proceed into performing the feature selection.

Here below we define the functions used during the _map_ and _reduce_ stage of the distributed calculations.

In [4]:
from scipy.stats.stats import pearsonr
from sklearn.metrics import normalized_mutual_info_score
from sklearn.feature_selection import mutual_info_regression
import numpy as np

def meltLPclass(lp):
    '''
    This function creates a list of k,v tuples, one per each
    label-feature combination. 'k' corresponds to the index
    of the feature and 'v' corresponds to a tuple of two
    elements: value of the label, value of the feature
    
    Parameters
    ----------
    lp : LabeledPoint
        a point in the feature space with label
    '''
    label = lp.label
    features = lp.features
    r = range(features.size)
    return [(i, (label, features[i])) for i in r]

def corr(x):
    '''
    This function calculates the Pearson correlation coefficient
    among two variables. It returns the index of the feature and
    its correlation coefficient
    
    Parameters
    ----------
    x : tuple
        x[0] is a scalar value (or a tuple), representing the index(es) of the feature(s)
        x[1] is a pyspark.resultiterable.ResultIterable object
    '''
    idx = x[0]
    values = list(x[1])
    
    l = list(values)
    v1, v2 = zip(*values)
    p = pearsonr(v1, v2)[0]
    
    return (idx, p)

def mi(x):
    '''
    This function calculates the Mutual information between two discrete variables.
    It relies on the 'adjusted_mutual_info_score' function available in the
    sklearn package.
    
    Doc: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html
    
    Parameters
    ----------
    x : tuple
        x[0] is a scalar value (or a tuple), representing the index(es) of the feature(s)
        x[1] is a pyspark.resultiterable.ResultIterable object
    '''
    idx = x[0]
    values = list(x[1])
    
    l = list(values)
    v1, v2 = zip(*values)
    res = normalized_mutual_info_score(v1, v2)
    
    return (idx, res)

def miCont(x):
    '''
    This function calculates the Mutual information between two continuous variables.
    It relies on the 'mutual_info_regression' function available in the
    sklearn package.
    
    Doc: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html
    
    Parameters
    ----------
    x : tuple
        x[0] is a scalar value (or a tuple), representing the index(es) of the feature(s)
        x[1] is a pyspark.resultiterable.ResultIterable object
    '''
    idx = x[0]
    values = list(x[1])
    
    l = list(values)
    v1, v2 = zip(*values)
    V1 = np.array(v1).reshape(len(v1),1)  # 1-column matrix layout transformation
    res = mutual_info_regression(V1, v2, discrete_features=False, random_state=42)[0]  # random_state is set to provide deterministic results
    
    return (idx, res)

From our RDD (_rdd_) that represents our dataset,
- the _flatMap_ operation iterates over every single instance and produce intermediate tuples containing the values of the pair label-feature and the feature index. We need the feature index to be the key of the tuple, so we can group every tuple regarding such feature in the Reducers.
- the _groupByKey_ operation sort and gather tuples having the same key in the Reducers (one Reducer per each key)
- the _map_ operation of feature _j_ has been provided with all the data label-feature of the feature _j_. That is the vectors of the label and the feature _j_. Having such data in one place, the _corr_ function can calculate the Pearson correlation coefficient. You can plug in other (custom) functions to assess the association of the two variables, such as _mi_ and _miCont_.
- the _collect_ operation with collect all the resulting data in the spark driver process.

In [5]:
fscores = rdd.flatMap(meltLPclass).groupByKey().map(corr).collect()

In [6]:
fscores[0:5]

[(0, -0.35084747767184321),
 (2, -0.22878473305454464),
 (4, -0.1789427359379214),
 (6, -0.088184281454553884),
 (8, 0.14261199150183776)]

Because of the distributed computation, the order of the scores in _fscores_ can be different with respect to the order of the features. This is the reason for which _corr_ function returns the feature index along with the Pearson correlation coefficient. We therefore need to sort the data according to the feature index. The result is the _scoreClass_ vector.

In [7]:
fscoresIdx, fscoresScore = zip(*fscores)
scoreClass = [fscoresScore[fscoresIdx.index(i)] for i in range(ncols)]

In [8]:
fscoresIdx[0:3]

(0, 2, 4)

In [9]:
fscoresScore[0:3]

(-0.35084747767184321, -0.22878473305454464, -0.1789427359379214)

In [10]:
scoreClass[0:3]

[-0.35084747767184321, 0.21492668774990711, -0.22878473305454464]

## Feature Selection

### Top n-features

Given the scores label-feature, we select the top _nfs_ features that best correlate with the label. _fsIdx_ stores the indexes of the selected features.

In [11]:
nfs = 5  # number of feature to select

In [12]:
df = zip([abs(x) for x in scoreClass], range(len(scoreClass)))
df.sort(key=lambda tup: tup[0], reverse=True)
fsIdx = [x[1] for x in df[0:nfs]]

In [13]:
df[0:5]

[(0.35084747767184321, 0),
 (0.22878473305454464, 2),
 (0.21492668774990711, 1),
 (0.1789427359379214, 4),
 (0.15406676409013534, 3)]

In [14]:
fsIdx[0:5]

[0, 2, 1, 4, 3]

The final step is to reduce the dimensionality of _rdd_ according to the selected features.

In [15]:
from pyspark.mllib.linalg import Vectors

def reduceLP(lp, fsIdx):
    label = lp.label
    features = lp.features
    v = [features[i] for i in fsIdx]
    return LabeledPoint(label, Vectors.dense(v))

rddFS = rdd.map(lambda x: reduceLP(x, fsIdx))

In [16]:
rddFS.first()

LabeledPoint(-1.0, [-1.0,1.0,-0.941176,1.0,-0.89011])

## Food for thought

1. does this algorithm scale with the number of instances? That is, given 100M instances instead of the current 1K, will this code work or will it crash?
2. what about the same question above concerning the number of feature instead?
3. for this feature selection, do we need to rely on the LabeledPoint data structure? Can we use another (simpler) data structure?
4. can I directly calculate the correlation in _map_ phase instead of going through the _map_ and _reduce_ phases? Why?
5. this dataset has 1000 instances, 24 features and 1 class feature. Can you calcuate the number of tuples produced at the _flatMap_ operation? Can you estimate the ratio between the size of such intermediate results and the original dataset?