<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Madelon: 4-Stability Selection Pipeline

_Authors: Blake Cannon (DEN)_

---
MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear. Among 500 attributes, only 20 are informative, the rest are noise.

### Stability Selection Pipeline

This is the fourth in a series of Jupyter Notebooks and will use a pipeline to choose features


## Import packages

In [1]:
import numpy as np

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

## Import Pickles

In [2]:
# Loading the data saved from the last notebook
X_train = np.load('./_data/X_train.npy')
y_train = np.load('./_data/y_train.npy')
X_val = np.load('./_data/X_val.npy')
y_val = np.load('./_data/y_val.npy')
X_test = np.load('./_data/X_test.npy')

## Stability Selection Pipeline


Above, we used a 

In [3]:
# Import packages
from sklearn.linear_model import RandomizedLogisticRegression, LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score

In [4]:
# Instatiate and fit the logistic regression model
logr = LogisticRegression()
logr.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [5]:
# Manually chosen threshold for testing
threshold = 0.05

In [6]:
stability_selection = RandomizedLogisticRegression(n_resampling=300,
                                                   n_jobs=1,
                                                   random_state=101,
                                                   scaling=0.15,
                                                   sample_fraction=0.50,
                                                   selection_threshold=threshold)



In [7]:
interactions = PolynomialFeatures(degree=4, interaction_only=True)

In [8]:
model = make_pipeline(stability_selection, interactions, logr)

In [9]:
model.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('randomizedlogisticregression', RandomizedLogisticRegression(C=1, fit_intercept=True, memory=None, n_jobs=1,
               n_resampling=300, normalize=True, pre_dispatch='3*n_jobs',
               random_state=101, sample_fraction=0.5, scaling=0.15,
               selection_threshold=0.05, ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [10]:
print('Number of features picked by stability selection: %i' % np.sum(model.steps[0][1].all_scores_ >= threshold))

Number of features picked by stability selection: 13


In [11]:
print('Area Under the Curve: %0.5f' % roc_auc_score(y_val, model.predict_proba(X_val)[:,1]))

Area Under the Curve: 0.89703


In [12]:
feature_filter = model.steps[0][1].all_scores_ >= threshold

In [13]:
counter = -1
important_features = []
for i in feature_filter:
    counter += 1
    if i == True:
        important_features.append(counter)
print('Number of important features:', len(important_features))
print('List of important features:', important_features)

Number of important features: 13
List of important features: [48, 64, 105, 128, 241, 323, 336, 338, 378, 442, 453, 472, 475]
