<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Madelon: 07-Run on Database Dataset

_Authors: Blake Cannon (DEN)_

---
MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear. Among 500 attributes, only 20 are informative, the rest are noise.

### Notebook 7

This is the seventh Jupyter Notebook in a series and includes running a stability selection pipeline on the full Madelon dataset from the database.

## Import packages

In [None]:
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import RandomizedLogisticRegression, LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score

## Load Pickles and Split

In [None]:
# Loading the data saved from the last notebook
X = np.load('./_data/madelon_db.p')

In [None]:
X.shape

In [None]:
y = X[1001].as_matrix(columns=None)
y

In [None]:
cols = list(range(0, 1001, 1))
X = X[cols]
X = X.as_matrix(columns=None)
X.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=42)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## Run Stability Selection Pipeline

In [None]:
# Instatiate and fit the logistic regression model
logr = LogisticRegression()
logr.fit(X_train,y_train)

In [None]:
# Threshold chosen from earlier testing
threshold = 0.04

In [None]:
stability_selection = RandomizedLogisticRegression(n_resampling=300,
                                                   n_jobs=1,
                                                   random_state=101,
                                                   scaling=0.15,
                                                   sample_fraction=0.50,
                                                   selection_threshold=threshold)

In [None]:
interactions = PolynomialFeatures(degree=4, interaction_only=True)

In [None]:
model = make_pipeline(stability_selection, interactions, logr)

In [None]:
model.fit(X_train, y_train)

In [None]:
print('Number of features picked by stability selection: %i' % np.sum(model.steps[0][1].all_scores_ >= threshold))

In [None]:
print('Area Under the Curve: %0.5f' % roc_auc_score(y_val, model.predict_proba(X_val)[:,1]))

In [None]:
feature_filter = model.steps[0][1].all_scores_ >= threshold

In [None]:
counter = -1
important_features = []
for i in feature_filter:
    counter += 1
    if i == True:
        important_features.append(counter)
print('Number of important features:', len(important_features))
print('List of important features:', important_features)