# Step 1 - Benchmarking

### Domain and Data

Prepared for the Neural Information Processing Symposium 2003 Feature Extraction Workshop

http://clopinet.com/isabelle/Projects/NIPS2003

MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear

MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). We added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized.
### Problem Statement

Our dataset has 499 features to select from. Of those 499 there are 5 relevant features to provide us the highest output.

### Solution Statement

Using machine learning technology, we will build algorithm models to find the optimal features. 

### Metric

Our metric for success will consist of various steps to provide the highest model scores.

### Benchmark

Our benchmark for this will be using a naive Logistic Regression with default settings

In [1]:
from os import chdir
chdir('./lib')

In [2]:
import pandas as pd
#from sqlalchemy import create_engine
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, Ridge, SGDRegressor, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression

In [3]:
from project_5 import connect_to_postgres, load_data_from_database, make_data_dict, general_model, general_transformer

In [4]:
df = load_data_from_database('postgresql://dsi:correct horse battery staple@joshuacook.me:5432')

In [5]:
data_dict = make_data_dict(df.drop('label', axis=1), df['label'], 0.25, 51)

In [6]:
general_transformer(StandardScaler(), data_dict)

{'X':        feat_000  feat_001  feat_002  feat_003  feat_004  feat_005  feat_006  \
 index                                                                         
 0           485       477       537       479       452       471       491   
 1           483       458       460       487       587       475       526   
 2           487       542       499       468       448       471       442   
 3           480       491       510       485       495       472       417   
 4           484       502       528       489       466       481       402   
 5           481       496       451       480       516       484       484   
 6           484       533       498       466       577       482       471   
 7           474       468       598       490       530       482       448   
 8           484       498       558       489       508       478       471   
 9           496       448       570       476       477       481       595   
 10          478       446       45

In [7]:
log_model = general_model(LogisticRegression(C=1.0, penalty='l2'), data_dict)

In [38]:
print log_model['model']
print log_model['test_score']
print log_model['train_score']

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
0.548484848485
0.813432835821


In [34]:
log_model_df = pd.DataFrame({'coef': log_model['model'].coef_[0],'features': log_model['X'].columns,
                               })

In [35]:
log_model_df.sort_values(by='coef', ascending=False).head(10)

Unnamed: 0,coef,features
433,0.897586,feat_433
472,0.66765,feat_472
475,0.520365,feat_475
48,0.469166,feat_048
493,0.408383,feat_493
241,0.390909,feat_241
56,0.378339,feat_056
46,0.376963,feat_046
453,0.362375,feat_453
494,0.334256,feat_494


<img src="assets/benchmarking.png" width="600px">