# Step 1 - Benchmarking

### Domain and Data

Automatic feature selections needs to be shown to be possible and a productive option for datasets with a large number of features.  

The data to be used for the problem from the NIPS 2003 feature selection challenge.  This is one of the sets called madelon.  It is a synthetic dataset of 5 features, 15 features generated from the first 5 features and 480 features of unproductive information.  There is a single target vector.  The range of the target values is -1 and 1.  Since the target vector only has 2 values, a logistical regression is a nice place to start.



### Problem Statement

A baseline is needed to establish if automatic feature selection improves the predicitive power of the models and selection of salient features.  The baseline will treat all the features as relevant and be used to create a prediction of the target vector.  The results are expected to be pretty poor, because there are 480 features of noise, and 20 features of signals.  Later models will atempt to improve our signal to noise ratios.

### Solution Statement

Since this is intended to establish a baseline, the solution is producing a measure for improved models to test against.  The expected accuracy of the prediction should be about 50%.


### Metric

The metric used for comparing models is the mean accuracy.  It is the default scoring mechanism for Logistic regressions.  If non-binary models are introduced later, this metric will need to be revisited.  

### Benchmark

The dataset is evenly split between labels 1 and -1.  The minimum baseline of a random guess is 50%.  The baseline model with a logistic regression will be used as the benchmark for the improvement seen with the automatic feature selection.


## Implementation

There is a library file with some functions to help build the models quickly and consistently. The functions in the model called project5.py.

In [11]:
from os import chdir, getcwd;
chdir('../')
from  lib.project_5 import load_data_from_database, make_data_dict, general_transformer, general_model
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

In [12]:
params = {'user_name' : "dsi_student", 
          'password' : "correct horse battery staple",
          'url': 'joshuacook.me',
          'port' : "5432", 
          'database' : "dsi", 
          'table' : "madelon"}

madelon_df = load_data_from_database(**params)
madelon_df.head()

Unnamed: 0,index,feat_000,feat_001,feat_002,feat_003,feat_004,feat_005,feat_006,feat_007,feat_008,...,feat_491,feat_492,feat_493,feat_494,feat_495,feat_496,feat_497,feat_498,feat_499,label
0,0,485,477,537,479,452,471,491,476,475,...,481,477,485,511,485,481,479,475,496,-1
1,1,483,458,460,487,587,475,526,479,485,...,478,487,338,513,486,483,492,510,517,-1
2,2,487,542,499,468,448,471,442,478,480,...,481,492,650,506,501,480,489,499,498,-1
3,3,480,491,510,485,495,472,417,474,502,...,480,474,572,454,469,475,482,494,461,1
4,4,484,502,528,489,466,481,402,478,487,...,479,452,435,486,508,481,504,495,511,1


In [13]:
madelon_df.drop('index', axis =1, inplace=True)
madelon_df.head()

Unnamed: 0,feat_000,feat_001,feat_002,feat_003,feat_004,feat_005,feat_006,feat_007,feat_008,feat_009,...,feat_491,feat_492,feat_493,feat_494,feat_495,feat_496,feat_497,feat_498,feat_499,label
0,485,477,537,479,452,471,491,476,475,473,...,481,477,485,511,485,481,479,475,496,-1
1,483,458,460,487,587,475,526,479,485,469,...,478,487,338,513,486,483,492,510,517,-1
2,487,542,499,468,448,471,442,478,480,477,...,481,492,650,506,501,480,489,499,498,-1
3,480,491,510,485,495,472,417,474,502,476,...,480,474,572,454,469,475,482,494,461,1
4,484,502,528,489,466,481,402,478,487,468,...,479,452,435,486,508,481,504,495,511,1


### Create a local copy of the table - Execute once.

If needed, the second block below with attach the local datafile

In [14]:
# ###Make connection
# params = {'user_name' : "dsi", 'password' : "dsi", 'url': 'localhost',
#          'port' : "5432", 'database' : "dsi", 'table' : "madelon"}

# engine = sqlalchemy.create_engine("postgresql://{}:{}@{}:{}/{}".format(params['user_name'],params['password'], 
#     params['url'], params['port'], params['database']))

# ### Write table to Local
# madelon_df.to_sql(params['table'],engine,if_exists='replace')
# engine.dispose()

# madelon_df = pd.read_sql_table(params['table'], engine)
# engine.dispose()

In [15]:
y = madelon_df['label']
X = madelon_df.drop('label', axis =1)

In [16]:
baseline = make_data_dict(X,y,random_state=43)
baseline[0]['X_train'].head()

Unnamed: 0,feat_000,feat_001,feat_002,feat_003,feat_004,feat_005,feat_006,feat_007,feat_008,feat_009,...,feat_490,feat_491,feat_492,feat_493,feat_494,feat_495,feat_496,feat_497,feat_498,feat_499
1679,482,470,480,479,494,479,436,475,481,470,...,463,471,496,763,478,507,483,488,435,403
1445,479,529,499,495,481,481,454,478,486,474,...,497,474,483,420,480,452,486,475,481,506
352,485,477,541,484,513,484,442,476,459,477,...,450,482,501,542,497,518,486,476,522,476
552,486,501,554,501,514,482,534,477,494,473,...,455,483,478,397,502,482,480,465,539,499
338,482,476,476,465,557,488,397,479,508,481,...,479,480,503,589,485,497,481,468,451,461


In [17]:
X_train = baseline[-1]['X_train']
y_train = baseline[-1]['y_train']
X_test = baseline[-1]['X_test']
y_test = baseline[-1]['y_test']

scale = StandardScaler()
baseline.append(general_transformer(scale, X_train, y_train, X_test, y_test))

LogReg =LogisticRegression(n_jobs=-1,verbose =2)
baseline.append(general_model(LogReg,X_train, y_train, X_test, y_test))



[LibLinear]

In [18]:
print "The mean accuracy of the training set is {:.2f}%.".format (baseline[-1]['train_score']*100)
print "The mean accuracy of the test set is {:.2f}%.".format (baseline[-1]['test_score']*100)

The mean accuracy of the training set is 78.93%.
The mean accuracy of the test set is 51.80%.


### Conclusion

This model is not very effective.  It is only slightly better than a random guess, and the difference between the training and test data looks like an overfit, or specifically in this case, an ineffective fit.

### Next steps
Move on to step 2 to do select salient features.  As a separate journey, I want to test if changing the target (y_train & y_test) to be 0 and 1 is significant.  I doiubt it, but I need to verify.

In [19]:

y2 = madelon_df['label']
switch = lambda x :0 if x < 0 else 1

X2 = madelon_df.drop('label', axis =1)
y2 = y2.apply(switch)

baseline2 = make_data_dict(X2,y2,random_state=43)

X2_train = baseline2[-1]['X_train']
y2_train = baseline2[-1]['y_train']
X2_test = baseline2[-1]['X_test']
y2_test = baseline2[-1]['y_test']

scale2 = StandardScaler()
baseline2.append(general_transformer(scale, X2_train, y2_train, X2_test, y2_test))

LogReg2 =LogisticRegression(n_jobs=-1,verbose =1)
baseline2.append(general_model(LogReg,X2_train, y2_train, X2_test, y2_test))
print "\n"
print "The mean accuracy of the training set is {:.2f}%.".format (baseline[-1]['train_score']*100)
print "The mean accuracy of the test set is {:.2f}%.".format (baseline[-1]['test_score']*100)

[LibLinear]

The mean accuracy of the training set is 78.93%.
The mean accuracy of the test set is 51.80%.


### Conclusion, part 2.

Switching the target from -1 to 0 does not affect the results.  Move on to step 2.
