# Step 3 - Build Models

<img src="assets/build_model.png" width="600pX2">

### Domain and Data

This is the third step in exploring feature selection on a dataset with many2 features, most of which are not relevant.  The dataset is the synthetic madelon data set from the previous steps.  

### Problem Statement

A simple logistic regresison on all features was not effective. Applying a 'l1' penalty2 to the logistic regression did not improve the model.  Very2 few features were dropped when the penalty was added.  Clearly, other methods of feature selection need to be exlored.


### Solution Statement

Better selection models need to be used.  Kbest and K nearest neighbors will be used, and any other models that seem effective.

### Metric

The metric from step 1 will be reused.  It is the mean accuracy of the prediction.  At least 2 test/train splits will be done.  First with the same random state as the previous step then with a new random state.  

### Benchmark 

The primary benchmark is improving the predictive power of the model beyond 50% accuracy.  

We know there are 20 salient features.  A secondary benchmark would be selecting less than 20 relevant features.

## Implementation

Implement the following code pipeline using the functions you write in `lib/project_5.py`.

Start with kBest selector:

In [1]:
#from os import chdir, getcwd;
#chdir('lib')
from  lib.project_5 import load_data_from_database, make_data_dict, general_transformer, general_model
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV


In [2]:
params = {'user_name' : "dsi_student", 
          'password' : "correct horse battery staple",
          'url': 'joshuacook.me',
          'port' : "5432", 
          'database' : "dsi", 
          'table' : "madelon"}

madelon_df = load_data_from_database(**params)
madelon_df.drop('index', axis =1, inplace=True)

y = madelon_df['label']
X = madelon_df.drop('label', axis =1)

baseline = make_data_dict(X,y,random_state=33)
baseline[0]['X_train'].head()

Unnamed: 0,feat_000,feat_001,feat_002,feat_003,feat_004,feat_005,feat_006,feat_007,feat_008,feat_009,...,feat_490,feat_491,feat_492,feat_493,feat_494,feat_495,feat_496,feat_497,feat_498,feat_499
1179,466,467,484,491,549,476,488,478,484,475,...,471,481,509,436,493,661,476,489,514,458
1529,482,494,479,474,489,476,495,477,517,481,...,470,471,519,345,482,441,486,483,571,514
1125,484,460,549,497,461,470,540,477,490,481,...,527,480,491,745,498,520,484,475,531,477
1739,479,472,445,466,476,474,498,475,483,482,...,481,474,513,421,483,509,481,490,537,445
1303,481,475,542,481,435,473,467,478,484,476,...,488,474,477,388,408,494,484,459,474,495


In [3]:
y = madelon_df['label']
X = madelon_df.drop('label', axis =1)

model = make_data_dict(X,y,random_state=43)

X_train = model[-1]['X_train']
y_train = model[-1]['y_train']
X_test = model[-1]['X_test']
y_test = model[-1]['y_test']

scale = StandardScaler()
baseline.append(general_transformer(scale, X_train, y_train, X_test, y_test))

X_train = model[-1]['X_train']
y_train = model[-1]['y_train']
X_test = model[-1]['X_test']
y_test = model[-1]['y_test']

kbest = SelectKBest(k=50)
model.append(general_transformer(kbest, X_train, y_train, X_test, y_test))

X_train = model[-1]['X_train']
y_train = model[-1]['y_train']
X_test = model[-1]['X_test']
y_test = model[-1]['y_test']

LogReg =LogisticRegression(n_jobs=-1,verbose =2)
model.append(general_model(LogReg,X_train, y_train, X_test, y_test))
print "\n"
print "The mean accuracy of the training set is {:.2f}%.".format (model[-1]['train_score']*100)
print "The mean accuracy of the test set is {:.2f}%.".format (model[-1]['test_score']*100)

[LibLinear]

The mean accuracy of the training set is 67.67%.
The mean accuracy of the test set is 54.60%.


This is not a significant improvement.  Moving on to gridsearch, with cross validation.  The model is reaching the maX2 number of iterations.  Perhaps maX2 iterations needs to be increased as well.

In [4]:
y2 = madelon_df['label']
X2 = madelon_df.drop('label', axis =1)

model2 = make_data_dict(X2,y2,random_state=43)

X2_train = model2[-1]['X_train']
y2_train = model2[-1]['y_train']
X2_test = model2[-1]['X_test']
y2_test = model2[-1]['y_test']

scale = StandardScaler()
baseline.append(general_transformer(scale, X2_train, y2_train, X2_test, y2_test))

X2_train = model2[-1]['X_train']
y2_train = model2[-1]['y_train']
X2_test = model2[-1]['X_test']
y2_test = model2[-1]['y_test']

kbest = SelectKBest(k=10)
model2.append(general_transformer(kbest, X2_train, y2_train, X2_test, y2_test))

X2_train = model2[-1]['X_train']
y2_train = model2[-1]['y_train']
X2_test = model2[-1]['X_test']
y2_test = model2[-1]['y_test']

param_grid = param_grid = {'C': [10**i for i in range(-3, 3)] , 'penalty' : ['l1','l2']}
grid = GridSearchCV(LogisticRegression(verbose =2,max_iter=200, n_jobs=-1), param_grid)

model2.append(general_model(grid,X2_train, y2_train, X2_test, y2_test))

print "\n"
print model2[-1]['model'].best_estimator_
print "The mean accuracy of the training set is {:.2f}%.".format (model2[-1]['train_score']*100)
print "The mean accuracy of the test set is {:.2f}%.".format (model2[-1]['test_score']*100)

[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]



[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=200, multi_class='ovr', n_jobs=-1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=2, warm_start=False)
The mean accuracy of the training set is 60.80%.
The mean accuracy of the test set is 61.20%.


The best version using KBest selector, there is a slight improvement in accuracy for the test set.  The best estimator uses C = 0.01 and the 'l1' penalty.

Perhaps the K Nearest Nieghbors will be more effective.

In [5]:
for k in range (6,20,2):
    y3 = madelon_df['label']
    X3 = madelon_df.drop('label', axis =1)

    model3 = make_data_dict(X3,y3,random_state=43)

    X3_train = model3[-1]['X_train']
    y3_train = model3[-1]['y_train']
    X3_test = model3[-1]['X_test']
    y3_test = model3[-1]['y_test']

    scale = StandardScaler()
    baseline.append(general_transformer(scale, X3_train, y3_train, X3_test, y3_test))

    X3_train = model3[-1]['X_train']
    y3_train = model3[-1]['y_train']
    X3_test = model3[-1]['X_test']
    y3_test = model3[-1]['y_test']


    kbest = SelectKBest(k=k)
    model3.append(general_transformer(kbest, X3_train, y3_train, X3_test, y3_test))

    X3_train = model3[-1]['X_train']
    y3_train = model3[-1]['y_train']
    X3_test = model3[-1]['X_test']
    y3_test = model3[-1]['y_test']


    param_grid = param_grid = {'n_neighbors': [i for i in range(3, 22 ,2)] , 'p' : [2,3]}
    grid = GridSearchCV(KNeighborsClassifier(n_jobs=-1), param_grid)

    model3.append(general_model(grid,X3_train, y3_train, X3_test, y3_test))

    print "kbest = {}".format(k)
    print model3[-1]['model'].best_estimator_
    print "The mean accuracy of the training set is {:.2f}%.".format (model3[-1]['train_score']*100)
    print "The mean accuracy of the test set is {:.2f}%.".format (model3[-1]['test_score']*100)

model3.append(general_model(grid,X3_train, y3_train, X3_test, y3_test))

kbest = 6
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=9, p=3,
           weights='uniform')
The mean accuracy of the training set is 76.93%.
The mean accuracy of the test set is 71.80%.
kbest = 8
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=3, p=2,
           weights='uniform')
The mean accuracy of the training set is 90.07%.
The mean accuracy of the test set is 81.40%.
kbest = 10
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=3, p=2,
           weights='uniform')
The mean accuracy of the training set is 92.93%.
The mean accuracy of the test set is 88.20%.
kbest = 12
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=7, p=2,
           weights='uniform')
The mean accurac

The best model, so far, uses the kbest transformer to select the 14 best features, then K nearest neighbors model is used with neighbors = 3 and p =2.  The 'p' parameter is the power for the minkowski metric; p=2 is the same as the euclidean distance.

Let's rerun the best model.  

In [6]:
y4 = madelon_df['label']
X4 = madelon_df.drop('label', axis =1)

model4 = make_data_dict(X4,y4,random_state=43)

X4_train = model4[-1]['X_train']
y4_train = model4[-1]['y_train']
X4_test = model4[-1]['X_test']
y4_test = model4[-1]['y_test']

scale = StandardScaler()
baseline.append(general_transformer(scale, X4_train, y4_train, X4_test, y4_test))

X4_train = model4[-1]['X_train']
y4_train = model4[-1]['y_train']
X4_test = model4[-1]['X_test']
y4_test = model4[-1]['y_test']


kbest = SelectKBest(k=14)
model4.append(general_transformer(kbest, X4_train, y4_train, X4_test, y4_test))

X4_train = model4[-1]['X_train']
y4_train = model4[-1]['y_train']
X4_test = model4[-1]['X_test']
y4_test = model4[-1]['y_test']


knn = KNeighborsClassifier(n_jobs=-1, p=2, n_neighbors=3)

model4.append(general_model(knn,
                            X4_train, 
                            y4_train,
                            X4_test, 
                            y4_test))

print "The mean accuracy of the training set is {:.2f}%.".format (model4[-1]['train_score']*100)
print "The mean accuracy of the test set is {:.2f}%.".format (model4[-1]['test_score']*100)

The mean accuracy of the training set is 92.53%.
The mean accuracy of the test set is 88.20%.


In [7]:
good_feature_from_final_model =  [y for x,y in zip(kbest.get_support(),X4.columns) if x]
print good_feature_from_final_model

['feat_048', 'feat_064', 'feat_105', 'feat_128', 'feat_241', 'feat_277', 'feat_336', 'feat_338', 'feat_378', 'feat_442', 'feat_453', 'feat_472', 'feat_475', 'feat_493']


### Conclusions

Automatic feature selection is viable for filtering for salient features and making a useful model.  The current model has room for improvement.  Too many features were selected, and the accuracy has room for improvment. Other models may provide better feature selection.  Also, this model should be tested against new datasets to test if it can be generalized.   