# Step 2 - Identify Salient Features Using $\ell1$-penalty

<img src="/assets/identify_features.png" width="600px">

### Domain and Data

This is the second step in exploring feature selection on a dataset with many features, most of which are not relevant.  The dataset is the synthetic madelon data set from the previous step.  A simple logistic regresison on all features was not effective.  


### Problem Statement

From step 1, the logisitic regression is equivalant to guessing.  This step is to start improve on the logistic regression by dropping irrelevant features.  

### Solution Statement

This is exploring the solution space.  LASSO is a good place to start to see if automated feature selection is feasible.

### Metric

The metric from step 1 will be reused.  It is the mean accuracy of the prediction.  At least 2 test/train splits will be done.  First with the same random state as the previous step then with a new random state.  

### Benchmark 

The previous step returned a mean accuracy of 52.0% for the test dataset.  If LASSO is an effective feature selection model, a marked improvement of the metric shoudl be seen.  Also, a smaller spread of the training score to test score is expected.



### Implementation:
A similar pipeline to the previous step will be utilized.  The logistic regression model will include the parameter penalty='l1'.  



In [1]:
from os import chdir, getcwd;
chdir('../')
from  lib.project_5 import load_data_from_database, make_data_dict, general_transformer, general_model
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

In [2]:
params = {'user_name' : "dsi_student", 
          'password' : "correct horse battery staple",
          'url': 'joshuacook.me',
          'port' : "5432", 
          'database' : "dsi", 
          'table' : "madelon"}

madelon_df = load_data_from_database(**params)
madelon_df.drop('index', axis =1, inplace=True)

y = madelon_df['label']
X = madelon_df.drop('label', axis =1)

baseline = make_data_dict(X,y,random_state=43)
baseline[0]['X_train'].head()

Unnamed: 0,feat_000,feat_001,feat_002,feat_003,feat_004,feat_005,feat_006,feat_007,feat_008,feat_009,...,feat_490,feat_491,feat_492,feat_493,feat_494,feat_495,feat_496,feat_497,feat_498,feat_499
1679,482,470,480,479,494,479,436,475,481,470,...,463,471,496,763,478,507,483,488,435,403
1445,479,529,499,495,481,481,454,478,486,474,...,497,474,483,420,480,452,486,475,481,506
352,485,477,541,484,513,484,442,476,459,477,...,450,482,501,542,497,518,486,476,522,476
552,486,501,554,501,514,482,534,477,494,473,...,455,483,478,397,502,482,480,465,539,499
338,482,476,476,465,557,488,397,479,508,481,...,479,480,503,589,485,497,481,468,451,461


In [3]:
X_train = baseline[-1]['X_train']
y_train = baseline[-1]['y_train']
X_test = baseline[-1]['X_test']
y_test = baseline[-1]['y_test']

scale = StandardScaler()
baseline.append(general_transformer(scale, X_train, y_train, X_test, y_test))

X_train = baseline[-1]['X_train']
y_train = baseline[-1]['y_train']
X_test = baseline[-1]['X_test']
y_test = baseline[-1]['y_test']

LogReg =LogisticRegression(n_jobs=-1,verbose =2, penalty='l1')
baseline.append(general_model(LogReg,X_train, y_train, X_test, y_test))


[LibLinear]

In [4]:
print "\n"
print "The mean accuracy of the training set is {:.2f}%.".format (baseline[-1]['train_score']*100)
print "The mean accuracy of the test set is {:.2f}%.".format (baseline[-1]['test_score']*100)



The mean accuracy of the training set is 79.07%.
The mean accuracy of the test set is 50.80%.


In [5]:
len(baseline)

3

In [6]:
baseline2 = make_data_dict(X,y,random_state=43)

X_train2 = baseline2[-1]['X_train']
y_train2 = baseline2[-1]['y_train']
X_test2 = baseline2[-1]['X_test']
y_test2 = baseline2[-1]['y_test']

scale2 = StandardScaler()
baseline2.append(general_transformer(scale2, X_train2, y_train2, X_test2, y_test2))

X_train2 = baseline2[-1]['X_train']
y_train2 = baseline2[-1]['y_train']
X_test2 = baseline2[-1]['X_test']
y_test2 = baseline2[-1]['y_test']

LogReg2 =LogisticRegression(n_jobs=-1,verbose =3, penalty='l1')
baseline2.append(general_model(LogReg2,X_train2, y_train2, X_test2, y_test2))

print "\n"
print "The mean accuracy of the training set is {:.2f}%.".format (baseline2[-1]['train_score']*100)
print "The mean accuracy of the test set is {:.2f}%.".format (baseline2[-1]['test_score']*100)

[LibLinear]

The mean accuracy of the training set is 79.07%.
The mean accuracy of the test set is 50.80%.


In [7]:
baseline2[-1]['model'].coef_

array([[  8.30168855e-03,   1.64321179e-01,   6.26193677e-02,
         -7.86641056e-02,   1.87052855e-01,  -1.91899021e-01,
         -1.50245561e-01,   6.57424901e-02,   1.06307933e-01,
          0.00000000e+00,  -1.34414093e-01,   1.05004051e-01,
          7.38551237e-02,   0.00000000e+00,  -8.53339723e-02,
          6.60241586e-02,  -1.46308774e-02,  -5.73393270e-02,
          8.36447833e-02,  -7.66348670e-02,  -1.12252437e-01,
          1.00191938e-01,   0.00000000e+00,  -2.06173370e-01,
          4.13895810e-02,   8.23416988e-02,   1.73111099e-01,
          9.45515630e-02,   0.00000000e+00,   3.57863455e-02,
          2.50597300e-02,  -9.36908164e-03,   0.00000000e+00,
          1.03561401e-01,  -1.36318712e-01,   3.29989563e-02,
         -3.12240511e-02,   8.88875700e-03,  -1.30732168e-01,
         -6.54031335e-02,   0.00000000e+00,   4.81145129e-02,
         -1.19081824e-01,  -7.00469889e-02,   1.40300293e-01,
          1.00976840e-02,   2.68510360e-01,   1.09302837e-01,
        

### Conclusion.

Adding the 'l1' peanalty did not improve the accuracy.  The accuracy of both train and test sets are equivalent to baseline.  Different methods of feature selection will be required.  

Very few features were eliminated with this technique.  Since feature elimination is one of teh primary benefits of using the 'l1' penalty, other steps will need to be used to select the salient features.  

As an aside, the affect of using the scalar has not been considered.  Below, I'll do a run without appliying the scalar.



In [8]:
baseline3 = make_data_dict(X,y,random_state=43)

X_train3 = baseline3[-1]['X_train']
y_train3 = baseline3[-1]['y_train']
X_test3 = baseline3[-1]['X_test']
y_test3 = baseline3[-1]['y_test']

LogReg3  = LogisticRegression(n_jobs=-1,verbose =3, penalty='l1')
baseline3.append(general_model(LogReg3,X_train3, y_train3, X_test3, y_test3))

print "\n"
print "The mean accuracy of the training set is {:.2f}%.".format (baseline3[-1]['train_score']*100)
print "The mean accuracy of the test set is {:.2f}%.".format (baseline3[-1]['test_score']*100)

[LibLinear]

The mean accuracy of the training set is 79.07%.
The mean accuracy of the test set is 50.00%.




### Conclusion, part 2

Including the scalar transformation reduces the computation time compared to doing the regression without scaling.  The results are virtually the same with and without the scalar. However, the scalar helps the anaylsis converge quickly.