# Feature Selection

In this notebook, relevant features will be extracted using stepwise logistic regression. The basic idea is to fit the model with one additional predictor, only choosing the variable that lowers AUC the most, and continually iterating until AUC does not improve. 

import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss

import random

import os
import os.path as path
import pickle

In [2]:
train = pd.read_csv('data/train_out.csv')

read = open('train/train_na.pkl', 'rb')
train_na = pickle.load(read)
read.close()

train.index = train.msno
# I'm only taking training datapoints without NA's
train = train[~train_na]
# to make things faster, I'm only using 10% of the data
train = train.sample(n = 100000) 

train_X = train.drop(['msno', 'is_churn', 'concated'], axis = 1)
train_y = train.is_churn

First, I will do Logistic Regression using all the features to get a sense of how much improvement we will make.

In [3]:
log = LogisticRegression()
log.fit(train_X, train_y)
score = log.score(train_X, train_y)
print('Guessing the mode score:           %f' % (1 - np.mean(train_y)))
print('Logistic Regression score:         %f' % score)

Guessing the mode score:           0.934550
Logistic Regression score:         0.934550


This score is actually bad because it's exactly equivalent to guessing all 0's. This is why we must subset the variables.

### Stepwise Regression

The basic idea of forward variable selection is to continually add more variables to our model and seeing which variable creates the most improvements. We will continue iterating only if the model is improving. Otherwise, we will stop in order to avoid taking in noisy and irrelevant features.

In [6]:
isImproving = True
features = train_X.columns
ranked_features = []
score = []
logloss = []
AUC = [0]
i = 0

while isImproving:
    top_AUC = AUC[i]
    print('Iteration %s' % str(i+1))
    for var in train_X.columns.difference(ranked_features):
        log = LogisticRegression()
        log.fit(train_X[[var] + ranked_features], train_y)
        pred = log.predict(train_X[[var] + ranked_features])
        prob = log.predict_proba(train_X[[var] + ranked_features])
        curr_AUC = roc_auc_score(train_y, pred)
        if curr_AUC > top_AUC:
            best_feature = var
            top_AUC = curr_AUC
            best_score = np.mean(pred == train_y)
            top_log_loss = log_loss(train_y, prob)
    if(AUC[i] >= top_AUC):
        print('Logfit is not improving...')
        isImproving = False
    else:
        print('Best feature:      %s' % best_feature)
        print('AUC:               %f' % top_AUC)
        print('Score:             %f' % best_score)
        print('Log Loss:          %f' % top_log_loss)
        AUC.append(top_AUC)
        ranked_features.append(best_feature)
        score.append(best_score)
        logloss.append(top_log_loss)
    i += 1

Iteration 1
Best feature:      expiration_date
AUC:               0.709457
Score:             0.961010
Log Loss:          0.188780
Iteration 2
Best feature:      201702
AUC:               0.756242
Score:             0.960950
Log Loss:          0.128497
Iteration 3
Best feature:      is_auto_renew
AUC:               0.807173
Score:             0.965980
Log Loss:          0.098397
Iteration 4
Best feature:      membership_expire_date
AUC:               0.826610
Score:             0.966590
Log Loss:          0.095237
Iteration 5
Best feature:      201503
AUC:               0.829784
Score:             0.966150
Log Loss:          0.094140
Iteration 6
Best feature:      registered_via_7.0
AUC:               0.832299
Score:             0.966600
Log Loss:          0.093630
Iteration 7
Best feature:      is_cancel
AUC:               0.834409
Score:             0.966560
Log Loss:          0.091545
Iteration 8
Best feature:      201609
AUC:               0.835422
Score:             0.966860
Log L

Now let's see how much we improved!

In [7]:
log = LogisticRegression()
log.fit(train_X[ranked_features], train_y)
prob = log.predict_proba(train_X[ranked_features])
predict = log.predict(train_X[ranked_features])
print('Log Loss of Logistic Regression:        %f' % log_loss(train_y, prob))
print('Accuracy of Logistic Regression:        %f' % (1 - np.mean(train_y - predict)))

Log Loss of Logistic Regression:        0.090415
Accuracy of Logistic Regression:        0.992570


Nice! Now logistic regression is performing 99% accurate on the training data. Of course, accuracy will be very different in the test data, and we will use other models (Random forest, gradient boosting, etc) to predict.

Finally let's save the best features for future use.

In [8]:
output = open('train/best_features.pkl', 'wb')
pickle.dump(ranked_features, output, protocol=pickle.HIGHEST_PROTOCOL)
output.close()
print('Best features saved!')

Best features saved!
