### Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### Get Data

In [2]:
X = pd.read_hdf('../wip-data/X_train.h5', key = 'df')
y = pd.read_hdf('../wip-data/y_train.h5', key = 'df')

### Unpenalized Logistic Regression Models
<ol>
    <li> We use the <i>train-validate</i> strategy to estimate the test error.
    <li> We begin with an unpenalised logistic regression that includes all features. The estimated test error for this model forms the baseline to compare and contrast the efficacy of other models that we develop.
</ol>

In [94]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 1970)

clf = LogisticRegression(max_iter = 10000, penalty = 'none').fit(X_train, y_train)
base_est_test_error = (1 - clf.score(X_val, y_val))*100

print("The estimated test error for the unpenalized logistic regression model = %f" % est_test_error,"%")

The estimated test error for the unpenalized logistic regression model = 26.256983 %


#### Recursive Feature Elimination using Cross Validation
<ol>
    <li> For the model developed in the previous step, we recurrsively eliminate features, using cross validation, to select the best set of features.
    <li> We ascertain if the reduced set of features does indeed produce a model with better estimated test error than the baseline.
</ol>

In [88]:
from sklearn.feature_selection import RFECV

rfecv = RFECV(estimator = clf, step = 1, cv = 5, scoring = 'accuracy').fit(X_train, y_train)
est_test_error = (1 - rfecv.score(X_val, y_val))*100

print("Of the %d features, the optimal number of features to include in the model is %d" 
      % (len(X.columns.values), rfecv.n_features_))
print("The estimated test error for the new model = %f" % est_test_error,"%")
print("The features to be eliminated are %s" % X.columns.values[rfecv.get_support() == False] )

Of the 14 features, the optimal number of features : 13
The estimated test error for the new model = 26.256983 %
The features to be eliminated are ['Fare']


#### Transforming the <i>train</i> Dataset
The estimated test error for the new model remains exactly the same as the baseline and we can eliminate the identified feature without any adverse impact

In [89]:
X_train = rfecv.transform(X_train)
X_val = rfecv.transform(X_val)

#### Adding Interaction Features
<ol>
    <li> Recall that the learning curves indicate that the quality of a regression model would improve with either additional data or additional features.
    <li> We introduce new features by adding interaction features, built from the original features in the dataset, and use this enhanced feature set to build a new logistic regression model.
    <li> We will ascertain if the model with additional features improves on the baseline estimated test error.

In [99]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

add_feat = PolynomialFeatures(interaction_only = True)

pipeline = make_pipeline(add_feat, clf)
pipeline.fit(X_train, y_train)
pipeline.score(X_val, y_val)

In [102]:
pipeline.fit(X_train, y_train)
est_test_error = (1 - pipeline.score(X_val, y_val))*100
print("The estimated test error for the new model = %f" % est_test_error,"%")

The estimated test error for the new model = 17.877095 %


In [104]:
#RFECV(estimator = pipeline, step = 1, cv = 5, scoring = 'accuracy').fit(X_train, y_train)
pipeline.coeff_

AttributeError: 'Pipeline' object has no attribute 'coeff_'