# Feature Selection - simplistic greedy approach
As we've now seen, it's fairly easy to overfit a model and as such we may need to make decisions about what variables or factors to include in the model and which to leave out. A simplistic way to do this is to add features individually, one by one.

## 1. Split the data into a test and train set.

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('Swiss_Healthcare_Premium_Prediction.csv.gz', compression='gzip')

df = df.fillna(value=0)
X = df[df.columns[:-1]]
y = df['Premium']
print(len(df))
df.head()

53617


Unnamed: 0.1,Unnamed: 0,ID,CAT_Insurer,CAT_Region_Num,205d_V2_CUR,205d_V3_PRC,212d_V1_CUR,212d_V2_PRC,213d_V2_CUR,213d_V3_PRC,...,KG_SPS_226d_V1_CUR,KG_SPS_227d_V1_PRC,KG_SPS_229d_V1_CUR,KG_SX_226d_V1_CUR,KG_SX_227d_V1_PRC,KG_SX_229d_V1_CUR,KG_TOT_226d_V1_CUR,KG_TOT_227d_V1_PRC,KG_TOT_229d_V1_CUR,Premium
0,0,0.0,0.0,0.0,0.326051,0.215957,0.322181,0.234051,0.377469,0.213745,...,0.275441,0.373868,0.285167,0.149472,0.610291,0.131174,0.326051,0.215957,0.359147,0.409432
1,1,1.9e-05,0.0,0.0,0.326051,0.215957,0.322181,0.234051,0.377469,0.213745,...,0.275441,0.373868,0.285167,0.149472,0.610291,0.131174,0.326051,0.215957,0.359147,0.394941
2,2,3.7e-05,0.0,0.0,0.326051,0.215957,0.322181,0.234051,0.377469,0.213745,...,0.275441,0.373868,0.285167,0.149472,0.610291,0.131174,0.326051,0.215957,0.359147,0.358463
3,3,5.6e-05,0.0,0.0,0.326051,0.215957,0.322181,0.234051,0.377469,0.213745,...,0.275441,0.373868,0.285167,0.149472,0.610291,0.131174,0.326051,0.215957,0.359147,0.321986
4,4,7.5e-05,0.0,0.0,0.326051,0.215957,0.322181,0.234051,0.377469,0.213745,...,0.275441,0.373868,0.285167,0.149472,0.610291,0.131174,0.326051,0.215957,0.359147,0.285634


In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y)

## 2. Find the [single] best feature to train a regression model on
Loop through all of the X features and train an unpenalized LinearRegression model using each of those single features. Find the feature that produces the lowest Mean squared test error.

In [4]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

In [15]:
min_mse = 100**100 #super high number to guarantee it wont be the min
best_feature = None

def mse(residual_col):
    return np.mean(residual_col.map(lambda x: x**2))

for feat in X.columns:
    linreg = LinearRegression()
    curr_X_train = np.array(X_train[feat]).reshape(-1, 1)
    curr_X_test = np.array(X_test[feat]).reshape(-1,1)
    linreg.fit(curr_X_train,y_train)
    
    y_hat_test = linreg.predict(curr_X_test)
    test_mse = mse(y_test - y_hat_test)
    
    if test_mse < min_mse:
        min_mse = test_mse
        best_feature = feat
        
print('The single best predictor was: {}'.format(feat))

The single best predictor was: KG_TOT_229d_V1_CUR


## 3. Generalize #2
Write a function that takes in a desired number of features and returns a model using the top n features (according to test set error). Be sure to do this iteratively. In other words, rather then simply taking the top n features based on how well each performs individually, first find the best feature and train a model, then loop back through all of the remaining features and select that which produces the best results in combination with the best feature already selected. Continue on finding the best third feature in combination with the previous 2 features, etc. This process will continue until you reach the desired number of features (or there are no features left).

In [27]:
def best_feat(X_train, X_test, y_train, y_test, feat_options, prev_feats=[]):
    min_mse = 100**100
    best_feature = None
    
    for feat in feat_options:
        linreg = LinearRegression()
        if prev_feats==[]:
            curr_X_train = np.array(X_train[feat]).reshape(-1, 1)
            curr_X_test = np.array(X_test[feat]).reshape(-1,1)
        else:
            feats = prev_feats + [feat]
            curr_X_train = X_train[feats]
            curr_X_test = X_test[feats]
            
        linreg.fit(curr_X_train,y_train)
        
        y_hat_test = linreg.predict(curr_X_test)
        test_mse = mse(y_test - y_hat_test)
        
        if test_mse < min_mse:
            min_mse = test_mse
            best_feature = feat
    return best_feature
    

In [28]:
def linreg_greedy_feat(n_feats, X_train, X_test, y_train, y_test):
    curr_model_feats = []
    remaining_feats = list(X.columns)
    
    for n in range(1,n_feats+1):
        next_feat = best_feat(X_train, X_test, y_train, y_test,
                              feat_options=remaining_feats, prev_feats = curr_model_feats)
        curr_model_feats.append(next_feat)
        remaining_feats.remove(next_feat)
    
    model = LinearRegression()
    model.fit(X_train[curr_model_feats], y_train)
    
    return model, curr_model_feats

In [29]:
best_feat(X_train,X_test,y_train,y_test,feat_options=X.columns)

'213d_V1_CUR'

In [31]:
linreg_greedy_feat(10, X_train, X_test, y_train, y_test)

(LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
 ['213d_V1_CUR',
  '212d_V1B_CUR',
  '212d_V1_CUR',
  '224d_V1_CUR',
  '205d_V2_CUR',
  '716d_V1_INT',
  '501d_V11_CUR',
  '215d_V1_CUR_ICEP',
  '708d_V1_INT',
  'CAT_Region_Num'])

# Plotting Learning Curves
Iterate from 2 to 20 feature variables. Use your greedy classifier defined above to generate a linear regression model with successively more and more features incorporated into the model. Then plot the train and test errors as a function of the number of variables incorporated into each of these models.

In [7]:
import matplotlib.pyplot as plt
%matplotlib inline

for i in range(2,21):
    print('On iteration: {}'.format(i-1))
    #Train Greedy Classifier Model with this many features
    #Your code here
    
    #Calculate Training Mean Squared Error
    #Your code here
    
    #Calculate Test Mean Squared Error
    #Your code here
    #Plot Results
    #Your code here
    pass
#Add Legend and Descriptive Title/Axis Labels
#Your code here