
## Description

UPDATE: In this version of the kernel we will try to test the idea of selecting features using LOFO. For more details about LOFO please see Ahmet Erdem's kernel available [at this link](https://www.kaggle.com/divrikwicky/instantgratification-lofo-feature-importance). The feature selection step is going to slow down the training process, so this new version will run longer than 1 minute. If you want to see the original kernel that runs less than a minute please refer to Version 1 of this kernel. 

The original kernel scores 0.99610 on the LB. Unfortunately, we won't be able to use this result as a baseline for comparison because we won't be able to submit our work to LB: in order for LOFO to work, an external package, `lofo-importance`, must be loaded but the usage of external packages is banned by the competion rules. However, it is possible to compute the cross-validation score for the QDA model without LOFO. As a matter of fact, I have already done it in a different kernel: [link](https://www.kaggle.com/graf10a/tuning-512-separate-qda-models) (see the "Repeat Using the Standard Parameters" section). The result was a CV score of 0.96629.  Let's see if selecting features with LOFO can improve this baseline. 

SPOILER: Basically, the resutl is very inconclusive -- the combined AUC went up from 0.96629 to 0.96727, the fold-average AUC went down from 0.96628 to 0.96213, and the standard deviation increased from 9e-05 to 0.0097. It would be nice to submit it to the LB to see how well it performs.

## Setting things up
### Loading Libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np 
import pandas as pd 
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler


from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

### Loading Data

In [None]:
%%time

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

train['wheezy-copper-turtle-magic'] = train['wheezy-copper-turtle-magic'].astype('category')
test['wheezy-copper-turtle-magic'] = test['wheezy-copper-turtle-magic'].astype('category')

### Computing LOFO Importance

Here is the adapted code from [Ahmet's notebook](https://www.kaggle.com/divrikwicky/instantgratification-lofo-feature-importance):

In [None]:
from lofo import LOFOImportance, FLOFOImportance, plot_importance
from tqdm import tqdm_notebook

def get_model():
    return Pipeline([('scaler', StandardScaler()),
                    ('qda', QuadraticDiscriminantAnalysis(reg_param=0.111))
                   ])

features = [c for c in train.columns if c not in ['id', 'target', 'wheezy-copper-turtle-magic']]


def get_lofo_importance(wctm_num):
    sub_df = train[train['wheezy-copper-turtle-magic'] == wctm_num]
    sub_features = [f for f in features if sub_df[f].std() > 1.5]
    lofo_imp = LOFOImportance(sub_df, target="target",
                              features=sub_features, 
                              cv=StratifiedKFold(n_splits=4, random_state=42, shuffle=True), scoring="roc_auc",
                              model=get_model(), n_jobs=4)
    return lofo_imp.get_importance()

features_to_remove = []
potential_gain = []

n_models=512
for i in tqdm_notebook(range(n_models)):
    imp = get_lofo_importance(i)
    features_to_remove.append(imp["feature"].values[-1])
    potential_gain.append(-imp["importance_mean"].values[-1])
    
print("Potential gain (AUC):", np.round(np.mean(potential_gain), 5))

## Building the QDA Classifier with LOFO

### Preparing Things for Cross-Validation

In [None]:
clf_name='QDA'

NFOLDS=25
RS=42

oof=np.zeros(len(train))
preds=np.zeros(len(test))

### Training the Classifiers on All Data

In [None]:
%%time

print(f'Cross-validation for the {clf_name} classifier:')

default_cols = [c for c in train.columns if c not in ['id', 'target', 'wheezy-copper-turtle-magic']]

# BUILD 512 SEPARATE NON-LINEAR MODELS
for i in range(512):  
    
    # EXTRACT SUBSET OF DATASET WHERE WHEEZY-MAGIC EQUALS i     
    X = train[train['wheezy-copper-turtle-magic']==i].copy()
    Y = X.pop('target').values
    X_test = test[test['wheezy-copper-turtle-magic']==i].copy()
    idx_train = X.index 
    idx_test = X_test.index
    X.reset_index(drop=True,inplace=True)

    #cols = [c for c in X.columns if c not in ['id', 'wheezy-copper-turtle-magic']]
    cols = [c for c in default_cols if c != features_to_remove[i]]
    X = X[cols].values             # numpy.ndarray
    X_test = X_test[cols].values   # numpy.ndarray

    # FEATURE SELECTION (USE APPROX 40 OF 255 FEATURES)
    vt = VarianceThreshold(threshold=1.5).fit(X)
    X = vt.transform(X)            # numpy.ndarray
    X_test = vt.transform(X_test)  # numpy.ndarray   

    # STRATIFIED K FOLD
    auc_all_folds=np.array([])
    folds = StratifiedKFold(n_splits=NFOLDS, random_state=RS)

    for fold_num, (train_index, val_index) in enumerate(folds.split(X, Y), 1):

        X_train, Y_train = X[train_index, :], Y[train_index]
        X_val, Y_val = X[val_index, :], Y[val_index]

        pipe = Pipeline([('scaler', StandardScaler()),
                         (clf_name, QuadraticDiscriminantAnalysis(reg_param=0.111)),
                       ])  

        pipe.fit(X_train, Y_train)

        oof[idx_train[val_index]] = pipe.predict_proba(X_val)[:,1]
        preds[idx_test] += pipe.predict_proba(X_test)[:,1]/NFOLDS

        auc = roc_auc_score(Y_val, oof[idx_train[val_index]])
        auc_all_folds = np.append(auc_all_folds, auc)
            
# PRINT CROSS-VALIDATION AUC FOR THE CLASSFIER
auc_combo = roc_auc_score(train['target'].values, oof)
auc_folds_average = np.mean(auc_all_folds)
std = np.std(auc_all_folds)/np.sqrt(NFOLDS)

print(f'The combined CV score is {round(auc_combo,5)}.')    
print(f'The folds average CV score is {round(auc_folds_average,5)}.')
print(f'The standard deviation is {round(std, 5)}.')

## Creating the Submission File

All done! At this point we are ready to make our submission file! (We won't be able to submit it but let's make it anyway.)

In [None]:
sub = pd.read_csv('../input/sample_submission.csv')
sub['target'] = preds
sub.to_csv('submission.csv',index=False)

In [None]:
sub.shape

In [None]:
sub.head()