## Introduction

### The Purpose of this Kernel

It has been firmly established that the competion data set was made with the help of the `make_classification()` utility (see for example [this beautiful kernel](https://www.kaggle.com/mhviraf/synthetic-data-for-next-instant-gratification) by mhviraf. But one question still remains open: what value was used in the data preparation process for the number of clusters per class? The default value of this parameter is 2 but, on the other hand, the QDA algorithm is working very well on this data set and, since QDA assumes only one cluster per class, many of us use it as a standard assumption. But is this assumption correct? To answer this question we will need to take a little bit closer look at the data. 

I hope you find this kernel useful. Don't forget to kindly upvote it and leave great comments!

### The Main Idea

In our analysis, we will be using a simple data augmentation idea that was nicely explained in [this discussion topic](https://www.kaggle.com/c/instant-gratification/discussion/94128#latest-549171) by TripleLift and in [this kernel](https://www.kaggle.com/nroman/augmentation-explained) by Roman. The idea is simple: if all the data in a given class, say 1, belong to a single cluster then we can easily augment our train set by adding to it new points computed from the following formula: $2(X_1)_\text{center} - X_1$, where $X_1$ is the class 1 train set data and $(X_1)_\text{center}$ is the 'centroid', or the mean point of $X_1$ (each coordinate of this point is equal to the average value of the coordinates of $X_1$). Some people tried to apply this idea, some even claimed that they got a moderate boost to their score but it did not work very well for many others, myself included -- I observe a slight decrease in my local CV score when I tried implementing this method. What I realized recently is that this data augmentation idea can be used to detect the presence of multiple clusters for a given class. 

Let me explain how it works. First of all we will need to abandon some widespread misconceptions. The data augmentation method described above is often presented as a mean to improve the LB score. But if you think about it, you realize that this expectation is not justified -- by augmenting your data set in this way, you are not adding any useful information to your QDA algorithm: quick inspection of the mathematical definitions of the mean and covariance shows that both of this quantities are not going to change under the described data augmentation operation (assuming large enough number of points, so that the difference between $N$ and $N-1$ is irrelvant) and for QDA the mean and covariance is all that matters. On the other hand, the performance of QDA is not supposed to suffer either because for the augmented data set it will compute the same mean and covariance. And if the performance suffers, it means that there is something wrong with our understanding of the properties of the data set. This is the main idea of the method implemented below.

## Setting Things Up

### Loading Libraries

In [None]:
import numpy as np 
import pandas as pd 
from tqdm import tqdm
from pathlib import Path
from sklearn.metrics import roc_auc_score
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import VarianceThreshold
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

### Loading Data

In [None]:
%%time

path = Path('../input')

train = pd.read_csv(path/'train.csv')
test = pd.read_csv(path/'test.csv')

### Preprocessing

In this part, we will collect and preprocess data from all 512 model (one model per one value of the `'wheezy-copper-turtle-magic'` categorical variable as was explained by Chris Deotte [here](https://www.kaggle.com/cdeotte/support-vector-machine-0-925)). Later we will be able to load all these data from a single dictionary.

In [None]:
magic_max=train['wheezy-copper-turtle-magic'].max()
magic_min=train['wheezy-copper-turtle-magic'].min()

In [None]:
def preprocess(train=train, test=test):
       
    prepr = {} 
    
    #PREPROCESS 512 SEPARATE MODELS
    for i in range(magic_min, magic_max+1):

        # EXTRACT SUBSET OF DATASET WHERE WHEEZY-MAGIC EQUALS i     
        X = train[train['wheezy-copper-turtle-magic']==i].copy()
        Y = X.pop('target').values
        X_test = test[test['wheezy-copper-turtle-magic']==i].copy()
        idx_train = X.index 
        idx_test = X_test.index
        X.reset_index(drop=True,inplace=True)

        cols = np.array([c for c in X.columns if c not in ['id', 'wheezy-copper-turtle-magic']])

        l=len(X)
        X_all = pd.concat([X[cols], X_test[cols]], ignore_index=True)
        
        sel = VarianceThreshold(threshold=2)
        X_vt = sel.fit_transform(X_all)               # np.ndarray
        
        prepr['vt_' + str(i)] = X_vt
        prepr['n_vt' + str(i)] = X_vt.shape[1]
        prepr['feats_vt' + str(i)] = cols[sel.get_support(indices=True)]        
        prepr['train_size_' + str(i)] = l
        prepr['idx_train_' + str(i)] = idx_train
        prepr['idx_test_' + str(i)] = idx_test
        prepr['target_' + str(i)] = Y
        
    return prepr

In [None]:
%%time

data = preprocess()

And here is a handy function to get data for any value of `i`.

In [None]:
def get_data(i, data):
    
    l = data['train_size_' + str(i)]
    
    X_all = data['vt_' + str(i)]                

    X = X_all[:l, :]
    X_test = X_all[l:, :]

    Y = data['target_' + str(i)]

    idx_train = data['idx_train_' + str(i)]
    idx_test = data['idx_test_' + str(i)]
    
    return X, X_test, Y, idx_train, idx_test

Now, let's define a very useful function initializing storage arrays for our cross-validation results: AUC, out-of-fold predictions, and test set prediction (actually, we won't be producing any submission file in this kernel, so keeping track of test set prediction is a bit redundant but let's keep it for the sake of completeness). 

In [None]:
def initialize_cv():
    auc = np.array([])
    oof = np.zeros(len(train))
    preds = np.zeros(len(test)) 
    return auc, oof, preds

And another useful function to report the result of the cross-validation procedure.

In [None]:
def report_results(oof, auc_all, clf_name='QDA'):
    # PRINT VALIDATION CV AUC FOR THE CLASSFIER
    print(f'The result summary for the {clf_name} classifier:')
    auc_combo = roc_auc_score(train['target'].values, oof)
    auc_av = np.mean(auc_all)
    std = np.std(auc_all)/(np.sqrt(NFOLDS)*np.sqrt(magic_max+1))

    print(f'The combined CV score is {round(auc_combo, 5)}.')    
    print(f'The folds average CV score is {round(auc_av, 5)}.')
    print(f'The standard deviation is {round(std, 5)}.\n')

Define the number of folds and random seed.

In [None]:
NFOLDS = 5
RS = 42

## Baseline: One Cluster per Class

Now, let's establish a baseline by runnig an unaugmented QDA algorithm once. This will give us a good reference score for comparison. Also, the algorithm will compute the mean values of the relevant variables for both the positive and negative classes and for all 512 values of the magic variable `'wheezy-copper-turtle-magic'`. These mean values will be stored in a dictionary called `means`.

In [None]:
auc_all, oof, preds = initialize_cv() 

means = {}

#TRAIN 512 SEPARATE MODELS
for i in tqdm(range(magic_min, magic_max+1)):

    X, X_test, Y, idx_train, idx_test = get_data(i=i, data=data)      

    # STRATIFIED K FOLD    
    folds = StratifiedKFold(n_splits=NFOLDS, random_state=RS)

    auc_folds = np.array([])

    for train_index, val_index in folds.split(X, Y):     

        X_train, Y_train = X[train_index, :], Y[train_index]
        X_val, Y_val = X[val_index, :], Y[val_index]

        #BUILDING THE PIPELINE FOR THE CURRENT CLASSIFIER
        clf = QuadraticDiscriminantAnalysis(reg_param=0.111)
        
        clf.fit(X_train, Y_train)
        
        means[str(i)] = clf.means_

        oof[idx_train[val_index]] = clf.predict_proba(X_val)[:,1]
        preds[idx_test] += clf.predict_proba(X_test)[:,1]/NFOLDS

        auc = roc_auc_score(Y_val, oof[idx_train[val_index]])
        auc_folds = np.append(auc_folds, auc)

    auc_all = np.append(auc_all, np.mean(auc_folds))

report_results(oof, auc_all)

Ok, the combined AUC score is 0.96278. Now, let's see how the same algorithm will do after data augmentation. For coordinates of the clusters' centroids we will be using the mean values computed in the previous step and stored in the `means` dictionary.

In [None]:
auc_all, oof, preds = initialize_cv() 

#TRAIN 512 SEPARATE MODELS
for i in tqdm(range(magic_min, magic_max+1)):

    X, X_test, Y, idx_train, idx_test = get_data(i=i, data=data)      

    # STRATIFIED K FOLD    
    folds = StratifiedKFold(n_splits=NFOLDS, random_state=RS)

    auc_folds = np.array([])

    for train_index, val_index in folds.split(X, Y):     

        X_train, Y_train = X[train_index, :], Y[train_index]
        X_val, Y_val = X[val_index, :], Y[val_index]
        
        X_aug_0 = 2*means[str(i)][0] - X_train[Y_train==0]
        Y_aug_0 = np.zeros(len(X_aug_0)).reshape(-1, 1)

        X_aug_1 = 2*means[str(i)][1] - X_train[Y_train==1]
        Y_aug_1 = np.zeros(len(X_aug_1)).reshape(-1, 1)

        X_aug=np.vstack((X_train, X_aug_0, X_aug_1))
        Y_aug=np.vstack((Y_train.reshape(-1, 1), Y_aug_0, Y_aug_1))

        perms = np.random.permutation(len(X_aug))

        X_aug = X_aug[perms]
        Y_aug = Y_aug[perms]

        #BUILDING THE PIPELINE FOR THE CURRENT CLASSIFIER
        clf = QuadraticDiscriminantAnalysis(reg_param=0.111)
        
        clf.fit(X_aug, Y_aug.ravel())

        oof[idx_train[val_index]] = clf.predict_proba(X_val)[:,1]
        preds[idx_test] += clf.predict_proba(X_test)[:,1]/NFOLDS

        auc = roc_auc_score(Y_val, oof[idx_train[val_index]])
        auc_folds = np.append(auc_folds, auc)

    auc_all = np.append(auc_all, np.mean(auc_folds))

report_results(oof, auc_all)

### What Does It Tell Us?

The new combined AUC score is 0.94964 which is significantly lower than the 0.96278 AUC that we got without augmentation. This is not supposed to happen if our understanding of the data set is correct! The augmentation that we have performed does not change the mean or covariance values, so the performance of QDA is not supposed to suffer. But it does. 

## Two Clusters per Class

One possible explanation of this phenomenon is that we actually have more than one cluster per class in the data set. If this is true, then we need to use the centroids of the clusters rathen than the mean values of positive and negative class as our 'pivoting points' for data augmentation. But what is the number of clusters? The simplest guess would be 2 because this is the default value of `make_classification()`. Let's investigate this possibility. To do that, we will use Gaussian Mixture Model (similar to the one introduced in [the great Dieter's kernel](https://www.kaggle.com/christofhenkel/graphicallasso-gaussianmixture)) to label 2 clusters for class 0 and 2 other clusters for class 1 (4 clusters in total). Then we will augment data using the centroids of the identified clusters as our 'pivoting points'.

In [None]:
auc_all, oof, preds = initialize_cv() 

#TRAIN 512 SEPARATE MODELS
for i in tqdm(range(magic_min, magic_max+1)):

    X, X_test, Y, idx_train, idx_test = get_data(i=i, data=data)      

    # STRATIFIED K FOLD    
    folds = StratifiedKFold(n_splits=NFOLDS, random_state=RS)

    auc_folds = np.array([])

    for train_index, val_index in folds.split(X, Y):     

        X_train, Y_train = X[train_index, :], Y[train_index]
        X_val, Y_val = X[val_index, :], Y[val_index]
        
        X_train_0 = X_train[Y_train==0]
        Y_train_0 = Y_train[Y_train==0].reshape(-1, 1)

        X_train_1 = X_train[Y_train==1]
        Y_train_1 = Y_train[Y_train==1].reshape(-1, 1)


        params={'n_components' : 2,          # 2 clusters per class
                'init_params': 'random', 
                'covariance_type': 'full', 
                'tol':0.001, 
                'reg_covar': 0.001,#0.001, 
                'max_iter': 100, 
                'n_init': 10, 
               }

        clf = GaussianMixture(**params)

        clf.fit(X_train_0)
        labels_0 = clf.predict(X_train_0)
        means_0 = clf.means_

        clf.fit(X_train_1)
        labels_1 = clf.predict(X_train_1)
        means_1 = clf.means_

        X_aug_00 = 2*means_0[0] - X_train_0[labels_0==0]
        Y_aug_00 = np.zeros(len(X_aug_00)).reshape(-1, 1)

        X_aug_01 = 2*means_0[1] - X_train_0[labels_0==1]
        Y_aug_01 = np.zeros(len(X_aug_01)).reshape(-1, 1)

        X_aug_10 = 2*means_1[0] - X_train_1[labels_1==0]
        Y_aug_10 = np.ones(len(X_aug_10)).reshape(-1, 1)

        X_aug_11 = 2*means_1[1] - X_train_1[labels_1==1]
        Y_aug_11 = np.ones(len(X_aug_11)).reshape(-1, 1)

        X_aug=np.vstack((X_train_0, X_train_1, X_aug_00, X_aug_01, X_aug_10, X_aug_11))
        Y_aug=np.vstack((Y_train_0, Y_train_1, Y_aug_00, Y_aug_01, Y_aug_10, Y_aug_11))

        perms = np.random.permutation(len(X_aug))

        X_aug = X_aug[perms]
        Y_aug = Y_aug[perms]

        #BUILDING THE PIPELINE FOR THE CURRENT CLASSIFIER
        clf = QuadraticDiscriminantAnalysis(reg_param=0.111)
        
        clf.fit(X_aug, Y_aug.ravel())

        oof[idx_train[val_index]] = clf.predict_proba(X_val)[:,1]
        preds[idx_test] += clf.predict_proba(X_test)[:,1]/NFOLDS

        auc = roc_auc_score(Y_val, oof[idx_train[val_index]])
        auc_folds = np.append(auc_folds, auc)

    auc_all = np.append(auc_all, np.mean(auc_folds))

report_results(oof, auc_all)

The combined AUC score is 0.96278 which is exactly the same as the score that we got without any data augmentation. This strongly suggests that the actual number of clusters per class in the data set is not one but two.

## Could it Be Three?

It it possible that the nuber of clusters is 3? Let's try it out using the same strategy.

In [None]:
auc_all, oof, preds = initialize_cv() 

#TRAIN 512 SEPARATE MODELS
for i in tqdm(range(magic_min, magic_max+1)):

    X, X_test, Y, idx_train, idx_test = get_data(i=i, data=data)      

    # STRATIFIED K FOLD    
    folds = StratifiedKFold(n_splits=NFOLDS, random_state=RS)

    auc_folds = np.array([])

    for train_index, val_index in folds.split(X, Y):     

        X_train, Y_train = X[train_index, :], Y[train_index]
        X_val, Y_val = X[val_index, :], Y[val_index]
        
        X_train_0 = X_train[Y_train==0]
        Y_train_0 = Y_train[Y_train==0].reshape(-1, 1)

        X_train_1 = X_train[Y_train==1]
        Y_train_1 = Y_train[Y_train==1].reshape(-1, 1)


        params={'n_components' : 3,          # 3 clusters per class
                'init_params': 'random', 
                'covariance_type': 'full', 
                'tol':0.001, 
                'reg_covar': 0.001,#0.001, 
                'max_iter': 100, 
                'n_init': 10, 
               }

        clf = GaussianMixture(**params)

        clf.fit(X_train_0)
        labels_0 = clf.predict(X_train_0)
        means_0 = clf.means_

        clf.fit(X_train_1)
        labels_1 = clf.predict(X_train_1)
        means_1 = clf.means_

        X_aug_00 = 2*means_0[0] - X_train_0[labels_0==0]
        Y_aug_00 = np.zeros(len(X_aug_00)).reshape(-1, 1)

        X_aug_01 = 2*means_0[1] - X_train_0[labels_0==1]
        Y_aug_01 = np.zeros(len(X_aug_01)).reshape(-1, 1)
        
        X_aug_02 = 2*means_0[2] - X_train_0[labels_0==2]
        Y_aug_02 = np.zeros(len(X_aug_02)).reshape(-1, 1)

        X_aug_10 = 2*means_1[0] - X_train_1[labels_1==0]
        Y_aug_10 = np.ones(len(X_aug_10)).reshape(-1, 1)

        X_aug_11 = 2*means_1[1] - X_train_1[labels_1==1]
        Y_aug_11 = np.ones(len(X_aug_11)).reshape(-1, 1)
        
        X_aug_12 = 2*means_1[2] - X_train_1[labels_1==2]
        Y_aug_12 = np.zeros(len(X_aug_12)).reshape(-1, 1)

        X_aug=np.vstack((X_train_0, X_train_1, X_aug_00, X_aug_01, X_aug_02, X_aug_10, X_aug_11, X_aug_12,))
        Y_aug=np.vstack((Y_train_0, Y_train_1, Y_aug_00, Y_aug_01, Y_aug_02, Y_aug_10, Y_aug_11, Y_aug_12))

        perms = np.random.permutation(len(X_aug))

        X_aug = X_aug[perms]
        Y_aug = Y_aug[perms]

        #BUILDING THE PIPELINE FOR THE CURRENT CLASSIFIER
        clf = QuadraticDiscriminantAnalysis(reg_param=0.111)
        
        clf.fit(X_aug, Y_aug.ravel())

        oof[idx_train[val_index]] = clf.predict_proba(X_val)[:,1]
        preds[idx_test] += clf.predict_proba(X_test)[:,1]/NFOLDS

        auc = roc_auc_score(Y_val, oof[idx_train[val_index]])
        auc_folds = np.append(auc_folds, auc)

    auc_all = np.append(auc_all, np.mean(auc_folds))

report_results(oof, auc_all)

Well, the AUC corresponding to 3 clusters per class is 0.95421. This result is better than what we had for 1 cluster per class (0.94964) but worse than for 2 clusters per class (0.96278). It does not look very surprising -- with 3 clusters there is more flexibility (less rigid bias) that with one and it is good for the score. But, if the right number of clusters is 2 than fitting the data with 3 clusters leads to too much flexibility -- our bias is too flexible and it is bad for the score. So, our empirical demostration suggests that the right number of clusters per class should be two. 

## Conclusion

The evidence we collected strongly suggest that the actual number of clusters per class in the data set is not one (or three) but two. And this must be taken into account in the model building process -- in order for your model to be successful it must have the optimal bias which is most consistent with the structure of your data. This will help you to boost you score and move to a higher position on the leaderboard. Good luck!