So, I decided to determine chance for row to be outlier ( i.e. label greater than some high percentile ). In order to do that, I've built classifier based on naive Bayes principle, stacked on feature selector - logistic regression equipped with L1 penalty

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

y = train.pop('y')
ID = train.pop('ID')

In [None]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.linear_model import RandomizedLogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.base import TransformerMixin
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_pipeline, make_union

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score, confusion_matrix

In [None]:
ints = train.select_dtypes(['int']).columns.tolist()
objs = train.select_dtypes(['object']).columns.tolist()

for col in ints:
    if np.var(train[col])==0:
        train.pop(col)
        ints.remove(col)


In [None]:
outs = (y>120).as_matrix().astype(int)

#let's define outliers as labels greater than 120

In [None]:
def evaluate(y_true, pred, thresh=.5):
    print('precision', precision_score(y_true, pred[:, 1]>thresh))
    print('recall', recall_score(y_true, pred[:, 1]>thresh))
    print('roc', roc_auc_score(y_true, pred[:, 1]))
    print('f1', f1_score(y_true, pred[:, 1]>thresh))

In [None]:
cv_preds = cross_val_predict(BernoulliNB(), train[ints], outs, cv=10, method='predict_proba')

In [None]:
evaluate(outs, cv_preds)

Okay, that's quite bad. Let's include some feature selection pipeline

In [None]:
pip = make_pipeline(RandomizedLogisticRegression(C=5), BernoulliNB())

selection_preds = cross_val_predict(pip, train[ints], outs, cv=10, method='predict_proba')
evaluate(outs, cv_preds)

This takes quite long, so I've settled on C=5 ( intuition, possibly flawed ) and did not test any other hyperparameteres. We see that feature selection improves ROC, but hits f1. 

Nonetheless, it's now time to perform some analysis of non-binary features. I will build transform that will decide whether to decode feature as promising or not, based on proportion of outliers associated with level of feature. 

In [None]:
class OutlierThresholder(TransformerMixin):
    
    def __init__(self, thresh=1.5):
        self.th = thresh
    
    def fit(self, X, y):
        
        X = np.asarray(X)
        maps = []
        for col in range(X.shape[1]):
            
            val = X[:, col].copy()
            useful = []
            not_useful = []
            for u in np.unique(X[:, col]):
                
                o, no = y[val==u].mean(), y[val!=u].mean()
                q = o/no if no else 0
                
                if q > self.th:
                    useful.append(u)
                else:
                    not_useful.append(u)
                    
            col_map = dict(zip(useful+not_useful, [0]*len(useful)+[1]*len(not_useful)))
            maps.append(col_map)
            
        self.maps = maps
        return self
        
    def transform(self, X, y=None):
        
        X = X.copy()
        X = np.asarray(X)
        for col in range(X.shape[1]):
            
            X[:, col] = [self.maps[col][x] if x in self.maps[col] else 1 for x in X[:, col]]
            
        return X

In [None]:
def sel_obj(X):
    return X[:, :8]

def sel_ints(X):
    return X[:, 8:]

In [None]:
pip = make_pipeline(OutlierThresholder(), BernoulliNB())

outlier_obj_preds = cross_val_predict(pip, train[objs], outs, method='predict_proba', cv=10)
evaluate(outs, outlier_obj_preds)

Very, very bad. Let's include binary features

In [None]:
un = make_union(make_pipeline(FunctionTransformer(sel_obj), OutlierThresholder()), FunctionTransformer(sel_ints))

for col in objs:
    train[col] = pd.factorize(train[col])[0]

binary_with_obj = make_pipeline(un, BernoulliNB())

In [None]:
full_preds = cross_val_predict(binary_with_obj, train, outs, method='predict_proba', cv=10)

In [None]:
evaluate(outs, full_preds)

comparable with model based solely on binary features. not worth the hassle

In [None]:
upd_binary_with_obj = make_pipeline(un, RandomizedLogisticRegression(C=5), BernoulliNB())

full_upd_preds = cross_val_predict(upd_binary_with_obj, train, outs, method='predict_proba', cv=10)
evaluate(outs, full_upd_preds)

Another bad score. Before I include these in final model, I would like to plot probability curves. Intuitively, we would like for our model to be n% right for every sample it assigns n% of confidence. Such curve will be called "properly calibrated probability". In case of such an event, we should see straight line of equation y=x on our plots

In [None]:
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve

In [None]:
plt.plot(*calibration_curve(outs, full_upd_preds[:, 1], n_bins=5)[::-1])
plt.xlabel('mean predicted probability')
plt.ylabel('percent of correctly assigned labels')
plt.show()

But we see that assumption does not hold. Let's check earlier models

In [None]:
plt.plot(*calibration_curve(outs, outlier_obj_preds[:, 1], n_bins=5)[::-1])
plt.xlabel('mean predicted probability')
plt.ylabel('percent of correctly assigned labels')
plt.show()

In [None]:
plt.plot(*calibration_curve(outs, cv_preds[:, 1], n_bins=5)[::-1])
plt.xlabel('mean predicted probability')
plt.ylabel('percent of correctly assigned labels')
plt.show()

In [None]:
plt.plot(*calibration_curve(outs, selection_preds[:, 1], n_bins=5)[::-1])
plt.xlabel('mean predicted probability')
plt.ylabel('percent of correctly assigned labels')
plt.show()

Let's build some xgboost models

In [None]:
from xgboost import XGBRegressor
from functools import partial

xgb_params = dict(max_depth=3, learning_rate=0.05, n_estimators=100, subsample=.7, colsample_bytree=.7)
xgbr = XGBRegressor(**xgb_params)
my_cv = partial(cross_val_score, scoring='r2', cv=10)
cv_ordinary = my_cv(xgbr, train, y)
cv_add = my_cv(xgbr, np.hstack([train, cv_preds[:, 1].reshape(-1, 1)]), y)

In [None]:
cv_ordinary.mean(), cv_add.mean()

Doesn't look very helpful. On the other hand, I didn't put a lot of effort into choosing hyperparameters.

( for some reason I two cells above won't run. On my computer results are ~ 0.57 with second one being slightly worse )