Negative downsampling (CTR prediction)
========

When you have a binary classification problem of predicting clicks, the classes "clicked" and "not clicked" are skewed: there are much more impressions w/o clicks than there are impressions with clicks.

It turns out that prediction quality often improves if some samples of negative classes are removed. Of course, it makes a lot of sense to optimize sampling rate (do a hyperparameter optimization), but a good start is often sampling rate that brings classes proportions to roughly 1:1.

After downsampling you just need to re-calibrate the output of a model.

The recalibration formula:
$q = \frac{p}{p+\frac{1-p}{w}} $  Where p is the output of your predictor trained on downsampled data, and q - the probability you need, and w is a sampling rate.

Target sample rate for 1:1 proportion of classes:

$w = \frac{ctr}{1-ctr}$  

You can also notice that when ctr is small,  $w  \approx ctr$.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Let's generate some data: clicks.

In [68]:
N=200000000
ideal_ctr = 0.012
w = ideal_ctr / (1. - ideal_ctr)
clicks = (np.random.rand(N)<ideal_ctr)*1

Clicks is now an array with zeros and ones (ones - where you've got a click)

In [69]:
clicks

array([0, 0, 0, ..., 0, 0, 0])

In [70]:
def recalibrate(p,w):
    return p / (p + (1. - p)/w)

Original CTR

In [71]:
ctr = clicks.sum()/float(len(clicks))
ctr

0.011986775

Let's prepare our "training" set: we take all clicks, and for impressions w/o clicks, we only take some of them, with a sampling rate w

In [72]:
c2=clicks[((clicks>0)|((clicks<1)&(np.random.rand(len(clicks))<w)))]

CTR in a new set is about 50%, exactly where we want it.

In [73]:
ctr2=c2.sum()/float(len(c2))
ctr2

0.49952242975737682

And recalibrated ctr is very close to the original one.

In [74]:
recalibrate(ctr2, w)

0.011977372802274792

Our new dataset size is about $2\cdot ctr \cdot original\ dataset\ size$, which is approx 1/50 of what it was.