# freq-e tutorial 
This notebook walks through: 
How to run freq-e to obtain prevalence estimates on the test set. 

In [2]:
import numpy as np
import freq_e
import tutorial_utils 

## Pre-processing 
freq-e must have data in the following format 
- `X_train` : numpy.ndarray, shape=(number of training examples, number of features) 
- `y_train` : numpy.ndarray (binary 0's and 1's), shape=(number of training examples,) 
- `X_test` : numpy.ndarray, shape=(number of test/inference examples, number of *training* features)

Often you will have *multiple* test groups. In this case you should have an `X_test` for each test group. 

We will use the Yelp academic dataset as an example. The text representation will be unigram counts (e.g. "bag-of-words"). Here, we have already calcuated the BOW counts and saved them as a .json file. The y-values are negative sentiment (y=0) and positive sentiment (y=1). 

In [51]:
#load pre-processed data 
X_train, y_train, dict_vect, vocab_mask = tutorial_utils.get_train_data('../example_data/train_yelp.json')
X_test, y_test = tutorial_utils.get_test_group('../example_data/test_yelp.json', vocab_mask, dict_vect)
assert X_test.shape[1] == X_train.shape[1]

Orig vocab 14791, pruned 3112
Training X shape: (2000, 3112) and Y shape: (2000,)
Testing X shape: (2000, 3112) and Y shape: (2000,)


# Freq-e usage

## Inference
Freq-e inference will return (1) a point estimate of the class frequency/proportions and (2) a confidence interval for the point estimate. 

There are three different ways to obtain estimates: 
1. Create a `FreqEstimator` object and use the built-in training method. 
2. You can also train a scikit-learn classifier yourself and pass it in to freq-e. Here the model class is restricted to scikit-learn models that have a .decision_function() method. 
3. Use the standalone `infer_freq_from_predictions()` method and pass in the predicted probabilities of the positive class of the test set. This may be useful in the cases where you have certain classifier architectures that are not built from sklearn (e.g. an LSTM or CNN). 

## Method 1 (train internally)

In [4]:
#create a freq-e object 
FreqE = freq_e.FreqEstimator()

### Training 
The FreqEstimator model object can train a discriminative logistic regression classifier for you. It uses grid search over the L1 penalties, evaluating on cross-entropy over 10 cross-validation folds. 

In [5]:
FreqE.train_cross_val(X_train, y_train)

TRAINING LOGISTIC REGRESSION MODEL
Best model: LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Training mean accuracy= 0.9635


### Inference 
We will get a prevalence point estimate and a 95% confidence interval (the default). One can change the confidence level as needed. 

In [7]:
print('FREQ-E ESTIMATE')
out = FreqE.infer_freq(X_test, conf_level=0.95)
print(out)

FREQ-E ESTIMATE
{'point': 0.77200000000000002, 'conf_interval': (0.748, 0.79500000000000004), 'conf_level': 0.95}


#### Compare this to other methods

In [9]:
# In our example, we know the true class prevalence because we have access to the test labels
# This is not the case if you are doing true inference 
print('True label prevalence:')
print(np.mean(y_test))

True label prevalence:
0.769


In [10]:
#naive method = PCC (probabilistic classify and count)
print('PCC inference of label prevalence:')
trained_model = FreqE.trained_model #used the logistic regression model we already trained 
probs = trained_model.predict_proba(X_test)[:, 1] #get the soft probabilites 
print(np.mean(probs)) 

PCC inference of label prevalence:
0.769418519107


## Method 2 (pre-trained scikit-learn linear model)

You can also train a scikit-learn classifier yourself and pass it in to freq-e.  This may be useful if you want to use different hyperparameters or different models.

In [18]:
from sklearn.svm import LinearSVC
trained_model = LinearSVC()
trained_model.fit(X_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [50]:
fe2 = freq_e.FreqEstimator()
label_prior = np.mean(y_train) #most often, you want to estimate the label prior from the training labels 
fe2.set_trained_model(trained_model, label_prior)
print('FREQ-E ESTIMATE')
out = fe2.infer_freq(X_test)
print(out)

FREQ-E ESTIMATE
{'point': 0.57400000000000007, 'conf_interval': (0.53800000000000003, 0.60899999999999999), 'conf_level': 0.95}


Here we do a lot worse since we haven't tuned the hyperparameters for the SVC (and we wouldn't recommend SVC over logistic regression for this task).

## Method 3 (pass in predicted probabilites on the test set)

Finally, you can do both the classifier training and prediction yourself.  All you need to do is get the predicted positive class probabilites on the test set (`test_pred_probs`) and pass them into the `infer_freq_from_predictions` standalone method.  We'll do it with sklearn, but you could get them from any model from any software package.  We would recommend using a model with probabilistic predictions (such as, a neural network with logistic output) whose hyperparameters have been tuned with cross-validation or on a development set.

In [23]:
# Let's train a LogisticRegression classifier (without hyperparameter tuning) 
# and get the probabilities for the positive class 
from sklearn.linear_model import LogisticRegression
trained_model2 = LogisticRegression()
trained_model2.fit(X_train, y_train)
test_pred_probs = trained_model2.predict_proba(X_test)[:, 1] #estimated probabilites for the positive class 
print(test_pred_probs[0:5])

[  6.75556119e-01   5.67236618e-05   9.99967121e-01   9.80429471e-01
   7.78366752e-01]


In [24]:
print('FREQ-E ESTIMATE')
label_prior = np.mean(y_train)
out = freq_e.infer_freq_from_predictions(test_pred_probs, label_prior)
print(out)

FREQ-E ESTIMATE
{'point': 0.78900000000000003, 'conf_interval': (0.76700000000000002, 0.81000000000000005), 'conf_level': 0.95}


# Other test groups 
Here we examine two other test groups to show some anecdotes of why using our method is important when the training class prevalence does not match the true test class prevalence.  The empirical results in Keith et. al 2018 show this more rigorously.

#### High prevalence test group  

This test group has *higher* positive prevalence than the training data.  We expect PCC to give a too-low prediction, while freq-e should be closer to the truth.

In [25]:
X_test2, y_test2 = tutorial_utils.get_test_group('../example_data/high_prev.json', vocab_mask, dict_vect)

Testing X shape: (415, 3112) and Y shape: (415,)


In [26]:
print('True label prevalence:')
print(np.mean(y_test2))

True label prevalence:
0.968674698795


In [28]:
# Note: we don't have to re-train our original frequency estimate object! 
# we can just infer the class proportions on this new test set 
print('FREQ-E ESTIMATE')
out = FreqE.infer_freq(X_test2, conf_level=0.95)
print(out)

FREQ-E ESTIMATE
{'point': 0.97299999999999998, 'conf_interval': (0.94400000000000006, 0.99199999999999999), 'conf_level': 0.95}


In [29]:
print('PCC estimate')
print(np.mean(FreqE.trained_model.predict_proba(X_test2)[:, 1]))

PCC estimate
0.901936558961


#### Low prevalence test group 

This test group has *lower* prevalence than the training data.  We expect PCC to be biased upward, but freq-e to be closer to the truth.

In [30]:
X_test3, y_test3 = tutorial_utils.get_test_group('../example_data/low_prev.json', vocab_mask, dict_vect)

Testing X shape: (825, 3112) and Y shape: (825,)


In [32]:
print('True label prevalence:')
print(np.mean(y_test3))

True label prevalence:
0.13696969697


In [34]:
print('FREQ-E ESTIMATE')
out = FreqE.infer_freq(X_test3, conf_level=0.95)
print(out)

FREQ-E ESTIMATE
{'point': 0.045999999999999999, 'conf_interval': (0.028000000000000001, 0.069000000000000006), 'conf_level': 0.95}


Unfortunately, in this case freq-e's confidence interval fails to cover the true value; as noted in the paper, the coverage rate tends to be too lower than desired, so be careful when using it.

In [35]:
print('PCC estimate')
print(np.mean(FreqE.trained_model.predict_proba(X_test3)[:, 1]))

PCC estimate
0.311480435401
