# freq-e tutorial 
This notebook walks through: 
How to run freq-e to obtain prevalence estimates on the test set. 

In [1]:
#load dependencies 
import numpy as np 
import json

In [2]:
#load freq_e package and tutorial data-preprocessing utils 
from freq_e import estimate
import tutorial_utils 

## Pre-processing 
freq-e must have data in the following format 
- `X_train` : numpy.ndarray, shape=(number of training examples, number of features) 
- `y_train` : numpy.ndarray (binary 0's and 1's), shape=(number of training examples,) 
- `X_test` : numpy.ndarray, shape=(number of test/inference examples, number of *training* features)

Often you will have *multiple* test groups. In this case you should have an `X_test` for each test group. 

We will use the Yelp academic dataset as an example. The text representation will be unigram counts (e.g. "bag-of-words"). Here, we have already calcuated the BOW counts and saved them as a .json file. The y-values are negative sentiment (y=0) and positive sentiment (y=1). 

In [3]:
#load pre-processed data 
X_train, y_train, dict_vect, vocab_mask = tutorial_utils.get_train_data('../example_data/train_yelp.json')
X_test, y_test = tutorial_utils.get_test_group('../example_data/test_yelp.json', vocab_mask, dict_vect)
assert X_test.shape[1] == X_train.shape[1]

Orig vocab 14791, pruned 3112
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
(2000, 3112) (2000,)
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
(2000, 3112) (2000,)


# Freq-e usage

## Inference
Freq-e inference will return (1) a point estimate of the class frequency/proportions and (2) a confidence interval for the point estimate. 

There are three different ways to obtain estimates: 
1. Create a `FreqEstimate` object and use the built-in training method. 
2. Use the `infer_freq()` method and pass in a pre-trained scikit-learn linear model (e.g. Logistic_Regression). Here the model class is restricted to scikit-learn models that have a .decision_function() method. 
3. Use the `infer_freq()` method and pass in the predicted probabilities of the positive class of the test set. 

Method 3 may be useful in the cases where you have special classifier architectures that are not built from sklearn (e.g. an LSTM or CNN). 

## Method 1 (train internally)

In [4]:
#create a freq-e object 
FreqE = estimate.FreqEstimate()

### Training 
In order to select the best discriminitive classifier, we do a grid search over the L1 penalties for LogReg, evaluating on cross-entropy over 10 cross-validation folds. 

In [5]:
FreqE.train_cross_val(X_train, y_train)

TRAINING LOGISTIC REGRESSION MODEL
LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)
Training mean accuracy= 0.9635


### Inference 
We will get a prevalence point estimate and the 95% confidence intervals. One can change the confidence level as needed. 

In [6]:
print('FREQ-E ESTIMATED')
out = FreqE.infer_freq_obj(X_test, conf_level=0.95)
print(out)

FREQ-E ESTIMATED
{'point': 0.772, 'conf_interval_95%': (0.748, 0.795)}


#### Compare this to other methods

In [7]:
# In our example, we know the true class proportion because we have access to the test labels
# This is not the case if you are doing true inference 
print('TRUE')
print(np.mean(y_test))

TRUE
0.769


In [8]:
#naive method = PCC (probabilistic classify and count)
print('PCC')
trained_model = FreqE.trained_model #used the logistic regression model we already trained 
probs = trained_model.predict_proba(X_test)[:, 1] #get the soft probabilites 
print(np.mean(probs)) 

PCC
0.7694201610577046


## Method 2 (pre-trained scikit-learn linear model)

In [9]:
# Here we will give an example of passing in a classifier that is not LogisticRegression 
# We will show that we can also use a LinearSVC (linear support vector classifier)
from sklearn.svm import LinearSVC
trained_model = LinearSVC()
trained_model.fit(X_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [10]:
print('FREQ-E ESTIMATED')
label_prior = np.mean(y_train) #most often, you want to estimate the label prior from training, but you can also pass in other values
out = estimate.infer_freq(X_test, label_prior, conf_level=0.95, trained_model=trained_model)
print(out)

FREQ-E ESTIMATED
{'point': 0.5740000000000001, 'conf_interval_95%': (0.538, 0.609)}


Here we do a lot worse since we obviously haven't tuned the hyperparameters for the SVC. However, if a user is using a classifier that has been fine-tuned on the training data (e.g. a LSTM or Transformer) this method should work well. 

## Method 3 (pass in predicted probabilites on the test set)
Train a classifier, get the predicted positive class probabilites on the test set (`test_pred_probs`) and pass this into the `infer_freq` stand-alone method. 

In [11]:
# Let's train a LogisticRegression classifier (without hyperparameter tuning) 
# and get the probabilities for the positive class 
from sklearn.linear_model import LogisticRegression
trained_model2 = LogisticRegression()
trained_model2.fit(X_train, y_train)
test_pred_probs = trained_model2.predict_proba(X_test)[:, 1] #estimated probabilites for the positive class 
print(test_pred_probs[0:5])

[6.75550754e-01 5.67237177e-05 9.99967122e-01 9.80429407e-01
 7.78366952e-01]


In [12]:
print('FREQ-E ESTIMATED')
label_prior = np.mean(y_train)
out = estimate.infer_freq(X_test, label_prior, conf_level=0.95, trained_model=None, test_pred_probs=test_pred_probs)
print(out)

FREQ-E ESTIMATED
{'point': 0.789, 'conf_interval_95%': (0.767, 0.81)}


# Other test groups 
Here we examine two other test groups to show some anecdotes (supported by empirical results in Keith et. al 2018) of why using our method is important when the training class prevalence does not match the true test class prevalence. 

#### High prevalence test group  

In [13]:
X_test2, y_test2 = tutorial_utils.get_test_group('../example_data/high_prev.json', vocab_mask, dict_vect)
print(X_test2.shape, y_test2.shape)

<class 'numpy.ndarray'> <class 'numpy.ndarray'>
(415, 3112) (415,)
(415, 3112) (415,)


In [14]:
print('TRUE')
print(np.mean(y_test2))

TRUE
0.9686746987951808


In [15]:
# Note: we don't have to re-train our frequency estimate object! we can just infer the 
# class proportions on this new test set 
print('FREQ-E ESTIMATED')
out = FreqE.infer_freq_obj(X_test2, conf_level=0.95)
print(out)

FREQ-E ESTIMATED
{'point': 0.973, 'conf_interval_95%': (0.9440000000000001, 0.992)}


In [16]:
print('PCC')
print(np.mean(FreqE.trained_model.predict_proba(X_test2)[:, 1]))

PCC
0.9019349059317472


#### Low prevalence test group 

In [17]:
X_test3, y_test3 = tutorial_utils.get_test_group('../example_data/low_prev.json', vocab_mask, dict_vect)
print(X_test3.shape, y_test3.shape)

<class 'numpy.ndarray'> <class 'numpy.ndarray'>
(825, 3112) (825,)
(825, 3112) (825,)


In [18]:
print('TRUE')
print(np.mean(y_test3))

TRUE
0.13696969696969696


In [19]:
print('FREQ-E ESTIMATED')
out = FreqE.infer_freq_obj(X_test3, conf_level=0.95)
print(out)

FREQ-E ESTIMATED
{'point': 0.046, 'conf_interval_95%': (0.028, 0.069)}


In [20]:
print('PCC')
print(np.mean(FreqE.trained_model.predict_proba(X_test3)[:, 1]))

PCC
0.3114774782928134
