# Freq-E tutorial 
This notebook walks through: 

1. Getting text in a format suitable for input into freq-e

2. How to run freq-e to obtain prevalence estimates on the test set. 

In [1]:
#load dependencies 
import numpy as np 
import json
from sklearn.feature_extraction import DictVectorizer

In [2]:
#load freq_e package 
from freq_e import estimate

## (1) Pre-processing 
The goal of pre-processing is to get your data into the format: 
- `X_train` : numpy.ndarray, shape=(number of training examples, number of features) 
- `y_train` : numpy.ndarray (binary 0's and 1's), shape=(number of training examples,) 
- `X_test` : numpy.ndarray, shape=(number of test/inference examples, number of *training* features)

Often you will have *multiple* test groups. In this case you should have an `X_test` for each test group. 

We will use the Yelp academic dataset as an example. The text representation will be unigram counts (e.g. "bag-of-words"). Here, we have already calcuated the BOW counts and saved them as a .json file. The y-values are negative sentiment (y=0) and positive sentiment (y=1). 

You can skip this section if you already have your data formatted. Common other feature representations include averaging word embeddings or getting a FastText vector representation of your data. 

In [3]:
def load_x_y_from_json(file_name): 
    count_dicts = []
    y = []
    for line in open(file_name): 
        dd = json.loads(line)
        counts = dd['counts'].copy()
        cc = dd['class']
        count_dicts.append(counts); y.append(cc)
    return count_dicts, np.array(y)  

In [4]:
def prune_vocab(X_train, dv_vocab): 
    #remove words that occur in <5 docs 
    xx=X_train.copy()
    xx[xx>0]=1
    w_df = np.asarray(xx.sum(0)).flatten()
    new_vocab_mask = w_df >= 5
    print("Orig vocab %d, pruned %d" % (len(w_df), np.sum(new_vocab_mask)))
    X_train = X_train[:,new_vocab_mask]
    dv_vocab = dv_vocab[new_vocab_mask]
    return X_train, dv_vocab, new_vocab_mask

In [5]:
#get train data 
dv = DictVectorizer()
train_count_dicts, y_train = load_x_y_from_json('example_data/train_yelp.json')
X_train = dv.fit_transform(train_count_dicts).toarray()
dv_vocab = np.array(dv.feature_names_)
X_train, dv_vocab, new_vocab_mask = prune_vocab(X_train, dv_vocab)
print(type(X_train), type(y_train))
print(X_train.shape, y_train.shape)
assert X_train.shape[0] == y_train.shape[0]

Orig vocab 14791, pruned 3112
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
(2000, 3112) (2000,)


In [6]:
def transform_test(test_count_dicts, new_vocab_mask): 
    X_test = dv.transform(test_count_dicts).toarray()
    X_test = X_test[:,new_vocab_mask]
    return X_test

In [7]:
# get test data (1 test group) 
# NOTE: the test group is the "inference" group in a real-word setting
# here we have labels on the test set, but in a real-word setting there 
# would most likely not be labels on the test set
test_count_dicts, y_test = load_x_y_from_json('example_data/test_yelp.json')
X_test = transform_test(test_count_dicts, new_vocab_mask)
print(type(X_test), type(y_test))
print(X_test.shape, y_test.shape)
assert X_test.shape[1] == X_train.shape[1]

<class 'numpy.ndarray'> <class 'numpy.ndarray'>
(2000, 3112) (2000,)


# (2) Freq-e usage

## Inference
Freq-e inference will return (1) a point estimate of the class frequency/proportions and (2) a confidence interval for the point estimate. 

There are three different ways to obtain estimates: 
1. Create a `FreqEstimate` object and use the built-in training method. 
2. Use the `infer_freq()` method and pass in a pre-trained scikit-learn linear model (e.g. Logistic_Regression). Here the model class is restricted to scikit-learn models that have a .decision_function() method. 
3. Use the `infer_freq()` method and pass in the predicted probabilities of the positive class of the test set. 

Method 3 may be useful in the cases where you have special classifier architectures that are not built from sklearn (e.g. an LSTM or CNN). 

## Method 1 (train internally)

In [8]:
#create a freq-e object 
FreqE = estimate.FreqEstimate()

### Training 
In order to select the best discriminitive classifier, we do a grid search over the L1 penalties for LogReg, evaluating on cross-entropy over 10 cross-validation folds 

In [9]:
FreqE.fit(X_train, y_train)

TRAINING LOGISTIC REGRESSION MODEL
LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)
Training mean accuracy= 0.9635


### Inference 

In [10]:
print('FREQ-E ESTIMATED')
out = FreqE.infer_freq_obj(X_test, conf_level=0.95)
print(out)

FREQ-E ESTIMATED
{'point': 0.772, 'conf_interval': (0.748, 0.795)}


#### Compare this to other methods

In [11]:
# In our example, we know the true class proportion because we have access to the test labels
# This is not the case if you are doing true inference 
print('TRUE')
print(np.mean(y_test))

TRUE
0.769


In [12]:
#naive method = PCC (probabilistic classify and count)
print('PCC')
print(np.mean(FreqE.trained_model.predict_proba(X_test)[:, 1]))

PCC
0.7694195295507668


## Method 2 (pre-trained scikit-learn linear model)

In [13]:
# Here we will give an example of passing in a classifier that is not LogisticRegression 
# We will show that we can also use a LinearSVC (linear support vector classifier)
from sklearn.svm import LinearSVC
trained_model = LinearSVC()
trained_model.fit(X_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [14]:
print('FREQ-E ESTIMATED')
label_prior = np.mean(y_train) #most often, you want to estimate the label prior from training, but you can also pass in other values
out = estimate.infer_freq(X_test, label_prior, conf_level=0.95, trained_model=trained_model)
print(out)

FREQ-E ESTIMATED
{'point': 0.5740000000000001, 'conf_interval': (0.538, 0.609)}


Here we do a lot worse since we obviously haven't tuned the hyperparameters for the SVC. However, if a user is using a classifier that has been fine-tuned on the training data (e.g. a LSTM or Transformer) this method should work well. 

## Method 3 (pass in predicted probabilites on the test set)
Train a classifier, get the predicted positive class probabilites on the test set (`test_pred_probs`) and pass this into the `infer_freq` stand-alone method. 

In [15]:
# Let's train a LogisticRegression classifier (without hyperparameter tuning) 
# and get the probabilities for the positive class 
from sklearn.linear_model import LogisticRegression
trained_model2 = LogisticRegression()
trained_model2.fit(X_train, y_train)
test_pred_probs = trained_model2.predict_proba(X_test)[:, 1]
print(test_pred_probs[0:5])

[6.75550754e-01 5.67237177e-05 9.99967122e-01 9.80429407e-01
 7.78366952e-01]


In [16]:
print('FREQ-E ESTIMATED')
label_prior = np.mean(y_train)
out = estimate.infer_freq(X_test, label_prior, conf_level=0.95, trained_model=None, test_pred_probs=test_pred_probs)
print(out)

FREQ-E ESTIMATED
{'point': 0.789, 'conf_interval': (0.767, 0.81)}


# Other test groups 
Here we examine two other test groups to show some anecdotes (supported by empirical results in Keith et. al 2018) of why using our method is important when the training class prevalence does not match the true test class prevalence. 

#### High prevalence test group  

In [17]:
test_count_dicts2, y_test2 = load_x_y_from_json('example_data/high_prev.json')
X_test2 = transform_test(test_count_dicts2, new_vocab_mask)
print(X_test2.shape, y_test2.shape)

(415, 3112) (415,)


In [18]:
print('TRUE')
print(np.mean(y_test2))

TRUE
0.9686746987951808


In [19]:
# Note: we don't have to re-train our frequency estimate object! we can just infer the 
# class proportions on this new test set 
print('FREQ-E ESTIMATED')
out = FreqE.infer_freq_obj(X_test2, conf_level=0.95)
print(out)

FREQ-E ESTIMATED
{'point': 0.973, 'conf_interval': (0.9440000000000001, 0.992)}


In [20]:
print('PCC')
print(np.mean(FreqE.trained_model.predict_proba(X_test2)[:, 1]))

PCC
0.9019396973852466


#### Low prevalence test group 

In [21]:
test_count_dicts3, y_test3 = load_x_y_from_json('example_data/low_prev.json')
X_test3 = transform_test(test_count_dicts3, new_vocab_mask)
print(X_test3.shape, y_test3.shape)

(825, 3112) (825,)


In [22]:
print('TRUE')
print(np.mean(y_test3))

TRUE
0.13696969696969696


In [23]:
print('FREQ-E ESTIMATED')
out = FreqE.infer_freq_obj(X_test3, conf_level=0.95)
print(out)

FREQ-E ESTIMATED
{'point': 0.046, 'conf_interval': (0.028, 0.069)}


In [24]:
print('PCC')
print(np.mean(FreqE.trained_model.predict_proba(X_test3)[:, 1]))

PCC
0.3114853708952336
