# Lab session: Training a text classifier with BERT
***To automatically detect the use of <ins>unamed sources</ins> in a*** **New York Time** ***corpus***

e.g.

Even <ins>one of Mr. Bush's advisers acknowledged</ins>: "I think he worked hard in New Hampshire, but I think they were all surprised by how hard he had to work."

For their part, <ins>New York officials say</ins> they are not trying to cut New Jersey out of the process.

**Import packages**

In [None]:
import pandas as pd
import numpy as np

In [None]:
from AugmentedSocialScientist import bert

## 1. Tokenization 

Cut the corpus to annotation units (e.g. sentences)

In [None]:
text = "In its fumbling attempts to explain the purge of United States attorneys, the Bush administration has argued that the fired prosecutors were not aggressive enough about addressing voter fraud. It is a phony argument; there is no evidence that any of them ignored real instances of voter fraud. But more than that, it is a window on what may be a major reason for some of the firings.   In partisan Republican circles, the pursuit of voter fraud is code for suppressing the votes of minorities and poor people. By resisting pressure to crack down on ''fraud,'' the fired United States attorneys actually appear to have been standing up for the integrity of the election system.   TD John McKay, one of the fired attorneys, says he was pressured by Republicans to bring voter fraud charges after the 2004 Washington governor's race, which a Democrat, Christine Gregoire, won after two recounts. Republicans were trying to overturn an election result they did not like, but Mr. McKay refused to go along. ''There was no evidence,'' he said, ''and I am not going to drag innocent people in front of a grand jury.''   Later, when he interviewed with Harriet Miers, then the White House counsel, for a federal judgeship that he ultimately did not get, he says, he was asked to explain ''criticism that I mishandled the 2004 governor's election.''   Mr. McKay is not the only one of the federal attorneys who may have been brought down for refusing to pursue dubious voter fraud cases. Before David Iglesias of New Mexico was fired, prominent New Mexico Republicans reportedly complained repeatedly to Karl Rove about Mr. Iglesias's failure to indict Democrats for voter fraud. The White House said that last October, just weeks before Mr. McKay and most of the others were fired, President Bush complained that United States attorneys were not pursuing voter fraud aggressively enough.   There is no evidence of rampant voter fraud in this country. Rather, Republicans under Mr. Bush have used such allegations as an excuse to suppress the votes of Democratic-leaning groups. They have intimidated Native American voter registration campaigners in South Dakota with baseless charges of fraud. They have pushed through harsh voter ID bills in states like Georgia and Missouri, both blocked by the courts, that were designed to make it hard for people who lack drivers' licenses -- who are disproportionately poor, elderly or members of minorities -- to vote. Florida passed a law placing such onerous conditions on voter registration drives, which register many members of minorities and poor people, that the League of Women Voters of Florida suspended its registration work in the state.   The claims of vote fraud used to promote these measures usually fall apart on close inspection, as Mr. McKay saw. Missouri Republicans have long charged that St. Louis voters, by which they mean black voters, registered as living on vacant lots. But when The St. Louis Post-Dispatch checked, it found that thousands of people lived in buildings on lots that the city had erroneously classified as vacant.   The United States attorney purge appears to have been prompted by an array of improper political motives. Carol Lam, the San Diego attorney, seems to have been fired to stop her from continuing an investigation that put Republican officials and campaign contributors at risk. These charges, like the accusation that Mr. McKay and other United States attorneys were insufficiently aggressive about voter fraud, are a way of saying, without actually saying, that they would not use their offices to help Republicans win elections. It does not justify their firing; it makes their firing a graver offense.    NS "
text

In [None]:
import nltk
nltk.download('punkt')

from nltk.tokenize import sent_tokenize

In [None]:
sentences = sent_tokenize(____)

In [None]:
sentences

In [None]:
pd.DataFrame(sentences, columns=['text'])

## 2. Sampling (of training set and test set)

Given the corpus divided into annotation units 

In [None]:
corpus = pd.read_csv('./data/corpus.csv')

Randomly sample 200 sentences as training set

In [None]:
training_set = _______

Randomly sample 100 sentences as test set

⚠️ **The training set and the test set must have no intersection**

In [None]:
test_set = ______

Export the training set and the test set for annotation

In [None]:
training_set.to_csv('training_set.csv', index=False)

In [None]:
test_set.to_csv('test_set.csv', index=False)

## 3. Annotation

Use your favorate annotation tool! (Excel is ok...)

## 4. Training and evaluation

Given the annotated training set and test set

In [None]:
train = pd.read_csv('./data/annotated_training_set.csv')

In [None]:
test = pd.read_csv('./data/annotated_test_set.csv')

**Step 1:** Preprocess the training and test data with `bert.encode()`

In [None]:
train_loader = bert.encode(_____, _____)

In [None]:
test_loader = bert.encode(_____, _____)

**Step 2:** Training a model, evaluating and saving it with `bert.run_training()`

In [None]:
score = bert.run_training(___________,           #encoded training set
                          ___________,           #encoded test set
                          n_epochs=3,            #number of epochs
                          lr=5e-5,               #learning rate
                          random_state=42,       #random state (for replicability)
                          save_model_as='off')   #name of the saved model

In [None]:
score

# 4. Prediction

In [None]:
corpus

**Step 1:** Preprocess the prediction data with `bert.encode()`

In [None]:
pred_dataloader = bert.encode(____)

**Step 2:** Prediction with the saved model using `bert.predict_with_model()`

In [None]:
pred = bert.predict_with_model(____, model_path=____)

The function return a two-dimensional array: for each sentence, probability that it belongs to category 0 and probability that it belongs to category 1. 

In [None]:
pred

Store the predicted labels and probabilities into the dataframe

In [None]:
corpus['pred_label'] = np.argmax(pred, axis=1)

In [None]:
corpus['pred_proba'] = np.max(pred, axis=1)

Inspect the predictions

In [None]:
corpus

Then you can process any statistical analysis on the entire corpus...

In [None]:
corpus.groupby('year').sum()['pred_label']/corpus.year.value_counts()

**To recapitulate:**

3 main functions:

- `bert.encode(texts, labels)`to preprocess the data. 
   (only `bert.encode(texts)` for prediction data)
   
   
- `bert.run_training(train_loader, test_loader, lr, n_epochs, random_state, save_model_as)` 

    to train, validate and save the model


- `bert.predict_with_model(pred_loder, model_path)`to make predictions

## Exercice: Clickbalt detection

Use the following training data and test data to build a model which automatically classifies whether a headline is a click balt. 

In [None]:
cb_train = pd.read_csv('../AugmentedSocialScientist/datasets/english/clickbait_train.csv')
cb_test = pd.read_csv('../AugmentedSocialScientist/datasets/english/clickbait_test.csv')

Then use the model to automatically classify the following headlines: 

In [None]:
cb_pred = pd.read_csv('../AugmentedSocialScientist/datasets/english/clickbait_pred.csv')
cb_pred