# Lab session: Training a text classifier with BERT
***To automatically detect the use of <ins>unamed sources</ins> in a*** **New York Time** ***corpus***

e.g.

Even <ins>one of Mr. Bush's advisers acknowledged</ins>: "I think he worked hard in New Hampshire, but I think they were all surprised by how hard he had to work."

For their part, <ins>New York officials say</ins> they are not trying to cut New Jersey out of the process.

**Import packages**

In [1]:
import pandas as pd
import numpy as np

In [2]:
from AugmentedSocialScientist import bert

2023-06-27 14:06:47.476963: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


There are 1 GPU(s) available.
We will use GPU 0: Quadro P5000


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

## 1. Tokenization 

Cut the corpus to annotation units (e.g. sentences)

In [3]:
text = "In its fumbling attempts to explain the purge of United States attorneys, the Bush administration has argued that the fired prosecutors were not aggressive enough about addressing voter fraud. It is a phony argument; there is no evidence that any of them ignored real instances of voter fraud. But more than that, it is a window on what may be a major reason for some of the firings.   In partisan Republican circles, the pursuit of voter fraud is code for suppressing the votes of minorities and poor people. By resisting pressure to crack down on ''fraud,'' the fired United States attorneys actually appear to have been standing up for the integrity of the election system.   TD John McKay, one of the fired attorneys, says he was pressured by Republicans to bring voter fraud charges after the 2004 Washington governor's race, which a Democrat, Christine Gregoire, won after two recounts. Republicans were trying to overturn an election result they did not like, but Mr. McKay refused to go along. ''There was no evidence,'' he said, ''and I am not going to drag innocent people in front of a grand jury.''   Later, when he interviewed with Harriet Miers, then the White House counsel, for a federal judgeship that he ultimately did not get, he says, he was asked to explain ''criticism that I mishandled the 2004 governor's election.''   Mr. McKay is not the only one of the federal attorneys who may have been brought down for refusing to pursue dubious voter fraud cases. Before David Iglesias of New Mexico was fired, prominent New Mexico Republicans reportedly complained repeatedly to Karl Rove about Mr. Iglesias's failure to indict Democrats for voter fraud. The White House said that last October, just weeks before Mr. McKay and most of the others were fired, President Bush complained that United States attorneys were not pursuing voter fraud aggressively enough.   There is no evidence of rampant voter fraud in this country. Rather, Republicans under Mr. Bush have used such allegations as an excuse to suppress the votes of Democratic-leaning groups. They have intimidated Native American voter registration campaigners in South Dakota with baseless charges of fraud. They have pushed through harsh voter ID bills in states like Georgia and Missouri, both blocked by the courts, that were designed to make it hard for people who lack drivers' licenses -- who are disproportionately poor, elderly or members of minorities -- to vote. Florida passed a law placing such onerous conditions on voter registration drives, which register many members of minorities and poor people, that the League of Women Voters of Florida suspended its registration work in the state.   The claims of vote fraud used to promote these measures usually fall apart on close inspection, as Mr. McKay saw. Missouri Republicans have long charged that St. Louis voters, by which they mean black voters, registered as living on vacant lots. But when The St. Louis Post-Dispatch checked, it found that thousands of people lived in buildings on lots that the city had erroneously classified as vacant.   The United States attorney purge appears to have been prompted by an array of improper political motives. Carol Lam, the San Diego attorney, seems to have been fired to stop her from continuing an investigation that put Republican officials and campaign contributors at risk. These charges, like the accusation that Mr. McKay and other United States attorneys were insufficiently aggressive about voter fraud, are a way of saying, without actually saying, that they would not use their offices to help Republicans win elections. It does not justify their firing; it makes their firing a graver offense.    NS "
text

"In its fumbling attempts to explain the purge of United States attorneys, the Bush administration has argued that the fired prosecutors were not aggressive enough about addressing voter fraud. It is a phony argument; there is no evidence that any of them ignored real instances of voter fraud. But more than that, it is a window on what may be a major reason for some of the firings.   In partisan Republican circles, the pursuit of voter fraud is code for suppressing the votes of minorities and poor people. By resisting pressure to crack down on ''fraud,'' the fired United States attorneys actually appear to have been standing up for the integrity of the election system.   TD John McKay, one of the fired attorneys, says he was pressured by Republicans to bring voter fraud charges after the 2004 Washington governor's race, which a Democrat, Christine Gregoire, won after two recounts. Republicans were trying to overturn an election result they did not like, but Mr. McKay refused to go alon

In [4]:
import nltk
nltk.download('punkt')

from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [5]:
sentences = sent_tokenize(text)

In [6]:
sentences

['In its fumbling attempts to explain the purge of United States attorneys, the Bush administration has argued that the fired prosecutors were not aggressive enough about addressing voter fraud.',
 'It is a phony argument; there is no evidence that any of them ignored real instances of voter fraud.',
 'But more than that, it is a window on what may be a major reason for some of the firings.',
 'In partisan Republican circles, the pursuit of voter fraud is code for suppressing the votes of minorities and poor people.',
 "By resisting pressure to crack down on ''fraud,'' the fired United States attorneys actually appear to have been standing up for the integrity of the election system.",
 "TD John McKay, one of the fired attorneys, says he was pressured by Republicans to bring voter fraud charges after the 2004 Washington governor's race, which a Democrat, Christine Gregoire, won after two recounts.",
 'Republicans were trying to overturn an election result they did not like, but Mr. McK

In [7]:
pd.DataFrame(sentences, columns=['text'])

Unnamed: 0,text
0,In its fumbling attempts to explain the purge ...
1,It is a phony argument; there is no evidence t...
2,"But more than that, it is a window on what may..."
3,"In partisan Republican circles, the pursuit of..."
4,By resisting pressure to crack down on ''fraud...
5,"TD John McKay, one of the fired attorneys, say..."
6,Republicans were trying to overturn an electio...
7,"''There was no evidence,'' he said, ''and I am..."
8,"Later, when he interviewed with Harriet Miers,..."
9,Mr. McKay is not the only one of the federal a...


## 2. Sampling (of training set and test set)

Given the corpus divided into annotation units 

In [8]:
corpus = pd.read_csv('./data/corpus.csv')

Randomly sample 200 sentences as training set

In [9]:
training_set = corpus.sample(200,random_state=42)

Randomly sample 100 sentences as test set

⚠️ **The training set and the test set must have no intersection**

In [10]:
test_set = corpus[corpus.text.apply(lambda x: x not in training_set.text)].sample(100, random_state=42)

Export the training set and the test set for annotation

In [11]:
training_set.to_csv('training_set.csv', index=False)

In [12]:
test_set.to_csv('test_set.csv', index=False)

## 3. Annotation

Use your favorate annotation tool! (Excel is ok...)

## 4. Training and evaluation

Given the annotated training set and test set

In [13]:
train = pd.read_csv('./data/annotated_training_set.csv')

In [14]:
test = pd.read_csv('./data/annotated_test_set.csv')

**Step 1:** Preprocess the training and test data with `bert.encode()`

In [15]:
train_loader = bert.encode(train.text.values, train.label.values)

  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/2000 [00:00<?, ?it/s]

In [16]:
test_loader = bert.encode(test.text.values, test.label.values)

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

**Step 2:** Training a model, evaluating and saving it with `bert.run_training()`

In [17]:
score = bert.run_training(train_loader,          #encoded training set
                          test_loader,           #encoded test set
                          n_epochs=3,            #number of epochs
                          lr=5e-5,               #learning rate
                          random_state=42,       #random state (for replicability)
                          save_model_as='off')   #name of the saved model

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i


Training...
  Batch    40  of     63.    Elapsed: 0:00:26.

  Average training loss: 0.42
  Training took: 0:00:40

Running Validation...

  Average test loss: 0.16
  Validation took: 0:00:08
              precision    recall  f1-score   support

           0       0.99      0.93      0.96       900
           1       0.60      0.91      0.72       100

    accuracy                           0.93      1000
   macro avg       0.79      0.92      0.84      1000
weighted avg       0.95      0.93      0.94      1000


Training...
  Batch    40  of     63.    Elapsed: 0:00:25.

  Average training loss: 0.15
  Training took: 0:00:39

Running Validation...

  Average test loss: 0.13
  Validation took: 0:00:08
              precision    recall  f1-score   support

           0       0.99      0.96      0.98       900
           1       0.74      0.93      0.83       100

    accuracy                           0.96      1000
   macro avg       0.87      0.95      0.90      1000
weighted avg   

In [18]:
score

(array([0.99422633, 0.70895522]),
 array([0.95666667, 0.95      ]),
 array([0.97508494, 0.81196581]),
 array([900, 100]))

# 4. Prediction

In [19]:
corpus

Unnamed: 0,text,year
0,"In North Carolina, Mr. McCrory won largely on ...",2012
1,"Instead, the focus on Mr. Romney's personal fo...",2012
2,As investors have poured money into the skyroc...,2015
3,"In a November 1998 plea agreement, Ronald Latt...",2001
4,"""While we preserved, maintained and released a...",2001
...,...,...
4294,A veteran researcher said the staff had been t...,2006
4295,Domestic yields rose only 1.9 percent.,2006
4296,"The federal government, including the Centers ...",2020
4297,"Noting the e-mails, phone records and testimon...",2011


**Step 1:** Preprocess the prediction data with `bert.encode()`

In [20]:
pred_dataloader = bert.encode(corpus.text)

  0%|          | 0/4299 [00:00<?, ?it/s]

  0%|          | 0/4299 [00:00<?, ?it/s]

**Step 2:** Prediction with the saved model using `bert.predict_with_model()`

In [21]:
pred = bert.predict_with_model(pred_dataloader, model_path='./models/off')

  0%|          | 0/135 [00:00<?, ?it/s]

The function return a two-dimensional array: for each sentence, probability that it belongs to category 0 and probability that it belongs to category 1. 

In [22]:
pred

array([[0.99820566, 0.00179435],
       [0.9968569 , 0.00314302],
       [0.01327625, 0.9867237 ],
       ...,
       [0.9977915 , 0.00220845],
       [0.9951126 , 0.0048874 ],
       [0.9973519 , 0.00264812]], dtype=float32)

Store the predicted labels and probabilities into the dataframe

In [23]:
corpus['pred_label'] = np.argmax(pred, axis=1)

In [24]:
corpus['pred_proba'] = np.max(pred, axis=1)

Inspect the predictions

In [25]:
corpus

Unnamed: 0,text,year,pred_label,pred_proba
0,"In North Carolina, Mr. McCrory won largely on ...",2012,0,0.998206
1,"Instead, the focus on Mr. Romney's personal fo...",2012,0,0.996857
2,As investors have poured money into the skyroc...,2015,1,0.986724
3,"In a November 1998 plea agreement, Ronald Latt...",2001,0,0.997723
4,"""While we preserved, maintained and released a...",2001,0,0.997600
...,...,...,...,...
4294,A veteran researcher said the staff had been t...,2006,1,0.985452
4295,Domestic yields rose only 1.9 percent.,2006,0,0.997453
4296,"The federal government, including the Centers ...",2020,0,0.997792
4297,"Noting the e-mails, phone records and testimon...",2011,0,0.995113


Then you can process any statistical analysis on the entire corpus...

In [26]:
corpus.groupby('year').sum()['pred_label']/corpus.year.value_counts()

year
2000    0.357664
2001    0.458937
2002    0.446927
2003    0.490385
2004    0.364780
2005    0.258503
2006    0.361963
2007    0.373737
2008    0.368421
2009    0.349112
2010    0.342391
2011    0.337963
2012    0.359091
2013    0.372093
2014    0.272727
2015    0.361905
2016    0.327273
2017    0.316964
2018    0.337079
2019    0.341727
2020    0.342020
dtype: float64

**To recapitulate:**

3 main functions:

- `bert.encode(texts, labels)`to preprocess the data. 
   (only `bert.encode(texts)` for prediction data)
   
   
- `bert.run_training(train_loader, test_loader, lr, n_epochs, random_state, save_model_as)` 

    to train, validate and save the model


- `bert.predict_with_model(pred_loder, model_path)`to make predictions

## Exercice: Clickbalt detection

Use the following training data and test data to build a model which automatically classifies whether a headline is a click balt. 

In [28]:
cb_train = pd.read_csv('../AugmentedSocialScientist/datasets/english/clickbait_train.csv')
cb_test = pd.read_csv('../AugmentedSocialScientist/datasets/english/clickbait_test.csv')

Then use the model to automatically classify the following headlines: 

In [29]:
cb_pred = pd.read_csv('../AugmentedSocialScientist/datasets/english/clickbait_pred.csv')
cb_pred

Unnamed: 0,headline
0,34 Musical Baby Names That'll Make You Want To...
1,Senate Approves Tight Regulation Over Cigarettes
2,Scotland predicted to have worst recession sin...
3,17 Times Chloe The Mini Frenchie Won Instagram...
4,Markets rally as world's central banks infuse ...
5,17 Photos Everyone Who Grew Up Eating Pan Dulc...
6,Zimbabwean opposition leader rejects calls for...
7,Chief of Swiss Re Steps Down
8,This Guy's Epic Story Explains Why Every Girl ...
9,"There's A New Trailer For The ""Sherlock"" Chris..."


## Solution

Training

In [30]:
train_loader = bert.encode(cb_train.headline.values, cb_train.is_clickbait.values)

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

In [31]:
test_loader = bert.encode(cb_test.headline.values, cb_test.is_clickbait.values)

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

In [32]:
score = bert.run_training(train_loader,
                          test_loader,
                          n_epochs=2,
                          lr=5e-5,
                          random_state=42,
                          save_model_as='clickbait')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i


Training...

  Average training loss: 0.43
  Training took: 0:00:02

Running Validation...

  Average test loss: 0.22
  Validation took: 0:00:00
              precision    recall  f1-score   support

           0       0.95      0.96      0.96       104
           1       0.96      0.95      0.95        96

    accuracy                           0.95       200
   macro avg       0.96      0.95      0.95       200
weighted avg       0.96      0.95      0.95       200


Training...

  Average training loss: 0.11
  Training took: 0:00:02

Running Validation...

  Average test loss: 0.19
  Validation took: 0:00:00
              precision    recall  f1-score   support

           0       0.98      0.96      0.97       104
           1       0.96      0.98      0.97        96

    accuracy                           0.97       200
   macro avg       0.97      0.97      0.97       200
weighted avg       0.97      0.97      0.97       200


Training complete!


Prediction

In [33]:
pred_loader = bert.encode(cb_pred.headline.values)

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

In [34]:
pred_proba = bert.predict_with_model(pred_loader, model_path='./models/clickbait')

  0%|          | 0/2 [00:00<?, ?it/s]

Inspect the prediction results

In [35]:
cb_pred['pred_label'] = np.argmax(pred_proba, axis=1)
cb_pred['pred_proba'] = np.max(pred_proba, axis=1)

In [36]:
for i in range(len(cb_pred)):
    print(f"{cb_pred.loc[i,'headline']}")
    print(f"Is clickbait: {bool(cb_pred.loc[i,'pred_label'])}, with a probability of {cb_pred.loc[i,'pred_proba']*100:.0f}%")
    print()

34 Musical Baby Names That'll Make You Want To Procreate
Is clickbait: True, with a probability of 98%

Senate Approves Tight Regulation Over Cigarettes
Is clickbait: False, with a probability of 97%

Scotland predicted to have worst recession since 1980, but not as bad as rest of UK
Is clickbait: False, with a probability of 97%

17 Times Chloe The Mini Frenchie Won Instagram In 2015
Is clickbait: True, with a probability of 96%

Markets rally as world's central banks infuse cash
Is clickbait: False, with a probability of 97%

17 Photos Everyone Who Grew Up Eating Pan Dulce Will Relate To
Is clickbait: True, with a probability of 96%

Zimbabwean opposition leader rejects calls for power sharing talks
Is clickbait: False, with a probability of 97%

Chief of Swiss Re Steps Down
Is clickbait: False, with a probability of 97%

This Guy's Epic Story Explains Why Every Girl Has A Trapped In The Closet Moment
Is clickbait: True, with a probability of 95%

There's A New Trailer For The "Sherl