Author Note (Taegyoon Kim, taegyoon@psu.edu)


---


- This is a notebook for a BERT fine-tuned classifier introduced in **Taegyoon Kim. Violent Political Rhetoric on Twitter. *Political Science Research and Methods***.
- The input data (i.e., training set) is labeled tweets that contain one or more of the violent keywords extracted using the violent keyword extractor (https://github.com/taegyoon-kim/violent_political_rheotric_on_twitter/blob/master/violent_political_rhetoric_violent_keyword_extract.py). 
- The classifier is trained on GPU provided Google Colaboratory.
- Load input data (e.g., from a Google Drive as in the below) and train a classifier. 



Mount Google Drive

---







In [None]:
from google.colab import drive
drive.mount("/content/drive")

Packages


---



In [None]:
!pip install torch torchvision
!pip install transformers==2.10.0
!pip install seqeval
!pip install tensorboardx
!pip install simpletransformers==0.9.1 # the classifier is based on simpletransformers package https://github.com/ThilinaRajapakse/simpletransformers

In [None]:
import pandas as pd
import numpy as np

import gc
import requests
import os

from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score, confusion_matrix
from scipy.special import softmax

import random

import torch

print("Cuda available" if torch.cuda.is_available() is True else "CPU")
print("PyTorch version: ", torch.__version__)

Load Data

---



In [None]:
url = '/content/drive/My Drive/diss_detection/diss_detection_training.csv' # location of training set

df = pd.read_csv(url)
df['text'] = df['status_final_text']
df['threat'] = df['final_binary'].astype(float)
df = df[['text','threat']]

print(len(df['threat']))
print(df['threat'].value_counts(normalize = True))

Performance Metrics


---



In [11]:
def report_results(A, B):
    A_name = A.name
    B_name = B.name
    
    df = pd.DataFrame({'A': A,
                       'B': B})
    df = df.dropna()
    A = df['A']
    B = df['B']
    
    prec = precision_score(B, A)
    rec = recall_score(B, A)
    f1 = f1_score(B, A)
    acc = accuracy_score(B, A)

    performance = [prec, rec, f1, acc]

    return performance

Define Set Seed Function


---



In [12]:
def set_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

5-fold Cross Validation


---



In [None]:
## hyper-parameters

args = {
   
   'output_dir': 'outputs/',
   'cache_dir': 'cache/',

   'fp16': False,
   'fp16_opt_level': 'O1',
   'max_seq_length': 250,
   'train_batch_size': 8,
   'eval_batch_size': 8,
   'gradient_accumulation_steps': 1,
   'num_train_epochs': 3,
   'weight_decay': 0,
   'learning_rate': 3e-5,
   'adam_epsilon': 1e-8,
   'warmup_ratio': 0.06,
   'warmup_steps': 0,
   'max_grad_norm': 1.0,

   'logging_steps': 50,
   'evaluate_during_training': False,
   'save_steps': 2000,
   'eval_all_checkpoints': True,
   'use_tensorboard': True,

   'overwrite_output_dir': True,
   'reprocess_input_data': True
   
   }


## set seed number

set_seed(777)


## cross validate

kf = KFold(n_splits = 5, random_state = 777, shuffle = True)

for train_index, val_index in kf.split(df):
  
  # splitting dataframe
    train_df = df.iloc[train_index]
    val_df = df.iloc[val_index]
  
  # defining Model
    model = ClassificationModel('bert', 'bert-base-uncased', args = args)
  
  # train model
    model.train_model(train_df)
  
  # validate model 
    predictions, raw_outputs = model.predict(val_df['text'])
    probabilities = softmax(raw_outputs, axis=1) 
  
  # apply different thresholds   
    val_df['BERT_threat_850'] = np.where(probabilities[:,1] >= 0.85, 1, 0)
    val_df['BERT_threat_875'] = np.where(probabilities[:,1] >= 0.875, 1, 0)
    val_df['BERT_threat_900'] = np.where(probabilities[:,1] >= 0.9, 1, 0)
  
  # performance
    performance_850 = report_results(val_df['BERT_threat_850'], val_df['threat'])
    performance_875 = report_results(val_df['BERT_threat_875'], val_df['threat'])
    performance_900 = report_results(val_df['BERT_threat_900'], val_df['threat'])
    print(performance_850)
    print(performance_875)
    print(performance_900)

Prediction on New Text


---



In [None]:
## train model

model = ClassificationModel('bert', 'bert-base-uncased', args = args)
model.train_model(df)


## generate predictions

predictions, raw_outputs = model.predict('your_new_text'])