# AP4: Annotation Analysis
<b>Sarah Barrington, April 2023</b>

This analysis uses the annotated dataset of removed Reddit posts and comments labelled by single thematic categories, which are as follows:

* Racial
* Gendered
* Moderation, censorship, wokeness:
* Sexual
* Political/geographical
* Health
* Illegal activities or violence
* Criticism:
* Environment/world/population
* Corporations
* Unlabelled

The goal of this analysis is to qualify the link between input posts and these resulting categories. 

# Set up and imports

In [39]:
import os
import pandas as pd
import csv2tsv

from sklearn.model_selection import train_test_split

try: 
    os.chdir('AP4')
except:
    print('Working directory already set')

Working directory already set


# Divide data into test, training and development

In [40]:
# Import 'adjudicated' data 
df = pd.read_csv('adjudicated.txt', sep='\t', header=None)
df.head()

Unnamed: 0,0,1,2,3
0,41667,adjudicated,health,So they want to inject shit into people? Liter...
1,135751,adjudicated,health,Sort of around the time E-cigs started getting...
2,81008,adjudicated,health,A heart attack at 50 cuts off 10-20 years stil...
3,24973,adjudicated,criticism,"Damn, we needed a Harvard University study to ..."
4,156020,adjudicated,moderation,Great discussion as always on Re- [Removed]


In [41]:
# Clean up data
X = df.iloc[:, 3] # Input text
y = df.iloc[:, 2] # Annotated labels

In [59]:
# Divide into three groups
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_dev, y_train, y_dev = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

# Ensure splits have the correct proportions
print(len(X_train)/len(df))
print(len(X_test)/len(df))
print(len(X_dev)/len(df))
print(len(df))

0.6
0.2
0.2
550


In [60]:
def write_txt_file(X, y, label):
    pd.DataFrame({'text':X, 'label':y}).to_csv(f'splits/{label}.txt', sep="\t", header=None)
    
    return None
    
write_txt_file(X_train, y_train, 'train')
write_txt_file(X_test, y_test, 'test')
write_txt_file(X_dev, y_dev, 'dev')

# Build classifier

In [61]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MultiLabelBinarizer

from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

mlb = MultiLabelBinarizer()
y_train = mlb.fit_transform(y_train)
y_dev = mlb.transform(y_dev)
y_test = mlb.transform(y_test)

# Vectorize text data
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_dev = vectorizer.transform(X_dev)
X_test = vectorizer.transform(X_test)

clf = MultiOutputClassifier(estimator= LogisticRegression()).fit(X_train, y_train)
#clf.predict(X[-2:])

print('DEV accuracy:', clf.score(X_dev, y_dev))

DEV accuracy: 0.05454545454545454


In [54]:
y_train.shape

(330, 21)

In [67]:
toast = 'hello'
print("%s/train.txt" % toast)

hello/train.txt


# Implementing base BERT model example 

In [69]:
!pip install transformers
from transformers import BertModel, BertTokenizer
import nltk
import torch
import torch.nn as nn
import numpy as np
import random
from scipy.stats import norm
import math

# If you have your folder of data on your Google drive account, you can connect that here

# Change this to the directory with your data
directory="splits"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Running on {}".format(device))
def read_labels(filename):
    labels={}
    with open(filename) as file:
        for line in file:
            cols = line.split("\t")
            label = cols[2]
            if label not in labels:
                labels[label]=len(labels)
    return labels
def read_data(filename, labels, max_data_points=1000):
  
    data = []
    data_labels = []
    with open(filename) as file:
        for line in file:
            cols = line.split("\t")
            label = cols[2]
            text = cols[1]
            
            data.append(text)
            data_labels.append(labels[label])
            

    # shuffle the data
    tmp = list(zip(data, data_labels))
    random.shuffle(tmp)
    data, data_labels = zip(*tmp)
    
    if max_data_points is None:
        return data, data_labels
    
    return data[:max_data_points], data_labels[:max_data_points]

labels=read_labels("%s/train.txt" % directory)
train_x, train_y=read_data("%s/train.txt" % directory, labels, max_data_points=None)
dev_x, dev_y=read_data("%s/dev.txt" % directory, labels, max_data_points=None)
test_x, test_y=read_data("%s/test.txt" % directory, labels, max_data_points=None)

def evaluate(model, x, y):
    model.eval()
    corr = 0.
    total = 0.
    with torch.no_grad():
        for x, y in zip(x, y):
            y_preds=model.forward(x)
            for idx, y_pred in enumerate(y_preds):
                prediction=torch.argmax(y_pred)
                if prediction == y[idx]:
                    corr += 1.
                total+=1                          
    return corr/total, total
class BERTClassifier(nn.Module):

    def __init__(self, bert_model_name, params):
        super().__init__()
    
        self.model_name=bert_model_name
        self.tokenizer = BertTokenizer.from_pretrained(self.model_name, do_lower_case=params["doLowerCase"], do_basic_tokenize=False)
        self.bert = BertModel.from_pretrained(self.model_name)
        
        self.num_labels = params["label_length"]

        self.fc = nn.Linear(params["embedding_size"], self.num_labels)

    def get_batches(self, all_x, all_y, batch_size=32, max_toks=510):
            
        """ Get batches for input x, y data, with data tokenized according to the BERT tokenizer 
      (and limited to a maximum number of WordPiece tokens """

        batches_x=[]
        batches_y=[]
        
        for i in range(0, len(all_x), batch_size):

            current_batch=[]

            x=all_x[i:i+batch_size]

            batch_x = self.tokenizer(x, padding=True, truncation=True, return_tensors="pt", max_length=max_toks)
            batch_y=all_y[i:i+batch_size]

            batches_x.append(batch_x.to(device))
            batches_y.append(torch.LongTensor(batch_y).to(device))
            
        return batches_x, batches_y
  

    def forward(self, batch_x): 
    
        bert_output = self.bert(input_ids=batch_x["input_ids"],
                         attention_mask=batch_x["attention_mask"],
                         token_type_ids=batch_x["token_type_ids"],
                         output_hidden_states=True)

      # We're going to represent an entire document just by its [CLS] embedding (at position 0)
      # And use the *last* layer output (layer -1)
      # as a result of this choice, this embedding will be optimized for this purpose during the training process.
      
        bert_hidden_states = bert_output['hidden_states']

        out = bert_hidden_states[-1][:,0,:]

        out = self.fc(out)

        return out.squeeze()
def confidence_intervals(accuracy, n, significance_level):
    critical_value=(1-significance_level)/2
    z_alpha=-1*norm.ppf(critical_value)
    se=math.sqrt((accuracy*(1-accuracy))/n)
    return accuracy-(se*z_alpha), accuracy+(se*z_alpha)
def train(bert_model_name, model_filename, train_x, train_y, dev_x, dev_y, labels, embedding_size=768, doLowerCase=None):

    bert_model = BERTClassifier(bert_model_name, params={"label_length": len(labels), "doLowerCase":doLowerCase, "embedding_size":embedding_size})
    bert_model.to(device)

    batch_x, batch_y = bert_model.get_batches(train_x, train_y)
    dev_batch_x, dev_batch_y = bert_model.get_batches(dev_x, dev_y)

    optimizer = torch.optim.Adam(bert_model.parameters(), lr=1e-5)
    cross_entropy=nn.CrossEntropyLoss()

    num_epochs=30
    best_dev_acc = 0.
    patience=5

    best_epoch=0

    for epoch in range(num_epochs):
        bert_model.train()

        # Train
        for x, y in zip(batch_x, batch_y):
            y_pred = bert_model.forward(x)
            loss = cross_entropy(y_pred.view(-1, bert_model.num_labels), y.view(-1))
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Evaluate
        dev_accuracy, _=evaluate(bert_model, dev_batch_x, dev_batch_y)
        if epoch % 1 == 0:
            print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))
            if dev_accuracy > best_dev_acc:
                torch.save(bert_model.state_dict(), model_filename)
                best_dev_acc = dev_accuracy
                best_epoch=epoch
        if epoch - best_epoch > patience:
            print("No improvement in dev accuracy over %s epochs; stopping training" % patience)
            break

    bert_model.load_state_dict(torch.load(model_filename))
    print("\nBest Performing Model achieves dev accuracy of : %.3f" % (best_dev_acc))
    return bert_model
# small BERT -- can run on laptop
# bert_model_name="google/bert_uncased_L-2_H-128_A-2"
# model_filename="mybert.model"
# embedding_size=128
# doLowerCase=True

# bert-base -- slow on laptop; better on Colab
bert_model_name="bert-base-cased"
model_filename="mybert.model"
embedding_size=768
doLowerCase=False

model=train(bert_model_name, model_filename, train_x, train_y, dev_x, dev_y, labels, embedding_size=embedding_size, doLowerCase=doLowerCase)
test_batch_x, test_batch_y = model.get_batches(test_x, test_y)
accuracy, test_n=evaluate(model, test_batch_x, test_batch_y)

lower, upper=confidence_intervals(accuracy, test_n, .95)
print("Test accuracy for best dev model: %.3f, 95%% CIs: [%.3f %.3f]\n" % (accuracy, lower, upper))

Running on cpu


Downloading (…)solve/main/vocab.txt: 100%|███| 213k/213k [00:00<00:00, 23.4MB/s]
Downloading (…)okenizer_config.json: 100%|███| 29.0/29.0 [00:00<00:00, 6.61kB/s]
Downloading (…)lve/main/config.json: 100%|██████| 570/570 [00:00<00:00, 101kB/s]
Downloading pytorch_model.bin: 100%|█████████| 436M/436M [00:37<00:00, 11.7MB/s]
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel f

Epoch 0, dev accuracy: 0.227
Epoch 1, dev accuracy: 0.236
Epoch 2, dev accuracy: 0.255
Epoch 3, dev accuracy: 0.273
Epoch 4, dev accuracy: 0.273
Epoch 5, dev accuracy: 0.300
Epoch 6, dev accuracy: 0.327
Epoch 7, dev accuracy: 0.336
Epoch 8, dev accuracy: 0.373
Epoch 9, dev accuracy: 0.409
Epoch 10, dev accuracy: 0.427
Epoch 11, dev accuracy: 0.436
Epoch 12, dev accuracy: 0.455
Epoch 13, dev accuracy: 0.445
Epoch 14, dev accuracy: 0.427
Epoch 15, dev accuracy: 0.445
Epoch 16, dev accuracy: 0.436
Epoch 17, dev accuracy: 0.418
Epoch 18, dev accuracy: 0.445
No improvement in dev accuracy over 5 epochs; stopping training

Best Performing Model achieves dev accuracy of : 0.455
Test accuracy for best dev model: 0.482, 95% CIs: [0.388 0.575]



# Reporting accuracy and confidence intervals 
As reported above, the baseline TEST accuracy of the BERT model implemented above is <b>0.482</b>, with lower and upper 95% confidence intervals of <b>[0.388 and 0.575]</b> respectively. This performance is worse than assumed 'chance' rate of 0.5. This suggests that no strong relationship has been found between a BERT model featureset and the human-annotated labels. There is room for improvement on this result.

# <font color=red> CATHERINE TO START HERE 

# Analysis of results

# Tweaking model to improve score 