# Baseline Gender Biased Token Classifiers

### Target: Label Categories

### Word Embeddings: GloVe

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory ../data/token_clf_data/model_input/
* Multilabel classification
    * 3 categories of labels: Linguistic, Person Name, Contextual
* Word Embeddings
    * GloVe (trained on English Wikipedia dump)

***

### Table of Contents

**[0.](#0) Preprocessing**

**[1.](#1) Baseline Model**

**[2.](#2) Hyperparameter Optimization**

**[3.](#3) Error Analysis**

***

Load necessary libraries:

In [2]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt

# For preprocessing
from nltk.stem import WordNetLemmatizer
import scipy.stats
from gensim.models import FastText
from gensim import utils as gensim_utils
from gensim.test.utils import get_tmpfile

# For classification
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# For evaluation
from collections import Counter
from sklearn.metrics import classification_report, make_scorer
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, ConfusionMatrixDisplay#, plot_confusion_matrix
from sklearn.metrics import precision_recall_fscore_support, f1_score
from intervaltree import Interval, IntervalTree

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [3]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


Drop duplicate rows with all but the same annotation ID:

In [4]:
df_train = df_train.drop(columns=["ann_id"])
df_train = df_train.drop_duplicates()
df_dev = df_dev.drop(columns=["ann_id"])
df_dev = df_dev.drop_duplicates()
print(df_train.shape, df_dev.shape)

(463441, 9) (156146, 9)


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [5]:
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [6]:
df_train.shape

(463439, 9)

***

#### Label Categories

Add the annotation label categories as a column of higher-level Inside-Outside-Beginning (IOB) tags so they can be used as targets:

In [7]:
df_train = utils.addCategoryTagColumn(df_train)
# df_train.head(20)

In [12]:
df_dev = utils.addCategoryTagColumn(df_dev)
# df_dev.head()

Remove columns that won't be used as features for the classifiers and remove any duplicate rows that remain:

In [13]:
cols_to_keep = ["sentence_id", "token_id", "pos", "token", "tag_cat"]

In [14]:
df_train = df_train[cols_to_keep]
df_train = df_train.drop_duplicates()
df_dev = df_dev[cols_to_keep]
df_dev = df_dev.drop_duplicates()
# df_train.head(20)

Group the data by sentence, so the token column becomes a list of tokens for each sentence:

In [15]:
df_train_grouped = utils.implodeDataFrame(df_train, ["sentence_id"])
df_dev_grouped = utils.implodeDataFrame(df_dev, ["sentence_id"])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
df_dev_grouped.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_cat
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[O, B-Linguistic, O, B-Linguistic, O, O, O, O,..."
11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[O, O, O]"
13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
18,"[498, 499, 500, 500, 501, 501, 502, 503, 503, ...","[IN, CD, NNP, NNP, NNP, NNP, VBD, NNP, NNP, NN...","[In, 1941, Tom, Tom, Allan, Allan, married, Ja...","[O, O, B-Person-Name, B-Contextual, I-Person-N..."
24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[O, O, B-Person-Name, I-Person-Name, I-Person-..."


Pad the sentences so they all have the same lengths?

In [42]:
# df_train_grouped = utils.addPaddedSentenceColumn(df_train_grouped)
# print(df_train_grouped.sentence.values[0][:20])
# df_dev_grouped = utils.addPaddedSentenceColumn(df_dev_grouped)

['Title', ':', 'Papers', 'of', 'The', 'The', 'Very', 'Very', 'Rev', 'Rev', 'Rev', 'Prof', 'Prof', 'James', 'James', 'Whyte', 'Whyte', '(', '1920-2005', ')']


Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, CATEGORY-TAG)`

In [16]:
dev_sentences = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_cat")
# print(dev_sentences[0])
train_sentences = utils.zipFeaturesAndTarget(df_train_grouped, "tag_cat")
# print(train_sentences[1])

#### Word Embeddings

Use the custom fastText word embeddings, trained on the entire dataset of descriptive metadata from the Archives (harvested in October 2020) using the Continuous Bag-of-Words (CBOW) algorithm.  Subword embeddings (for subwords from 2 to 6 characters long, inclusive) are used to infer the embeddings for out-of-vocabulary (OOV) words.

Use the word embedding model trained on lowercased text to 100 dimensions: 

In [19]:
file_name = get_tmpfile(config.tokc_path+"fasttext100_lowercased.model")
embedding_model = FastText.load(file_name)

In [20]:
vocabulary = list(df_train.token.unique())
vocabulary_lowercased = [token.lower() for token in vocabulary]
vocabulary_lowercased = list(set(vocabulary_lowercased))
print("Vocabulary size:", len(vocabulary))
print("Lowercased vocabulary size:", len(vocabulary_lowercased))

Vocabulary size: 35968
Lowercased vocabulary size: 31335


Create feature dictionaries:

*References:*
* *https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html*
* *https://stackoverflow.com/questions/58736548/how-to-use-word-embedding-as-features-for-crf-sklearn-crfsuite-model-training*

In [29]:
# Get a vector representation of a token from a fastText word embedding model
def extractEmbedding(token, fasttext_model=embedding_model):
    if token.isalpha():
        token = token.lower()
    embedding = fasttext_model.wv[token]
    return embedding

def extractTokenFeatures(sentence, i):
    token = sentence[i][0]
    pos = sentence[i][1]
    features = {
        'bias': 1.0,    # HOW IS THIS DECIDED? WHAT DOES THIS DO?
        'pos': pos,
        'pos[:2]': pos[:2],
        'token': token
    }
    
    # Add each value in a token's word embedding as a separate feature
    embedding = extractEmbedding(token)
    for i,n in enumerate(embedding):
        features['e{}'.format(i)] = n
    
    # Record whether a token is the first or last token of a sentence
    if i == 0:
        features['START'] = True
    elif i == (len(sentence) - 1):
        features['END'] = True
    
    return features

def extractSentenceFeatures(sentence):
    return [extractTokenFeatures(sentence, i) for i in range(len(sentence))]

def extractSentenceTargets(sentence):
    return [tag for token, pos, tag in sentence]

def extractSentenceTokens(sentence):
    return [token for token, pos, tag in sentence]

In [31]:
# extractSentenceFeatures(train_sentences[0])[5]

In [32]:
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]

In [33]:
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

<a id="1"></a>
## 1. Baseline Model

* **Features:** part-of-speech tag, first 2 letters of part-of-speech tag abbreviation
* **Target:** label category IOB tags
* **Algorithm:** L2SGD

### Train

Train a Conditional Random Field (CRF) model with the default parameters:

In [34]:
algorithms = ['lbfgs', 'l2sgd', 'ap', 'pa', 'arow']
# Available algorithms with sklearn_crfsuite are:
#     'lbfgs' - Gradient descent using the L-BFGS method
#     'l2sgd' - Stochastic Gradient Descent with L2 regularization term
#     'ap' - Averaged Perceptron
#     'pa' - Passive Aggressive (PA)
#     'arow' - Adaptive Regularization Of Weight Vector (AROW)

In [35]:
# clf = sklearn_crfsuite.CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100) #iterations unlimited
clf = sklearn_crfsuite.CRF(algorithm=algorithms[1], c2=0.1, max_iterations=100)     # up to 1000 iterations allowed
# clf = sklearn_crfsuite.CRF(algorithm=algorithms[2], max_iterations=100)           # max iterations allowed
# clf = sklearn_crfsuite.CRF(algorithm=algorithms[3], max_iterations=100)           # max iterations allowed
# clf = sklearn_crfsuite.CRF(algorithm=algorithms[4], max_iterations=100)           # max iterations allowed

In [36]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [37]:
targets = list(clf.classes_)
targets.remove('O')
print(targets)

['B-Person-Name', 'B-Contextual', 'I-Person-Name', 'I-Contextual', 'B-Linguistic', 'I-Linguistic']


### Predict

In [38]:
y_pred = clf.predict(X_dev)

### Evaluate

#### Strict Evaluation

In [50]:
targets_sorted = sorted(targets, key=lambda name: (name[1:], name[0]))
for target in targets_sorted:
    print("Label:", target)
    print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", labels=target))
    print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", labels=target))
    print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", labels=target))
    print()

Label: B-Contextual
  - F1: 0.9126541151660342
  - Prec: 0.9097454884859033
  - Rec 0.9309258721399296

Label: I-Contextual
  - F1: 0.9126541151660342
  - Prec: 0.9097454884859033
  - Rec 0.9309258721399296

Label: B-Linguistic
  - F1: 0.9126541151660342
  - Prec: 0.9097454884859033
  - Rec 0.9309258721399296

Label: I-Linguistic
  - F1: 0.9126541151660342
  - Prec: 0.9097454884859033
  - Rec 0.9309258721399296

Label: B-Person-Name
  - F1: 0.9126541151660342
  - Prec: 0.9097454884859033
  - Rec 0.9309258721399296

Label: I-Person-Name
  - F1: 0.9126541151660342
  - Prec: 0.9097454884859033
  - Rec 0.9309258721399296



The built-in evaluation approach is strict, so unless the model predictions' labels are on text spans that exactly match the development data's test, the predicted labels will be deemed incorrect.

As with the manual annotation evaluation, we want to evaluate the predictions more loosely, considering overlapping text spans in addition to exactly matching text spans.

In [58]:
# dev_sentences[0]
df_dev_grouped = df_dev_grouped.rename(columns={"tag_cat":"tag_cat_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_cat_predicted", y_pred)
df_dev_grouped.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_cat_expected,tag_cat_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[O, B-Linguistic, O, B-Linguistic, O, O, O, O,...","[O, B-Linguistic, O, B-Linguistic, O, O, O, O,..."
11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[O, O, O]","[O, O, O]"
13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
18,"[498, 499, 500, 500, 501, 501, 502, 503, 503, ...","[IN, CD, NNP, NNP, NNP, NNP, VBD, NNP, NNP, NN...","[In, 1941, Tom, Tom, Allan, Allan, married, Ja...","[O, O, B-Person-Name, B-Contextual, I-Person-N...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[O, O, B-Person-Name, I-Person-Name, I-Person-...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


In [67]:
df_dev_exploded = df_dev_grouped.explode(list(df_dev_grouped.columns))
df_dev_exploded = df_dev_exploded.rename(columns={"sentence":"token"})
df_dev_exploded.head()

Unnamed: 0_level_0,token_id,pos,token,tag_cat_expected,tag_cat_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,154,IN,After,O,O
5,155,PRP$,his,B-Linguistic,B-Linguistic
5,156,NN,ordination,O,O
5,157,PRP,he,B-Linguistic,B-Linguistic
5,158,VBD,spent,O,O


In [65]:
print(df_dev_exploded.shape)
print(df_dev.shape)

(154935, 5)
(154935, 5)


In [68]:
df_dev_grouped.to_csv(config.tokc_path+"model_output/categoryTags_crfL2sgd_POS-CustomFastText_grouped.csv")
df_dev_exploded.to_csv(config.tokc_path+"model_output/categoryTags_crfL2sgd_POS-CustomFastText.csv")

#### Loose Evaluation

Using the model with one of the best performing algorithms, Stochastic Gradient Descent with L2 regularization (`l2sgd`),  conduct a loose evaluation of the model's performance (as the manual annotation were evaluated).

To try to improve the model's performance, use cross-validation and randomized search for choosing regularization parameters:

In [125]:
folds = 3  # 3-fold cross validation

In [126]:
clf = sklearn_crfsuite.CRF(algorithm='lbfgs', c1=0.1, c2 = 0.1, max_iterations=100) #, all_possible_transitions=True)