# Baseline Gender Biased Token Classifiers

### Target: Label Categories

### Word Embeddings: Custom fastText

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/model_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/`
* Sequence classification
    * 3 categories of labels: Linguistic, Person Name, Contextual
    * 1 model per category
* Word embeddings
    * Custom fastText (word2vec with subwords, trained on Archives' descriptive metadata extracted in October 2020)  

***

### Table of Contents

**[0.](#0) Preprocessing**

**[1.](#1) Baseline Model**

**[2.](#2) Hyperparameter Optimization**

**[3.](#3) Error Analysis**

***

Load necessary libraries:

In [1]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt

# For preprocessing
from nltk.stem import WordNetLemmatizer
import scipy.stats
# For classification
# from sklearn.pipeline import Pipeline, FeatureUnion
# from sklearn.base import BaseEstimator, TransformerMixin
# from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer, FunctionTransformer, OneHotEncoder
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.compose import ColumnTransformer
# from sklearn.impute import SimpleImputer
# from sklearn.model_selection import cross_val_score, RandomizedSearchCV
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# LR with OvR provides multilabel model
# from sklearn.multiclass import OneVsRestClassifier
# from sklearn.linear_model import LogisticRegression
# -------------------------------
# Multilabel models in sklearn
# -------------------------------
# from sklearn.ensemble import RandomForestClassifier
#     tree.DecisionTreeClassifier
#     tree.ExtraTreeClassifier
#     ensemble.ExtraTreesClassifier
#     neighbors.KNeighborsClassifier
#     neural_network.MLPClassifier
#     neighbors.RadiusNeighborsClassifier
#     linear_model.RidgeClassifier
#     linear_model.RidgeClassifierCV

# For evaluation
from collections import Counter
from sklearn.metrics import classification_report, make_scorer
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, ConfusionMatrixDisplay, plot_confusion_matrix
from sklearn.metrics import precision_recall_fscore_support, f1_score
from intervaltree import Interval, IntervalTree

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [2]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


Drop duplicate rows with all but the same annotation ID:

In [3]:
df_train = df_train.drop(columns=["ann_id"])
df_train = df_train.drop_duplicates()
df_dev = df_dev.drop(columns=["ann_id"])
df_dev = df_dev.drop_duplicates()
print(df_train.shape, df_dev.shape)

(463441, 9) (156146, 9)


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [4]:
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [5]:
df_train.shape

(463439, 9)

***

#### Label Categories

Add the annotation label categories as a column of higher-level Inside-Outside-Beginning (IOB) tags so they can be used as targets:

In [6]:
df_train = utils.addCategoryTagColumn(df_train)
# df_train.head(20)

In [7]:
df_dev = utils.addCategoryTagColumn(df_dev)
df_dev.head()

Unnamed: 0,description_id,sentence_id,token_id,token,token_offsets,pos,tag,field,subset,tag_cat
172,3,5,154,After,"(907, 912)",IN,O,Biographical / Historical,dev,O
173,3,5,155,his,"(913, 916)",PRP$,B-Gendered-Pronoun,Biographical / Historical,dev,B-Linguistic
174,3,5,156,ordination,"(917, 927)",NN,O,Biographical / Historical,dev,O
175,3,5,157,he,"(928, 930)",PRP,B-Gendered-Pronoun,Biographical / Historical,dev,B-Linguistic
176,3,5,158,spent,"(931, 936)",VBD,O,Biographical / Historical,dev,O


Remove columns that won't be used as features for the classifiers and remove any duplicate rows that remain:

In [8]:
cols_to_keep = ["sentence_id", "token_id", "pos", "token", "tag_cat"]

In [9]:
df_train = df_train[cols_to_keep]
df_train = df_train.drop_duplicates()
df_dev = df_dev[cols_to_keep]
df_dev = df_dev.drop_duplicates()
# df_train.head(20)

Create columns for each category so they can be passed into the models as individual features:

In [13]:
# ling_cat_tags = ["B-Linguistic", "I-Linguistic"]
# df_train_ling = df_train.loc[df_train.tag_cat.isin(ling_cat_tags)]
# df_dev_ling = df_dev.loc[df_dev.tag_cat.isin(ling_cat_tags)]

In [14]:
# pers_cat_tags = ["B-Person-Name", "I-Person-Name"]
# df_train_pers = df_train.loc[df_train.tag_cat.isin(pers_cat_tags)]
# df_dev_pers = df_dev.loc[df_dev.tag_cat.isin(pers_cat_tags)]

In [15]:
# cont_cat_tags = ["B-Contextual", "I-Contextual"]
# df_train_cont = df_train.loc[df_train.tag_cat.isin(cont_cat_tags)]
# df_dev_cont = df_dev.loc[df_dev.tag_cat.isin(cont_cat_tags)]

In [16]:
# df_train = (df_train.drop(columns=["tag_cat"])).drop_duplicates()
# df_dev = (df_dev.drop(columns=["tag_cat"])).drop_duplicates()

In [17]:
# join_cols = ["sentence_id", "token_id", "pos", "token"]

In [18]:
# df_train = df_train.join(df_train_ling.set_index(join_cols), on=join_cols, how="outer")
# df_train = df_train.join(df_train_pers.set_index(join_cols), on=join_cols, how="outer", lsuffix="", rsuffix="_personname")
# df_train = df_train.join(df_train_cont.set_index(join_cols), on=join_cols, how="outer", lsuffix="", rsuffix="_contextual")
# df_train = df_train.rename(columns={"tag_cat":"tag_cat_linguistic"})
# # df_train.head(30)  # Should have one row per token!

In [19]:
# df_dev = df_dev.join(df_dev_ling.set_index(join_cols), on=join_cols, how="outer")
# df_dev = df_dev.join(df_dev_pers.set_index(join_cols), on=join_cols, how="outer", lsuffix="", rsuffix="_personname")
# df_dev = df_dev.join(df_dev_cont.set_index(join_cols), on=join_cols, how="outer", lsuffix="", rsuffix="_contextual")
# df_dev = df_dev.rename(columns={"tag_cat":"tag_cat_linguistic"})
# # df_dev.head(30)

In [20]:
# df_train.tail(30)

Unnamed: 0,sentence_id,token_id,pos,token,tag_cat_linguistic,tag_cat_personname,tag_cat_contextual
779229,42027,753891,NN,treatment,,,
779230,42027,753892,IN,of,,,
779231,42027,753893,NN,homosexuality,,,
779232,42027,753894,IN,in,,,
779233,42027,753895,JJ,contemporary,,,
779234,42027,753896,JJ,medical,,,
779235,42027,753897,NNS,journals,,,
779236,42027,753898,CC,and,,,
779237,42027,753899,NNS,books,,,
779238,42027,753900,",",",",,,


**REMEMBER:** check that model input data created on correct subset of files - no stereotypes about homosexuality "offences" or "medical treatment" of homosexuality??? 

Replace the `tag_cat_` columns' `nan` values with `'O'`:

In [40]:
tag_cat_cols = ["tag_cat_linguistic", "tag_cat_personname", "tag_cat_contextual"]
df_train[tag_cat_cols] = df_train[tag_cat_cols].fillna('O')
df_dev[tag_cat_cols] = df_dev[tag_cat_cols].fillna('O')
df_dev.head()

Group the data by sentence, so the token column becomes a list of tokens for each sentence:

In [10]:
df_train_grouped = utils.implodeDataFrame(df_train, ["sentence_id"])
df_dev_grouped = utils.implodeDataFrame(df_dev, ["sentence_id"])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
df_dev_grouped.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_cat
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[O, B-Linguistic, O, B-Linguistic, O, O, O, O,..."
11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[O, O, O]"
13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
18,"[498, 499, 500, 500, 501, 501, 502, 503, 503, ...","[IN, CD, NNP, NNP, NNP, NNP, VBD, NNP, NNP, NN...","[In, 1941, Tom, Tom, Allan, Allan, married, Ja...","[O, O, B-Person-Name, B-Contextual, I-Person-N..."
24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[O, O, B-Person-Name, I-Person-Name, I-Person-..."


Pad the sentences so they all have the same lengths:

In [42]:
# df_train_grouped = utils.addPaddedSentenceColumn(df_train_grouped)
# print(df_train_grouped.sentence.values[0][:20])
# df_dev_grouped = utils.addPaddedSentenceColumn(df_dev_grouped)

['Title', ':', 'Papers', 'of', 'The', 'The', 'Very', 'Very', 'Rev', 'Rev', 'Rev', 'Prof', 'Prof', 'James', 'James', 'Whyte', 'Whyte', '(', '1920-2005', ')']


Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, CATEGORY-TAG)`

In [11]:
dev_sentences = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_cat")
# print(dev_sentences[0])
train_sentences = utils.zipFeaturesAndTarget(df_train_grouped, "tag_cat")
# print(train_sentences[1])

#### Word Embeddings

Get GloVe word embeddings (which were trained on English Wikipedia entries) for the vocabulary of the dataset (the unique tokens in the training set):

In [12]:
dimensions = ["50", "100", "200", "300"]
d = dimensions[0]

In [13]:
glove = utils.getGloveEmbeddings(d)
# print(glove["the"])

In [14]:
vocabulary = list(df_train.token.unique())
vocabulary_lowercased = [token.lower() for token in vocabulary]
vocabulary_lowercased = list(set(vocabulary_lowercased))
print("Vocabulary size:", len(vocabulary))
print("Lowercased vocabulary size:", len(vocabulary_lowercased))

Vocabulary size: 35968
Lowercased vocabulary size: 31335


In [15]:
word_embeddings = utils.getEmbeddingsForTokens(glove, vocabulary)

In [16]:
assert np.array_equal(word_embeddings[0], glove[vocabulary[0].lower()])

In [17]:
embedding_dict = dict(zip(vocabulary, word_embeddings))

In [18]:
embedding_dict_keys = list(embedding_dict.keys())
for token in vocabulary:
    assert token in embedding_dict_keys

Create feature dictionaries:

*References:*
* *https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html*
* *https://stackoverflow.com/questions/58736548/how-to-use-word-embedding-as-features-for-crf-sklearn-crfsuite-model-training*

In [19]:
# Get a vector representation of a token from a dictionray of word embeddings
def extractEmbedding(token, embedding_dict=glove, dimensions=int(d)):
    if token.isalpha():
        token = token.lower()
    try:
        embedding = embedding_dict[token]
    except KeyError:
        embedding = np.zeros((dimensions,))
    return embedding.reshape(-1,1)

def extractTokenFeatures(sentence, i):
    token = sentence[i][0]
    pos = sentence[i][1]
    features = {
        'bias': 1.0,    # HOW IS THIS DECIDED? WHAT DOES THIS DO?
        'pos': pos,
        'pos[:2]': pos[:2],
        'token': token
    }
    
    # Add each value in a token's word embedding as a separate feature
    embedding = extractEmbedding(token)
    for i,n in enumerate(embedding):
        features['e{}'.format(i)] = n
    
    # Record whether a token is the first or last token of a sentence
    if i == 0:
        features['START'] = True
    elif i == (len(sentence) - 1):
        features['END'] = True
    
    return features

def extractSentenceFeatures(sentence):
    return [extractTokenFeatures(sentence, i) for i in range(len(sentence))]

def extractSentenceTargets(sentence):
    return [tag for token, pos, tag in sentence]

def extractSentenceTokens(sentence):
    return [token for token, pos, tag in sentence]

In [21]:
extractSentenceFeatures(train_sentences[0])[5]

{'bias': 1.0,
 'pos': 'DT',
 'pos[:2]': 'DT',
 'token': 'The',
 'e0': array([0.418], dtype=float32),
 'e1': array([0.24968], dtype=float32),
 'e2': array([-0.41242], dtype=float32),
 'e3': array([0.1217], dtype=float32),
 'e4': array([0.34527], dtype=float32),
 'e5': array([-0.044457], dtype=float32),
 'e6': array([-0.49688], dtype=float32),
 'e7': array([-0.17862], dtype=float32),
 'e8': array([-0.00066023], dtype=float32),
 'e9': array([-0.6566], dtype=float32),
 'e10': array([0.27843], dtype=float32),
 'e11': array([-0.14767], dtype=float32),
 'e12': array([-0.55677], dtype=float32),
 'e13': array([0.14658], dtype=float32),
 'e14': array([-0.0095095], dtype=float32),
 'e15': array([0.011658], dtype=float32),
 'e16': array([0.10204], dtype=float32),
 'e17': array([-0.12792], dtype=float32),
 'e18': array([-0.8443], dtype=float32),
 'e19': array([-0.12181], dtype=float32),
 'e20': array([-0.016801], dtype=float32),
 'e21': array([-0.33279], dtype=float32),
 'e22': array([-0.1552], dty

In [20]:
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]

In [21]:
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

<a id="1"></a>
## 1. Baseline Model

* **Features:** part-of-speech tag, first 2 letters of part-of-speech tag abbreviation, GloVe embeddings
* **Target:** label category IOB tags
* **Algorithm:** L2SGD

### Train

Train a Conditional Random Field (CRF) model with the default parameters:

In [22]:
algorithms = ['lbfgs', 'l2sgd', 'ap', 'pa', 'arow']
# Available algorithms with sklearn_crfsuite are:
#     'lbfgs' - Gradient descent using the L-BFGS method
#     'l2sgd' - Stochastic Gradient Descent with L2 regularization term
#     'ap' - Averaged Perceptron
#     'pa' - Passive Aggressive (PA)
#     'arow' - Adaptive Regularization Of Weight Vector (AROW)

In [23]:
# clf = sklearn_crfsuite.CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100) #iterations unlimited
clf = sklearn_crfsuite.CRF(algorithm=algorithms[1], c2=0.1, max_iterations=100)     # up to 1000 iterations allowed
# clf = sklearn_crfsuite.CRF(algorithm=algorithms[2], max_iterations=100)           # max iterations allowed
# clf = sklearn_crfsuite.CRF(algorithm=algorithms[3], max_iterations=100)           # max iterations allowed
# clf = sklearn_crfsuite.CRF(algorithm=algorithms[4], max_iterations=100)           # max iterations allowed

In [24]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [25]:
targets = list(clf.classes_)
targets.remove('O')
print(targets)

['B-Person-Name', 'B-Contextual', 'I-Person-Name', 'I-Contextual', 'B-Linguistic', 'I-Linguistic']


### Predict

In [26]:
y_pred = clf.predict(X_dev)

### Evaluate

#### Strict Evaluation

In [31]:
metrics.flat_f1_score(y_dev, y_pred, average="weighted", labels=targets)

0.369708061511727

In [None]:
# # LBFGS - third-best
# targets_sorted = sorted(targets, key=lambda name: (name[1:], name[0]))
# print(metrics.flat_classification_report(y_dev, y_pred, labels=targets_sorted, digits=5))

In [28]:
# L2SGD - tied for first (about the same as the fourth)
targets_sorted = sorted(targets, key=lambda name: (name[1:], name[0]))
print(metrics.flat_classification_report(y_dev, y_pred, labels=targets_sorted, digits=5))

               precision    recall  f1-score   support

 B-Contextual    0.45215   0.20301   0.28021      1862
 I-Contextual    0.72059   0.03346   0.06395      2929
 B-Linguistic    0.70822   0.51477   0.59620      1523
 I-Linguistic    0.63636   0.02536   0.04878       276
B-Person-Name    0.58921   0.40075   0.47704      2670
I-Person-Name    0.67856   0.38080   0.48783      4396

    micro avg    0.62937   0.29372   0.40052     13656
    macro avg    0.63085   0.25969   0.32567     13656
 weighted avg    0.64169   0.29372   0.36971     13656



In [143]:
# # AP - second-best
# targets_sorted = sorted(targets, key=lambda name: (name[1:], name[0]))
# print(metrics.flat_classification_report(y_dev, y_pred, labels=targets_sorted, digits=5))

               precision    recall  f1-score   support

 B-Contextual    0.13750   0.04726   0.07034      1862
 I-Contextual    0.19298   0.00376   0.00737      2929
 B-Linguistic    0.61205   0.54695   0.57767      1523
 I-Linguistic    0.36842   0.02536   0.04746       276
B-Person-Name    0.44488   0.29775   0.35674      2670
I-Person-Name    0.57406   0.25569   0.35379      4396

    micro avg    0.49090   0.20929   0.29346     13656
    macro avg    0.38831   0.19613   0.23556     13656
 weighted avg    0.40762   0.20929   0.26020     13656



In [148]:
# # PA - tied for first (about the same as the second)
# targets_sorted = sorted(targets, key=lambda name: (name[1:], name[0]))
# print(metrics.flat_classification_report(y_dev, y_pred, labels=targets_sorted, digits=5))

               precision    recall  f1-score   support

 B-Contextual    0.24719   0.05908   0.09536      1862
 I-Contextual    0.36977   0.05428   0.09467      2929
 B-Linguistic    0.64818   0.53710   0.58743      1523
 I-Linguistic    0.38462   0.01812   0.03460       276
B-Person-Name    0.47510   0.34307   0.39843      2670
I-Person-Name    0.54005   0.29754   0.38369      4396

    micro avg    0.51015   0.24282   0.32903     13656
    macro avg    0.44415   0.21820   0.26570     13656
 weighted avg    0.45981   0.24282   0.30094     13656



In [153]:
# # AROW - worst-performing
# targets_sorted = sorted(targets, key=lambda name: (name[1:], name[0]))
# print(metrics.flat_classification_report(y_dev, y_pred, labels=targets_sorted, digits=5))

               precision    recall  f1-score   support

 B-Contextual    0.14218   0.11224   0.12545      1862
 I-Contextual    0.07963   0.09116   0.08500      2929
 B-Linguistic    0.46378   0.38674   0.42177      1523
 I-Linguistic    0.00444   0.00362   0.00399       276
B-Person-Name    0.30427   0.40000   0.34563      2670
I-Person-Name    0.27737   0.42357   0.33522      4396

    micro avg    0.24158   0.29262   0.26466     13656
    macro avg    0.21195   0.23622   0.21951     13656
 weighted avg    0.23706   0.29262   0.25795     13656



The built-in evaluation approach is strict, so unless the model predictions' labels are on text spans that exactly match the development data's test, the predicted labels will be deemed incorrect.

As with the manual annotation evaluation, we want to evaluate the predictions more loosely, considering overlapping text spans in addition to exactly matching text spans.

#### Loose Evaluation

Using the model with one of the best performing algorithms, Stochastic Gradient Descent with L2 regularization (`l2sgd`),  conduct a loose evaluation of the model's performance (as the manual annotation were evaluated).

In [179]:
print(y_pred[0])
print(y_dev[0])

['O', 'B-Linguistic', 'B-Contextual', 'B-Linguistic', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'B-Linguistic', 'O', 'B-Linguistic', 'O', 'O', 'O', 'O', 'O', 'B-Contextual', 'I-Contextual', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [180]:
dev_sent0 = X_dev[0]
tokens = [features["token"] for features in dev_sent0]
print(tokens)

['After', 'his', 'ordination', 'he', 'spent', 'three', 'years', 'as', 'an', 'army', 'Chaplain', 'and', 'then', 'in', '1948', 'was', 'inducted', 'to', 'Dunollie', 'Road', 'Church', 'in', 'Oban', '.']


To try to improve the model's performance, use cross-validation and randomized search for choosing regularization parameters:

In [125]:
folds = 3  # 3-fold cross validation

In [126]:
clf = sklearn_crfsuite.CRF(algorithm='lbfgs', c1=0.1, c2 = 0.1, max_iterations=100) #, all_possible_transitions=True)