# Baseline Gender Biased Token Classifiers

### Target: Label Categories

### Word Embeddings: fastText (custom)

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/model_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/`
* Sequence classification
    * 3 categories of labels: Linguistic, Person Name, Contextual
    * 1 model per category
* Word embeddings
    * Custom fastText (word2vec with subwords, trained on Archives' descriptive metadata extracted in October 2020)  

***

### Table of Contents

**[0.](#0) Preprocessing**

**[1.](#1) Models**

**[2.](#2) Performance Evaluation**

***

Load necessary libraries:

In [1]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt

# For preprocessing
from nltk.stem import WordNetLemmatizer
import scipy.stats
from gensim.models import FastText
from gensim import utils as gensim_utils
from gensim.test.utils import get_tmpfile

# For classification
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# For evaluation
from collections import Counter
from sklearn.metrics import classification_report, make_scorer
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, ConfusionMatrixDisplay#, plot_confusion_matrix
from sklearn.metrics import precision_recall_fscore_support, f1_score
from intervaltree import Interval, IntervalTree

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [2]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


Drop duplicate rows with all but the same annotation ID:

In [3]:
df_train = df_train.drop(columns=["ann_id"])
df_train = df_train.drop_duplicates()
df_dev = df_dev.drop(columns=["ann_id"])
df_dev = df_dev.drop_duplicates()
print(df_train.shape, df_dev.shape)

(463441, 9) (156146, 9)


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [4]:
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [5]:
df_train.shape

(463439, 9)

***

#### Label Categories

Add the annotation label categories as a column of higher-level Inside-Outside-Beginning (IOB) tags so they can be used as targets:

In [6]:
df_train = utils.addCategoryTagColumn(df_train)
# df_train.head(20)

In [7]:
df_dev = utils.addCategoryTagColumn(df_dev)
# df_dev.head()

Remove columns that won't be used as features for the classifiers and remove any duplicate rows that remain:

In [8]:
cols_to_keep = ["sentence_id", "token_id", "pos", "token", "tag_cat"]

In [9]:
df_train = df_train[cols_to_keep]
df_train = df_train.drop_duplicates()
df_dev = df_dev[cols_to_keep]
df_dev = df_dev.drop_duplicates()
# df_train.head(20)

Create columns for each category so they can be used as three separate targets:

In [10]:
ling_cat_tags = ["B-Linguistic", "I-Linguistic"]
df_train_ling = df_train.loc[df_train.tag_cat.isin(ling_cat_tags)]
df_dev_ling = df_dev.loc[df_dev.tag_cat.isin(ling_cat_tags)]

In [11]:
pers_cat_tags = ["B-Person-Name", "I-Person-Name"]
df_train_pers = df_train.loc[df_train.tag_cat.isin(pers_cat_tags)]
df_dev_pers = df_dev.loc[df_dev.tag_cat.isin(pers_cat_tags)]

In [12]:
cont_cat_tags = ["B-Contextual", "I-Contextual"]
df_train_cont = df_train.loc[df_train.tag_cat.isin(cont_cat_tags)]
df_dev_cont = df_dev.loc[df_dev.tag_cat.isin(cont_cat_tags)]

In [13]:
df_train = (df_train.drop(columns=["tag_cat"])).drop_duplicates()
df_dev = (df_dev.drop(columns=["tag_cat"])).drop_duplicates()

In [14]:
join_cols = ["sentence_id", "token_id", "pos", "token"]

In [15]:
df_train = df_train.join(df_train_ling.set_index(join_cols), on=join_cols, how="outer")
df_train = df_train.join(df_train_pers.set_index(join_cols), on=join_cols, how="outer", lsuffix="", rsuffix="_personname")
df_train = df_train.join(df_train_cont.set_index(join_cols), on=join_cols, how="outer", lsuffix="", rsuffix="_contextual")
df_train = df_train.rename(columns={"tag_cat":"tag_cat_linguistic"})
# df_train.head(30)  # Should have one row per token!

In [16]:
df_dev = df_dev.join(df_dev_ling.set_index(join_cols), on=join_cols, how="outer")
df_dev = df_dev.join(df_dev_pers.set_index(join_cols), on=join_cols, how="outer", lsuffix="", rsuffix="_personname")
df_dev = df_dev.join(df_dev_cont.set_index(join_cols), on=join_cols, how="outer", lsuffix="", rsuffix="_contextual")
df_dev = df_dev.rename(columns={"tag_cat":"tag_cat_linguistic"})
# df_dev.head(30)

In [17]:
# df_train.tail(30)

**REMEMBER:** check that model input data created on correct subset of files - no stereotypes about homosexuality "offences" or "medical treatment" of homosexuality??? 

Replace the `tag_cat_` columns' `nan` values with `'O'`:

In [18]:
tag_cat_cols = ["tag_cat_linguistic", "tag_cat_personname", "tag_cat_contextual"]
df_train[tag_cat_cols] = df_train[tag_cat_cols].fillna('O')
df_dev[tag_cat_cols] = df_dev[tag_cat_cols].fillna('O')
df_dev.head()

Unnamed: 0,sentence_id,token_id,pos,token,tag_cat_linguistic,tag_cat_personname,tag_cat_contextual
172,5,154,IN,After,O,O,O
173,5,155,PRP$,his,B-Linguistic,O,O
174,5,156,NN,ordination,O,O,O
175,5,157,PRP,he,B-Linguistic,O,O
176,5,158,VBD,spent,O,O,O


Group the data by sentence, so the token column becomes a list of tokens for each sentence:

In [19]:
df_train_grouped = utils.implodeDataFrame(df_train, ["sentence_id"])
df_dev_grouped = utils.implodeDataFrame(df_dev, ["sentence_id"])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
df_dev_grouped.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_cat_linguistic,tag_cat_personname,tag_cat_contextual
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[O, B-Linguistic, O, B-Linguistic, O, O, O, O,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, B-Contextual, I-Co..."
11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[O, O, O]","[O, O, O]","[O, O, O]"
13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
18,"[498, 499, 500, 501, 502, 503, 504, 505, 506, ...","[IN, CD, NNP, NNP, VBD, NNP, NNP, CC, PRP, VBD...","[In, 1941, Tom, Allan, married, Jane, Moore, a...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, B-Person-Name, I-Person-Name, O, B-Pers...","[O, O, B-Contextual, I-Contextual, I-Contextua..."
24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, B-Person-Name, I-Person-Name, I-Person-...","[O, O, O, O, O, O, O, O, O, O, B-Contextual, O..."


Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, CATEGORY-TAG)`

In [20]:
train_sentences_ling = utils.zipFeaturesAndTarget(df_train_grouped, "tag_cat_linguistic")
print(train_sentences_ling[0][:3])
dev_sentences_ling = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_cat_linguistic")
print(dev_sentences_ling[0][:3])
train_sentences_pers = utils.zipFeaturesAndTarget(df_train_grouped, "tag_cat_personname")
dev_sentences_pers = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_cat_personname")
train_sentences_cont = utils.zipFeaturesAndTarget(df_train_grouped, "tag_cat_contextual")
dev_sentences_cont = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_cat_contextual")

[('Title', 'NN', 'O'), (':', ':', 'O'), ('Papers', 'NNS', 'O')]
[('After', 'IN', 'O'), ('his', 'PRP$', 'B-Linguistic'), ('ordination', 'NN', 'O')]


#### Word Embeddings

Use the custom fastText word embeddings, trained on the entire dataset of descriptive metadata from the Archives (harvested in October 2020) using the Continuous Bag-of-Words (CBOW) algorithm.  Subword embeddings (for subwords from 2 to 6 characters long, inclusive) are used to infer the embeddings for out-of-vocabulary (OOV) words.

Use the word embedding model trained on lowercased text to 100 dimensions: 

In [21]:
file_name = get_tmpfile(config.tokc_path+"fasttext100_lowercased.model")
embedding_model = FastText.load(file_name)

In [22]:
vocabulary = list(df_train.token.unique())
vocabulary_lowercased = [token.lower() for token in vocabulary]
vocabulary_lowercased = list(set(vocabulary_lowercased))
print("Vocabulary size:", len(vocabulary))
print("Lowercased vocabulary size:", len(vocabulary_lowercased))

Vocabulary size: 35968
Lowercased vocabulary size: 31335


Define feature dictionaries:

In [23]:
# Get a vector representation of a token from a fastText word embedding model
def extractEmbedding(token, fasttext_model=embedding_model):
    if token.isalpha():
        token = token.lower()
    embedding = fasttext_model.wv[token]
    return embedding

def extractTokenFeatures(sentence, i):
    token = sentence[i][0]
    pos = sentence[i][1]
    features = {
        'bias': 1.0,
        'pos': pos,
        'pos[:2]': pos[:2],
        'token': token
    }
    
    # Add each value in a token's word embedding as a separate feature
    embedding = extractEmbedding(token)
    for i,n in enumerate(embedding):
        features['e{}'.format(i)] = n
    
    # Record whether a token is the first or last token of a sentence
    if i == 0:
        features['START'] = True
    elif i == (len(sentence) - 1):
        features['END'] = True
    
    return features

def extractSentenceFeatures(sentence):
    return [extractTokenFeatures(sentence, i) for i in range(len(sentence))]

def extractSentenceTargets(sentence):
    return [tag for token, pos, tag in sentence]

def extractSentenceTokens(sentence):
    return [token for token, pos, tag in sentence]

In [24]:
# extractSentenceFeatures(train_sentences[0])[5]

*References:*
* *https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html*
* *https://stackoverflow.com/questions/58736548/how-to-use-word-embedding-as-features-for-crf-sklearn-crfsuite-model-training*

<a id="1"></a>
## 1. Models

### Linguistic

* **Features:** part-of-speech tag, first 2 letters of part-of-speech tag abbreviation, custom fastText embeddings
* **Target:** Linguistic label category IOB tags
* **Algorithm:** L2SGD

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Linguistic** category of tags:

In [25]:
train_sentences = train_sentences_ling
dev_sentences = dev_sentences_ling

In [26]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

In [27]:
algorithms = ['lbfgs', 'l2sgd', 'ap', 'pa', 'arow']
# Available algorithms with sklearn_crfsuite are:
#     'lbfgs' - Gradient descent using the L-BFGS method
#     'l2sgd' - Stochastic Gradient Descent with L2 regularization term
#     'ap' - Averaged Perceptron
#     'pa' - Passive Aggressive (PA)
#     'arow' - Adaptive Regularization Of Weight Vector (AROW)

In [28]:
# clf = sklearn_crfsuite.CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100) #iterations unlimited
clf = sklearn_crfsuite.CRF(algorithm=algorithms[1], c2=0.1, max_iterations=100, all_possible_transitions=True)     # up to 1000 iterations allowed
# clf = sklearn_crfsuite.CRF(algorithm=algorithms[2], max_iterations=100)           # max iterations allowed
# clf = sklearn_crfsuite.CRF(algorithm=algorithms[3], max_iterations=100)           # max iterations allowed
# clf = sklearn_crfsuite.CRF(algorithm=algorithms[4], max_iterations=100)           # max iterations allowed

In [29]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [30]:
targets = list(clf.classes_)
targets.remove('O')
print(targets)

['B-Linguistic', 'I-Linguistic']


#### Predict

In [31]:
y_pred = clf.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [32]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", labels=targets))

  - F1: 0.678230100682343
  - Prec: 0.7553309858774123
  - Rec 0.6287799791449427


Save the prediction data:

In [33]:
df_dev_grouped = df_dev_grouped.rename(columns={"tag_cat_linguistic":"tag_cat_linguistic_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_cat_linguistic_predicted", y_pred)
df_dev_grouped.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_cat_linguistic_expected,tag_cat_personname,tag_cat_contextual,tag_cat_linguistic_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[O, B-Linguistic, O, B-Linguistic, O, O, O, O,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, B-Contextual, I-Co...","[O, B-Linguistic, O, B-Linguistic, O, O, O, O,..."
11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[O, O, O]","[O, O, O]","[O, O, O]","[O, O, O]"
13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
18,"[498, 499, 500, 501, 502, 503, 504, 505, 506, ...","[IN, CD, NNP, NNP, VBD, NNP, NNP, CC, PRP, VBD...","[In, 1941, Tom, Allan, married, Jane, Moore, a...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, B-Person-Name, I-Person-Name, O, B-Pers...","[O, O, B-Contextual, I-Contextual, I-Contextua...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, B-Person-Name, I-Person-Name, I-Person-...","[O, O, O, O, O, O, O, O, O, O, B-Contextual, O...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


### Person Name

* **Features:** part-of-speech tag, first 2 letters of part-of-speech tag abbreviation, custom fastText embeddings
* **Target:** Person-Name label category IOB tags
* **Algorithm:** L2SGD

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Person Name** category of tags:

In [34]:
train_sentences = train_sentences_pers
dev_sentences = dev_sentences_pers

In [35]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

In [36]:
algorithms = ['lbfgs', 'l2sgd', 'ap', 'pa', 'arow']

In [37]:
clf = sklearn_crfsuite.CRF(algorithm=algorithms[1], c2=0.1, max_iterations=100, all_possible_transitions=True)

In [38]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [39]:
targets = list(clf.classes_)
targets.remove('O')
print(targets)

['B-Person-Name', 'I-Person-Name']


#### Predict

In [40]:
y_pred = clf.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [41]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", labels=targets))

  - F1: 0.4946309099465871
  - Prec: 0.8145205069894943
  - Rec 0.35538592027141647


Save the prediction data:

In [42]:
df_dev_grouped = df_dev_grouped.rename(columns={"tag_cat_personname":"tag_cat_personname_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_cat_personname_predicted", y_pred)
df_dev_grouped.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_cat_linguistic_expected,tag_cat_personname_expected,tag_cat_contextual,tag_cat_linguistic_predicted,tag_cat_personname_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[O, B-Linguistic, O, B-Linguistic, O, O, O, O,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, B-Contextual, I-Co...","[O, B-Linguistic, O, B-Linguistic, O, O, O, O,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[O, O, O]","[O, O, O]","[O, O, O]","[O, O, O]","[O, O, O]"
13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
18,"[498, 499, 500, 501, 502, 503, 504, 505, 506, ...","[IN, CD, NNP, NNP, VBD, NNP, NNP, CC, PRP, VBD...","[In, 1941, Tom, Allan, married, Jane, Moore, a...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, B-Person-Name, I-Person-Name, O, B-Pers...","[O, O, B-Contextual, I-Contextual, I-Contextua...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-Pers..."
24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, B-Person-Name, I-Person-Name, I-Person-...","[O, O, O, O, O, O, O, O, O, O, B-Contextual, O...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, B-Person-Name, I-Person-Name, I-Person-..."


<a id="1"></a>
### Contextual

* **Features:** part-of-speech tag, first 2 letters of part-of-speech tag abbreviation, custom fastText embeddings
* **Target:** Contextual label category IOB tags
* **Algorithm:** L2SGD

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Contextual** category of tags:

In [43]:
train_sentences = train_sentences_cont
dev_sentences = dev_sentences_cont

In [44]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

In [45]:
algorithms = ['lbfgs', 'l2sgd', 'ap', 'pa', 'arow']

In [46]:
clf = sklearn_crfsuite.CRF(algorithm=algorithms[1], c2=0.1, max_iterations=100, all_possible_transitions=True)     # up to 1000 iterations allowed

In [47]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [48]:
targets = list(clf.classes_)
targets.remove('O')
print(targets)

['B-Contextual', 'I-Contextual']


#### Predict

In [49]:
y_pred = clf.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [50]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", labels=targets))

  - F1: 0.2107333889044305
  - Prec: 0.4815860025638492
  - Rec 0.14458081963802788


Save the prediction data:

In [51]:
df_dev_grouped = df_dev_grouped.rename(columns={"tag_cat_contextual":"tag_cat_contextual_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_cat_contextual_predicted", y_pred)
df_dev_grouped.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_cat_linguistic_expected,tag_cat_personname_expected,tag_cat_contextual_expected,tag_cat_linguistic_predicted,tag_cat_personname_predicted,tag_cat_contextual_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[O, B-Linguistic, O, B-Linguistic, O, O, O, O,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, B-Contextual, I-Co...","[O, B-Linguistic, O, B-Linguistic, O, O, O, O,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[O, O, O]","[O, O, O]","[O, O, O]","[O, O, O]","[O, O, O]","[O, O, O]"
13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
18,"[498, 499, 500, 501, 502, 503, 504, 505, 506, ...","[IN, CD, NNP, NNP, VBD, NNP, NNP, CC, PRP, VBD...","[In, 1941, Tom, Allan, married, Jane, Moore, a...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, B-Person-Name, I-Person-Name, O, B-Pers...","[O, O, B-Contextual, I-Contextual, I-Contextua...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-Pers...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, B-Person-Name, I-Person-Name, I-Person-...","[O, O, O, O, O, O, O, O, O, O, B-Contextual, O...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, B-Person-Name, I-Person-Name, I-Person-...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


<a id="2"></a>
## 2. Performance Evaluation

### Strict Evaluation

The built-in evaluation approach is strict, so unless the model predictions' labels are on text spans that exactly match the development data's test, the predicted labels will be deemed incorrect.

As with the manual annotation evaluation, we want to evaluate the predictions more loosely, considering overlapping text spans in addition to exactly matching text spans.  Save the predictions for each token and then use IntervalTree to evaluate performance considering overlapping labels, rather than only exactly matching labels.

In [52]:
df_dev_exploded = df_dev_grouped.explode(list(df_dev_grouped.columns))
df_dev_exploded = df_dev_exploded.rename(columns={"sentence":"token"})
df_dev_exploded.head()

Unnamed: 0_level_0,token_id,pos,token,tag_cat_linguistic_expected,tag_cat_personname_expected,tag_cat_contextual_expected,tag_cat_linguistic_predicted,tag_cat_personname_predicted,tag_cat_contextual_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
5,154,IN,After,O,O,O,O,O,O
5,155,PRP$,his,B-Linguistic,O,O,B-Linguistic,O,O
5,156,NN,ordination,O,O,O,O,O,O
5,157,PRP,he,B-Linguistic,O,O,B-Linguistic,O,O
5,158,VBD,spent,O,O,O,O,O,O


In [53]:
print(df_dev_exploded.shape)
print(df_dev.shape)

(152711, 9)
(152711, 7)


Save the grouped (one row per sentence) and exploded (one row per token) data:

In [54]:
df_dev_grouped.to_csv(config.tokc_path+"model_output/categoryTags_crfL2sgd_POS-fastText100_bySentence.csv")
df_dev_exploded.to_csv(config.tokc_path+"model_output/categoryTags_crfL2sgd_POS-fastText100_byToken.csv")
# df_dev_exploded = pd.read_csv(config.tokc_path+"model_output/categoryTags_crfL2sgd_POS-CustomFastText_byToken.csv")
# df_dev_exploded.head()

Strictly evaluate the **Linguistic** tags:

In [56]:
df_dev_exploded = df_dev_exploded.reset_index()

In [57]:
category = "linguistic"
tags = ["B-Linguistic", "I-Linguistic"]
subdf_pred_ling = utils.addCatAgreementAndMatchTypeCols(df_dev_exploded, category, tags)
subdf_pred_ling.head()

Unnamed: 0,sentence_id,token_id,token,tag_cat_linguistic_expected,tag_cat_linguistic_predicted,strict_agreement,match_type
0,5,154,After,O,O,TN,exact_match
1,5,155,his,B-Linguistic,B-Linguistic,TP,exact_match
2,5,156,ordination,O,O,TN,exact_match
3,5,157,he,B-Linguistic,B-Linguistic,TP,exact_match
4,5,158,spent,O,O,TN,exact_match


In [58]:
subdf_pred_ling_stats = subdf_pred_ling.groupby("strict_agreement").count()
subdf_pred_ling_stats = subdf_pred_ling_stats[["token_id"]]
subdf_pred_ling_stats = subdf_pred_ling_stats.rename(columns={"token_id":"linguistic_count"})
subdf_pred_ling_stats

Unnamed: 0_level_0,linguistic_count
strict_agreement,Unnamed: 1_level_1
FN,711
FP,359
TN,150435
TP,1206


In [59]:
subdf_pred_ling_stats2 = subdf_pred_ling.groupby("match_type").count()
subdf_pred_ling_stats2 = subdf_pred_ling_stats2[["token_id"]]
subdf_pred_ling_stats2 = subdf_pred_ling_stats2.rename(columns={"token_id":"linguistic_count"})
subdf_pred_ling_stats2

Unnamed: 0_level_0,linguistic_count
match_type,Unnamed: 1_level_1
category_match,1
exact_match,151641
mismatch,1069


Strictly evaluate the **Person Name** tags:

In [60]:
category = "personname"
tags = ["B-Person-Name", "I-Person-Name"]
subdf_pred_pers = utils.addCatAgreementAndMatchTypeCols(df_dev_exploded, category, tags)
subdf_pred_pers.head()

Unnamed: 0,sentence_id,token_id,token,tag_cat_personname_expected,tag_cat_personname_predicted,strict_agreement,match_type
0,5,154,After,O,O,TN,exact_match
1,5,155,his,O,O,TN,exact_match
2,5,156,ordination,O,O,TN,exact_match
3,5,157,he,O,O,TN,exact_match
4,5,158,spent,O,O,TN,exact_match


In [61]:
subdf_pred_pers_stats = subdf_pred_pers.groupby("strict_agreement").count()
subdf_pred_pers_stats = subdf_pred_pers_stats[["token_id"]]
subdf_pred_pers_stats = subdf_pred_pers_stats.rename(columns={"token_id":"personname_count"})
subdf_pred_pers_stats

Unnamed: 0_level_0,personname_count
strict_agreement,Unnamed: 1_level_1
FN,4370
FP,571
TN,145256
TP,2514


In [62]:
subdf_pred_pers_stats2 = subdf_pred_pers.groupby("match_type").count()
subdf_pred_pers_stats2 = subdf_pred_pers_stats2[["token_id"]]
subdf_pred_pers_stats2 = subdf_pred_pers_stats2.rename(columns={"token_id":"personname_count"})
subdf_pred_pers_stats2

Unnamed: 0_level_0,personname_count
match_type,Unnamed: 1_level_1
category_match,190
exact_match,147770
mismatch,4751


Strictly evaluate the **Contextual** tags:

In [63]:
category = "contextual"
tags = ["B-Contextual", "I-Contextual"]
subdf_pred_cont = utils.addCatAgreementAndMatchTypeCols(df_dev_exploded, category, tags)
subdf_pred_cont.head()

Unnamed: 0,sentence_id,token_id,token,tag_cat_contextual_expected,tag_cat_contextual_predicted,strict_agreement,match_type
0,5,154,After,O,O,TN,exact_match
1,5,155,his,O,O,TN,exact_match
2,5,156,ordination,O,O,TN,exact_match
3,5,157,he,O,O,TN,exact_match
4,5,158,spent,O,O,TN,exact_match


In [64]:
subdf_pred_cont_stats = subdf_pred_cont.groupby("strict_agreement").count()
subdf_pred_cont_stats = subdf_pred_cont_stats[["token_id"]]
subdf_pred_cont_stats = subdf_pred_cont_stats.rename(columns={"token_id":"contextual_count"})
subdf_pred_cont_stats

Unnamed: 0_level_0,contextual_count
strict_agreement,Unnamed: 1_level_1
FN,3959
FP,744
TN,147313
TP,695


In [65]:
subdf_pred_cont_stats2 = subdf_pred_cont.groupby("match_type").count()
subdf_pred_cont_stats2 = subdf_pred_cont_stats2[["token_id"]]
subdf_pred_cont_stats2 = subdf_pred_cont_stats2.rename(columns={"token_id":"contextual_count"})
subdf_pred_cont_stats2

Unnamed: 0_level_0,contextual_count
match_type,Unnamed: 1_level_1
category_match,153
exact_match,148008
mismatch,4550


Combine the statistics and calculate precision, recall, and F1 scores for each label:

In [66]:
subdf_pred_stats = pd.concat([subdf_pred_ling_stats.T, subdf_pred_pers_stats.T, subdf_pred_cont_stats.T])
subdf_pred_stats

strict_agreement,FN,FP,TN,TP
linguistic_count,711,359,150435,1206
personname_count,4370,571,145256,2514
contextual_count,3959,744,147313,695


In [67]:
lprec, lrec, lf = utils.precisionRecallF1(subdf_pred_stats.TP.values[0], subdf_pred_stats.FP.values[0], subdf_pred_stats.FN.values[0])
pprec, prec, pf = utils.precisionRecallF1(subdf_pred_stats.TP.values[1], subdf_pred_stats.FP.values[1], subdf_pred_stats.FN.values[1])
cprec, crec, cf = utils.precisionRecallF1(subdf_pred_stats.TP.values[2], subdf_pred_stats.FP.values[2], subdf_pred_stats.FN.values[2])
precision = [lprec, pprec, cprec]
recall = [lrec, prec, crec]
f_1 = [lf, pf, cf]
subdf_pred_stats.insert(len(list(subdf_pred_stats.columns)), "precision", precision)
subdf_pred_stats.insert(len(list(subdf_pred_stats.columns)), "recall", recall)
subdf_pred_stats.insert(len(list(subdf_pred_stats.columns)), "f_1", f_1)
subdf_pred_stats

strict_agreement,FN,FP,TN,TP,precision,recall,f_1
linguistic_count,711,359,150435,1206,0.770607,0.629108,0.692705
personname_count,4370,571,145256,2514,0.814911,0.365195,0.504364
contextual_count,3959,744,147313,695,0.482974,0.149334,0.228131


In [68]:
subdf_pred_stats2 = pd.concat([subdf_pred_ling_stats2.T, subdf_pred_pers_stats2.T, subdf_pred_cont_stats2.T])
subdf_pred_stats2

match_type,category_match,exact_match,mismatch
linguistic_count,1,151641,1069
personname_count,190,147770,4751
contextual_count,153,148008,4550


Save the data:

In [69]:
subdf_pred_stats.to_csv(config.tokc_path+"model_output/strict_agreement_stats_fastText100_crf.csv")
subdf_pred_stats2.to_csv(config.tokc_path+"model_output/match_types_fastText100_crf.csv")

### Loose Evaluation

Conduct a loose evaluation of the model's performance (as the manual annotation were evaluated), considering any overlapping or envelopping expected and predicted annotations to be matches, in addition to exactly-matching expected and predicted annotations.

In [70]:
# Get the start and end offsets of the tokens
tok_df = pd.read_csv(config.tokc_path+"desc_sent_ann_token_tag.csv", usecols=["token_id", "sentence_id", "token_offsets"])
tok_df.head()

Unnamed: 0,token_id,sentence_id,token_offsets
0,0,0,"(0, 10)"
1,1,0,"(10, 11)"
2,2,0,"(12, 15)"
3,3,1,"(17, 22)"
4,4,1,"(22, 23)"


In [71]:
# Join the offsets data to the dev data, keeping only the tokens included in the dev data
df_dev_exploded = df_dev_exploded.reset_index()
join_cols = ["sentence_id", "token_id"]
df_pred = df_dev_exploded.join(tok_df.set_index(join_cols), on=join_cols, how="left")
df_pred = df_pred[["sentence_id", "token_id", "token", "token_offsets", "pos", 
                   "tag_cat_linguistic_expected", "tag_cat_linguistic_predicted",
                   "tag_cat_personname_expected", "tag_cat_personname_predicted",
                   "tag_cat_contextual_expected", "tag_cat_contextual_predicted"
                  ]]
df_pred.head()

Unnamed: 0,sentence_id,token_id,token,token_offsets,pos,tag_cat_linguistic_expected,tag_cat_linguistic_predicted,tag_cat_personname_expected,tag_cat_personname_predicted,tag_cat_contextual_expected,tag_cat_contextual_predicted
0,5,154,After,"(907, 912)",IN,O,O,O,O,O,O
1,5,155,his,"(913, 916)",PRP$,B-Linguistic,B-Linguistic,O,O,O,O
2,5,156,ordination,"(917, 927)",NN,O,O,O,O,O,O
3,5,157,he,"(928, 930)",PRP,B-Linguistic,B-Linguistic,O,O,O,O
4,5,158,spent,"(931, 936)",VBD,O,O,O,O,O,O


In [72]:
assert df_pred.isna().values.any() == False, "There should be no NaN values in any of the prediction DataFrame's columns."
assert df_pred.shape[0] == df_dev_exploded.shape[0], "There should be the same number of rows as the dev DataFrame prior to the join."

In [73]:
# Transform the offsets to tuples of ints
offsets = list(df_pred.token_offsets)
offsets = [tuple(((offset_pair[1:-1]).split(", "))) for offset_pair in offsets]
offsets = [(int(start_offset), int(end_offset)) for start_offset,end_offset in offsets]
# print(type(offsets[0]), type(offsets[0][0]), type(offsets[0][1]))
col_i = list(df_pred.columns).index("token_offsets")
df_pred = df_pred.drop(columns=["token_offsets"])
df_pred.insert((col_i-1), "token_offsets", offsets)
# df_pred.head()

Determine how many expected **Linguistic** tags overlap, envelop/fall within, or exactly match predicted Linguistic tags:

In [74]:
# LINGUISTIC
ling_tags = ["B-Linguistic", "I-Linguistic"]
b_ling_exp_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_linguistic_expected", ling_tags)
b_ling_pred_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_linguistic_predicted", ling_tags)
tp, fp, fn = utils.looseAgreement(b_ling_exp_tree, b_ling_pred_tree)
print(ling_tags)
print("---------------------------------------")
print("TP:",tp, "| FP:",fp, "| FN:",fn)
print("---------------------------------------")
prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
print("Precision:", prec)
print("Recall:", rec)
print("F_1 Score:", f1)

['B-Linguistic', 'I-Linguistic']
---------------------------------------
TP: 2730 | FP: 314 | FN: 673
---------------------------------------
Precision: 0.8968462549277266
Recall: 0.8022333235380547
F_1 Score: 0.8469055374592834


In [75]:
# B-LINGUISTIC
b_ling_exp_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_linguistic_expected", [ling_tags[0]])
b_ling_pred_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_linguistic_predicted", [ling_tags[0]])
tp, fp, fn = utils.looseAgreement(b_ling_exp_tree, b_ling_pred_tree)
print(ling_tags[0])
print("---------------------------------------")
print("TP:",tp, "| FP:",fp, "| FN:",fn)
print("---------------------------------------")
prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
print("Precision:", prec)
print("Recall:", rec)
print("F_1 Score:", f1)

B-Linguistic
---------------------------------------
TP: 2389 | FP: 274 | FN: 467
---------------------------------------
Precision: 0.8971085242208036
Recall: 0.836484593837535
F_1 Score: 0.8657365464758109


In [76]:
# I-LINGUISTIC
b_ling_exp_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_linguistic_expected", [ling_tags[1]])
b_ling_pred_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_linguistic_predicted", [ling_tags[1]])
tp, fp, fn = utils.looseAgreement(b_ling_exp_tree, b_ling_pred_tree)
print(ling_tags[1])
print("---------------------------------------")
print("TP:",tp, "| FP:",fp, "| FN:",fn)
print("---------------------------------------")
prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
print("Precision:", prec)
print("Recall:", rec)
print("F_1 Score:", f1)

I-Linguistic
---------------------------------------
TP: 65 | FP: 44 | FN: 212
---------------------------------------
Precision: 0.5963302752293578
Recall: 0.23465703971119134
F_1 Score: 0.3367875647668394


Determine how many expected **Person Name** tags overlap, envelop/fall within, or exactly match predicted Person Name tags:

In [77]:
# PERSON NAME
tags = ["B-Person-Name", "I-Person-Name"]
exp_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_personname_expected", tags)
pred_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_personname_predicted", tags)
tp, fp, fn = utils.looseAgreement(exp_tree, pred_tree)
print(tags)
print("---------------------------------------")
print("TP:",tp, "| FP:",fp, "| FN:",fn)
print("---------------------------------------")
prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
print("Precision:", prec)
print("Recall:", rec)
print("F_1 Score:", f1)

['B-Person-Name', 'I-Person-Name']
---------------------------------------
TP: 21722 | FP: 328 | FN: 3982
---------------------------------------
Precision: 0.9851247165532879
Recall: 0.8450824774354186
F_1 Score: 0.9097457804581816


In [78]:
# B-PERSON-NAME
exp_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_personname_expected", [tags[0]])
pred_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_personname_predicted", [tags[0]])
tp, fp, fn = utils.looseAgreement(exp_tree, pred_tree)
print(tags[0])
print("---------------------------------------")
print("TP:",tp, "| FP:",fp, "| FN:",fn)
print("---------------------------------------")
prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
print("Precision:", prec)
print("Recall:", rec)
print("F_1 Score:", f1)

B-Person-Name
---------------------------------------
TP: 4792 | FP: 199 | FN: 1594
---------------------------------------
Precision: 0.9601282308154678
Recall: 0.7503914813654871
F_1 Score: 0.8424013360288302


In [79]:
# I-PERSON-NAME
exp_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_personname_expected", [tags[1]])
pred_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_personname_predicted", [tags[1]])
tp, fp, fn = utils.looseAgreement(exp_tree, pred_tree)
print(tags[1])
print("---------------------------------------")
print("TP:",tp, "| FP:",fp, "| FN:",fn)
print("---------------------------------------")
prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
print("Precision:", prec)
print("Recall:", rec)
print("F_1 Score:", f1)

I-Person-Name
---------------------------------------
TP: 8120 | FP: 301 | FN: 2712
---------------------------------------
Precision: 0.9642560266001663
Recall: 0.7496307237813885
F_1 Score: 0.8435049083259752


Determine how many expected **Contextual** tags overlap, envelop/fall within, or exactly match predicted Person Name tags:

In [80]:
# CONTEXTUAL
tags = ["B-Contextual", "I-Contextual"]
exp_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_contextual_expected", tags)
pred_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_contextual_predicted", tags)
tp, fp, fn = utils.looseAgreement(exp_tree, pred_tree)
print(tags)
print("---------------------------------------")
print("TP:",tp, "| FP:",fp, "| FN:",fn)
print("---------------------------------------")
prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
print("Precision:", prec)
print("Recall:", rec)
print("F_1 Score:", f1)

['B-Contextual', 'I-Contextual']
---------------------------------------
TP: 6238 | FP: 554 | FN: 3671
---------------------------------------
Precision: 0.9184334511189635
Recall: 0.6295287112725805
F_1 Score: 0.7470211364588947


In [81]:
# B-CONTEXTUAL
exp_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_contextual_expected", [tags[0]])
pred_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_contextual_predicted", [tags[0]])
tp, fp, fn = utils.looseAgreement(exp_tree, pred_tree)
print(tags[0])
print("---------------------------------------")
print("TP:",tp, "| FP:",fp, "| FN:",fn)
print("---------------------------------------")
prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
print("Precision:", prec)
print("Recall:", rec)
print("F_1 Score:", f1)

B-Contextual
---------------------------------------
TP: 2585 | FP: 407 | FN: 1341
---------------------------------------
Precision: 0.8639705882352942
Recall: 0.6584309730005095
F_1 Score: 0.7473258167100318


In [82]:
# I-CONTEXTUAL
exp_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_contextual_expected", [tags[1]])
pred_tree = utils.createIntervalTree(df_pred, "token_offsets", "tag_cat_contextual_predicted", [tags[1]])
tp, fp, fn = utils.looseAgreement(exp_tree, pred_tree)
print(tags[1])
print("---------------------------------------")
print("TP:",tp, "| FP:",fp, "| FN:",fn)
print("---------------------------------------")
prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
print("Precision:", prec)
print("Recall:", rec)
print("F_1 Score:", f1)

I-Contextual
---------------------------------------
TP: 1046 | FP: 207 | FN: 2586
---------------------------------------
Precision: 0.8347964884277733
Recall: 0.2879955947136564
F_1 Score: 0.4282497441146366


Export summary stats of loose (where overlap, envelop, and exact match all count as correct) agreement scores: