# Experiment 3, Model 1

#### Model Setup

Run models in the following order, using their output labels as features for the next model:

1. Multiclass Person Name + Occupation Sequence Classifier

2. Multilabel Stereotype + Omission Document Classifier

***

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/experiment_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/experiment3/`
* Word Embeddings
    * Custom fastText (word2vec with subwords) embeddings of 100 dimensions trained on the CRC Archives catalog's descriptive metadata (harvested October 2020)

***

**Table of Contents**

[I.](#i) Person Name + Occupation Sequence Classifier
* [Preprocessing](#prep)
* [Training & Prediction](#tp)
* [Evaluation](#eval)

Load resources:

In [1]:
# For custom functions and variables
import utils, utils1, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For preprocessing
from gensim.models import FastText
from gensim import utils as gensim_utils

# For multilabel token classification
import sklearn.metrics
from sklearn.preprocessing import MultiLabelBinarizer
from skmultilearn.problem_transform import ClassifierChain
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# For multiclass sequence classification
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

Define resources for the models:

In [2]:
Path(config.experiment_input_path).mkdir(parents=True, exist_ok=True)    # For train, devtest, and blind test data
# Path(config.experiment1_output_path).mkdir(parents=True, exist_ok=True)  # For predictions
# Path(config.experiment1_agmt_path).mkdir(parents=True, exist_ok=True)    # For agreement metrics

predictions_dir = config.experiment3_path+"5fold/output/"     # For predictions
Path(predictions_dir).mkdir(parents=True, exist_ok=True)
agreement_dir = config.experiment3_path+"5fold/agreement/"    # For agreement metrics
Path(agreement_dir).mkdir(parents=True, exist_ok=True)

In [3]:
# Model 2:
pers_o_label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Occupation", "I-Occupation"]
# Model 2.1:
# pers_label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine"]

In [4]:
pers_o_label_tags = {
    "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
     "Occupation": ["B-Occupation", "I-Occupation"]
    }

In [5]:
d = 100  # dimensions of word embeddings (should match utils1.py)
target_labels = "pers_o"  # for file names
# ---------------------
### Model 2.2: binary classification with Person-Name vs. O
# target_labels = "pers"

<a id="1"></a>
## 1. Person Name + Occupation Labels

Train a multiclass sequence classifier, using Conditional Random Field with Adaptive Regularization of Weight Vectors (AROW), on the Person Name and Occupation labels.

Multiclass is a suitable setup for these labels because they are mutually exclusive (no one token should have more than one of these labels).  The sequence classifier with AROW was the highest performing for past algorithm experiments with sequence classifiers for Person Name and Occupation labels.

The devtest data subset from the model in step 1 will be the train data subset in this step, with the predicted Linguistic labels as features passed into this second model.  The train data subset from the first model will be the devtest data subset for this second model.

In [6]:
### For 40-40-20 data split
# train_df = pd.read_csv(config.tokc_path+"experiment_input/token_validate.csv", index_col=0)
# dev_df =  pd.read_csv(config.tokc_path+"experiment_input/token_train.csv", index_col=0)
# perso_train, perso_dev = utils.selectDataForLabels(train_df, dev_df, "tag", pers_o_label_subset)
# print(perso_train.shape, perso_dev.shape)
# ------------------------
### For this experiment, we'll repeatedly train models on different 80% selections of 
### data and predict on the remaining 20% split, for a modified 5-fold cross-validation approach.
perso_df = pd.read_csv(config.tokc_path+"experiment_input/token_5fold.csv", index_col=0)
# Make sure only Person Name and Occupation tags are considered
perso_df = utils1.selectDataForLabels(perso_df, "tag", pers_o_label_subset) #pers_label_subset)
perso_df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,fold
0,0,0,99999,0,Identifier,"(0, 10)",NN,O,Identifier,split4
1,0,0,99999,1,:,"(10, 11)",:,O,Identifier,split4
2,0,0,99999,2,AA5,"(12, 15)",NN,O,Identifier,split4
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,split2
4,1,1,99999,4,:,"(22, 23)",:,O,Title,split2


In [7]:
print(perso_df.tag.unique())  # Looks good
# -----------------------------
# # Model 2.2: for every Person Name BIO tag, replace it with the category name (simply, Person Name)
# tags = list(perso_df["tag"])
# new_tags = [tag if tag == "O" else "Person Name" for tag in tags]
# perso_df["tag"] = new_tags
# print(perso_df.tag.unique())  # Looks good

['O' 'B-Unknown' 'B-Masculine' 'I-Unknown' 'I-Masculine' 'B-Occupation'
 'I-Occupation' 'B-Feminine' 'I-Feminine']


Get the label associated with each annotation for future evaluation:

In [8]:
df_by_ann = pd.read_csv(config.tokc_path+"experiment_input/token_5fold.csv", usecols=["ann_id", "token_id", "tag"])
df_by_ann = df_by_ann.drop_duplicates()
df_by_ann = utils.implodeDataFrame(df_by_ann, ["ann_id"])
tags_col = list(df_by_ann.tag)
labels = [[tag[2:] if tag != "O" else tag for tag in tags] for tags in tags_col]
labels = [label_list[0] for label_list in labels]
df_by_ann.insert(len(df_by_ann.columns), "expected_label", labels)
perso_labels = list(pers_o_label_tags.keys())
df_by_ann = df_by_ann.loc[df_by_ann.expected_label.isin(perso_labels)]
df_by_ann.head()

Unnamed: 0_level_0,token_id,tag,expected_label
ann_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
7,"[58341, 58342, 58343, 58344]","[B-Feminine, I-Feminine, I-Feminine, I-Feminine]",Feminine
14,"[19836, 19837, 19838, 19839]","[B-Unknown, I-Unknown, I-Unknown, I-Unknown]",Unknown
15,"[28713, 28714]","[B-Unknown, I-Unknown]",Unknown
16,"[28738, 28739, 28740, 28741]","[B-Unknown, I-Unknown, I-Unknown, I-Unknown]",Unknown
17,"[28790, 28791, 28792]","[B-Unknown, I-Unknown, I-Unknown]",Unknown


Define the five groups of training and test sets:

In [10]:
split_col = "fold"
splits = perso_df[split_col].unique()
splits.sort()
train0, test0 = list(splits[:4]), splits[4]
train1, test1 = list(splits[1:]), splits[0]
train2, test2 = list(splits[2:])+[splits[0]], splits[1]
train3, test3 = list(splits[3:])+list(splits[:2]), splits[2]
train4, test4 = [splits[4]]+list(splits[:3]), splits[3]
runs = [(train0, test0), (train1, test1), (train2, test2), (train3, test3), (train4, test4)]

Looks good!

<a id="prep"></a>
#### Preprocessing

In [11]:
# train_df = perso_train.drop(columns=["description_id", "ann_id", "token_offsets", "field", "subset", "pos"])
# dev_df = perso_dev.drop(columns=["description_id", "ann_id", "token_offsets", "field", "subset", "pos"])
# ------------------------
df = perso_df.drop(columns=["description_id", "ann_id", "token_offsets", "field", "pos"])

In [12]:
# df_train_token_groups = utils.implodeDataFrame(train_df, ['token_id', 'sentence_id', 'token'])
# df_train_token_groups = df_train_token_groups.reset_index()
# # df_train_token_groups.head()
# df_dev_token_groups = utils.implodeDataFrame(dev_df, ['token_id', 'sentence_id', 'token'])
# df_dev_token_groups = df_dev_token_groups.reset_index()
# # df_dev_token_groups.head()
# df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
# df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
# df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
# df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
# df_train_grouped.head()
# ------------------------

In [13]:
df_token_groups = utils.implodeDataFrame(df, ['token_id', 'sentence_id', 'token', 'fold'])
df_token_groups = df_token_groups.reset_index()
# df_token_groups.head(20)

Make sure that every row's tags are not duplicated, and that if a row has a `B-` or `I-` tag (or category name), it doesn't also have an `O` tag, and sort the order of the tags so that any `B` tag will be selected as the expected tag for training before an "I" tag.  Additionally, if multiple labels are present and one is `Unknown`, put the `Unknown` tag first so that it will be selected as the expected tag for training (as the data sample output above illustrates, and as Error Analysis of document classifiers for Person Names showed, people's names should have been annotated as `Unknown` more than they actually were).

In [14]:
tags = list(df_token_groups["tag"])
new_tags = []
for tag_list in tags:
    unique_tags = list(set(tag_list))
    if (len(unique_tags) > 1) and ("O" in unique_tags):
        unique_tags.remove("O")
    unique_tags.sort()
    # Put any Unknown tags at the start of the list, so they'll be selected
    # as a feature for training over Masculine or Feminine tags
    if len(unique_tags) > 1:
        if "I-Unknown" in unique_tags:
            unique_tags.remove("I-Unknown")
            unique_tags = ["I-Unknown"] + unique_tags
        if "B-Unknown" in unique_tags:
            unique_tags.remove("B-Unknown")
            unique_tags = ["B-Unknown"] + unique_tags
    new_tags += [unique_tags]
df_token_groups["tag"] = new_tags
# df_token_groups.head(20)

In [15]:
df_grouped = utils.implodeDataFrame(df_token_groups, ['sentence_id', 'fold'])
df_grouped = df_grouped.rename(columns={"token":"sentence"})
df_grouped = df_grouped.reset_index()
df_grouped.head()

Unnamed: 0,sentence_id,fold,token_id,sentence,tag
0,0,split4,"[0, 1, 2]","[Identifier, :, AA5]","[[O], [O], [O]]"
1,1,split2,"[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[Title, :, Papers, of, The, Very, Rev, Prof, J...","[[O], [O], [O], [O], [B-Unknown, B-Masculine],..."
2,2,split1,"[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[Scope, and, Contents, :, Sermons, and, addres...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."
3,3,split2,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[Biographical, /, Historical, :, Professor, Ja...","[[O], [O], [O], [O], [B-Masculine], [I-Masculi..."
4,4,split4,"[134, 135, 136, 137, 138, 139, 140, 141, 142, ...","[He, was, educated, at, Daniel, Stewart, 's, C...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."


Zip the linguistic label and BIO tags together with the tokens so each sentence item is a tuple: `(TOKEN, TAG_LIST)`

In [14]:
# df_train_grouped = df_train_grouped.reset_index()
# df_dev_grouped = df_dev_grouped.reset_index()
# train_sentences_pers = utils1.zip1Feature1AndTarget(df_train_grouped, "tag")  # Dev because not using additional feature col for linguistic labels
# print(train_sentences_pers[2][:3])
# dev_sentences_pers = utils1.zip1FeatureAndTarget(df_dev_grouped, "tag")
# print(dev_sentences_pers[0][:3])

In [15]:
# train_sentences = train_sentences_pers
# dev_sentences = dev_sentences_pers

In [16]:
# # Features
# X_train = [utils1.extractSentenceFeatures(sentence) for sentence in train_sentences]
# X_dev = [utils1.extractSentenceFeatures(sentence) for sentence in dev_sentences]
# # Target
# y_train = [utils1.extractSentenceTargets(sentence) for sentence in train_sentences]
# y_dev = [utils1.extractSentenceTargets(sentence) for sentence in dev_sentences]

<a id="tp"></a>
#### Training & Prediction

Train a Conditional Random Field (CRF) model with the default parameters on the **Person Name** category of tags.  We'll increase the max iterations to 100 for this model.

In [16]:
a = "arow"

In [18]:
# clf_pers = sklearn_crfsuite.CRF(algorithm=a, variance=0.5, max_iterations=100, all_possible_transitions=True)
# # https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
# try:
#     clf_pers.fit(X_train, y_train)
# except AttributeError:
#     pass

In [18]:
pred_df = pd.DataFrame()

# Specify the run one at a time (with for loop, kernel dies; also, 
# crf_suite for sklearn's models will keep learning from previous runs if not restarted)
run = runs[4]  # 0, 1, 2, 3

# Get the train (80%) and test (20%) subsets of data
train_splits, test_split = run[0], run[1]
print("Training on:", train_splits)
train_df = df_grouped.loc[df_grouped[split_col].isin(train_splits)]
dev_df = df_grouped.loc[df_grouped[split_col] == test_split]

# Zip feature and target columns together so each 
# sentence item is a tuple: `(TOKEN, TAG_LIST)`
train_sentences = utils1.zip1FeatureAndTarget(train_df, "tag")
dev_sentences = utils1.zip1FeatureAndTarget(dev_df, "tag")
# Extract features
X_train = [utils1.extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [utils1.extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Extract targets
y_train = [utils1.extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [utils1.extractSentenceTargets(sentence) for sentence in dev_sentences]

# Train a classification model
clf_pers = sklearn_crfsuite.CRF(algorithm=a, variance=0.5, max_iterations=100, all_possible_transitions=True)
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_pers.fit(X_train, y_train)
except AttributeError:
    pass

# Predict with the trained model
print("Predicting on:", test_split)
predictions = clf_pers.predict(X_dev)
dev_df = dev_df.rename(columns={"tag":"tag_{}_expected".format(target_labels)})
dev_df.insert(len(dev_df.columns), "tag_{}_predicted".format(target_labels), predictions)
dev_df = dev_df.set_index(["sentence_id", "fold"])
dev_df_exploded = dev_df.explode(list(dev_df.columns))

if pred_df.shape[0] > 0:
    pred_df = pd.concat([pred_df, dev_df_exploded])
else:
    pred_df = dev_df_exploded

assert pred_df.loc[pred_df["tag_{}_predicted".format(target_labels)].isna()].shape[0] == 0, "Any NaN values should be replaced with 'O'"

filename = "crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_predictions_{s}.csv".format(a=a, t=target_labels, d=d, s=test_split)
pred_df.to_csv(predictions_dir+filename)

print("Predictions for {} saved!".format(test_split))

Combine the prediction data:

In [19]:
pred_df0 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_predictions_split0.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_df1 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_predictions_split1.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_df2 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_predictions_split2.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_df3 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_predictions_split3.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_df4 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_predictions_split4.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_perso = pd.concat([pred_df0, pred_df1, pred_df2, pred_df3, pred_df4])
print(pred_perso.shape)

(753521, 5)


In [20]:
pred_perso = pred_perso.reset_index()
pred_perso = utils.getColumnValuesAsLists(pred_perso, "tag_{}_expected".format(target_labels))
pred_perso.head()

Unnamed: 0,sentence_id,fold,token_id,sentence,tag_pers_o_expected,tag_pers_o_predicted
0,8,split0,233,James,[B-Masculine],B-Unknown
1,8,split0,234,Whyte,[I-Masculine],I-Unknown
2,8,split0,235,was,[O],O
3,8,split0,236,called,[O],O
4,8,split0,237,upon,[O],O


Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [22]:
targets = list(clf_pers.classes_)
targets.remove('O')
print(targets)

['B-Unknown', 'I-Unknown', 'B-Masculine', 'I-Masculine', 'B-Occupation', 'I-Occupation', 'B-Feminine', 'I-Feminine']


In [19]:
# y_pred = clf_pers.predict(X_dev)

#### Evaluate: All Labels

In [20]:
# print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
# print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
# print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))

  - F1: 0.42825014312210974
  - Prec: 0.4496106745338277
  - Rec 0.41771448419590135


Save the prediction data:

In [28]:
# df_dev_grouped = df_dev_grouped.rename(columns={"tag":"tag_pers_o_expected"})
# df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_pers_o_predicted", y_pred)
# # df_dev_grouped.head()
# df_dev_grouped = df_dev_grouped.set_index("sentence_id")
# df_dev_exploded = df_dev_grouped.explode(list(df_dev_grouped.columns))
# df_dev_exploded.head()

In [2]:
# filename = "crf_{a}_pers_o_baseline_fastText{d}_nolingfeatures_predictions.csv".format(a=a, d=d)
# df_dev_exploded.to_csv(config.experiment1_output_path+filename)

<a id="eval"></a>
### Evaluation
#### Evaluate: Strict, Each Label

The built-in evaluation approach is strict, so unless the model predictions' labels are on text spans that exactly match the development data's test, the predicted labels will be deemed incorrect.

In [24]:
# a = "arow"
# category = "pers_o"
# filename = "crf_{a}_{c}_baseline_fastText{d}_nolingfeatures_predictions.csv".format(a=a, c=category, d=d)
# pred_perso = pd.read_csv(config.experiment1_output_path+filename)
# pred_perso = utils.getColumnValuesAsLists(pred_perso, "tag_{}_expected".format(category))
# # pred_pers.head()

Calculate performance metrics for each category of labels:

In [21]:
category = target_labels

In [22]:
pred_perso = utils.isPredictedInExpected(pred_perso, "tag_{}_expected".format(category), "tag_{}_predicted".format(category), '_merge', 'O')
pred_perso.head()

Unnamed: 0,sentence_id,fold,token_id,sentence,tag_pers_o_expected,tag_pers_o_predicted,_merge
0,8,split0,233,James,[B-Masculine],B-Unknown,false positive
1,8,split0,234,Whyte,[I-Masculine],I-Unknown,false positive
2,8,split0,235,was,[O],O,true negative
3,8,split0,236,called,[O],O,true negative
4,8,split0,237,upon,[O],O,true negative


Save the combined data:

In [25]:
filename = "crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_evaluation.csv".format(a=a, t=target_labels, d=d)
pred_perso.to_csv(predictions_dir+filename)

In [26]:
pred_perso_stats = utils.getScoresByCatTags(
    pred_perso, "_merge", pers_o_label_subset[0], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
)
for i in range(1, len(pers_o_label_subset)):
    tag_stats = utils.getScoresByCatTags(
        pred_perso, "_merge", pers_o_label_subset[i], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
    )
    pred_perso_stats = pd.concat([pred_perso_stats, tag_stats])
# ----------------------
# Model 2.2:
# pred_perso_stats = utils.getScoresByCatTags(
#     pred_perso, "_merge", "Person Name", "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
# )
# ----------------------
pred_perso_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,B-Unknown,2485,3361,5446,0.618372,0.686673,0.650735
0,I-Unknown,4215,4940,8649,0.636471,0.672341,0.653914
0,B-Feminine,192,330,664,0.668008,0.775701,0.717838
0,I-Feminine,466,646,1485,0.696856,0.761148,0.727585
0,B-Masculine,892,1181,1648,0.582538,0.648819,0.613895
0,I-Masculine,1400,2000,2076,0.509323,0.597238,0.549788
0,B-Occupation,1274,836,1562,0.651376,0.550776,0.596867
0,I-Occupation,1843,942,1456,0.607173,0.441346,0.511146


Save the statistics:

In [27]:
# pred_perso_stats.to_csv(
#     config.experiment1_agmt_path+"crf_{a}_baseline_fastText{d}_{c}_nolingfeatures_strict_agmt.csv".format(a=a, c=category, d=d)
# )
pred_perso_stats.to_csv(agreement_dir+"crf_{a}_baseline_fastText{d}_{c}_nolingfeatures_strict_agmt.csv".format(a=a, c=category, d=d))

### Annotation Agreement

Calculate agreement at the annotation level, so if the model labels any word correctly from a manually annotated text span, that annotation is recorded as being correctly labeled (`true positive`).  Note whether the models' labels are an `exact_match`, `label_match`, `category_match` or `mismatch`.

*Note: `ann_id` of `9999` indicates no annotation*

Group the annotation data by token:

In [36]:
df_by_ann = df_by_ann.explode(["token_id", "tag"])
df_by_ann = df_by_ann.rename(columns={"tag": "expected_tag"})
df_by_ann = df_by_ann.reset_index()
df_by_ann.head()

Unnamed: 0,ann_id,token_id,expected_tag,expected_label
0,7,58341,B-Feminine,Feminine
1,7,58342,I-Feminine,Feminine
2,7,58343,I-Feminine,Feminine
3,7,58344,I-Feminine,Feminine
4,14,19836,B-Unknown,Unknown


Align the columns of the dev and prediction DataFrames:

In [30]:
# Rename `sentence` column `token`
pred_perso = pred_perso.rename(columns={"sentence":"token"})
# pred_perso.head()

Join the data, adding the annotation IDs (`ann_id` column) to the prediction DataFrames:

In [37]:
index_list = ["token_id"]

In [67]:
pred_perso_ann = pred_perso.join(df_by_ann.set_index(index_list), on=index_list, how="left")
pred_perso_ann = pred_perso_ann.drop(columns=["tag_{}_expected".format(category)])  # duplicate of expected_tag
pred_perso_ann = pred_perso_ann.rename(columns={"expected_tag":"tag_{}_expected".format(category)})
pred_perso_ann["ann_id"] = pred_perso_ann["ann_id"].fillna(99999)
pred_perso_ann["tag_pers_o_expected"] = pred_perso_ann["tag_pers_o_expected"].fillna("O")
pred_perso_ann["expected_label"] = pred_perso_ann["expected_label"].fillna("O")
assert pred_perso_ann.loc[pred_perso_ann["token_id"].isna()].shape[0] == 0
assert pred_perso_ann.loc[pred_perso_ann["ann_id"].isna()].shape[0] == 0
assert pred_perso_ann.loc[pred_perso_ann["tag_pers_o_predicted"].isna()].shape[0] == 0
assert pred_perso_ann.loc[pred_perso_ann["tag_pers_o_expected"].isna()].shape[0] == 0
# pred_perso_ann.head()

Explode the DataFrame:

In [68]:
pred_perso_ann = pred_perso_ann.explode(["tag_pers_o_expected"])

Generalize the predicted BIO tags to label names:

In [69]:
# Get the predicted labels
pred_labels = list(pred_perso_ann["tag_{}_predicted".format(category)])
pred_labels = [label if label == "O" else label[2:] for label in pred_labels]
pred_perso_ann.insert(len(pred_perso_ann.columns), "label_{}_predicted".format(category), pred_labels)
# pred_perso_ann.head()

Group the data by annotation:

In [70]:
pred_perso_ann = pred_perso_ann.drop(columns=["tag_{}_expected".format(category), "tag_{}_predicted".format(category)])
pred_perso_ann = utils.implodeDataFrame(pred_perso_ann, ["ann_id", "expected_label"])
pred_perso_ann = pred_perso_ann.reset_index()
pred_perso_ann.head()

Unnamed: 0,ann_id,expected_label,sentence_id,fold,token_id,sentence,_merge,label_pers_o_predicted
0,7.0,Feminine,"[2590, 2590, 2590, 2590]","[split2, split2, split2, split2]","[58341, 58342, 58343, 58344]","[Mrs, Norman, Macleod, ,]","[true positive, true positive, true positive, ...","[Feminine, Feminine, Feminine, Feminine]"
1,14.0,Unknown,"[1097, 1097, 1097, 1097]","[split4, split4, split4, split4]","[19836, 19837, 19838, 19839]","[Dr., Nelly, Renee, Deme]","[false positive, false positive, false positiv...","[Feminine, Feminine, Feminine, Feminine]"
2,15.0,Unknown,"[1485, 1485]","[split4, split4]","[28713, 28714]","[Marjory, Kennedy-Fraser]","[true positive, true positive]","[Feminine, Feminine]"
3,16.0,Unknown,"[1486, 1486, 1486, 1486]","[split3, split3, split3, split3]","[28738, 28739, 28740, 28741]","[Marjory, Kennedy, Fraser, ,]","[true positive, true positive, true positive, ...","[Feminine, Feminine, Feminine, Feminine]"
4,17.0,Unknown,"[1487, 1487, 1487]","[split0, split0, split0]","[28790, 28791, 28792]","[Marjory, Kennedy-Fraser, ,]","[true positive, true positive, true positive]","[Feminine, Feminine, Feminine]"


Separate out the expected unannotated data (`ann_id` of `99999`) from the expected annotated data:

In [71]:
pred_perso_ann = pred_perso_ann.rename(columns={"sentence":"token", "expected_label":"label_{}_expected".format(category)})
fp_df = pred_perso_ann.loc[pred_perso_ann.ann_id == 99999]
fp_df = fp_df.explode(["token_id", "sentence_id", "fold", "token", "_merge", "label_{}_predicted".format(category)])
print(fp_df.shape)
fp_df.head()

(712167, 8)


Unnamed: 0,ann_id,label_pers_o_expected,sentence_id,fold,token_id,token,_merge,label_pers_o_predicted
20710,99999.0,O,8,split0,235,was,true negative,O
20710,99999.0,O,8,split0,236,called,true negative,O
20710,99999.0,O,8,split0,237,upon,true negative,O
20710,99999.0,O,8,split0,238,to,true negative,O
20710,99999.0,O,8,split0,239,preach,true negative,O


In [72]:
pred_perso_ann = pred_perso_ann.loc[pred_perso_ann.ann_id != 99999]

In [73]:
eval_by_ann = pd.concat([pred_perso_ann, fp_df])

Get unique lists of predicted labels per row, removing `O` values from lists with Linguistic labels:

In [74]:
def getUniqueLabels(label_list):
    final_labels = []
    for labels in label_list:
        if (len(labels) > 1) and ("O" in labels):
            labels.remove("O")
        final_labels += [labels]
    assert len(final_labels) == len(label_list), "There should be the same number of sub-lists in final_labels and label_list."
    return final_labels

In [75]:
predicted_labels = list(eval_by_ann["label_{}_predicted".format(category)])
predicted_labels = [list(set(predictions)) for predictions in predicted_labels]
predicted_unique = getUniqueLabels(predicted_labels)
# predicted_unique[:5]  # Looks good
# for pred in predicted_unique:
#     if len(pred) > 1:
#         print("Multi-item lists exist")
#         break
eval_by_ann["label_{}_predicted".format(category)] = predicted_unique
# Reorder columns:
eval_by_ann = eval_by_ann[
    ["ann_id", "sentence_id", "fold", "token_id", "token", "_merge", "label_{}_expected".format(category), "label_{}_predicted".format(category)]
]
eval_by_ann.head()

Unnamed: 0,ann_id,sentence_id,fold,token_id,token,_merge,label_pers_o_expected,label_pers_o_predicted
0,7.0,"[2590, 2590, 2590, 2590]","[split2, split2, split2, split2]","[58341, 58342, 58343, 58344]","[Mrs, Norman, Macleod, ,]","[true positive, true positive, true positive, ...",Feminine,[Feminine]
1,14.0,"[1097, 1097, 1097, 1097]","[split4, split4, split4, split4]","[19836, 19837, 19838, 19839]","[Dr., Nelly, Renee, Deme]","[false positive, false positive, false positiv...",Unknown,[Feminine]
2,15.0,"[1485, 1485]","[split4, split4]","[28713, 28714]","[Marjory, Kennedy-Fraser]","[true positive, true positive]",Unknown,[Feminine]
3,16.0,"[1486, 1486, 1486, 1486]","[split3, split3, split3, split3]","[28738, 28739, 28740, 28741]","[Marjory, Kennedy, Fraser, ,]","[true positive, true positive, true positive, ...",Unknown,[Feminine]
4,17.0,"[1487, 1487, 1487]","[split0, split0, split0]","[28790, 28791, 28792]","[Marjory, Kennedy-Fraser, ,]","[true positive, true positive, true positive]",Unknown,[Feminine]


Record the agreements and disagreements:

In [76]:
expected_labels = list(eval_by_ann["label_{}_expected".format(category)])
predicted_labels = list(eval_by_ann["label_{}_predicted".format(category)])
ann_agmts = []
perso_labels = list(pers_o_label_tags.keys())
for i,exp in enumerate(expected_labels):
    pred = predicted_labels[i]
    if (exp == "O"):
        # If `O` was predicted as expected
        if 'O' in pred:
            ann_agmts += ["true negative"]
        # If a label was predicted when `O` was expected
        else:
            ann_agmts += ["false positive"]
    else:
        # If the correct label was predicted
        if exp in pred:
            ann_agmts += ["true positive"]
        # If there's a label mismatch
        elif (exp != "O") and (not "O" in pred) and (not exp in pred):
            ann_agmts += ["false positive"]
        # If `O` was predicted when there was an expected label
        else:
            ann_agmts += ["false negative"]
assert len(ann_agmts) == eval_by_ann.shape[0]

# Insert the annotation agreement column recording TP, FP, TN, and FN
eval_by_ann.insert(len(eval_by_ann.columns), "annotation_agreement", ann_agmts)
eval_by_ann.head()

Unnamed: 0,ann_id,sentence_id,fold,token_id,token,_merge,label_pers_o_expected,label_pers_o_predicted,annotation_agreement
0,7.0,"[2590, 2590, 2590, 2590]","[split2, split2, split2, split2]","[58341, 58342, 58343, 58344]","[Mrs, Norman, Macleod, ,]","[true positive, true positive, true positive, ...",Feminine,[Feminine],true positive
1,14.0,"[1097, 1097, 1097, 1097]","[split4, split4, split4, split4]","[19836, 19837, 19838, 19839]","[Dr., Nelly, Renee, Deme]","[false positive, false positive, false positiv...",Unknown,[Feminine],false positive
2,15.0,"[1485, 1485]","[split4, split4]","[28713, 28714]","[Marjory, Kennedy-Fraser]","[true positive, true positive]",Unknown,[Feminine],false positive
3,16.0,"[1486, 1486, 1486, 1486]","[split3, split3, split3, split3]","[28738, 28739, 28740, 28741]","[Marjory, Kennedy, Fraser, ,]","[true positive, true positive, true positive, ...",Unknown,[Feminine],false positive
4,17.0,"[1487, 1487, 1487]","[split0, split0, split0]","[28790, 28791, 28792]","[Marjory, Kennedy-Fraser, ,]","[true positive, true positive, true positive]",Unknown,[Feminine],false positive


Save the data:

In [77]:
eval_by_ann.to_csv(predictions_dir+"cc-{a}_{c}_baseline_fastText{d}_annot_evaluation.csv".format(a=a, c=category, d=d))

Calculate annotation agreement metrics for each label:

In [78]:
annot_agmt = pd.DataFrame.from_dict({
        "label":[], "false negative":[], "false positive":[],
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [79]:
for label in perso_labels:
    agmt_df = eval_by_ann.loc[eval_by_ann["label_{}_expected".format(category)] == label]
    tp = agmt_df.loc[agmt_df.annotation_agreement == "true positive"].shape[0]
    fp = agmt_df.loc[agmt_df.annotation_agreement == "false positive"].shape[0]
    fn = agmt_df.loc[agmt_df.annotation_agreement == "false negative"].shape[0]
    prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
    label_agmt = pd.DataFrame.from_dict({
            "label":[label], "false negative":[fn], "false positive":[fp],
             "true positive":[tp], "precision":[prec], "recall":[rec], "f1":[f1]
        })
    annot_agmt = pd.concat([annot_agmt, label_agmt])
annot_agmt

Unnamed: 0,label,false negative,false positive,true positive,precision,recall,f1
0,Unknown,2345.0,1326.0,6840.0,0.837619,0.744692,0.788427
0,Feminine,161.0,520.0,974.0,0.651941,0.85815,0.740966
0,Masculine,780.0,2805.0,2001.0,0.416355,0.719525,0.527481
0,Occupation,1147.0,73.0,1738.0,0.959691,0.602426,0.740204


Save the metrics and annotation-level data:

In [80]:
# metrics_perso.to_csv(
#     config.experiment1_agmt_path+"crf_{a}_baseline_fastText{d}_{c}_nolingfeatures_annot_agmt.csv".format(a=a, d=d, c=category)
# )
annot_agmt.to_csv(agreement_dir+"crf_{a}_baseline_fastText{d}_{c}_nolingfeatures_annot_agmt.csv".format(a=a, d=d, c=category))

### Loose Evaluation

As with the manual annotation evaluation, we want to evaluate the predictions more loosely, considering overlapping text spans in addition to exactly matching text spans.

#### Token Agreement

First, generalize the tokens' IOB tags to the label, and calculate agreement scores for each label.

In [40]:
pred_perso_labels = pred_perso.copy()
pred_perso_labels = pred_perso_labels.drop(columns=["_merge"])
tag_exp = list(pred_perso_labels["tag_{}_expected".format(category)])
tag_pred = list(pred_perso_labels["tag_{}_predicted".format(category)])
label_exp = [[tag if tag == "O" else tag[2:] for tag in tag_exp_list] for tag_exp_list in tag_exp]
label_pred = [tag if tag == "O" else tag[2:] for tag in tag_pred]
pred_perso_labels = pred_perso_labels.drop(columns=["tag_{}_expected".format(category), "tag_{}_predicted".format(category)])
pred_perso_labels.insert(len(pred_perso_labels.columns), "label_{}_expected".format(category), label_exp)
pred_perso_labels.insert(len(pred_perso_labels.columns), "label_{}_predicted".format(category), label_pred)
# pred_pers_labels.loc[pred_pers_labels.label_personname_predicted == "Feminine"].head()  # Looks good

Calculate the agreement metrics at the label level for each token:

In [41]:
tags = ['Unknown', 'Feminine', 'Masculine', 'Occupation']
pred_perso_labels = utils.isPredictedInExpected(pred_perso_labels, "label_{}_expected".format(category), "label_{}_predicted".format(category), '_merge', 'O')

pred_perso_stats = utils.getScoresByCatTags(
    pred_perso_labels, "_merge", tags[0], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_perso_labels, "_merge", tags[i], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
    )
    pred_perso_stats = pd.concat([pred_perso_stats, tag_stats])
pred_perso_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,Unknown,6683,7489,14907,0.66561,0.690459,0.677807
0,Feminine,651,856,2269,0.72608,0.777055,0.750703
0,Masculine,2276,2986,3919,0.56756,0.632607,0.598321
0,Occupation,3117,1598,3198,0.666806,0.506413,0.575646


Combine and save the performance measures:

In [42]:
# pred_perso_stats.to_csv(
#     config.experiment1_agmt_path+"crf_{a}_baseline_fastText{d}_{c}_nolingfeatures_loose_agmt.csv".format(a=a, d=d, c=category)
# )
pred_perso_stats.to_csv(agreement_dir+"crf_{a}_baseline_fastText{d}_{c}_nolingfeatures_loose_agmt.csv".format(a=a, d=d, c=category))

In [43]:
pred_perso_labels.head()

Unnamed: 0,sentence_id,fold,token_id,token,label_pers_o_expected,label_pers_o_predicted,_merge
0,8,split0,233,James,[Masculine],Unknown,false positive
1,8,split0,234,Whyte,[Masculine],Unknown,false positive
2,8,split0,235,was,[O],O,true negative
3,8,split0,236,called,[O],O,true negative
4,8,split0,237,upon,[O],O,true negative


Save the loose predictions (with labels instead of BIO tags):

In [44]:
pred_perso_labels.to_csv(predictions_dir+"crf_{a}_baseline_fastText{d}_{c}_nolingfeatures_loose_evaluation.csv".format(a=a, d=d, c=category))