# Experiment 3, Model 1

#### Model Setup

Run models in the following order, using their output labels as features for the next model:

1. Multiclass Person Name + Occupation Sequence Classifier

2. Multilabel Stereotype + Omission Document Classifier

***

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/experiment_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/experiment3/`
* Word Embeddings
    * Custom fastText (word2vec with subwords) embeddings of 100 dimensions trained on the CRC Archives catalog's descriptive metadata (harvested October 2020)

***

**Table of Contents**

[I.](#i) Person Name + Occupation Sequence Classifier
* [Preprocessing](#prep)
* [Training & Prediction](#tp)
* [Evaluation](#eval)

Load resources:

In [1]:
# For custom functions and variables
import utils, utils1, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For preprocessing
from gensim.models import FastText
from gensim import utils as gensim_utils

# For multilabel token classification
import sklearn.metrics
from sklearn.preprocessing import MultiLabelBinarizer
from skmultilearn.problem_transform import ClassifierChain
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# For multiclass sequence classification
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# For saving model
from joblib import dump,load

Define resources for the models:

In [2]:
Path(config.experiment_input_path).mkdir(parents=True, exist_ok=True)    # For train, devtest, and blind test data
# Path(config.experiment1_output_path).mkdir(parents=True, exist_ok=True)  # For predictions
# Path(config.experiment1_agmt_path).mkdir(parents=True, exist_ok=True)    # For agreement metrics

predictions_dir = config.experiment3_path+"5fold/output/"     # For predictions
Path(predictions_dir).mkdir(parents=True, exist_ok=True)
agreement_dir = config.experiment3_path+"5fold/agreement/"    # For agreement metrics
Path(agreement_dir).mkdir(parents=True, exist_ok=True)

# predictions_dir = config.experiment3_path+"5fold_withdescid/output/"     # For predictions
# Path(predictions_dir).mkdir(parents=True, exist_ok=True)
# agreement_dir = config.experiment3_path+"5fold_withdescid/agreement/"    # For agreement metrics
# Path(agreement_dir).mkdir(parents=True, exist_ok=True)

In [3]:
# Model 2:
pers_o_label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Occupation", "I-Occupation"]
# Model 2.1:
# pers_label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine"]

In [4]:
pers_o_label_tags = {
    "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
     "Occupation": ["B-Occupation", "I-Occupation"]
    }

In [5]:
d = 100  # dimensions of word embeddings (should match utils1.py)
target_labels = "pers_o" #_withdescid"  # for file names

<a id="1"></a>
## 1. Person Name + Occupation Labels

Train a multiclass sequence classifier, using Conditional Random Field with Adaptive Regularization of Weight Vectors (AROW), on the Person Name and Occupation labels.

Multiclass is a suitable setup for these labels because they are mutually exclusive (no one token should have more than one of these labels).  The sequence classifier with AROW was the highest performing for past algorithm experiments with sequence classifiers for Person Name and Occupation labels.

The devtest data subset from the model in step 1 will be the train data subset in this step, with the predicted Linguistic labels as features passed into this second model.  The train data subset from the first model will be the devtest data subset for this second model.

In [7]:
### For this experiment, we'll repeatedly train models on different 80% selections of 
### data and predict on the remaining 20% split, for a modified 5-fold cross-validation approach.
perso_df = pd.read_csv(config.tokc_path+"experiment_input/token_5fold.csv", index_col=0)
# Make sure only Person Name and Occupation tags are considered
perso_df = utils1.selectDataForLabels(perso_df, "tag", pers_o_label_subset) #pers_label_subset)
perso_df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,fold
0,0,0,99999,0,Identifier,"(0, 10)",NN,O,Identifier,split4
1,0,0,99999,1,:,"(10, 11)",:,O,Identifier,split4
2,0,0,99999,2,AA5,"(12, 15)",NN,O,Identifier,split4
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,split2
4,1,1,99999,4,:,"(22, 23)",:,O,Title,split2


In [7]:
print(perso_df.tag.unique())  # Looks good

['O' 'B-Unknown' 'B-Masculine' 'I-Unknown' 'I-Masculine' 'B-Occupation'
 'I-Occupation' 'B-Feminine' 'I-Feminine']


Get the label associated with each annotation for future evaluation:

In [25]:
df_by_ann = pd.read_csv(config.tokc_path+"experiment_input/token_5fold.csv", usecols=["ann_id", "token_id", "tag"])
df_by_ann = df_by_ann.drop_duplicates()
df_by_ann = utils.implodeDataFrame(df_by_ann, ["ann_id"])
tags_col = list(df_by_ann.tag)
labels = [[tag[2:] if tag != "O" else tag for tag in tags] for tags in tags_col]
labels = [label_list[0] for label_list in labels]
df_by_ann.insert(len(df_by_ann.columns), "expected_label", labels)
perso_labels = list(pers_o_label_tags.keys())
df_by_ann = df_by_ann.loc[df_by_ann.expected_label.isin(perso_labels)]
df_by_ann.head()

Unnamed: 0_level_0,token_id,tag,expected_label
ann_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
7,"[58341, 58342, 58343, 58344]","[B-Feminine, I-Feminine, I-Feminine, I-Feminine]",Feminine
14,"[19836, 19837, 19838, 19839]","[B-Unknown, I-Unknown, I-Unknown, I-Unknown]",Unknown
15,"[28713, 28714]","[B-Unknown, I-Unknown]",Unknown
16,"[28738, 28739, 28740, 28741]","[B-Unknown, I-Unknown, I-Unknown, I-Unknown]",Unknown
17,"[28790, 28791, 28792]","[B-Unknown, I-Unknown, I-Unknown]",Unknown


Define the five groups of training and test sets:

In [9]:
split_col = "fold"
splits = perso_df[split_col].unique()
splits.sort()
train0, test0 = list(splits[:4]), splits[4]
train1, test1 = list(splits[1:]), splits[0]
train2, test2 = list(splits[2:])+[splits[0]], splits[1]
train3, test3 = list(splits[3:])+list(splits[:2]), splits[2]
train4, test4 = [splits[4]]+list(splits[:3]), splits[3]
runs = [(train0, test0), (train1, test1), (train2, test2), (train3, test3), (train4, test4)]

Looks good!

<a id="prep"></a>
#### Preprocessing

In [9]:
df = perso_df.drop(columns=["ann_id", "token_offsets", "field", "pos"])

In [10]:
df_token_groups = utils.implodeDataFrame(df, ['token_id', 'description_id', 'sentence_id', 'token', 'fold'])
df_token_groups = df_token_groups.reset_index()
# df_token_groups.head(20)

Make sure that every row's tags are not duplicated, and that if a row has a `B-` or `I-` tag (or category name), it doesn't also have an `O` tag, and sort the order of the tags so that any `B` tag will be selected as the expected tag for training before an "I" tag.  Additionally, if multiple labels are present and one is `Unknown`, put the `Unknown` tag first so that it will be selected as the expected tag for training (as the data sample output above illustrates, and as Error Analysis of document classifiers for Person Names showed, people's names should have been annotated as `Unknown` more than they actually were).

In [11]:
tags = list(df_token_groups["tag"])
new_tags = []
for tag_list in tags:
    unique_tags = list(set(tag_list))
    if (len(unique_tags) > 1) and ("O" in unique_tags):
        unique_tags.remove("O")
    unique_tags.sort()
    # Put any Unknown tags at the start of the list, so they'll be selected
    # as a feature for training over Masculine or Feminine tags
    if len(unique_tags) > 1:
        if "I-Unknown" in unique_tags:
            unique_tags.remove("I-Unknown")
            unique_tags = ["I-Unknown"] + unique_tags
        if "B-Unknown" in unique_tags:
            unique_tags.remove("B-Unknown")
            unique_tags = ["B-Unknown"] + unique_tags
    new_tags += [unique_tags]
df_token_groups["tag"] = new_tags
# df_token_groups.head(20)

In [12]:
df_grouped = utils.implodeDataFrame(df_token_groups, ['sentence_id', 'fold'])
df_grouped = df_grouped.rename(columns={"token":"sentence"})
df_grouped = df_grouped.reset_index()
df_grouped.head()

Unnamed: 0,sentence_id,fold,token_id,description_id,sentence,tag
0,0,split4,"[0, 1, 2]","[0, 0, 0]","[Identifier, :, AA5]","[[O], [O], [O]]"
1,1,split2,"[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[Title, :, Papers, of, The, Very, Rev, Prof, J...","[[O], [O], [O], [O], [B-Unknown, B-Masculine],..."
2,2,split1,"[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[Scope, and, Contents, :, Sermons, and, addres...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."
3,3,split2,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...","[Biographical, /, Historical, :, Professor, Ja...","[[O], [O], [O], [O], [B-Masculine], [I-Masculi..."
4,4,split4,"[134, 135, 136, 137, 138, 139, 140, 141, 142, ...","[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...","[He, was, educated, at, Daniel, Stewart, 's, C...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."


In [13]:
df_grouped.shape

(42030, 6)

<a id="tp"></a>
#### Training & Prediction

Train a Conditional Random Field (CRF) model with the default parameters on the **Person Name** category of tags.  We'll increase the max iterations to 100 for this model.

In [19]:
a = "arow"

In [21]:
pred_df = pd.DataFrame()

# Specify the run one at a time (with for loop, kernel dies; also, 
# crf_suite for sklearn's models will keep learning from previous runs if not restarted)
run = runs[4]  # 0, 1, 2, 3

# Get the train (80%) and test (20%) subsets of data
train_splits, test_split = run[0], run[1]
print("Training on:", train_splits)
train_df = df_grouped.loc[df_grouped[split_col].isin(train_splits)]
dev_df = df_grouped.loc[df_grouped[split_col] == test_split]

# Zip feature and target columns together so each 
# sentence item is a tuple: `(TOKEN, TAG_LIST)`
train_sentences = utils1.zip1FeatureAndTarget(train_df, "tag")  #utils1.zip2FeaturesAndTarget(train_df, "tag", "sentence", "description_id")
dev_sentences = utils1.zip1FeatureAndTarget(dev_df, "tag")      #utils1.zip2FeaturesAndTarget(dev_df, "tag", "sentence", "description_id")  #
# Extract features
X_train = [utils1.extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [utils1.extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Extract targets
y_train = [utils1.extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [utils1.extractSentenceTargets(sentence) for sentence in dev_sentences]

# Train a classification model
clf_pers = sklearn_crfsuite.CRF(algorithm=a, variance=1, max_iterations=100, all_possible_transitions=True)
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_pers.fit(X_train, y_train)
except AttributeError:
    pass

# Predict with the trained model
print("Predicting on:", test_split)
predictions = clf_pers.predict(X_dev)
dev_df = dev_df.rename(columns={"tag":"tag_{}_expected".format(target_labels)})
dev_df.insert(len(dev_df.columns), "tag_{}_predicted".format(target_labels), predictions)
dev_df = dev_df.set_index(["sentence_id", "fold"])
dev_df_exploded = dev_df.explode(list(dev_df.columns))

if pred_df.shape[0] > 0:
    pred_df = pd.concat([pred_df, dev_df_exploded])
else:
    pred_df = dev_df_exploded

assert pred_df.loc[pred_df["tag_{}_predicted".format(target_labels)].isna()].shape[0] == 0, "Any NaN values should be replaced with 'O'"

filename = "crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_predictions_{s}.csv".format(a=a, t=target_labels, d=d, s=test_split)
pred_df.to_csv(predictions_dir+filename)

print("Predictions for {} saved!".format(test_split))

Training on: ['split4', 'split0', 'split1', 'split2']
Predicting on: split3
Predictions for split3 saved!


Combine the prediction data:

In [6]:
a = "arow"
pred_df0 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_predictions_split0.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_df1 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_predictions_split1.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_df2 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_predictions_split2.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_df3 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_predictions_split3.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_df4 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_predictions_split4.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_perso = pd.concat([pred_df0, pred_df1, pred_df2, pred_df3, pred_df4])
print(pred_perso.shape)

(753521, 6)


In [7]:
pred_perso = pred_perso.reset_index()
pred_perso = utils.getColumnValuesAsLists(pred_perso, "tag_{}_expected".format(target_labels))
pred_perso.head()

Unnamed: 0,sentence_id,fold,token_id,description_id,sentence,tag_pers_o_expected,tag_pers_o_predicted
0,8,split0,233,3,James,[B-Masculine],B-Masculine
1,8,split0,234,3,Whyte,[I-Masculine],I-Masculine
2,8,split0,235,3,was,[O],O
3,8,split0,236,3,called,[O],O
4,8,split0,237,3,upon,[O],O


Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [9]:
targets = ["B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Unknown", "I-Unknown", "B-Occupation", "I-Occupation"]
# targets = list(clf_pers.classes_)
# targets.remove('O')
print(targets)

['B-Feminine', 'I-Feminine', 'B-Masculine', 'I-Masculine', 'B-Unknown', 'I-Unknown', 'B-Occupation', 'I-Occupation']


Include only the first value from the expected tags column, as that's what was used in training:

In [10]:
exp_col = "tag_pers_o_expected"
pred_col = "tag_pers_o_predicted"

In [11]:
exp_pred_lists = list(pred_perso[exp_col])
new_exp_pred_col = [exp_pred_list[0] for exp_pred_list in exp_pred_lists]
pred_df = pred_perso.drop(columns=[exp_col])
pred_df.insert(len(pred_df.columns)-1, exp_col, new_exp_pred_col)

In [12]:
pred_df[exp_col].fillna("O")
pred_df[pred_col].fillna("O")
pred_df.head()

Unnamed: 0,sentence_id,fold,token_id,description_id,sentence,tag_pers_o_expected,tag_pers_o_predicted
0,8,split0,233,3,James,B-Masculine,B-Masculine
1,8,split0,234,3,Whyte,I-Masculine,I-Masculine
2,8,split0,235,3,was,O,O
3,8,split0,236,3,called,O,O
4,8,split0,237,3,upon,O,O


Save the concatenated prediction data:

In [13]:
category = target_labels
pred_df.to_csv(predictions_dir+"crf_{a}_{c}_baseline_fastText{d}_predictions.csv".format(a=a, c=category, d=d))

Save the model:

In [1]:
model_dir = "models/experiment3/"
Path(model_dir).mkdir(parents=True, exist_ok=True)
filename = model_dir+"crf_arow_F-fastText100_T-pers-o.joblib"
dump(clf_pers, filename)

<a id="eval"></a>
### Evaluation
#### Evaluate: Strict, Each Label

The built-in evaluation approach is strict, so unless the model predictions' labels are on text spans that exactly match the development data's test, the predicted labels will be deemed incorrect.

Calculate performance metrics for each category of labels:

In [14]:
pred_df = pred_df.rename(columns={"sentence":"token"})
df_pred = pred_df.drop(columns=[exp_col])
df_exp = pred_df.drop(columns=[pred_col])

In [16]:
join_on =  ["description_id", "sentence_id", "token_id", "token", "fold"]
df_agmt = utils.makeEvaluationDataFrame(
    df_exp, 
    df_pred, 
    join_on+[exp_col],
    join_on+[pred_col],
    join_on+[exp_col, pred_col, "_merge"], 
    pred_col,
    exp_col,
    "O"
)
df_agmt.head()

Unnamed: 0,description_id,sentence_id,token_id,token,fold,tag_pers_o_expected,tag_pers_o_predicted,_merge
0,3,8,233,James,split0,B-Masculine,B-Masculine,true positive
1,3,8,234,Whyte,split0,I-Masculine,I-Masculine,true positive
2,3,8,235,was,split0,O,O,true negative
3,3,8,236,called,split0,O,O,true negative
4,3,8,237,upon,split0,O,O,true negative


In [19]:
print(df_agmt.shape)
print(df_pred.shape)
print(df_exp.shape)

(781922, 8)
(753521, 6)
(753521, 6)


In [20]:
df_agmt._merge.value_counts()

true negative     724565
true positive      21782
false negative     19572
false positive     16003
Name: _merge, dtype: int64

Save the agreement data:

In [21]:
filename = "crf_{a}_{t}_baseline_fastText{d}_nolingfeatures_strict_evaluation.csv".format(a=a, t=target_labels, d=d)
df_agmt.to_csv(predictions_dir+filename)

Calculate the true positives, false positives, false negatives, precision, recall, and F1 metrics for each tag:

In [22]:
labels_grouped = list(pers_o_label_tags.values())
labels = []
for label_group in labels_grouped:
    for tag in label_group:
        labels += [tag] 
print(labels)

['B-Unknown', 'I-Unknown', 'B-Feminine', 'I-Feminine', 'B-Masculine', 'I-Masculine', 'B-Occupation', 'I-Occupation']


In [23]:
agmt_scores = pd.DataFrame.from_dict({
        "label":[], "false negative":[], "false positive":[],
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })
for label in labels:
    agmt_df = pd.concat([df_agmt.loc[df_agmt[exp_col] == label], df_agmt.loc[df_agmt[pred_col] == label]])
    agmt_df = agmt_df.drop_duplicates() # True positives will have been duplicated in line above
    tp = agmt_df.loc[agmt_df._merge == "true positive"].shape[0]
    fp = agmt_df.loc[agmt_df._merge == "false positive"].shape[0]
    fn = agmt_df.loc[agmt_df._merge == "false negative"].shape[0]
    prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
    label_agmt = pd.DataFrame.from_dict({
            "label":[label], "false negative":[fn], "false positive":[fp],
             "true positive":[tp], "precision":[prec], "recall":[rec], "f1":[f1]
        })
    agmt_scores = pd.concat([agmt_scores, label_agmt])
agmt_scores

Unnamed: 0,label,false negative,false positive,true positive,precision,recall,f1
0,B-Unknown,4057.0,3360.0,5407.0,0.616745,0.571323,0.593165
0,I-Unknown,6509.0,4864.0,8434.0,0.634231,0.564411,0.597288
0,B-Feminine,372.0,485.0,647.0,0.571555,0.634936,0.601581
0,I-Feminine,874.0,825.0,1385.0,0.626697,0.613103,0.619825
0,B-Masculine,1887.0,1548.0,1166.0,0.429624,0.381919,0.40437
0,I-Masculine,2610.0,2744.0,1662.0,0.377213,0.389045,0.383038
0,B-Occupation,1343.0,935.0,1563.0,0.625701,0.537853,0.57846
0,I-Occupation,1920.0,1242.0,1518.0,0.55,0.441536,0.489835


Save the statistics:

In [24]:
agmt_scores.to_csv(agreement_dir+"crf_{a}_baseline_fastText{d}_{c}_nolingfeatures_strict_agmt.csv".format(a=a, c=category, d=d))

#### Evaluate: Annotation Agreement

Calculate agreement at the annotation level, so if the model labels any word correctly from a manually annotated text span, that annotation is recorded as being correctly labeled (`true positive`).

In [26]:
# Generalize tags to labels for annotation agreement
pred_col, exp_col = "predicted_label", "expected_label"
category = "pers_o"
pred_labels = list(pred_df["tag_{}_predicted".format(category)])
pred_labels = [label if label == "O" else label[2:] for label in pred_labels]
pred_df.insert(len(pred_df.columns), pred_col, pred_labels)
print(pred_df.shape)
pred_df.head()

(753521, 8)


Unnamed: 0,sentence_id,fold,token_id,description_id,token,tag_pers_o_expected,tag_pers_o_predicted,predicted_label
0,8,split0,233,3,James,B-Masculine,B-Masculine,Masculine
1,8,split0,234,3,Whyte,I-Masculine,I-Masculine,Masculine
2,8,split0,235,3,was,O,O,O
3,8,split0,236,3,called,O,O,O
4,8,split0,237,3,upon,O,O,O


In [27]:
df_by_ann = df_by_ann.drop(columns=["tag"])
df_by_ann = df_by_ann.explode(["token_id"]).reset_index()
df_by_ann.head(2)

Unnamed: 0,ann_id,token_id,expected_label
0,7,58341,Feminine
1,7,58342,Feminine


In [28]:
eval_df_joined = pred_df.join(df_by_ann.set_index("token_id"), on="token_id", how="outer")
print(eval_df_joined.shape)
eval_df_joined["ann_id"] = eval_df_joined["ann_id"].fillna(99999)
eval_df_joined[exp_col] = eval_df_joined[exp_col].fillna("")
eval_df_joined.head()

(763670, 10)


Unnamed: 0,sentence_id,fold,token_id,description_id,token,tag_pers_o_expected,tag_pers_o_predicted,predicted_label,ann_id,expected_label
0,8,split0,233,3,James,B-Masculine,B-Masculine,Masculine,14387.0,Masculine
1,8,split0,234,3,Whyte,I-Masculine,I-Masculine,Masculine,14387.0,Masculine
2,8,split0,235,3,was,O,O,O,99999.0,
3,8,split0,236,3,called,O,O,O,99999.0,
4,8,split0,237,3,upon,O,O,O,99999.0,


In [29]:
eval_by_ann = utils.implodeDataFrame(eval_df_joined, ["description_id", "sentence_id", "ann_id", "fold", "expected_label"]).reset_index()
print(eval_by_ann.shape)
eval_by_ann.head()

(63277, 10)


Unnamed: 0,description_id,sentence_id,ann_id,fold,expected_label,token_id,token,tag_pers_o_expected,tag_pers_o_predicted,predicted_label
0,0,0,99999.0,split4,,"[0, 1, 2]","[Identifier, :, AA5]","[O, O, O]","[O, O, O]","[O, O, O]"
1,1,1,14384.0,split2,Unknown,"[7, 8, 9, 10, 11, 12]","[The, Very, Rev, Prof, James, Whyte]","[B-Unknown, I-Unknown, B-Unknown, I-Unknown, I...","[O, I-Unknown, I-Unknown, B-Unknown, I-Unknown...","[O, Unknown, Unknown, Unknown, Unknown, Unknown]"
2,1,1,24275.0,split2,Masculine,"[7, 8, 9, 10, 11, 12]","[The, Very, Rev, Prof, James, Whyte]","[B-Unknown, I-Unknown, B-Unknown, I-Unknown, I...","[O, I-Unknown, I-Unknown, B-Unknown, I-Unknown...","[O, Unknown, Unknown, Unknown, Unknown, Unknown]"
3,1,1,26233.0,split2,Unknown,"[9, 10, 11, 12]","[Rev, Prof, James, Whyte]","[B-Unknown, I-Unknown, I-Unknown, I-Unknown]","[I-Unknown, B-Unknown, I-Unknown, I-Unknown]","[Unknown, Unknown, Unknown, Unknown]"
4,1,1,99999.0,split2,,"[3, 4, 5, 6, 13, 14, 15]","[Title, :, Papers, of, (, 1920-2005, )]","[O, O, O, O, O, O, O]","[O, O, O, O, I-Unknown, I-Unknown, O]","[O, O, O, O, Unknown, Unknown, O]"


In [30]:
pred_label_col = list(eval_by_ann[pred_col])
unique_pred_label_col = [list(set(pred_labels)) for pred_labels in pred_label_col]
new_pred_label_col = []
for pred_labels in unique_pred_label_col:
    if "O" in pred_labels:
        pred_labels.remove("O")
    new_pred_label_col += [pred_labels]
eval_by_ann = eval_by_ann.drop(columns=[pred_col])
eval_by_ann.insert(len(eval_by_ann.columns), pred_col, new_pred_label_col)
eval_by_ann = eval_by_ann.explode([pred_col])
eval_by_ann.head()

Unnamed: 0,description_id,sentence_id,ann_id,fold,expected_label,token_id,token,tag_pers_o_expected,tag_pers_o_predicted,predicted_label
0,0,0,99999.0,split4,,"[0, 1, 2]","[Identifier, :, AA5]","[O, O, O]","[O, O, O]",
1,1,1,14384.0,split2,Unknown,"[7, 8, 9, 10, 11, 12]","[The, Very, Rev, Prof, James, Whyte]","[B-Unknown, I-Unknown, B-Unknown, I-Unknown, I...","[O, I-Unknown, I-Unknown, B-Unknown, I-Unknown...",Unknown
2,1,1,24275.0,split2,Masculine,"[7, 8, 9, 10, 11, 12]","[The, Very, Rev, Prof, James, Whyte]","[B-Unknown, I-Unknown, B-Unknown, I-Unknown, I...","[O, I-Unknown, I-Unknown, B-Unknown, I-Unknown...",Unknown
3,1,1,26233.0,split2,Unknown,"[9, 10, 11, 12]","[Rev, Prof, James, Whyte]","[B-Unknown, I-Unknown, I-Unknown, I-Unknown]","[I-Unknown, B-Unknown, I-Unknown, I-Unknown]",Unknown
4,1,1,99999.0,split2,,"[3, 4, 5, 6, 13, 14, 15]","[Title, :, Papers, of, (, 1920-2005, )]","[O, O, O, O, O, O, O]","[O, O, O, O, I-Unknown, I-Unknown, O]",Unknown


Compare predicted labels to expected label for each annotation, where if a label was predicted when none was expected, agreement is a false positive; if a correct label was predicted (even if it's only on part of the annotation), agreement is a true positive; and if no label was predicted when a label was expected, agreement is a false negative.

In [31]:
exp_col = "expected_label"
pred_col = "predicted_label"
df_pred = eval_by_ann.drop(columns=[exp_col, "fold", "token_id", "token", "tag_pers_o_expected", "tag_pers_o_predicted"])
df_exp = eval_by_ann.drop(columns=[pred_col, "fold", "token_id", "token", "tag_pers_o_expected", "tag_pers_o_predicted"])

In [33]:
join_on =  ["description_id", "sentence_id", "ann_id"]
join_on =  ["description_id", "sentence_id", "ann_id"]
eval_df = utils.makeEvaluationDataFrame(
    df_exp, 
    df_pred, 
    join_on+[exp_col], 
    join_on+[pred_col], 
    ["description_id", "sentence_id", "ann_id", "expected_label", "predicted_label", "_merge"], 
    exp_col, 
    pred_col, 
    "O"
)
eval_df = eval_df.fillna("O")
id_col = "ann_id"
eval_df = eval_df.sort_values(by=[id_col, exp_col, pred_col])
eval_df.head()

Unnamed: 0,description_id,sentence_id,ann_id,expected_label,predicted_label,_merge
6848,1082,2590,7.0,Feminine,Feminine,true positive
65634,855,1097,14.0,O,Feminine,false positive
2709,855,1097,14.0,Unknown,O,false negative
66267,1038,1485,15.0,O,Feminine,false positive
3709,1038,1485,15.0,Unknown,O,false negative


In [35]:
eval_df = eval_df.drop_duplicates()
print(eval_df.shape)
print(df_pred.shape)
print(df_exp.shape)

(115201, 6)
(63829, 4)
(63829, 4)


Save the data:

In [36]:
eval_df.to_csv(predictions_dir+"crf-{a}_{c}_baseline_fastText{d}_annot_evaluation.csv".format(a=a, c=category, d=d))

Calculate annotation agreement metrics for each label:

In [37]:
agmt_scores = pd.DataFrame.from_dict({
        "label":[], "false negative":[], "false positive":[],
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [39]:
labels = ['Feminine', 'Masculine', 'Unknown', 'Occupation']
for label in labels:
    agmt_df = pd.concat([eval_df.loc[eval_df[exp_col] == label], eval_df.loc[eval_df[pred_col] == label]])
    agmt_df = agmt_df.drop_duplicates() # True positives will have been duplicated in line above
    tp = agmt_df.loc[agmt_df._merge == "true positive"].shape[0]
    fp = agmt_df.loc[agmt_df._merge == "false positive"].shape[0]
    fn = agmt_df.loc[agmt_df._merge == "false negative"].shape[0]
    prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
    label_agmt = pd.DataFrame.from_dict({
            "label":[label], "false negative":[fn], "false positive":[fp],
             "true positive":[tp], "precision":[prec], "recall":[rec], "f1":[f1]
        })
    agmt_scores = pd.concat([agmt_scores, label_agmt])
agmt_scores

Unnamed: 0,label,false negative,false positive,true positive,precision,recall,f1
0,Feminine,602.0,627.0,1097.0,0.636311,0.645674,0.640958
0,Masculine,3708.0,1894.0,1941.0,0.506128,0.343601,0.409321
0,Unknown,4062.0,4952.0,7090.0,0.588773,0.63576,0.611365
0,Occupation,1188.0,926.0,1777.0,0.657418,0.599325,0.627029


Save the metrics:

In [40]:
agmt_scores.to_csv(agreement_dir+"crf_{a}_baseline_fastText{d}_{c}_nolingfeatures_annot_agmt.csv".format(a=a, d=d, c=category))