# Experiment 1, Model 2

#### Model Setup

Run models in the following order, using their output labels as features for the next model:

1. Multilabel Linguistic Classifier
2. Multiclass Person Name + Occupation Sequence Classifier
3. Multilabel Stereotype + Omission Document Classifier

***

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/experiment_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/experiment1/`
* Word Embeddings
    * Custom fastText (word2vec with subwords) embeddings of 100 dimensions trained on the CRC Archives catalog's descriptive metadata (harvested October 2020)
    
***

**Table of Contents**

[I.](#i) Person Name + Occupation Classifier
* [Preprocessing](#prep)
* [Training & Prediction](#tp)
* [Evaluation](#eval)

[II.](#ii) Predict Over All Data

Load programming resources:

In [1]:
# For custom functions and variables
import utils, utils1, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For preprocessing
from gensim.models import FastText
from gensim import utils as gensim_utils

# For classification
import sklearn.metrics
from sklearn.preprocessing import MultiLabelBinarizer
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# For saving model
from joblib import dump,load

Define resources for the models:

In [2]:
# Path(config.experiment_input_path).mkdir(parents=True, exist_ok=True)    # For train, devtest, and blind test data
predictions_dir = config.experiment1_path+"5fold/output/"
Path(predictions_dir).mkdir(parents=True, exist_ok=True)  # For predictions
agreement_dir = config.experiment1_path+"5fold/agreement/"
Path(agreement_dir).mkdir(parents=True, exist_ok=True)    # For agreement metrics

In [3]:
pers_o_label_subset = [
    "B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", 
    "B-Masculine", "I-Masculine", "B-Occupation", "I-Occupation"
]

In [4]:
ling_label_tags = {
    "Gendered-Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered-Role": ["B-Gendered-Role", "I-Gendered-Role"],"Generalization": ["B-Generalization", "I-Generalization"]
    }
pers_o_label_tags = {
    "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
     "Occupation": ["B-Occupation", "I-Occupation"]
    }

In [5]:
d = 100                   # dimensions of word embeddings (should match utils1.py) for file names
target_labels = "pers_o"  # for file names

<a id="i"></a>
## I. Person Name + Occupation Labels

Train a multiclass sequence classifier, using Conditional Random Field with Adaptive Regularization of Weight Vectors (AROW), on the Person Name and Occupation labels, **passing in the Linguistic labels (not specific BIO label-tag pair) from the previous model's predictions as features to this model.** The Person Name labels were assigned based on the presence or absence of gendered terminology referring to named people within a description, so the Linguistic labels passed as features will be rolled up to the description level. 

Multiclass is a suitable setup for these labels because they are mutually exclusive (no one token should have more than one of these labels).  The sequence classifier with AROW was the highest performing for past algorithm experiments with sequence classifiers for Person Name and Occupation labels.

Load the Linguistic features:

In [6]:
a = "rf"
ling_filename = predictions_dir+"cc-{a}_linglabels_baseline_fastText{d}_strict_evaluation.csv".format(a=a, d=d)
ling_eval_df = pd.read_csv(ling_filename, usecols=["description_id", "token_id", "predicted_label"])
ling_features = ling_eval_df.rename(columns={"predicted_label":"pred_ling_label"})
ling_features = ling_features.fillna("O")

Group Linguistic predictions by description, removing duplicates from the list of Linguistic predictions:

In [7]:
ling_features = utils.implodeDataFrame(ling_features, ["description_id"]).reset_index()
col = "pred_ling_label"
pred_col = list(ling_features[col])
# Remove duplicates
unique_pred_col = [list(set(preds)) for preds in pred_col]
# Remove "O" values
unique_pred_col = [preds.remove("O") if "O" in preds else preds for preds in unique_pred_col]
# Sort the lists
new_pred_col = []
for preds in unique_pred_col:
    if preds == None:
        preds = ["O"]
    else:
        preds.sort()
    new_pred_col += [preds]
assert len(pred_col) == len(new_pred_col)
ling_features = ling_features.drop(columns=[col, "token_id"])
ling_features.insert((len(ling_features.columns)-1), col, new_pred_col)
ling_features.head() # Want to keep desc ids in previous model!

Unnamed: 0,pred_ling_label,description_id
0,[O],0
1,[O],1
2,[O],2
3,[O],3
4,[O],4


In [8]:
ling_features[col].value_counts()  # Looks good

[O]    27908
Name: pred_ling_label, dtype: int64

In [9]:
df = pd.read_csv(config.tokc_path+"experiment_input/token_5fold.csv", index_col=0)
perso_df = utils1.selectDataForLabels(df, "tag", pers_o_label_subset)
print(df.shape, perso_df.shape)

(779270, 10) (779270, 10)


Get the label associated with each annotation for future evaluation:

In [61]:
df_by_ann = pd.read_csv(config.tokc_path+"experiment_input/token_5fold.csv", index_col=0)
df_by_ann = df_by_ann.drop_duplicates()
df_by_ann = utils.implodeDataFrame(df_by_ann, ["ann_id"])
tags_col = list(df_by_ann.tag)
labels = [[tag[2:] if tag != "O" else tag for tag in tags] for tags in tags_col]
labels = [label_list[0] for label_list in labels]
df_by_ann.insert(len(df_by_ann.columns), "expected_label", labels)
perso_labels = list(pers_o_label_tags.keys())
df_by_ann = df_by_ann.loc[df_by_ann.expected_label.isin(perso_labels)]
df_by_ann.head()

Unnamed: 0_level_0,description_id,sentence_id,token_id,token,token_offsets,pos,tag,field,fold,expected_label
ann_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
7,"[1082, 1082, 1082, 1082]","[2590, 2590, 2590, 2590]","[58341, 58342, 58343, 58344]","[Mrs, Norman, Macleod, ,]","[(36375, 36378), (36379, 36385), (36386, 36393...","[NNP, NNP, NNP, ,]","[B-Feminine, I-Feminine, I-Feminine, I-Feminine]","[Scope and Contents, Scope and Contents, Scope...","[split2, split2, split2, split2]",Feminine
14,"[855, 855, 855, 855]","[1097, 1097, 1097, 1097]","[19836, 19837, 19838, 19839]","[Dr., Nelly, Renee, Deme]","[(40, 43), (44, 49), (50, 55), (56, 60)]","[NNP, NNP, NNP, NNP]","[B-Unknown, I-Unknown, I-Unknown, I-Unknown]","[Title, Title, Title, Title]","[split4, split4, split4, split4]",Unknown
15,"[1038, 1038]","[1485, 1485]","[28713, 28714]","[Marjory, Kennedy-Fraser]","[(14570, 14577), (14578, 14592)]","[NNP, NNP]","[B-Unknown, I-Unknown]","[Scope and Contents, Scope and Contents]","[split4, split4]",Unknown
16,"[1038, 1038, 1038, 1038]","[1486, 1486, 1486, 1486]","[28738, 28739, 28740, 28741]","[Marjory, Kennedy, Fraser, ,]","[(14698, 14705), (14706, 14713), (14714, 14720...","[NNP, NNP, NNP, ,]","[B-Unknown, I-Unknown, I-Unknown, I-Unknown]","[Scope and Contents, Scope and Contents, Scope...","[split3, split3, split3, split3]",Unknown
17,"[1038, 1038, 1038]","[1487, 1487, 1487]","[28790, 28791, 28792]","[Marjory, Kennedy-Fraser, ,]","[(14924, 14931), (14932, 14946), (14946, 14947)]","[NNP, NNP, ,]","[B-Unknown, I-Unknown, I-Unknown]","[Scope and Contents, Scope and Contents, Scope...","[split0, split0, split0]",Unknown


Join the linguistic labels of descriptions (which will be passed to the model as features) to the model input and evaluation data:

In [11]:
perso_df = perso_df.join(ling_features.set_index("description_id"), on="description_id", how="left")
assert perso_df.loc[perso_df.tag.isna()].shape[0] == 0
perso_df = perso_df.fillna("O")
feature_col = list(perso_df[col])
new_feature_col = []
for preds in feature_col:
    if preds != "O":
        preds.sort()
        preds = ",".join(preds)
    new_feature_col += [preds]
perso_df = perso_df.drop(columns=[col])
perso_df.insert(2, col, new_feature_col)
perso_df[col].value_counts() # Looks good

O    779270
Name: pred_ling_label, dtype: int64

<a id="prep"></a>
### Preprocessing

Group data by token and then by sentence, so each sentence is a list of tokens and each token has a list of tags associated with it:

In [12]:
perso_data = perso_df.drop(columns=["ann_id", "token_offsets", "field", "pos"])

In [13]:
perso_token_groups = utils.implodeDataFrame(perso_data, ["description_id", "token_id", "sentence_id", "token", "fold", col]).reset_index()
# perso_token_groups.head()

In [14]:
perso_grouped = utils.implodeDataFrame(perso_token_groups, ["description_id", "sentence_id", "fold", col]).reset_index()
perso_grouped = perso_grouped.rename(columns={"token":"sentence"})
print(perso_grouped.shape)
perso_grouped.head()

(42030, 7)


Unnamed: 0,description_id,sentence_id,fold,pred_ling_label,token_id,sentence,tag
0,0,0,split4,O,"[0, 1, 2]","[Identifier, :, AA5]","[[O], [O], [O]]"
1,1,1,split2,O,"[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[Title, :, Papers, of, The, Very, Rev, Prof, J...","[[O], [O], [O], [O], [O, B-Unknown, B-Masculin..."
2,2,2,split1,O,"[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[Scope, and, Contents, :, Sermons, and, addres...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."
3,3,3,split2,O,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[Biographical, /, Historical, :, Professor, Ja...","[[O], [O], [O], [O], [B-Masculine], [I-Masculi..."
4,3,4,split4,O,"[134, 135, 136, 137, 138, 139, 140, 141, 142, ...","[He, was, educated, at, Daniel, Stewart, 's, C...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."


In [15]:
perso_grouped[col].value_counts()  # Looks good!

O    42030
Name: pred_ling_label, dtype: int64

Define the five groups of training and test sets:

In [16]:
split_col = "fold"
splits = perso_grouped[split_col].unique()
splits.sort()
print(splits)

['split0' 'split1' 'split2' 'split3' 'split4']


In [17]:
train0, test0 = list(splits[:4]), splits[4]
train1, test1 = list(splits[1:]), splits[0]
train2, test2 = list(splits[2:])+[splits[0]], splits[1]
train3, test3 = list(splits[3:])+list(splits[:2]), splits[2]
train4, test4 = [splits[4]]+list(splits[:3]), splits[3]
runs = [(train0, test0), (train1, test1), (train2, test2), (train3, test3), (train4, test4)]

<a id="tp"></a>
### Training & Prediction

Train a Conditional Random Field (CRF) model with the default parameters on the **Person Name** category of tags.  We'll set the max iterations to 100 for this model.

In [18]:
pred_df = pd.DataFrame()

# Specify the run one at a time (with for loop, model remembers what it learned in previous runs)
run = runs[0]  # 1, 2, 3, 4

# Get the train (80%) and test (20%) subsets of data
# with Person Name tags as targets and Linguistic labels as features
train_splits, test_split = run[0], run[1]
print("Training on:", train_splits)
train_df = perso_grouped.loc[perso_grouped[split_col].isin(train_splits)]
dev_df = perso_grouped.loc[perso_grouped[split_col] == test_split]

# Zip the linguistic label and BIO tags together with the tokens so each 
# sentence item is a tuple: `(TOKEN, LING_LABEL(S), TAG_LIST)`
train_sentences = utils1.zip2FeaturesAndTarget(train_df, "tag", feature_col2=col)
dev_sentences = utils1.zip2FeaturesAndTarget(dev_df, "tag", feature_col2=col)

# Extract features
X_train = [utils1.extractSentenceFeatures(s) for s in train_sentences] 
X_dev = [utils1.extractSentenceFeatures(s) for s in dev_sentences]

# Extract targets
y_train = [utils1.extractSentenceTargets(s) for s in train_sentences]
y_dev = [utils1.extractSentenceTargets(s) for s in dev_sentences]

# Train a classification model
a = "arow"
clf = sklearn_crfsuite.CRF(algorithm=a, variance=1, max_iterations=100, all_possible_transitions=True)
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf.fit(X_train, y_train)
except AttributeError:
    pass

# Predict with the trained model
print("Predicting on:", test_split)
predictions = clf.predict(X_dev)

dev_df = dev_df.rename(columns={"tag":"tag_pers_o_expected"})
dev_df.insert(len(dev_df.columns), "tag_pers_o_predicted", predictions)
dev_df = dev_df.set_index(["description_id", "sentence_id", "fold", col])
dev_df_exploded = dev_df.explode(list(dev_df.columns))

if pred_df.shape[0] > 0:
    pred_df = pd.concat([pred_df, dev_df_exploded])
else:
    pred_df = dev_df_exploded

assert pred_df.loc[pred_df["tag_pers_o_expected"].isna()].shape[0] == 0, "Any NaN values should be replaced with 'O'"

filename = "crf_{a}_{t}_baseline_fastText{d}_predictions_{s}.csv".format(a=a, t=target_labels, d=d, s=test_split)
pred_df.to_csv(predictions_dir+filename)

print("Predictions for {} saved!".format(test_split))

Training on: ['split0', 'split1', 'split2', 'split3']
Predicting on: split4
Predictions for split4 saved!


Save the model (the last model run):

In [19]:
model_dir = "models/experiment1/"
Path(model_dir).mkdir(parents=True, exist_ok=True)
filename = model_dir+"crf-{a}_pn_F-fastText{d}Ling_T-PNOcc.joblib".format(a=a, d=d)  # include features (F) and targets (T) in model's file name
dump(clf, filename)

['models/experiment1/crf-arow_pn_F-fastText100Ling_T-PNOcc.joblib']

Combine the prediction data:

In [20]:
pred_df0 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_predictions_split0.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_df1 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_predictions_split1.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_df2 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_predictions_split2.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_df3 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_predictions_split3.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_df4 = pd.read_csv(predictions_dir+"crf_{a}_{t}_baseline_fastText{d}_predictions_split4.csv".format(a=a, t=target_labels, d=d), index_col=0)
pred_perso = pd.concat([pred_df0, pred_df1, pred_df2, pred_df3, pred_df4])
print(pred_perso.shape)

(753521, 7)


In [21]:
pred_perso = pred_perso.reset_index()
pred_perso = utils.getColumnValuesAsLists(pred_perso, "tag_pers_o_expected")
pred_perso.head()

Unnamed: 0,description_id,sentence_id,fold,pred_ling_label,token_id,sentence,tag_pers_o_expected,tag_pers_o_predicted
0,3,8,split0,O,233,James,[B-Masculine],B-Masculine
1,3,8,split0,O,234,Whyte,[I-Masculine],I-Unknown
2,3,8,split0,O,235,was,[O],O
3,3,8,split0,O,236,called,[O],O
4,3,8,split0,O,237,upon,[O],O


In [22]:
assert pred_perso.shape[0] == len(pred_perso.token_id.unique()), "There should be one row per token."
pred_perso.pred_ling_label.value_counts()  # Looks good

O    753521
Name: pred_ling_label, dtype: int64

Include only the first value from the expected tags column, as that's what was used in training:

In [23]:
exp_col = "tag_pers_o_expected"
pred_col = "tag_pers_o_predicted"

In [24]:
exp_pred_lists = list(pred_perso[exp_col])
new_exp_pred_col = [exp_pred_list[0] for exp_pred_list in exp_pred_lists]
pred_df = pred_perso.drop(columns=[exp_col])
pred_df.insert(len(pred_df.columns)-1, exp_col, new_exp_pred_col)

In [25]:
pred_df[exp_col].fillna("O")
pred_df[pred_col].fillna("O")
pred_df.head()

Unnamed: 0,description_id,sentence_id,fold,pred_ling_label,token_id,sentence,tag_pers_o_expected,tag_pers_o_predicted
0,3,8,split0,O,233,James,B-Masculine,B-Masculine
1,3,8,split0,O,234,Whyte,I-Masculine,I-Unknown
2,3,8,split0,O,235,was,O,O
3,3,8,split0,O,236,called,O,O
4,3,8,split0,O,237,upon,O,O


Save the concatenated prediction data:

In [26]:
category = target_labels
a = "arow"
pred_df.to_csv(predictions_dir+"crf_{a}_{c}_baseline_fastText{d}_predictions.csv".format(a=a, c=category, d=d))

<a id="eval"></a>
### Evaluation
#### Evaluate: Strict, Each Label

The built-in evaluation approach is strict, so unless the model predictions' labels are on text spans that exactly match the development data's test, the predicted labels will be deemed incorrect.

Calculate performance metrics for each category of labels:

In [27]:
pred_df = pred_df.rename(columns={"sentence":"token"})
df_pred = pred_df.drop(columns=[exp_col])
df_exp = pred_df.drop(columns=[pred_col])

In [37]:
join_on =  ["description_id", "sentence_id", "token_id", "token", "fold", "pred_ling_label"]
df_agmt = utils.makeEvaluationDataFrame(
    df_exp, 
    df_pred, 
    join_on+[exp_col],
    join_on+[pred_col],
    join_on+[exp_col, pred_col, "_merge"], 
    pred_col,
    exp_col,
    "O"
)
df_agmt = eval_df.fillna("O")
df_agmt.head()

Unnamed: 0,description_id,sentence_id,token_id,token,fold,pred_ling_label,tag_pers_o_expected,tag_pers_o_predicted,_merge
0,3,8,233,James,split0,O,B-Masculine,B-Masculine,true positive
1,3,8,234,Whyte,split0,O,I-Masculine,O,false negative
2,3,8,235,was,split0,O,O,O,true negative
3,3,8,236,called,split0,O,O,O,true negative
4,3,8,237,upon,split0,O,O,O,true negative


In [38]:
print(df_agmt.shape)
print(df_exp.shape)
print(df_pred.shape)

(786096, 9)
(753521, 7)
(753521, 7)


Save the agreement data:

In [39]:
df_agmt.to_csv(predictions_dir+"crf-arow_pers_o_baseline_fastText100_strict_evaluation.csv")

Calculate the true positives, false positives, false negatives, precision, recall, and F1 metrics for each tag:

In [40]:
labels = ['B-Feminine', 'I-Feminine', 'B-Masculine', 'I-Masculine', 'B-Unknown', 'I-Unknown', 'B-Occupation', 'I-Occupation']

In [42]:
agmt_scores = pd.DataFrame.from_dict({
        "label":[], "false negative":[], "false positive":[],
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })
for label in labels:
    agmt_df = pd.concat([df_agmt.loc[df_agmt[exp_col] == label], df_agmt.loc[df_agmt[pred_col] == label]])
    agmt_df = agmt_df.drop_duplicates() # True positives will have been duplicated in line above
    tp = agmt_df.loc[agmt_df._merge == "true positive"].shape[0]
    fp = agmt_df.loc[agmt_df._merge == "false positive"].shape[0]
    fn = agmt_df.loc[agmt_df._merge == "false negative"].shape[0]
    prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
    label_agmt = pd.DataFrame.from_dict({
            "label":[label], "false negative":[fn], "false positive":[fp],
             "true positive":[tp], "precision":[prec], "recall":[rec], "f1":[f1]
        })
    agmt_scores = pd.concat([agmt_scores, label_agmt])
agmt_scores

Unnamed: 0,label,false negative,false positive,true positive,precision,recall,f1
0,B-Feminine,397.0,585.0,453.0,0.436416,0.532941,0.479873
0,I-Feminine,1035.0,1516.0,1138.0,0.428787,0.5237,0.471514
0,B-Masculine,2176.0,1695.0,837.0,0.330569,0.277796,0.301894
0,I-Masculine,3369.0,3812.0,1396.0,0.268049,0.29297,0.279956
0,B-Unknown,4131.0,3183.0,3008.0,0.485867,0.421348,0.451313
0,I-Unknown,7317.0,4876.0,5388.0,0.524942,0.424085,0.469154
0,B-Occupation,1304.0,901.0,1407.0,0.609619,0.518997,0.560669
0,I-Occupation,2054.0,1180.0,1266.0,0.51758,0.381325,0.439126


Save the data:

In [43]:
agmt_scores.to_csv(agreement_dir+"crf-arow_pers_o_baseline_fastText100_strict_agmt.csv")

#### Evaluate: Annotation Agreement

Calculate agreement at the annotation level, so if the model labels any word correctly from a manually annotated text span, that annotation is recorded as being correctly labeled (`true positive`).

In [44]:
# Generalize tags to labels for annotation agreement
pred_col, exp_col = "predicted_label", "expected_label"
category = "pers_o"
pred_labels = list(pred_df["tag_{}_predicted".format(category)])
pred_labels = [label if label == "O" else label[2:] for label in pred_labels]
pred_df.insert(len(pred_df.columns), pred_col, pred_labels)
print(pred_df.shape)
pred_df.head()

(753521, 9)


Unnamed: 0,description_id,sentence_id,fold,pred_ling_label,token_id,token,tag_pers_o_expected,tag_pers_o_predicted,predicted_label
0,3,8,split0,O,233,James,B-Masculine,B-Masculine,Masculine
1,3,8,split0,O,234,Whyte,I-Masculine,I-Unknown,Unknown
2,3,8,split0,O,235,was,O,O,O
3,3,8,split0,O,236,called,O,O,O
4,3,8,split0,O,237,upon,O,O,O


In [62]:
df_by_ann = df_by_ann.explode(["description_id", "sentence_id", "token_id", "token", "token_offsets", "pos", "tag", "field", "fold"])
df_by_ann = df_by_ann.reset_index()
df_by_ann.head()

Unnamed: 0,ann_id,description_id,sentence_id,token_id,token,token_offsets,pos,tag,field,fold,expected_label
0,7,1082,2590,58341,Mrs,"(36375, 36378)",NNP,B-Feminine,Scope and Contents,split2,Feminine
1,7,1082,2590,58342,Norman,"(36379, 36385)",NNP,I-Feminine,Scope and Contents,split2,Feminine
2,7,1082,2590,58343,Macleod,"(36386, 36393)",NNP,I-Feminine,Scope and Contents,split2,Feminine
3,7,1082,2590,58344,",","(36393, 36394)",",",I-Feminine,Scope and Contents,split2,Feminine
4,14,855,1097,19836,Dr.,"(40, 43)",NNP,B-Unknown,Title,split4,Unknown


In [63]:
join_on = ["description_id", "sentence_id", "token_id", "token", "fold"]
eval_df_joined = pred_df.join(df_by_ann.set_index(join_on), on=join_on, how="outer")
print(eval_df_joined.shape)
eval_df_joined["ann_id"] = eval_df_joined["ann_id"].fillna(99999)
eval_df_joined[exp_col] = eval_df_joined[exp_col].fillna("O")
eval_df_joined.head()

(763670, 15)


Unnamed: 0,description_id,sentence_id,fold,pred_ling_label,token_id,token,tag_pers_o_expected,tag_pers_o_predicted,predicted_label,ann_id,token_offsets,pos,tag,field,expected_label
0,3,8,split0,O,233,James,B-Masculine,B-Masculine,Masculine,14387.0,"(1350, 1355)",NNP,B-Masculine,Biographical / Historical,Masculine
1,3,8,split0,O,234,Whyte,I-Masculine,I-Unknown,Unknown,14387.0,"(1356, 1361)",NNP,I-Masculine,Biographical / Historical,Masculine
2,3,8,split0,O,235,was,O,O,O,99999.0,,,,,O
3,3,8,split0,O,236,called,O,O,O,99999.0,,,,,O
4,3,8,split0,O,237,upon,O,O,O,99999.0,,,,,O


In [64]:
eval_by_ann = utils.implodeDataFrame(eval_df_joined, ["description_id", "sentence_id", "ann_id", "fold", "pred_ling_label", "expected_label"]).reset_index()
print(eval_by_ann.shape)
eval_by_ann.head()

(63277, 15)


Unnamed: 0,description_id,sentence_id,ann_id,fold,pred_ling_label,expected_label,token_id,token,tag_pers_o_expected,tag_pers_o_predicted,predicted_label,token_offsets,pos,tag,field
0,0,0,99999.0,split4,O,O,"[0, 1, 2]","[Identifier, :, AA5]","[O, O, O]","[O, O, O]","[O, O, O]","[nan, nan, nan]","[nan, nan, nan]","[nan, nan, nan]","[nan, nan, nan]"
1,1,1,14384.0,split2,O,Unknown,"[7, 8, 9, 10, 11, 12]","[The, Very, Rev, Prof, James, Whyte]","[O, I-Unknown, I-Unknown, I-Unknown, I-Masculi...","[B-Unknown, I-Unknown, I-Unknown, B-Unknown, I...","[Unknown, Unknown, Unknown, Unknown, Unknown, ...","[(34, 37), (38, 42), (43, 46), (47, 51), (52, ...","[DT, NNP, NNP, NNP, NNP, NNP]","[B-Unknown, I-Unknown, I-Unknown, I-Unknown, I...","[Title, Title, Title, Title, Title, Title]"
2,1,1,24275.0,split2,O,Masculine,"[7, 8, 9, 10, 11, 12]","[The, Very, Rev, Prof, James, Whyte]","[O, I-Unknown, I-Unknown, I-Unknown, I-Masculi...","[B-Unknown, I-Unknown, I-Unknown, B-Unknown, I...","[Unknown, Unknown, Unknown, Unknown, Unknown, ...","[(34, 37), (38, 42), (43, 46), (47, 51), (52, ...","[DT, NNP, NNP, NNP, NNP, NNP]","[B-Masculine, I-Masculine, I-Masculine, I-Masc...","[Title, Title, Title, Title, Title, Title]"
3,1,1,26233.0,split2,O,Unknown,"[9, 10, 11, 12]","[Rev, Prof, James, Whyte]","[I-Unknown, I-Unknown, I-Masculine, O]","[I-Unknown, B-Unknown, I-Unknown, I-Unknown]","[Unknown, Unknown, Unknown, Unknown]","[(43, 46), (47, 51), (52, 57), (58, 63)]","[NNP, NNP, NNP, NNP]","[B-Unknown, I-Unknown, I-Unknown, I-Unknown]","[Title, Title, Title, Title]"
4,1,1,99999.0,split2,O,O,"[3, 4, 5, 6, 13, 14, 15]","[Title, :, Papers, of, (, 1920-2005, )]","[O, O, O, O, O, O, O]","[O, O, O, O, O, O, O]","[O, O, O, O, O, O, O]","[nan, nan, nan, nan, nan, nan, nan]","[nan, nan, nan, nan, nan, nan, nan]","[nan, nan, nan, nan, nan, nan, nan]","[nan, nan, nan, nan, nan, nan, nan]"


In [65]:
pred_label_col = list(eval_by_ann[pred_col])
unique_pred_label_col = [list(set(pred_labels)) for pred_labels in pred_label_col]
new_pred_label_col = []
for pred_labels in unique_pred_label_col:
    if "O" in pred_labels:
        pred_labels.remove("O")
    new_pred_label_col += [pred_labels]
eval_by_ann = eval_by_ann.drop(columns=[pred_col])
eval_by_ann.insert(len(eval_by_ann.columns), pred_col, new_pred_label_col)
eval_by_ann = eval_by_ann.explode([pred_col])
eval_by_ann.head()

Unnamed: 0,description_id,sentence_id,ann_id,fold,pred_ling_label,expected_label,token_id,token,tag_pers_o_expected,tag_pers_o_predicted,token_offsets,pos,tag,field,predicted_label
0,0,0,99999.0,split4,O,O,"[0, 1, 2]","[Identifier, :, AA5]","[O, O, O]","[O, O, O]","[nan, nan, nan]","[nan, nan, nan]","[nan, nan, nan]","[nan, nan, nan]",
1,1,1,14384.0,split2,O,Unknown,"[7, 8, 9, 10, 11, 12]","[The, Very, Rev, Prof, James, Whyte]","[O, I-Unknown, I-Unknown, I-Unknown, I-Masculi...","[B-Unknown, I-Unknown, I-Unknown, B-Unknown, I...","[(34, 37), (38, 42), (43, 46), (47, 51), (52, ...","[DT, NNP, NNP, NNP, NNP, NNP]","[B-Unknown, I-Unknown, I-Unknown, I-Unknown, I...","[Title, Title, Title, Title, Title, Title]",Unknown
2,1,1,24275.0,split2,O,Masculine,"[7, 8, 9, 10, 11, 12]","[The, Very, Rev, Prof, James, Whyte]","[O, I-Unknown, I-Unknown, I-Unknown, I-Masculi...","[B-Unknown, I-Unknown, I-Unknown, B-Unknown, I...","[(34, 37), (38, 42), (43, 46), (47, 51), (52, ...","[DT, NNP, NNP, NNP, NNP, NNP]","[B-Masculine, I-Masculine, I-Masculine, I-Masc...","[Title, Title, Title, Title, Title, Title]",Unknown
3,1,1,26233.0,split2,O,Unknown,"[9, 10, 11, 12]","[Rev, Prof, James, Whyte]","[I-Unknown, I-Unknown, I-Masculine, O]","[I-Unknown, B-Unknown, I-Unknown, I-Unknown]","[(43, 46), (47, 51), (52, 57), (58, 63)]","[NNP, NNP, NNP, NNP]","[B-Unknown, I-Unknown, I-Unknown, I-Unknown]","[Title, Title, Title, Title]",Unknown
4,1,1,99999.0,split2,O,O,"[3, 4, 5, 6, 13, 14, 15]","[Title, :, Papers, of, (, 1920-2005, )]","[O, O, O, O, O, O, O]","[O, O, O, O, O, O, O]","[nan, nan, nan, nan, nan, nan, nan]","[nan, nan, nan, nan, nan, nan, nan]","[nan, nan, nan, nan, nan, nan, nan]","[nan, nan, nan, nan, nan, nan, nan]",


Compare predicted labels to expected label for each annotation, where if a label was predicted when none was expected, agreement is a false positive; if a correct label was predicted (even if it's only on part of the annotation), agreement is a true positive; and if no label was predicted when a label was expected, agreement is a false negative.

In [66]:
exp_col = "expected_label"
pred_col = "predicted_label"
df_pred = eval_by_ann.drop(columns=[exp_col, "fold", "token_id", "token", "tag_pers_o_expected", "tag_pers_o_predicted"])
df_exp = eval_by_ann.drop(columns=[pred_col, "fold", "token_id", "token", "tag_pers_o_expected", "tag_pers_o_predicted"])

In [68]:
eval_by_ann[exp_col] = eval_by_ann[exp_col].fillna("O")
eval_by_ann[pred_col] = eval_by_ann[pred_col].fillna("O")
assert eval_by_ann.loc[eval_by_ann.expected_label.isna()].shape[0] == 0
assert eval_by_ann.loc[eval_by_ann.predicted_label.isna()].shape[0] == 0

Record the agreement type for each row, either false positive, true positive, false negative, or true negative:

In [69]:
join_on =  ["description_id", "sentence_id", "ann_id", "pred_ling_label"]
eval_df = utils.makeEvaluationDataFrame(
    df_exp, 
    df_pred, 
    join_on+[exp_col], 
    join_on+[pred_col], 
    join_on+[exp_col, pred_col, "_merge"], 
    exp_col, 
    pred_col, 
    "O"
)
id_col = "ann_id"
eval_df = eval_df.sort_values(by=[id_col, exp_col, pred_col])
eval_df.head()

Unnamed: 0,description_id,sentence_id,ann_id,pred_ling_label,expected_label,predicted_label,_merge
6999,1082,2590,7.0,O,Feminine,O,false negative
69139,1082,2590,7.0,O,O,Unknown,false positive
66635,855,1097,14.0,O,O,Feminine,false positive
2826,855,1097,14.0,O,Unknown,Unknown,true positive
2827,855,1097,14.0,O,Unknown,Unknown,true positive


In [73]:
eval_df = eval_df.drop_duplicates()
print(eval_df.shape)
print(df_pred.shape)
print(df_exp.shape)

(79601, 7)
(64652, 9)
(64652, 9)


Save the data:

In [74]:
eval_df.to_csv(predictions_dir+"crf-{a}_{c}_baseline_fastText{d}_annot_evaluation.csv".format(a=a, c=category, d=d))

Calculate annotation agreement metrics for each label:

In [75]:
agmt_scores = pd.DataFrame.from_dict({
        "label":[], "false negative":[], "false positive":[],
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [77]:
labels = ['Feminine', 'Masculine', 'Unknown', 'Occupation']
for label in labels:
    agmt_df = pd.concat([eval_df.loc[eval_df[exp_col] == label], eval_df.loc[eval_df[pred_col] == label]])
    agmt_df = agmt_df.drop_duplicates() # True positives will have been duplicated in line above
    tp = agmt_df.loc[agmt_df._merge == "true positive"].shape[0]
    fp = agmt_df.loc[agmt_df._merge == "false positive"].shape[0]
    fn = agmt_df.loc[agmt_df._merge == "false negative"].shape[0]
    prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
    label_agmt = pd.DataFrame.from_dict({
            "label":[label], "false negative":[fn], "false positive":[fp],
             "true positive":[tp], "precision":[prec], "recall":[rec], "f1":[f1]
        })
    agmt_scores = pd.concat([agmt_scores, label_agmt])
agmt_scores

Unnamed: 0,label,false negative,false positive,true positive,precision,recall,f1
0,Feminine,553.0,1208.0,1146.0,0.486831,0.674514,0.565507
0,Masculine,3402.0,3167.0,2247.0,0.415035,0.39777,0.406219
0,Unknown,5982.0,3581.0,5170.0,0.59079,0.463594,0.51952
0,Occupation,1278.0,901.0,1687.0,0.651855,0.568971,0.607599


Save the metrics:

In [78]:
agmt_scores.to_csv(agreement_dir+"crf-{a}_{c}_baseline_fastText{d}_annot_agmt.csv".format(a=a, d=d, c=category))

***

#### *For train-dev-test (i.e., 40-40-20) approach*

<a id="ii"></a>
## II. Predict Over All Data

In [54]:
test_sentences = utils1.zip2FeaturesAndTarget(df_test_grouped, "tag")
X_test = [utils1.extractSentenceFeatures(sentence) for sentence in test_sentences]  # Features
y_test = [utils1.extractSentenceTargets(sentence) for sentence in test_sentences]   # Target
# Combine all data subsets' features and targets
X_all = X_train+X_dev+X_test
y_all = y_train+y_dev+y_test

In [55]:
all_predictions = clf_perso.predict(X_all)

#### Evaluate: All Labels

In [56]:
print("  - F1:", metrics.flat_f1_score(y_all, all_predictions, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_all, all_predictions, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_all, all_predictions, average="weighted", zero_division=0, labels=targets))

  - F1: 0.5211899648086513
  - Prec: 0.5561442804328083
  - Rec 0.49419097345616575


Save the prediction data:

In [130]:
df_train_grouped = df_train_grouped.rename(columns={"tag":"tag_pers_o_expected"})
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.drop(columns="tag_{}_predicted".format(target_labels))
df_dev_grouped = df_dev_grouped.reset_index()
df_test_grouped = df_test_grouped.rename(columns={"tag":"tag_pers_o_expected"})
df_all_grouped = pd.concat([df_train_grouped, df_dev_grouped, df_test_grouped])
# df_all_grouped.head()

In [127]:
df_all_grouped.insert(len(df_all_grouped.columns), "tag_pers_o_predicted", all_predictions)
df_all_grouped = df_all_grouped.set_index("sentence_id")
df_all_exploded = df_all_grouped.explode(list(df_all_grouped.columns))
df_all_exploded.head()

Unnamed: 0_level_0,token_id,sentence,tag_pers_o_expected,pred_ling_tag,tag_pers_o_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,16,Scope,[O],[O],O
2,17,and,[O],[O],O
2,18,Contents,[O],[O],O
2,19,:,[O],[O],O
2,20,Sermons,[O],[O],O


In [128]:
df_all_exploded = df_all_exploded.reset_index()

In [129]:
filename = "crf_{a}_{t}_baseline_fastText{d}_predictions_ALLDATA.csv".format(a=a, t=target_labels, d=d)
df_all_exploded.to_csv(config.experiment1_output_path+filename)

#### Evaluate: Each Label

The built-in evaluation approach is strict, so unless the model predictions' labels are on text spans that exactly match the manual annotations, the predicted labels will be deemed incorrect.

Calculate performance metrics for each category of labels:

In [131]:
pred_perso = df_all_exploded.copy()
pred_perso = pred_perso.fillna("O")
pred_perso = utils.isPredictedInExpected(pred_perso, "tag_{}_expected".format(category), "tag_{}_predicted".format(category), '_merge', 'O')
# pred_perso.head()

In [132]:
pred_perso_stats = utils.getScoresByCatTags(
    pred_perso, "_merge", pers_o_label_subset[0], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
)
for i in range(1, len(pers_o_label_subset)):
    tag_stats = utils.getScoresByCatTags(
        pred_perso, "_merge", pers_o_label_subset[i], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
    )
    pred_perso_stats = pd.concat([pred_perso_stats, tag_stats])
pred_perso_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,B-Unknown,2082,2433,4549,0.651533,0.68602,0.668332
0,I-Unknown,3959,3797,7738,0.670828,0.661537,0.66615
0,B-Feminine,88,183,392,0.681739,0.816667,0.743128
0,I-Feminine,337,1035,1620,0.610169,0.827798,0.702515
0,B-Masculine,678,875,1356,0.607799,0.666667,0.635873
0,I-Masculine,1491,1339,2103,0.610982,0.585142,0.597783
0,B-Occupation,896,623,1634,0.72397,0.64585,0.682682
0,I-Occupation,1304,892,1844,0.673977,0.585769,0.626785


Save the statistics:

In [133]:
pred_perso_stats.to_csv(
    config.experiment1_agmt_path+"crf_{a}_baseline_fastText{d}_{c}_strict_agmt_ALLDATA.csv".format(a=a, c=category, d=d)
)

#### Evaluate: Annotation Agreement

Calculate agreement at the annotation level, so if the model labels any word correctly from a manually annotated text span, that annotation is recorded as being correctly labeled (`true positive`).  Note whether the models' labels are an `exact_match`, `label_match`, `category_match` or `mismatch`.

Load the annotation data:

*Note: `ann_id` of `9999` indicates no annotation*

Group the annotation data by token:

In [134]:
perso_all = pd.concat([perso_train, perso_dev, perso_test])

In [135]:
df_ann = perso_all[["sentence_id", "ann_id", "token_id", "tag"]]
df_ann = utils.implodeDataFrame(df_ann, ["sentence_id", "ann_id", "token_id"])
df_ann = df_ann.reset_index()
print(df_ann.shape)
df_ann.head()

(778803, 4)


Unnamed: 0,sentence_id,ann_id,token_id,tag
0,0,99999,0,[O]
1,0,99999,1,[O]
2,0,99999,2,[O]
3,1,14384,7,[B-Unknown]
4,1,14384,8,[I-Unknown]


Align the columns of the annotation and prediction DataFrames:

In [138]:
# Rename `sentence` column `token`
pred_perso = pred_perso.rename(columns={"sentence":"token"}).sort_values(by="token_id")
pred_perso.head()

Unnamed: 0,sentence_id,token_id,token,tag_pers_o_expected,pred_ling_tag,tag_pers_o_predicted,_merge
604541,0,0,Identifier,[O],[O],O,true negative
604542,0,1,:,[O],[O],O,true negative
604543,0,2,AA5,[O],[O],O,true negative
298617,1,3,Title,[O],[O],O,true negative
298618,1,4,:,[O],[O],O,true negative


Join the data, adding the annotation IDs (`ann_id` column) to the prediction DataFrames:

In [139]:
index_list = ["sentence_id", "token_id"]

In [140]:
pred_perso_ann = pred_perso.join(df_ann.set_index(index_list), on=index_list, how="left")
pred_perso_ann = pred_perso_ann.drop(columns=["tag"])  # duplicate of tag_expected
assert pred_perso_ann.loc[pred_perso_ann["token_id"].isna()].shape[0] == 0
assert pred_perso_ann.loc[pred_perso_ann["ann_id"].isna()].shape[0] == 0
assert pred_perso_ann.loc[pred_perso_ann["tag_pers_o_predicted"].isna()].shape[0] == 0
assert pred_perso_ann.loc[pred_perso_ann["tag_pers_o_expected"].isna()].shape[0] == 0
# pred_perso_ann.head()

Explode the DataFrame:

In [141]:
pred_perso_ann = pred_perso_ann.explode(["tag_pers_o_expected"])

Generalize the BIO tags to label names:

In [142]:
# Get the predicted labels
pred_labels = list(pred_perso_ann["tag_{}_predicted".format(category)])
pred_labels = [label if label == "O" else label[2:] for label in pred_labels]
pred_perso_ann.insert(len(pred_perso_ann.columns), "label_{}_predicted".format(category), pred_labels)
# Get the lists of expected labels
exp_labels = list(pred_perso_ann["tag_{}_expected".format(category)])
exp_labels = [label if label == "O" else label[2:] for label in exp_labels]
pred_perso_ann.insert(len(pred_perso_ann.columns), "label_{}_expected".format(category), exp_labels)
# pred_perso_ann.head()

Group the data by annotation:

In [143]:
pred_perso_ann = pred_perso_ann.drop(columns=["tag_{}_expected".format(category), "tag_{}_predicted".format(category)])
pred_perso_ann = utils.implodeDataFrame(pred_perso_ann, ["sentence_id", "ann_id"])
pred_perso_ann = pred_perso_ann.reset_index()
# pred_perso_ann.head()

Record the agreements and disagreements:

In [144]:
agmt_types_perso, agmt_labels_perso = utils1.getAnnotationAgreement(pred_perso_ann, "label_pers_o_predicted", "label_pers_o_expected")
pred_perso_ann.insert(len(pred_perso_ann.columns), "annotation_agreement", agmt_types_perso)
pred_perso_ann.insert(len(pred_perso_ann.columns), "agreement_label", agmt_labels_perso)
# pred_perso_ann.head()

In [145]:
metrics_perso_all = utils1.getAnnotationAgreementMetrics(pred_perso_ann, "all")
metrics_perso_pn = utils1.getAnnotationAgreementMetrics(pred_perso_ann.loc[~(pred_perso_ann.agreement_label.isin(["Occupation","O"]))], "Person Name")
metrics_perso_unk = utils1.getAnnotationAgreementMetrics(pred_perso_ann.loc[pred_perso_ann.agreement_label == "Unknown"], "Unknown")
metrics_perso_fem = utils1.getAnnotationAgreementMetrics(pred_perso_ann.loc[pred_perso_ann.agreement_label == "Feminine"], "Feminine")
metrics_perso_mas = utils1.getAnnotationAgreementMetrics(pred_perso_ann.loc[pred_perso_ann.agreement_label == "Masculine"], "Masculine")
metrics_perso_occ = utils1.getAnnotationAgreementMetrics(pred_perso_ann.loc[pred_perso_ann.agreement_label == "Occupation"], "Occupation")
metrics_perso = pd.concat([metrics_perso_all, metrics_perso_pn, metrics_perso_unk, metrics_perso_fem, metrics_perso_mas, metrics_perso_occ])
metrics_perso

Unnamed: 0,labels,false negative,true positive,false positive,precision,recall,f_1
0,all,10080,15546,5003,0.756533,0.606649,0.673351
0,Person Name,8551,13603,4324,0.7588,0.61402,0.678775
0,Unknown,3749,8755,2649,0.767713,0.700176,0.732391
0,Feminine,817,1636,593,0.733961,0.666938,0.698847
0,Masculine,3985,3212,1082,0.74802,0.446297,0.559046
0,Occupation,1529,1943,679,0.741037,0.55962,0.637676


Save the metrics:

In [146]:
metrics_perso.to_csv(
    config.experiment1_agmt_path+"crf_{a}_baseline_fastText{d}_{c}_annot_agmt.csv".format(a=a, d=d, c=category)
)

#### Evaluation: Loose, Each Label

Generalize the tokens' BIO tags to the labels and calculate agreement scores for each label.

In [147]:
pred_perso_labels = pred_perso.drop(columns=["_merge"])
tag_exp = list(pred_perso_labels["tag_{}_expected".format(category)])
tag_pred = list(pred_perso_labels["tag_{}_predicted".format(category)])
label_exp = [[tag if tag == "O" else tag[2:] for tag in tag_exp_list] for tag_exp_list in tag_exp]
label_pred = [tag if tag == "O" else tag[2:] for tag in tag_pred]
pred_perso_labels = pred_perso_labels.drop(columns=["tag_{}_expected".format(category), "tag_{}_predicted".format(category)])
pred_perso_labels.insert(len(pred_perso_labels.columns), "label_{}_expected".format(category), label_exp)
pred_perso_labels.insert(len(pred_perso_labels.columns), "label_{}_predicted".format(category), label_pred)
# pred_pers_labels.loc[pred_pers_labels.label_personname_predicted == "Feminine"].head()  # Looks good

Calculate the agreement metrics at the label level for each token:

In [148]:
tags = ['Unknown', 'Feminine', 'Masculine', 'Occupation']
pred_perso_labels = utils.isPredictedInExpected(pred_perso_labels, "label_{}_expected".format(category), "label_{}_predicted".format(category), '_merge', 'O')

pred_perso_stats = utils.getScoresByCatTags(
    pred_perso_labels, "_merge", tags[0], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_perso_labels, "_merge", tags[i], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
    )
    pred_perso_stats = pd.concat([pred_perso_stats, tag_stats])
pred_perso_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,Unknown,6017,5487,13030,0.703678,0.684097,0.693749
0,Feminine,422,874,2356,0.729412,0.848092,0.784288
0,Masculine,2159,1942,3731,0.657677,0.633447,0.645334
0,Occupation,2200,1381,3612,0.723413,0.621473,0.668579


Combine and save the performance measures:

In [149]:
pred_perso_stats.to_csv(
    config.experiment1_agmt_path+"crf_{a}_baseline_fastText{d}_{c}_loose_agmt.csv".format(a=a, d=d, c=category)
)

*Past error analysis on Person Name document classifiers would suggest that there's a mix-up in the manual annotations between 'Masculine' and 'Unknown,' with many Masculine-labeled names actually needing an 'Unknown' label based on the annotation descriptions.*