# Experiment 1, Model 1

#### Model Setup

Run models in the following order, using their output labels as features for the next model:

1. Multilabel Linguistic Classifier
2. Multiclass Person Name + Occupation Sequence Classifier
3. Multilabel StereDocument Classifier

Train the first model and then run it over the entire dataset.

***

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/experiment_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/experiment1/`
* Word Embeddings
    * Custom fastText (word2vec with subwords) embeddings of 100 dimensions trained on the CRC Archives catalog's descriptive metadata (harvested October 2020)
    
***

**Table of Contents**

[I.](#i) Train the Linguistic Classifier

[II.](#ii) Predict Over All Data

Load programming resources:

In [1]:
# For custom functions and variables
import utils, utils1, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For preprocessing
from gensim.models import FastText
from gensim import utils as gensim_utils

# For multilabel token classification
import sklearn.metrics
from sklearn.preprocessing import MultiLabelBinarizer
from skmultilearn.problem_transform import ClassifierChain
from sklearn.ensemble import RandomForestClassifier

Define resources for the models:

In [2]:
Path(config.experiment_input_path).mkdir(parents=True, exist_ok=True)    # For train, devtest, and blind test data
Path(config.experiment1_output_path).mkdir(parents=True, exist_ok=True)  # For predictions
Path(config.experiment1_agmt_path).mkdir(parents=True, exist_ok=True)    # For agreement metrics

In [3]:
# Model 1:
ling_label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
# Model 2:
pers_o_label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Occupation", "I-Occupation"]
# Model 3:
so_label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission"]

In [4]:
ling_label_tags = {
    "Gendered-Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered-Role": ["B-Gendered-Role", "I-Gendered-Role"],"Generalization": ["B-Generalization", "I-Generalization"]
    }
pers_o_label_tags = {
    "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
     "Occupation": ["B-Occupation", "I-Occupation"]
    }
so_label_tags = {
    "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"]
             }

In [5]:
d = 100  # dimensions of word embeddings (should match utils1.py)

<a id="i"></a>
## I. Train the Linguistic Classifier

Run a multilabel classifier on the train set of the data, focusing only on applying the Linguistic category of labels: Gendered Pronoun, Gendered Role, and Generalization.

Use a Classifier Chain with Random Forest, as this was the highest-performing multilabel model setup from previous algorithm experiments for the Linguistic labels.

For this experiment, we'll train the model on 40% of the data, rather than 60%.  We'll use fastText embeddings of 100 dimensions, as was used in the model that achieved the above scores.

#### Preprocessing

In [6]:
train_df = pd.read_csv(config.tokc_path+"experiment_input/token_train.csv", index_col=0)
dev_df = pd.read_csv(config.tokc_path+"experiment_input/token_validate.csv", index_col=0)
test_df = pd.read_csv(config.tokc_path+"experiment_input/token_test.csv", index_col=0)
ling_train = utils1.selectDataForLabels(train_df, "tag", ling_label_subset)
ling_dev = utils1.selectDataForLabels(dev_df, "tag", ling_label_subset)
ling_test = utils1.selectDataForLabels(test_df, "tag", ling_label_subset)

In [7]:
train_data = utils1.loadData(ling_train)
dev_data = utils1.loadData(ling_dev)
test_data = utils1.loadData(ling_test)
print(train_data.shape, dev_data.shape, test_data.shape)
train_data.head()

(298617, 5) (305924, 5) (148980, 5)


Unnamed: 0,sentence_id,token_id,token,pos,tag
0,2,16,Scope,NN,[O]
1,2,17,and,CC,[O]
2,2,18,Contents,NNS,[O]
3,2,19,:,:,[O]
4,2,20,Sermons,NNS,[O]


In [8]:
assert train_data.shape[0] == len(train_data.token_id.unique())
assert dev_data.shape[0] == len(dev_data.token_id.unique())
assert test_data.shape[0] == len(test_data.token_id.unique())

Combine all the data to run the model over after training:

In [9]:
all_data = pd.concat([train_data, dev_data, test_data])

Create feature matrices:

In [85]:
train_tokens = utils1.zipTokensFeatures(train_data)
dev_tokens = utils1.zipTokensFeatures(dev_data)
all_tokens = utils1.zipTokensFeatures(all_data)
X_train = utils1.makeFastTextFeatureMatrix(train_tokens)
X_dev = utils1.makeFastTextFeatureMatrix(dev_tokens)
X_all = utils1.makeFastTextFeatureMatrix(all_tokens)

Binarize targets:

In [86]:
mlb, y_train = utils1.binarizeTrainTargets(train_data)
y_dev = utils1.binarizeDevTargets(mlb, dev_data)
y_all = utils1.binarizeDevTargets(mlb, all_data)

#### Train & Predict

In [87]:
a = "rf"
clf = ClassifierChain(
    classifier = RandomForestClassifier(random_state=22),
)
clf.fit(X_train, y_train)

In [88]:
predictions = clf.predict(X_dev)

#### Evaluate: Strict, All Labels

In [89]:
print("Precision - macro:", sklearn.metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print("Recall - macro:", sklearn.metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print("F1 Score - macro:", sklearn.metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print("Accuracy - normalized:", sklearn.metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", sklearn.metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - macro: 0.46238272522632073
Recall - macro: 0.42298863525280694
F1 Score - macro: 0.42251820432443793
Accuracy - normalized: 0.9925798564349315
Accuracy - unnormalized: 303654


In [90]:
print("Total samples:", X_dev.shape[0])

Total samples: 305924


In [91]:
pred_df = utils.makePredictionDF(predictions, dev_data, "tag", "predicted_tag", "O", mlb)
assert pred_df.loc[pred_df.predicted_tag.isna()].shape[0] == 0, "Any NaN values should be replaced with 'O'"
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,predicted_tag
0,1,3,Title,NN,O
1,1,4,:,:,O
2,1,5,Papers,NNS,O
3,1,6,of,IN,O
4,1,7,The,DT,O


In [92]:
print(pred_df.shape[0], len(pred_df.token_id.unique()))

306940 305924


Save the prediction data:

In [93]:
pred_df.to_csv(config.experiment1_output_path+"cc-{a}_ling_baseline_fastText{d}_predictions.csv".format(a=a,d=d))

#### Evaluate: Each Label

There are more predictions than unique tokens, because with multilabel classification, one token can have multiple predicted tags.

In [119]:
exp_df = dev_data.explode(["tag"])
exp_df = exp_df.rename(columns={"tag":"expected_tag"})
exp_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag
0,1,3,Title,NN,O
1,1,4,:,:,O
2,1,5,Papers,NNS,O
3,1,6,of,IN,O
4,1,7,The,DT,O


In [120]:
print(exp_df.shape[0], len(exp_df.token_id.unique()))

307597 305924


In [121]:
exp_pred_df = pd.merge(
    left=exp_df, 
    right=pred_df.loc[pred_df.predicted_tag != "O"], # only include the predictions of Linguistic labels
    how="outer",
    left_on=["sentence_id", "token_id", "token", "pos", "expected_tag"],
    right_on=["sentence_id", "token_id", "token", "pos", "predicted_tag"],
    suffixes=["", "_pred"],
    indicator=True
)
exp_pred_df.shape

(308371, 7)

Record the agreement type for each row, ignoring rows with `'O'` and `NaN` value pairs (the `true negative` agreement type, which doesn't go into the precision, recall, or F1 score calculations).

In [122]:
exp_col = "expected_tag"
pred_col = "predicted_tag"
no_tag_value = "O"
# Find true negatives based on the expected and predicted tags
sub_exp_pred_df = exp_pred_df.loc[exp_pred_df[exp_col] == no_tag_value]
sub_exp_pred_df = sub_exp_pred_df.loc[sub_exp_pred_df[pred_col].isna()]
# sub_exp_pred_df.replace(to_replace="left_only", value="true negative", inplace=True)
tn_tokens = list(sub_exp_pred_df["token_id"])

# Record false negatives, false positives, and true positives based on the merge values
eval_df = exp_pred_df.loc[~exp_pred_df["token_id"].isin(tn_tokens)]
eval_df = eval_df.replace(to_replace="left_only", value="false negative")
eval_df = eval_df.replace(to_replace="right_only", value="false positive")
eval_df = eval_df.replace(to_replace="both", value="true positive")
eval_df = eval_df.sort_index()
# eval_df.head()

In [123]:
eval_df.shape

(2177, 7)

Save the data:

In [124]:
eval_df.to_csv(config.experiment1_agmt_path+"cc-{a}_ling_baseline_fastText{d}_evaluation.csv".format(a=a,d=d))

##### Strict Agreement

Calculate the true positives, false positives, false negatives, precision, recall, and F1 metrics for all tags and each tag individually.

In [125]:
agmt_stats = utils.getAgreementStatsForAllTags(eval_df, "_merge", "token_id", "tag(s)", y_dev, predictions)
for label_tag in ling_label_subset:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label_tag])
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,all,476,7,1694,0.462383,0.422989,0.422518
0,B-Generalization,124,3,148,0.980132,0.544118,0.699764
0,I-Generalization,85,1,0,0.0,0.0,0.0
0,B-Gendered-Role,97,1,438,0.997722,0.818692,0.899384
0,I-Gendered-Role,132,0,0,0.0,0.0,0.0
0,B-Gendered-Pronoun,15,2,2802,0.999287,0.994675,0.996976
0,I-Gendered-Pronoun,23,0,0,0.0,0.0,0.0


Save the data:

In [126]:
agmt_stats.to_csv(config.experiment1_agmt_path+"cc-{a}_baseline_fastText{d}_ling_strict_agmt.csv".format(a=a,d=d))

##### Annotation-level Agreement

Join the manual annotations' offsets to the evaluation data:

In [127]:
annot_df = pd.read_csv(config.agg_path+"aggregated_final.csv")#, usecols=["description_id","agg_ann_id", "ann_offsets"])
# Get only the Linguistic annotations
annot_df = annot_df.loc[annot_df.category == "Linguistic"]
annot_df = annot_df[["agg_ann_id", "ann_offsets", "label"]]
annot_df = annot_df.rename(columns={"agg_ann_id":"ann_id"})
# annot_df.head()

In [128]:
to_add = ling_dev_subset[["ann_id", "token_id", "token_offsets"]]
# Only include annotations with Linguistic labels
to_add = to_add.loc[to_add.ann_id.isin(list(annot_df.ann_id))]
eval_df_joined = eval_df.join(to_add.set_index("token_id"), on="token_id", how="outer")
# Join on the left, as there will be annotations from outside the devtest set in annot_df
print(eval_df_joined.shape)
eval_df_joined = eval_df_joined.join(annot_df.set_index("ann_id"), on="ann_id", how="left")
print(eval_df_joined.shape)  # Looks good!  Same as before join.
eval_df_joined = eval_df_joined.rename(columns={"label":"expected_label"})
eval_df_joined.head()

(2337, 9)
(2337, 11)


Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge,ann_id,token_offsets,ann_offsets,expected_label
39,5,155,his,PRP$,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive,14379,"(913, 916)","(913, 916)",Gendered-Pronoun
41,5,157,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive,14380,"(928, 930)","(928, 930)",Gendered-Pronoun
62,7,216,His,PRP$,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive,14382,"(1241, 1244)","(1241, 1244)",Gendered-Pronoun
72,7,226,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive,14383,"(1315, 1317)","(1315, 1317)",Gendered-Pronoun
218,16,435,He,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive,9516,"(677, 679)","(677, 679)",Gendered-Pronoun


Replace the predicted tags with their corresponding labels:

In [129]:
eval_df_joined.expected_label = eval_df_joined.expected_label.fillna("no_label")
eval_df_joined.expected_tag = eval_df_joined.expected_tag.fillna("no_label")
eval_df_joined.predicted_tag = eval_df_joined.predicted_tag.fillna("no_label")
# eval_df_joined.predicted_tag.value_counts()

In [130]:
predicted_labels = list(eval_df_joined.predicted_tag)
predicted_labels = [tag[2:] if tag != "no_label" else tag for tag in predicted_labels]
eval_df_joined.insert(len(eval_df_joined.columns), "predicted_label", predicted_labels)
# eval_df_joined.head()

In [131]:
cols_to_keep = ["sentence_id", "token_id", "token", "expected_label", "predicted_label", "_merge", "token_offsets", "ann_offsets", "ann_id"]
eval_by_ann = utils.implodeDataFrame(eval_df_joined[cols_to_keep], ["ann_id", "ann_offsets"]).reset_index()
exp_labels = list(eval_by_ann["expected_label"])
exp_labels = [labels[0] for labels in exp_labels]
eval_by_ann["expected_label"] = exp_labels
# eval_by_ann.head()

In [132]:
assert eval_by_ann.loc[eval_by_ann.expected_label == "no_label"].shape[0] == 0
assert eval_by_ann.loc[eval_by_ann.expected_label.isna()].shape[0] == 0

Every row should have an annotation label (a Linguistic label in `expected_label`).

In [133]:
ann_agmts = []
token_agmts = (eval_by_ann["_merge"])
for agmts in token_agmts:
    if "true positive" in agmts:
        ann_agmt = "true positive"
    elif "false positive" in agmts:
        ann_agmt = "false positive"
    else:
        ann_agmt = "false negative"
    ann_agmts += [ann_agmt]
assert len(ann_agmts) == eval_by_ann.shape[0]
eval_by_ann.insert(len(eval_by_ann.columns), "annotation_agreement", ann_agmts)
eval_by_ann.head()

Unnamed: 0,ann_id,ann_offsets,sentence_id,token_id,token,expected_label,predicted_label,_merge,token_offsets,annotation_agreement
0,0,"(1407, 1415)","[5760, 5760]","[133674, 133674]","[knighted, knighted]",Gendered-Role,"[no_label, Generalization]","[false negative, false positive]","[(1407, 1415), (1407, 1415)]",false positive
1,1063,"(1532, 1535)",[12112],[272117],[his],Gendered-Pronoun,[Gendered-Pronoun],[true positive],"[(1532, 1535)]",true positive
2,1064,"(1548, 1550)",[12113],[272120],[He],Gendered-Pronoun,[Gendered-Pronoun],[true positive],"[(1548, 1550)]",true positive
3,1078,"(496, 500)",[12106],[271938],[male],Gendered-Role,[no_label],[false negative],"[(496, 500)]",false negative
4,1079,"(505, 511)",[12106],[271940],[female],Gendered-Role,[no_label],[false negative],"[(505, 511)]",false negative


Save the data:

In [134]:
eval_by_ann[["ann_id", "ann_offsets", "token_id", "expected_label", "predicted_label", "annotation_agreement"]].to_csv(
    config.experiment1_agmt_path+"cc-{a}_baseline_fastText{d}_ling_annot_evaluation.csv".format(a=a,d=d)
)

Calculate annotation agreement metrics for each label:

In [135]:
annot_agmt = pd.DataFrame.from_dict({
        "label":[], "false negative":[], "false positive":[],
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [136]:
labels = ling_label_tags.keys()
for label in labels:
    agmt_df = eval_by_ann.loc[eval_by_ann.expected_label == label]
    tp = agmt_df.loc[agmt_df.annotation_agreement == "true positive"].shape[0]
    fp = agmt_df.loc[agmt_df.annotation_agreement == "false positive"].shape[0]
    fn = agmt_df.loc[agmt_df.annotation_agreement == "false negative"].shape[0]
    prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
    label_agmt = pd.DataFrame.from_dict({
            "label":[label], "false negative":[fn], "false positive":[fp],
             "true positive":[tp], "precision":[prec], "recall":[rec], "f1":[f1]
        })
    annot_agmt = pd.concat([annot_agmt, label_agmt])
annot_agmt

Unnamed: 0,label,false negative,false positive,true positive,precision,recall,f1
0,Gendered-Pronoun,19.0,0.0,1401.0,1.0,0.98662,0.993265
0,Gendered-Role,103.0,4.0,230.0,0.982906,0.690691,0.811287
0,Generalization,77.0,3.0,124.0,0.976378,0.616915,0.756098


Save the scores:

In [137]:
# eval_by_ann.loc[eval_by_ann.expected_label == "Gendered-Role"]["annotation_agreement"].value_counts()     # Looks good
# eval_by_ann.loc[eval_by_ann.expected_label == "Gendered-Pronoun"]["annotation_agreement"].value_counts()  # Looks good
# eval_by_ann.loc[eval_by_ann.expected_label == "Generalization"]["annotation_agreement"].value_counts()    # Looks good
annot_agmt.to_csv(config.experiment1_agmt_path+"cc-{a}_baseline_fastText{d}_ling_annot_agmt.csv".format(a=a,d=d))

##### Loose Agreement

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

Create a copy of the evaluation DataFrame where tags are replaced by label names:

In [138]:
a = "rf"
eval_df = pd.read_csv(config.experiment1_agmt_path+"cc-{a}_ling_baseline_fastText{d}_evaluation.csv".format(a=a,d=d), index_col=0)
loose_eval_df = eval_df.copy()
for label,tags in ling_label_tags.items():
    for tag in tags:
        loose_eval_df["expected_tag"] = loose_eval_df["expected_tag"].replace(to_replace=tag, value=label)
        loose_eval_df["predicted_tag"] = loose_eval_df["predicted_tag"].replace(to_replace=tag, value=label)
# loose_eval_df.head()

In [139]:
loose_eval_df.loc[loose_eval_df.predicted_tag.isna()].shape

(476, 7)

In [140]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
loose_eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
39,5,155,his,PRP$,Gendered-Pronoun,Gendered-Pronoun,true positive
41,5,157,he,PRP,Gendered-Pronoun,Gendered-Pronoun,true positive
62,7,216,His,PRP$,Gendered-Pronoun,Gendered-Pronoun,true positive
72,7,226,he,PRP,Gendered-Pronoun,Gendered-Pronoun,true positive
218,16,435,He,PRP,Gendered-Pronoun,Gendered-Pronoun,true positive


In [141]:
loose_eval_df.to_csv(config.experiment1_agmt_path+"cc-{a}_ling_baseline_fastText{d}_evaluation_loose.csv".format(a=a,d=d))

In [142]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[],
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [143]:
for label,tags in ling_label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,Gendered-Pronoun,38.0,0.0,2802.0,1.0,0.98662,0.993265
0,Gendered-Role,229.0,0.0,438.0,1.0,0.656672,0.79276
0,Generalization,209.0,0.0,148.0,1.0,0.414566,0.586139


Great!  The performance of this model looks comparable to the model trained on 60% of the data.

Save the data:

In [144]:
loose_agmt.to_csv(config.experiment1_agmt_path+"cc-{a}_baseline_fastText{d}_ling_loose_agmt.csv".format(a=a,d=d))

<a id="ii"></a>
## II. Predict Over All Data

In [145]:
all_predictions = clf.predict(X_all)

In [146]:
pred_df = utils.makePredictionDF(all_predictions, all_data, "tag", "predicted_tag", "O", mlb)
assert pred_df.loc[pred_df.predicted_tag.isna()].shape[0] == 0, "Any NaN values should be replaced with 'O'"
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,predicted_tag
0,2,16,Scope,NN,O
1,2,17,and,CC,O
2,2,18,Contents,NNS,O
3,2,19,:,:,O
4,2,20,Sermons,NNS,O


In [147]:
print(pred_df.shape[0], len(pred_df.token_id.unique()))

755963 753521


Save the prediction data:

In [148]:
pred_df.to_csv(config.experiment1_output_path+"cc-{a}_ling_baseline_fastText{d}_predictions_ALLDATA.csv".format(a=a,d=d))

#### Evaluate: Strict, All Labels

In [149]:
print("Precision - macro:", sklearn.metrics.precision_score(y_all, all_predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print("Recall - macro:", sklearn.metrics.recall_score(y_all, all_predictions, average="macro", zero_division=0))
print("F1 Score - macro:", sklearn.metrics.f1_score(y_all, all_predictions, average="macro", zero_division=0))
print("Accuracy - normalized:", sklearn.metrics.accuracy_score(y_all, all_predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", sklearn.metrics.accuracy_score(y_all, all_predictions, normalize=False))  # number of correctly classified samples

Precision - macro: 0.5920109983470005
Recall - macro: 0.44067573805476495
F1 Score - macro: 0.44051951753599433
Accuracy - normalized: 0.9929769707811726
Accuracy - unnormalized: 748229


In [150]:
print("Total samples:", X_all.shape[0])

Total samples: 753521


#### Evaluate: Each Label

There are more predictions than unique tokens, because with multilabel classification, one token can have multiple predicted tags.

In [151]:
exp_df = all_data.explode(["tag"])
exp_df = exp_df.rename(columns={"tag":"expected_tag"})
exp_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag
0,2,16,Scope,NN,O
1,2,17,and,CC,O
2,2,18,Contents,NNS,O
3,2,19,:,:,O
4,2,20,Sermons,NNS,O


In [152]:
print(exp_df.shape[0], len(exp_df.token_id.unique()))

757416 753521


In [153]:
exp_pred_df = pd.merge(
    left=exp_df, 
    right=pred_df.loc[pred_df.predicted_tag != "O"], # only include the predictions of Linguistic labels
    how="outer",
    left_on=["sentence_id", "token_id", "token", "pos", "expected_tag"],
    right_on=["sentence_id", "token_id", "token", "pos", "predicted_tag"],
    suffixes=["", "_pred"],
    indicator=True
)
exp_pred_df.shape

(759346, 7)

Record the agreement type for each row, ignoring rows with `'O'` and `NaN` value pairs (the `true negative` agreement type, which doesn't go into the precision, recall, or F1 score calculations).

In [154]:
exp_col = "expected_tag"
pred_col = "predicted_tag"
no_tag_value = "O"
# Find true negatives based on the expected and predicted tags
sub_exp_pred_df = exp_pred_df.loc[exp_pred_df[exp_col] == no_tag_value]
sub_exp_pred_df = sub_exp_pred_df.loc[sub_exp_pred_df[pred_col].isna()]
# sub_exp_pred_df.replace(to_replace="left_only", value="true negative", inplace=True)
tn_tokens = list(sub_exp_pred_df["token_id"])

# Record false negatives, false positives, and true positives based on the merge values
eval_df = exp_pred_df.loc[~exp_pred_df["token_id"].isin(tn_tokens)]
eval_df = eval_df.replace(to_replace="left_only", value="false negative")
eval_df = eval_df.replace(to_replace="right_only", value="false positive")
eval_df = eval_df.replace(to_replace="both", value="true positive")
eval_df = eval_df.sort_index()
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
129,19,533,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
135,19,539,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
261,37,960,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
274,37,973,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
292,39,1002,his,PRP$,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive


In [155]:
eval_df.shape

(5264, 7)

Save the data:

In [156]:
eval_df.to_csv(config.experiment1_agmt_path+"cc-{a}_ling_baseline_fastText{d}_evaluation_ALLDATA.csv".format(a=a,d=d))

##### Strict Agreement

Calculate the true positives, false positives, false negatives, precision, recall, and F1 metrics for all tags and each tag individually.

In [157]:
agmt_stats = utils.getAgreementStatsForAllTags(eval_df, "_merge", "token_id", "tag(s)", y_dev, predictions)
for label_tag in ling_label_subset:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label_tag])
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,all,1045,16,4203,0.462383,0.422989,0.422518
0,B-Generalization,261,10,444,0.977974,0.629787,0.766178
0,I-Generalization,221,2,2,0.5,0.008969,0.017621
0,B-Gendered-Role,177,1,1178,0.999152,0.869373,0.929755
0,I-Gendered-Role,319,0,0,0.0,0.0,0.0
0,B-Gendered-Pronoun,17,3,6782,0.999558,0.9975,0.998528
0,I-Gendered-Pronoun,50,0,0,0.0,0.0,0.0


Save the data:

In [158]:
agmt_stats.to_csv(config.experiment1_agmt_path+"cc-{a}_baseline_fastText{d}_ling_strict_agmt_ALLDATA.csv".format(a=a,d=d))

##### Annotation-level Agreement

Join the manual annotations' offsets to the evaluation data:

In [159]:
annot_df = pd.read_csv(config.agg_path+"aggregated_final.csv")#, usecols=["description_id","agg_ann_id", "ann_offsets"])
# Get only the Linguistic annotations
annot_df = annot_df.loc[annot_df.category == "Linguistic"]
annot_df = annot_df[["agg_ann_id", "ann_offsets", "label"]]
annot_df = annot_df.rename(columns={"agg_ann_id":"ann_id"})
# annot_df.head()

In [160]:
dev_token_ids = list(dev_data.token_id.unique())
ling_dev_subset = ling_dev.loc[ling_dev.token_id.isin(dev_token_ids)]

In [161]:
to_add = ling_dev_subset[["ann_id", "token_id", "token_offsets"]]
# Only include annotations with Linguistic labels
to_add = to_add.loc[to_add.ann_id.isin(list(annot_df.ann_id))]
eval_df_joined = eval_df.join(to_add.set_index("token_id"), on="token_id", how="outer")
# Join on the left, as there will be annotations from outside the devtest set in annot_df
print(eval_df_joined.shape)
eval_df_joined = eval_df_joined.join(annot_df.set_index("ann_id"), on="ann_id", how="left")
print(eval_df_joined.shape)  # Looks good!  Same as before join.
eval_df_joined = eval_df_joined.rename(columns={"label":"expected_label"})
# eval_df_joined.head()

(7083, 9)
(7083, 11)


Replace the predicted tags with their corresponding labels:

In [162]:
eval_df_joined.expected_label = eval_df_joined.expected_label.fillna("no_label")
eval_df_joined.expected_tag = eval_df_joined.expected_tag.fillna("no_label")
eval_df_joined.predicted_tag = eval_df_joined.predicted_tag.fillna("no_label")
# eval_df_joined.predicted_tag.value_counts()

In [163]:
predicted_labels = list(eval_df_joined.predicted_tag)
predicted_labels = [tag[2:] if tag != "no_label" else tag for tag in predicted_labels]
eval_df_joined.insert(len(eval_df_joined.columns), "predicted_label", predicted_labels)
# eval_df_joined.head()

In [164]:
cols_to_keep = ["sentence_id", "token_id", "token", "expected_label", "predicted_label", "_merge", "token_offsets", "ann_offsets", "ann_id"]
eval_by_ann = utils.implodeDataFrame(eval_df_joined[cols_to_keep], ["ann_id", "ann_offsets"]).reset_index()
exp_labels = list(eval_by_ann["expected_label"])
exp_labels = [labels[0] for labels in exp_labels]
eval_by_ann["expected_label"] = exp_labels
# eval_by_ann.head()

In [165]:
assert eval_by_ann.loc[eval_by_ann.expected_label == "no_label"].shape[0] == 0
assert eval_by_ann.loc[eval_by_ann.expected_label.isna()].shape[0] == 0

Every row should have an annotation label (a Linguistic label in `expected_label`).

In [166]:
ann_agmts = []
token_agmts = (eval_by_ann["_merge"])
for agmts in token_agmts:
    if "true positive" in agmts:
        ann_agmt = "true positive"
    elif "false positive" in agmts:
        ann_agmt = "false positive"
    else:
        ann_agmt = "false negative"
    ann_agmts += [ann_agmt]
assert len(ann_agmts) == eval_by_ann.shape[0]
eval_by_ann.insert(len(eval_by_ann.columns), "annotation_agreement", ann_agmts)
# eval_by_ann.head()

Save the data:

In [167]:
eval_by_ann[["ann_id", "ann_offsets", "token_id", "expected_label", "predicted_label", "annotation_agreement"]].to_csv(
    config.experiment1_agmt_path+"cc-{a}_baseline_fastText{d}_ling_annot_evaluation_ALLDATA.csv".format(a=a,d=d)
)

Calculate annotation agreement metrics for each label:

In [168]:
annot_agmt = pd.DataFrame.from_dict({
        "label":[], "false negative":[], "false positive":[],
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [169]:
labels = ling_label_tags.keys()
for label in labels:
    agmt_df = eval_by_ann.loc[eval_by_ann.expected_label == label]
    tp = agmt_df.loc[agmt_df.annotation_agreement == "true positive"].shape[0]
    fp = agmt_df.loc[agmt_df.annotation_agreement == "false positive"].shape[0]
    fn = agmt_df.loc[agmt_df.annotation_agreement == "false negative"].shape[0]
    prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
    label_agmt = pd.DataFrame.from_dict({
            "label":[label], "false negative":[fn], "false positive":[fp],
             "true positive":[tp], "precision":[prec], "recall":[rec], "f1":[f1]
        })
    annot_agmt = pd.concat([annot_agmt, label_agmt])
annot_agmt

Unnamed: 0,label,false negative,false positive,true positive,precision,recall,f1
0,Gendered-Pronoun,131.0,0.0,1401.0,1.0,0.914491,0.955336
0,Gendered-Role,976.0,4.0,230.0,0.982906,0.190713,0.319444
0,Generalization,408.0,3.0,124.0,0.976378,0.233083,0.376328


Save the scores:

In [170]:
# eval_by_ann.loc[eval_by_ann.expected_label == "Gendered-Role"]["annotation_agreement"].value_counts()     # Looks good
# eval_by_ann.loc[eval_by_ann.expected_label == "Gendered-Pronoun"]["annotation_agreement"].value_counts()  # Looks good
# eval_by_ann.loc[eval_by_ann.expected_label == "Generalization"]["annotation_agreement"].value_counts()    # Looks good
annot_agmt.to_csv(config.experiment1_agmt_path+"cc-{a}_baseline_fastText{d}_ling_annot_agmt_ALLDATA.csv".format(a=a,d=d))

##### Loose Agreement

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

Create a copy of the evaluation DataFrame where tags are replaced by label names:

In [171]:
a = "rf"
eval_df = pd.read_csv(config.experiment1_agmt_path+"cc-{a}_ling_baseline_fastText{d}_evaluation_ALLDATA.csv".format(a=a,d=d), index_col=0)
loose_eval_df = eval_df.copy()
for label,tags in ling_label_tags.items():
    for tag in tags:
        loose_eval_df["expected_tag"] = loose_eval_df["expected_tag"].replace(to_replace=tag, value=label)
        loose_eval_df["predicted_tag"] = loose_eval_df["predicted_tag"].replace(to_replace=tag, value=label)
# loose_eval_df.head()

In [172]:
loose_eval_df.loc[loose_eval_df.predicted_tag.isna()].shape

(1045, 7)

In [173]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
# loose_eval_df.head()

In [174]:
loose_eval_df.to_csv(config.experiment1_agmt_path+"cc-{a}_ling_baseline_fastText{d}_evaluation_loose_ALLDATA.csv".format(a=a,d=d))

In [175]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[], "true negative":[], 
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [176]:
for label,tags in ling_label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,Gendered-Pronoun,67.0,0.0,,6782.0,1.0,0.990218,0.995085
0,Gendered-Role,496.0,0.0,,1178.0,1.0,0.703704,0.826087
0,Generalization,482.0,0.0,,446.0,1.0,0.480603,0.649199


Save the data:

In [177]:
loose_agmt.to_csv(config.experiment1_agmt_path+"cc-{a}_baseline_fastText{d}_ling_loose_agmt_ALLDATA.csv".format(a=a,d=d))