# Experiment 1

#### Model Setup

Run models in the following order, using their output labels as features for the next model:

[1.](#1) Multilabel Linguistic Classifier

[2.](#2) Multiclass Person Name + Occupation Sequence Classifier

[3.](#3) Multilabel Document Classifier

***

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/experiment_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/experiment1/`
* Word Embeddings
    * Custom fastText (word2vec with subwords) embeddings of 100 dimensions trained on the CRC Archives catalog's descriptive metadata (harvested October 2020)
    
***

Load programming resources:

In [1]:
# For custom functions and variables
import utils, utils1, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For preprocessing
from gensim.models import FastText
from gensim import utils as gensim_utils

# For multilabel token classification
import sklearn.metrics
from sklearn.preprocessing import MultiLabelBinarizer
from skmultilearn.problem_transform import ClassifierChain
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# For multiclass sequence classification
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

Define resources for the models:

In [2]:
Path(config.experiment_input_path).mkdir(parents=True, exist_ok=True)    # For train, devtest, and blind test data
Path(config.experiment1_output_path).mkdir(parents=True, exist_ok=True)  # For predictions
Path(config.experiment1_agmt_path).mkdir(parents=True, exist_ok=True)    # For agreement metrics

In [3]:
# Model 1:
ling_label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
# Model 2:
pers_o_label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Occupation", "I-Occupation"]
# Model 3:
so_label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission"]

In [4]:
ling_label_tags = {
    "Gendered-Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered-Role": ["B-Gendered-Role", "I-Gendered-Role"],"Generalization": ["B-Generalization", "I-Generalization"]
    }
pers_o_label_tags = {
    "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
     "Occupation": ["B-Occupation", "I-Occupation"]
    }
so_label_tags = {
    "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"]
             }

In [5]:
d = 100  # dimensions of word embeddings (should match utils1.py)

<a id="1"></a>
## 1. Linguistic Classifier

Run a multilabel classifier on the train set of the data, focusing only on applying the Linguistic category of labels: Gendered Pronoun, Gendered Role, and Generalization.

Use a Classifier Chain with Random Forest, as this was the highest-performing multilabel model setup from previous algorithm experiments for the Linguistic labels.

For this experiment, we'll train the model on 40% of the data, rather than 60%.  We'll use fastText embeddings of 100 dimensions, as was used in the model that achieved the above scores.

#### Preprocessing

In [5]:
train_df = pd.read_csv(config.tokc_path+"experiment_input/token_train.csv", index_col=0)
dev_df = pd.read_csv(config.tokc_path+"experiment_input/token_validate.csv", index_col=0)
ling_train, ling_dev = utils.selectDataForLabels(train_df, dev_df, "tag", ling_label_subset)

In [6]:
train_data = utils1.loadData(ling_train)
dev_data = utils1.loadData(ling_dev)
print(train_data.shape, dev_data.shape)
train_data.head()

(298617, 5) (305924, 5)


Unnamed: 0,sentence_id,token_id,token,pos,tag
0,2,16,Scope,NN,[O]
1,2,17,and,CC,[O]
2,2,18,Contents,NNS,[O]
3,2,19,:,:,[O]
4,2,20,Sermons,NNS,[O]


In [7]:
assert train_data.shape[0] == len(train_data.token_id.unique())
assert dev_data.shape[0] == len(dev_data.token_id.unique())

Create feature matrices:

In [8]:
train_tokens = utils1.zipTokensFeatures(train_data)
dev_tokens = utils1.zipTokensFeatures(dev_data)
X_train = utils1.makeFastTextFeatureMatrix(train_tokens)
X_dev = utils1.makeFastTextFeatureMatrix(dev_tokens)

Binarize targets:

In [9]:
mlb, y_train = utils1.binarizeTrainTargets(train_data)
y_dev = utils1.binarizeDevTargets(mlb, dev_data)

#### Train & Predict

In [10]:
a = "rf"
clf = ClassifierChain(
    classifier = RandomForestClassifier(random_state=22),
)
clf.fit(X_train, y_train)

In [11]:
predictions = clf.predict(X_dev)

#### Evaluate: Strict, All Labels

In [14]:
print("Precision - macro:", sklearn.metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print("Recall - macro:", sklearn.metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print("F1 Score - macro:", sklearn.metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print("Accuracy - normalized:", sklearn.metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", sklearn.metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - macro: 0.46238272522632073
Recall - macro: 0.42298863525280694
F1 Score - macro: 0.42251820432443793
Accuracy - normalized: 0.9925798564349315
Accuracy - unnormalized: 303654


In [16]:
print("Total samples:", X_dev.shape[0])

Total samples: 305924


#### Evaluate: Each Label

In [17]:
pred_df = utils.makePredictionDF(predictions, dev_data, "tag", "predicted_tag", "O", mlb)
assert pred_df.loc[pred_df.predicted_tag.isna()].shape[0] == 0, "Any NaN values should be replaced with 'O'"
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,predicted_tag
0,1,3,Title,NN,O
1,1,4,:,:,O
2,1,5,Papers,NNS,O
3,1,6,of,IN,O
4,1,7,The,DT,O


In [60]:
print(pred_df.shape[0], len(pred_df.token_id.unique()))

306940 305924


Save the prediction data:

In [83]:
pred_df.to_csv(config.experiment1_output_path+"cc-{a}_ling_baseline_fastText{d}_predictions.csv".format(a=a,d=d))

There are more predictions than unique tokens, because with multilabel classification, one token can have multiple predicted tags.

In [61]:
exp_df = dev_data.explode(["tag"])
exp_df = exp_df.rename(columns={"tag":"expected_tag"})
exp_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag
0,1,3,Title,NN,O
1,1,4,:,:,O
2,1,5,Papers,NNS,O
3,1,6,of,IN,O
4,1,7,The,DT,O


In [62]:
print(exp_df.shape[0], len(exp_df.token_id.unique()))

307597 305924


In [65]:
exp_pred_df = pd.merge(
    left=exp_df, 
    right=pred_df.loc[pred_df.predicted_tag != "O"], # only include the predictions of Linguistic labels
    how="outer",
    left_on=["sentence_id", "token_id", "token", "pos", "expected_tag"],
    right_on=["sentence_id", "token_id", "token", "pos", "predicted_tag"],
    suffixes=["", "_pred"],
    indicator=True
)
exp_pred_df.shape

(308371, 7)

Record the agreement type for each row, ignoring rows with `'O'` and `NaN` value pairs (the `true negative` agreement type, which doesn't go into the precision, recall, or F1 score calculations).

In [73]:
exp_col = "expected_tag"
pred_col = "predicted_tag"
no_tag_value = "O"
# Find true negatives based on the expected and predicted tags
sub_exp_pred_df = exp_pred_df.loc[exp_pred_df[exp_col] == no_tag_value]
sub_exp_pred_df = sub_exp_pred_df.loc[sub_exp_pred_df[pred_col].isna()]
# sub_exp_pred_df.replace(to_replace="left_only", value="true negative", inplace=True)
tn_tokens = list(sub_exp_pred_df["token_id"])

# Record false negatives, false positives, and true positives based on the merge values
eval_df = exp_pred_df.loc[~exp_pred_df["token_id"].isin(tn_tokens)]
eval_df = eval_df.replace(to_replace="left_only", value="false negative")
eval_df = eval_df.replace(to_replace="right_only", value="false positive")
eval_df = eval_df.replace(to_replace="both", value="true positive")
eval_df = eval_df.sort_index()
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
39,5,155,his,PRP$,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
41,5,157,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
62,7,216,His,PRP$,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
72,7,226,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
218,16,435,He,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive


In [74]:
eval_df.shape

(2177, 7)

Save the data:

In [84]:
eval_df.to_csv(config.experiment1_agmt_path+"cc-{a}_ling_baseline_fastText{d}_evaluation.csv".format(a=a,d=d))

##### Strict Agreement

Calculate the true positives, false positives, false negatives, precision, recall, and F1 metrics for all tags and each tag individually.

In [76]:
agmt_stats = utils.getAgreementStatsForAllTags(eval_df, "_merge", "token_id", "tag(s)", y_dev, predictions)
for label_tag in ling_label_subset:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label_tag])
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,all,476,7,1694,0.462383,0.422989,0.422518
0,B-Generalization,124,3,148,0.980132,0.544118,0.699764
0,I-Generalization,85,1,0,0.0,0.0,0.0
0,B-Gendered-Role,97,1,438,0.997722,0.818692,0.899384
0,I-Gendered-Role,132,0,0,0.0,0.0,0.0
0,B-Gendered-Pronoun,15,2,2802,0.999287,0.994675,0.996976
0,I-Gendered-Pronoun,23,0,0,0.0,0.0,0.0


Save the data:

In [81]:
agmt_stats.to_csv(config.experiment1_agmt_path+"cc-{a}_baseline_fastText{d}_ling_strict_agmt.csv".format(a=a,d=d))

##### Annotation-level Agreement

Join the manual annotations' offsets to the evaluation data:

In [234]:
annot_df = pd.read_csv(config.agg_path+"aggregated_final.csv")#, usecols=["description_id","agg_ann_id", "ann_offsets"])
# Get only the Linguistic annotations
annot_df = annot_df.loc[annot_df.category == "Linguistic"]
annot_df = annot_df[["agg_ann_id", "ann_offsets", "label"]]
annot_df = annot_df.rename(columns={"agg_ann_id":"ann_id"})
# annot_df.head()

In [237]:
dev_token_ids = list(dev_data.token_id.unique())
ling_dev_subset = ling_dev.loc[ling_dev.token_id.isin(dev_token_ids)]

In [238]:
to_add = ling_dev_subset[["ann_id", "token_id", "token_offsets"]]
# Only include annotations with Linguistic labels
to_add = to_add.loc[to_add.ann_id.isin(list(annot_df.ann_id))]
eval_df_joined = eval_df.join(to_add.set_index("token_id"), on="token_id", how="outer")
# Join on the left, as there will be annotations from outside the devtest set in annot_df
print(eval_df_joined.shape)
eval_df_joined = eval_df_joined.join(annot_df.set_index("ann_id"), on="ann_id", how="left")
print(eval_df_joined.shape)  # Looks good!  Same as before join.
eval_df_joined = eval_df_joined.rename(columns={"label":"expected_label"})
eval_df_joined.head()

(3996, 9)
(3996, 11)


Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge,ann_id,token_offsets,ann_offsets,expected_label
39.0,5.0,155,his,PRP$,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive,14379,"(913, 916)","(913, 916)",Gendered-Pronoun
41.0,5.0,157,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive,14380,"(928, 930)","(928, 930)",Gendered-Pronoun
62.0,7.0,216,His,PRP$,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive,14382,"(1241, 1244)","(1241, 1244)",Gendered-Pronoun
72.0,7.0,226,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive,14383,"(1315, 1317)","(1315, 1317)",Gendered-Pronoun
218.0,16.0,435,He,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive,9516,"(677, 679)","(677, 679)",Gendered-Pronoun


Replace the predicted tags with their corresponding labels:

In [239]:
eval_df_joined.expected_label = eval_df_joined.expected_label.fillna("no_label")
eval_df_joined.expected_tag = eval_df_joined.expected_tag.fillna("no_label")
eval_df_joined.predicted_tag = eval_df_joined.predicted_tag.fillna("no_label")
eval_df_joined.predicted_tag.value_counts()

no_label              2234
B-Gendered-Pronoun    1447
B-Gendered-Role        232
B-Generalization        82
I-Generalization         1
Name: predicted_tag, dtype: int64

In [240]:
predicted_labels = list(eval_df_joined.predicted_tag)
predicted_labels = [tag[2:] if tag != "no_label" else tag for tag in predicted_labels]
eval_df_joined.insert(len(eval_df_joined.columns), "predicted_label", predicted_labels)
# eval_df_joined.head()

In [251]:
cols_to_keep = ["sentence_id", "token_id", "token", "expected_label", "predicted_label", "_merge", "token_offsets", "ann_offsets", "ann_id"]
eval_by_ann = utils.implodeDataFrame(eval_df_joined[cols_to_keep], ["ann_id", "ann_offsets"]).reset_index()
exp_labels = list(eval_by_ann["expected_label"])
exp_labels = [labels[0] for labels in exp_labels]
eval_by_ann["expected_label"] = exp_labels
# eval_by_ann.head()

In [248]:
assert eval_by_ann.loc[eval_by_ann.expected_label == "no_label"].shape[0] == 0
assert eval_by_ann.loc[eval_by_ann.expected_label.isna()].shape[0] == 0

Every row should have an annotation label (a Linguistic label in `expected_label`).

In [250]:
ann_agmts = []
token_agmts = (eval_by_ann["_merge"])
for agmts in token_agmts:
    if "true positive" in agmts:
        ann_agmt = "true positive"
    elif "false positive" in agmts:
        ann_agmt = "false positive"
    else:
        ann_agmt = "false negative"
    ann_agmts += [ann_agmt]
assert len(ann_agmts) == eval_by_ann.shape[0]
eval_by_ann.insert(len(eval_by_ann.columns), "annotation_agreement", ann_agmts)
eval_by_ann.head()

Unnamed: 0,ann_id,ann_offsets,sentence_id,token_id,token,expected_label,predicted_label,_merge,token_offsets,annotation_agreement
0,0,"(1407, 1415)","[5760.0, 5760.0]","[133674, 133674]","[knighted, knighted]",Gendered-Role,"[no_label, Generalization]","[false negative, false positive]","[(1407, 1415), (1407, 1415)]",false positive
1,1,"(9625, 9635)","[nan, nan]","[228678, 228679]","[nan, nan]",Gendered-Role,"[no_label, no_label]","[nan, nan]","[(9625, 9635), (9635, 9636)]",false negative
2,2,"(2426, 2439)","[nan, nan, nan]","[196525, 196526, 196527]","[nan, nan, nan]",Gendered-Role,"[no_label, no_label, no_label]","[nan, nan, nan]","[(2426, 2432), (2433, 2439), (2439, 2440)]",false negative
3,93,"(4141, 4148)","[nan, nan]","[4714, 4715]","[nan, nan]",Gendered-Role,"[no_label, no_label]","[nan, nan]","[(4141, 4148), (4148, 4149)]",false negative
4,1063,"(1532, 1535)",[12112.0],[272117],[his],Gendered-Pronoun,[Gendered-Pronoun],[true positive],"[(1532, 1535)]",true positive


Save the data:

In [254]:
eval_by_ann[["ann_id", "ann_offsets", "token_id", "expected_label", "predicted_label", "annotation_agreement"]].to_csv(
    config.experiment1_agmt_path+"cc-{a}_baseline_fastText{d}_ling_annot_evaluation.csv".format(a=a,d=d)
)

Calculate annotation agreement metrics for each label:

In [292]:
annot_agmt = pd.DataFrame.from_dict({
        "label":[], "false negative":[], "false positive":[],
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [293]:
labels = ling_label_tags.keys()
for label in labels:
    agmt_df = eval_by_ann.loc[eval_by_ann.expected_label == label]
    tp = agmt_df.loc[agmt_df.annotation_agreement == "true positive"].shape[0]
    fp = agmt_df.loc[agmt_df.annotation_agreement == "false positive"].shape[0]
    fn = agmt_df.loc[agmt_df.annotation_agreement == "false negative"].shape[0]
    prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
    label_agmt = pd.DataFrame.from_dict({
            "label":[label], "false negative":[fn], "false positive":[fp],
             "true positive":[tp], "precision":[prec], "recall":[rec], "f1":[f1]
        })
    annot_agmt = pd.concat([annot_agmt, label_agmt])
annot_agmt

Unnamed: 0,label,false negative,false positive,true positive,precision,recall,f1
0,Gendered-Pronoun,131.0,0.0,1401.0,1.0,0.914491,0.955336
0,Gendered-Role,976.0,4.0,230.0,0.982906,0.190713,0.319444
0,Generalization,408.0,3.0,124.0,0.976378,0.233083,0.376328


Save the scores:

In [262]:
# eval_by_ann.loc[eval_by_ann.expected_label == "Gendered-Role"]["annotation_agreement"].value_counts()     # Looks good
# eval_by_ann.loc[eval_by_ann.expected_label == "Gendered-Pronoun"]["annotation_agreement"].value_counts()  # Looks good
# eval_by_ann.loc[eval_by_ann.expected_label == "Generalization"]["annotation_agreement"].value_counts()    # Looks good
annot_agmt.to_csv(config.experiment1_agmt_path+"cc-{a}_baseline_fastText{d}_ling_annot_agmt.csv".format(a=a,d=d))

##### Loose Agreement

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

Create a copy of the evaluation DataFrame where tags are replaced by label names:

In [7]:
a = "rf"
eval_df = pd.read_csv(config.experiment1_agmt_path+"cc-{a}_ling_baseline_fastText{d}_evaluation.csv".format(a=a,d=d), index_col=0)
loose_eval_df = eval_df.copy()
for label,tags in ling_label_tags.items():
    for tag in tags:
        loose_eval_df["expected_tag"] = loose_eval_df["expected_tag"].replace(to_replace=tag, value=label)
        loose_eval_df["predicted_tag"] = loose_eval_df["predicted_tag"].replace(to_replace=tag, value=label)
# loose_eval_df.head()

In [8]:
loose_eval_df.loc[loose_eval_df.predicted_tag.isna()].shape

(476, 7)

In [9]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
loose_eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
39,5,155,his,PRP$,Gendered-Pronoun,Gendered-Pronoun,true positive
41,5,157,he,PRP,Gendered-Pronoun,Gendered-Pronoun,true positive
62,7,216,His,PRP$,Gendered-Pronoun,Gendered-Pronoun,true positive
72,7,226,he,PRP,Gendered-Pronoun,Gendered-Pronoun,true positive
218,16,435,He,PRP,Gendered-Pronoun,Gendered-Pronoun,true positive


In [12]:
loose_eval_df.to_csv(config.experiment1_agmt_path+"cc-{a}_ling_baseline_fastText{d}_evaluation_loose.csv".format(a=a,d=d))

In [10]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[], "true negative":[], 
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [11]:
for label,tags in ling_label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,Gendered-Pronoun,38.0,0.0,,2802.0,1.0,0.98662,0.993265
0,Gendered-Role,229.0,0.0,,438.0,1.0,0.656672,0.79276
0,Generalization,209.0,0.0,,148.0,1.0,0.414566,0.586139


Great!  The performance of this model looks comparable to the model trained on 60% of the data.

Save the data:

In [278]:
loose_agmt.to_csv(config.experiment1_agmt_path+"cc-{a}_baseline_fastText{d}_ling_loose_agmt.csv".format(a=a,d=d))

***
<a id="2"></a>
## 2. Person Name + Occupation Labels

Train a multiclass sequence classifier, using Conditional Random Field with Adaptive Regularization of Weight Vectors (AROW), on the Person Name and Occupation labels, **passing in the Linguistic labels (not specific BIO label-tag pair) from the previous model's predictions as features to this model.**

Multiclass is a suitable setup for these labels because they are mutually exclusive (no one token should have more than one of these labels).  The sequence classifier with AROW was the highest performing for past algorithm experiments with sequence classifiers for Person Name and Occupation labels.

In [13]:
# loose_eval_df = pd.read_csv(config.experiment1_agmt_path+"cc-{a}_ling_baseline_fastText{d}_evaluation_loose.csv".format(a=a,d=d), index_col=0)
train_ling_features = loose_eval_df[["token_id", "predicted_tag"]]
train_ling_features = train_ling_features.rename(columns={"predicted_tag":"pred_ling_tag"})
train_ling_features.head()

Unnamed: 0,token_id,pred_ling_tag
39,155,Gendered-Pronoun
41,157,Gendered-Pronoun
62,216,Gendered-Pronoun
72,226,Gendered-Pronoun
218,435,Gendered-Pronoun


The devtest data subset from the model in step 1 will be the train data subset in this step, with the predicted Linguistic labels as features passed into this second model.  The train data subset from the first model will be the devtest data subset for this second model.

In [14]:
train_df = pd.read_csv(config.tokc_path+"experiment_input/token_validate.csv", index_col=0)
dev_df =  pd.read_csv(config.tokc_path+"experiment_input/token_train.csv", index_col=0)
perso_train, perso_dev = utils.selectDataForLabels(train_df, dev_df, "tag", pers_o_label_subset)
print(perso_train.shape, perso_dev.shape)

(316721, 10) (308583, 10)


Join the linguistic labels (features) to the train and devtest data:

In [15]:
perso_train = perso_train.join(train_ling_features.set_index("token_id"), on="token_id", how="outer")
perso_train.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset,pred_ling_tag
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,dev,
4,1,1,99999,4,:,"(22, 23)",:,O,Title,dev,
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,dev,
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,dev,
9,1,1,52952,7,The,"(34, 37)",DT,O,Title,dev,


In [16]:
perso_train.pred_ling_tag = perso_train.pred_ling_tag.fillna("O")
perso_train.pred_ling_tag.value_counts()

O                   315090
Gendered-Pronoun      1447
Gendered-Role          232
Generalization          83
Name: pred_ling_tag, dtype: int64

Looks good!

#### Preprocessing

In [17]:
train_df = perso_train.drop(columns=["description_id", "ann_id", "token_offsets", "field", "subset", "pos"])
dev_df = perso_dev.drop(columns=["description_id", "ann_id", "token_offsets", "field", "subset", "pos"])

In [18]:
df_train_token_groups = utils.implodeDataFrame(train_df, ['token_id', 'sentence_id', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
# df_train_token_groups.head()

In [19]:
df_dev_token_groups = utils.implodeDataFrame(dev_df, ['token_id', 'sentence_id', 'token'])
df_dev_token_groups = df_dev_token_groups.reset_index()
# df_dev_token_groups.head()

In [20]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
df_train_grouped.head()

Unnamed: 0_level_0,token_id,sentence,tag,pred_ling_tag
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,"[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[Title, :, Papers, of, The, Very, Rev, Prof, J...","[[O], [O], [O], [O], [O, B-Unknown, B-Masculin...","[[O], [O], [O], [O], [O, O, O], [O, O, O], [O,..."
3,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[Biographical, /, Historical, :, Professor, Ja...","[[O], [O], [O], [O], [B-Masculine], [I-Masculi...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."
5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[After, his, ordination, he, spent, three, yea...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[[O], [Gendered-Pronoun], [O], [Gendered-Prono..."
7,"[216, 217, 218, 219, 220, 221, 222, 223, 224, ...","[His, primary, interests, were, in, liturgy, a...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[[Gendered-Pronoun], [O], [O], [O], [O], [O], ..."
9,"[256, 257, 258, 259, 260, 261, 262, 263, 264, ...","[The, service, was, relayed, around, the, worl...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."


Zip the linguistic label and BIO tags together with the tokens so each sentence item is a tuple: `(TOKEN, LING_LABEL, TAG_LIST)`

In [21]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences_pers = utils1.zipTrainFeaturesAndTarget(df_train_grouped, "tag")
print(train_sentences_pers[2][:3])
dev_sentences_pers = utils1.zipDevFeaturesAndTarget(df_dev_grouped, "tag")
print(dev_sentences_pers[0][:3])

[('After', ['O'], ['O']), ('his', ['Gendered-Pronoun'], ['O']), ('ordination', ['O'], ['O'])]
[('Scope', ['O']), ('and', ['O']), ('Contents', ['O'])]


In [22]:
train_sentences = train_sentences_pers
dev_sentences = dev_sentences_pers

In [23]:
# Features
X_train = [utils1.extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [utils1.extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [utils1.extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [utils1.extractSentenceTargets(sentence) for sentence in dev_sentences]

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Person Name** category of tags.  We'll increase the max iterations to 100 for this model.

In [24]:
a = "arow"
clf_pers = sklearn_crfsuite.CRF(algorithm=a, variance=0.5, max_iterations=100, all_possible_transitions=True)

In [25]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_pers.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [26]:
targets = list(clf_pers.classes_)
targets.remove('O')
print(targets)

['I-Unknown', 'B-Masculine', 'I-Masculine', 'B-Occupation', 'I-Occupation', 'B-Unknown', 'I-Feminine', 'B-Feminine']


#### Predict

In [27]:
y_pred = clf_pers.predict(X_dev)

#### Evaluate: All Labels

In [28]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))

  - F1: 0.40267473104429147
  - Prec: 0.4517275156285195
  - Rec 0.3767280305661688


Save the prediction data:

In [29]:
df_dev_grouped = df_dev_grouped.rename(columns={"tag":"tag_pers_o_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_pers_o_predicted", y_pred)
df_dev_grouped.head()

Unnamed: 0,sentence_id,token_id,sentence,tag_pers_o_expected,tag_pers_o_predicted
0,2,"[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[Scope, and, Contents, :, Sermons, and, addres...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,8,"[233, 234, 235, 236, 237, 238, 239, 240, 241, ...","[James, Whyte, was, called, upon, to, preach, ...","[[B-Masculine], [I-Masculine], [O], [O], [O], ...","[B-Masculine, I-Unknown, O, O, O, O, O, O, O, ..."
2,19,"[520, 521, 522, 523, 524, 525, 526, 527, 528, ...","[Rev, Tom, Allan, 's, first, charge, was, Nort...","[[B-Masculine], [I-Masculine], [I-Masculine], ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,21,"[579, 580, 581, 582, 583, 584, 585, 586, 587, ...","[In, 1953, the, "", Tell, Scotland, "", committe...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,30,"[768, 769, 770, 771, 772, 773, 774, 775, 776, ...","[Title, :, Papers, of, Rev, Prof, Alec, Campbe...","[[O], [O], [O], [O], [B-Masculine, B-Unknown],...","[O, O, O, O, O, O, B-Unknown, I-Unknown, I-Mas..."


In [30]:
df_dev_grouped = df_dev_grouped.set_index("sentence_id")
df_dev_exploded = df_dev_grouped.explode(list(df_dev_grouped.columns))
df_dev_exploded.head()

Unnamed: 0_level_0,token_id,sentence,tag_pers_o_expected,tag_pers_o_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,16,Scope,[O],O
2,17,and,[O],O
2,18,Contents,[O],O
2,19,:,[O],O
2,20,Sermons,[O],O


In [31]:
filename = "crf_{a}_pers_o_baseline_fastText{d}_predictions.csv".format(a=a, d=d)
df_dev_exploded.to_csv(config.experiment1_output_path+filename)

#### Evaluate: Each Label

The built-in evaluation approach is strict, so unless the model predictions' labels are on text spans that exactly match the development data's test, the predicted labels will be deemed incorrect.

In [32]:
a = "arow"
category = "pers_o"
filename = "crf_{a}_{c}_baseline_fastText{d}_predictions.csv".format(a=a, c=category, d=d)
pred_perso = pd.read_csv(config.experiment1_output_path+filename)
pred_perso = utils.getColumnValuesAsLists(pred_perso, "tag_{}_expected".format(category))
# pred_pers.head()

Calculate performance metrics for each category of labels:

In [33]:
pred_perso = utils.isPredictedInExpected(pred_perso, "tag_{}_expected".format(category), "tag_{}_predicted".format(category), '_merge', 'O')
pred_perso.head()

Unnamed: 0,sentence_id,token_id,sentence,tag_pers_o_expected,tag_pers_o_predicted,_merge
0,2,16,Scope,[O],O,true negative
1,2,17,and,[O],O,true negative
2,2,18,Contents,[O],O,true negative
3,2,19,:,[O],O,true negative
4,2,20,Sermons,[O],O,true negative


In [34]:
pred_perso_stats = utils.getScoresByCatTags(
    pred_perso, "_merge", pers_o_label_subset[0], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
)
for i in range(1, len(pers_o_label_subset)):
    tag_stats = utils.getScoresByCatTags(
        pred_perso, "_merge", pers_o_label_subset[i], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
    )
    pred_perso_stats = pd.concat([pred_perso_stats, tag_stats])
pred_perso_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,B-Unknown,1086,1003,1314,0.567113,0.5475,0.557134
0,I-Unknown,2117,2002,2639,0.568627,0.554878,0.561669
0,B-Feminine,45,215,309,0.589695,0.872881,0.703872
0,I-Feminine,249,128,311,0.708428,0.555357,0.622623
0,B-Masculine,303,315,404,0.561892,0.571429,0.56662
0,I-Masculine,697,433,314,0.420348,0.310584,0.357224
0,B-Occupation,373,712,655,0.479151,0.63716,0.546973
0,I-Occupation,662,477,553,0.536893,0.455144,0.49265


Save the statistics:

In [35]:
pred_perso_stats.to_csv(
    config.experiment1_agmt_path+"crf_{a}_baseline_fastText{d}_{c}_strict_agmt.csv".format(a=a, c=category, d=d)
)

#### Annotation Agreement

Calculate agreement at the annotation level, so if the model labels any word correctly from a manually annotated text span, that annotation is recorded as being correctly labeled (`true positive`).  Note whether the models' labels are an `exact_match`, `label_match`, `category_match` or `mismatch`.

Load the annotation data:

*Note: `ann_id` of `9999` indicates no annotation*

In [36]:
dev_df =  pd.read_csv(config.tokc_path+"experiment_input/token_train.csv", usecols=["sentence_id", "ann_id", "token_id", "tag"])
# dev_df.head()

Group the annotation data by token:

In [37]:
df_ann = utils.implodeDataFrame(dev_df, ["sentence_id", "ann_id", "token_id"])
df_ann = df_ann.reset_index()
# print(df_ann.shape)
# df_ann.head()

Align the columns of the dev and prediction DataFrames:

In [38]:
# Rename `sentence` column `token`
pred_perso = pred_perso.rename(columns={"sentence":"token"})
# pred_perso.head()

Join the data, adding the annotation IDs (`ann_id` column) to the prediction DataFrames:

In [39]:
index_list = ["sentence_id", "token_id"]

In [40]:
pred_perso_ann = pred_perso.join(df_ann.set_index(index_list), on=index_list, how="left")
pred_perso_ann = pred_perso_ann.drop(columns=["tag"])  # duplicate of tag_expected
assert pred_perso_ann.loc[pred_perso_ann["token_id"].isna()].shape[0] == 0
assert pred_perso_ann.loc[pred_perso_ann["ann_id"].isna()].shape[0] == 0
assert pred_perso_ann.loc[pred_perso_ann["tag_pers_o_predicted"].isna()].shape[0] == 0
assert pred_perso_ann.loc[pred_perso_ann["tag_pers_o_expected"].isna()].shape[0] == 0
# pred_perso_ann.head()

Explode the DataFrame:

In [41]:
pred_perso_ann = pred_perso_ann.explode(["tag_pers_o_expected"])

Generalize the BIO tags to label names:

In [42]:
# Get the predicted labels
pred_labels = list(pred_perso_ann["tag_{}_predicted".format(category)])
pred_labels = [label if label == "O" else label[2:] for label in pred_labels]
pred_perso_ann.insert(len(pred_perso_ann.columns), "label_{}_predicted".format(category), pred_labels)
# Get the lists of expected labels
exp_labels = list(pred_perso_ann["tag_{}_expected".format(category)])
exp_labels = [label if label == "O" else label[2:] for label in exp_labels]
pred_perso_ann.insert(len(pred_perso_ann.columns), "label_{}_expected".format(category), exp_labels)
# pred_perso_ann.head()

Group the data by annotation:

In [43]:
pred_perso_ann = pred_perso_ann.drop(columns=["tag_{}_expected".format(category), "tag_{}_predicted".format(category)])
pred_perso_ann = utils.implodeDataFrame(pred_perso_ann, ["sentence_id", "ann_id"])
pred_perso_ann = pred_perso_ann.reset_index()
pred_perso_ann.head()

Unnamed: 0,sentence_id,ann_id,token_id,token,_merge,label_pers_o_predicted,label_pers_o_expected
0,2,99999,"[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[Scope, and, Contents, :, Sermons, and, addres...","[true negative, true negative, true negative, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,8,14387,"[233, 234]","[James, Whyte]","[true positive, false positive]","[Masculine, Unknown]","[Masculine, Masculine]"
2,8,99999,"[235, 236, 237, 238, 239, 240, 241, 242, 243, ...","[was, called, upon, to, preach, at, the, memor...","[true negative, true negative, true negative, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,19,9518,[533],[he],[true negative],[O],[O]
4,19,9519,[539],[he],[true negative],[O],[O]


Record the agreements and disagreements:

In [44]:
agmt_types_perso, agmt_labels_perso = utils1.getAnnotationAgreement(pred_perso_ann, "label_pers_o_predicted", "label_pers_o_expected")

In [45]:
pred_perso_ann.insert(len(pred_perso_ann.columns), "annotation_agreement", agmt_types_perso)
pred_perso_ann.insert(len(pred_perso_ann.columns), "agreement_label", agmt_labels_perso)
pred_perso_ann.head()

Unnamed: 0,sentence_id,ann_id,token_id,token,_merge,label_pers_o_predicted,label_pers_o_expected,annotation_agreement,agreement_label
0,2,99999,"[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[Scope, and, Contents, :, Sermons, and, addres...","[true negative, true negative, true negative, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",false positive,Occupation
1,8,14387,"[233, 234]","[James, Whyte]","[true positive, false positive]","[Masculine, Unknown]","[Masculine, Masculine]",true positive,Masculine
2,8,99999,"[235, 236, 237, 238, 239, 240, 241, 242, 243, ...","[was, called, upon, to, preach, at, the, memor...","[true negative, true negative, true negative, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",true negative,O
3,19,9518,[533],[he],[true negative],[O],[O],true negative,O
4,19,9519,[539],[he],[true negative],[O],[O],true negative,O


In [46]:
metrics_perso_all = utils1.getAnnotationAgreementMetrics(pred_perso_ann, "all")
metrics_perso_pn = utils1.getAnnotationAgreementMetrics(pred_perso_ann.loc[~(pred_perso_ann.agreement_label.isin(["Occupation","O"]))], "Person Name")
metrics_perso_unk = utils1.getAnnotationAgreementMetrics(pred_perso_ann.loc[pred_perso_ann.agreement_label == "Unknown"], "Unknown")
metrics_perso_fem = utils1.getAnnotationAgreementMetrics(pred_perso_ann.loc[pred_perso_ann.agreement_label == "Feminine"], "Feminine")
metrics_perso_mas = utils1.getAnnotationAgreementMetrics(pred_perso_ann.loc[pred_perso_ann.agreement_label == "Masculine"], "Masculine")
metrics_perso_occ = utils1.getAnnotationAgreementMetrics(pred_perso_ann.loc[pred_perso_ann.agreement_label == "Occupation"], "Occupation")
metrics_perso = pd.concat([metrics_perso_all, metrics_perso_pn, metrics_perso_unk, metrics_perso_fem, metrics_perso_mas, metrics_perso_occ])
metrics_perso

Unnamed: 0,labels,false negative,true positive,false positive,precision,recall,f_1
0,all,4850,5243,2235,0.701123,0.519469,0.596779
0,Person Name,4307,4376,1684,0.722112,0.503973,0.593638
0,Unknown,3037,2744,1146,0.705398,0.474658,0.56747
0,Feminine,231,641,166,0.7943,0.735092,0.76355
0,Masculine,1039,991,372,0.727073,0.488177,0.584144
0,Occupation,543,867,551,0.611425,0.614894,0.613154


Save the metrics:

In [47]:
metrics_perso.to_csv(
    config.experiment1_agmt_path+"crf_{a}_baseline_fastText{d}_{c}_annot_agmt.csv".format(a=a, d=d, c=category)
)

### Loose Evaluation

As with the manual annotation evaluation, we want to evaluate the predictions more loosely, considering overlapping text spans in addition to exactly matching text spans.

#### Token Agreement

First, generalize the tokens' IOB tags to the label, and calculate agreement scores for each label.

In [50]:
pred_perso_labels = pred_perso.drop(columns=["_merge"])
tag_exp = list(pred_perso_labels["tag_{}_expected".format(category)])
tag_pred = list(pred_perso_labels["tag_{}_predicted".format(category)])
label_exp = [[tag if tag == "O" else tag[2:] for tag in tag_exp_list] for tag_exp_list in tag_exp]
label_pred = [tag if tag == "O" else tag[2:] for tag in tag_pred]
pred_perso_labels = pred_perso_labels.drop(columns=["tag_{}_expected".format(category), "tag_{}_predicted".format(category)])
pred_perso_labels.insert(len(pred_perso_labels.columns), "label_{}_expected".format(category), label_exp)
pred_perso_labels.insert(len(pred_perso_labels.columns), "label_{}_predicted".format(category), label_pred)
# pred_pers_labels.loc[pred_pers_labels.label_personname_predicted == "Feminine"].head()  # Looks good

Calculate the agreement metrics at the label level for each token:

In [51]:
tags = ['Unknown', 'Feminine', 'Masculine', 'Occupation']
pred_perso_labels = utils.isPredictedInExpected(pred_perso_labels, "label_{}_expected".format(category), "label_{}_predicted".format(category), '_merge', 'O')

pred_perso_stats = utils.getScoresByCatTags(
    pred_perso_labels, "_merge", tags[0], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_perso_labels, "_merge", tags[i], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
    )
    pred_perso_stats = pd.concat([pred_perso_stats, tag_stats])
pred_perso_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,Unknown,3198,2653,4305,0.618712,0.57377,0.595395
0,Feminine,293,294,669,0.694704,0.695426,0.695065
0,Masculine,998,655,811,0.553206,0.448314,0.495267
0,Occupation,1035,1087,1310,0.546516,0.558635,0.552509


Combine and save the performance measures:

In [52]:
pred_perso_stats.to_csv(
    config.experiment1_agmt_path+"crf_{a}_baseline_fastText{d}_{c}_loose_agmt.csv".format(a=a, d=d, c=category)
)