# Experiment 1, Model 1

#### Model Setup

Run models in the following order, using their output labels as features for the next model:

1. Multilabel Linguistic Classifier
2. Multiclass Person Name + Occupation Sequence Classifier
3. Multilabel Stereotype and Omission Document Classifier

Train the first model and then run it over the entire dataset.

***

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/experiment_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/experiment1/`
* Word Embeddings
    * Custom fastText (word2vec with subwords) embeddings of 100 dimensions trained on the CRC Archives catalog's descriptive metadata (harvested October 2020)
    
***

**Table of Contents**

[I.](#i) Linguistic Classifier
* [Preprocessing](#prep)
* [Training & Prediction](#tp)
* [Evaluation](#eval)

Load programming resources:

In [1]:
# For custom functions and variables
import utils, utils1, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For preprocessing
from gensim.models import FastText
from gensim import utils as gensim_utils

# For multilabel token classification
import sklearn.metrics
from sklearn.preprocessing import MultiLabelBinarizer
from skmultilearn.problem_transform import ClassifierChain
from sklearn.ensemble import RandomForestClassifier

# For saving model
from joblib import dump,load

Define resources for the models:

In [2]:
# Path(config.experiment_input_path).mkdir(parents=True, exist_ok=True)    # For train, devtest, and blind test data
predictions_dir = config.experiment1_path+"5fold/output/"
Path(predictions_dir).mkdir(parents=True, exist_ok=True)  # For predictions
agreement_dir = config.experiment1_path+"5fold/agreement/"
Path(agreement_dir).mkdir(parents=True, exist_ok=True)    # For agreement metrics

In [3]:
# Model 1:
ling_label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]

In [4]:
ling_label_tags = {
    "Gendered-Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered-Role": ["B-Gendered-Role", "I-Gendered-Role"],"Generalization": ["B-Generalization", "I-Generalization"]
    }

In [5]:
d = 100  # dimensions of word embeddings (should match utils1.py)

<a id="i"></a>
## I. Train the Linguistic Classifier

Run a multilabel classifier on the train set of the data, focusing only on applying the Linguistic category of labels: Gendered Pronoun, Gendered Role, and Generalization.

Use a Classifier Chain with Random Forest, as this was the highest-performing multilabel model setup from previous algorithm experiments for the Linguistic labels.

<a id="prep"></a>
### Preprocessing

For this experiment, we'll repeatedly train models on different 80% selections of data and predict on the remaining 20% split, for a modified 5-fold cross-validation approach.

In [6]:
df = pd.read_csv(config.tokc_path+"experiment_input/token_5fold.csv", index_col=0)
df = df.drop_duplicates()
df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,fold
0,0,0,99999,0,Identifier,"(0, 10)",NN,O,Identifier,split4
1,0,0,99999,1,:,"(10, 11)",:,O,Identifier,split4
2,0,0,99999,2,AA5,"(12, 15)",NN,O,Identifier,split4
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,split2
4,1,1,99999,4,:,"(22, 23)",:,O,Title,split2


Remove mistaken labels that were thought to have been removed already:

In [7]:
df = df.loc[df.tag != "B-Nonbinary"]
df = df.loc[df.tag != "I-Nonbinary"]

In [8]:
print(df.shape)

(778801, 10)


Make sure only Linguistic tags are considered:

In [9]:
df = utils1.selectDataForLabels(df, "tag", ling_label_subset)

Replace the tags with label names (remove ``B-`` and ``I-``):

In [10]:
labels_col = utils1.getLabelColFromTagCol(df, "tag")
df.insert(len(df.columns), "label", labels_col)

In [11]:
# df.head()
df.label.value_counts()

O                   769616
Gendered-Pronoun      3732
Gendered-Role         3392
Generalization        2061
Name: label, dtype: int64

Get the label associated with each annotation for future evaluation:

In [45]:
df_by_ann = pd.read_csv(config.tokc_path+"experiment_input/token_5fold.csv", index_col=0)
df_by_ann = df_by_ann.drop_duplicates()
df_by_ann = utils.implodeDataFrame(df_by_ann, ["ann_id"])
tags_col = list(df_by_ann.tag)
labels = [[tag[2:] if tag != "O" else tag for tag in tags] for tags in tags_col]
labels = [label_list[0] for label_list in labels]
df_by_ann.insert(len(df_by_ann.columns), "label", labels)
ling_labels = list(ling_label_tags.keys())
df_by_ann = df_by_ann.loc[df_by_ann.label.isin(ling_labels)]
df_by_ann.head()

Unnamed: 0_level_0,description_id,sentence_id,token_id,token,token_offsets,pos,tag,field,fold,label
ann_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,[2364],[5760],[133674],[knighted],"[(1407, 1415)]",[VBN],[B-Gendered-Role],[Biographical / Historical],[split3],Gendered-Role
1,"[4542, 4542]","[10365, 10365]","[228678, 228679]","[knighthood, .]","[(9625, 9635), (9635, 9636)]","[NN, .]","[B-Gendered-Role, I-Gendered-Role]","[Scope and Contents, Scope and Contents]","[split2, split2]",Gendered-Role
2,"[3660, 3660, 3660]","[8733, 8733, 8733]","[196525, 196526, 196527]","[Prince, Regent, .]","[(2426, 2432), (2433, 2439), (2439, 2440)]","[NNP, NNP, .]","[B-Gendered-Role, I-Gendered-Role, I-Gendered-...","[Biographical / Historical, Biographical / His...","[split3, split3, split3]",Gendered-Role
3,"[4678, 4678]","[10637, 10637]","[236354, 236355]","[knighthood, .]","[(9993, 10003), (10003, 10004)]","[NN, .]","[B-Gendered-Role, I-Gendered-Role]","[Scope and Contents, Scope and Contents]","[split0, split0]",Gendered-Role
4,[4732],[10763],[239212],[Sir],"[(7192, 7195)]",[NNP],[B-Gendered-Role],[Biographical / Historical],[split1],Gendered-Role


Make sure any tokens with a label don't also have an `O` tag:

In [12]:
df_imploded = utils.implodeDataFrame(
    df[["description_id", "sentence_id", "ann_id", "token_id", "token", "field", "token_offsets", "pos", "tag", "label", "fold"]], 
    ["description_id", "sentence_id", "token_id", "token", "field", "token_offsets", "pos", "fold"]
).reset_index()
df_imploded.head()

Unnamed: 0,description_id,sentence_id,token_id,token,field,token_offsets,pos,fold,ann_id,tag,label
0,0,0,0,Identifier,Identifier,"(0, 10)",NN,split4,[99999],[O],[O]
1,0,0,1,:,Identifier,"(10, 11)",:,split4,[99999],[O],[O]
2,0,0,2,AA5,Identifier,"(12, 15)",NN,split4,[99999],[O],[O]
3,1,1,3,Title,Title,"(17, 22)",NN,split2,[99999],[O],[O]
4,1,1,4,:,Title,"(22, 23)",:,split2,[99999],[O],[O]


In [18]:
target_col = "label"  #"tag"

In [19]:
tags = list(df_imploded[target_col])
ann_ids = list(df_imploded["ann_id"])
new_tags, new_ann_ids = [], []
for i,tag_list in enumerate(tags):
    unique_tags = list(set(tag_list))
    ann_list = ann_ids[i]
    if (len(unique_tags) > 1) and ("O" in unique_tags):
        o_index = unique_tags.index("O")
        unique_tags.remove("O")
        
        ann_to_remove = ann_list[o_index]
        ann_list.remove(ann_to_remove)
    
    new_tags += [unique_tags]
    new_ann_ids += [ann_list]
    
df_imploded[target_col] = new_tags
df_imploded["ann_id"] = new_ann_ids
# # df_imploded.head(20)
# df_imploded.tag.value_counts()  # Looks good

***
Calculate the number of tokens with different combinations of labels:

In [22]:
df_value_counts = pd.DataFrame(df_imploded.label.value_counts()).reset_index()
df_value_counts = df_value_counts.rename(columns={"index":"labels", "label":"token_count"})
df_value_counts

Unnamed: 0,labels,token_count
0,[O],744728
1,[Gendered-Pronoun],3624
2,[Gendered-Role],3151
3,[Generalization],1808
4,"[Gendered-Pronoun, Generalization]",107
5,"[Generalization, Gendered-Role]",103


`Generalization` is the only label that occurs with other Linguistic labels, accounting for 10% of its occurrence (annotation of 210 out of 2,018 tokens).  Future work can experiment with using sequence classifiers for classification with Linguistic labels to see if this improves the classification of gender biased language (Stereotype, Omission) down the line.

In [23]:
df_value_counts.to_csv(config.tokc_path+"linguistic_annotation_occurrences.csv")

***

In [17]:
df_exploded = df_imploded.explode([target_col])
df_exploded[target_col].value_counts()  # Looks good

O                   744728
Gendered-Pronoun      3731
Gendered-Role         3254
Generalization        2018
Name: label, dtype: int64

In [18]:
print(df_exploded.shape)
df_exploded.head()

(753731, 11)


Unnamed: 0,description_id,sentence_id,token_id,token,field,token_offsets,pos,fold,ann_id,tag,label
0,0,0,0,Identifier,Identifier,"(0, 10)",NN,split4,[99999],[O],O
1,0,0,1,:,Identifier,"(10, 11)",:,split4,[99999],[O],O
2,0,0,2,AA5,Identifier,"(12, 15)",NN,split4,[99999],[O],O
3,1,1,3,Title,Title,"(17, 22)",NN,split2,[99999],[O],O
4,1,1,4,:,Title,"(22, 23)",:,split2,[99999],[O],O


In [19]:
assert df_exploded.loc[df_exploded.ann_id.isna()].shape[0] == 0
assert df_exploded.loc[df_exploded.token_id.isna()].shape[0] == 0
assert df_exploded.loc[df_exploded[target_col].isna()].shape[0] == 0

Remove the Annotation ID column for training and testing the classification model:

In [22]:
df = df_exploded.drop(columns=["ann_id"])

Define the five splits of the data to combine iteratively into training and test sets using five-fold cross-validation:

In [22]:
split_col = "fold"
splits = df[split_col].unique()
splits.sort()
print(splits)

['split0' 'split1' 'split2' 'split3' 'split4']


In [23]:
train0, test0 = list(splits[:4]), splits[4]
train1, test1 = list(splits[1:]), splits[0]
train2, test2 = list(splits[2:])+[splits[0]], splits[1]
train3, test3 = list(splits[3:])+list(splits[:2]), splits[2]
train4, test4 = [splits[4]]+list(splits[:3]), splits[3]

In [24]:
runs = [(train0, test0), (train1, test1), (train2, test2), (train3, test3), (train4, test4)]
for run in runs:
    print(run)

(['split0', 'split1', 'split2', 'split3'], 'split4')
(['split1', 'split2', 'split3', 'split4'], 'split0')
(['split2', 'split3', 'split4', 'split0'], 'split1')
(['split3', 'split4', 'split0', 'split1'], 'split2')
(['split4', 'split0', 'split1', 'split2'], 'split3')


Looks good!

<a id="tp"></a>
### Training & Prediction

In [29]:
pred_df = pd.DataFrame()
a = "rf"

# Drop tag column if using labels as targets:
df = df.drop(columns=["tag"])

for run in runs:
    # Get the train (80%) and test (20%) subsets of data
    train_splits, test_split = run[0], run[1]
    print("Training on:", train_splits)
    train_df = df.loc[df[split_col].isin(train_splits)]
    dev_df = df.loc[df[split_col] == test_split]
    
    ling_train = train_df.rename(columns={"fold":"subset"})  # Change column name to next function's expected column name
    ling_dev = dev_df.rename(columns={"fold":"subset"})      # Change column name to next function's expected column name
    train_data = utils1.loadData(ling_train)
    dev_data = utils1.loadData(ling_dev)
    
    # Create feature matrices
    train_tokens = utils1.zipTokensFeatures(train_data)
    dev_tokens = utils1.zipTokensFeatures(dev_data)
    X_train = utils1.makeFastTextFeatureMatrix(train_tokens)
    X_dev = utils1.makeFastTextFeatureMatrix(dev_tokens)
    
    # Binarize targets
    mlb, y_train = utils1.binarizeTrainTargets(train_data, target_col=target_col)
    y_dev = utils1.binarizeDevTargets(mlb, dev_data, target_col=target_col)

    # Train a classification model
    clf = ClassifierChain(
        classifier = RandomForestClassifier(random_state=22),
    )
    clf.fit(X_train, y_train)
    
    # Predict with the trained model
    print("Predicting on:", test_split)
    predictions = clf.predict(X_dev)
    if pred_df.shape[0] > 0:
        next_pred_df = utils.makePredictionDF(predictions, dev_data, target_col, "predicted_{}".format(target_col), "O", mlb)
        pred_df = pd.concat([pred_df, next_pred_df])
    else:
        pred_df = utils.makePredictionDF(predictions, dev_data, target_col, "predicted_{}".format(target_col), "O", mlb)

assert pred_df.loc[pred_df["predicted_{}".format(target_col)].isna()].shape[0] == 0, "Any NaN values should be replaced with 'O'"
print("Modified 5-fold cross-validation complete!")

Modified 5-fold cross-validation complete!


In [30]:
print(pred_df.shape[0], len(pred_df.token_id.unique()))
pred_df.head()

753522 753521


Unnamed: 0,description_id,sentence_id,token_id,token,pos,predicted_label
0,0,0,0,Identifier,NN,O
1,0,0,1,:,:,O
2,0,0,2,AA5,NN,O
3,3,4,134,He,PRP,Gendered-Pronoun
4,3,4,135,was,VBD,O


Save the prediction data:

In [31]:
pred_df.to_csv(predictions_dir+"cc-{a}_linglabels_baseline_fastText{d}_predictions.csv".format(a=a,d=d))

Save the model (the last model run):

In [32]:
model_dir = "models/experiment1/"
Path(model_dir).mkdir(parents=True, exist_ok=True)
filename = model_dir+"cc-{a}_linglabels_F-fastText{d}_T-ling.joblib".format(a=a, d=d)  # include features (F) and targets (T) in model's file name
dump(clf, filename)

['models/experiment1/cc-{a}_linglabels_F-fastText{d}_T-ling.joblib']

<a id="eval"></a>
### Evaluation
#### Evaluate: Strict, Each Label

There are more predictions than unique tokens, because with multilabel classification, one token can have multiple predicted tags.

In [6]:
a = "rf"

In [7]:
pred_df = pd.read_csv(predictions_dir+"cc-{a}_linglabels_baseline_fastText{d}_predictions.csv".format(a=a,d=d), index_col=0)
print(pred_df.shape)
pred_df.head()

(753522, 6)


Unnamed: 0,description_id,sentence_id,token_id,token,pos,predicted_label
0,0,0,0,Identifier,NN,O
1,0,0,1,:,:,O
2,0,0,2,AA5,NN,O
3,3,4,134,He,PRP,Gendered-Pronoun
4,3,4,135,was,VBD,O


In [25]:
target_col = "label"
exp_df = df.drop(columns=["ann_id"])
exp_df = exp_df.rename(columns={target_col:"expected_{}".format(target_col)})
print(exp_df.shape)
exp_df.head()

(778801, 10)


Unnamed: 0,description_id,sentence_id,token_id,token,token_offsets,pos,tag,field,fold,expected_label
0,0,0,0,Identifier,"(0, 10)",NN,O,Identifier,split4,O
1,0,0,1,:,"(10, 11)",:,O,Identifier,split4,O
2,0,0,2,AA5,"(12, 15)",NN,O,Identifier,split4,O
3,1,1,3,Title,"(17, 22)",NN,O,Title,split2,O
4,1,1,4,:,"(22, 23)",:,O,Title,split2,O


In [26]:
assert len(pred_df.token_id.unique()) == len(exp_df.token_id.unique())

In [35]:
exp_col = "expected_{}".format(target_col)
pred_col = "predicted_{}".format(target_col)
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["description_id", "sentence_id", "token_id", "token", "pos", exp_col],
    ["description_id", "sentence_id", "token_id", "token", "pos", pred_col],
    ["description_id", "sentence_id", "token_id", "token", "token_offsets", "pos", "tag", "field", "fold", exp_col, pred_col, "_merge"], 
    pred_col,
    exp_col,
    "O"
)

In [36]:
assert eval_df.loc[eval_df.token_id == ""].shape[0] == 0

In [37]:
eval_df.head()

Unnamed: 0,description_id,sentence_id,token_id,token,token_offsets,pos,tag,field,fold,expected_label,predicted_label,_merge
0,0,0,0,Identifier,"(0, 10)",NN,O,Identifier,split4,O,O,true negative
1,0,0,1,:,"(10, 11)",:,O,Identifier,split4,O,O,true negative
2,0,0,2,AA5,"(12, 15)",NN,O,Identifier,split4,O,O,true negative
3,1,1,3,Title,"(17, 22)",NN,O,Title,split2,O,O,true negative
4,1,1,4,:,"(22, 23)",:,O,Title,split2,O,O,true negative


In [38]:
eval_df._merge.value_counts()

true negative     770519
true positive       6214
false negative      2971
false positive      2020
Name: _merge, dtype: int64

Save the data:

In [39]:
eval_df.to_csv(predictions_dir+"cc-{a}_linglabels_baseline_fastText{d}_strict_evaluation.csv".format(a=a,d=d))

Calculate the true positives, false positives, false negatives, precision, recall, and F1 metrics for each tag:

In [40]:
labels = list(ling_label_tags.keys())
print(labels)

['Gendered-Pronoun', 'Gendered-Role', 'Generalization']


In [42]:
agmt_scores = pd.DataFrame.from_dict({
        "label":[], "false negative":[], "false positive":[],
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })
for label in labels:
    agmt_df = pd.concat([eval_df.loc[eval_df[exp_col] == label], eval_df.loc[eval_df[pred_col] == label]])
    agmt_df = agmt_df.drop_duplicates() # True positives will have been duplicated in line above
    tp = agmt_df.loc[agmt_df._merge == "true positive"].shape[0]
    fp = agmt_df.loc[agmt_df._merge == "false positive"].shape[0]
    fn = agmt_df.loc[agmt_df._merge == "false negative"].shape[0]
    prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
    label_agmt = pd.DataFrame.from_dict({
            "label":[label], "false negative":[fn], "false positive":[fp],
             "true positive":[tp], "precision":[prec], "recall":[rec], "f1":[f1]
        })
    agmt_scores = pd.concat([agmt_scores, label_agmt])
agmt_scores

Unnamed: 0,label,false negative,false positive,true positive,precision,recall,f1
0,Gendered-Pronoun,77.0,990.0,3654.0,0.786822,0.979362,0.872597
0,Gendered-Role,1064.0,862.0,2192.0,0.717747,0.673219,0.69477
0,Generalization,1728.0,168.0,294.0,0.636364,0.145401,0.236715


Save the data:

In [43]:
agmt_scores.to_csv(agreement_dir+"cc-{a}_linglabels_baseline_fastText{d}_strict_agmt.csv".format(a=a,d=d))

#### Evaluate: Each Annotation

Join the manual annotations IDs to the evaluation data:

In [46]:
df_by_ann = df_by_ann.explode(["description_id", "sentence_id", "token_id", "token_offsets", "pos", "tag", "field", "fold"])
df_by_ann = df_by_ann.reset_index()
df_by_ann.head()

Unnamed: 0,ann_id,description_id,sentence_id,token_id,token,token_offsets,pos,tag,field,fold,label
0,0,2364,5760,133674,[knighted],"(1407, 1415)",VBN,B-Gendered-Role,Biographical / Historical,split3,Gendered-Role
1,1,4542,10365,228678,"[knighthood, .]","(9625, 9635)",NN,B-Gendered-Role,Scope and Contents,split2,Gendered-Role
2,1,4542,10365,228679,"[knighthood, .]","(9635, 9636)",.,I-Gendered-Role,Scope and Contents,split2,Gendered-Role
3,2,3660,8733,196525,"[Prince, Regent, .]","(2426, 2432)",NNP,B-Gendered-Role,Biographical / Historical,split3,Gendered-Role
4,2,3660,8733,196526,"[Prince, Regent, .]","(2433, 2439)",NNP,I-Gendered-Role,Biographical / Historical,split3,Gendered-Role


In [47]:
pred_df.head()

Unnamed: 0,description_id,sentence_id,token_id,token,pos,predicted_label
0,0,0,0,Identifier,NN,O
1,0,0,1,:,:,O
2,0,0,2,AA5,NN,O
3,3,4,134,He,PRP,Gendered-Pronoun
4,3,4,135,was,VBD,O


In [48]:
to_add = df_by_ann[["ann_id", "token_id", "label"]]
eval_df_joined = pred_df.join(to_add.set_index("token_id"), on="token_id", how="outer")
print(eval_df_joined.shape)
eval_df_joined = eval_df_joined.rename(columns={"label":exp_col})
eval_df_joined["ann_id"] = eval_df_joined["ann_id"].fillna(99999)
eval_df_joined[exp_col] = eval_df_joined[exp_col].fillna("")
eval_df_joined.head()

(753914, 8)


Unnamed: 0,description_id,sentence_id,token_id,token,pos,predicted_label,ann_id,expected_label
0,0,0,0,Identifier,NN,O,99999.0,
1,0,0,1,:,:,O,99999.0,
2,0,0,2,AA5,NN,O,99999.0,
3,3,4,134,He,PRP,Gendered-Pronoun,14377.0,Gendered-Pronoun
4,3,4,135,was,VBD,O,99999.0,


In [50]:
eval_by_ann = utils.implodeDataFrame(eval_df_joined, ["description_id", "sentence_id", "ann_id", "expected_label"]).reset_index()
print(eval_by_ann.shape)
eval_by_ann.head()

(49808, 8)


Unnamed: 0,description_id,sentence_id,ann_id,expected_label,token_id,token,pos,predicted_label
0,0,0,99999.0,,"[0, 1, 2]","[Identifier, :, AA5]","[NN, :, NN]","[O, O, O]"
1,1,1,99999.0,,"[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[Title, :, Papers, of, The, Very, Rev, Prof, J...","[NN, :, NNS, IN, DT, NNP, NNP, NNP, NNP, NNP, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O]"
2,2,2,99999.0,,"[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[Scope, and, Contents, :, Sermons, and, addres...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,3,3,99999.0,,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[Biographical, /, Historical, :, Professor, Ja...","[NNP, /, NNP, :, NNP, NNP, NNP, NNP, VBD, DT, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,3,4,14377.0,Gendered-Pronoun,[134],[He],[PRP],[Gendered-Pronoun]


Get unique values for each predicted label:

In [51]:
pred_label_col = list(eval_by_ann[pred_col])
unique_pred_label_col = [list(set(pred_labels)) for pred_labels in pred_label_col]
eval_by_ann = eval_by_ann.drop(columns=[pred_col])
eval_by_ann.insert(len(eval_by_ann.columns), pred_col, unique_pred_label_col)
eval_by_ann = eval_by_ann.explode([pred_col])
print(eval_by_ann.shape)
eval_by_ann.head()

(52047, 8)


Unnamed: 0,description_id,sentence_id,ann_id,expected_label,token_id,token,pos,predicted_label
0,0,0,99999.0,,"[0, 1, 2]","[Identifier, :, AA5]","[NN, :, NN]",O
1,1,1,99999.0,,"[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[Title, :, Papers, of, The, Very, Rev, Prof, J...","[NN, :, NNS, IN, DT, NNP, NNP, NNP, NNP, NNP, ...",O
2,2,2,99999.0,,"[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[Scope, and, Contents, :, Sermons, and, addres...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...",O
3,3,3,99999.0,,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[Biographical, /, Historical, :, Professor, Ja...","[NNP, /, NNP, :, NNP, NNP, NNP, NNP, VBD, DT, ...",O
4,3,4,14377.0,Gendered-Pronoun,[134],[He],[PRP],Gendered-Pronoun


In [52]:
assert eval_by_ann.loc[eval_by_ann.expected_label.isna()].shape[0] == 0
assert eval_by_ann.loc[eval_by_ann.predicted_label.isna()].shape[0] == 0

In [53]:
# exp_col = "expected_label"
# pred_col = "predicted_label"
df_pred = eval_by_ann.drop(columns=[exp_col, "token_id", "token", "pos"])
df_exp = eval_by_ann.drop(columns=[pred_col, "token_id", "token", "pos"])

Record the agreement type for each row, either false positive, true positive, false negative, or true negative:

In [62]:
join_on =  ["description_id", "sentence_id", "ann_id"]
eval_df = utils.makeEvaluationDataFrame(
    df_exp, 
    df_pred, 
    join_on+[exp_col], 
    join_on+[pred_col], 
    ["description_id", "sentence_id", "ann_id", "expected_label", "predicted_label", "_merge"], 
    exp_col, 
    pred_col, 
    "O"
)
eval_df = eval_df.sort_values(by=[id_col, exp_col, pred_col])
eval_df.head()

Unnamed: 0,description_id,sentence_id,ann_id,expected_label,predicted_label,_merge
9449,2364,5760,0.0,Gendered-Role,O,false negative
58683,2364,5760,0.0,O,Generalization,false positive
16313,4542,10365,1.0,Gendered-Role,Gendered-Role,true positive
16314,4542,10365,1.0,Gendered-Role,Gendered-Role,true positive
63847,4542,10365,1.0,O,O,true negative


In [13]:
eval_df = eval_df.drop_duplicates()
eval_df.shape

(95641, 6)

Save the data:

In [14]:
eval_df.to_csv(predictions_dir+"cc-{a}_linglabels_baseline_fastText{d}_annot_evaluation.csv".format(a=a,d=d))

Calculate annotation agreement metrics for each label:

In [65]:
labels = list(ling_label_tags.keys())
print(labels)

['Gendered-Pronoun', 'Gendered-Role', 'Generalization']


In [15]:
agmt_scores = pd.DataFrame.from_dict({
        "label":[], "false negative":[], "false positive":[],
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })
for label in labels:
    agmt_df = pd.concat([eval_df.loc[eval_df[exp_col] == label], eval_df.loc[eval_df[pred_col] == label]])
    agmt_df = agmt_df.drop_duplicates() # True positives will have been duplicated in line above
    tp = agmt_df.loc[agmt_df._merge == "true positive"].shape[0]
    fp = agmt_df.loc[agmt_df._merge == "false positive"].shape[0]
    fn = agmt_df.loc[agmt_df._merge == "false negative"].shape[0]
    prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
    label_agmt = pd.DataFrame.from_dict({
            "label":[label], "false negative":[fn], "false positive":[fp],
             "true positive":[tp], "precision":[prec], "recall":[rec], "f1":[f1]
        })
    agmt_scores = pd.concat([agmt_scores, label_agmt])
agmt_scores

Unnamed: 0,label,false negative,false positive,true positive,precision,recall,f1
0,Gendered-Pronoun,29.0,851.0,3654.0,0.811099,0.992126,0.892526
0,Gendered-Role,535.0,791.0,2255.0,0.740315,0.808244,0.77279
0,Generalization,1010.0,167.0,305.0,0.646186,0.231939,0.341354


Save the scores:

In [16]:
agmt_scores.to_csv(agreement_dir+"cc-{a}_linglabels_baseline_fastText{d}_annot_agmt.csv".format(a=a,d=d))

#### Evaluate: Loose, Each Label

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

In [None]:
## Create a copy of the evaluation DataFrame where tags are replaced by label names:
a = "rf"
eval_df = pd.read_csv(predictions_dir+"cc-{a}_ling_baseline_fastText{d}_strict_evaluation.csv".format(a=a,d=d), index_col=0)
loose_eval_df = eval_df.copy()
for label,tags in ling_label_tags.items():
    for tag in tags:
        loose_eval_df[exp_col] = loose_eval_df[exp_col].replace(to_replace=tag, value=label)
        loose_eval_df[pred_col] = loose_eval_df[pred_col].replace(to_replace=tag, value=label)
# loose_eval_df.head()

In [None]:
loose_eval_df.loc[loose_eval_df.predicted_tag.isna()].shape

(588, 8)

In [None]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
loose_eval_df.head()

Unnamed: 0,description_id,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
1,3,5,155,his,PRP$,Gendered-Pronoun,Gendered-Pronoun,true positive
3,3,5,157,he,PRP,Gendered-Pronoun,Gendered-Pronoun,true positive
152,7,24,668,he,PRP,Gendered-Pronoun,Gendered-Pronoun,true positive
158,7,24,674,he,PRP,Gendered-Pronoun,Gendered-Pronoun,true positive
220,7,28,756,He,PRP,Gendered-Pronoun,Gendered-Pronoun,true positive


In [None]:
loose_eval_df.to_csv(predictions_dir+"cc-{a}_ling_baseline_fastText{d}_evaluation_loose.csv".format(a=a,d=d))

In [None]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[],
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [None]:
for label,tags in ling_label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,Gendered-Pronoun,24.0,0.0,1470.0,1.0,0.983936,0.991903
0,Gendered-Role,243.0,0.0,900.0,1.0,0.787402,0.881057
0,Generalization,321.0,0.0,128.0,1.0,0.285078,0.443674


Save the data:

In [None]:
loose_agmt.to_csv(agreement_dir+"cc-{a}_ling_baseline_fastText{d}_loose_agmt.csv".format(a=a,d=d))