# Gender Biased Sequence Classifiers

## Feature Engineering

### Target: Labels

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/model_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/crf_l2sgd/`
* Sequence classification
    * 9 lables (2 from original annotation taxonomy weren't applied during manual annotation):
        1. Person Name: Unknown, Feminine, Masculine (Non-binary not annotated with)
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Occupation, Omission, Stereotype (Empowering not annotated with)
    * 1 model per category
* Word embeddings
    * Custom fastText (word2vec with subwords, trained on Archives' descriptive metadata extracted in October 2020)  

***

### Table of Contents

[0.](#0) Preprocessing

  * [Hypotheses](#h)

[1.](#1) Models

  * [Linguistic](#ling)
  * [Person Name](#pers)
  * [Contextual](#cont)

[2.](#2) Performance Evaluation

[3.](#3) Transitions

***

Load necessary libraries:

In [1]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt

# For preprocessing
from nltk.stem import WordNetLemmatizer
import scipy.stats
from gensim.models import FastText
from gensim import utils as gensim_utils
from gensim.test.utils import get_tmpfile

# For classification
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# For evaluation
from collections import Counter
from sklearn.metrics import classification_report, make_scorer
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, ConfusionMatrixDisplay#, plot_confusion_matrix
from sklearn.metrics import precision_recall_fscore_support, f1_score
from intervaltree import Interval, IntervalTree

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [2]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


Drop duplicate rows with all but the same annotation ID:

In [3]:
df_train = df_train.drop(columns=["ann_id"])
df_train = df_train.drop_duplicates()
df_dev = df_dev.drop(columns=["ann_id"])
df_dev = df_dev.drop_duplicates()
print(df_train.shape, df_dev.shape)

(463441, 9) (156146, 9)


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [4]:
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [5]:
df_train.shape

(463439, 9)

Remove columns that won't be used as features for the classifiers and remove any duplicate rows that remain:

In [6]:
cols_to_keep = ["sentence_id", "token_id", "pos", "token", "tag"]

In [7]:
df_train = df_train[cols_to_keep]
df_train = df_train.drop_duplicates()
df_dev = df_dev[cols_to_keep]
df_dev = df_dev.drop_duplicates()
# df_train.head(20)

Create separate subsets of data for each category so they can be used with three separate models, replacing `NaN` tag values with `'O'`:

In [8]:
tags = (df_train.tag.unique())
tags.sort()
print(tags)

['B-Feminine' 'B-Gendered-Pronoun' 'B-Gendered-Role' 'B-Generalization'
 'B-Masculine' 'B-Occupation' 'B-Omission' 'B-Stereotype' 'B-Unknown'
 'I-Feminine' 'I-Gendered-Pronoun' 'I-Gendered-Role' 'I-Generalization'
 'I-Masculine' 'I-Occupation' 'I-Omission' 'I-Stereotype' 'I-Unknown' 'O']


In [9]:
ling_cat_tags = ['B-Gendered-Pronoun', 'B-Gendered-Role', 'B-Generalization', 'I-Gendered-Pronoun', 'I-Gendered-Role', 'I-Generalization']
df_train_ling = df_train.loc[df_train.tag.isin(ling_cat_tags)]
df_dev_ling = df_dev.loc[df_dev.tag.isin(ling_cat_tags)]

In [10]:
pers_cat_tags = ['B-Feminine', 'B-Masculine', 'B-Unknown', 'I-Feminine', 'I-Masculine', 'I-Unknown']
df_train_pers = df_train.loc[df_train.tag.isin(pers_cat_tags)]
df_dev_pers = df_dev.loc[df_dev.tag.isin(pers_cat_tags)]

In [11]:
cont_cat_tags = ['B-Occupation', 'B-Omission', 'B-Stereotype', 'I-Occupation', 'I-Omission', 'I-Stereotype']
df_train_cont = df_train.loc[df_train.tag.isin(cont_cat_tags)]
df_dev_cont = df_dev.loc[df_dev.tag.isin(cont_cat_tags)]

In [12]:
df_train = (df_train.drop(columns=["tag"])).drop_duplicates()
df_dev = (df_dev.drop(columns=["tag"])).drop_duplicates()

In [13]:
join_cols = ["sentence_id", "token_id", "pos", "token"]

In [14]:
df_train_ling = df_train.join(df_train_ling.set_index(join_cols), on=join_cols, how="outer")
df_train_ling = df_train_ling.rename(columns={"tag":"tag_linguistic"})
df_train_ling = df_train_ling.fillna('O')
# df_train_ling.head()
df_dev_ling = df_dev.join(df_dev_ling.set_index(join_cols), on=join_cols, how="outer")
df_dev_ling = df_dev_ling.rename(columns={"tag":"tag_linguistic"})
df_dev_ling = df_dev_ling.fillna('O')
# df_dev_ling.head()

In [15]:
df_train_pers = df_train.join(df_train_pers.set_index(join_cols), on=join_cols, how="outer")
df_train_pers = df_train_pers.rename(columns={"tag":"tag_personname"})
df_train_pers = df_train_pers.fillna('O')
df_dev_pers = df_dev.join(df_dev_pers.set_index(join_cols), on=join_cols, how="outer")
df_dev_pers = df_dev_pers.rename(columns={"tag":"tag_personname"})
df_dev_pers = df_dev_pers.fillna('O')
# df_dev_pers.head()

In [16]:
df_train_cont = df_train.join(df_train_cont.set_index(join_cols), on=join_cols, how="outer")
df_train_cont = df_train_cont.rename(columns={"tag":"tag_contextual"})
df_train_cont = df_train_cont.fillna('O')
df_dev_cont = df_dev.join(df_dev_cont.set_index(join_cols), on=join_cols, how="outer")
df_dev_cont = df_dev_cont.rename(columns={"tag":"tag_contextual"})
df_dev_cont = df_dev_cont.fillna('O')
df_train_cont.head()

Unnamed: 0,sentence_id,token_id,pos,token,tag_contextual
3,1,3,NN,Title,O
4,1,4,:,:,O
5,1,5,NNS,Papers,O
6,1,6,IN,of,O
7,1,7,DT,The,B-Stereotype


In [17]:
df_train_ling = df_train_ling.drop_duplicates()
df_dev_ling = df_dev_ling.drop_duplicates()
df_train_pers = df_train_pers.drop_duplicates()
df_dev_pers = df_dev_pers.drop_duplicates()
df_train_cont = df_train_cont.drop_duplicates()
df_dev_cont = df_dev_cont.drop_duplicates()

In [18]:
train_dfs = [df_train_ling, df_train_pers, df_train_cont]
dev_dfs = [df_dev_ling, df_dev_pers, df_dev_cont]
for df in train_dfs:
    print(df.shape[0], len(df.token_id.unique()))
print()
for df in dev_dfs:
    print(df.shape[0], len(df.token_id.unique()))

452222 452086
455327 452086
453119 452086

152494 152455
153568 152455
152768 152455


Tokens can have multiple tags, so there are more rows than unique token IDs.  In order to pass the data into a CRF model, we need to have one tag per token, so we'll simply **take the first tag** when we extract features for each token.

### Feature Engineering

Define feature dictionaries for baseline models, using word embeddings, tokens, and part-of-speech tags as features.

**Word Embeddings**

Use the custom fastText word embeddings, trained on the entire dataset of descriptive metadata from the Archives (harvested in October 2020) using the Continuous Bag-of-Words (CBOW) algorithm.  Subword embeddings (for subwords from 2 to 6 characters long, inclusive) are used to infer the embeddings for out-of-vocabulary (OOV) words.

Use the word embedding model trained on lowercased text to 100 dimensions: 

In [19]:
file_name = config.tokc_path+"fasttext100_lowercased.model"  #get_tmpfile()
embedding_model = FastText.load(file_name)

In [20]:
vocabulary = list(df_train.token.unique())
vocabulary_lowercased = [token.lower() for token in vocabulary]
vocabulary_lowercased = list(set(vocabulary_lowercased))
print("Vocabulary size:", len(vocabulary))
print("Lowercased vocabulary size:", len(vocabulary_lowercased))

Vocabulary size: 35968
Lowercased vocabulary size: 31335


In [68]:
# Reference: https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html

# Get a vector representation of a token from a fastText word embedding model
def extractEmbedding(token, fasttext_model=embedding_model):
    if token.isalpha():
        token = token.lower()
    embedding = fasttext_model.wv[token]
    return embedding

def extractTokenFeatures(sentence, i):
    token = sentence[i][0]
    pos = sentence[i][1]
    features = {
        'bias': 1.0,
        'token': token,
#         'token[-3:]': token[-3:]
    }
    
    # Add each value in a token's word embedding as a separate feature
    # Reference: https://stackoverflow.com/questions/58736548/how-to-use-word-embedding-as-features-for-crf-sklearn-crfsuite-model-training
    embedding = extractEmbedding(token)
    for i,n in enumerate(embedding):
        features['e{}'.format(i)] = n
    
    # Record whether a token is the first or last token of a sentence
    if i == 0:
        features['START'] = True
    elif i == (len(sentence) - 1):
        features['END'] = True
    
    return features

def extractSentenceFeatures(sentence):
    return [extractTokenFeatures(sentence, i) for i in range(len(sentence))]

def extractSentenceTargets(sentence):
    return [tag_list[0] for token, pos, tag_list in sentence]

def extractSentenceTokens(sentence):
    return [token for token, pos, tag_list in sentence]

################################################
# For Person Name Model
################################################
def extractPersonNameTokenFeatures(sentence, i):
    token = sentence[i][0]
    pos = sentence[i][1]
    features = {
        'bias': 1.0,
        'token': token,
        'pos': pos,
        'pos[:2]': pos[:2],
#         'token[-3:]': token[-3:]
    }
    
    # Add each value in a token's word embedding as a separate feature
    # Reference: https://stackoverflow.com/questions/58736548/how-to-use-word-embedding-as-features-for-crf-sklearn-crfsuite-model-training
    embedding = extractEmbedding(token)
    for i,n in enumerate(embedding):
        features['e{}'.format(i)] = n
    
    # Record whether a token is the first or last token of a sentence
    if i == 0:
        features['START'] = True
    elif i == (len(sentence) - 1):
        features['END'] = True
    
    return features

def extractPersonNameSentenceFeatures(sentence):
    return [extractPersonNameTokenFeatures(sentence, i) for i in range(len(sentence))]


<a id="h"></a>
#### Hypotheses:
Note: Any features that improve a model's performance will be kept for subsequence feature engineering steps.

**1. Part-of-Speech (POS) Tags**

* POS tags will improve the Linguistic model's performance, raising the F1 score by over 0.1 - *countered*
* POS tags will improve the Person Name model's performance, raising the F1 score by over 0.1 - *supported*
* POS tags will not improve the Contextual model's performance, raising the F1 score by no more than 0.1 - *countered*

**2. Token Suffixes (last 3 letters):**

* Suffixes will improve the Linguistic model's F1 score by over 0.1 - *countered*
* Suffixes will not change the Person Name model's F1 score by more than +0.1 or -0.1 - *countered (worsened performance >0.1)*
* Suffixes will not improve the Contextual model's performance, raising the F1 score by no more than 0.1 - *supported*

<a id="1"></a>
## 1. Models
<a id="ling"></a>
### Linguistic

* **Features:** custom fastText embeddings, token suffixes
* **Target:** Linguistic label category IOB tags
* **Algorithm:** L2SGD

#### Preprocessing

In [69]:
df_train = df_train_ling
df_dev = df_dev_ling

Group the data by token, so the all the tags for one token are recorded in a list for that token's row:

In [70]:
df_train_token_groups = utils.implodeDataFrame(df_train, ['token_id', 'sentence_id', 'pos', 'token'])
df_dev_token_groups = utils.implodeDataFrame(df_dev, ['token_id', 'sentence_id', 'pos', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
df_dev_token_groups = df_dev_token_groups.reset_index()

Group the data by sentence, where each sentence is a list of tokens:

In [71]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
# df_dev_grouped.head()

Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, TAG_LIST)`

In [72]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences_ling = utils.zipFeaturesAndTarget(df_train_grouped, "tag_linguistic")
# print(train_sentences_ling[0][:3])
dev_sentences_ling = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_linguistic")
# print(dev_sentences_ling[0][:3])

[('Title', 'NN', ['O']), (':', ':', ['O']), ('Papers', 'NNS', ['O'])]
[('After', 'IN', ['O']), ('his', 'PRP$', ['B-Gendered-Pronoun']), ('ordination', 'NN', ['O'])]


Extract the features and targets:

In [73]:
train_sentences = train_sentences_ling
dev_sentences = dev_sentences_ling

In [74]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

**From Optimization of Baseline Model:** pa with pa_type=0 was best, however, arow with variance=0.5 also had strong performance, and is strongest with other categories, Person Name and Contextual, so we'll use arow.

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Linguistic** category of tags.  We'll increase the max_iterations to 100 for this model.

In [75]:
clf_ling = sklearn_crfsuite.CRF(algorithm='arow', variance=0.5, max_iterations=100, all_possible_transitions=True)

In [76]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_ling.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [77]:
targets = list(clf_ling.classes_)
targets.remove('O')
print(targets)

['B-Gendered-Pronoun', 'B-Generalization', 'B-Gendered-Role', 'I-Generalization', 'I-Gendered-Role', 'I-Gendered-Pronoun']


#### Predict

In [78]:
y_pred = clf_ling.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [79]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))

  - F1: 0.5939825098668525
  - Prec: 0.6556302138252653
  - Rec 0.5489432703003337


The baseline model's scores were:
  - F1: 0.6697698893667122
  - Prec: 0.696012355375722
  - Rec 0.6718576195773082
  
Adding part-of-speech tags seems to decrease the model's performance, **counter** to the hypothesis:
  - F1: 0.5351115706558578
  - Prec: 0.6588512211470845
  - Rec 0.46051167964404893
  
Adding token suffixes also seems to decrease the model's performance (though not as much as adding POS tags), **counter** to the hypothesis:
  - F1: 0.5939825098668525
  - Prec: 0.6556302138252653
  - Rec 0.5489432703003337

For closer inspection, save the prediction data in a directory specific to this model:

In [80]:
df_dev_grouped = df_dev_grouped.rename(columns={"tag_linguistic":"tag_linguistic_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_linguistic_predicted", y_pred)
# df_dev_grouped.head()

In [81]:
df_dev_exploded = df_dev_grouped.explode(list(df_dev_grouped.columns)[1:])
df_dev_exploded.head()

Unnamed: 0,sentence_id,token_id,pos,sentence,tag_linguistic_expected,tag_linguistic_predicted
0,5,154,IN,After,[O],O
0,5,155,PRP$,his,[B-Gendered-Pronoun],O
0,5,156,NN,ordination,[O],O
0,5,157,PRP,he,[B-Gendered-Pronoun],B-Gendered-Pronoun
0,5,158,VBD,spent,[O],O


In [82]:
output_path = "model_output/crf_l2sgd/"
Path(config.tokc_path+output_path).mkdir(exist_ok=True, parents=True)

In [83]:
# filename = "crf_l2sgd_linguistic_labels_withPOS.csv"
# df_dev_exploded.to_csv(config.tokc_path+output_path+filename)
filename = "crf_l2sgd_linguistic_labels_withSuffix.csv"
df_dev_exploded[["token_id", "tag_linguistic_expected", "tag_linguistic_predicted"]].to_csv(config.tokc_path+output_path+filename)

<a id="pers"></a>
### Person Name

* **Features:** custom fastText embeddings, POS tags, token suffixes
* **Target:** Person-Name label category IOB tags
* **Algorithm:** L2SGD

#### Preprocessing

In [84]:
df_train = df_train_pers
df_dev = df_dev_pers

In [85]:
df_train_token_groups = utils.implodeDataFrame(df_train, ['token_id', 'sentence_id', 'pos', 'token'])
df_dev_token_groups = utils.implodeDataFrame(df_dev, ['token_id', 'sentence_id', 'pos', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
df_dev_token_groups = df_dev_token_groups.reset_index()

In [86]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
# df_dev_grouped.head()

Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, TAG_LIST)`

In [87]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences_pers = utils.zipFeaturesAndTarget(df_train_grouped, "tag_personname")
# print(train_sentences_pers[0][:3])
dev_sentences_pers = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_personname")
# print(dev_sentences_pers[0][:3])

In [88]:
train_sentences = train_sentences_pers
dev_sentences = dev_sentences_pers

In [89]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

**From Optimization of Baseline Model:** arow with variance=0.5 was best-performing algorithm/parameter combination.

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Person Name** category of tags.  We'll increase the max iterations to 100 for this model.

In [90]:
clf_pers = sklearn_crfsuite.CRF(algorithm='arow', variance=0.5, max_iterations=100, all_possible_transitions=True)

In [91]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_pers.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [92]:
targets = list(clf_pers.classes_)
targets.remove('O')
print(targets)

['B-Unknown', 'I-Unknown', 'I-Masculine', 'B-Masculine', 'B-Feminine', 'I-Feminine']


#### Predict

In [93]:
y_pred = clf_pers.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [94]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))

  - F1: 0.4724254452794501
  - Prec: 0.5036375286524056
  - Rec 0.45406486640948707


The baseline model's performance scores were:
  - F1: 0.4789287297354048
  - Prec: 0.5381704595959792
  - Rec 0.43620517216745247
  
Adding POS tags as features improves the performance of the Person Name sequence classification model, **supporting** our hypothesis:
  - F1: 0.5011854407104209
  - Prec: 0.5297892459718619
  - Rec 0.4823546220888698
  
Adding token suffixes seems to worsen the performance of the model overall, **countering** our hypothesis:
  - F1: 0.4724254452794501
  - Prec: 0.5036375286524056
  - Rec 0.45406486640948707

For further inspection, save the prediction data:

In [95]:
df_dev_grouped = df_dev_grouped.rename(columns={"tag_personname":"tag_personname_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_personname_predicted", y_pred)
# df_dev_grouped.head()

Unnamed: 0,sentence_id,token_id,pos,sentence,tag_personname_expected,tag_personname_predicted
0,5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[[O], [O], [O]]","[O, O, O]"
2,13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,18,"[498, 499, 500, 501, 502, 503, 504, 505, 506, ...","[IN, CD, NNP, NNP, VBD, NNP, NNP, CC, PRP, VBD...","[In, 1941, Tom, Allan, married, Jane, Moore, a...","[[O], [O], [B-Masculine], [I-Masculine], [O], ...","[O, O, O, O, O, B-Unknown, O, O, O, O, O, O, O..."
4,24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[[O], [O], [B-Masculine], [I-Masculine], [I-Ma...","[O, O, B-Masculine, I-Masculine, I-Masculine, ..."


In [96]:
df_dev_grouped = df_dev_grouped.set_index("sentence_id")
df_dev_exploded = df_dev_grouped.explode(list(df_dev_grouped.columns))
# df_dev_exploded.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_personname_expected,tag_personname_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,154,IN,After,[O],O
5,155,PRP$,his,[O],O
5,156,NN,ordination,[O],O
5,157,PRP,he,[O],O
5,158,VBD,spent,[O],O


In [97]:
# filename = "crf_l2sgd_personname_labels_withPOS.csv"
# df_dev_exploded.to_csv(config.tokc_path+output_path+filename)
filename = "crf_l2sgd_personname_labels_withPOSSuffix.csv"
df_dev_exploded[["token_id", "tag_personname_expected", "tag_personname_predicted"]].to_csv(config.tokc_path+output_path+filename)

<a id="cont"></a>
### Contextual

* **Features:** custom fastText embeddings, token suffixes
* **Target:** Contextual label category IOB tags
* **Algorithm:** L2SGD

#### Preprocessing

In [98]:
df_train = df_train_cont
df_dev = df_dev_cont

In [99]:
df_train_token_groups = utils.implodeDataFrame(df_train, ['token_id', 'sentence_id', 'pos', 'token'])
df_dev_token_groups = utils.implodeDataFrame(df_dev, ['token_id', 'sentence_id', 'pos', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
df_dev_token_groups = df_dev_token_groups.reset_index()

In [100]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
# df_dev_grouped.head()

Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, TAG_LIST)`

In [101]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences_cont = utils.zipFeaturesAndTarget(df_train_grouped, "tag_contextual")
# print(train_sentences_cont[0][:3])
dev_sentences_cont = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_contextual")
# print(dev_sentences_cont[0][:3])

In [102]:
train_sentences = train_sentences_cont
dev_sentences = dev_sentences_cont

In [103]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

**From Optimization of Baseline Model:** arow with variance=0.5 was the best performing algorithm/parameter combination.

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Contextual** category of tags.  We'll increase the maximum iterations to 100 for this model.

In [104]:
clf_cont = sklearn_crfsuite.CRF(algorithm='arow', variance=0.5, max_iterations=100, all_possible_transitions=True)

In [105]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_cont.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [106]:
targets = list(clf_cont.classes_)
targets.remove('O')
print(targets)

['B-Stereotype', 'I-Stereotype', 'B-Occupation', 'I-Occupation', 'B-Omission', 'I-Omission']


#### Predict

In [107]:
y_pred = clf_cont.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [108]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))

  - F1: 0.4087296343564073
  - Prec: 0.4288273306621178
  - Rec 0.3930510314875136


The baseline model's performance scores were:
  - F1: 0.40803990799640844
  - Prec: 0.4124968743940007
  - Rec 0.4091205211726384
  
Adding POS tags seems to decrease the model's performance, **counter** to the hypothesis:
  - F1: 0.3848080068912827
  - Prec: 0.41570355576471396
  - Rec 0.358957654723127

Adding token suffixes seems to have less than a 0.1 impact on the model's performance, in **support** of the hypothesis:
  - F1: 0.4087296343564073
  - Prec: 0.4288273306621178
  - Rec 0.3930510314875136

For further inspection, save the prediction data:

In [109]:
df_dev_grouped = df_dev_grouped.rename(columns={"tag_contextual":"tag_contextual_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_contextual_predicted", y_pred)
df_dev_grouped.head()

Unnamed: 0,sentence_id,token_id,pos,sentence,tag_contextual_expected,tag_contextual_predicted
0,5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, B-Occupation, O..."
1,11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[[O], [O], [O]]","[O, O, O]"
2,13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,18,"[498, 499, 500, 501, 502, 503, 504, 505, 506, ...","[IN, CD, NNP, NNP, VBD, NNP, NNP, CC, PRP, VBD...","[In, 1941, Tom, Allan, married, Jane, Moore, a...","[[O], [O], [B-Stereotype], [I-Stereotype], [I-...","[O, O, O, O, O, O, O, O, O, O, B-Omission, I-O..."
4,24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, B-Occupation, O..."


In [110]:
df_dev_grouped = df_dev_grouped.set_index("sentence_id")
df_dev_exploded = df_dev_grouped.explode(list(df_dev_grouped.columns))
df_dev_exploded.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_contextual_expected,tag_contextual_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,154,IN,After,[O],O
5,155,PRP$,his,[O],O
5,156,NN,ordination,[O],O
5,157,PRP,he,[O],O
5,158,VBD,spent,[O],O


In [111]:
# filename = "crf_l2sgd_contextual_labels_withPOS.csv"
# df_dev_exploded.to_csv(config.tokc_path+output_path+filename)
filename = "crf_l2sgd_contextual_labels_withPOSSuffixes.csv"
df_dev_exploded[["token_id", "tag_contextual_expected", "tag_contextual_predicted"]].to_csv(config.tokc_path+output_path+filename)

<a id="2"></a>
## NOT YET UPDATED & RUN: 2. Performance Evaluation

### Strict Evaluation

The built-in evaluation approach is strict, so unless the model predictions' labels are on text spans that exactly match the development data's test, the predicted labels will be deemed incorrect.

In [174]:
output_path = "model_output/crf_l2sgd/"

In [175]:
category = "contextual"
filename = "crf_l2sgd_{}_labels_baseline.csv".format(category)
pred_cont = pd.read_csv(config.tokc_path+output_path+filename, index_col=0)
pred_cont = utils.getColumnValuesAsLists(pred_cont, "tag_{}_expected".format(category))
# pred_cont.head()

In [176]:
category = "personname"
filename = "crf_l2sgd_{}_labels_baseline.csv".format(category)
pred_pers = pd.read_csv(config.tokc_path+output_path+filename, index_col=0)
pred_pers = utils.getColumnValuesAsLists(pred_pers, "tag_{}_expected".format(category))
# pred_pers.head()

In [177]:
category = "linguistic"
filename = "crf_l2sgd_{}_labels_baseline.csv".format(category)
pred_ling = pd.read_csv(config.tokc_path+output_path+filename, index_col=0)
pred_ling = utils.getColumnValuesAsLists(pred_ling, "tag_{}_expected".format(category))
pred_ling = pred_ling.set_index("sentence_id")
# pred_ling.head()

Calculate performance metrics for each category of labels:

In [178]:
category = "contextual"
pred_cont = utils.isPredictedInExpected(pred_cont, "tag_{}_expected".format(category), "tag_{}_predicted".format(category), '_merge', 'O')
# pred_cont.head()

In [179]:
category = "personname"
pred_pers = utils.isPredictedInExpected(pred_pers, "tag_{}_expected".format(category), "tag_{}_predicted".format(category), '_merge', 'O')
# pred_pers.head()

In [180]:
category = "linguistic"
pred_ling = utils.isPredictedInExpected(pred_ling, "tag_{}_expected".format(category), "tag_{}_predicted".format(category), '_merge', 'O')
# pred_ling.head()

In [181]:
category = "contextual"
tags = ['B-Occupation', 'I-Occupation', 'B-Omission', 'I-Omission', 'B-Stereotype', 'I-Stereotype']
pred_cont_stats = utils.getScoresByCatTags(
    pred_cont, "_merge", tags[0], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_cont, "_merge", tags[i], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
    )
    pred_cont_stats = pd.concat([pred_cont_stats, tag_stats])
# pred_cont_stats

In [182]:
category = "personname"
tags = ['B-Feminine', 'I-Feminine', 'B-Masculine', 'I-Masculine', 'B-Unknown', 'I-Unknown', "B-Nonbinary", "I-Nonbinary"]
pred_pers_stats = utils.getScoresByCatTags(
    pred_pers, "_merge", tags[0], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_pers, "_merge", tags[i], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
    )
    pred_pers_stats = pd.concat([pred_pers_stats, tag_stats])
# pred_pers_stats

In [183]:
category = "linguistic"
tags = ["B-Gendered-Pronoun", "I-Gendered-Pronoun", "B-Gendered-Role", "I-Gendered-Role", "B-Generalization", "I-Generalization"]
pred_ling_stats = utils.getScoresByCatTags(
    pred_ling, "_merge", tags[0], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_ling, "_merge", tags[i], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
    )
    pred_ling_stats = pd.concat([pred_ling_stats, tag_stats])
# pred_ling_stats

Combine the statistics:

In [184]:
stats = pd.concat([pred_cont_stats, pred_pers_stats, pred_ling_stats])
stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,B-Occupation,270,191,350,0.64695,0.564516,0.602929
0,I-Occupation,386,326,351,0.518464,0.476255,0.496464
0,B-Omission,429,546,524,0.48972,0.549843,0.518043
0,I-Omission,897,1194,522,0.304196,0.367865,0.333014
0,B-Stereotype,155,98,41,0.294964,0.209184,0.244776
0,I-Stereotype,492,360,148,0.291339,0.23125,0.25784
0,B-Feminine,38,95,202,0.680135,0.841667,0.752328
0,I-Feminine,143,149,402,0.729583,0.737615,0.733577
0,B-Masculine,236,280,456,0.619565,0.65896,0.638655
0,I-Masculine,345,435,432,0.49827,0.555985,0.525547


Save the statistics:

In [185]:
stats.to_csv(config.tokc_path+output_path+"crf_l2sgd_baseline_performance_strict_alltags.csv")

#### Annotation Agreement

### Loose Evaluation

As with the manual annotation evaluation, we want to evaluate the predictions more loosely, considering overlapping text spans in addition to exactly matching text spans.

#### Token Agreement

First, generalize the tokens' IOB tags to the label, and calculate agreement scores for each label.

In [186]:
category = "contextual"
pred_cont_labels = pred_cont.copy()
tag_exp = list(pred_cont_labels["tag_{}_expected".format(category)])
tag_pred = list(pred_cont_labels["tag_{}_predicted".format(category)])
label_exp = [[tag if tag == "O" else tag[2:] for tag in tag_exp_list] for tag_exp_list in tag_exp]
label_pred = [tag if tag == "O" else tag[2:] for tag in tag_pred]
# print(label_exp[:20])  # Looks good
# print(label_pred[:20]) # Looks good
pred_cont_labels = pred_cont_labels.drop(columns=["tag_{}_expected".format(category), "tag_{}_predicted".format(category)])
pred_cont_labels.insert(len(pred_cont_labels.columns), "label_{}_expected".format(category), label_exp)
pred_cont_labels.insert(len(pred_cont_labels.columns), "label_{}_predicted".format(category), label_pred)
# pred_cont_labels.head(20)  # Looks good

In [187]:
category = "personname"
pred_pers_labels = pred_pers.copy()
tag_exp = list(pred_pers_labels["tag_{}_expected".format(category)])
tag_pred = list(pred_pers_labels["tag_{}_predicted".format(category)])
label_exp = [[tag if tag == "O" else tag[2:] for tag in tag_exp_list] for tag_exp_list in tag_exp]
label_pred = [tag if tag == "O" else tag[2:] for tag in tag_pred]
pred_pers_labels = pred_pers_labels.drop(columns=["tag_{}_expected".format(category), "tag_{}_predicted".format(category)])
pred_pers_labels.insert(len(pred_pers_labels.columns), "label_{}_expected".format(category), label_exp)
pred_pers_labels.insert(len(pred_pers_labels.columns), "label_{}_predicted".format(category), label_pred)
# pred_pers_labels.loc[pred_pers_labels.label_personname_predicted == "Feminine"].head()  # Looks good

In [188]:
category = "linguistic"
pred_ling_labels = pred_ling.copy()
tag_exp = list(pred_ling_labels["tag_{}_expected".format(category)])
tag_pred = list(pred_ling_labels["tag_{}_predicted".format(category)])
label_exp = [[tag if tag == "O" else tag[2:] for tag in tag_exp_list] for tag_exp_list in tag_exp]
label_pred = [tag if tag == "O" else tag[2:] for tag in tag_pred]
pred_ling_labels = pred_ling_labels.drop(columns=["tag_{}_expected".format(category), "tag_{}_predicted".format(category)])
pred_ling_labels.insert(len(pred_ling_labels.columns), "label_{}_expected".format(category), label_exp)
pred_ling_labels.insert(len(pred_ling_labels.columns), "label_{}_predicted".format(category), label_pred)
# pred_ling_labels.head()  # Looks good

Calculate the agreement metrics at the label level for each token:

In [190]:
category = "contextual"
tags = ['Occupation', 'Omission', 'Stereotype']
pred_cont_labels = pred_cont_labels.drop(columns=["_merge"])
pred_cont_labels = utils.isPredictedInExpected(pred_cont_labels, "label_{}_expected".format(category), "label_{}_predicted".format(category), '_merge', 'O')

pred_cont_stats = utils.getScoresByCatTags(
    pred_cont_labels, "_merge", tags[0], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_cont_labels, "_merge", tags[i], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
    )
    pred_cont_stats = pd.concat([pred_cont_stats, tag_stats])
pred_cont_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,Occupation,656,481,737,0.60509,0.529074,0.564535
0,Omission,1297,1684,1102,0.395549,0.459358,0.425072
0,Stereotype,638,449,198,0.306028,0.236842,0.267026


In [191]:
category = "personname"
tags = ['Feminine', 'Masculine', 'Unknown', "Nonbinary"]
pred_pers_labels = pred_pers_labels.drop(columns=["_merge"])
pred_pers_labels = utils.isPredictedInExpected(pred_pers_labels, "label_{}_expected".format(category), "label_{}_predicted".format(category), '_merge', 'O')


pred_pers_stats = utils.getScoresByCatTags(
    pred_pers_labels, "_merge", tags[0], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_pers_labels, "_merge", tags[i], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
    )
    pred_pers_stats = pd.concat([pred_pers_stats, tag_stats])
pred_pers_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,Feminine,176,210,638,0.752358,0.783784,0.76775
0,Masculine,578,669,934,0.582658,0.617725,0.599679
0,Unknown,1801,1105,2154,0.660939,0.544627,0.597172
0,Nonbinary,0,0,0,0.0,0.0,0.0


In [192]:
category = "linguistic"
tags = ["Gendered-Pronoun", "Gendered-Role", "Generalization"]
pred_ling_labels = pred_ling_labels.drop(columns=["_merge"])
pred_ling_labels = utils.isPredictedInExpected(pred_ling_labels, "label_{}_expected".format(category), "label_{}_predicted".format(category), '_merge', 'O')


pred_ling_stats = utils.getScoresByCatTags(
    pred_ling_labels, "_merge", tags[0], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_ling_labels, "_merge", tags[i], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
    )
    pred_ling_stats = pd.concat([pred_ling_stats, tag_stats])
pred_ling_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,Gendered-Pronoun,41,169,718,0.80947,0.945982,0.872418
0,Gendered-Role,264,170,421,0.712352,0.614599,0.659875
0,Generalization,268,90,84,0.482759,0.238636,0.319392


Combine and save the performance measures:

In [193]:
loose_stats = pd.concat([pred_cont_stats, pred_pers_stats, pred_ling_stats])
# loose_stats

In [194]:
loose_stats.to_csv(config.tokc_path+output_path+"crf_l2sgd_baseline_performance_loose_alltags.csv")

#### Annotation Agreement

Calculate agreement at the annotation level, so if the model labels any word correctly from a manually annotated text span, that annotation is recorded as being correctly labeled (`true positive`).  Note whether the models' labels are an `exact_match`, `label_match`, `category_match` or `mismatch`.

<a id="3"></a>
## 3. Transitions