# Baseline Gender Biased Token Classifiers

### Target: Labels

### Features: Word Embeddings

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/model_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/crf_l2sgd_baseline/`
* Sequence classification
    * 9 lables (2 from original annotation taxonomy weren't applied during manual annotation):
        1. Person Name: Unknown, Feminine, Masculine (Non-binary not annotated with)
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Occupation, Omission, Stereotype (Empowering not annotated with)
    * 1 model per category
* Word embeddings
    * Custom fastText (word2vec with subwords, trained on Archives' descriptive metadata extracted in October 2020)  

***

### Table of Contents

[0.](#0) Preprocessing

[1.](#1) Models

[2.](#2) Performance Evaluation

[3.](#3) Transitions

***

Load necessary libraries:

In [1]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt

# For preprocessing
from nltk.stem import WordNetLemmatizer
import scipy.stats
from gensim.models import FastText
from gensim import utils as gensim_utils
from gensim.test.utils import get_tmpfile

# For classification
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# For evaluation
from collections import Counter
from sklearn.metrics import classification_report, make_scorer
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, ConfusionMatrixDisplay#, plot_confusion_matrix
from sklearn.metrics import precision_recall_fscore_support, f1_score
from intervaltree import Interval, IntervalTree

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [2]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


Drop duplicate rows with all but the same annotation ID:

In [3]:
df_train = df_train.drop(columns=["ann_id"])
df_train = df_train.drop_duplicates()
df_dev = df_dev.drop(columns=["ann_id"])
df_dev = df_dev.drop_duplicates()
print(df_train.shape, df_dev.shape)

(463441, 9) (156146, 9)


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [4]:
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [5]:
df_train.shape

(463439, 9)

Remove columns that won't be used as features for the classifiers and remove any duplicate rows that remain:

In [6]:
cols_to_keep = ["sentence_id", "token_id", "pos", "token", "tag"]

In [7]:
df_train = df_train[cols_to_keep]
df_train = df_train.drop_duplicates()
df_dev = df_dev[cols_to_keep]
df_dev = df_dev.drop_duplicates()
# df_train.head(20)

Create separate subsets of data for each category so they can be used with three separate models, replacing `NaN` tag values with `'O'`:

In [8]:
tags = (df_train.tag.unique())
tags.sort()
print(tags)

['B-Feminine' 'B-Gendered-Pronoun' 'B-Gendered-Role' 'B-Generalization'
 'B-Masculine' 'B-Occupation' 'B-Omission' 'B-Stereotype' 'B-Unknown'
 'I-Feminine' 'I-Gendered-Pronoun' 'I-Gendered-Role' 'I-Generalization'
 'I-Masculine' 'I-Occupation' 'I-Omission' 'I-Stereotype' 'I-Unknown' 'O']


In [9]:
ling_cat_tags = ['B-Gendered-Pronoun', 'B-Gendered-Role', 'B-Generalization', 'I-Gendered-Pronoun', 'I-Gendered-Role', 'I-Generalization']
df_train_ling = df_train.loc[df_train.tag.isin(ling_cat_tags)]
df_dev_ling = df_dev.loc[df_dev.tag.isin(ling_cat_tags)]

In [10]:
pers_cat_tags = ['B-Feminine', 'B-Masculine', 'B-Unknown', 'I-Feminine', 'I-Masculine', 'I-Unknown']
df_train_pers = df_train.loc[df_train.tag.isin(pers_cat_tags)]
df_dev_pers = df_dev.loc[df_dev.tag.isin(pers_cat_tags)]

In [11]:
cont_cat_tags = ['B-Occupation', 'B-Omission', 'B-Stereotype', 'I-Occupation', 'I-Omission', 'I-Stereotype']
df_train_cont = df_train.loc[df_train.tag.isin(cont_cat_tags)]
df_dev_cont = df_dev.loc[df_dev.tag.isin(cont_cat_tags)]

In [12]:
df_train = (df_train.drop(columns=["tag"])).drop_duplicates()
df_dev = (df_dev.drop(columns=["tag"])).drop_duplicates()

In [13]:
join_cols = ["sentence_id", "token_id", "pos", "token"]

In [14]:
df_train_ling = df_train.join(df_train_ling.set_index(join_cols), on=join_cols, how="outer")
df_train_ling = df_train_ling.rename(columns={"tag":"tag_linguistic"})
df_train_ling = df_train_ling.fillna('O')
# df_train_ling.head()
df_dev_ling = df_dev.join(df_dev_ling.set_index(join_cols), on=join_cols, how="outer")
df_dev_ling = df_dev_ling.rename(columns={"tag":"tag_linguistic"})
df_dev_ling = df_dev_ling.fillna('O')
# df_dev_ling.head()

In [15]:
df_train_pers = df_train.join(df_train_pers.set_index(join_cols), on=join_cols, how="outer")
df_train_pers = df_train_pers.rename(columns={"tag":"tag_personname"})
df_train_pers = df_train_pers.fillna('O')
df_dev_pers = df_dev.join(df_dev_pers.set_index(join_cols), on=join_cols, how="outer")
df_dev_pers = df_dev_pers.rename(columns={"tag":"tag_personname"})
df_dev_pers = df_dev_pers.fillna('O')
# df_dev_pers.head()

In [16]:
df_train_cont = df_train.join(df_train_cont.set_index(join_cols), on=join_cols, how="outer")
df_train_cont = df_train_cont.rename(columns={"tag":"tag_contextual"})
df_train_cont = df_train_cont.fillna('O')
df_dev_cont = df_dev.join(df_dev_cont.set_index(join_cols), on=join_cols, how="outer")
df_dev_cont = df_dev_cont.rename(columns={"tag":"tag_contextual"})
df_dev_cont = df_dev_cont.fillna('O')
df_train_cont.head()

Unnamed: 0,sentence_id,token_id,pos,token,tag_contextual
3,1,3,NN,Title,O
4,1,4,:,:,O
5,1,5,NNS,Papers,O
6,1,6,IN,of,O
7,1,7,DT,The,B-Stereotype


In [17]:
df_train_ling = df_train_ling.drop_duplicates()
df_dev_ling = df_dev_ling.drop_duplicates()
df_train_pers = df_train_pers.drop_duplicates()
df_dev_pers = df_dev_pers.drop_duplicates()
df_train_cont = df_train_cont.drop_duplicates()
df_dev_cont = df_dev_cont.drop_duplicates()

In [18]:
train_dfs = [df_train_ling, df_train_pers, df_train_cont]
dev_dfs = [df_dev_ling, df_dev_pers, df_dev_cont]
for df in train_dfs:
    print(df.shape[0], len(df.token_id.unique()))
print()
for df in dev_dfs:
    print(df.shape[0], len(df.token_id.unique()))

452222 452086
455327 452086
453119 452086

152494 152455
153568 152455
152768 152455


Tokens can have multiple tags, so there are more rows than unique token IDs.  In order to pass the data into a CRF model, we need to have one tag per token, so we'll simply **take the first tag** when we extract features for each token.

#### Word Embeddings

Use the custom fastText word embeddings, trained on the entire dataset of descriptive metadata from the Archives (harvested in October 2020) using the Continuous Bag-of-Words (CBOW) algorithm.  Subword embeddings (for subwords from 2 to 6 characters long, inclusive) are used to infer the embeddings for out-of-vocabulary (OOV) words.

Use the word embedding model trained on lowercased text to 100 dimensions: 

In [19]:
file_name = config.tokc_path+"fasttext100_lowercased.model"  #get_tmpfile()
embedding_model = FastText.load(file_name)

In [20]:
vocabulary = list(df_train.token.unique())
vocabulary_lowercased = [token.lower() for token in vocabulary]
vocabulary_lowercased = list(set(vocabulary_lowercased))
print("Vocabulary size:", len(vocabulary))
print("Lowercased vocabulary size:", len(vocabulary_lowercased))

Vocabulary size: 35968
Lowercased vocabulary size: 31335


Define feature dictionaries for baseline models, using only the word embeddings and token as features:

In [21]:
# Get a vector representation of a token from a fastText word embedding model
def extractEmbedding(token, fasttext_model=embedding_model):
    if token.isalpha():
        token = token.lower()
    embedding = fasttext_model.wv[token]
    return embedding

def extractTokenFeatures(sentence, i):
    token = sentence[i][0]
    pos = sentence[i][1]
    features = {
        'bias': 1.0,
        'token': token
    }
    
    # Add each value in a token's word embedding as a separate feature
    embedding = extractEmbedding(token)
    for i,n in enumerate(embedding):
        features['e{}'.format(i)] = n
    
    # Record whether a token is the first or last token of a sentence
    if i == 0:
        features['START'] = True
    elif i == (len(sentence) - 1):
        features['END'] = True
    
    return features

def extractSentenceFeatures(sentence):
    return [extractTokenFeatures(sentence, i) for i in range(len(sentence))]

def extractSentenceTargets(sentence):
    return [tag_list[0] for token, pos, tag_list in sentence]

def extractSentenceTokens(sentence):
    return [token for token, pos, tag_list in sentence]

*References:*
* *https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html*
* *https://stackoverflow.com/questions/58736548/how-to-use-word-embedding-as-features-for-crf-sklearn-crfsuite-model-training*

<a id="1"></a>
## 1. Models

### Linguistic

* **Features:** custom fastText embeddings
* **Target:** Linguistic label category IOB tags
* **Algorithm:** L2SGD

#### Preprocessing

In [131]:
df_train = df_train_ling
df_dev = df_dev_ling

Group the data by token, so the all the tags for one token are recorded in a list for that token's row:

In [132]:
df_train_token_groups = utils.implodeDataFrame(df_train, ['token_id', 'sentence_id', 'pos', 'token'])
df_dev_token_groups = utils.implodeDataFrame(df_dev, ['token_id', 'sentence_id', 'pos', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
df_dev_token_groups = df_dev_token_groups.reset_index()

Group the data by sentence, where each sentence is a list of tokens:

In [133]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
# df_dev_grouped.head()

Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, TAG_LIST)`

In [134]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences_ling = utils.zipFeaturesAndTarget(df_train_grouped, "tag_linguistic")
print(train_sentences_ling[0][:3])
dev_sentences_ling = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_linguistic")
print(dev_sentences_ling[0][:3])

[('Title', 'NN', ['O']), (':', ':', ['O']), ('Papers', 'NNS', ['O'])]
[('After', 'IN', ['O']), ('his', 'PRP$', ['B-Gendered-Pronoun']), ('ordination', 'NN', ['O'])]


Extract the features and targets:

In [135]:
train_sentences = train_sentences_ling
dev_sentences = dev_sentences_ling

In [136]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

#### Optimization

Look for the highest-performing (based on F1 score) models by trying different algorithms and parameters.  Algorithms vailable with sklearn_crfsuite are:
 * 'lbfgs' - Gradient descent using the L-BFGS method
 * 'l2sgd' - Stochastic Gradient Descent with L2 regularization term
 * 'ap' - Averaged Perceptron
 * 'pa' - Passive Aggressive (PA)
 * 'arow' - Adaptive Regularization Of Weight Vector (AROW)

In [100]:
algorithms = ['lbfgs', 'l2sgd', 'ap', 'pa', 'arow']
max_iters=50

In [29]:
# df_train["tag_linguistic"].unique()
targets = [
        'B-Gendered-Pronoun', 'I-Gendered-Pronoun', 'B-Generalization',
        'I-Generalization', 'B-Gendered-Role', 'I-Gendered-Role'
]
f1_scorer = make_scorer(
    metrics.flat_f1_score, average='None', 
    labels=targets
)

In [30]:
crf0A = sklearn_crfsuite.CRF(algorithm=algorithms[0], c1=0, max_iterations=max_iters, all_possible_transitions=True) # unlimited iterations
crf0B = sklearn_crfsuite.CRF(algorithm=algorithms[0], c1=0.1, max_iterations=max_iters, all_possible_transitions=True)
crf0C = sklearn_crfsuite.CRF(algorithm=algorithms[0], c1=0.1, c2=0.2, max_iterations=max_iters, all_possible_transitions=True)

crf1A = sklearn_crfsuite.CRF(algorithm=algorithms[1], c2=1.0, max_iterations=max_iters, all_possible_transitions=True) # max iters: 1000
crf1B = sklearn_crfsuite.CRF(algorithm=algorithms[1], c2=0.2, max_iterations=max_iters, all_possible_transitions=True)

crf2 = sklearn_crfsuite.CRF(algorithm=algorithms[2], max_iterations=max_iters, all_possible_transitions=True) # max iters: 100

crf3A = sklearn_crfsuite.CRF(algorithm=algorithms[3], pa_type=0, max_iterations=max_iters, all_possible_transitions=True) # max iters: 100
crf3B = sklearn_crfsuite.CRF(algorithm=algorithms[3], pa_type=1, max_iterations=max_iters, all_possible_transitions=True)
crf3C = sklearn_crfsuite.CRF(algorithm=algorithms[3], pa_type=2, max_iterations=max_iters, all_possible_transitions=True)

crf4A = sklearn_crfsuite.CRF(algorithm=algorithms[4], variance=1, max_iterations=max_iters, all_possible_transitions=True) # max iters: 100
crf4B = sklearn_crfsuite.CRF(algorithm=algorithms[4], variance=0.5, max_iterations=max_iters, all_possible_transitions=True)

In [44]:
# MODEL REMEMBERED WHAT IT LEARNED FROM PREVIOUS FITS WHEN TRIED THIS APPROACH:
# # RandomizedSearchCV and GridSearchCV return an attribute error, even when try fixes found online: # https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
# # So, here we manually set possible parameter values and iterate through eleven models to see what works best
# for model,params in opt_dict.items():
#     if params["algorithm"] == algorithms[0]:
#         if "c2" in params.keys():
#             crf = sklearn_crfsuite.CRF(
#                 algorithm=params["algorithm"], 
#                 c1=params["c1"], 
#                 c2=params["c2"]
#                 max_iterations=max_iters, 
#                 all_possible_transitions=True
#             )
#         else:
#             crf = 
    
#     # Train
#     try:
#         crf.fit(X_train, y_train)
#     except AttributeError:
#         pass
    
#     # Predict
#     y_pred = crf.predict(X_dev)
    
#     # Evaluate
#     print(crf.algorithm, crf.c1, crf.c2, crf.pa_type, crf.variance)
#     print("  Weighted:")
#     print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
#     print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
#     print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
#     print("  Unweighted:")
#     print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
#     print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
#     print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
#     print()

lbfgs 0 None None None
  Weighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: 0.599280238771264


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: 0.7038549022988657


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec 0.5655790147152912
  Unweighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: [0.85307443 0.         0.         0.48390942 0.29530201]


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: [0.81559406 0.         0.         0.74087591 0.6875    ]


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec [0.89416554 0.         0.         0.35929204 0.18803419]

lbfgs 0.1 None None None
  Weighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: 0.6923958948078268


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: 0.7639084295038221


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec 0.6948176583493282
  Unweighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: [0.87484511 0.         0.05298013 0.67586207 0.40993789]


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: [0.8050171  0.         0.57142857 0.76222222 0.75      ]


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec [0.95793758 0.         0.02777778 0.60707965 0.28205128]

lbfgs 0.1 0.2 None None
  Weighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: 0.6990028072783006


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: 0.7613420596160072


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec 0.690978886756238
  Unweighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: [0.85677912 0.         0.15662651 0.6903164  0.41463415]


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: [0.80695444 0.         0.59090909 0.75313808 0.72340426]


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec [0.91316147 0.         0.09027778 0.63716814 0.29059829]

l2sgd None 1.0 None None
  Weighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: 0.6684952523610904


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: 0.7215503482052836


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec 0.6692258477287268
  Unweighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: [0.88389058 0.         0.         0.62406816 0.34899329]


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: [0.80066079 0.         0.         0.78342246 0.8125    ]


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec [0.98643148 0.         0.         0.51858407 0.22222222]

l2sgd None 0.2 None None
  Weighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: 0.5937918194490615


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: 0.6815801083979136


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec 0.5911708253358925
  Unweighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: [0.88242424 0.         0.         0.41344956 0.37735849]


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: [0.7973713  0.         0.         0.69747899 0.71428571]


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec [0.98778833 0.         0.         0.29380531 0.25641026]

ap None None None None
  Weighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: 0.6759137161198069


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: 0.7977315369298237


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec 0.6698656429942419
  Unweighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: [0.87876923 0.         0.10457516 0.63731656 0.28767123]


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: [0.80405405 0.         0.88888889 0.781491   0.72413793]


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec [0.9687924  0.         0.05555556 0.5380531  0.17948718]

pa None None 0 None
  Weighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: 0.7101857203438285


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: 0.7616816921141131


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec 0.7140115163147792
  Unweighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: [0.87945879 0.         0.25142857 0.67249757 0.39053254]


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: [0.80427447 0.         0.70967742 0.74568966 0.63461538]


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec [0.97014925 0.         0.15277778 0.61238938 0.28205128]

pa None None 1 None
  Weighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: 0.7106638674759423


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: 0.7625822250742893


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec 0.7140115163147792
  Unweighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: [0.87945879 0.         0.25142857 0.67185979 0.4       ]


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: [0.80427447 0.         0.70967742 0.74675325 0.64150943]


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec [0.97014925 0.         0.15277778 0.61061947 0.29059829]

pa None None 2 None
  Weighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: 0.7134352707185775


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: 0.7638416116729302


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec 0.7216890595009597
  Unweighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: [0.88456865 0.         0.25142857 0.671875   0.4047619 ]


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: [0.80088009 0.         0.70967742 0.74945534 0.66666667]


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec [0.98778833 0.         0.15277778 0.60884956 0.29059829]

arow None None None 1
  Weighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: 0.7221649891290777


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: 0.7230511018940287


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec 0.72808701215611
  Unweighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: [0.85987261 0.         0.20883534 0.70401494 0.57416268]


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: [0.81032413 0.         0.24761905 0.74505929 0.65217391]


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec [0.91587517 0.         0.18055556 0.66725664 0.51282051]

arow None None None 0.5
  Weighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: 0.7334680388132536


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: 0.707506150271441


  _warn_prf(average, modifier, msg_start, len(result))


  - Rec 0.7722328854766475
  Unweighted:


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


  - F1: [0.86005089 0.         0.2761194  0.72875817 0.52173913]


  _warn_prf(average, modifier, msg_start, len(result))


  - Prec: [0.80958084 0.         0.2983871  0.676783   0.71641791]
  - Rec [0.91723202 0.         0.25694444 0.78938053 0.41025641]



  _warn_prf(average, modifier, msg_start, len(result))


In [31]:
try:
    crf0A.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf0A.predict(X_dev)

# Evaluate
print(crf0A.algorithm, crf0A.c1, crf0A.c2, crf0A.pa_type, crf0A.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

lbfgs 0 None None None
  Weighted:
  - F1: 0.5220463286505957
  - Prec: 0.6424500624544645
  - Rec 0.4922135706340378
  Unweighted:
  - F1: [0.85307443 0.         0.00892857 0.         0.48390942 0.29530201]
  - Prec: [0.81559406 0.         0.25       0.         0.74087591 0.6875    ]
  - Rec [0.89416554 0.         0.00454545 0.         0.35929204 0.18803419]


In [32]:
try:
    crf0B.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf0B.predict(X_dev)

# Evaluate
print(crf0B.algorithm, crf0B.c1, crf0B.c2, crf0B.pa_type, crf0B.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

lbfgs 0.1 None None None
  Weighted:
  - F1: 0.6083674858352446
  - Prec: 0.7689434393136277
  - Rec 0.60734149054505
  Unweighted:
  - F1: [0.87484511 0.         0.05286344 0.05298013 0.67586207 0.40993789]
  - Prec: [0.8050171  0.         0.85714286 0.57142857 0.76222222 0.75      ]
  - Rec [0.95793758 0.         0.02727273 0.02777778 0.60707965 0.28205128]


In [33]:
try:
    crf0C.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf0C.predict(X_dev)

# Evaluate
print(crf0C.algorithm, crf0C.c1, crf0C.c2, crf0C.pa_type, crf0C.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

lbfgs 0.1 0.2 None None
  Weighted:
  - F1: 0.6427339974408373
  - Prec: 0.7651587413557269
  - Rec 0.6218020022246941
  Unweighted:
  - F1: [0.85677912 0.         0.28679245 0.15662651 0.6903164  0.41463415]
  - Prec: [0.80695444 0.         0.84444444 0.59090909 0.75313808 0.72340426]
  - Rec [0.91316147 0.         0.17272727 0.09027778 0.63716814 0.29059829]


In [34]:
try:
    crf1A.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf1A.predict(X_dev)

# Evaluate
print(crf1A.algorithm, crf1A.c1, crf1A.c2, crf1A.pa_type, crf1A.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

l2sgd None 1.0 None None
  Weighted:
  - F1: 0.5811224023583894
  - Prec: 0.6272431558647711
  - Rec 0.5817575083426029
  Unweighted:
  - F1: [0.88389058 0.         0.         0.         0.62406816 0.34899329]
  - Prec: [0.80066079 0.         0.         0.         0.78342246 0.8125    ]
  - Rec [0.98643148 0.         0.         0.         0.51858407 0.22222222]


In [35]:
try:
    crf1B.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf1B.predict(X_dev)

# Evaluate
print(crf1B.algorithm, crf1B.c1, crf1B.c2, crf1B.pa_type, crf1B.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

l2sgd None 0.2 None None
  Weighted:
  - F1: 0.517290080102963
  - Prec: 0.7148552332736035
  - Rec 0.514460511679644
  Unweighted:
  - F1: [0.88242424 0.         0.00904977 0.         0.41344956 0.37735849]
  - Prec: [0.7973713  0.         1.         0.         0.69747899 0.71428571]
  - Rec [0.98778833 0.         0.00454545 0.         0.29380531 0.25641026]


In [36]:
try:
    crf2.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf2.predict(X_dev)

# Evaluate
print(crf2.algorithm, crf2.c1, crf2.c2, crf2.pa_type, crf2.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

ap None None None None
  Weighted:
  - F1: 0.6014970716276341
  - Prec: 0.8224996619695852
  - Rec 0.5912124582869855
  Unweighted:
  - F1: [0.87876923 0.64       0.07017544 0.10457516 0.63731656 0.28767123]
  - Prec: [0.80405405 0.8        1.         0.88888889 0.781491   0.72413793]
  - Rec [0.9687924  0.53333333 0.03636364 0.05555556 0.5380531  0.17948718]


In [31]:
try:
    crf3A.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf3A.predict(X_dev)

# Evaluate
print(crf3A.algorithm, crf3A.c1, crf3A.c2, crf3A.pa_type, crf3A.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

pa None None 0 None
  Weighted:
  - F1: 0.6608095292949409
  - Prec: 0.7676344251172861
  - Rec 0.6551724137931034
  Unweighted:
  - F1: [0.88456865 0.69230769 0.29104478 0.23255814 0.67378641 0.40697674]
  - Prec: [0.80088009 0.81818182 0.8125     0.71428571 0.74623656 0.63636364]
  - Rec [0.98778833 0.6        0.17727273 0.13888889 0.61415929 0.2991453 ]


In [32]:
try:
    crf3B.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf3B.predict(X_dev)

# Evaluate
print(crf3B.algorithm, crf3B.c1, crf3B.c2, crf3B.pa_type, crf3B.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

pa None None 1 None
  Weighted:
  - F1: 0.6599815730795738
  - Prec: 0.762389078290318
  - Rec 0.6490545050055617
  Unweighted:
  - F1: [0.87945879 0.69230769 0.29927007 0.24277457 0.67249757 0.40462428]
  - Prec: [0.80427447 0.81818182 0.75925926 0.72413793 0.74568966 0.625     ]
  - Rec [0.97014925 0.6        0.18636364 0.14583333 0.61238938 0.2991453 ]


In [33]:
try:
    crf3C.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf3C.predict(X_dev)

# Evaluate
print(crf3C.algorithm, crf3C.c1, crf3C.c2, crf3C.pa_type, crf3C.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

pa None None 2 None
  Weighted:
  - F1: 0.6597046279568559
  - Prec: 0.7595230166063811
  - Rec 0.6490545050055617
  Unweighted:
  - F1: [0.87730061 0.69230769 0.30434783 0.25142857 0.67120623 0.4       ]
  - Prec: [0.80067189 0.81818182 0.75       0.70967742 0.74514039 0.64150943]
  - Rec [0.97014925 0.6        0.19090909 0.15277778 0.61061947 0.29059829]


In [42]:
try:
    crf4A.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf4A.predict(X_dev)

# Evaluate
print(crf4A.algorithm, crf4A.c1, crf4A.c2, crf4A.pa_type, crf4A.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

arow None None None 1
  Weighted:
  - F1: 0.6586210838588668
  - Prec: 0.6726122299605511
  - Rec 0.664071190211346
  Unweighted:
  - F1: [0.87015385 0.54545455 0.28490028 0.27237354 0.66475645 0.48913043]
  - Prec: [0.79617117 0.5        0.38167939 0.30973451 0.7219917  0.67164179]
  - Rec [0.95929444 0.6        0.22727273 0.24305556 0.6159292  0.38461538]


In [43]:
try:
    crf4B.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf4B.predict(X_dev)

# Evaluate
print(crf4B.algorithm, crf4B.c1, crf4B.c2, crf4B.pa_type, crf4B.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

arow None None None 0.5
  Weighted:
  - F1: 0.6488288150410274
  - Prec: 0.6222010099532319
  - Rec 0.6846496106785317
  Unweighted:
  - F1: [0.86294416 0.69230769 0.29045643 0.21761658 0.67236955 0.38541667]
  - Prec: [0.81048868 0.81818182 0.26717557 0.17355372 0.65066225 0.49333333]
  - Rec [0.92265943 0.6        0.31818182 0.29166667 0.69557522 0.31623932]


**Best model:** pa with pa_type=0; arow with variance=0.5 also has strong performance, and is strongest with other categories, Person Name and Contextual.

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Linguistic** category of tags.  We'll increase the max_iterations to 100 for this model.

In [137]:
clf_ling = sklearn_crfsuite.CRF(algorithm='arow', variance=0.5, max_iterations=100, all_possible_transitions=True)

In [138]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_ling.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [139]:
targets = list(clf_ling.classes_)
targets.remove('O')
print(targets)

['B-Gendered-Pronoun', 'B-Generalization', 'B-Gendered-Role', 'I-Generalization', 'I-Gendered-Role', 'I-Gendered-Pronoun']


#### Predict

In [140]:
y_pred = clf_ling.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [141]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))

  - F1: 0.6697698893667122
  - Prec: 0.696012355375722
  - Rec 0.6718576195773082


Save the prediction data in a directory specific to this model:

In [142]:
df_dev_grouped = df_dev_grouped.rename(columns={"tag_linguistic":"tag_linguistic_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_linguistic_predicted", y_pred)
# df_dev_grouped.head()

In [143]:
df_dev_exploded = df_dev_grouped.explode(list(df_dev_grouped.columns)[1:])
df_dev_exploded.head()

Unnamed: 0,sentence_id,token_id,pos,sentence,tag_linguistic_expected,tag_linguistic_predicted
0,5,154,IN,After,[O],O
0,5,155,PRP$,his,[B-Gendered-Pronoun],B-Gendered-Pronoun
0,5,156,NN,ordination,[O],O
0,5,157,PRP,he,[B-Gendered-Pronoun],B-Gendered-Pronoun
0,5,158,VBD,spent,[O],O


In [144]:
output_path = "model_output/crf_l2sgd_baseline/"
Path(config.tokc_path+output_path).mkdir(exist_ok=True, parents=True)

In [145]:
filename = "crf_l2sgd_linguistic_labels_baseline.csv"
df_dev_exploded.to_csv(config.tokc_path+output_path+filename)

### Person Name

* **Features:** custom fastText embeddings
* **Target:** Person-Name label category IOB tags
* **Algorithm:** L2SGD

#### Preprocessing

In [146]:
df_train = df_train_pers
df_dev = df_dev_pers

In [147]:
df_train_token_groups = utils.implodeDataFrame(df_train, ['token_id', 'sentence_id', 'pos', 'token'])
df_dev_token_groups = utils.implodeDataFrame(df_dev, ['token_id', 'sentence_id', 'pos', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
df_dev_token_groups = df_dev_token_groups.reset_index()

In [148]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
df_dev_grouped.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_personname
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."
11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[[O], [O], [O]]"
13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."
18,"[498, 499, 500, 501, 502, 503, 504, 505, 506, ...","[IN, CD, NNP, NNP, VBD, NNP, NNP, CC, PRP, VBD...","[In, 1941, Tom, Allan, married, Jane, Moore, a...","[[O], [O], [B-Masculine], [I-Masculine], [O], ..."
24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[[O], [O], [B-Masculine], [I-Masculine], [I-Ma..."


Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, TAG_LIST)`

In [149]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences_pers = utils.zipFeaturesAndTarget(df_train_grouped, "tag_personname")
print(train_sentences_pers[0][:3])
dev_sentences_pers = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_personname")
print(dev_sentences_pers[0][:3])

[('Title', 'NN', ['O']), (':', ':', ['O']), ('Papers', 'NNS', ['O'])]
[('After', 'IN', ['O']), ('his', 'PRP$', ['O']), ('ordination', 'NN', ['O'])]


In [150]:
train_sentences = train_sentences_pers
dev_sentences = dev_sentences_pers

In [151]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

#### Optimization

Look for the highest-performing (based on F1 score) models by trying different algorithms and parameters.  Algorithms vailable with sklearn_crfsuite are:
 * 'lbfgs' - Gradient descent using the L-BFGS method
 * 'l2sgd' - Stochastic Gradient Descent with L2 regularization term
 * 'ap' - Averaged Perceptron
 * 'pa' - Passive Aggressive (PA)
 * 'arow' - Adaptive Regularization Of Weight Vector (AROW)

In [116]:
algorithms = ['lbfgs', 'l2sgd', 'ap', 'pa', 'arow']
max_iters=50

In [41]:
# df_train["tag_personname"].unique()
targets = [
        'B-Unknown', 'I-Unknown', 'B-Feminine',
        'I-Feminine', 'B-Masculine', 'I-Masculine'
]
f1_scorer = make_scorer(
    metrics.flat_f1_score, average='None', 
    labels=targets
)

In [42]:
crf0A = sklearn_crfsuite.CRF(algorithm=algorithms[0], c1=0, max_iterations=max_iters, all_possible_transitions=True) # unlimited iterations
crf0B = sklearn_crfsuite.CRF(algorithm=algorithms[0], c1=0.1, max_iterations=max_iters, all_possible_transitions=True)
crf0C = sklearn_crfsuite.CRF(algorithm=algorithms[0], c1=0.1, c2=0.2, max_iterations=max_iters, all_possible_transitions=True)

crf1A = sklearn_crfsuite.CRF(algorithm=algorithms[1], c2=1.0, max_iterations=max_iters, all_possible_transitions=True) # max iters: 1000
crf1B = sklearn_crfsuite.CRF(algorithm=algorithms[1], c2=0.2, max_iterations=max_iters, all_possible_transitions=True)

crf2 = sklearn_crfsuite.CRF(algorithm=algorithms[2], max_iterations=max_iters, all_possible_transitions=True) # max iters: 100

crf3A = sklearn_crfsuite.CRF(algorithm=algorithms[3], pa_type=0, max_iterations=max_iters, all_possible_transitions=True) # max iters: 100
crf3B = sklearn_crfsuite.CRF(algorithm=algorithms[3], pa_type=1, max_iterations=max_iters, all_possible_transitions=True)
crf3C = sklearn_crfsuite.CRF(algorithm=algorithms[3], pa_type=2, max_iterations=max_iters, all_possible_transitions=True)

crf4A = sklearn_crfsuite.CRF(algorithm=algorithms[4], variance=1, max_iterations=max_iters, all_possible_transitions=True) # max iters: 100
crf4B = sklearn_crfsuite.CRF(algorithm=algorithms[4], variance=0.5, max_iterations=max_iters, all_possible_transitions=True)

In [43]:
try:
    crf0A.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf0A.predict(X_dev)

# Evaluate
print(crf0A.algorithm, crf0A.c1, crf0A.c2, crf0A.pa_type, crf0A.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

lbfgs 0 None None None
  Weighted:
  - F1: 0.25563333936374655
  - Prec: 0.48241170869152883
  - Rec 0.17545363623374768
  Unweighted:
  - F1: [0.21609604 0.24892487 0.40752351 0.28761651 0.32380952 0.23603462]
  - Prec: [0.42631579 0.42087254 0.77380952 0.72       0.51987768 0.51020408]
  - Rec [0.14472901 0.17672414 0.27659574 0.1797005  0.2351314  0.15353122]


In [44]:
try:
    crf0B.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf0B.predict(X_dev)

# Evaluate
print(crf0B.algorithm, crf0B.c1, crf0B.c2, crf0B.pa_type, crf0B.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

lbfgs 0.1 None None None
  Weighted:
  - F1: 0.3952936125163706
  - Prec: 0.6164456508900145
  - Rec 0.2961851693099014
  Unweighted:
  - F1: [0.34542314 0.37176232 0.62222222 0.53304904 0.43134087 0.38205128]
  - Prec: [0.62794349 0.63431542 0.74117647 0.74183976 0.5184466  0.51114923]
  - Rec [0.23823705 0.26293103 0.53617021 0.41597338 0.36929461 0.30501535]


In [45]:
try:
    crf0C.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf0C.predict(X_dev)

# Evaluate
print(crf0C.algorithm, crf0C.c1, crf0C.c2, crf0C.pa_type, crf0C.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

lbfgs 0.1 0.2 None None
  Weighted:
  - F1: 0.4505459769939627
  - Prec: 0.6028142561062638
  - Rec 0.36133733390484357
  Unweighted:
  - F1: [0.45994065 0.48070953 0.63333333 0.5320911  0.38327526 0.30410184]
  - Prec: [0.60963618 0.62804171 0.71891892 0.70410959 0.51764706 0.49199085]
  - Rec [0.36926742 0.38936782 0.56595745 0.42762063 0.30428769 0.22006141]


In [46]:
try:
    crf1A.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf1A.predict(X_dev)

# Evaluate
print(crf1A.algorithm, crf1A.c1, crf1A.c2, crf1A.pa_type, crf1A.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

l2sgd None 1.0 None None
  Weighted:
  - F1: 0.39947172781326473
  - Prec: 0.5390631041498564
  - Rec 0.31975996570938703
  Unweighted:
  - F1: [0.37020484 0.37389855 0.5990566  0.5520728  0.44057052 0.35034657]
  - Prec: [0.49403579 0.55477032 0.67195767 0.70360825 0.51576994 0.4557377 ]
  - Rec [0.29600953 0.28196839 0.54042553 0.45424293 0.38450899 0.28454452]


In [47]:
try:
    crf1B.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf1B.predict(X_dev)

# Evaluate
print(crf1B.algorithm, crf1B.c1, crf1B.c2, crf1B.pa_type, crf1B.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

l2sgd None 0.2 None None
  Weighted:
  - F1: 0.28237156184801626
  - Prec: 0.6289538389122998
  - Rec 0.19402771824546364
  Unweighted:
  - F1: [0.26691042 0.26218487 0.63341646 0.54115226 0.31621349 0.09779482]
  - Prec: [0.57367387 0.59541985 0.76506024 0.70889488 0.58148148 0.77272727]
  - Rec [0.17391304 0.16810345 0.54042553 0.43760399 0.21715076 0.05220061]


In [48]:
try:
    crf2.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf2.predict(X_dev)

# Evaluate
print(crf2.algorithm, crf2.c1, crf2.c2, crf2.pa_type, crf2.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

ap None None None None
  Weighted:
  - F1: 0.43418543421107464
  - Prec: 0.6506056343043833
  - Rec 0.33190455779397054
  Unweighted:
  - F1: [0.41818182 0.42871094 0.6367713  0.60142712 0.46962233 0.29945694]
  - Prec: [0.62162162 0.66920732 0.67298578 0.77631579 0.57777778 0.61858974]
  - Rec [0.31506849 0.31537356 0.60425532 0.49084859 0.395574   0.1975435 ]


In [49]:
try:
    crf3A.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf3A.predict(X_dev)

# Evaluate
print(crf3A.algorithm, crf3A.c1, crf3A.c2, crf3A.pa_type, crf3A.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

pa None None 0 None
  Weighted:
  - F1: 0.46544025425232854
  - Prec: 0.6526455542987322
  - Rec 0.36676668095442205
  Unweighted:
  - F1: [0.46613697 0.48264984 0.66350711 0.59205021 0.45128205 0.30015552]
  - Prec: [0.63900415 0.64752116 0.7486631  0.7971831  0.59060403 0.62459547]
  - Rec [0.36688505 0.38469828 0.59574468 0.47088186 0.36514523 0.1975435 ]


In [50]:
try:
    crf3B.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf3B.predict(X_dev)

# Evaluate
print(crf3B.algorithm, crf3B.c1, crf3B.c2, crf3B.pa_type, crf3B.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

pa None None 1 None
  Weighted:
  - F1: 0.46124270643488524
  - Prec: 0.6487355256491798
  - Rec 0.36233747678239747
  Unweighted:
  - F1: [0.46369138 0.47971145 0.66019417 0.58029979 0.44633731 0.29434547]
  - Prec: [0.63523316 0.6440678  0.76836158 0.81381381 0.58093126 0.60509554]
  - Rec [0.36509827 0.38218391 0.5787234  0.45091514 0.36237898 0.19447288]


In [51]:
try:
    crf3C.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf3C.predict(X_dev)

# Evaluate
print(crf3C.algorithm, crf3C.c1, crf3C.c2, crf3C.pa_type, crf3C.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

pa None None 2 None
  Weighted:
  - F1: 0.4648015025167207
  - Prec: 0.6483430407678211
  - Rec 0.36748106872410347
  Unweighted:
  - F1: [0.46373544 0.48826291 0.65876777 0.58201058 0.4467354  0.29439252]
  - Prec: [0.62830957 0.64653641 0.74331551 0.7994186  0.58956916 0.61563518]
  - Rec [0.36748064 0.39224138 0.59148936 0.45757072 0.35961272 0.19344933]


In [52]:
try:
    crf4A.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf4A.predict(X_dev)

# Evaluate
print(crf4A.algorithm, crf4A.c1, crf4A.c2, crf4A.pa_type, crf4A.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

arow None None None 1
  Weighted:
  - F1: 0.4815200566676591
  - Prec: 0.5175284691788858
  - Rec 0.46506643806258036
  Unweighted:
  - F1: [0.45074415 0.48698438 0.6437247  0.59171598 0.51690294 0.38585209]
  - Prec: [0.55643045 0.55022624 0.61389961 0.60137457 0.42664266 0.35      ]
  - Rec [0.3787969  0.43678161 0.67659574 0.58236273 0.65560166 0.42988741]


In [53]:
try:
    crf4B.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf4B.predict(X_dev)

# Evaluate
print(crf4B.algorithm, crf4B.c1, crf4B.c2, crf4B.pa_type, crf4B.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

arow None None None 0.5
  Weighted:
  - F1: 0.4947955715164867
  - Prec: 0.5537045694622348
  - Rec 0.44920702957565367
  Unweighted:
  - F1: [0.50312809 0.51784329 0.65154639 0.62031107 0.41166937 0.36140135]
  - Prec: [0.56259205 0.56281619 0.632      0.68902439 0.49706458 0.45230769]
  - Rec [0.45503276 0.47952586 0.67234043 0.5640599  0.35131397 0.30092119]


**Best model:** arow with variance=0.5

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Person Name** category of tags.  We'll increase the max iterations to 100 for this model.

In [152]:
clf_pers = sklearn_crfsuite.CRF(algorithm='arow', variance=0.5, max_iterations=100, all_possible_transitions=True)

In [153]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_pers.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [154]:
targets = list(clf_pers.classes_)
targets.remove('O')
print(targets)

['B-Unknown', 'I-Unknown', 'I-Masculine', 'B-Masculine', 'B-Feminine', 'I-Feminine']


#### Predict

In [155]:
y_pred = clf_pers.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [156]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))

  - F1: 0.4789287297354048
  - Prec: 0.5381704595959792
  - Rec 0.43620517216745247


Save the prediction data:

In [157]:
df_dev_grouped = df_dev_grouped.rename(columns={"tag_personname":"tag_personname_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_personname_predicted", y_pred)
df_dev_grouped.head()

Unnamed: 0,sentence_id,token_id,pos,sentence,tag_personname_expected,tag_personname_predicted
0,5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[[O], [O], [O]]","[O, O, O]"
2,13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,18,"[498, 499, 500, 501, 502, 503, 504, 505, 506, ...","[IN, CD, NNP, NNP, VBD, NNP, NNP, CC, PRP, VBD...","[In, 1941, Tom, Allan, married, Jane, Moore, a...","[[O], [O], [B-Masculine], [I-Masculine], [O], ...","[O, O, O, O, O, B-Unknown, I-Unknown, O, O, O,..."
4,24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[[O], [O], [B-Masculine], [I-Masculine], [I-Ma...","[O, O, B-Masculine, I-Masculine, I-Masculine, ..."


In [158]:
df_dev_grouped = df_dev_grouped.set_index("sentence_id")
df_dev_exploded = df_dev_grouped.explode(list(df_dev_grouped.columns))
df_dev_exploded.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_personname_expected,tag_personname_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,154,IN,After,[O],O
5,155,PRP$,his,[O],O
5,156,NN,ordination,[O],O
5,157,PRP,he,[O],O
5,158,VBD,spent,[O],O


In [159]:
filename = "crf_l2sgd_personname_labels_baseline.csv"
df_dev_exploded.to_csv(config.tokc_path+output_path+filename)

<a id="1"></a>
### Contextual

* **Features:** custom fastText embeddings
* **Target:** Contextual label category IOB tags
* **Algorithm:** L2SGD

#### Preprocessing

In [160]:
df_train = df_train_cont
df_dev = df_dev_cont

In [161]:
df_train_token_groups = utils.implodeDataFrame(df_train, ['token_id', 'sentence_id', 'pos', 'token'])
df_dev_token_groups = utils.implodeDataFrame(df_dev, ['token_id', 'sentence_id', 'pos', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
df_dev_token_groups = df_dev_token_groups.reset_index()

In [162]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
df_dev_grouped.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_contextual
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."
11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[[O], [O], [O]]"
13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."
18,"[498, 499, 500, 501, 502, 503, 504, 505, 506, ...","[IN, CD, NNP, NNP, VBD, NNP, NNP, CC, PRP, VBD...","[In, 1941, Tom, Allan, married, Jane, Moore, a...","[[O], [O], [B-Stereotype], [I-Stereotype], [I-..."
24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."


Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, TAG_LIST)`

In [163]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences_cont = utils.zipFeaturesAndTarget(df_train_grouped, "tag_contextual")
print(train_sentences_cont[0][:3])
dev_sentences_cont = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_contextual")
print(dev_sentences_cont[0][:3])

[('Title', 'NN', ['O']), (':', ':', ['O']), ('Papers', 'NNS', ['O'])]
[('After', 'IN', ['O']), ('his', 'PRP$', ['O']), ('ordination', 'NN', ['O'])]


In [164]:
train_sentences = train_sentences_cont
dev_sentences = dev_sentences_cont

In [165]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

#### Optimization

Look for the highest-performing (based on F1 score) models by trying different algorithms and parameters.  Algorithms vailable with sklearn_crfsuite are:
 * 'lbfgs' - Gradient descent using the L-BFGS method
 * 'l2sgd' - Stochastic Gradient Descent with L2 regularization term
 * 'ap' - Averaged Perceptron
 * 'pa' - Passive Aggressive (PA)
 * 'arow' - Adaptive Regularization Of Weight Vector (AROW)

In [78]:
algorithms = ['lbfgs', 'l2sgd', 'ap', 'pa', 'arow']
max_iters=50

In [79]:
# df_train["tag_contextual"].unique()
targets = [
        'B-Occupation', 'I-Occupation', 'B-Stereotype',
        'I-Stereotype', 'B-Omission', 'I-Omission'
]
f1_scorer = make_scorer(
    metrics.flat_f1_score, average='None', 
    labels=targets
)

In [80]:
crf0A = sklearn_crfsuite.CRF(algorithm=algorithms[0], c1=0, max_iterations=max_iters, all_possible_transitions=True) # unlimited iterations
crf0B = sklearn_crfsuite.CRF(algorithm=algorithms[0], c1=0.1, max_iterations=max_iters, all_possible_transitions=True)
crf0C = sklearn_crfsuite.CRF(algorithm=algorithms[0], c1=0.1, c2=0.2, max_iterations=max_iters, all_possible_transitions=True)

crf1A = sklearn_crfsuite.CRF(algorithm=algorithms[1], c2=1.0, max_iterations=max_iters, all_possible_transitions=True) # max iters: 1000
crf1B = sklearn_crfsuite.CRF(algorithm=algorithms[1], c2=0.2, max_iterations=max_iters, all_possible_transitions=True)

crf2 = sklearn_crfsuite.CRF(algorithm=algorithms[2], max_iterations=max_iters, all_possible_transitions=True) # max iters: 100

crf3A = sklearn_crfsuite.CRF(algorithm=algorithms[3], pa_type=0, max_iterations=max_iters, all_possible_transitions=True) # max iters: 100
crf3B = sklearn_crfsuite.CRF(algorithm=algorithms[3], pa_type=1, max_iterations=max_iters, all_possible_transitions=True)
crf3C = sklearn_crfsuite.CRF(algorithm=algorithms[3], pa_type=2, max_iterations=max_iters, all_possible_transitions=True)

crf4A = sklearn_crfsuite.CRF(algorithm=algorithms[4], variance=1, max_iterations=max_iters, all_possible_transitions=True) # max iters: 100
crf4B = sklearn_crfsuite.CRF(algorithm=algorithms[4], variance=0.5, max_iterations=max_iters, all_possible_transitions=True)

In [81]:
try:
    crf0A.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf0A.predict(X_dev)

# Evaluate
print(crf0A.algorithm, crf0A.c1, crf0A.c2, crf0A.pa_type, crf0A.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

lbfgs 0 None None None
  Weighted:
  - F1: 0.07526574269976748
  - Prec: 0.3419825203723303
  - Rec 0.04234527687296417
  Unweighted:
  - F1: [0.         0.         0.         0.         0.18710263 0.11749681]
  - Prec: [0.         0.         0.         0.         0.76865672 0.58974359]
  - Rec [0.         0.         0.         0.         0.10651499 0.06524823]


In [82]:
try:
    crf0B.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf0B.predict(X_dev)

# Evaluate
print(crf0B.algorithm, crf0B.c1, crf0B.c2, crf0B.pa_type, crf0B.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

lbfgs 0.1 None None None
  Weighted:
  - F1: 0.27203523227916193
  - Prec: 0.6341356146479743
  - Rec 0.1780673181324647
  Unweighted:
  - F1: [0.27612903 0.25559105 0.11299435 0.16603774 0.48888889 0.1993205 ]
  - Prec: [0.75352113 0.65934066 0.58823529 0.56410256 0.79672897 0.49438202]
  - Rec [0.16903633 0.15852048 0.0625     0.09734513 0.35263702 0.1248227 ]


In [83]:
try:
    crf0C.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf0C.predict(X_dev)

# Evaluate
print(crf0C.algorithm, crf0C.c1, crf0C.c2, crf0C.pa_type, crf0C.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

lbfgs 0.1 0.2 None None
  Weighted:
  - F1: 0.36974453222807907
  - Prec: 0.6650245193879263
  - Rec 0.2625407166123778
  Unweighted:
  - F1: [0.47764449 0.42293907 0.16304348 0.22738386 0.53013699 0.27465536]
  - Prec: [0.77112676 0.65738162 0.625      0.66428571 0.78498986 0.54411765]
  - Rec [0.34597156 0.31175694 0.09375    0.13716814 0.40020683 0.18368794]


In [84]:
try:
    crf1A.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf1A.predict(X_dev)

# Evaluate
print(crf1A.algorithm, crf1A.c1, crf1A.c2, crf1A.pa_type, crf1A.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

l2sgd None 1.0 None None
  Weighted:
  - F1: 0.14238250038533273
  - Prec: 0.6949935082632666
  - Rec 0.08534201954397394
  Unweighted:
  - F1: [0.01253918 0.01308901 0.02325581 0.05890603 0.33637117 0.1907061 ]
  - Prec: [0.8        0.71428571 0.16666667 0.6        0.84583333 0.63967611]
  - Rec [0.00631912 0.00660502 0.0125     0.03097345 0.20992761 0.11205674]


In [85]:
try:
    crf1B.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf1B.predict(X_dev)

# Evaluate
print(crf1B.algorithm, crf1B.c1, crf1B.c2, crf1B.pa_type, crf1B.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

l2sgd None 0.2 None None
  Weighted:
  - F1: 0.23290156335890247
  - Prec: 0.587957004746522
  - Rec 0.16547231270358306
  Unweighted:
  - F1: [0.03703704 0.06532663 0.17894737 0.16252822 0.51733333 0.25569358]
  - Prec: [0.8        0.66666667 0.56666667 0.34615385 0.72795497 0.47318008]
  - Rec [0.01895735 0.0343461  0.10625    0.10619469 0.40124095 0.1751773 ]


In [86]:
try:
    crf2.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf2.predict(X_dev)

# Evaluate
print(crf2.algorithm, crf2.c1, crf2.c2, crf2.pa_type, crf2.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

ap None None None None
  Weighted:
  - F1: 0.22686387330210564
  - Prec: 0.8443018920394695
  - Rec 0.14505971769815418
  Unweighted:
  - F1: [0.26168224 0.20303384 0.02453988 0.03478261 0.49893086 0.15275995]
  - Prec: [0.84482759 0.87       0.66666667 1.         0.80275229 0.80405405]
  - Rec [0.15481833 0.11492734 0.0125     0.01769912 0.36194416 0.08439716]


In [87]:
try:
    crf3A.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf3A.predict(X_dev)

# Evaluate
print(crf3A.algorithm, crf3A.c1, crf3A.c2, crf3A.pa_type, crf3A.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

pa None None 0 None
  Weighted:
  - F1: 0.306310049934223
  - Prec: 0.80194162836387
  - Rec 0.20781758957654722
  Unweighted:
  - F1: [0.45810056 0.32640333 0.08284024 0.04329004 0.5375603  0.22061483]
  - Prec: [0.78244275 0.76585366 0.77777778 1.         0.80578512 0.73493976]
  - Rec [0.32385466 0.20739762 0.04375    0.02212389 0.4033092  0.12978723]


In [88]:
try:
    crf3B.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf3B.predict(X_dev)

# Evaluate
print(crf3B.algorithm, crf3B.c1, crf3B.c2, crf3B.pa_type, crf3B.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

pa None None 1 None
  Weighted:
  - F1: 0.30799201369715534
  - Prec: 0.8058167663636654
  - Rec 0.20868621064060802
  Unweighted:
  - F1: [0.45240761 0.32432432 0.08284024 0.04892086 0.5399449  0.22543701]
  - Prec: [0.77692308 0.76097561 0.77777778 1.         0.80824742 0.75100402]
  - Rec [0.31911532 0.20607662 0.04375    0.02507375 0.40537746 0.13262411]


In [89]:
try:
    crf3C.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf3C.predict(X_dev)

# Evaluate
print(crf3C.algorithm, crf3C.c1, crf3C.c2, crf3C.pa_type, crf3C.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

pa None None 2 None
  Weighted:
  - F1: 0.30804522569488546
  - Prec: 0.809491042440604
  - Rec 0.20912052117263843
  Unweighted:
  - F1: [0.46784922 0.32398754 0.08333333 0.04610951 0.53830228 0.22128174]
  - Prec: [0.78438662 0.75728155 0.875      1.         0.80912863 0.75      ]
  - Rec [0.33333333 0.20607662 0.04375    0.02359882 0.4033092  0.12978723]


In [90]:
try:
    crf4A.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf4A.predict(X_dev)

# Evaluate
print(crf4A.algorithm, crf4A.c1, crf4A.c2, crf4A.pa_type, crf4A.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

arow None None None 1
  Weighted:
  - F1: 0.36669929570944
  - Prec: 0.40929601777426833
  - Rec 0.3355048859934853
  Unweighted:
  - F1: [0.6036036  0.5095057  0.24512535 0.24287653 0.37733645 0.24971537]
  - Prec: [0.70230608 0.60035842 0.22110553 0.22487437 0.43355705 0.26857143]
  - Rec [0.52922591 0.44253633 0.275      0.2640118  0.33402275 0.23333333]


In [91]:
try:
    crf4B.fit(X_train, y_train)
except AttributeError:
    pass

# Predict
y_pred = crf4B.predict(X_dev)

# Evaluate
print(crf4B.algorithm, crf4B.c1, crf4B.c2, crf4B.pa_type, crf4B.variance)
print("  Weighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  Unweighted:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

arow None None None 0.5
  Weighted:
  - F1: 0.3925447072732632
  - Prec: 0.43471962523819374
  - Rec 0.36503800217155263
  Unweighted:
  - F1: [0.57769653 0.50433526 0.21782178 0.22643746 0.42546064 0.32653061]
  - Prec: [0.68546638 0.55661882 0.18032787 0.18929633 0.46237864 0.38461538]
  - Rec [0.49921011 0.46103038 0.275      0.28171091 0.39400207 0.28368794]


**Best model:** arow with variance=0.5

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Contextual** category of tags.  We'll increase the maximum iterations to 100 for this model.

In [166]:
clf_cont = sklearn_crfsuite.CRF(algorithm='arow', variance=0.5, max_iterations=100, all_possible_transitions=True)

In [167]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_cont.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [168]:
targets = list(clf_cont.classes_)
targets.remove('O')
print(targets)

['B-Stereotype', 'I-Stereotype', 'B-Occupation', 'I-Occupation', 'B-Omission', 'I-Omission']


#### Predict

In [169]:
y_pred = clf_cont.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [170]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))

  - F1: 0.40803990799640844
  - Prec: 0.4124968743940007
  - Rec 0.4091205211726384


Save the prediction data:

In [171]:
df_dev_grouped = df_dev_grouped.rename(columns={"tag_contextual":"tag_contextual_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_contextual_predicted", y_pred)
df_dev_grouped.head()

Unnamed: 0,sentence_id,token_id,pos,sentence,tag_contextual_expected,tag_contextual_predicted
0,5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, B-Occupation, O..."
1,11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[[O], [O], [O]]","[O, O, O]"
2,13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,18,"[498, 499, 500, 501, 502, 503, 504, 505, 506, ...","[IN, CD, NNP, NNP, VBD, NNP, NNP, CC, PRP, VBD...","[In, 1941, Tom, Allan, married, Jane, Moore, a...","[[O], [O], [B-Stereotype], [I-Stereotype], [I-...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, B-Occupation, O..."


In [172]:
df_dev_grouped = df_dev_grouped.set_index("sentence_id")
df_dev_exploded = df_dev_grouped.explode(list(df_dev_grouped.columns))
df_dev_exploded.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_contextual_expected,tag_contextual_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,154,IN,After,[O],O
5,155,PRP$,his,[O],O
5,156,NN,ordination,[O],O
5,157,PRP,he,[O],O
5,158,VBD,spent,[O],O


In [173]:
filename = "crf_l2sgd_contextual_labels_baseline.csv"
df_dev_exploded.to_csv(config.tokc_path+output_path+filename)

<a id="2"></a>
## 2. Performance Evaluation

### Strict Evaluation

The built-in evaluation approach is strict, so unless the model predictions' labels are on text spans that exactly match the development data's test, the predicted labels will be deemed incorrect.

In [174]:
output_path = "model_output/crf_l2sgd_baseline/"

In [175]:
category = "contextual"
filename = "crf_l2sgd_{}_labels_baseline.csv".format(category)
pred_cont = pd.read_csv(config.tokc_path+output_path+filename, index_col=0)
pred_cont = utils.getColumnValuesAsLists(pred_cont, "tag_{}_expected".format(category))
# pred_cont.head()

In [176]:
category = "personname"
filename = "crf_l2sgd_{}_labels_baseline.csv".format(category)
pred_pers = pd.read_csv(config.tokc_path+output_path+filename, index_col=0)
pred_pers = utils.getColumnValuesAsLists(pred_pers, "tag_{}_expected".format(category))
# pred_pers.head()

In [177]:
category = "linguistic"
filename = "crf_l2sgd_{}_labels_baseline.csv".format(category)
pred_ling = pd.read_csv(config.tokc_path+output_path+filename, index_col=0)
pred_ling = utils.getColumnValuesAsLists(pred_ling, "tag_{}_expected".format(category))
pred_ling = pred_ling.set_index("sentence_id")
# pred_ling.head()

Calculate performance metrics for each category of labels:

In [178]:
category = "contextual"
pred_cont = utils.isPredictedInExpected(pred_cont, "tag_{}_expected".format(category), "tag_{}_predicted".format(category), '_merge', 'O')
# pred_cont.head()

In [179]:
category = "personname"
pred_pers = utils.isPredictedInExpected(pred_pers, "tag_{}_expected".format(category), "tag_{}_predicted".format(category), '_merge', 'O')
# pred_pers.head()

In [180]:
category = "linguistic"
pred_ling = utils.isPredictedInExpected(pred_ling, "tag_{}_expected".format(category), "tag_{}_predicted".format(category), '_merge', 'O')
# pred_ling.head()

In [181]:
category = "contextual"
tags = ['B-Occupation', 'I-Occupation', 'B-Omission', 'I-Omission', 'B-Stereotype', 'I-Stereotype']
pred_cont_stats = utils.getScoresByCatTags(
    pred_cont, "_merge", tags[0], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_cont, "_merge", tags[i], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
    )
    pred_cont_stats = pd.concat([pred_cont_stats, tag_stats])
# pred_cont_stats

In [182]:
category = "personname"
tags = ['B-Feminine', 'I-Feminine', 'B-Masculine', 'I-Masculine', 'B-Unknown', 'I-Unknown', "B-Nonbinary", "I-Nonbinary"]
pred_pers_stats = utils.getScoresByCatTags(
    pred_pers, "_merge", tags[0], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_pers, "_merge", tags[i], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
    )
    pred_pers_stats = pd.concat([pred_pers_stats, tag_stats])
# pred_pers_stats

In [183]:
category = "linguistic"
tags = ["B-Gendered-Pronoun", "I-Gendered-Pronoun", "B-Gendered-Role", "I-Gendered-Role", "B-Generalization", "I-Generalization"]
pred_ling_stats = utils.getScoresByCatTags(
    pred_ling, "_merge", tags[0], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_ling, "_merge", tags[i], "tag_{}_expected".format(category), "tag_{}_predicted".format(category), "token_id"
    )
    pred_ling_stats = pd.concat([pred_ling_stats, tag_stats])
# pred_ling_stats

Combine the statistics:

In [184]:
stats = pd.concat([pred_cont_stats, pred_pers_stats, pred_ling_stats])
stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,B-Occupation,270,191,350,0.64695,0.564516,0.602929
0,I-Occupation,386,326,351,0.518464,0.476255,0.496464
0,B-Omission,429,546,524,0.48972,0.549843,0.518043
0,I-Omission,897,1194,522,0.304196,0.367865,0.333014
0,B-Stereotype,155,98,41,0.294964,0.209184,0.244776
0,I-Stereotype,492,360,148,0.291339,0.23125,0.25784
0,B-Feminine,38,95,202,0.680135,0.841667,0.752328
0,I-Feminine,143,149,402,0.729583,0.737615,0.733577
0,B-Masculine,236,280,456,0.619565,0.65896,0.638655
0,I-Masculine,345,435,432,0.49827,0.555985,0.525547


Save the statistics:

In [185]:
stats.to_csv(config.tokc_path+output_path+"crf_l2sgd_baseline_performance_strict_alltags.csv")

#### Annotation Agreement

### Loose Evaluation

As with the manual annotation evaluation, we want to evaluate the predictions more loosely, considering overlapping text spans in addition to exactly matching text spans.

#### Token Agreement

First, generalize the tokens' IOB tags to the label, and calculate agreement scores for each label.

In [186]:
category = "contextual"
pred_cont_labels = pred_cont.copy()
tag_exp = list(pred_cont_labels["tag_{}_expected".format(category)])
tag_pred = list(pred_cont_labels["tag_{}_predicted".format(category)])
label_exp = [[tag if tag == "O" else tag[2:] for tag in tag_exp_list] for tag_exp_list in tag_exp]
label_pred = [tag if tag == "O" else tag[2:] for tag in tag_pred]
# print(label_exp[:20])  # Looks good
# print(label_pred[:20]) # Looks good
pred_cont_labels = pred_cont_labels.drop(columns=["tag_{}_expected".format(category), "tag_{}_predicted".format(category)])
pred_cont_labels.insert(len(pred_cont_labels.columns), "label_{}_expected".format(category), label_exp)
pred_cont_labels.insert(len(pred_cont_labels.columns), "label_{}_predicted".format(category), label_pred)
# pred_cont_labels.head(20)  # Looks good

In [187]:
category = "personname"
pred_pers_labels = pred_pers.copy()
tag_exp = list(pred_pers_labels["tag_{}_expected".format(category)])
tag_pred = list(pred_pers_labels["tag_{}_predicted".format(category)])
label_exp = [[tag if tag == "O" else tag[2:] for tag in tag_exp_list] for tag_exp_list in tag_exp]
label_pred = [tag if tag == "O" else tag[2:] for tag in tag_pred]
pred_pers_labels = pred_pers_labels.drop(columns=["tag_{}_expected".format(category), "tag_{}_predicted".format(category)])
pred_pers_labels.insert(len(pred_pers_labels.columns), "label_{}_expected".format(category), label_exp)
pred_pers_labels.insert(len(pred_pers_labels.columns), "label_{}_predicted".format(category), label_pred)
# pred_pers_labels.loc[pred_pers_labels.label_personname_predicted == "Feminine"].head()  # Looks good

In [188]:
category = "linguistic"
pred_ling_labels = pred_ling.copy()
tag_exp = list(pred_ling_labels["tag_{}_expected".format(category)])
tag_pred = list(pred_ling_labels["tag_{}_predicted".format(category)])
label_exp = [[tag if tag == "O" else tag[2:] for tag in tag_exp_list] for tag_exp_list in tag_exp]
label_pred = [tag if tag == "O" else tag[2:] for tag in tag_pred]
pred_ling_labels = pred_ling_labels.drop(columns=["tag_{}_expected".format(category), "tag_{}_predicted".format(category)])
pred_ling_labels.insert(len(pred_ling_labels.columns), "label_{}_expected".format(category), label_exp)
pred_ling_labels.insert(len(pred_ling_labels.columns), "label_{}_predicted".format(category), label_pred)
# pred_ling_labels.head()  # Looks good

Calculate the agreement metrics at the label level for each token:

In [190]:
category = "contextual"
tags = ['Occupation', 'Omission', 'Stereotype']
pred_cont_labels = pred_cont_labels.drop(columns=["_merge"])
pred_cont_labels = utils.isPredictedInExpected(pred_cont_labels, "label_{}_expected".format(category), "label_{}_predicted".format(category), '_merge', 'O')

pred_cont_stats = utils.getScoresByCatTags(
    pred_cont_labels, "_merge", tags[0], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_cont_labels, "_merge", tags[i], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
    )
    pred_cont_stats = pd.concat([pred_cont_stats, tag_stats])
pred_cont_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,Occupation,656,481,737,0.60509,0.529074,0.564535
0,Omission,1297,1684,1102,0.395549,0.459358,0.425072
0,Stereotype,638,449,198,0.306028,0.236842,0.267026


In [191]:
category = "personname"
tags = ['Feminine', 'Masculine', 'Unknown', "Nonbinary"]
pred_pers_labels = pred_pers_labels.drop(columns=["_merge"])
pred_pers_labels = utils.isPredictedInExpected(pred_pers_labels, "label_{}_expected".format(category), "label_{}_predicted".format(category), '_merge', 'O')


pred_pers_stats = utils.getScoresByCatTags(
    pred_pers_labels, "_merge", tags[0], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_pers_labels, "_merge", tags[i], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
    )
    pred_pers_stats = pd.concat([pred_pers_stats, tag_stats])
pred_pers_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,Feminine,176,210,638,0.752358,0.783784,0.76775
0,Masculine,578,669,934,0.582658,0.617725,0.599679
0,Unknown,1801,1105,2154,0.660939,0.544627,0.597172
0,Nonbinary,0,0,0,0.0,0.0,0.0


In [192]:
category = "linguistic"
tags = ["Gendered-Pronoun", "Gendered-Role", "Generalization"]
pred_ling_labels = pred_ling_labels.drop(columns=["_merge"])
pred_ling_labels = utils.isPredictedInExpected(pred_ling_labels, "label_{}_expected".format(category), "label_{}_predicted".format(category), '_merge', 'O')


pred_ling_stats = utils.getScoresByCatTags(
    pred_ling_labels, "_merge", tags[0], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
)
for i in range(1, len(tags)):
    tag_stats = utils.getScoresByCatTags(
        pred_ling_labels, "_merge", tags[i], "label_{}_expected".format(category), "label_{}_predicted".format(category), "token_id"
    )
    pred_ling_stats = pd.concat([pred_ling_stats, tag_stats])
pred_ling_stats

Unnamed: 0,tag(s),false negative,false positive,true positive,precision,recall,f1
0,Gendered-Pronoun,41,169,718,0.80947,0.945982,0.872418
0,Gendered-Role,264,170,421,0.712352,0.614599,0.659875
0,Generalization,268,90,84,0.482759,0.238636,0.319392


Combine and save the performance measures:

In [193]:
loose_stats = pd.concat([pred_cont_stats, pred_pers_stats, pred_ling_stats])
# loose_stats

In [194]:
loose_stats.to_csv(config.tokc_path+output_path+"crf_l2sgd_baseline_performance_loose_alltags.csv")

#### Annotation Agreement

Calculate agreement at the annotation level, so if the model labels any word correctly from a manually annotated text span, that annotation is recorded as being correctly labeled (`true positive`).  Note whether the models' labels are an `exact_match`, `label_match`, `category_match` or `mismatch`.

<a id="3"></a>
## 3. Transitions