# Baseline Gender Bias Sequence Classifiers with GloVe

### Target: Labels

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/model_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/crf_l2sgd/`
* Sequence classification
    * 9 lables (2 from original annotation taxonomy weren't applied during manual annotation):
        1. Person Name: Unknown, Feminine, Masculine (Non-binary not annotated with)
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Occupation, Omission, Stereotype (Empowering not annotated with enough)
    * 1 model per category
* Word embeddings
    * GloVe (trained on Gigaword 5 + English Wikipedia dump)

***

### Table of Contents

[0.](#0) Preprocessing

  * [Hypotheses](#h)

[1.](#1) Models

  * [All Labels](#all)
  * [Linguistic](#ling)
  * [Person Name](#pers)
  * [Contextual](#cont)
  * [Person Name + Occupation](#perso)
  
[2.](#2) Performance Evaluation

***

Load necessary libraries:

In [1]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt

# For preprocessing
from nltk.stem import WordNetLemmatizer
import scipy.stats

from gensim.models import FastText
from gensim import utils as gensim_utils
from gensim.test.utils import get_tmpfile

# For classification
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# For evaluation
from collections import Counter
from sklearn.metrics import classification_report, make_scorer
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, ConfusionMatrixDisplay#, plot_confusion_matrix
from sklearn.metrics import precision_recall_fscore_support, f1_score
from intervaltree import Interval, IntervalTree

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [2]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


Drop duplicate rows with all but the same annotation ID:

In [3]:
df_train = df_train.drop(columns=["ann_id"])
df_train = df_train.drop_duplicates()
df_dev = df_dev.drop(columns=["ann_id"])
df_dev = df_dev.drop_duplicates()
print(df_train.shape, df_dev.shape)

(463441, 9) (156146, 9)


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [4]:
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [5]:
df_train.shape

(463439, 9)

Remove columns that won't be used as features for the classifiers and remove any duplicate rows that remain:

In [6]:
cols_to_keep = ["sentence_id", "token_id", "pos", "token", "tag"]

In [7]:
df_train = df_train[cols_to_keep]
df_train = df_train.drop_duplicates()
df_dev = df_dev[cols_to_keep]
df_dev = df_dev.drop_duplicates()
# df_train.head(20)

Create separate subsets of data for each category so they can be used with three separate models, replacing `NaN` tag values with `'O'`:

In [8]:
tags = (df_train.tag.unique())
tags.sort()
print(tags)

['B-Feminine' 'B-Gendered-Pronoun' 'B-Gendered-Role' 'B-Generalization'
 'B-Masculine' 'B-Occupation' 'B-Omission' 'B-Stereotype' 'B-Unknown'
 'I-Feminine' 'I-Gendered-Pronoun' 'I-Gendered-Role' 'I-Generalization'
 'I-Masculine' 'I-Occupation' 'I-Omission' 'I-Stereotype' 'I-Unknown' 'O']


***
**Optional Preprocessing** - if only using a subset of tags

In [9]:
ling_cat_tags = ['B-Gendered-Pronoun', 'B-Gendered-Role', 'B-Generalization', 'I-Gendered-Pronoun', 'I-Gendered-Role', 'I-Generalization']
df_train_ling = df_train.loc[df_train.tag.isin(ling_cat_tags)]
df_dev_ling = df_dev.loc[df_dev.tag.isin(ling_cat_tags)]

In [10]:
pers_cat_tags = ['B-Feminine', 'B-Masculine', 'B-Unknown', 'I-Feminine', 'I-Masculine', 'I-Unknown']
df_train_pers = df_train.loc[df_train.tag.isin(pers_cat_tags)]
df_dev_pers = df_dev.loc[df_dev.tag.isin(pers_cat_tags)]

In [11]:
cont_cat_tags = ['B-Occupation', 'B-Omission', 'B-Stereotype', 'I-Occupation', 'I-Omission', 'I-Stereotype']
df_train_cont = df_train.loc[df_train.tag.isin(cont_cat_tags)]
df_dev_cont = df_dev.loc[df_dev.tag.isin(cont_cat_tags)]

In [9]:
perso_cat_tags = ['B-Feminine', 'B-Masculine', 'B-Occupation', 'B-Unknown', 'I-Feminine', 'I-Masculine', 'I-Occupation', 'I-Unknown']
df_train_perso = df_train.loc[df_train.tag.isin(perso_cat_tags)]
df_dev_perso = df_dev.loc[df_dev.tag.isin(perso_cat_tags)]

In [10]:
df_train = (df_train.drop(columns=["tag"])).drop_duplicates()
df_dev = (df_dev.drop(columns=["tag"])).drop_duplicates()

In [11]:
join_cols = ["sentence_id", "token_id", "pos", "token"]

In [14]:
df_train_ling = df_train.join(df_train_ling.set_index(join_cols), on=join_cols, how="outer")
df_train_ling = df_train_ling.rename(columns={"tag":"tag_linguistic"})
df_train_ling = df_train_ling.fillna('O')
# df_train_ling.head()
df_dev_ling = df_dev.join(df_dev_ling.set_index(join_cols), on=join_cols, how="outer")
df_dev_ling = df_dev_ling.rename(columns={"tag":"tag_linguistic"})
df_dev_ling = df_dev_ling.fillna('O')
# df_dev_ling.head()

In [15]:
df_train_pers = df_train.join(df_train_pers.set_index(join_cols), on=join_cols, how="outer")
df_train_pers = df_train_pers.rename(columns={"tag":"tag_personname"})
df_train_pers = df_train_pers.fillna('O')
df_dev_pers = df_dev.join(df_dev_pers.set_index(join_cols), on=join_cols, how="outer")
df_dev_pers = df_dev_pers.rename(columns={"tag":"tag_personname"})
df_dev_pers = df_dev_pers.fillna('O')
# df_dev_pers.head()

In [16]:
df_train_cont = df_train.join(df_train_cont.set_index(join_cols), on=join_cols, how="outer")
df_train_cont = df_train_cont.rename(columns={"tag":"tag_contextual"})
df_train_cont = df_train_cont.fillna('O')
df_dev_cont = df_dev.join(df_dev_cont.set_index(join_cols), on=join_cols, how="outer")
df_dev_cont = df_dev_cont.rename(columns={"tag":"tag_contextual"})
df_dev_cont = df_dev_cont.fillna('O')
df_train_cont.head()

Unnamed: 0,sentence_id,token_id,pos,token,tag_contextual
3,1,3,NN,Title,O
4,1,4,:,:,O
5,1,5,NNS,Papers,O
6,1,6,IN,of,O
7,1,7,DT,The,B-Stereotype


In [12]:
df_train_perso = df_train.join(df_train_perso.set_index(join_cols), on=join_cols, how="outer")
df_train_perso = df_train_perso.rename(columns={"tag":"tag_pno"})
df_train_perso = df_train_perso.fillna('O')
df_dev_perso = df_dev.join(df_dev_perso.set_index(join_cols), on=join_cols, how="outer")
df_dev_perso = df_dev_perso.rename(columns={"tag":"tag_pno"})
df_dev_perso = df_dev_perso.fillna('O')
# df_dev_perso.head()

In [13]:
# df_train_ling = df_train_ling.drop_duplicates()
# df_dev_ling = df_dev_ling.drop_duplicates()
# df_train_pers = df_train_pers.drop_duplicates()
# df_dev_pers = df_dev_pers.drop_duplicates()
# df_train_cont = df_train_cont.drop_duplicates()
# df_dev_cont = df_dev_cont.drop_duplicates()
df_train_perso = df_train_perso.drop_duplicates()
df_dev_perso = df_dev_perso.drop_duplicates()

In [18]:
train_dfs = [df_train_ling, df_train_pers, df_train_cont]
dev_dfs = [df_dev_ling, df_dev_pers, df_dev_cont]
for df in train_dfs:
    print(df.shape[0], len(df.token_id.unique()))
print()
for df in dev_dfs:
    print(df.shape[0], len(df.token_id.unique()))

452222 452086
455327 452086
453119 452086

152494 152455
153568 152455
152768 152455


***

Tokens can have multiple tags, so there are more rows than unique token IDs.  In order to pass the data into a CRF model, we need to have one tag per token, so we'll simply **take the first tag** when we extract features for each token.

#### Word Embeddings

Get GloVe word embeddings (which were trained on English Wikipedia entries) for the vocabulary of the dataset (the unique tokens in the training set):

In [11]:
dimensions = ["50", "100", "200", "300"]
d = dimensions[0]

In [12]:
glove = utils.getGloveEmbeddings(d)
# print(glove["the"])

In [13]:
vocabulary = list(df_train.token.unique())
vocabulary_lowercased = [token.lower() for token in vocabulary]
vocabulary_lowercased = list(set(vocabulary_lowercased))
print("Vocabulary size:", len(vocabulary))
print("Lowercased vocabulary size:", len(vocabulary_lowercased))

Vocabulary size: 35968
Lowercased vocabulary size: 31335


In [14]:
word_embeddings = utils.getEmbeddingsForTokens(glove, vocabulary)

In [15]:
assert np.array_equal(word_embeddings[0], glove[vocabulary[0].lower()])

In [16]:
embedding_dict = dict(zip(vocabulary, word_embeddings))

In [17]:
embedding_dict_keys = list(embedding_dict.keys())
for token in vocabulary:
    assert token in embedding_dict_keys

In [18]:
# Reference: https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html

# Get a vector representation of a token from a GloVe word embedding
def extractEmbedding(token, embedding_dict=glove, dimensions=int(d)):
    if token.isalpha():
        token = token.lower()
    try:
        embedding = embedding_dict[token]
    except KeyError:
        embedding = np.zeros((dimensions,))
    return embedding.reshape(-1,1)

def extractTokenFeatures(sentence, i):
    token = sentence[i][0]
    pos = sentence[i][1]
    features = {
        'bias': 1.0,
        'token': token,
    }
    
    # Add each value in a token's word embedding as a separate feature
    # Reference: https://stackoverflow.com/questions/58736548/how-to-use-word-embedding-as-features-for-crf-sklearn-crfsuite-model-training
    embedding = extractEmbedding(token)
    for i,n in enumerate(embedding):
        features['e{}'.format(i)] = n
    
    # Record whether a token is the first or last token of a sentence
    if i == 0:
        features['START'] = True
    elif i == (len(sentence) - 1):
        features['END'] = True
    
    return features

def extractSentenceFeatures(sentence):
    return [extractTokenFeatures(sentence, i) for i in range(len(sentence))]

def extractSentenceTargets(sentence):
    return [tag_list[0] for token, pos, tag_list in sentence]

def extractSentenceTokens(sentence):
    return [token for token, pos, tag_list in sentence]

<a id="h"></a>
#### Hypothesis:
The three multiclass baseline sequence classifiers with GloVe word embeddings as features will have higher F1 scores than the same models with custom fastText embeddings. 

<a id="1"></a>
## 1. Models

<a id="all"></a>
## All Labels

* **Features:** custom fastText embeddings
* **Target:** Person-Name label category IOB tags
* **Algorithm:** AROW, variance=1

#### Preprocessing

In [19]:
df_train_token_groups = utils.implodeDataFrame(df_train, ['token_id', 'sentence_id', 'pos', 'token'])
df_dev_token_groups = utils.implodeDataFrame(df_dev, ['token_id', 'sentence_id', 'pos', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
df_dev_token_groups = df_dev_token_groups.reset_index()

In [20]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
# df_dev_grouped.head()

Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, TAG_LIST)`

In [21]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences = utils.zipFeaturesAndTarget(df_train_grouped, "tag")
dev_sentences = utils.zipFeaturesAndTarget(df_dev_grouped, "tag")

In [23]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

**From Optimization of Baseline Model:** arow with variance=1 was best-performing algorithm/parameter combination.

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Person Name** category of tags.  We'll increase the max iterations to 100 for this model.

In [25]:
clf = sklearn_crfsuite.CRF(algorithm='arow', variance=1, max_iterations=50, all_possible_transitions=True)

In [26]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [27]:
targets = list(clf.classes_)
targets.remove('O')
print(targets)

['B-Unknown', 'I-Unknown', 'I-Stereotype', 'I-Masculine', 'B-Masculine', 'B-Stereotype', 'B-Occupation', 'I-Occupation', 'B-Gendered-Pronoun', 'B-Omission', 'B-Generalization', 'B-Gendered-Role', 'B-Feminine', 'I-Omission', 'I-Generalization', 'I-Feminine', 'I-Gendered-Role', 'I-Gendered-Pronoun']


#### Predict

In [28]:
y_pred = clf.predict(X_dev)

#### Evaluate

##### Summary (with O label)

In [29]:
# Evaluate
print(clf.algorithm, clf.c1, clf.c2, clf.pa_type, clf.variance)
print("  Macro:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="macro", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="macro", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="macro", zero_division=0, labels=targets))
print("  Micro:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="micro", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="micro", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="micro", zero_division=0, labels=targets))
print("  Per Label:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

arow None None None 1
  Macro:
  - F1: 0.3448072001994752
  - Prec: 0.3280949779307769
  - Rec 0.3718758583588562
  Micro:
  - F1: 0.38332083645755227
  - Prec: 0.35861262665627436
  - Rec 0.4116857551896922
  Per Label:
  - F1: [0.42539683 0.43977055 0.16512456 0.29272811 0.30212014 0.17391304
 0.59833333 0.49354005 0.8297456  0.13043478 0.2893617  0.50815217
 0.44311377 0.14482201 0.2364532  0.36917866 0.36434109 0.        ]
  - Prec: [0.45995423 0.44384408 0.13875598 0.24515128 0.28547579 0.15483871
 0.61472603 0.47869674 0.77372263 0.10700132 0.31775701 0.51944444
 0.43529412 0.11949266 0.18604651 0.33538462 0.29012346 0.        ]
  - Rec [0.39566929 0.43577113 0.20386643 0.36321839 0.32082552 0.19834711
 0.58279221 0.50933333 0.89451477 0.16701031 0.265625   0.49734043
 0.45121951 0.18377823 0.32432432 0.41054614 0.48958333 0.        ]


##### Strict Evaluation

In [83]:
df_dev_grouped = df_dev_grouped.rename(columns={"tag":"tag_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_predicted", y_pred)
# df_dev_grouped.head()

In [85]:
df_dev_grouped = df_dev_grouped.set_index(["sentence_id"])
df_dev_exploded = df_dev_grouped.explode(list(df_dev_grouped.columns))
# df_dev_exploded.head()
df_dev_exploded = df_dev_exploded.explode(["tag_expected"])
df_dev_exploded.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_expected,tag_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,154,IN,After,O,O
5,155,PRP$,his,B-Gendered-Pronoun,B-Gendered-Pronoun
5,156,NN,ordination,O,O
5,157,PRP,he,B-Gendered-Pronoun,B-Gendered-Pronoun
5,158,VBD,spent,O,O


In [86]:
df_dev_exploded = df_dev_exploded.fillna("O")

In [88]:
df_dev_exploded = df_dev_exploded.rename(columns={"sentence":"token"})

In [89]:
exp_df = df_dev_exploded.drop(columns=["tag_predicted"]).reset_index()
pred_df = df_dev_exploded.drop(columns=["tag_expected"]).reset_index()
# pred_df.head()

In [90]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "tag_expected"],   # left on
    ["sentence_id", "token_id", "token", "pos", "tag_predicted"],  # right on
    ["sentence_id", "token_id", "token", "pos", "tag_expected", "tag_predicted", "_merge"],  # final column list
    "tag_expected",
    "tag_predicted", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag_expected,tag_predicted,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
4,5,158,spent,VBD,O,O,true negative


In [91]:
filename = "crf_arow_var1_baseline_GloVe{d}_predictions.csv".format(d=d)
eval_df.to_csv(config.tokc_path+"sequence_model_output/"+filename)

Calculate precision, recall, and F1 score at the token level for each tag:

In [92]:
agmt_stats = pd.DataFrame()
for tag in targets:
#     getScoresByTags(df, eval_col, tags, exp_col="expected_tag", pred_col="predicted_tag"):
    tag_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [tag], exp_col="tag_expected", pred_col="tag_predicted")
    agmt_stats = pd.concat([agmt_stats, tag_agmt_stats])
agmt_stats

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,B-Unknown,1182,654,0,1928,0.746708,0.619936,0.677442
0,I-Unknown,1772,1408,0,3244,0.697334,0.64673,0.67108
0,I-Stereotype,585,728,0,326,0.309298,0.357849,0.331807
0,I-Masculine,801,917,0,1352,0.595857,0.627961,0.611488
0,B-Masculine,676,357,0,1110,0.756646,0.621501,0.682447
0,B-Stereotype,222,136,0,84,0.381818,0.27451,0.319392
0,B-Occupation,283,213,0,816,0.793003,0.742493,0.766917
0,I-Occupation,389,422,0,824,0.661316,0.679308,0.670191
0,B-Gendered-Pronoun,85,165,0,1408,0.895105,0.943068,0.918461
0,B-Omission,834,612,0,798,0.565957,0.488971,0.524655


Save the statistics:

In [94]:
filename = "crf_arow_var1_baseline_GloVe{d}_agreement.csv".format(d=d)
stats.to_csv(config.tokc_path+"sequence_model_performance/"+filename)

<a id="ling"></a>
### Linguistic

* **Features:** custom fastText embeddings, token suffixes
* **Target:** Linguistic label category IOB tags
* **Algorithm:** AROW

#### Preprocessing

In [27]:
df_train = df_train_ling
df_dev = df_dev_ling

Group the data by token, so the all the tags for one token are recorded in a list for that token's row:

In [28]:
df_train_token_groups = utils.implodeDataFrame(df_train, ['token_id', 'sentence_id', 'pos', 'token'])
df_dev_token_groups = utils.implodeDataFrame(df_dev, ['token_id', 'sentence_id', 'pos', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
df_dev_token_groups = df_dev_token_groups.reset_index()

Group the data by sentence, where each sentence is a list of tokens:

In [29]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
# df_dev_grouped.head()

Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, TAG_LIST)`

In [30]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences_ling = utils.zipFeaturesAndTarget(df_train_grouped, "tag_linguistic")
# print(train_sentences_ling[0][:3])
dev_sentences_ling = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_linguistic")
# print(dev_sentences_ling[0][:3])

Extract the features and targets:

In [31]:
train_sentences = train_sentences_ling
dev_sentences = dev_sentences_ling

In [32]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

**From Optimization of Baseline Model:** pa with pa_type=0 was best, however, arow with variance=0.5 also had strong performance, and is strongest with other categories, Person Name and Contextual, so we'll use arow.

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Linguistic** category of tags.  We'll increase the max_iterations to 100 for this model.

In [33]:
clf_ling = sklearn_crfsuite.CRF(algorithm='arow', variance=0.5, max_iterations=100, all_possible_transitions=True)

In [34]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_ling.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [35]:
targets = list(clf_ling.classes_)
targets.remove('O')
print(targets)

['B-Gendered-Pronoun', 'B-Generalization', 'B-Gendered-Role', 'I-Generalization', 'I-Gendered-Role', 'I-Gendered-Pronoun']


#### Predict

In [36]:
y_pred = clf_ling.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [37]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))

  - F1: 0.46127525476481473
  - Prec: 0.6776789299424809
  - Rec 0.35706340378198


The baseline model's scores with GloVe embeddings of 100 dimensions were:

The baseline model's scores with GloVe embeddings of 50 dimensions were:
  - F1: 0.6682632094564904
  - Prec: 0.6397609767051633
  - Rec 0.7057842046718577
  
Which is a slightly lower performance than the baseline model with custom fastText embeddings:
  - F1: 0.6697698893667122
  - Prec: 0.696012355375722
  - Rec 0.6718576195773082

<a id="pers"></a>
### Person Name

* **Features:** custom fastText embeddings
* **Target:** Person-Name label category IOB tags
* **Algorithm:** AROW

#### Preprocessing

In [27]:
df_train = df_train_pers
df_dev = df_dev_pers

In [28]:
df_train_token_groups = utils.implodeDataFrame(df_train, ['token_id', 'sentence_id', 'pos', 'token'])
df_dev_token_groups = utils.implodeDataFrame(df_dev, ['token_id', 'sentence_id', 'pos', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
df_dev_token_groups = df_dev_token_groups.reset_index()

In [29]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
# df_dev_grouped.head()

Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, TAG_LIST)`

In [30]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences_pers = utils.zipFeaturesAndTarget(df_train_grouped, "tag_personname")
# print(train_sentences_pers[0][:3])
dev_sentences_pers = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_personname")
# print(dev_sentences_pers[0][:3])

In [31]:
train_sentences = train_sentences_pers
dev_sentences = dev_sentences_pers

In [32]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

**From Optimization of Baseline Model:** arow with variance=0.5 was best-performing algorithm/parameter combination.

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Person Name** category of tags.  We'll increase the max iterations to 100 for this model.

In [33]:
clf_pers = sklearn_crfsuite.CRF(algorithm='arow', variance=0.5, max_iterations=100, all_possible_transitions=True)

In [34]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_pers.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [35]:
targets = list(clf_pers.classes_)
targets.remove('O')
print(targets)

['B-Unknown', 'I-Unknown', 'I-Masculine', 'B-Masculine', 'B-Feminine', 'I-Feminine']


#### Predict

In [36]:
y_pred = clf_pers.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [37]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))

  - F1: 0.45720798958795417
  - Prec: 0.45011745722056856
  - Rec 0.4673524789255608


The baseline model's performance scores were:
  - F1: 0.45688362747838485
  - Prec: 0.44852949944619963
  - Rec 0.4783540505786541

Compared to the baseline model with custom fastText embeddings:
  - F1: 0.4789287297354048
  - Prec: 0.5381704595959792
  - Rec 0.43620517216745247

<a id="cont"></a>
### Contextual

* **Features:** custom fastText embeddings, token suffixes
* **Target:** Contextual label category IOB tags
* **Algorithm:** AROW

#### Preprocessing

In [27]:
df_train = df_train_cont
df_dev = df_dev_cont

In [28]:
df_train_token_groups = utils.implodeDataFrame(df_train, ['token_id', 'sentence_id', 'pos', 'token'])
df_dev_token_groups = utils.implodeDataFrame(df_dev, ['token_id', 'sentence_id', 'pos', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
df_dev_token_groups = df_dev_token_groups.reset_index()

In [29]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
# df_dev_grouped.head()

Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, TAG_LIST)`

In [30]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences_cont = utils.zipFeaturesAndTarget(df_train_grouped, "tag_contextual")
# print(train_sentences_cont[0][:3])
dev_sentences_cont = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_contextual")
# print(dev_sentences_cont[0][:3])

In [31]:
train_sentences = train_sentences_cont
dev_sentences = dev_sentences_cont

In [32]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

**From Optimization of Baseline Model:** arow with variance=0.5 was the best performing algorithm/parameter combination.

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Contextual** category of tags.  We'll increase the maximum iterations to 100 for this model.

In [33]:
clf_cont = sklearn_crfsuite.CRF(algorithm='arow', variance=0.5, max_iterations=100, all_possible_transitions=True)

In [34]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_cont.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [35]:
targets = list(clf_cont.classes_)
targets.remove('O')
print(targets)

['B-Stereotype', 'I-Stereotype', 'B-Occupation', 'I-Occupation', 'B-Omission', 'I-Omission']


#### Predict

In [36]:
y_pred = clf_cont.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [37]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))

  - F1: 0.4043437926935072
  - Prec: 0.4144046520219574
  - Rec 0.3993485342019544


The baseline model's performance scores were:
  - F1: 0.3896210891326545
  - Prec: 0.40701772348045956
  - Rec 0.3752442996742671
  
Compared to the baseline model with custom fastText embeddings:
  - F1: 0.40803990799640844
  - Prec: 0.4124968743940007
  - Rec 0.4091205211726384

<a id="perso"></a>
### Person Name + Occupation

* **Features:** custom fastText embeddings
* **Target:** Person-Name label category IOB tags
* **Algorithm:** AROW, variance=1

#### Preprocessing

In [22]:
df_train = df_train_perso
df_dev = df_dev_perso

In [23]:
df_train_token_groups = utils.implodeDataFrame(df_train, ['token_id', 'sentence_id', 'pos', 'token'])
df_dev_token_groups = utils.implodeDataFrame(df_dev, ['token_id', 'sentence_id', 'pos', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
df_dev_token_groups = df_dev_token_groups.reset_index()

In [24]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
# df_dev_grouped.head()

Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, TAG_LIST)`

In [26]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences_perso = utils.zipFeaturesAndTarget(df_train_grouped, "tag_pno")
dev_sentences_perso = utils.zipFeaturesAndTarget(df_dev_grouped, "tag_pno")

In [27]:
train_sentences = train_sentences_perso
dev_sentences = dev_sentences_perso

In [28]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

**From Optimization of Baseline Model:** arow with variance=1 was best-performing algorithm/parameter combination.

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Person Name** category of tags.  We'll increase the max iterations to 100 for this model.

In [31]:
clf_pers = sklearn_crfsuite.CRF(algorithm='arow', variance=1, max_iterations=50, all_possible_transitions=True)

In [32]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_pers.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [33]:
targets = list(clf_pers.classes_)
targets.remove('O')
print(targets)

['B-Unknown', 'I-Unknown', 'I-Masculine', 'B-Masculine', 'B-Occupation', 'I-Occupation', 'B-Feminine', 'I-Feminine']


#### Predict

In [34]:
y_pred = clf_pers.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [35]:
# Evaluate
print(clf_pers.algorithm, clf_pers.c1, clf_pers.c2, clf_pers.pa_type, clf_pers.variance)
print("  Macro:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="macro", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="macro", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="macro", zero_division=0, labels=targets))
print("  Micro:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="micro", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="micro", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="micro", zero_division=0, labels=targets))
print("  Per Label:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

arow None None None 1
  Macro:
  - F1: 0.5243867381337348
  - Prec: 0.5403350348885592
  - Rec 0.5144901877709556
  Micro:
  - F1: 0.49842427238460113
  - Prec: 0.5195156511657865
  - Rec 0.478978622327791
  Per Label:
  - F1: [0.50196335 0.48682946 0.37071173 0.46112957 0.63591433 0.54187192
 0.66090713 0.53576642]
  - Prec: [0.55660377 0.52335676 0.37052201 0.44316731 0.6819788  0.59875583
 0.67105263 0.47724317]
  - Rec [0.45709178 0.4550683  0.37090164 0.48060942 0.59567901 0.49485861
 0.65106383 0.61064892]
