# Baseline Gender Bias Sequence Classifiers with GloVe

### Target: Labels

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/model_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/crf_l2sgd/`
* Sequence classification
    * 9 lables (2 from original annotation taxonomy weren't applied during manual annotation):
        1. Person Name: Unknown, Feminine, Masculine (Non-binary not annotated with)
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Occupation, Omission, Stereotype (Empowering not annotated with enough)
    * 1 model per category
* Word embeddings
    * GloVe (trained on Gigaword 5 + English Wikipedia dump)

***

### Table of Contents

[0.](#0) Preprocessing

  * [Hypotheses](#h)

[1.](#1) Models

  * [All Labels](#all)
  * [Linguistic](#ling)
  * [Person Name](#pers)
  * [Contextual](#cont)
  * [Person Name + Occupation](#perso)
  
[2.](#2) Performance Evaluation

***

Load necessary libraries:

In [1]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt

# For preprocessing
from nltk.stem import WordNetLemmatizer
import scipy.stats

from gensim.models import FastText
from gensim import utils as gensim_utils
from gensim.test.utils import get_tmpfile

# For classification
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# For evaluation
from collections import Counter
from sklearn.metrics import classification_report, make_scorer
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, ConfusionMatrixDisplay#, plot_confusion_matrix
from sklearn.metrics import precision_recall_fscore_support, f1_score
from intervaltree import Interval, IntervalTree

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [2]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


Drop duplicate rows with all but the same annotation ID:

In [3]:
df_train = df_train.drop(columns=["ann_id"])
df_train = df_train.drop_duplicates()
df_dev = df_dev.drop(columns=["ann_id"])
df_dev = df_dev.drop_duplicates()
print(df_train.shape, df_dev.shape)

(463441, 9) (156146, 9)


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [4]:
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [5]:
df_train.shape

(463439, 9)

Remove columns that won't be used as features for the classifiers and remove any duplicate rows that remain:

In [6]:
cols_to_keep = ["sentence_id", "token_id", "pos", "token", "tag"]

In [7]:
df_train = df_train[cols_to_keep]
df_train = df_train.drop_duplicates()
df_dev = df_dev[cols_to_keep]
df_dev = df_dev.drop_duplicates()
# df_train.head(20)

Create separate subsets of data for each category so they can be used with three separate models, replacing `NaN` tag values with `'O'`:

In [8]:
tags = (df_train.tag.unique())
tags.sort()
print(tags)

['B-Feminine' 'B-Gendered-Pronoun' 'B-Gendered-Role' 'B-Generalization'
 'B-Masculine' 'B-Occupation' 'B-Omission' 'B-Stereotype' 'B-Unknown'
 'I-Feminine' 'I-Gendered-Pronoun' 'I-Gendered-Role' 'I-Generalization'
 'I-Masculine' 'I-Occupation' 'I-Omission' 'I-Stereotype' 'I-Unknown' 'O']


***
**Optional Preprocessing** - if only using a subset of tags

In [9]:
ling_cat_tags = ['B-Gendered-Pronoun', 'B-Gendered-Role', 'B-Generalization', 'I-Gendered-Pronoun', 'I-Gendered-Role', 'I-Generalization']
df_train_ling = df_train.loc[df_train.tag.isin(ling_cat_tags)]
df_dev_ling = df_dev.loc[df_dev.tag.isin(ling_cat_tags)]
category = "linguistic"

In [10]:
# pers_cat_tags = ['B-Feminine', 'B-Masculine', 'B-Unknown', 'I-Feminine', 'I-Masculine', 'I-Unknown']
# df_train_pers = df_train.loc[df_train.tag.isin(pers_cat_tags)]
# df_dev_pers = df_dev.loc[df_dev.tag.isin(pers_cat_tags)]

In [11]:
# cont_cat_tags = ['B-Occupation', 'B-Omission', 'B-Stereotype', 'I-Occupation', 'I-Omission', 'I-Stereotype']
# df_train_cont = df_train.loc[df_train.tag.isin(cont_cat_tags)]
# df_dev_cont = df_dev.loc[df_dev.tag.isin(cont_cat_tags)]

In [12]:
# perso_cat_tags = ['B-Feminine', 'B-Masculine', 'B-Occupation', 'B-Unknown', 'I-Feminine', 'I-Masculine', 'I-Occupation', 'I-Unknown']
# df_train_perso = df_train.loc[df_train.tag.isin(perso_cat_tags)]
# df_dev_perso = df_dev.loc[df_dev.tag.isin(perso_cat_tags)]
# category = "pers_o"

In [13]:
df_train = (df_train.drop(columns=["tag"])).drop_duplicates()
df_dev = (df_dev.drop(columns=["tag"])).drop_duplicates()

In [14]:
join_cols = ["sentence_id", "token_id", "pos", "token"]

In [15]:
df_train_ling = df_train.join(df_train_ling.set_index(join_cols), on=join_cols, how="outer")
df_train_ling = df_train_ling.fillna('O')
# df_train_ling.head()
df_dev_ling = df_dev.join(df_dev_ling.set_index(join_cols), on=join_cols, how="outer")
df_dev_ling = df_dev_ling.fillna('O')
# df_dev_ling.head()

In [16]:
# df_train_pers = df_train.join(df_train_pers.set_index(join_cols), on=join_cols, how="outer")
# df_train_pers = df_train_pers.rename(columns={"tag":"tag_personname"})
# df_train_pers = df_train_pers.fillna('O')
# df_dev_pers = df_dev.join(df_dev_pers.set_index(join_cols), on=join_cols, how="outer")
# df_dev_pers = df_dev_pers.rename(columns={"tag":"tag_personname"})
# df_dev_pers = df_dev_pers.fillna('O')
# # df_dev_pers.head()

In [17]:
# df_train_cont = df_train.join(df_train_cont.set_index(join_cols), on=join_cols, how="outer")
# df_train_cont = df_train_cont.rename(columns={"tag":"tag_contextual"})
# df_train_cont = df_train_cont.fillna('O')
# df_dev_cont = df_dev.join(df_dev_cont.set_index(join_cols), on=join_cols, how="outer")
# df_dev_cont = df_dev_cont.rename(columns={"tag":"tag_contextual"})
# df_dev_cont = df_dev_cont.fillna('O')
# df_train_cont.head()

In [18]:
# df_train_perso = df_train.join(df_train_perso.set_index(join_cols), on=join_cols, how="outer")
# df_train_perso = df_train_perso.fillna('O')
# df_dev_perso = df_dev.join(df_dev_perso.set_index(join_cols), on=join_cols, how="outer")
# df_dev_perso = df_dev_perso.fillna('O')
# # df_dev_perso.head()

In [19]:
df_train_ling = df_train_ling.drop_duplicates()
df_dev_ling = df_dev_ling.drop_duplicates()
# df_train_pers = df_train_pers.drop_duplicates()
# df_dev_pers = df_dev_pers.drop_duplicates()
# df_train_cont = df_train_cont.drop_duplicates()
# df_dev_cont = df_dev_cont.drop_duplicates()
# df_train_perso = df_train_perso.drop_duplicates()
# df_dev_perso = df_dev_perso.drop_duplicates()

In [20]:
# train_dfs = [df_train_ling, df_train_pers, df_train_cont]
# dev_dfs = [df_dev_ling, df_dev_pers, df_dev_cont]
# for df in train_dfs:
#     print(df.shape[0], len(df.token_id.unique()))
# print()
# for df in dev_dfs:
#     print(df.shape[0], len(df.token_id.unique()))

***

Tokens can have multiple tags, so there are more rows than unique token IDs.  In order to pass the data into a CRF model, we need to have one tag per token, so we'll simply **take the first tag** when we extract features for each token.

#### Word Embeddings

Get GloVe word embeddings (which were trained on English Wikipedia entries) for the vocabulary of the dataset (the unique tokens in the training set):

In [21]:
dimensions = ["50", "100", "200", "300"]
d = dimensions[1]

In [22]:
glove = utils.getGloveEmbeddings(d)
# print(glove["the"])

In [23]:
vocabulary = list(df_train.token.unique())
vocabulary_lowercased = [token.lower() for token in vocabulary]
vocabulary_lowercased = list(set(vocabulary_lowercased))
print("Vocabulary size:", len(vocabulary))
print("Lowercased vocabulary size:", len(vocabulary_lowercased))

Vocabulary size: 35968
Lowercased vocabulary size: 31335


In [24]:
word_embeddings = utils.getEmbeddingsForTokens(glove, vocabulary)

In [25]:
assert np.array_equal(word_embeddings[0], glove[vocabulary[0].lower()])

In [26]:
embedding_dict = dict(zip(vocabulary, word_embeddings))

In [27]:
embedding_dict_keys = list(embedding_dict.keys())
for token in vocabulary:
    assert token in embedding_dict_keys

In [28]:
# Reference: https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html

# Get a vector representation of a token from a GloVe word embedding
def extractEmbedding(token, embedding_dict=glove, dimensions=int(d)):
    if token.isalpha():
        token = token.lower()
    try:
        embedding = embedding_dict[token]
    except KeyError:
        embedding = np.zeros((dimensions,))
    return embedding.reshape(-1,1)

def extractTokenFeatures(sentence, i):
    token = sentence[i][0]
    pos = sentence[i][1]
    features = {
        'bias': 1.0,
        'token': token,
    }
    
    # Add each value in a token's word embedding as a separate feature
    # Reference: https://stackoverflow.com/questions/58736548/how-to-use-word-embedding-as-features-for-crf-sklearn-crfsuite-model-training
    embedding = extractEmbedding(token)
    for i,n in enumerate(embedding):
        features['e{}'.format(i)] = n
    
    # Record whether a token is the first or last token of a sentence
    if i == 0:
        features['START'] = True
    elif i == (len(sentence) - 1):
        features['END'] = True
    
    return features

def extractSentenceFeatures(sentence):
    return [extractTokenFeatures(sentence, i) for i in range(len(sentence))]

def extractSentenceTargets(sentence):
    return [tag_list[0] for token, pos, tag_list in sentence]

def extractSentenceTokens(sentence):
    return [token for token, pos, tag_list in sentence]

<a id="h"></a>
#### Hypothesis:
The three multiclass baseline sequence classifiers with GloVe word embeddings as features will have higher F1 scores than the same models with custom fastText embeddings. 

<a id="1"></a>
## 1. Models

<a id="all"></a>
## Custom Label Selection

* **Features:** custom fastText embeddings
* **Algorithm:** AROW, variance=1

#### Preprocessing

In [29]:
df_train = df_train_ling #df_train_perso
df_dev = df_dev_ling #df_dev_perso

In [30]:
df_train_token_groups = utils.implodeDataFrame(df_train, ['token_id', 'sentence_id', 'pos', 'token'])
df_dev_token_groups = utils.implodeDataFrame(df_dev, ['token_id', 'sentence_id', 'pos', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
df_dev_token_groups = df_dev_token_groups.reset_index()

In [31]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
# df_dev_grouped.head()

Zip the POS and category tags together with the tokens so each sentence item is a tuple: `(TOKEN, POS-TAG, TAG_LIST)`

In [32]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences = utils.zipFeaturesAndTarget(df_train_grouped, "tag")
dev_sentences = utils.zipFeaturesAndTarget(df_dev_grouped, "tag")

In [33]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

**From Optimization of Baseline Model:** arow with variance=1 was best-performing algorithm/parameter combination.

#### Train

Train a Conditional Random Field (CRF) model with 50 maximum iterations.

In [34]:
clf = sklearn_crfsuite.CRF(algorithm='arow', variance=1, max_iterations=50, all_possible_transitions=True)

In [35]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [36]:
targets = list(clf.classes_)
targets.remove('O')
print(targets)

['B-Gendered-Pronoun', 'B-Generalization', 'B-Gendered-Role', 'I-Generalization', 'I-Gendered-Role', 'I-Gendered-Pronoun']


#### Predict

In [37]:
y_pred = clf.predict(X_dev)

#### Evaluate

##### Summary (with O label)

In [38]:
# Evaluate
print(clf.algorithm, clf.c1, clf.c2, clf.pa_type, clf.variance)
print("  Macro:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="macro", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="macro", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="macro", zero_division=0, labels=targets))
print("  Micro:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="micro", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="micro", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="micro", zero_division=0, labels=targets))
print("  Per Label:")
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average=None, zero_division=0, labels=targets))

arow None None None 1
  Macro:
  - F1: 0.4018151141437465
  - Prec: 0.5644738220880121
  - Rec 0.3439536665778246
  Micro:
  - F1: 0.49202578893790294
  - Prec: 0.6309834638816362
  - Rec 0.4032258064516129
  Per Label:
  - F1: [0.45490196 0.24347826 0.65107577 0.25660377 0.5826087  0.22222222]
  - Prec: [0.81978799 0.336      0.69047619 0.28099174 0.59292035 0.66666667]
  - Rec [0.31478969 0.19090909 0.6159292  0.23611111 0.57264957 0.13333333]


##### Strict Evaluation

In [39]:
df_dev_grouped = df_dev_grouped.rename(columns={"tag":"tag_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_predicted", y_pred)
# df_dev_grouped.head()

In [40]:
df_dev_grouped = df_dev_grouped.set_index(["sentence_id"])
df_dev_exploded = df_dev_grouped.explode(list(df_dev_grouped.columns))
# df_dev_exploded.head()
df_dev_exploded = df_dev_exploded.explode(["tag_expected"])
df_dev_exploded.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_expected,tag_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,154,IN,After,O,O
5,155,PRP$,his,B-Gendered-Pronoun,O
5,156,NN,ordination,O,O
5,157,PRP,he,B-Gendered-Pronoun,O
5,158,VBD,spent,O,O


In [41]:
df_dev_exploded = df_dev_exploded.fillna("O")

In [42]:
df_dev_exploded = df_dev_exploded.rename(columns={"sentence":"token"})

In [43]:
exp_df = df_dev_exploded.drop(columns=["tag_predicted"]).reset_index()
pred_df = df_dev_exploded.drop(columns=["tag_expected"]).reset_index()
# pred_df.head()

In [44]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "tag_expected"],   # left on
    ["sentence_id", "token_id", "token", "pos", "tag_predicted"],  # right on
    ["sentence_id", "token_id", "token", "pos", "tag_expected", "tag_predicted", "_merge"],  # final column list
    "tag_expected",
    "tag_predicted", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag_expected,tag_predicted,_merge
0,5,154,After,IN,O,O,true negative
2,5,156,ordination,NN,O,O,true negative
4,5,158,spent,VBD,O,O,true negative
5,5,159,three,CD,O,O,true negative
6,5,160,years,NNS,O,O,true negative


In [45]:
filename = "crf_arow_var1_{c}_baseline_GloVe{d}_predictions.csv".format(d=d, c=category)
eval_df.to_csv(config.tokc_path+"sequence_model_output/"+filename)

Calculate precision, recall, and F1 score at the token level for each tag:

In [49]:
targets = ['B-Gendered-Pronoun', 'I-Gendered-Pronoun', 'B-Gendered-Role', 'I-Gendered-Role', 'B-Generalization', 'I-Generalization']

In [50]:
agmt_stats = pd.DataFrame()
for tag in targets:
#     getScoresByTags(df, eval_col, tags, exp_col="expected_tag", pred_col="predicted_tag"):
    tag_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [tag], exp_col="tag_expected", pred_col="tag_predicted")
    agmt_stats = pd.concat([agmt_stats, tag_agmt_stats])
agmt_stats

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,B-Gendered-Pronoun,4,0,0,480,1.0,0.991736,0.995851
0,I-Gendered-Pronoun,3,0,0,4,1.0,0.571429,0.727273
0,B-Gendered-Role,3,2,0,720,0.99723,0.995851,0.99654
0,I-Gendered-Role,3,2,0,134,0.985294,0.978102,0.981685
0,B-Generalization,15,5,0,88,0.946237,0.854369,0.897959
0,I-Generalization,3,7,0,68,0.906667,0.957746,0.931507


Save the statistics:

In [51]:
filename = "crf_arow_var1_{c}_baseline_GloVe{d}_agreement.csv".format(d=d, c=category)
agmt_stats.to_csv(config.tokc_path+"sequence_model_performance/"+filename)