# Baseline Gender Biased Token Classifiers with GloVe

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/model_input/`
* Multilabel classification
    * 3 categories of labels:
        1. Person Name: Unknown, Feminine, Masculine (Non-binary not applied during annotation)
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Occupation, Omission, Stereotype (Empowering only applied by one annotator and too few times for training)
* Word Embeddings: GloVe 

***

### Table of Contents

[0.](#0) Preprocessing

[1.](#CC) Classifier Chain Models

***

Load necessary libraries:

In [1]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For classification
import sklearn.metrics as metrics
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.adapt import MLTSVM
# Base estimators
from sklearn.ensemble import RandomForestClassifier #, ExtraTreesClassifier
# from sklearn.linear_model import LogisticRegression, RidgeClassifier, PassiveAggressiveClassifier

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [2]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


In [3]:
df_train = df_train.drop(columns=["ann_id"])
df_train = df_train.drop_duplicates()
df_dev = df_dev.drop(columns=["ann_id"])
df_dev = df_dev.drop_duplicates()
print(df_train.shape, df_dev.shape)

(463441, 9) (156146, 9)


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [4]:
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [5]:
df_train.shape

(463439, 9)

***
#### Optional Preprocessing Step

If not classifying all labels at once, consider only the rows with tags for the select subset of labels, replacing all tags not in that subset with `"O"`:

In [6]:
# cont_label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission"]
# pers_label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Occupation", "I-Occupation"]#, "B-Nonbinary", "I-Nonbinary"]
ling_label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
# df_train, df_dev = utils.selectDataForLabels(df_train, df_dev, "tag", pers_label_subset)
df_train, df_dev = utils.selectDataForLabels(df_train, df_dev, "tag", ling_label_subset)
print(df_train.shape, df_dev.shape)

(463439, 9) (156146, 9)


***

Group the data by token, so there is one row per token rather than one row per token-tag pair:

In [7]:
subdf_train = df_train.drop(columns=["description_id", "field", "subset", "token_offsets"])
subdf_dev = df_dev.drop(columns=["description_id", "field", "subset", "token_offsets"])
df_train_imploded = utils.implodeDataFrame(subdf_train, ["sentence_id", "token_id", "token", "pos"])
df_train_imploded = df_train_imploded.reset_index()
df_dev_imploded = utils.implodeDataFrame(subdf_dev, ["sentence_id", "token_id", "token", "pos"])
df_dev_imploded = df_dev_imploded.reset_index()
df_dev_imploded.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag
0,5,154,After,IN,[O]
1,5,155,his,PRP$,[B-Gendered-Pronoun]
2,5,156,ordination,NN,[O]
3,5,157,he,PRP,[B-Gendered-Pronoun]
4,5,158,spent,VBD,[O]


Replace the tags with label names (remove ``B-`` and ``I-``):

In [8]:
def getLabelColFromTagCol(df, col):
    col_list = list(df[col])
    new_col = []
    for value_list in col_list:
        new_value_list = []
        for value in value_list:
            if value != "O":
                new_value = value[2:]
                new_value_list += [new_value]
            else:
                new_value_list += [value]
        # Remove any duplicates from the list of labels
        unique_values = list(set(new_value_list))
        # Sort the list of labels alphabetically
        unique_values.sort()
        new_col += [unique_values]
    assert len(new_col) == len(col_list)
    return new_col

In [9]:
train_labels = getLabelColFromTagCol(df_train_imploded, "tag")
# # train_labels[:10]  # Looks good
dev_labels = getLabelColFromTagCol(df_dev_imploded, "tag")
# dev_labels[:10] # Looks good

In [10]:
df_train_imploded.insert(len(df_train_imploded.columns), "label", train_labels)
df_train_imploded.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag,label
0,1,3,Title,NN,[O],[O]
1,1,4,:,:,[O],[O]
2,1,5,Papers,NNS,[O],[O]
3,1,6,of,IN,[O],[O]
4,1,7,The,DT,"[O, O, O]",[O]


In [11]:
df_dev_imploded.insert(len(df_dev_imploded.columns), "label", dev_labels)
df_dev_imploded.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag,label
0,5,154,After,IN,[O],[O]
1,5,155,his,PRP$,[B-Gendered-Pronoun],[Gendered-Pronoun]
2,5,156,ordination,NN,[O],[O]
3,5,157,he,PRP,[B-Gendered-Pronoun],[Gendered-Pronoun]
4,5,158,spent,VBD,[O],[O]


#### Word Embeddings

Get GloVe word embeddings (which were trained on English Wikipedia entries) for the vocabulary of the dataset (the unique tokens in the training set):

In [12]:
dimensions = ["50", "100", "200", "300"]
d = dimensions[1]

In [13]:
glove = utils.getGloveEmbeddings(d)
# print(glove["the"])

In [14]:
vocabulary = list(df_train.token.unique())
vocabulary_lowercased = [token.lower() for token in vocabulary]
vocabulary_lowercased = list(set(vocabulary_lowercased))
print("Vocabulary size:", len(vocabulary))
print("Lowercased vocabulary size:", len(vocabulary_lowercased))

Vocabulary size: 35968
Lowercased vocabulary size: 31335


In [15]:
word_embeddings = utils.getEmbeddingsForTokens(glove, vocabulary)

In [16]:
assert np.array_equal(word_embeddings[0], glove[vocabulary[0].lower()])

In [17]:
embedding_dict = dict(zip(vocabulary, word_embeddings))

In [18]:
embedding_dict_keys = list(embedding_dict.keys())
for token in vocabulary:
    assert token in embedding_dict_keys

Vectorize and binarize the data:

In [19]:
mlb = MultiLabelBinarizer()

In [20]:
target_col = "label"
feature_cols = ["token_id", "token"]
train_data = df_train_imploded
dev_data = df_dev_imploded

Extract features:

In [21]:
# Get a vector representation of a token from a GloVe word embedding
def extractEmbedding(token, embedding_dict=glove, dimensions=int(d)):
    if token.isalpha():
        token = token.lower()
    try:
        embedding = embedding_dict[token]
    except KeyError:
        embedding = np.zeros((dimensions,))
    return embedding.reshape(-1,1)

def makeFeatureMatrix(token_data, dimensions=int(d)):    
    feature_list = [extractEmbedding(token) for token_id,token in token_data]
    return np.array(feature_list).reshape(-1,dimensions)

In [22]:
train_tokens = list(zip(train_data[feature_cols[0]], train_data[feature_cols[1]]))
dev_tokens = list(zip(dev_data[feature_cols[0]], dev_data[feature_cols[1]]))

In [23]:
X_train = makeFeatureMatrix(train_tokens)
X_dev = makeFeatureMatrix(dev_tokens)
print(X_train.shape, X_dev.shape)  # (number_of_samples, number_of_features)

(452086, 100) (152455, 100)


Binarize targets:

In [24]:
y_train_labels = train_data[target_col]
y_train = mlb.fit_transform(y_train_labels)
y_dev_labels = dev_data[target_col]
y_dev = mlb.transform(y_dev_labels)
print(y_train.shape, y_dev.shape)  # number_of_samples, number_of_labels

(452086, 4) (152455, 4)


In [25]:
for labels in y_train:
    if sum(labels) > 1:
        print("Multilabelled tokens exist, as expected.")
        break

Multilabelled tokens exist, as expected.


For baseline models, use only the tokens' embeddings as features.

<a id="CC"></a>
## 1. Classifier Chain Models

Classifier Chain Models with Random Forest were best from the experiments with custom fastText embeddings, so use the same model here to compare performance with GloVe embeddings

*Reference: http://scikit.ml/api/skmultilearn.problem_transform.cc.html#skmultilearn.problem_transform.ClassifierChain*

#### Train & Predict

In [26]:
clf = ClassifierChain(
    classifier = RandomForestClassifier(random_state=22),
)
clf.fit(X_train, y_train)

In [27]:
predictions = clf.predict(X_dev)

#### Evaluate: All Labels

In [28]:
print("Precision - micro:", metrics.precision_score(y_dev, predictions, average="micro", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - micro:", metrics.recall_score(y_dev, predictions, average="micro", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - micro:", metrics.f1_score(y_dev, predictions, average="micro", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - micro: 0.995827747441389
Precision - macro: 0.780548135830234

Recall - micro: 0.9937351945026331
Recall - macro: 0.6674947560560105

F1 Score - micro: 0.9947803705348977
F1 Score - macro: 0.6854051119988238

Accuracy - normalized: 0.9927322816568823
Accuracy - unnormalized: 151347


#### Evaluate: Each Label

In [29]:
pred_df = utils.makePredictionDF(predictions, dev_data, "label", "predicted_label", "O", mlb)
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag,predicted_label
0,5,154,After,IN,[O],O
1,5,155,his,PRP$,[B-Gendered-Pronoun],Gendered-Pronoun
2,5,156,ordination,NN,[O],O
3,5,157,he,PRP,[B-Gendered-Pronoun],Gendered-Pronoun
4,5,158,spent,VBD,[O],O


In [30]:
exp_df = dev_data.explode(["label"])
exp_df = exp_df.rename(columns={"label":"expected_label"})
# exp_df.head()

In [31]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "expected_label"],   # left on
    ["sentence_id", "token_id", "token", "pos", "predicted_label"],  # right on
    ["sentence_id", "token_id", "token", "pos", "expected_label", "predicted_label", "_merge"],  # final column list
    "expected_label",
    "predicted_label", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_label,predicted_label,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,Gendered-Pronoun,Gendered-Pronoun,true positive
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,Gendered-Pronoun,Gendered-Pronoun,true positive
4,5,158,spent,VBD,O,O,true negative


Save the data:

In [32]:
Path(config.tokc_path+"multilabel_model_output/").mkdir(parents=True, exist_ok=True)
# eval_df.to_csv(config.tokc_path+"multilabel_model_output/cc-rf_baseline_GloVe50_predictions.csv")
eval_df.to_csv(config.tokc_path+"multilabel_model_output/cc-rf_linguistic_baseline_GloVe{}_predictions.csv".format(d))

##### Strict Agreement

In [37]:
agmt_stats = pd.DataFrame()

Calculate precision, recall, and F1 score at the token level for each tag:

In [38]:
labels = [ 
#     'Feminine', 'Masculine', 'Unknown',                      # Person Name category of labels
    'Gendered-Pronoun','Gendered-Role', 'Generalization',   # Linguistic category of labels
#     'Occupation', 'Omission', 'Stereotype'                  # Contextual category of labels
]

In [39]:
for label in labels:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label], exp_col="expected_label", pred_col="predicted_label")
    label_agmt_stats = label_agmt_stats.rename(columns={"tag(s)":"label(s)"})
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Unnamed: 0,label(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,Gendered-Pronoun,0,0,0,1374,1.0,1.0,1.0
0,Gendered-Role,1,0,0,112,1.0,0.99115,0.995556
0,Generalization,16,1,0,46,0.978723,0.741935,0.844037


Save the data:

In [40]:
Path(config.tokc_path+"multilabel_model_performance/").mkdir(parents=True, exist_ok=True)
# agmt_stats.to_csv(config.tokc_path+"multilabel_model_performance/cc-rf_baseline_GloVe{}_strict_agmt.csv".format(d))
agmt_stats.to_csv(config.tokc_path+"multilabel_model_performance/cc-rf_linguistic_baseline_GloVe{}_strict_agmt.csv".format(d))