# Baseline Gender Biased Token Classifiers

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory ../data/token_clf_data/model_input/
* Multilabel classification
    * 3 categories of labels:
        1. Person Name: Unknown, Non-binary, Feminine, Masculine
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Empowering, Occupation, Omission, Stereotype
* Also could try a [classifier chain](https://scikit-learn.org/stable/modules/multiclass.html#classifierchain)!

***

**Table of Contents**

[0.](#0) Preprocessing

[1.](#1) Logistic Regression (LR)

[2.](#2) Random Forest

[3.](#3) Support Vector Machines (SVM) (a.k.a. support vector classification (SVC))

[4.](#4) Error Analysis

***

**References**
* https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets 
* https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
* Text Analysis with Python for Social Scienctists (Hovy, 2022)

Load necessary libraries:

In [27]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For preprocessing
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize
# from nltk.corpus import stopwords
from nltk.tag import pos_tag

# For classifcation
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.feature_extraction import DictVectorizer  # does binary one-hot encoding if features are strings
from sklearn.linear_model import LogisticRegression
# from sklearn.svm import SVC
# from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, plot_confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_fscore_support, f1_score

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [3]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_train = df_train.loc[df_train.field != "Identifier"]
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
df_dev = df_dev.loc[df_dev.field != "Identifier"]
print(df_train.shape, df_dev.shape)
df_train.head()

(466922, 10) (157566, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [4]:
# # df_train.loc[df_train.tag == "B-Nonbinary"]
# df_train.loc[df_train.token_id == 230697]
# # df_train.loc[df_train.tag == "I-Nonbinary"]
# df_train.loc[df_train.token_id == 230697]
# df_dev.loc[df_dev.tag == "B-Nonbinary"]  # No results
# df_dev.loc[df_dev.tag == "I-Nonbinary"]  # No results
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [5]:
df_train.shape

(466920, 10)

If not classifying all labels at once, consider only the rows with tags for the select subset of labels:

In [44]:
# label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission", "B-Occupation", "I-Occupation"]
# label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Nonbinary", "I-Nonbinary"]
# label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
# df_train = df_train.loc[df_train.tag.isin(label_subset)]
# df_dev = df_dev.loc[df_dev.tag.isin(label_subset)]
# print(df_train.shape, df_dev.shape)

#### Normalization 
Lemmatize the tokens:

In [8]:
lmtzr = WordNetLemmatizer()

In [9]:
tokens_train = list(df_train.token)
lemmas_train = [lmtzr.lemmatize(token) for token in tokens_train]
tokens_dev = list(df_dev.token)
lemmas_dev = [lmtzr.lemmatize(token) for token in tokens_dev]

In [10]:
df_train.insert((list(df_train.columns).index("token")+1), "lemma", lemmas_train)
df_dev.insert((list(df_dev.columns).index("token")+1), "lemma", lemmas_dev)

In [11]:
# df_train.tail()
df_dev.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,lemma,token_offsets,pos,tag,field,subset
172,3,5,99999,154,After,After,"(907, 912)",IN,O,Biographical / Historical,dev
173,3,5,14379,155,his,his,"(913, 916)",PRP$,B-Gendered-Pronoun,Biographical / Historical,dev
174,3,5,99999,156,ordination,ordination,"(917, 927)",NN,O,Biographical / Historical,dev
175,3,5,14380,157,he,he,"(928, 930)",PRP,B-Gendered-Pronoun,Biographical / Historical,dev
176,3,5,99999,158,spent,spent,"(931, 936)",VBD,O,Biographical / Historical,dev


Binarize and encode the data:

In [21]:
v = DictVectorizer(sparse=True)
mlb = MultiLabelBinarizer()
labels2numbers = LabelEncoder()

In [23]:
feature_cols = ["sentence_id", "token"]  #"lemma"
target_col = "tag"
labels = list(df_train.tag.unique())
numeric_labels = labels2numbers.fit_transform(labels)
print(numeric_labels)
label_to_no = dict(zip(labels,list(y)))
no_to_label = dict(zip(list(y),labels))
print(label_to_no)

[18  8  4  7 17 16 13  5 14  1  6 15  3  2  0 12 11  9 10]
{'O': 18, 'B-Unknown': 8, 'B-Masculine': 4, 'B-Stereotype': 7, 'I-Unknown': 17, 'I-Stereotype': 16, 'I-Masculine': 13, 'B-Occupation': 5, 'I-Occupation': 14, 'B-Gendered-Pronoun': 1, 'B-Omission': 6, 'I-Omission': 15, 'B-Generalization': 3, 'B-Gendered-Role': 2, 'B-Feminine': 0, 'I-Generalization': 12, 'I-Gendered-Role': 11, 'I-Feminine': 9, 'I-Gendered-Pronoun': 10}


In [44]:
X_train = df_train[feature_cols]
# X_train = v.fit_transform(X_train.to_dict('records'))  # v.fit() - make sure column count same as X_dev!
# print(X_train.shape)

y_train = df_train[target_col].values
# Convert the string labels to numeric labels
y_train_numeric = utils.getNumericLabels(y_train, label_to_no)
# Convert each iterable of iterables above to a multilabel format
y_train_binarized = mlb.fit_transform(y_train_numeric)
print(y_train.shape, y_train_binarized.shape)

(466920,) (466920, 19)


In [45]:
X_train.head()

Unnamed: 0,sentence_id,token
3,1,Title
4,1,:
5,1,Papers
6,1,of
7,1,The


In [51]:
X_dev = df_dev[feature_cols]
# X_dev = v.transform(X_dev.to_dict('records'))
# print(X_dev.shape)

y_dev = df_dev[target_col].values
# Convert the string labels to numeric labels
y_dev_numeric = utils.getNumericLabels(y_dev, label_to_no)
# Convert each iterable of iterables above to a multilabel format
y_dev_binarized = mlb.transform(y_dev_numeric)
print(y_dev.shape, y_dev_binarized.shape)

(157566,) (157566, 19)


In [52]:
assert X_dev.shape[1] == X_train.shape[1], "The train and dev data must have the same number of columns."

#### Word Embeddings

In [63]:
df = pd.concat([df_train, df_dev])  # df_train

In [64]:
unique_tokens = list(set(list(df.token)))
unique_lemmas = list(set(list(df.lemma))) 
unique_lemmas = [lemma for lemma in unique_lemmas if lemma.isalpha()]
lemmas_lower = [lemma.lower() for lemma in unique_lemmas]
unique_lemmas_lower = list(set(lemmas_lower))
unique_words = [token for token in unique_tokens if token.isalpha()]  # keep tokens with only alphabetic characters
print(len(unique_words), len(unique_lemmas), len(unique_lemmas_lower))

32955 31155 26755


Load the [GloVe word embeddings](https://github.com/stanfordnlp/GloVe), which were trained on 2014 English Wikipedia entries and Gigaword 5:

*Note: could also try [GN-GloVe](https://github.com/uclanlp/gn_glove), which supposedly has gender-neutral word embeddings*

In [60]:
# Reference: https://medium.com/analytics-vidhya/basics-of-using-pre-trained-glove-vectors-in-python-d38905f356db
dimensions = ["50", "100", "200", "300"]  # pretrained GloVe embeddings come as vectors with one of these four dimensions
d = dimensions[0]  # start small to begin with
glove_path = config.inf_data_path+"glove.6B/glove.6B.{}d.txt".format(d)

In [61]:
glove = dict()
with open(glove_path, "r") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        glove[word] = vector
print(glove["recipient"])

[-0.13412    1.5041     0.39706   -0.41357    0.71336    0.63438
  0.1214     0.16746    1.4104     0.013127   0.68386    0.85236
  0.92599    0.098158   0.425     -0.83799    0.08482    0.22135
 -0.15247    0.72654    0.052728  -1.0064    -0.032626  -0.63342
  0.0039789 -1.5288    -0.74005   -1.2051    -1.0227    -0.10588
  1.2225     0.14845   -0.1641    -0.52887   -0.29012    0.59774
  0.62847    0.49003   -0.14227   -1.2193     0.56094   -0.17673
 -0.11216   -0.41801   -0.40841   -0.41748   -0.40276    0.25091
 -0.43016    0.26412  ]


Calculate GloVe's coverage of our vocabulary: 

In [65]:
not_in_glove = []
in_glove = dict()
for word in unique_words:
    lowercased = word.lower()
    try:
        vector = glove[lowercased]
        in_glove[word] = vector
    except KeyError:
        not_in_glove += [word]

print("Total words in vocabulary not in GloVe:", len(not_in_glove))
print("Proportion of vocabulary not in GloVe:",(len(not_in_glove))/len(unique_words))

Total words in vocabulary not in GloVe: 6453
Proportion of vocabulary not in GloVe: 0.19581247155211653


In [66]:
still_not_found = []
newly_found = 0
# partial_glove_match = dict()
for word in not_in_glove:
    found = re.findall("[A-Z]{0,1}[a-z]+", word)
    for f in found:
        lowercased = f.lower()
        try:
            vector = glove[lowercased]
            in_glove[word] = vector  # partial_glove_match[word] = vector
            newly_found += 1
        except KeyError:
            still_not_found += [word]
print("Not possible to find in GloVe:", len(still_not_found))
print("Proportion not found but possible to find in GloVe:", newly_found/(len(not_in_glove)))
print("Proportion of vocabulary not possible to find in GloVe:",(len(still_not_found))/len(unique_words))

Not possible to find in GloVe: 4187
Proportion not found but possible to find in GloVe: 0.7311328064466139
Proportion of vocabulary not possible to find in GloVe: 0.1270520406615081


GloVe has much better coverage than sense2vec, as expected due to the better domain match (Wikipedia entries are more similar to archival metadata descriptions than Reddit comments!).

In [67]:
words_to_vectors = in_glove.copy()
key_array = np.array(words_to_vectors.keys())
for word in unique_words:
    if word not in key_array:
        if word.lower() in key_array:
            vector = words_to_vectors[word.lower()]
        else:
            vector = np.array([])
        words_to_vectors[word] = vector

In [68]:
assert len(words_to_vectors) == len(unique_words)

#### Pipeline

In [99]:
# # class PosExtractor(BaseEstimator, TransformerMixin):
# #     def __init__(self):
# #         self
    
# #     def fit(self, X, y=None):
# #         return self
    
# #     def transform(self, X):
# #         X_pos = X["pos"]
# #         return X_pos

# # class LemmaExtractor(BaseEstimator, TransformerMixin):
# #     def __init__(self):
# #         self
    
# #     def fit(self, X, y=None):
# #         return self
    
# #     def transform(self, X):
# #         X_lemma = X["lemma"]
# #         return X_lemma

# # class TokenExtractor(BaseEstimator, TransformerMixin):
# #     def __init__(self):
# #         self
    
# #     def fit(self, X, y=None):
# #         return self
    
# #     def transform(self, X):
# #         X_token = X["token"]
# #         return X_token
    
# # class SentenceExtractor(BaseEstimator, TransformerMixin):
# #     def __init__(self):
# #         self
    
# #     def fit(self, X, y=None):
# #         return self
    
# #     def transform(self, X):
# #         X_sentence = X["sentence_id"]
# #         return X_sentence

# class FeatureExtractor(BaseEstimator, TransformerMixin):
#     def fit(self, X, y=None):
#         X = DictVectorizer().fit(X.to_dict('records'))
#         return X
    
# #     def pos_func(self, pos):
# #         return pos
    
#     def transform(self, X):
# #         X_tagged = X.apply(self.pos_func).apply(pd.Series)
#         X_vectorized = DictVectorizer().transform(X.to_dict('records'))
#         return X_vectorized

In [100]:
# # pipe = Pipeline([
# #     ("union", FeatureUnion([
# #         ("pos_features", Pipeline([
# #             ("pos", PosExtractor()),
# #         ])),
# #         ("token_features", Pipeline([
# #             ("token", TokenExtractor()),
# #         ])),
# #         ("sentence_features", Pipeline([
# #             ("sentence", SentenceExtractor()),
# #         ])),
# #     ])),
# #     ("algorithm", OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr", random_state=22)))
# # ])
# pipe = Pipeline([
#     ("selector", FeatureExtractor()),
#     ("algorithm", OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr", random_state=22)))
# ])

In [104]:
X_train = df_train[["sentence_id", "token", "pos"]]
X_train.head(1)

Unnamed: 0,sentence_id,token,pos
3,1,Title,NN


In [111]:
type(list(X_train.sentence_id)[0])

int

In [112]:
binarizer = ColumnTransformer([
    ('binarizer', MultiLabelBinarizer(), "sentence_id")
    ])

log_reg = OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr", random_state=22))

pipe = Pipeline([("binarizer", binarizer), ("classifier", log_reg)])

# feature_union = FeatureUnion([""])

In [113]:
log_reg = pipe.fit(X_train["sentence_id"], y_train_binarized)

IndexError: tuple index out of range

In [59]:
predicted_dev = log_reg.predict(X_dev)

ValueError: X has 19376 features per sample; expecting 35754

In [None]:
original_labels = mlb.classes_
dev_matrix = multilabel_confusion_matrix(y_dev_binarized, predicted_dev, labels=mlb.classes_)
df_dev_perf = utils.getPerformanceMetrics(y_dev_binarized, predicted_dev, dev_matrix, mlb.classes_, original_labels, no_to_label)
df_dev_perf

<a id="1"></a>
## 1. Logistic Regression

In [64]:
log_reg = OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr", random_state=22))

In [71]:
clf = log_reg.fit(X_train, y_train_binarized)

In [72]:
predicted_dev = clf.predict(X_dev)
print(predicted_dev[0])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]


In [None]:
# print("Contextual Labels Dev Accuracy on `token` col:", np.mean(predicted_dev == y_dev))  # 33%
# print("Person Name Labels Dev Accuracy on `token` col:", np.mean(predicted_dev == y_dev))  # 40%
# print("Linguistic Labels Dev Accuracy on `token` col:", np.mean(predicted_dev == y_dev))  # 76%
# print("Dev Accuracy (all labels) on `token` col:", np.mean(predicted_dev == y_dev))  # 90%
# print("Dev Accuracy (all labels) on `lemma` col:", np.mean(predicted_dev == y_dev))  # 90%
# print("Scaled Dev Accuracy (all labels) on `token` col:", np.mean(predicted_dev == y_dev_binarized))  # 99%

#### Performance

In [74]:
original_labels = mlb.classes_
dev_matrix = multilabel_confusion_matrix(y_dev_binarized, predicted_dev, labels=mlb.classes_)
df_dev_perf = utils.getPerformanceMetrics(y_dev_binarized, predicted_dev, dev_matrix, mlb.classes_, original_labels, no_to_label)
df_dev_perf

Unnamed: 0,labels,true_neg,false_neg,true_pos,false_pos,precision,recall,f_1
0,B-Feminine,158479,357,0,0,0.0,0.0,0.0
1,B-Gendered-Pronoun,158003,833,0,0,0.0,0.0,0.0
2,B-Gendered-Role,158069,767,0,0,0.0,0.0,0.0
3,B-Generalization,158450,386,0,0,0.0,0.0,0.0
4,B-Masculine,157711,1125,0,0,0.0,0.0,0.0
5,B-Occupation,157970,866,0,0,0.0,0.0,0.0
6,B-Omission,157360,1476,0,0,0.0,0.0,0.0
7,B-Stereotype,158309,527,0,0,0.0,0.0,0.0
8,B-Unknown,154587,4249,0,0,0.0,0.0,0.0
9,I-Feminine,157887,949,0,0,0.0,0.0,0.0


In [79]:
print("Dev Accuracy (all labels) on `token` col:", np.mean(predicted_dev == y_dev_binarized))

Dev Accuracy (all labels) on `token` col: 0.9820297930603031


Try using cross-validation (stratified k fold, where k=3) with Logistic Regression:

In [76]:
k = 3 # number of folds

In [88]:
log_reg_cv = OneVsRestClassifier(LogisticRegressionCV(
    solver="liblinear", multi_class="ovr", cv=k, scoring="f1", random_state=22)  #max_iter=500, --> default is 100 iterations
                                )
clf1 = log_reg_cv.fit(X_train, y_train_binarized)
pred1_dev = clf1.predict(X_dev)
# print("Dev Accuracy (all labels) on `lemma` col:", np.mean(pred1_dev == y_dev))  # 90%
# print("Dev Accuracy (all labels) on `token` col:", np.mean(pred1_dev == y_dev))  # 90%
# from sklearn import metrics
# metrics.SCORERS.keys()
# print("Accuracy:", clf1.score(pred1_dev, y_dev_binarized))

In [85]:
original_labels = mlb.classes_
dev_matrix1 = multilabel_confusion_matrix(y_dev_binarized, pred1_dev, labels=mlb.classes_)
df_dev_perf1 = utils.getPerformanceMetrics(y_dev_binarized, pred1_dev, dev_matrix1, mlb.classes_, original_labels, no_to_label)
df_dev_perf1

Unnamed: 0,labels,true_neg,false_neg,true_pos,false_pos,precision,recall,f_1
0,B-Feminine,158479,357,0,0,0.0,0.0,0.0
1,B-Gendered-Pronoun,158003,833,0,0,0.0,0.0,0.0
2,B-Gendered-Role,158069,767,0,0,0.0,0.0,0.0
3,B-Generalization,158450,386,0,0,0.0,0.0,0.0
4,B-Masculine,157711,1125,0,0,0.0,0.0,0.0
5,B-Occupation,157970,866,0,0,0.0,0.0,0.0
6,B-Omission,157360,1476,0,0,0.0,0.0,0.0
7,B-Stereotype,158309,527,0,0,0.0,0.0,0.0
8,B-Unknown,154587,4249,0,0,0.0,0.0,0.0
9,I-Feminine,157887,949,0,0,0.0,0.0,0.0


**QUESTION:** Are scores averaged across the 3 folds?

<a id="2"></a>
## 2. Perceptron

In [37]:
per = Perceptron(verbose=10, n_jobs=-1, max_iter=5)
clf2 = per.fit(X_train, y_train, labels)

ValueError: Provided ``coef_`` does not match dataset. 

In [20]:
labels_noO = labels.copy()
labels_noO.pop()
print(classification_report(y_pred=per.predict(X_dev), y_true=y_dev, labels=labels_noO, zero_division=0))

                    precision    recall  f1-score   support

        B-Feminine       0.00      0.00      0.00       357
B-Gendered-Pronoun       0.00      0.00      0.00       833
   B-Gendered-Role       0.00      0.00      0.00       767
  B-Generalization       0.00      0.00      0.00       386
       B-Masculine       0.00      0.00      0.00      1125
      B-Occupation       0.00      0.00      0.00       866
        B-Omission       0.00      0.00      0.00      1476
      B-Stereotype       0.00      0.00      0.00       527
         B-Unknown       0.00      0.00      0.00      4249
        I-Feminine       0.00      0.00      0.00       949
I-Gendered-Pronoun       0.00      0.00      0.00        18
   I-Gendered-Role       0.00      0.00      0.00       158
  I-Generalization       0.00      0.00      0.00       165
       I-Masculine       0.00      0.00      0.00      1568
      I-Occupation       0.00      0.00      0.00       953
        I-Omission       0.00      0.00

<a id="3"></a>
## 3. Stochastic Gradient Descent

In [42]:
sgd = SGDClassifier()
clf3 = sgd.partial_fit(X_train, y_train, labels)

In [43]:
print(classification_report(y_pred=clf3.predict(X_dev), y_true=y_dev, labels=labels))

                    precision    recall  f1-score   support

        B-Feminine       0.00      0.00      0.00       357
B-Gendered-Pronoun       0.00      0.00      0.00       833
   B-Gendered-Role       0.00      0.00      0.00       767
  B-Generalization       0.00      0.00      0.00       386
       B-Masculine       0.00      0.00      0.00      1125
      B-Occupation       0.00      0.00      0.00       866
        B-Omission       0.00      0.00      0.00      1476
      B-Stereotype       0.00      0.00      0.00       527
         B-Unknown       0.00      0.00      0.00      4249
        I-Feminine       0.00      0.00      0.00       949
I-Gendered-Pronoun       0.00      0.00      0.00        18
   I-Gendered-Role       0.00      0.00      0.00       158
  I-Generalization       0.00      0.00      0.00       165
       I-Masculine       0.00      0.00      0.00      1568
      I-Occupation       0.00      0.00      0.00       953
        I-Omission       0.00      0.00