# Baseline Gender Biased Token Classifiers

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory ../data/token_clf_data/model_input/
* Multilabel classification
    * 3 categories of labels:
        1. Person Name: Unknown, Non-binary, Feminine, Masculine
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Empowering, Occupation, Omission, Stereotype
* Also try a [classifier chain](https://scikit-learn.org/stable/modules/multiclass.html#classifierchain)!!!

***

**Table of Contents**

[0.](#0) Preprocessing

[1.](#1) Logistic Regression (LR)

[2.](#2) Perceptron

[3.](#3) Stochastic Gradient Descent (SDG)

[4.](#4) Support Vector Machines (SVM) (a.k.a. support vector classification (SVC))

[5.](#5) Error Analysis

***

**References**
* https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets 
* https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
* Text Analysis with Python for Social Scienctists (Hovy, 2022)

Load necessary libraries:

In [1]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For preprocessing
from nltk.stem import WordNetLemmatizer
import spacy

# For classifcation
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.feature_extraction import DictVectorizer  # does binary one-hot encoding if features are strings
from sklearn.linear_model import LogisticRegression #, LogisticRegressionCV, SGDClassifier, Perceptron
# from sklearn.svm import SVC
# from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, plot_confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_fscore_support, f1_score

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [2]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [3]:
# # df_train.loc[df_train.tag == "B-Nonbinary"]
# df_train.loc[df_train.token_id == 230697]
# # df_train.loc[df_train.tag == "I-Nonbinary"]
# df_train.loc[df_train.token_id == 230697]
# df_dev.loc[df_dev.tag == "B-Nonbinary"]  # No results
# df_dev.loc[df_dev.tag == "I-Nonbinary"]  # No results
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [4]:
df_train.shape

(467562, 10)

If not classifying all labels at once, consider only the rows with tags for the select subset of labels:

In [5]:
# label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission", "B-Occupation", "I-Occupation"]
# label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Nonbinary", "I-Nonbinary"]
# label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
# df_train = df_train.loc[df_train.tag.isin(label_subset)]
# df_dev = df_dev.loc[df_dev.tag.isin(label_subset)]
# print(df_train.shape, df_dev.shape)

Optionally, lemmatize the tokens (a form of normalization, a.k.a. standardization):

In [6]:
lmtzr = WordNetLemmatizer()

In [7]:
tokens_train = list(df_train.token)
lemmas_train = [lmtzr.lemmatize(token) for token in tokens_train]
tokens_dev = list(df_dev.token)
lemmas_dev = [lmtzr.lemmatize(token) for token in tokens_dev]

In [8]:
df_train.insert((list(df_train.columns).index("token")+1), "lemma", lemmas_train)
df_dev.insert((list(df_dev.columns).index("token")+1), "lemma", lemmas_dev)

In [9]:
# df_train.tail()
df_dev.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,lemma,token_offsets,pos,tag,field,subset
172,3,5,99999,154,After,After,"(907, 912)",IN,O,Biographical / Historical,dev
173,3,5,14379,155,his,his,"(913, 916)",PRP$,B-Gendered-Pronoun,Biographical / Historical,dev
174,3,5,99999,156,ordination,ordination,"(917, 927)",NN,O,Biographical / Historical,dev
175,3,5,14380,157,he,he,"(928, 930)",PRP,B-Gendered-Pronoun,Biographical / Historical,dev
176,3,5,99999,158,spent,spent,"(931, 936)",VBD,O,Biographical / Historical,dev


<a id="1"></a>
## 1. Logistic Regression

Binarize and encode the data:

In [10]:
v = DictVectorizer(sparse=True)
mlb = MultiLabelBinarizer()
labels2numbers = LabelEncoder()

In [11]:
feature_cols = ["sentence_id", "token", "lemma"]
target_col = "tag"
labels = list(np.unique(df_train.tag))
labels2numbers = LabelEncoder()
y = labels2numbers.fit_transform(labels)
label_to_no = dict(zip(labels,list(y)))
no_to_label = dict(zip(list(y),labels))
print(label_to_no)

{'B-Feminine': 0, 'B-Gendered-Pronoun': 1, 'B-Gendered-Role': 2, 'B-Generalization': 3, 'B-Masculine': 4, 'B-Occupation': 5, 'B-Omission': 6, 'B-Stereotype': 7, 'B-Unknown': 8, 'I-Feminine': 9, 'I-Gendered-Pronoun': 10, 'I-Gendered-Role': 11, 'I-Generalization': 12, 'I-Masculine': 13, 'I-Occupation': 14, 'I-Omission': 15, 'I-Stereotype': 16, 'I-Unknown': 17, 'O': 18}


In [12]:
X_train = df_train[feature_cols]
X_train = v.fit_transform(X_train.to_dict('records'))  # v.fit() - make sure column count same as X_dev!
print(X_train.shape)

y_train = df_train[target_col].values
# Convert the string labels to numeric labels
y_train_numeric = utils.getNumericLabels(y_train, label_to_no)
# Convert each iterable of iterables above to a multilabel format
y_train_binarized = mlb.fit_transform(y_train_numeric)
print(y_train.shape, y_train_binarized.shape)

(467562, 70367)
(467562,) (467562, 19)


In [13]:
X_dev = df_dev[feature_cols]
# X_dev = df_dev.drop("tag", axis=1)
X_dev = v.transform(X_dev.to_dict('records'))
print(X_dev.shape)

y_dev = df_dev[target_col].values
# Convert the string labels to numeric labels
y_dev_numeric = utils.getNumericLabels(y_dev, label_to_no)
# Convert each iterable of iterables above to a multilabel format
y_dev_binarized = mlb.transform(y_dev_numeric)
print(y_dev.shape, y_dev_binarized.shape)

(157740, 70367)
(157740,) (157740, 19)


In [14]:
assert X_dev.shape[1] == X_train.shape[1], "The train and dev data must have the same number of columns."

#### Train the Model

In [15]:
log_reg = OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr", random_state=22))

In [16]:
clf = log_reg.fit(X_train, y_train_binarized)

#### Predict

In [17]:
predicted_dev = clf.predict(X_dev)
print(predicted_dev[0])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]


#### Evaluate Model Performance

In [18]:
original_labels = mlb.classes_
dev_matrix = multilabel_confusion_matrix(y_dev_binarized, predicted_dev, labels=mlb.classes_)
df_dev_perf = utils.getPerformanceMetrics(y_dev_binarized, predicted_dev, dev_matrix, mlb.classes_, original_labels, no_to_label)
df_dev_perf

Unnamed: 0,labels,true_neg,false_neg,true_pos,false_pos,precision,recall,f_1
0,B-Feminine,157417,323,0,0,0.0,0.0,0.0
1,B-Gendered-Pronoun,156996,744,0,0,0.0,0.0,0.0
2,B-Gendered-Role,157150,590,0,0,0.0,0.0,0.0
3,B-Generalization,157495,245,0,0,0.0,0.0,0.0
4,B-Masculine,156716,1024,0,0,0.0,0.0,0.0
5,B-Occupation,157085,655,0,0,0.0,0.0,0.0
6,B-Omission,156658,1082,0,0,0.0,0.0,0.0
7,B-Stereotype,157481,259,0,0,0.0,0.0,0.0
8,B-Unknown,155680,2060,0,0,0.0,0.0,0.0
9,I-Feminine,156894,846,0,0,0.0,0.0,0.0


In [19]:
print("Dev Accuracy (all labels) on `token` col:", np.mean(predicted_dev == y_dev_binarized))

Dev Accuracy (all labels) on `token` col: 0.9890666186195806


Try using cross-validation (stratified k fold, where k=3) with Logistic Regression:

In [20]:
# k = 3 # number of folds

In [21]:
# log_reg_cv = OneVsRestClassifier(LogisticRegressionCV(
#     solver="liblinear", multi_class="ovr", cv=k, scoring="f1", random_state=22)  #max_iter=500, --> default is 100 iterations
#                                 )
# clf1 = log_reg_cv.fit(X_train, y_train_binarized)
# pred1_dev = clf1.predict(X_dev)
# # print("Dev Accuracy (all labels) on `lemma` col:", np.mean(pred1_dev == y_dev))  # 90%
# # print("Dev Accuracy (all labels) on `token` col:", np.mean(pred1_dev == y_dev))  # 90%
# # from sklearn import metrics
# # metrics.SCORERS.keys()
# # print("Accuracy:", clf1.score(pred1_dev, y_dev_binarized))

In [22]:
# original_labels = mlb.classes_
# dev_matrix1 = multilabel_confusion_matrix(y_dev_binarized, pred1_dev, labels=mlb.classes_)
# df_dev_perf1 = utils.getPerformanceMetrics(y_dev_binarized, pred1_dev, dev_matrix1, mlb.classes_, original_labels, no_to_label)
# df_dev_perf1

**QUESTION:** Are scores averaged across the 3 folds?

<a id="1.1"></a>
## 1.1. With Word Embeddings

Obtain the vocabulary of the annotated data:

In [24]:
df = df_train

In [35]:
unique_tokens = list(set(list(df.token)))
unique_words = [token for token in unique_tokens if token.isalpha()]  # keep tokens with only alphabetic characters
print(len(unique_words))

28545


Load spaCy's contextual word embeddings, which were trained on 2015 Reddit posts:

In [36]:
nlp = spacy.load("en_core_web_sm")
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk(config.s2v_reddit_path)

#-------------
# doc = nlp("A sentence about natural language processing.")
# assert doc[3:6].text == "natural language processing"
# freq = doc[3:6]._.s2v_freq
# vector = doc[3:6]._.s2v_vec
# most_similar = doc[3:6]._.s2v_most_similar(3)
# # [(('machine learning', 'NOUN'), 0.8986967),
# #  (('computer vision', 'NOUN'), 0.8636297),
# #  (('deep learning', 'NOUN'), 0.8573361)]

<sense2vec.component.Sense2VecComponent at 0x7f87677c5d30>

In [38]:
print(unique_words[:20])

['Cliffs', 'rpm', 'Heal', 'hereditary', 'Medjid', 'presumedly', 'Mondays', 'Routine', 'recipientESPMedawar', 'Kirkuk', 'Seton', 'Venado', 'Edith', 'Mackay', 'Visiting', 'interwar', 'atherogenic', 'Cawdor', 'jockey', 'Burgesses']


In [75]:
not_in_s2v = []
for word in unique_words:
    lowercased = word.lower()
    w = (nlp(lowercased))[0]
    if w._.s2v_vec is None:
        not_in_s2v += [word]

print("Total words in vocabulary not in Sense2Vec:", len(not_in_s2v))
print("Proportion of vocabulary not in Sense2Vec:",(len(not_in_s2v))/len(unique_words))

Total words in vocabulary not in Sense2Vec: 11529
Proportion of vocabulary not in Sense2Vec: 0.40388859695218077


In [61]:
# print(not_in_s2v[:100])
# print(not_in_s2v[1000:1100])
print(not_in_s2v[-100:])

['recipient', 'squabs', 'Darby', 'poulterer', 'Scotsman', 'sepolero', 'Morphogenetic', 'Hynes', 'Repleta', 'Simal', 'Mai', 'Arithmetic', 'Duce', 'Mme', 'lectureKatchalsky', 'inA', 'Tulsk', 'Berg', 'Mode', 'Bohme', 'Envelope', 'furnitureKoestler', 'Burmester', 'Evang', 'Ilona', 'compagne', 'BarnArthur', 'Cant', 'GollyArthur', 'Hatano', 'ElectionsAccompanied', 'accomodement', 'Finney', 'poetarum', 'Realites', 'Margaropus', 'LI', 'Lennox', 'Basberg', 'Ignacio', 'Glennie', 'Duffus', 'Takagi', 'Aliza', 'emir', 'Hersham', 'Rossini', 'Bald', 'Magnus', 'Pattison', 'Skefhill', 'Sorrel', 'Landolphin', 'Staub', 'ME', 'Alumbadi', 'Model', 'Majesties', 'Harian', 'trichocysts', 'Wandervogel', 'Rumped', 'Waetjen', 'Verbena', 'ofThe', 'sturzte', 'Lee', 'Copernicus', 'Verasis', 'Altenberg', 'unnumbered', 'environs', 'Neutral', 'Wynne', 'Mossman', 'ReneeESPCutten', 'ReadingSent', 'sequelae', 'Calary', 'Soviets', 'AustriaKoestler', 'Karlsburg', 'Peckham', 'Macdougall', 'Goldschmidt', 'electroencephalogra

In [74]:
# x = "lettersBaillie"
# y = "writer'"
# z = "MunsterSent"
newly_found = 0
for word in not_in_s2v:
    found = re.findall("[A-Z]{0,1}[a-z]+", word)
    for f in found:
        lowercased = f.lower()
        w = nlp(lowercased)[0]
        if not w._.s2v_vec is None:
            newly_found += 1
print(newly_found)

2659


In [76]:
print("Proportion not found but possible to find:", newly_found/(len(not_in_s2v)))
print("Proportion of vocabulary not possible to find in Sense2Vec:",(len(not_in_s2v)-newly_found)/len(unique_words))

Proportion not found but possible to find: 0.23063578801283718
Proportion of vocabulary not possible to find in Sense2Vec: 0.3107374321247154


I'm not sure it's worth reworking the tokenization and part-of-speech tagging to increase Sense2Vec's coverage of the vocabulary by only about 9%, so we'll keep going with the model input data as it is for now.