# Baseline Gender Biased Token Classifiers

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory ../data/token_clf_data/model_input/
* Multilabel classification
    * 3 categories of labels:
        1. Person Name: Unknown, Non-binary, Feminine, Masculine
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Empowering, Occupation, Omission, Stereotype
* Also try a [classifier chain](https://scikit-learn.org/stable/modules/multiclass.html#classifierchain)!!!

***

**Table of Contents**

[0.](#0) Preprocessing

[1.](#1) Logistic Regression (LR)

[2.](#2) Perceptron

[3.](#3) Stochastic Gradient Descent (SDG)

[4.](#4) Support Vector Machines (SVM) (a.k.a. support vector classification (SVC))

[5.](#5) Error Analysis

***

**References**
* https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets 
* https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
* Text Analysis with Python for Social Scienctists (Hovy, 2022)

Load necessary libraries:

In [1]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For preprocessing
from nltk.stem import WordNetLemmatizer
from sense2vec import Sense2Vec

# For classifcation
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.feature_extraction import DictVectorizer  # does binary one-hot encoding if features are strings
from sklearn.linear_model import LogisticRegression #, LogisticRegressionCV, SGDClassifier, Perceptron
# from sklearn.svm import SVC
# from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, plot_confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_fscore_support, f1_score

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [28]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(470712, 8) (158836, 8)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field,subset
3,1,1,99999,3,Title,O,Title,train
4,1,1,99999,4,:,O,Title,train
5,1,1,99999,5,Papers,O,Title,train
6,1,1,99999,6,of,O,Title,train
7,1,1,14384,7,The,B-Unknown,Title,train


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [29]:
# # df_train.loc[df_train.tag == "B-Nonbinary"]
# df_train.loc[df_train.token_id == 230697]
# # df_train.loc[df_train.tag == "I-Nonbinary"]
# df_train.loc[df_train.token_id == 230697]
# df_dev.loc[df_dev.tag == "B-Nonbinary"]  # No results
# df_dev.loc[df_dev.tag == "I-Nonbinary"]  # No results
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [30]:
df_train.shape

(470710, 8)

In [21]:
df_train.groupby("tag").size().reset_index(name="total")

Unnamed: 0,tag,total
0,B-Feminine,927
1,B-Gendered-Pronoun,2475
2,B-Gendered-Role,2146
3,B-Generalization,1280
4,B-Masculine,3666
5,B-Occupation,2477
6,B-Omission,4447
7,B-Stereotype,1595
8,B-Unknown,12964
9,I-Feminine,2051


In [22]:
df_dev.groupby("tag").size().reset_index(name="total")

Unnamed: 0,tag,total
0,B-Feminine,357
1,B-Gendered-Pronoun,833
2,B-Gendered-Role,767
3,B-Generalization,386
4,B-Masculine,1125
5,B-Occupation,866
6,B-Omission,1476
7,B-Stereotype,527
8,B-Unknown,4249
9,I-Feminine,949


If not classifying all labels at once, consider only the rows with tags for the select subset of labels:

In [44]:
# label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission", "B-Occupation", "I-Occupation"]
# label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Nonbinary", "I-Nonbinary"]
# label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
# df_train = df_train.loc[df_train.tag.isin(label_subset)]
# df_dev = df_dev.loc[df_dev.tag.isin(label_subset)]
# print(df_train.shape, df_dev.shape)

Optionally, lemmatize the tokens (a form of normalization, a.k.a. standardization):

In [45]:
# lmtzr = WordNetLemmatizer()

In [46]:
# tokens_train = list(df_train.token)
# lemmas_train = [lmtzr.lemmatize(token) for token in tokens_train]
# tokens_dev = list(df_dev.token)
# lemmas_dev = [lmtzr.lemmatize(token) for token in tokens_dev]

In [47]:
# df_train.insert((list(df_train.columns).index("token")+1), "lemma", lemmas_train)
# df_dev.insert((list(df_dev.columns).index("token")+1), "lemma", lemmas_dev)

In [5]:
# df_train.tail()

In [6]:
# df_dev.head()

Binarize and encode the data:

In [31]:
v = DictVectorizer(sparse=True)
# mlb = MultiLabelBinarizer()
# labels2numbers = LabelEncoder()

In [32]:
feature_cols = ["sentence_id", "token"]  #"lemma"
target_col = "tag"
labels = list(np.unique(df_train.tag))
# labels2numbers = LabelEncoder()
# y = labels2numbers.fit_transform(labels)
# label_to_no = dict(zip(labels,list(y)))
# no_to_label = dict(zip(list(y),labels))
# print(label_to_no)

In [33]:
X_train = df_train[feature_cols]
X_train = v.fit_transform(X_train.to_dict('records'))  # v.fit() - make sure column count same as X_dev!
print(X_train.shape)

y_train = df_train[target_col].values
# # Convert the string labels to numeric labels
# y_train_numeric = utils.getNumericLabels(y_train, label_to_no)
# # Convert each iterable of iterables above to a multilabel format
# y_train_binarized = mlb.fit_transform(y_train_numeric)
# print(y_train.shape, y_train_binarized.shape)

(470710, 35979)


In [34]:
X_dev = df_dev[feature_cols]
# X_dev = df_dev.drop("tag", axis=1)
X_dev = v.transform(X_dev.to_dict('records'))
print(X_dev.shape)

y_dev = df_dev[target_col].values
# # Convert the string labels to numeric labels
# y_dev_numeric = utils.getNumericLabels(y_dev, label_to_no)
# # Convert each iterable of iterables above to a multilabel format
# y_dev_binarized = mlb.transform(y_dev_numeric)
# print(y_dev.shape, y_dev_binarized.shape)

(158836, 35979)


In [35]:
assert X_dev.shape[1] == X_train.shape[1], "The train and dev data must have the same number of columns."

<a id="1"></a>
## 1. Logistic Regression

In [64]:
log_reg = OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr", random_state=22))

In [71]:
clf = log_reg.fit(X_train, y_train_binarized)

In [72]:
predicted_dev = clf.predict(X_dev)
print(predicted_dev[0])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]


In [None]:
# print("Contextual Labels Dev Accuracy on `token` col:", np.mean(predicted_dev == y_dev))  # 33%
# print("Person Name Labels Dev Accuracy on `token` col:", np.mean(predicted_dev == y_dev))  # 40%
# print("Linguistic Labels Dev Accuracy on `token` col:", np.mean(predicted_dev == y_dev))  # 76%
# print("Dev Accuracy (all labels) on `token` col:", np.mean(predicted_dev == y_dev))  # 90%
# print("Dev Accuracy (all labels) on `lemma` col:", np.mean(predicted_dev == y_dev))  # 90%
# print("Scaled Dev Accuracy (all labels) on `token` col:", np.mean(predicted_dev == y_dev_binarized))  # 99%

#### Performance

In [74]:
original_labels = mlb.classes_
dev_matrix = multilabel_confusion_matrix(y_dev_binarized, predicted_dev, labels=mlb.classes_)
df_dev_perf = utils.getPerformanceMetrics(y_dev_binarized, predicted_dev, dev_matrix, mlb.classes_, original_labels, no_to_label)
df_dev_perf

Unnamed: 0,labels,true_neg,false_neg,true_pos,false_pos,precision,recall,f_1
0,B-Feminine,158479,357,0,0,0.0,0.0,0.0
1,B-Gendered-Pronoun,158003,833,0,0,0.0,0.0,0.0
2,B-Gendered-Role,158069,767,0,0,0.0,0.0,0.0
3,B-Generalization,158450,386,0,0,0.0,0.0,0.0
4,B-Masculine,157711,1125,0,0,0.0,0.0,0.0
5,B-Occupation,157970,866,0,0,0.0,0.0,0.0
6,B-Omission,157360,1476,0,0,0.0,0.0,0.0
7,B-Stereotype,158309,527,0,0,0.0,0.0,0.0
8,B-Unknown,154587,4249,0,0,0.0,0.0,0.0
9,I-Feminine,157887,949,0,0,0.0,0.0,0.0


In [79]:
print("Dev Accuracy (all labels) on `token` col:", np.mean(predicted_dev == y_dev_binarized))

Dev Accuracy (all labels) on `token` col: 0.9820297930603031


Try using cross-validation (stratified k fold, where k=3) with Logistic Regression:

In [76]:
k = 3 # number of folds

In [88]:
log_reg_cv = OneVsRestClassifier(LogisticRegressionCV(
    solver="liblinear", multi_class="ovr", cv=k, scoring="f1", random_state=22)  #max_iter=500, --> default is 100 iterations
                                )
clf1 = log_reg_cv.fit(X_train, y_train_binarized)
pred1_dev = clf1.predict(X_dev)
# print("Dev Accuracy (all labels) on `lemma` col:", np.mean(pred1_dev == y_dev))  # 90%
# print("Dev Accuracy (all labels) on `token` col:", np.mean(pred1_dev == y_dev))  # 90%
# from sklearn import metrics
# metrics.SCORERS.keys()
# print("Accuracy:", clf1.score(pred1_dev, y_dev_binarized))

In [85]:
original_labels = mlb.classes_
dev_matrix1 = multilabel_confusion_matrix(y_dev_binarized, pred1_dev, labels=mlb.classes_)
df_dev_perf1 = utils.getPerformanceMetrics(y_dev_binarized, pred1_dev, dev_matrix1, mlb.classes_, original_labels, no_to_label)
df_dev_perf1

Unnamed: 0,labels,true_neg,false_neg,true_pos,false_pos,precision,recall,f_1
0,B-Feminine,158479,357,0,0,0.0,0.0,0.0
1,B-Gendered-Pronoun,158003,833,0,0,0.0,0.0,0.0
2,B-Gendered-Role,158069,767,0,0,0.0,0.0,0.0
3,B-Generalization,158450,386,0,0,0.0,0.0,0.0
4,B-Masculine,157711,1125,0,0,0.0,0.0,0.0
5,B-Occupation,157970,866,0,0,0.0,0.0,0.0
6,B-Omission,157360,1476,0,0,0.0,0.0,0.0
7,B-Stereotype,158309,527,0,0,0.0,0.0,0.0
8,B-Unknown,154587,4249,0,0,0.0,0.0,0.0
9,I-Feminine,157887,949,0,0,0.0,0.0,0.0


**QUESTION:** Are scores averaged across the 3 folds?

<a id="1.1"></a>
## 1.1. With Word Embeddings

In [3]:
s2v = Sense2Vec().from_disk(config.s2v_reddit_path)

In [None]:
# from sense2vec import Sense2Vec

# s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md")
# query = "natural_language_processing|NOUN"
# assert query in s2v
# vector = s2v[query]
# freq = s2v.get_freq(query)
# most_similar = s2v.most_similar(query, n=3)
# # [('machine_learning|NOUN', 0.8986967),
# #  ('computer_vision|NOUN', 0.8636297),
# #  ('deep_learning|NOUN', 0.8573361)]

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")
s2v = nlp.add_pipe("sense2vec")

In [5]:
s2v.from_disk(config.s2v_reddit_path")

<sense2vec.component.Sense2VecComponent at 0x7f327a4584f0>