# Baseline Gender Biased Token Classifiers

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory ../data/token_clf_data/model_input/
* Multilabel classification
    * 3 categories of labels:
        1. Person Name: Unknown, Non-binary, Feminine, Masculine
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Empowering, Occupation, Omission, Stereotype
* Also try a [classifier chain](https://scikit-learn.org/stable/modules/multiclass.html#classifierchain)!!!

***

**Table of Contents**

[0.](#0) Preprocessing

[1.](#1) Logistic Regression (LR)

[2.](#2) Decision Tree

[3.](#3) Random Forest

[4.](#4) Support Vector Machines (SVM) (a.k.a. support vector classification (SVC))

[5.](#4) Error Analysis

***

**References**
* https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets 
* https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
* Text Analysis with Python for Social Scienctists (Hovy, 2022)

Load necessary libraries:

In [41]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For preprocessing
from nltk.stem import WordNetLemmatizer

# For classifcation
from sklearn.preprocessing import MaxAbsScaler, LabelEncoder, MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer, HashingVectorizer
from sklearn.feature_extraction import DictVectorizer  # does binary one-hot encoding if features are strings
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV, SGDClassifier
# from sklearn.svm import SVC
# from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, plot_confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_fscore_support, f1_score

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [2]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467477, 8) (157705, 8)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field,subset
3,1,1,99999,3,Title,O,Title,train
4,1,1,99999,4,:,O,Title,train
5,1,1,99999,5,Papers,O,Title,train
6,1,1,99999,6,of,O,Title,train
7,1,1,14384,7,The,B-Unknown,Title,train


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [3]:
# # df_train.loc[df_train.tag == "B-Nonbinary"]
# df_train.loc[df_train.token_id == 230697]
# # df_train.loc[df_train.tag == "I-Nonbinary"]
# df_train.loc[df_train.token_id == 230697]
# df_dev.loc[df_dev.tag == "B-Nonbinary"]  # No results
# df_dev.loc[df_dev.tag == "I-Nonbinary"]  # No results
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [4]:
df_train.shape

(467475, 8)

Select only the rows with tags a subset of labels:

In [5]:
# label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission", "B-Occupation", "I-Occupation"]
# label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Nonbinary", "I-Nonbinary"]
# label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
# df_train = df_train.loc[df_train.tag.isin(label_subset)]
# df_dev = df_dev.loc[df_dev.tag.isin(label_subset)]
# print(df_train.shape, df_dev.shape)

Lemmatize the tokens:

In [82]:
# lmtzr = WordNetLemmatizer()

In [83]:
# tokens_train = list(df_train.token)
# lemmas_train = [lmtzr.lemmatize(token) for token in tokens_train]
# tokens_dev = list(df_dev.token)
# lemmas_dev = [lmtzr.lemmatize(token) for token in tokens_dev]

In [84]:
# df_train.insert((list(df_train.columns).index("token")+1), "lemma", lemmas_train)
# df_dev.insert((list(df_dev.columns).index("token")+1), "lemma", lemmas_dev)

In [5]:
# df_train.tail()

In [6]:
# df_dev.head()

Binarize and encode the data:

In [7]:
v = DictVectorizer(sparse=True)  #(sparse=False)
mlb = MultiLabelBinarizer()
labels2numbers = LabelEncoder()

In [8]:
feature_cols = ["sentence_id", "token"]  #"lemma"
target_col = "tag"
labels = np.unique(df_train.tag)
labels2numbers = LabelEncoder()
y = labels2numbers.fit_transform(labels)
label_to_no = dict(zip(labels,list(y)))
print(label_to_no)

{'B-Feminine': 0, 'B-Gendered-Pronoun': 1, 'B-Gendered-Role': 2, 'B-Generalization': 3, 'B-Masculine': 4, 'B-Occupation': 5, 'B-Omission': 6, 'B-Stereotype': 7, 'B-Unknown': 8, 'I-Feminine': 9, 'I-Gendered-Pronoun': 10, 'I-Gendered-Role': 11, 'I-Generalization': 12, 'I-Masculine': 13, 'I-Occupation': 14, 'I-Omission': 15, 'I-Stereotype': 16, 'I-Unknown': 17, 'O': 18}


In [9]:
labels = list(labels)
print(labels)
assert type(labels) == list

['B-Feminine', 'B-Gendered-Pronoun', 'B-Gendered-Role', 'B-Generalization', 'B-Masculine', 'B-Occupation', 'B-Omission', 'B-Stereotype', 'B-Unknown', 'I-Feminine', 'I-Gendered-Pronoun', 'I-Gendered-Role', 'I-Generalization', 'I-Masculine', 'I-Occupation', 'I-Omission', 'I-Stereotype', 'I-Unknown', 'O']


In [37]:
X_train = df_train[feature_cols]
X_train = v.fit_transform(X_train.to_dict('records'))  # v.fit() - make sure column count same as X_dev!
print(X_train.shape)

y_train = df_train[target_col].values
# Convert the string labels to numeric labels
y_train_numeric = utils.getNumericLabels(y_train, label_to_no)
# Convert each iterable of iterables above to a multilabel format
y_train_binarized = mlb.fit_transform(y_train_numeric)
print(y_train.shape, y_train_binarized.shape)

(467475, 35979)
(467475,) (467475, 19)


In [38]:
X_dev = df_dev[feature_cols]
# X_dev = df_dev.drop("tag", axis=1)
X_dev = v.transform(X_dev.to_dict('records'))
print(X_dev.shape)

y_dev = df_dev[target_col].values
# Convert the string labels to numeric labels
y_dev_numeric = utils.getNumericLabels(y_dev, label_to_no)
# Convert each iterable of iterables above to a multilabel format
y_dev_binarized = mlb.transform(y_dev_numeric)
print(y_dev.shape, y_dev_binarized.shape)

(157705, 35979)
(157705,) (157705, 19)


In [39]:
assert X_dev.shape[1] == X_train.shape[1], "The train and dev data must have the same number of columns."

**Feature Scaling:** Standardize the data using z-score normalization so that every feature's mean=0 and standard deviation=1. 

In [42]:
scaler = MaxAbsScaler()  # scaler designed for sparse matrices

In [43]:
X_train_scaled = ss.fit_transform(X_train)
X_dev_scaled = ss.fit_transform(X_dev)

In [44]:
assert X_dev_scaled.shape[1] == X_train_scaled.shape[1], "The train and dev data must have the same number of columns."

<a id="1"></a>
## 1. Logistic Regression

In [21]:
# clf_pipeline = Pipeline([
# CROSS-VALIDATION?
#     ("vect", CountVectorizer()),
#     ("tfidf", TfidfTransformer()),
#     ("clf", OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr", random_state=22)))
# ])
log_reg = OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr", random_state=22))

In [22]:
# clf_pipeline.fit(X_train, y_train_binarized)
clf = log_reg.fit(X_train_scaled, y_train_binarized)  # (X_train, y_train)

In [23]:
predicted_dev = clf.predict(X_dev_scaled)  # (X_dev)

In [26]:
# print("Contextual Labels Dev Accuracy on `token` col:", np.mean(predicted_dev == y_dev))  # 33%
# print("Person Name Labels Dev Accuracy on `token` col:", np.mean(predicted_dev == y_dev))  # 40%
# print("Linguistic Labels Dev Accuracy on `token` col:", np.mean(predicted_dev == y_dev))  # 76%
# print("Dev Accuracy (all labels) on `token` col:", np.mean(predicted_dev == y_dev))  # 90%
# print("Dev Accuracy (all labels) on `lemma` col:", np.mean(predicted_dev == y_dev))  # 90%
print("Scaled Dev Accuracy (all labels) on `token` col:", np.mean(predicted_dev == y_dev_binarized))  # 99%

Scaled Dev Accuracy (all labels) on `token` col: 0.9920988387712568


Try using cross-validation (stratified k fold, where k=3) with Logistic Regression:

In [27]:
k = 3 # number of folds
# from sklearn import metrics
# metrics.SCORERS.keys()

In [46]:
log_reg_cv = OneVsRestClassifier(LogisticRegressionCV(
    penalty="l2", solver="sag", multi_class="ovr", cv=k, scoring="f1", max_iter=500, random_state=22)
                                )
clf1 = log_reg_cv.fit(X_train_scaled, y_train_binarized)  # (X_train, y_train)
pred1_dev = clf1.predict(X_dev_scaled)  # (X_dev)
# print("Dev Accuracy (all labels) on `lemma` col:", np.mean(pred1_dev == y_dev))  # 90%
# print("Dev Accuracy (all labels) on `token` col:", np.mean(pred1_dev == y_dev))  # 90%
clf1.score(X_dev_scaled, y_dev_binarized)  # (X_dev, y_dev)



KeyboardInterrupt: 

### LR Performance

In [None]:
original_labels = mlb.labels_
dev_matrix = multilabel_confusion_matrix(y_devtest_binarized, predicted_dev, labels=labels)
df_dev_perf = 
# See: https://github.com/thegoose20/gender-bias/blob/main/document_classification/BaselineDocumentClassifiers.ipynb