# Baseline Gender Biased Token Classifiers

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory ../data/token_clf_data/model_input/
* Multilabel classification
    * 3 categories of labels:
        1. Person Name: Unknown, Non-binary, Feminine, Masculine
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Empowering, Occupation, Omission, Stereotype
* Also try a [classifier chain](https://scikit-learn.org/stable/modules/multiclass.html#classifierchain)!!!

***

**Table of Contents**

[0.](#0) Preprocessing

[1.](#1) Logistic Regression (LR)

[2.](#2) Decision Tree

[3.](#3) Random Forest

[4.](#4) Support Vector Machines (SVM) (a.k.a. support vector classification (SVC))

[5.](#4) Error Analysis

***

**References**
* https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets 
* https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
* Text Analysis with Python for Social Scienctists (Hovy, 2022)

Load necessary libraries:

In [31]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For preprocessing
from nltk.stem import WordNetLemmatizer

# For classifcation
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer, HashingVectorizer
from sklearn.feature_extraction import DictVectorizer  # does binary one-hot encoding if features are strings
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, plot_confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_fscore_support

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [2]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(470712, 8) (158836, 8)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field,subset
3,1,1,99999,3,Title,O,Title,train
4,1,1,99999,4,:,O,Title,train
5,1,1,99999,5,Papers,O,Title,train
6,1,1,99999,6,of,O,Title,train
7,1,1,14384,7,The,B-Unknown,Title,train


Lemmatize the tokens:

In [3]:
lmtzr = WordNetLemmatizer()

In [4]:
tokens_train = list(df_train.token)
lemmas_train = [lmtzr.lemmatize(token) for token in tokens_train]
tokens_dev = list(df_dev.token)
lemmas_dev = [lmtzr.lemmatize(token) for token in tokens_dev]

In [5]:
df_train.insert((list(df_train.columns).index("token")+1), "lemma", lemmas_train)
df_dev.insert((list(df_dev.columns).index("token")+1), "lemma", lemmas_dev)

In [6]:
df_train.tail()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,lemma,tag,field,subset
784526,27907,42028,99999,753916,medical,medical,O,Scope and Contents,train
784527,27907,42028,99999,753917,treatment,treatment,O,Scope and Contents,train
784528,27907,42028,99999,753918,of,of,O,Scope and Contents,train
784529,27907,42028,99999,753919,homosexuality,homosexuality,O,Scope and Contents,train
784530,27907,42028,99999,753920,.,.,O,Scope and Contents,train


In [7]:
df_dev.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,lemma,tag,field,subset
172,3,5,99999,154,After,After,O,Biographical / Historical,dev
173,3,5,14379,155,his,his,B-Gendered-Pronoun,Biographical / Historical,dev
174,3,5,99999,156,ordination,ordination,O,Biographical / Historical,dev
175,3,5,14380,157,he,he,B-Gendered-Pronoun,Biographical / Historical,dev
176,3,5,99999,158,spent,spent,O,Biographical / Historical,dev


Binarize and encode the data:

In [16]:
# df_train = df_train.drop(columns=["description_id","ann_id", "token_id", "field", "subset"])
# df_dev = df_dev.drop(columns=["description_id","ann_id", "token_id", "field", "subset"])
v = DictVectorizer(sparse=False)
mlb = MultiLabelBinarizer()
labels2numbers = LabelEncoder()

In [17]:
feature_cols = ["sentence_id", "lemma"]
target_col = "tag"
labels = np.unique(y_train)
labels2numbers = LabelEncoder()
y = labels2numbers.fit_transform(labels)
label_to_no = dict(zip(labels,list(y)))
print(label_to_no)

{'B-Feminine': 0, 'B-Gendered-Pronoun': 1, 'B-Gendered-Role': 2, 'B-Generalization': 3, 'B-Masculine': 4, 'B-Nonbinary': 5, 'B-Occupation': 6, 'B-Omission': 7, 'B-Stereotype': 8, 'B-Unknown': 9, 'I-Feminine': 10, 'I-Gendered-Pronoun': 11, 'I-Gendered-Role': 12, 'I-Generalization': 13, 'I-Masculine': 14, 'I-Nonbinary': 15, 'I-Occupation': 16, 'I-Omission': 17, 'I-Stereotype': 18, 'I-Unknown': 19, 'O': 20}


In [22]:
# TAKE A SAMPLE OF 25% FOR NOW
sent_ids = list(df_train.sentence_id)
quarter = int(len(sent_ids)/4)
sent_ids_sample = list(set(sent_ids[:quarter]))
df_train_sample = df_train.loc[df_train.sentence_id.isin(sent_ids_sample)]

X_train = df_train_sample[feature_cols]
# X_train = df_train[["sentence_id", "lemma"]]
X_train = v.fit_transform(X_train.to_dict('records'))
print(X_train.shape)

y_train = df_train_sample[target_col].values
# Convert the string labels to numeric labels
y_train_numeric = utils.getNumericLabels(y_train, label_to_no)
# Convert each iterable of iterables above to a multilabel format
y_train_binarized = mlb.fit_transform(y_train_numeric)
print(y_train_binarized.shape)
print(y_train_binarized[5])
print(y_train_numeric[5])
print(y_train[5])

(117686, 12897)
(117686, 19)
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
(4,)
B-Masculine


Looks good!

In [38]:
# TAKE A SAMPLE OF 25% FOR NOW
sent_ids = list(df_dev.sentence_id)
quarter = int(len(sent_ids)/4)
sent_ids_sample = list(set(sent_ids[:quarter]))
df_dev_sample = df_dev.loc[df_dev.sentence_id.isin(sent_ids_sample)]

X_dev = df_dev_sample[feature_cols]
# X_dev = df_dev.drop("tag", axis=1)
X_dev = v.fit_transform(X_dev.to_dict('records'))
print(X_dev.shape)

y_dev = df_dev_sample[target_col].values
# Convert the string labels to numeric labels
y_dev_numeric = utils.getNumericLabels(y_dev, label_to_no)
# Convert each iterable of iterables above to a multilabel format
y_dev_binarized = mlb.fit_transform(y_dev_numeric)
print(y_dev_binarized.shape)
print(y_dev_binarized[5])
print(y_dev_numeric[5])
print(y_dev[5])

(39741, 7021)
(39741, 19)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
(20,)
O


<a id="1"></a>
## 1. Logistic Regression

In [35]:
# clf_pipeline = Pipeline([
#     ("vect", CountVectorizer()),
#     ("tfidf", TfidfTransformer()),
#     ("clf", OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr")))
# ])
clf = OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr"))

In [36]:
# clf_pipeline.fit(X_train, y_train_binarized)
clf.fit(X_train, y_train_binarized)
predicted_dev = clf.predict(X_dev)

ValueError: X has 7021 features per sample; expecting 12897

In [None]:
print("Dev Accuracy:", np.mean(predicted_dev == y_dev_binarized))

In [None]:
dev_matrix = multilabel_confusion_matrix()