# Baseline Gender Biased Token Classifiers

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory ../data/token_clf_data/model_input/
* Multilabel classification
    * 3 categories of labels:
        1. Person Name: Unknown, Non-binary, Feminine, Masculine
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Empowering, Occupation, Omission, Stereotype
* Also try a [classifier chain](https://scikit-learn.org/stable/modules/multiclass.html#classifierchain)!!!

***

**Table of Contents**

[0.](#0) Preprocessing

[1.](#1) Logistic Regression (LR)

[2.](#2) Decision Tree

[3.](#3) Random Forest

[4.](#4) Support Vector Machines (SVM) (a.k.a. support vector classification (SVC))

[5.](#4) Error Analysis

***

**References**
* https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets 
* https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
* Text Analysis with Python for Social Scienctists (Hovy, 2022)

Load necessary libraries:

In [17]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For preprocessing
from nltk.stem import WordNetLemmatizer

# For classifcation
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
from sklearn.feature_extraction import DictVectorizer  # does binary one-hot encoding if features are strings
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# from sklearn.pipeline import Pipeline
# from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, plot_confusion_matrix, ConfusionMatrixDisplay
# from sklearn.metrics import precision_recall_fscore_support

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [21]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(213094, 8) (70683, 8)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field,subset
3,1,1,14384,7,The,B-Unknown,Title,train
4,1,1,24275,7,The,B-Masculine,Title,train
5,1,1,52952,7,The,B-Stereotype,Title,train
6,1,1,14384,8,Very,I-Unknown,Title,train
7,1,1,24275,8,Very,I-Masculine,Title,train


Lemmatize the tokens:

In [22]:
lmtzr = WordNetLemmatizer()

In [23]:
tokens_train = list(df_train.token)
lemmas_train = [lmtzr.lemmatize(token) for token in tokens_train]
tokens_dev = list(df_dev.token)
lemmas_dev = [lmtzr.lemmatize(token) for token in tokens_dev]

In [24]:
df_train.insert((list(df_train.columns).index("token")+1), "lemma", lemmas_train)
df_dev.insert((list(df_dev.columns).index("token")+1), "lemma", lemmas_dev)
df_dev.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,lemma,tag,field,subset
0,0,0,99999,0,Identifier,Identifier,O,Identifier,dev
1,0,0,99999,1,:,:,O,Identifier,dev
2,0,0,99999,2,AA5,AA5,O,Identifier,dev
138,3,4,14377,134,He,He,B-Gendered-Pronoun,Biographical / Historical,dev
139,3,4,14378,148,he,he,B-Gendered-Pronoun,Biographical / Historical,dev


In [29]:
df_dev.loc[df_dev.sentence_id == 5]

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,lemma,tag,field,subset
140,3,5,14379,155,his,his,B-Gendered-Pronoun,Biographical / Historical,dev
141,3,5,14380,157,he,he,B-Gendered-Pronoun,Biographical / Historical,dev
142,3,5,41262,163,army,army,B-Occupation,Biographical / Historical,dev
143,3,5,41262,164,Chaplain,Chaplain,I-Occupation,Biographical / Historical,dev


In [28]:
df_train.loc[df_train.sentence_id == 5]

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,lemma,tag,field,subset


Binarize and encode the **training** data:

In [15]:
df_train = df_train.drop(columns=["description_id","ann_id", "token_id", "field", "subset"])
df_dev = df_dev.drop(columns=["description_id","ann_id", "token_id", "field", "subset"])
v = DictVectorizer(sparse=False)

In [16]:
X_train = df_train.drop("tag", axis=1)
X_train = v.fit_transform(X_train.to_dict('records'))
y_train = df_train.tag.values

MemoryError: Unable to allocate 33.7 GiB for an array with shape (213094, 21223) and data type float64

In [12]:
X_dev = df_dev.drop("tag", axis=1)
X_dev = v.fit_transform(X_dev.to_dict('records'))
y_dev = df_dev.tag.values
print(X_dev.shape)

(70683, 10811)

In [None]:
classes = np.unique(y_dev)
classes = classes.to_list()