# Baseline Gender Biased Token Classifiers

## Target: Label Categories

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory ../data/token_clf_data/model_input/
* Multilabel classification
    * 3 categories of labels:
        1. Person Name: Unknown, Non-binary, Feminine, Masculine
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Empowering, Occupation, Omission, Stereotype
* Also try a [classifier chain](https://scikit-learn.org/stable/modules/multiclass.html#classifierchain)!!!

***

#### Table of Contents

**[0.](#0) Preprocessing**

**[1.](#1) Logistic Regression (LR) Model**

  [1.1](#1.1) Input Tokens Individually
 
  [1.2](#1.2) Input Tokens Grouped by Sentence

**[2.](#2) Error Analysis**

***

**References**
* https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets 
* https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
* Text Analysis with Python for Social Scienctists (Hovy, 2022)

Load necessary libraries:

In [131]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For preprocessing
from nltk.stem import WordNetLemmatizer
# import spacy
from scipy import spatial

# For classifcation
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer, FunctionTransformer, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.feature_extraction import DictVectorizer  # does binary one-hot encoding if features are strings
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV, SGDClassifier #, Perceptron
# from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, plot_confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_fscore_support, f1_score

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [2]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


In [3]:
df_train = df_train.drop(columns=["ann_id"])
df_train = df_train.drop_duplicates()
df_dev = df_dev.drop(columns=["ann_id"])
df_dev = df_dev.drop_duplicates()
print(df_train.shape, df_dev.shape)

(463441, 9) (156146, 9)


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [4]:
# # df_train.loc[df_train.tag == "B-Nonbinary"]
# df_train.loc[df_train.token_id == 230697]
# # df_train.loc[df_train.tag == "I-Nonbinary"]
# df_train.loc[df_train.token_id == 230697]
# df_dev.loc[df_dev.tag == "B-Nonbinary"]  # No results
# df_dev.loc[df_dev.tag == "I-Nonbinary"]  # No results
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [5]:
df_train.shape

(463439, 9)

***
#### Optional Preprocessing Steps

If not classifying all labels at once, consider only the rows with tags for the select subset of labels:

In [27]:
# label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission", "B-Occupation", "I-Occupation"]
# label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Nonbinary", "I-Nonbinary"]
# label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
# df_train = df_train.loc[df_train.tag.isin(label_subset)]
# df_dev = df_dev.loc[df_dev.tag.isin(label_subset)]
# print(df_train.shape, df_dev.shape)

Optionally, lemmatize the tokens (a form of normalization, a.k.a. standardization):

In [28]:
lmtzr = WordNetLemmatizer()

In [29]:
tokens_train = list(df_train.token)
lemmas_train = [lmtzr.lemmatize(token) for token in tokens_train]
tokens_dev = list(df_dev.token)
lemmas_dev = [lmtzr.lemmatize(token) for token in tokens_dev]

In [30]:
df_train.insert((list(df_train.columns).index("token")+1), "lemma", lemmas_train)
df_dev.insert((list(df_dev.columns).index("token")+1), "lemma", lemmas_dev)

In [31]:
# df_train.tail()
df_dev.head()

Unnamed: 0,description_id,sentence_id,token_id,token,lemma,token_offsets,pos,tag,field,subset
172,3,5,154,After,After,"(907, 912)",IN,O,Biographical / Historical,dev
173,3,5,155,his,his,"(913, 916)",PRP$,B-Gendered-Pronoun,Biographical / Historical,dev
174,3,5,156,ordination,ordination,"(917, 927)",NN,O,Biographical / Historical,dev
175,3,5,157,he,he,"(928, 930)",PRP,B-Gendered-Pronoun,Biographical / Historical,dev
176,3,5,158,spent,spent,"(931, 936)",VBD,O,Biographical / Historical,dev


Add the annotation label categories as a column of higher-level Inside-Outside-Beginning (IOB) tags so they can be used as targets:

In [8]:
df_train = utils.addCategoryTagColumn(df_train)
df_train.head()

Unnamed: 0,description_id,sentence_id,token_id,token,token_offsets,pos,tag,field,subset,tag_cat
3,1,1,3,Title,"(17, 22)",NN,O,Title,train,O
4,1,1,4,:,"(22, 23)",:,O,Title,train,O
5,1,1,5,Papers,"(24, 30)",NNS,O,Title,train,O
6,1,1,6,of,"(31, 33)",IN,O,Title,train,O
7,1,1,7,The,"(34, 37)",DT,B-Unknown,Title,train,B-Person-Name


In [9]:
df_dev = utils.addCategoryTagColumn(df_dev)
df_dev.head()

Unnamed: 0,description_id,sentence_id,token_id,token,token_offsets,pos,tag,field,subset,tag_cat
172,3,5,154,After,"(907, 912)",IN,O,Biographical / Historical,dev,O
173,3,5,155,his,"(913, 916)",PRP$,B-Gendered-Pronoun,Biographical / Historical,dev,B-Linguistic
174,3,5,156,ordination,"(917, 927)",NN,O,Biographical / Historical,dev,O
175,3,5,157,he,"(928, 930)",PRP,B-Gendered-Pronoun,Biographical / Historical,dev,B-Linguistic
176,3,5,158,spent,"(931, 936)",VBD,O,Biographical / Historical,dev,O


Look at only the *Linguistic* category to start:

In [144]:
df_train_ling = df_train.drop(columns=["tag"])
df_train_ling["tag_cat"] = df_train_ling["tag_cat"].replace(['B-Person-Name', 'B-Contextual', 'I-Person-Name', 'I-Contextual'], 'O')
df_train_ling = df_train_ling.drop_duplicates()

df_dev_ling = df_dev.drop(columns=["tag"])
df_dev_ling["tag_cat"] = df_dev_ling["tag_cat"].replace(['B-Person-Name', 'B-Contextual', 'I-Person-Name', 'I-Contextual'], 'O')
df_dev_ling = df_dev_ling.drop_duplicates()
df_dev_ling.head()

Unnamed: 0,description_id,sentence_id,token_id,token,token_offsets,pos,field,subset,tag_cat
172,3,5,154,After,"(907, 912)",IN,Biographical / Historical,dev,O
173,3,5,155,his,"(913, 916)",PRP$,Biographical / Historical,dev,B-Linguistic
174,3,5,156,ordination,"(917, 927)",NN,Biographical / Historical,dev,O
175,3,5,157,he,"(928, 930)",PRP,Biographical / Historical,dev,B-Linguistic
176,3,5,158,spent,"(931, 936)",VBD,Biographical / Historical,dev,O


Group the data by sentence, adding padding to sentences so that they are all the same length:

In [259]:
cols_to_keep = ["sentence_id", "pos", "token", "tag_cat"]

In [260]:
df_train_ling_grouped = utils.implodeDataFrame(df_train_ling[cols_to_keep], ["sentence_id"])
df_dev_ling_grouped = utils.implodeDataFrame(df_dev_ling[cols_to_keep], ["sentence_id"])

In [265]:
df_train_ling_grouped = df_train_ling_grouped.rename(columns={"token":"sentence"})
df_dev_ling_grouped = df_dev_ling_grouped.rename(columns={"token":"sentence"})
df_dev_ling_grouped.head()

Unnamed: 0_level_0,pos,sentence,tag_cat
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5,"[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[O, B-Linguistic, O, B-Linguistic, O, O, O, O,..."
11,"[NN, :, NN]","[Identifier, :, AA6]","[O, O, O]"
13,"[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
18,"[IN, CD, NNP, NNP, VBD, NNP, NNP, CC, PRP, VBD...","[In, 1941, Tom, Allan, married, Jane, Moore, a...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
24,"[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


In [275]:
df_train_ling_grouped = utils.addPaddedSentenceColumn(df_train_ling_grouped)
# print(df_train_ling_grouped.sentence.values[0][:20])
df_dev_ling_grouped = utils.addPaddedSentenceColumn(df_dev_ling_grouped)

***

#### Word Embeddings

Get GloVe word embeddings (which were trained on English Wikipedia entries) for the tokens:

In [148]:
dimensions = ["50", "100", "200", "300"]
d = dimensions[1]

In [149]:
glove = utils.getGloveEmbeddings(d)
print(glove["recipient"])

[ 0.063877   0.95793   -0.053323  -0.068542   0.76758   -0.27335
 -0.043212  -0.39447    0.15885    0.25465   -0.34075   -0.30437
  0.24691    0.49041   -0.54421   -0.026556   0.99498   -0.22903
 -0.083907   0.40962   -1.3918    -0.37756   -0.5675     0.090421
  0.71336    0.43176   -0.057562  -0.34407    1.3235    -0.82601
  0.46754    1.1343     0.44713    0.29694    0.61125    0.080119
 -0.95791    0.43931   -0.74273    0.4412    -0.068448   0.74451
  0.16243    0.1931     0.85294    0.39898    0.24571   -0.3771
 -0.96994    0.19199    0.057375   0.047835   0.74642   -0.075984
 -0.54556   -0.72614   -0.010644  -0.60529    1.0421    -0.03876
  0.18461    0.53881   -0.225      0.47586    0.63071   -0.6616
 -0.51847    0.90297    1.1178    -0.01349    0.19686    0.13684
 -0.38346    0.59652    0.3418     0.80315    0.061273  -0.48047
 -0.38057   -0.47128    0.45696    0.44741   -0.18594   -0.29276
 -0.8917     0.092826   0.20231   -0.72893    0.58968   -0.64259
 -0.34245    0.0076589  

In [150]:
train_tokens = list(df_train.token)
dev_tokens = list(df_dev.token)

In [151]:
train_embeddings = utils.getEmbeddingsForTokens(glove, train_tokens)
dev_embeddings = utils.getEmbeddingsForTokens(glove, dev_tokens)

In [152]:
assert np.array_equal(train_embeddings[0], glove[train_tokens[0].lower()])
assert np.array_equal(dev_embeddings[0], glove[dev_tokens[0].lower()])

In [153]:
embedding_dict = dict(zip(train_tokens, train_embeddings))
embedding_dict.update(dict(zip(dev_tokens, dev_embeddings)))

In [154]:
embedding_dict_keys = list(embedding_dict.keys())
unique_train_tokens = list(set(train_tokens))
unique_dev_tokens = list(set(dev_tokens))
for token in unique_dev_tokens:
    assert token in embedding_dict_keys
for token in unique_train_tokens:
    assert token in embedding_dict_keys

<a id="1"></a>
## 1. Logistic Regression Model

<a id="1.1"></a>
### 1.1 Input Tokens Individually

#### Feature Engineering

In [244]:
one_hot_encoder = OneHotEncoder()
labels2numbers = LabelEncoder()
mlb = MultiLabelBinarizer()

In [220]:
# # class CountWordCaps(BaseEstimator, TransformerMixin):
# # """ Model that extracts a counter of capital words from text. """
# #     def fit(self, X, y=None):
# #         return self
# #    def transform(self, texts):
# #         """ transform data :texts: The texts to count capital words in :returns: list of counts for each text """
# #         return [[sum(w.isupper() for w in nltk.word_tokenize(text))] for text in texts]
    
class GloveTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, token):
        return pd.DataFrame([embedding_dict[t] if t in embedding_dict.keys() else np.zeros((d,)) for t in token]) #for token_list in token]

Vectorize tokens:

In [221]:
col_transformer = ColumnTransformer(
    [
        ('pos_encodings', OneHotEncoder(handle_unknown="ignore"), ['pos']),
        ('token_embeddings', GloveTransformer(), 'token'),
    ],
    remainder='passthrough'
)

In [255]:
feature_cols = ["pos", "token"]  #, "lemma"]
target_col = "tag_cat"  #"tag"
labels = list(np.unique(df_train_ling[target_col]))
labels2numbers = LabelEncoder()
y = labels2numbers.fit_transform(labels)
label_to_no = dict(zip(labels,list(y)))
no_to_label = dict(zip(list(y),labels))
print(label_to_no)

{'B-Linguistic': 0, 'I-Linguistic': 1, 'O': 2}


In [235]:
X_train = df_train_ling[feature_cols]

In [252]:
y_train = list(df_train_ling[target_col].values)
y_train_numeric = utils.getNumericLabels(y_train, label_to_no)  # Convert the string labels to numeric labels
y_train_binarized = mlb.fit_transform(y_train_numeric)          # Convert each iterable of iterables above to a multilabel format
print(y_train[0])
print(y_train_numeric[0])
print(y_train_binarized[0])

O
(2,)
[0 0 1]


In [253]:
X_dev = df_dev_ling[feature_cols]
y_dev = df_dev_ling[target_col].values
y_dev_numeric = utils.getNumericLabels(y_dev, label_to_no)  # Make numeric
y_dev_binarized = mlb.transform(y_dev_numeric)              # Binarize
print(y_dev.shape, y_dev_binarized.shape)

(153199,) (153199, 3)


In [254]:
assert X_dev.shape[1] == X_train.shape[1], "The train and dev data must have the same number of columns."

#### Train the Model

In [162]:
log_reg = OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr", random_state=22))

In [163]:
pipeline = Pipeline([
    ("col_transformer", col_transformer),
    ('imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ("classifier", log_reg)
])

In [164]:
clf = pipeline.fit(X_train, y_train_binarized)

#### Predict

In [165]:
predicted_dev = clf.predict(X_dev)
print(predicted_dev[0])

[0 0 1]


#### Evaluate Model Performance

Strict evaluation:

In [166]:
original_labels = mlb.classes_
dev_matrix = multilabel_confusion_matrix(y_dev_binarized, predicted_dev, labels=mlb.classes_)
df_dev_perf = utils.getPerformanceMetrics(y_dev_binarized, predicted_dev, dev_matrix, mlb.classes_, original_labels, no_to_label)
df_dev_perf

Unnamed: 0,labels,true_neg,false_neg,true_pos,false_pos,precision,recall,f_1
0,B-Linguistic,151320,674,849,356,0.704564,0.557452,0.622434
1,I-Linguistic,152923,276,0,0,0.0,0.0,0.0
2,O,853,356,151044,946,0.993776,0.997649,0.995708


In [167]:
print("Dev Accuracy (all labels) on `token` col:", np.mean(predicted_dev == y_dev_binarized))

Dev Accuracy (all labels) on `token` col: 0.9943254633951049


Relaxed evaluation:

#### Cross Validation

Try using cross-validation (stratified k fold, where k=3) with Logistic Regression:

In [129]:
k = 3 # number of folds

In [132]:
log_reg_cv = OneVsRestClassifier(LogisticRegressionCV(
    solver="liblinear", multi_class="ovr", cv=k, scoring="f1", random_state=22)  #max_iter=500, --> default is 100 iterations
                                )

In [133]:
pipeline = Pipeline([
    ("col_transformer", col_transformer),
    ('imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ("classifier", log_reg_cv)
])

In [134]:
clf = pipeline.fit(X_train, y_train_binarized)

KeyboardInterrupt: 

In [None]:
predicted_dev = clf.predict(X_dev)

In [None]:
original_labels = mlb.classes_
dev_matrix1 = multilabel_confusion_matrix(y_dev_binarized, pred1_dev, labels=mlb.classes_)
df_dev_perf1 = utils.getPerformanceMetrics(y_dev_binarized, pred1_dev, dev_matrix1, mlb.classes_, original_labels, no_to_label)
df_dev_perf1

**QUESTION:** Are scores averaged across the 3 folds?

<a id="1.2"></a>
### 1.2 Input Tokens Grouped by Sentence

In [257]:
X_train = df_train_ling_grouped[feature_cols]
print(X_train.shape)

(25218, 2)


In [258]:
y_train = df_train_ling_grouped[target_col].values
y_train_numeric = [labels2numbers.fit_transform(labels_list) for labels_list in y_train]
y_train_binarized = [one_hot_encoder.fit_transform((number_list).reshape(-1,1)) for number_list in y_train_numeric]
# y_train_numeric = utils.getNumericLabels(y_train, label_to_no)  # Convert the string labels to numeric labels
# y_train_binarized = mlb.fit_transform(y_train_numeric)          # Convert each iterable of iterables above to a multilabel format
print(y_train.shape)
print(y_train[3])
print(y_train_numeric[3])
print(y_train_binarized[3])

(25218,)
['B-Linguistic', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Linguistic', 'O', 'O', 'O', 'O', 'O', 'O']
[0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1]
  (0, 0)	1.0
  (1, 1)	1.0
  (2, 1)	1.0
  (3, 1)	1.0
  (4, 1)	1.0
  (5, 1)	1.0
  (6, 1)	1.0
  (7, 1)	1.0
  (8, 1)	1.0
  (9, 1)	1.0
  (10, 0)	1.0
  (11, 1)	1.0
  (12, 1)	1.0
  (13, 1)	1.0
  (14, 1)	1.0
  (15, 1)	1.0
  (16, 1)	1.0


In [206]:
y_train[0]

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

In [211]:
# y_encoded = LabelEncoder().fit_transform(y_train[3])
y_encoded

array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1])

In [202]:
y = y.reshape(-1, 1)
ohe = one_hot_encoder.fit(y)

In [204]:
ohe.fit_transform(y_train_numeric)

  return array(a, dtype, copy=False, order=order)


ValueError: Expected 2D array, got 1D array instead:
array=[list([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
 list([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
 list([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
 ... list([2, 2, 2, 2, 2, 2, 2, 2])
 list([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
 list([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.