# Baseline Gender Biased Token Classifiers

## Target: Label Categories

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory ../data/token_clf_data/model_input/
* Multilabel classification
    * 3 categories of labels: Linguistic, Person Name, Contextual
* Models: 
    * 3 Logistic Regression classifiers, 1 per category
    * One model's output becomes next model's input

***

#### Table of Contents

**[0.](#0) Preprocessing**

**[1.](#1) Logistic Regression Models**

  [1.1](#1.1) Target Category: Linguistic
 
  [1.2](#1.2) Target Category: Person Name
  
  [1.3](#1.3) Target Category: Contextual

**[2.](#2) Error Analysis**

***

Load necessary libraries:

In [89]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For preprocessing
from nltk.stem import WordNetLemmatizer
# import spacy
from scipy import spatial

# For classification
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer, FunctionTransformer, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# LR with OvR provides multilabel model
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

# Multilabel models in sklearn
from sklearn.ensemble import RandomForestClassifier
#     tree.DecisionTreeClassifier
#     tree.ExtraTreeClassifier
#     ensemble.ExtraTreesClassifier
#     neighbors.KNeighborsClassifier
#     neural_network.MLPClassifier
#     neighbors.RadiusNeighborsClassifier
#     linear_model.RidgeClassifier
#     linear_model.RidgeClassifierCV

# For evaluation
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, plot_confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_fscore_support, f1_score

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [18]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


Drop duplicate rows with all but the same annotation ID:

In [19]:
df_train = df_train.drop(columns=["ann_id"])
df_train = df_train.drop_duplicates()
df_dev = df_dev.drop(columns=["ann_id"])
df_dev = df_dev.drop_duplicates()
print(df_train.shape, df_dev.shape)

(463441, 9) (156146, 9)


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [20]:
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [21]:
df_train.shape

(463439, 9)

***

#### Label Categories

Add the annotation label categories as a column of higher-level Inside-Outside-Beginning (IOB) tags so they can be used as targets:

In [22]:
df_train = utils.addCategoryTagColumn(df_train)
# df_train.head(20)

In [23]:
df_dev = utils.addCategoryTagColumn(df_dev)
# df_dev.head()

Remove columns that won't be used as features for the classifiers and remove any duplicate rows that remain:

In [24]:
cols_to_keep = ["sentence_id", "token_id", "pos", "token", "tag_cat"]

In [25]:
df_train = df_train[cols_to_keep]
df_train = df_train.drop_duplicates()
df_dev = df_dev[cols_to_keep]
df_dev = df_dev.drop_duplicates()
# df_train.head(20)

Create columns for each category so they can be passed into the models as individual features:

In [26]:
ling_cat_tags = ["B-Linguistic", "I-Linguistic"]
df_train_ling = df_train.loc[df_train.tag_cat.isin(ling_cat_tags)]
df_dev_ling = df_dev.loc[df_dev.tag_cat.isin(ling_cat_tags)]

In [27]:
pers_cat_tags = ["B-Person-Name", "I-Person-Name"]
df_train_pers = df_train.loc[df_train.tag_cat.isin(pers_cat_tags)]
df_dev_pers = df_dev.loc[df_dev.tag_cat.isin(pers_cat_tags)]

In [28]:
cont_cat_tags = ["B-Contextual", "I-Contextual"]
df_train_cont = df_train.loc[df_train.tag_cat.isin(cont_cat_tags)]
df_dev_cont = df_dev.loc[df_dev.tag_cat.isin(cont_cat_tags)]

In [29]:
df_train = (df_train.drop(columns=["tag_cat"])).drop_duplicates()
df_dev = (df_dev.drop(columns=["tag_cat"])).drop_duplicates()

In [30]:
join_cols = ["sentence_id", "token_id", "pos", "token"]

In [31]:
df_train = df_train.join(df_train_ling.set_index(join_cols), on=join_cols, how="outer")
df_train = df_train.join(df_train_pers.set_index(join_cols), on=join_cols, how="outer", lsuffix="", rsuffix="_personname")
df_train = df_train.join(df_train_cont.set_index(join_cols), on=join_cols, how="outer", lsuffix="", rsuffix="_contextual")
df_train = df_train.rename(columns={"tag_cat":"tag_cat_linguistic"})
# df_train.head(30)  # Should have one row per token!

In [33]:
df_dev = df_dev.join(df_dev_ling.set_index(join_cols), on=join_cols, how="outer")
df_dev = df_dev.join(df_dev_pers.set_index(join_cols), on=join_cols, how="outer", lsuffix="", rsuffix="_personname")
df_dev = df_dev.join(df_dev_cont.set_index(join_cols), on=join_cols, how="outer", lsuffix="", rsuffix="_contextual")
df_dev = df_dev.rename(columns={"tag_cat":"tag_cat_linguistic"})
# df_dev.head(30)

In [62]:
df_train.tail(30)

Unnamed: 0,sentence_id,token_id,pos,token,tag_cat_linguistic,tag_cat_personname,tag_cat_contextual
779229,42027,753891,NN,treatment,,,
779230,42027,753892,IN,of,,,
779231,42027,753893,NN,homosexuality,,,
779232,42027,753894,IN,in,,,
779233,42027,753895,JJ,contemporary,,,
779234,42027,753896,JJ,medical,,,
779235,42027,753897,NNS,journals,,,
779236,42027,753898,CC,and,,,
779237,42027,753899,NNS,books,,,
779238,42027,753900,",",",",,,


**REMEMBER:** check that model input data created on correct subset of files - no stereotypes about homosexuality "offences" or "medical treatment" of homosexuality??? 

Replace the `tag_cat_` columns' `nan` values with `'O'`:

In [43]:
tag_cat_cols = ["tag_cat_linguistic", "tag_cat_personname", "tag_cat_contextual"]
df_train[tag_cat_cols] = df_train[tag_cat_cols].fillna('O')
df_dev[tag_cat_cols] = df_dev[tag_cat_cols].fillna('O')
df_dev.head()

Unnamed: 0,sentence_id,token_id,pos,token,tag_cat_linguistic,tag_cat_personname,tag_cat_contextual
172,5,154,IN,After,O,O,O
173,5,155,PRP$,his,B-Linguistic,O,O
174,5,156,NN,ordination,O,O,O
175,5,157,PRP,he,B-Linguistic,O,O
176,5,158,VBD,spent,O,O,O


Group the data by sentence, so the token column becomes a list of tokens for each sentence:

In [38]:
df_train_grouped = utils.implodeDataFrame(df_train, ["sentence_id"])
df_dev_grouped = utils.implodeDataFrame(df_dev, ["sentence_id"])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
df_dev_grouped.head()

Unnamed: 0_level_0,token_id,pos,sentence,tag_cat_linguistic,tag_cat_personname,tag_cat_contextual
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[IN, PRP$, NN, PRP, VBD, CD, NNS, IN, DT, NN, ...","[After, his, ordination, he, spent, three, yea...","[O, B-Linguistic, O, B-Linguistic, O, O, O, O,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, B-Contextual, I-Co..."
11,"[308, 309, 310]","[NN, :, NN]","[Identifier, :, AA6]","[O, O, O]","[O, O, O]","[O, O, O]"
13,"[321, 322, 323, 324, 325, 326, 327, 328, 329, ...","[NN, CC, NNS, :, NNS, CC, NNS, ,, JJ, ;, NNS, ...","[Scope, and, Contents, :, Sermons, and, addres...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
18,"[498, 499, 500, 501, 502, 503, 504, 505, 506, ...","[IN, CD, NNP, NNP, VBD, NNP, NNP, CC, PRP, VBD...","[In, 1941, Tom, Allan, married, Jane, Moore, a...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, B-Person-Name, I-Person-Name, O, B-Pers...","[O, O, B-Contextual, I-Contextual, I-Contextua..."
24,"[649, 650, 651, 652, 653, 654, 655, 656, 657, ...","[IN, CD, NNP, NNP, NNP, VBD, DT, NN, TO, VB, N...","[In, 1955, Rev, Tom, Allan, accepted, a, call,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, B-Person-Name, I-Person-Name, I-Person-...","[O, O, O, O, O, O, O, O, O, O, B-Contextual, O..."


Pad the sentences so they all have the same lengths:

In [44]:
df_train_grouped = utils.addPaddedSentenceColumn(df_train_grouped)
print(df_train_grouped.sentence.values[0][:20])
df_dev_grouped = utils.addPaddedSentenceColumn(df_dev_grouped)

['Title', ':', 'Papers', 'of', 'The', 'Very', 'Rev', 'Rev', 'Prof', 'James', 'Whyte', '(', '1920-2005', ')', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD']


***

#### Word Embeddings

Get GloVe word embeddings (which were trained on English Wikipedia entries) for the vocabulary of the dataset (the unique tokens in the training set):

In [45]:
dimensions = ["50", "100", "200", "300"]
d = dimensions[0]

In [52]:
glove = utils.getGloveEmbeddings(d)
# print(glove["recipient"])

In [47]:
vocabulary = list(df_train.token.unique())
vocabulary_lowercased = [token.lower() for token in vocabulary]
vocabulary_lowercased = list(set(vocabulary_lowercased))
print("Vocabulary size:", len(vocabulary))
print("Lowercased vocabulary size:", len(vocabulary_lowercased))

Vocabulary size: 35968
Lowercased vocabulary size: 31335


In [48]:
word_embeddings = utils.getEmbeddingsForTokens(glove, vocabulary)

In [49]:
assert np.array_equal(word_embeddings[0], glove[vocabulary[0].lower()])

In [50]:
embedding_dict = dict(zip(vocabulary, word_embeddings))

In [51]:
embedding_dict_keys = list(embedding_dict.keys())
for token in vocabulary:
    assert token in embedding_dict_keys

<a id="1"></a>
## 1. Logistic Regression Model

#### Feature Engineering

Encode and binarize the text data so it can be input into sklearn models.

In [114]:
one_hot_encoder = OneHotEncoder()
# labels2numbers = LabelEncoder()
# mlb = MultiLabelBinarizer()

In [115]:
# Transforms each token to a GloVe word embedding of dimension d,
# or a zero vector of dimension d for any unrecognized tokens 
class GloveTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, sentence):
        return pd.DataFrame([np.asarray([embedding_dict[t] if t in embedding_dict.keys() else np.zeros((d,)) for t in token]) for token in sentence])

In [116]:
col_transformer = ColumnTransformer(
    [
#         ('pos_encodings', OneHotEncoder(handle_unknown='ignore'), ['pos']),  # onehotencoder doesn't work on list...?
        ('token_embeddings', GloveTransformer(), 'sentence')
    ],
    remainder='drop'
)

In [128]:
feature_cols = ['sentence']  #'pos', 
target_cols = ['tag_cat_linguistic', 'tag_cat_personname', 'tag_cat_contextual']

In [129]:
# labels = list(np.unique(df_train_ling[target_col]))
# labels2numbers = LabelEncoder()
# y = labels2numbers.fit_transform(labels)
# label_to_no = dict(zip(labels,list(y)))
# no_to_label = dict(zip(list(y),labels))
# print(label_to_no)

In [130]:
X_train = df_train_grouped[feature_cols]
X_dev = df_dev_grouped[feature_cols]

In [133]:
# y_train = list(df_train_ling[target_col].values)
# y_train_numeric = utils.getNumericLabels(y_train, label_to_no)  # Convert the string labels to numeric labels
# y_train_binarized = mlb.fit_transform(y_train_numeric)          # Convert each iterable of iterables above to a multilabel format
# print(y_train[0])
# print(y_train_numeric[0])
# print(y_train_binarized[0])

y_train = one_hot_encoder.fit_transform(df_train[target_cols]) #[0]])
# y1_train = one_hot_encoder.fit_transform(df_train[target_cols[1]])
# y2_train = one_hot_encoder.fit_transform(df_train[target_cols[2]])

In [134]:
print(y_train.shape)

(452799, 3)


In [136]:
# y_dev = df_dev_ling[target_col].values
# y_dev_numeric = utils.getNumericLabels(y_dev, label_to_no)  # Make numeric
# y_dev_binarized = mlb.transform(y_dev_numeric)              # Binarize
# print(y_dev.shape, y_dev_binarized.shape)
y_dev = one_hot_encoder.transform(df_dev[target_cols])

In [137]:
print(y_dev.shape)

(152711, 3)


In [138]:
assert X_dev.shape[1] == X_train.shape[1], "The train and dev data must have the same number of columns."

#### Train the Model

In [139]:
log_reg = OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr", random_state=22))

In [140]:
pipeline = Pipeline([
    ("col_transformer", col_transformer),
    ('imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ("classifier", log_reg)
])

In [141]:
clf = pipeline.fit(X_train, y_train)  
# TypeError - Do I need to turn tokens into numbers before getting embedding, like with Stanford blog's approach

TypeError: 'str' object cannot be interpreted as an integer

#### Predict

In [165]:
predicted_dev = clf.predict(X_dev)
print(predicted_dev[0])

[0 0 1]


#### Evaluate Model Performance

Strict evaluation:

In [166]:
original_labels = mlb.classes_
dev_matrix = multilabel_confusion_matrix(y_dev_binarized, predicted_dev, labels=mlb.classes_)
df_dev_perf = utils.getPerformanceMetrics(y_dev_binarized, predicted_dev, dev_matrix, mlb.classes_, original_labels, no_to_label)
df_dev_perf

Unnamed: 0,labels,true_neg,false_neg,true_pos,false_pos,precision,recall,f_1
0,B-Linguistic,151320,674,849,356,0.704564,0.557452,0.622434
1,I-Linguistic,152923,276,0,0,0.0,0.0,0.0
2,O,853,356,151044,946,0.993776,0.997649,0.995708


In [167]:
print("Dev Accuracy (all labels) on `token` col:", np.mean(predicted_dev == y_dev_binarized))

Dev Accuracy (all labels) on `token` col: 0.9943254633951049


Relaxed evaluation:

<a id="1.2"></a>
### 1.2 Input Tokens Grouped by Sentence

In [257]:
X_train = df_train_ling_grouped[feature_cols]
print(X_train.shape)

(25218, 2)


In [258]:
y_train = df_train_ling_grouped[target_col].values
y_train_numeric = [labels2numbers.fit_transform(labels_list) for labels_list in y_train]
y_train_binarized = [one_hot_encoder.fit_transform((number_list).reshape(-1,1)) for number_list in y_train_numeric]
# y_train_numeric = utils.getNumericLabels(y_train, label_to_no)  # Convert the string labels to numeric labels
# y_train_binarized = mlb.fit_transform(y_train_numeric)          # Convert each iterable of iterables above to a multilabel format
print(y_train.shape)
print(y_train[3])
print(y_train_numeric[3])
print(y_train_binarized[3])

(25218,)
['B-Linguistic', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Linguistic', 'O', 'O', 'O', 'O', 'O', 'O']
[0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1]
  (0, 0)	1.0
  (1, 1)	1.0
  (2, 1)	1.0
  (3, 1)	1.0
  (4, 1)	1.0
  (5, 1)	1.0
  (6, 1)	1.0
  (7, 1)	1.0
  (8, 1)	1.0
  (9, 1)	1.0
  (10, 0)	1.0
  (11, 1)	1.0
  (12, 1)	1.0
  (13, 1)	1.0
  (14, 1)	1.0
  (15, 1)	1.0
  (16, 1)	1.0


In [206]:
y_train[0]

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

In [211]:
# y_encoded = LabelEncoder().fit_transform(y_train[3])
y_encoded

array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1])

In [202]:
y = y.reshape(-1, 1)
ohe = one_hot_encoder.fit(y)

In [204]:
ohe.fit_transform(y_train_numeric)

  return array(a, dtype, copy=False, order=order)


ValueError: Expected 2D array, got 1D array instead:
array=[list([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
 list([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
 list([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
 ... list([2, 2, 2, 2, 2, 2, 2, 2])
 list([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
 list([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.