# Baseline Gender Bias Multilabel Token Classifiers

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/model_input/`
    * Prediction Data: under directory `../data/token_clf_data/multilabel_model_output/`
* Multilabel classification
    * 3 categories of labels:
        1. Person Name: Unknown, Feminine, Masculine (Non-binary not applied during annotation)
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Occupation, Omission, Stereotype (Empowering only applied by one annotator and too few times for training)
* Word embeddings: custom fastText embeddings

***

### Table of Contents

[0.](#0) Preprocessing

[1.](#CC) Classifier Chain Models

[2.](#2) Person Name Model

[3.](#3) Linguistic Model

***

Load necessary libraries:

In [1]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For embeddings
from gensim.models import FastText
from gensim import utils as gensim_utils

# For classification
import sklearn.metrics as metrics
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from skmultilearn.problem_transform import ClassifierChain
# Base estimators
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier, PassiveAggressiveClassifier

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [52]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


In [53]:
df_train = df_train.drop(columns=["ann_id"])
df_train = df_train.drop_duplicates()
df_dev = df_dev.drop(columns=["ann_id"])
df_dev = df_dev.drop_duplicates()
print(df_train.shape, df_dev.shape)

(463441, 9) (156146, 9)


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [54]:
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [55]:
df_train.shape

(463439, 9)

***
#### Optional Preprocessing Step

If not classifying all labels at once, consider only the rows with tags for the select subset of labels, replacing all tags not in that subset with `"O"`:

In [56]:
# cont_label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission", "B-Occupation", "I-Occupation"]
pers_label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine"]#, "B-Nonbinary", "I-Nonbinary"]
# ling_label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
df_train, df_dev = utils.selectDataForLabels(df_train, df_dev, "tag", pers_label_subset)
# df_train, df_dev = utils.selectDataForLabels(df_train, df_dev, "tag", ling_label_subset)
print(df_train.shape, df_dev.shape)

(463439, 9) (156146, 9)


***

Group the data by token, so there is one row per token rather than one row per token-tag pair:

In [57]:
subdf_train = df_train.drop(columns=["description_id", "field", "subset", "token_offsets"])
subdf_dev = df_dev.drop(columns=["description_id", "field", "subset", "token_offsets"])
df_train_imploded = utils.implodeDataFrame(subdf_train, ["sentence_id", "token_id", "token", "pos"])
df_train_imploded = df_train_imploded.reset_index()
df_dev_imploded = utils.implodeDataFrame(subdf_dev, ["sentence_id", "token_id", "token", "pos"])
df_dev_imploded = df_dev_imploded.reset_index()
df_dev_imploded.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag
0,5,154,After,IN,[O]
1,5,155,his,PRP$,[O]
2,5,156,ordination,NN,[O]
3,5,157,he,PRP,[O]
4,5,158,spent,VBD,[O]


Associate word embeddings to the tokens:

In [58]:
dimensions = ["50", "100", "200", "300"]
d = dimensions[1]
file_name = config.fasttext_path+"fasttext{}_lowercased.model".format(d)
embedding_model = FastText.load(file_name)

In [59]:
vocabulary = list(df_train.token.unique())
vocabulary_lowercased = [token.lower() for token in vocabulary]
vocabulary_lowercased = list(set(vocabulary_lowercased))
print("Vocabulary size:", len(vocabulary))
print("Lowercased vocabulary size:", len(vocabulary_lowercased))

Vocabulary size: 35968
Lowercased vocabulary size: 31335


Vectorize and binarize the data:

In [60]:
mlb = MultiLabelBinarizer()

In [61]:
target_col = "tag"
feature_cols = ["token_id", "token"]
train_data = df_train_imploded
dev_data = df_dev_imploded

In [62]:
train_data.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag
0,1,3,Title,NN,[O]
1,1,4,:,:,[O]
2,1,5,Papers,NNS,[O]
3,1,6,of,IN,[O]
4,1,7,The,DT,"[O, B-Unknown, B-Masculine]"


Extract features:

In [63]:
# Get a vector representation of a token from a fastText word embedding model
def extractEmbedding(token, fasttext_model=embedding_model):
    if token.isalpha():
        token = token.lower()
    embedding = fasttext_model.wv[token]
    return embedding

def makeFeatureMatrix(token_data):
    feature_list = [extractEmbedding(token) for token_id,token in token_data]
    return np.array(feature_list)

In [64]:
train_tokens = list(zip(train_data[feature_cols[0]], train_data[feature_cols[1]]))
dev_tokens = list(zip(dev_data[feature_cols[0]], dev_data[feature_cols[1]]))

In [65]:
X_train = makeFeatureMatrix(train_tokens)
X_dev = makeFeatureMatrix(dev_tokens)
print(X_train.shape, X_dev.shape)  # number_of_samples, number_of_features

(452086, 100) (152455, 100)


Binarize targets:

In [66]:
y_train_labels = train_data[target_col]
y_train = mlb.fit_transform(y_train_labels)
y_dev_labels = dev_data[target_col]
y_dev = mlb.transform(y_dev_labels)
print(y_train.shape, y_dev.shape)  # number_of_samples, number_of_labels

(452086, 7) (152455, 7)


In [67]:
for labels in y_train:
    if sum(labels) > 1:
        print("Multilabelled tokens exist, as expected.")
        break

Multilabelled tokens exist, as expected.


For baseline models, use only the tokens' embeddings as features.

<a id="CC"></a>
## 1. Classifier Chain Models

*Reference: http://scikit.ml/api/skmultilearn.problem_transform.cc.html#skmultilearn.problem_transform.ClassifierChain*

#### Train & Predict

In [59]:
a = "rf"
clf = ClassifierChain(
    classifier = RandomForestClassifier(random_state=22),
)
clf.fit(X_train, y_train)

In [60]:
predictions = clf.predict(X_dev)

#### Evaluate: All Labels

In [61]:
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.9139355112811681
Precision - macro: 0.5117583387635518

Recall - weighted: 0.9275677891204386
Recall - macro: 0.33299717339152485

F1 Score - weighted: 0.9144440009060665
F1 Score - macro: 0.3739269852824525

Accuracy - normalized: 0.9373192089469023
Accuracy - unnormalized: 142899


#### Evaluate: Each Label

In [62]:
pred_df = utils.makePredictionDF(predictions, dev_data, "tag", "predicted_tag", "O", mlb)
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,predicted_tag
0,5,154,After,IN,O
1,5,155,his,PRP$,B-Gendered-Pronoun
2,5,156,ordination,NN,O
3,5,157,he,PRP,B-Gendered-Pronoun
4,5,158,spent,VBD,O


In [63]:
exp_df = dev_data.explode(["tag"])
exp_df = exp_df.rename(columns={"tag":"expected_tag"})
# exp_df.head()

In [64]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "expected_tag"],   # left on
    ["sentence_id", "token_id", "token", "pos", "predicted_tag"],  # right on
    ["sentence_id", "token_id", "token", "pos", "expected_tag", "predicted_tag", "_merge"],  # final column list
    "expected_tag",
    "predicted_tag", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
4,5,158,spent,VBD,O,O,true negative


Save the data:

In [65]:
Path(config.tokc_path+"multilabel_model_output/").mkdir(parents=True, exist_ok=True)
eval_df.to_csv(config.tokc_path+"multilabel_model_output/cc-{a}_baseline_fastText{d}_predictions.csv".format(a=a,d=d))

##### Strict Agreement

Calculate the total true positives, false positives, true negatives, and false negatives.

In [66]:
agmt_stats = utils.getAgreementStatsForAllTags(eval_df, "_merge", "token_id", "tag(s)", y_dev, predictions)

Calculate precision, recall, and F1 score at the token level for each tag:

In [67]:
label_tags = [ 
    'B-Unknown', 'I-Unknown', 'B-Feminine', 'I-Feminine', 'B-Masculine',  'I-Masculine',
    'B-Gendered-Pronoun', 'I-Gendered-Pronoun','B-Gendered-Role', 'I-Gendered-Role', 
    'B-Generalization', 'I-Generalization', 
    'B-Stereotype', 'I-Stereotype', 'B-Omission', 'I-Omission', 'B-Occupation', 'I-Occupation'
]

In [68]:
for label_tag in label_tags:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label_tag])
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,all,11310,9593,140352,4484,0.511758,0.332997,0.373927
0,B-Unknown,1321,369,0,1130,0.753836,0.461036,0.572152
0,I-Unknown,2423,509,0,1278,0.715165,0.345312,0.465743
0,B-Feminine,105,78,0,352,0.818605,0.770241,0.793687
0,I-Feminine,505,58,0,340,0.854271,0.402367,0.547064
0,B-Masculine,468,262,0,1000,0.792393,0.681199,0.732601
0,I-Masculine,1014,306,0,452,0.596306,0.308322,0.406475
0,B-Gendered-Pronoun,9,178,0,1470,0.89199,0.993915,0.940198
0,I-Gendered-Pronoun,15,0,0,0,0.0,0.0,0.0
0,B-Gendered-Role,140,159,0,870,0.845481,0.861386,0.853359


Save the data:

In [69]:
Path(config.tokc_path+"multilabel_model_performance/").mkdir(parents=True, exist_ok=True)
agmt_stats.to_csv(config.tokc_path+"multilabel_model_performance/cc-{a}_baseline_fastText{d}_strict_agmt.csv".format(a=a,d=d))

##### Loose Agreement

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

Create a copy of the evaluation DataFrame where tags are replaced by label names:

In [70]:
label_tags = {
    "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
    "Gendered Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered Role": ["B-Gendered-Role", "I-Gendered-Role"],
    "Generalization": ["B-Generalization", "I-Generalization"], 
    "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"], "Occupation": ["B-Occupation", "I-Occupation"]
             }

In [71]:
loose_eval_df = eval_df.copy()
for label,tags in label_tags.items():
    for tag in tags:
        loose_eval_df["expected_tag"] = loose_eval_df["expected_tag"].replace(to_replace=tag, value=label)
        loose_eval_df["predicted_tag"] = loose_eval_df["predicted_tag"].replace(to_replace=tag, value=label)
# loose_eval_df.head()

In [72]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
# loose_eval_df.head()

In [73]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[], "true negative":[], 
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [74]:
for label,tags in label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,Unknown,3744.0,0.0,0.0,2408.0,1.0,0.391417,0.562617
0,Feminine,610.0,0.0,0.0,692.0,1.0,0.53149,0.694082
0,Masculine,1482.0,0.0,0.0,1452.0,1.0,0.494888,0.662107
0,Gendered Pronoun,24.0,0.0,0.0,1470.0,1.0,0.983936,0.991903
0,Gendered Role,258.0,0.0,0.0,870.0,1.0,0.771277,0.870871
0,Generalization,322.0,0.0,0.0,126.0,1.0,0.28125,0.439024
0,Stereotype,930.0,0.0,0.0,86.0,1.0,0.084646,0.15608
0,Omission,1994.0,0.0,0.0,1032.0,1.0,0.341044,0.508625
0,Occupation,1019.0,0.0,0.0,832.0,1.0,0.449487,0.620201


Save the data:

In [75]:
loose_agmt.to_csv(config.tokc_path+"multilabel_model_performance/cc-{a}_baseline_fastText{d}_loose_agmt.csv".format(a=a,d=d))

<a id="2"></a>

## 2. Person Name Model

Create multilabel models with the `PassiveAggressiveClassifier` for the Person Name category of labels in order to compare their performance to the sequence classifier's performance (the Passive Aggressive algorithm was top-performing for the baseline Person Name sequence classification model).

Then, try the `RandomForestClassifier` since this yeilded high performance in the optimization experiments.

#### Hypothesis
* The baseline sequence classifiers will outperform (F1 score >0.1 higher) the baseline multilabel token classifiers for labels in the Person Name category.

In [87]:
a = "pa"
pn_clf = ClassifierChain(classifier=PassiveAggressiveClassifier(
        max_iter=100, 
        loss="squared_hinge",  # equivalent to pa_type=2 (PA-II)
        random_state=22,
    )
)
# a = "rf"
# pn_clf = ClassifierChain(classifier=RandomForestClassifier(random_state=22))
pn_clf.fit(X_train, y_train)

In [88]:
predictions = pn_clf.predict(X_dev)

#### Evaluate

In [89]:
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.9246753427798468
Precision - macro: 0.29125171686438994

Recall - weighted: 0.9385097847295113
Recall - macro: 0.29917853276102013

F1 Score - weighted: 0.9308920661329323
F1 Score - macro: 0.2790846321002277

Accuracy - normalized: 0.8960086582926109
Accuracy - unnormalized: 136601


In [90]:
pred_df = utils.makePredictionDF(predictions, dev_data, "tag", "predicted_tag", "O", mlb)
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,predicted_tag
0,5,154,After,IN,O
1,5,155,his,PRP$,O
2,5,156,ordination,NN,O
3,5,157,he,PRP,I-Unknown
3,5,157,he,PRP,O


In [91]:
exp_df = dev_data.explode(["tag"])
exp_df = exp_df.rename(columns={"tag":"expected_tag"})
# exp_df.head()

In [92]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "expected_tag"],   # left on
    ["sentence_id", "token_id", "token", "pos", "predicted_tag"],  # right on
    ["sentence_id", "token_id", "token", "pos", "expected_tag", "predicted_tag", "_merge"],  # final column list
    "expected_tag",
    "predicted_tag", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,O,O,true negative
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,O,O,true negative
4,5,158,spent,VBD,O,O,true negative


Save the data:

In [93]:
Path(config.tokc_path+"multilabel_model_output/").mkdir(parents=True, exist_ok=True)
eval_df.to_csv(config.tokc_path+"multilabel_model_output/cc-{a}_baseline_fastText{d}_pn_predictions.csv".format(a=a,d=d))

##### Strict Agreement

Calculate the total true positives, false positives, true negatives, and false negatives.

In [94]:
agmt_stats = utils.getAgreementStatsForAllTags(eval_df, "_merge", "token_id", "tag(s)", y_dev, predictions)

Calculate precision, recall, and F1 score at the token level for each tag:

In [95]:
label_tags = [ 
    'B-Unknown', 'I-Unknown', 'B-Feminine', 'I-Feminine', 'B-Masculine',  'I-Masculine',
#     'B-Gendered-Pronoun', 'I-Gendered-Pronoun','B-Gendered-Role', 'I-Gendered-Role', 
#     'B-Generalization', 'I-Generalization', 
#     'B-Stereotype', 'I-Stereotype', 'B-Omission', 'I-Omission', 'B-Occupation', 'I-Occupation'
]

In [96]:
for label_tag in label_tags:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label_tag])
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,all,8061,9982,145151,1042,0.291252,0.299179,0.279085
0,B-Unknown,1051,1440,0,730,0.336406,0.409882,0.369527
0,I-Unknown,2319,2247,0,714,0.241135,0.23541,0.238238
0,B-Feminine,156,42,0,104,0.712329,0.4,0.512315
0,I-Feminine,382,320,0,234,0.422383,0.37987,0.4
0,B-Masculine,431,658,0,168,0.20339,0.280467,0.235789
0,I-Masculine,839,725,0,134,0.155995,0.137718,0.146288


Save the data:

In [97]:
Path(config.tokc_path+"multilabel_model_performance/").mkdir(parents=True, exist_ok=True)
agmt_stats.to_csv(config.tokc_path+"multilabel_model_performance/cc-{a}_baseline_fastText{d}_pn_strict_agmt.csv".format(a=a,d=d))

##### Loose Agreement

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

Create a copy of the evaluation DataFrame where tags are replaced by label names:

In [98]:
label_tags = {
    "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
#     "Gendered Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered Role": ["B-Gendered-Role", "I-Gendered-Role"],
#     "Generalization": ["B-Generalization", "I-Generalization"], 
#     "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"], "Occupation": ["B-Occupation", "I-Occupation"]
             }

In [99]:
loose_eval_df = eval_df.copy()
for label,tags in label_tags.items():
    for tag in tags:
        loose_eval_df["expected_tag"] = loose_eval_df["expected_tag"].replace(to_replace=tag, value=label)
        loose_eval_df["predicted_tag"] = loose_eval_df["predicted_tag"].replace(to_replace=tag, value=label)
# loose_eval_df.head()

In [100]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
# loose_eval_df.head()

In [101]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[], "true negative":[], 
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [102]:
for label,tags in label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,Unknown,3370.0,0.0,0.0,1444.0,1.0,0.299958,0.461489
0,Feminine,538.0,0.0,0.0,338.0,1.0,0.385845,0.556837
0,Masculine,1270.0,0.0,0.0,302.0,1.0,0.192112,0.322305


For a Classifier Chain, the Random Forest estimator performs better than the Passive Aggressive estimator.

Compared to the Baseline Sequence Classifier for Person Names:
* Unknown F1: 0.597172
* Feminine F1: 0.767750
* Masculine F1: 0.599679

Save the data:

In [103]:
loose_agmt.to_csv(config.tokc_path+"multilabel_model_performance/cc-{a}_baseline_fastText{d}_pn_loose_agmt.csv".format(a=a,d=d))

<a id="3"></a>

## 3. Linguistic Model

Create multilabel models with the `PassiveAggressiveClassifier` for the Linguistic category of labels in order to compare their performance to the sequence classifier's performance (the Passive Aggressive algorithm was top-performing for the baseline Linguistic sequence classification model).

Then, try the `RandomForestClassifier` since this yeilded high performance in the optimization experiments.

#### Hypothesis
* The baseline sequence classifiers will have worse performance (F1 score >=0.1 lower) than the baseline multilabel token classifiers for labels in the Linguistic category.

In [35]:
a = "pa"
l_clf = ClassifierChain(classifier=PassiveAggressiveClassifier(
        max_iter=100, 
        loss="squared_hinge",  # equivalent to pa_type=2 (PA-II)
        random_state=22,
    )
)
# a = "rf"
# l_clf = ClassifierChain(classifier = RandomForestClassifier(random_state=22))
l_clf.fit(X_train, y_train)

In [36]:
predictions = l_clf.predict(X_dev)

#### Evaluate

In [37]:
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.9908974243394203
Precision - macro: 0.36544387681923524

Recall - weighted: 0.9909878162584754
Recall - macro: 0.3336048125286899

F1 Score - weighted: 0.9908594769466269
F1 Score - macro: 0.34610025735673317

Accuracy - normalized: 0.987091272834607
Accuracy - unnormalized: 150487


In [38]:
pred_df = utils.makePredictionDF(predictions, dev_data, "tag", "predicted_tag", "O", mlb)
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,predicted_tag
0,5,154,After,IN,O
1,5,155,his,PRP$,O
2,5,156,ordination,NN,O
3,5,157,he,PRP,B-Gendered-Pronoun
4,5,158,spent,VBD,O


In [39]:
exp_df = dev_data.explode(["tag"])
exp_df = exp_df.rename(columns={"tag":"expected_tag"})
# exp_df.head()

In [40]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "expected_tag"],   # left on
    ["sentence_id", "token_id", "token", "pos", "predicted_tag"],  # right on
    ["sentence_id", "token_id", "token", "pos", "expected_tag", "predicted_tag", "_merge"],  # final column list
    "expected_tag",
    "predicted_tag", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,B-Gendered-Pronoun,,false negative
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
4,5,158,spent,VBD,O,O,true negative


Save the data:

In [41]:
Path(config.tokc_path+"multilabel_model_output/").mkdir(parents=True, exist_ok=True)
eval_df.to_csv(config.tokc_path+"multilabel_model_output/cc-{a}_baselinefastText{d}_ling_predictions.csv".format(a=a,d=d))

##### Strict Agreement

Calculate the total true positives, false positives, true negatives, and false negatives.

In [42]:
agmt_stats = utils.getAgreementStatsForAllTags(eval_df, "_merge", "token_id", "tag(s)", y_dev, predictions)

Calculate precision, recall, and F1 score at the token level for each tag:

In [43]:
label_tags = [ 
#     'B-Unknown', 'I-Unknown', 'B-Feminine', 'I-Feminine', 'B-Masculine',  'I-Masculine',
    'B-Gendered-Pronoun', 'I-Gendered-Pronoun','B-Gendered-Role', 'I-Gendered-Role', 
    'B-Generalization', 'I-Generalization', 
#     'B-Stereotype', 'I-Stereotype', 'B-Omission', 'I-Omission', 'B-Occupation', 'I-Occupation'
]

In [44]:
for label_tag in label_tags:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label_tag])
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,all,922,844,153929,637,0.365444,0.333605,0.3461
0,B-Gendered-Pronoun,196,95,0,1048,0.916885,0.842444,0.87809
0,I-Gendered-Pronoun,14,8,0,0,0.0,0.0,0.0
0,B-Gendered-Role,139,22,0,132,0.857143,0.487085,0.621176
0,I-Gendered-Role,65,22,0,0,0.0,0.0,0.0
0,B-Generalization,84,123,0,94,0.43318,0.52809,0.475949
0,I-Generalization,44,20,0,0,0.0,0.0,0.0


Save the data:

In [45]:
Path(config.tokc_path+"multilabel_model_performance/").mkdir(parents=True, exist_ok=True)
agmt_stats.to_csv(config.tokc_path+"multilabel_model_performance/cc-{a}_baselinefastText{d}_ling_strict_agmt.csv".format(a=a,d=d))

##### Loose Agreement

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

Create a copy of the evaluation DataFrame where tags are replaced by label names:

In [46]:
label_tags = {
#     "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
    "Gendered Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered Role": ["B-Gendered-Role", "I-Gendered-Role"],
    "Generalization": ["B-Generalization", "I-Generalization"], 
#     "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"], "Occupation": ["B-Occupation", "I-Occupation"]
             }

In [47]:
loose_eval_df = eval_df.copy()
for label,tags in label_tags.items():
    for tag in tags:
        loose_eval_df["expected_tag"] = loose_eval_df["expected_tag"].replace(to_replace=tag, value=label)
        loose_eval_df["predicted_tag"] = loose_eval_df["predicted_tag"].replace(to_replace=tag, value=label)
# loose_eval_df.head()

In [48]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
# loose_eval_df.head()

In [49]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[], "true negative":[], 
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [50]:
for label,tags in label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,Gendered Pronoun,210.0,0.0,0.0,1048.0,1.0,0.833068,0.908933
0,Gendered Role,204.0,0.0,0.0,132.0,1.0,0.392857,0.564103
0,Generalization,128.0,0.0,0.0,94.0,1.0,0.423423,0.594937


With a Classifier Chain, the Random Forest estimator yields better results than the Passive Aggressive estimator.

Compared to the Baseline Sequence Classifier:
* Gendered-Pronoun F1: 0.872418
* Gendered-Role F1: 0.659875
* Generalization F1: 0.319392


Save the data:

In [51]:
loose_agmt.to_csv(config.tokc_path+"multilabel_model_performance/cc-{a}_baseline_fastText{d}_ling_loose_agmt.csv".format(a=a,d=d))