# Baseline Gender Bias Multilabel Token Classifiers with fastText

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/model_input/`
    * Prediction Data: under directory `../data/token_clf_data/multilabel_model_output/`
* Multilabel classification
    * 3 categories of labels:
        1. Person Name: Unknown, Feminine, Masculine (Non-binary not applied during annotation)
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Occupation, Omission, Stereotype (Empowering only applied by one annotator and too few times for training)
* Word embeddings: custom fastText embeddings

***

### Table of Contents

[0.](#0) Preprocessing

[1.](#CC) Classifier Chain Models

[2.](#2) Person Name Model

[3.](#3) Linguistic Model

***

Load necessary libraries:

In [1]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For embeddings
from gensim.models import FastText
from gensim import utils as gensim_utils

# For classification
import sklearn.metrics as metrics
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from skmultilearn.problem_transform import ClassifierChain
# Base estimators
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier, PassiveAggressiveClassifier

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [2]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


In [3]:
df_train = df_train.drop(columns=["ann_id"])
df_train = df_train.drop_duplicates()
df_dev = df_dev.drop(columns=["ann_id"])
df_dev = df_dev.drop_duplicates()
print(df_train.shape, df_dev.shape)

(463441, 9) (156146, 9)


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [4]:
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [5]:
df_train.shape

(463439, 9)

***
#### Optional Preprocessing Step

If not classifying all labels at once, consider only the rows with tags for the select subset of labels, replacing all tags not in that subset with `"O"`:

In [7]:
# cont_label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission"]
perso_label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Occupation", "I-Occupation"]#, "B-Nonbinary", "I-Nonbinary"]
category = "pers_o"
# ling_label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
# category = "linguistic"
df_train, df_dev = utils.selectDataForLabels(df_train, df_dev, "tag", perso_label_subset)
# df_train, df_dev = utils.selectDataForLabels(df_train, df_dev, "tag", ling_label_subset)
# print(df_train.shape, df_dev.shape)

***

Group the data by token, so there is one row per token rather than one row per token-tag pair:

In [8]:
subdf_train = df_train.drop(columns=["description_id", "field", "subset", "token_offsets"])
subdf_dev = df_dev.drop(columns=["description_id", "field", "subset", "token_offsets"])
df_train_imploded = utils.implodeDataFrame(subdf_train, ["sentence_id", "token_id", "token", "pos"])
df_train_imploded = df_train_imploded.reset_index()
df_dev_imploded = utils.implodeDataFrame(subdf_dev, ["sentence_id", "token_id", "token", "pos"])
df_dev_imploded = df_dev_imploded.reset_index()
df_dev_imploded.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag
0,5,154,After,IN,[O]
1,5,155,his,PRP$,[O]
2,5,156,ordination,NN,[O]
3,5,157,he,PRP,[O]
4,5,158,spent,VBD,[O]


Replace the tags with label names (remove ``B-`` and ``I-``):

In [9]:
def getLabelColFromTagCol(df, col):
    col_list = list(df[col])
    new_col = []
    for value_list in col_list:
        new_value_list = []
        for value in value_list:
            if value != "O":
                new_value = value[2:]
                new_value_list += [new_value]
            else:
                new_value_list += [value]
        # Remove any duplicates from the list of labels
        unique_values = list(set(new_value_list))
        # Sort the list of labels alphabetically
        unique_values.sort()
        new_col += [unique_values]
    assert len(new_col) == len(col_list)
    return new_col

In [10]:
train_labels = getLabelColFromTagCol(df_train_imploded, "tag")
# # train_labels[:10]  # Looks good
dev_labels = getLabelColFromTagCol(df_dev_imploded, "tag")
# dev_labels[:10] # Looks good

In [11]:
df_train_imploded.insert(len(df_train_imploded.columns), "label", train_labels)
df_train_imploded.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag,label
0,1,3,Title,NN,[O],[O]
1,1,4,:,:,[O],[O]
2,1,5,Papers,NNS,[O],[O]
3,1,6,of,IN,[O],[O]
4,1,7,The,DT,"[O, B-Unknown, B-Masculine]","[Masculine, O, Unknown]"


In [12]:
df_dev_imploded.insert(len(df_dev_imploded.columns), "label", dev_labels)
df_dev_imploded.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag,label
0,5,154,After,IN,[O],[O]
1,5,155,his,PRP$,[O],[O]
2,5,156,ordination,NN,[O],[O]
3,5,157,he,PRP,[O],[O]
4,5,158,spent,VBD,[O],[O]


Associate word embeddings to the tokens:

In [13]:
dimensions = ["50", "100", "200", "300"]
d = dimensions[1]
file_name = config.fasttext_path+"fasttext{}_lowercased.model".format(d)
embedding_model = FastText.load(file_name)

In [14]:
vocabulary = list(df_train.token.unique())
vocabulary_lowercased = [token.lower() for token in vocabulary]
vocabulary_lowercased = list(set(vocabulary_lowercased))
print("Vocabulary size:", len(vocabulary))
print("Lowercased vocabulary size:", len(vocabulary_lowercased))

Vocabulary size: 35968
Lowercased vocabulary size: 31335


Vectorize and binarize the data:

In [15]:
mlb = MultiLabelBinarizer()

In [16]:
target_col = "label"
feature_cols = ["token_id", "token"]
train_data = df_train_imploded
dev_data = df_dev_imploded

Extract features:

In [17]:
# Get a vector representation of a token from a fastText word embedding model
def extractEmbedding(token, fasttext_model=embedding_model):
    if token.isalpha():
        token = token.lower()
    embedding = fasttext_model.wv[token]
    return embedding

def makeFeatureMatrix(token_data):
    feature_list = [extractEmbedding(token) for token_id,token in token_data]
    return np.array(feature_list)

In [18]:
train_tokens = list(zip(train_data[feature_cols[0]], train_data[feature_cols[1]]))
dev_tokens = list(zip(dev_data[feature_cols[0]], dev_data[feature_cols[1]]))

In [19]:
X_train = makeFeatureMatrix(train_tokens)
X_dev = makeFeatureMatrix(dev_tokens)
print(X_train.shape, X_dev.shape)  # number_of_samples, number_of_features

(452086, 100) (152455, 100)


Binarize targets:

In [20]:
y_train_labels = train_data[target_col]
y_train = mlb.fit_transform(y_train_labels)
y_dev_labels = dev_data[target_col]
y_dev = mlb.transform(y_dev_labels)
print(y_train.shape, y_dev.shape)  # number_of_samples, number_of_labels

(452086, 5) (152455, 5)


In [21]:
for labels in y_train:
    if sum(labels) > 1:
        print("Multilabelled tokens exist, as expected.")
        break

Multilabelled tokens exist, as expected.


For baseline models, use only the tokens' embeddings as features.

<a id="CC"></a>
## 1. Classifier Chain Model

*Reference: http://scikit.ml/api/skmultilearn.problem_transform.cc.html#skmultilearn.problem_transform.ClassifierChain*

#### Train & Predict

In [22]:
a = "rf"  #"pa"

In [23]:
if a == "rf":
    clf = ClassifierChain(
        classifier = RandomForestClassifier(random_state=22),
    )
elif a == "pa":
    clf = ClassifierChain(
        classifier = PassiveAggressiveClassifier(
            max_iter=100, 
            loss="squared_hinge",  # equivalent to pa_type=2 (PA-II)
            random_state=22,
        )
    )
else:
    print("a not recognized:", a)
    
clf.fit(X_train, y_train)

In [24]:
predictions = clf.predict(X_dev)

#### Evaluate: All Labels

In [25]:
print("Precision - micro:", metrics.precision_score(y_dev, predictions, average="micro", zero_division=0))  # micro = calculated from TP, FP, FN sums across all labels
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - micro:", metrics.recall_score(y_dev, predictions, average="micro", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - micro:", metrics.f1_score(y_dev, predictions, average="micro", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - micro: 0.9620258911430498
Precision - macro: 0.7574251504839203

Recall - micro: 0.9545840680627599
Recall - macro: 0.4977462402132226

F1 Score - micro: 0.9582905320282973
F1 Score - macro: 0.5866765372981741

Accuracy - normalized: 0.9539470663474467
Accuracy - unnormalized: 145434


In [26]:
print("Precision - per label:", metrics.precision_score(y_dev, predictions, average=None, zero_division=0))
print()
print("Recall - per label:", metrics.recall_score(y_dev, predictions, average=None, zero_division=0))
print()
print("F1 Score - per label:", metrics.f1_score(y_dev, predictions, average=None, zero_division=0))

Precision - per label: [0.80594059 0.66769231 0.97170646 0.70070922 0.64107717]

Recall - per label: [0.43114407 0.39544419 0.99363917 0.34425087 0.3242529 ]

F1 Score - per label: [0.56176674 0.49670959 0.98255044 0.46168224 0.43067369]


*Note: the scores above are calculated with the `"O"` label*

#### Evaluate: Each Label

In [27]:
pred_df = utils.makePredictionDF(predictions, dev_data, "label", "predicted_label", "O", mlb)
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag,predicted_label
0,5,154,After,IN,[O],O
1,5,155,his,PRP$,[O],O
2,5,156,ordination,NN,[O],O
3,5,157,he,PRP,[O],O
4,5,158,spent,VBD,[O],O


In [28]:
exp_df = dev_data.explode(["label"])
exp_df = exp_df.rename(columns={"label":"expected_label"})
# exp_df.head()

In [29]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "expected_label"],   # left on
    ["sentence_id", "token_id", "token", "pos", "predicted_label"],  # right on
    ["sentence_id", "token_id", "token", "pos", "expected_label", "predicted_label", "_merge"],  # final column list
    "expected_label",
    "predicted_label", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_label,predicted_label,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,O,O,true negative
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,O,O,true negative
4,5,158,spent,VBD,O,O,true negative


Save the data:

In [30]:
Path(config.tokc_path+"multilabel_model_output/").mkdir(parents=True, exist_ok=True)
eval_df.to_csv(config.tokc_path+"multilabel_model_output/cc-{a}_{c}_baseline_fastText{d}_predictions.csv".format(a=a,d=d,c=category))

##### Strict Agreement

Calculate precision, recall, and F1 score at the token level for each tag:

In [65]:
if category == "linguistic":
    labels = ['Gendered-Pronoun','Gendered-Role', 'Generalization']   # Linguistic category of labels
elif category == "pers_o":
    labels = ['Feminine', 'Masculine', 'Unknown', 'Occupation']      # Person Name category of labels
else:
    print("Category not recognized:",category)

In [69]:
agmt_scores = pd.DataFrame()
exp_col, pred_col = "expected_label", "predicted_label"
for label in labels:
    agmt_df = pd.concat([eval_df.loc[eval_df[exp_col] == label], eval_df.loc[eval_df[pred_col] == label]])
    agmt_df = agmt_df.drop_duplicates() # True positives will have been duplicated in line above
    tp = agmt_df.loc[agmt_df._merge == "true positive"].shape[0]
    fp = agmt_df.loc[agmt_df._merge == "false positive"].shape[0]
    fn = agmt_df.loc[agmt_df._merge == "false negative"].shape[0]
    prec, rec, f1 = utils.precisionRecallF1(tp, fp, fn)
    label_agmt = pd.DataFrame.from_dict({
            "label":[label], "false negative":[fn], "false positive":[fp],
             "true positive":[tp], "precision":[prec], "recall":[rec], "f1":[f1]
        })
    agmt_scores = pd.concat([agmt_scores, label_agmt])
agmt_scores

Unnamed: 0,label,false negative,false positive,true positive,precision,recall,f1
0,Feminine,39,87,12,0.121212,0.235294,0.16
0,Masculine,59,233,101,0.302395,0.63125,0.408907
0,Unknown,260,57,172,0.751092,0.398148,0.520424
0,Occupation,35,11,7,0.388889,0.166667,0.233333


Save the data:

In [70]:
Path(config.tokc_path+"multilabel_model_performance/").mkdir(parents=True, exist_ok=True)
agmt_scores.to_csv(config.tokc_path+"multilabel_model_performance/cc-{a}_{c}_baseline_fastText{d}_strict_agmt.csv".format(a=a,d=d, c=category))

<a id="2"></a>
***
*Note: models below use tags as targets instead of labels*

## 2. Person Name Model

Create multilabel models with the `PassiveAggressiveClassifier` for the Person Name category of labels in order to compare their performance to the sequence classifier's performance.

Then, try the `RandomForestClassifier` since this yeilded high performance in the optimization experiments.

#### Hypothesis
* The baseline sequence classifiers will outperform (F1 score >0.1 higher) the baseline multilabel token classifiers for labels in the Person Name category.

In [34]:
a = "pa"
pn_clf = ClassifierChain(classifier=PassiveAggressiveClassifier(
        max_iter=100, 
        loss="squared_hinge",  # equivalent to pa_type=2 (PA-II)
        random_state=22,
    )
)
# a = "rf"
# pn_clf = ClassifierChain(classifier=RandomForestClassifier(random_state=22))
pn_clf.fit(X_train, y_train)

In [35]:
predictions = pn_clf.predict(X_dev)

#### Evaluate

In [36]:
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.9040107294821986
Precision - macro: 0.2342654641746455

Recall - weighted: 0.8862445780186117
Recall - macro: 0.3036530273250453

F1 Score - weighted: 0.8930894528962644
F1 Score - macro: 0.2420080946447515

Accuracy - normalized: 0.820845495392083
Accuracy - unnormalized: 125142


In [37]:
pred_df = utils.makePredictionDF(predictions, dev_data, "tag", "predicted_tag", "O", mlb)
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,label,predicted_tag
0,5,154,After,IN,[O],O
1,5,155,his,PRP$,[O],O
2,5,156,ordination,NN,[O],O
3,5,157,he,PRP,[O],Masculine
4,5,158,spent,VBD,[O],O


In [38]:
exp_df = dev_data.explode(["tag"])
exp_df = exp_df.rename(columns={"tag":"expected_tag"})
# exp_df.head()

In [39]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "expected_tag"],   # left on
    ["sentence_id", "token_id", "token", "pos", "predicted_tag"],  # right on
    ["sentence_id", "token_id", "token", "pos", "expected_tag", "predicted_tag", "_merge"],  # final column list
    "expected_tag",
    "predicted_tag", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,O,O,true negative
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,O,O,left_only
4,5,158,spent,VBD,O,O,true negative


Save the data:

In [40]:
Path(config.tokc_path+"multilabel_model_output/").mkdir(parents=True, exist_ok=True)
eval_df.to_csv(config.tokc_path+"multilabel_model_output/cc-{a}_baseline_fastText{d}_pn_predictions.csv".format(a=a,d=d))

##### Strict Agreement

Calculate the total true positives, false positives, true negatives, and false negatives.

In [41]:
agmt_stats = utils.getAgreementStatsForAllTags(eval_df, "_merge", "token_id", "tag(s)", y_dev, predictions)

Calculate precision, recall, and F1 score at the token level for each tag:

In [42]:
label_tags = [ 
    'B-Unknown', 'I-Unknown', 'B-Feminine', 'I-Feminine', 'B-Masculine',  'I-Masculine',
#     'B-Gendered-Pronoun', 'I-Gendered-Pronoun','B-Gendered-Role', 'I-Gendered-Role', 
#     'B-Generalization', 'I-Generalization', 
#     'B-Stereotype', 'I-Stereotype', 'B-Omission', 'I-Omission', 'B-Occupation', 'I-Occupation'
]

In [43]:
for label_tag in label_tags:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label_tag])
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Unnamed: 0,tag(s),false negative,false positive,left_only,right_only,true negative,precision,recall,f1,true positive
0,all,691,680,5280.0,5971.0,141319,0.234265,0.303653,0.242008,
0,B-Unknown,221,0,,,0,0.0,0.0,0.0,0.0
0,I-Unknown,216,0,,,0,0.0,0.0,0.0,0.0
0,B-Feminine,30,0,,,0,0.0,0.0,0.0,0.0
0,I-Feminine,22,0,,,0,0.0,0.0,0.0,0.0
0,B-Masculine,75,0,,,0,0.0,0.0,0.0,0.0
0,I-Masculine,85,0,,,0,0.0,0.0,0.0,0.0


Save the data:

In [44]:
Path(config.tokc_path+"multilabel_model_performance/").mkdir(parents=True, exist_ok=True)
agmt_stats.to_csv(config.tokc_path+"multilabel_model_performance/cc-{a}_baseline_fastText{d}_pn_strict_agmt.csv".format(a=a,d=d))

##### Loose Agreement

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

Create a copy of the evaluation DataFrame where tags are replaced by label names:

In [45]:
label_tags = {
    "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
#     "Gendered Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered Role": ["B-Gendered-Role", "I-Gendered-Role"],
#     "Generalization": ["B-Generalization", "I-Generalization"], 
#     "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"], "Occupation": ["B-Occupation", "I-Occupation"]
             }

In [46]:
loose_eval_df = eval_df.copy()
for label,tags in label_tags.items():
    for tag in tags:
        loose_eval_df["expected_tag"] = loose_eval_df["expected_tag"].replace(to_replace=tag, value=label)
        loose_eval_df["predicted_tag"] = loose_eval_df["predicted_tag"].replace(to_replace=tag, value=label)
# loose_eval_df.head()

In [47]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
# loose_eval_df.head()

In [48]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[], "true negative":[], 
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [49]:
for label,tags in label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,Unknown,437.0,0.0,0.0,0.0,0.0,0.0,0.0
0,Feminine,52.0,0.0,0.0,0.0,0.0,0.0,0.0
0,Masculine,160.0,0.0,0.0,0.0,0.0,0.0,0.0


For a Classifier Chain, the Random Forest estimator performs better than the Passive Aggressive estimator.

Compared to the Baseline Sequence Classifier for Person Names:
* Unknown F1: 0.597172
* Feminine F1: 0.767750
* Masculine F1: 0.599679

Save the data:

In [50]:
loose_agmt.to_csv(config.tokc_path+"multilabel_model_performance/cc-{a}_baseline_fastText{d}_pn_loose_agmt.csv".format(a=a,d=d))

<a id="3"></a>

## 3. Linguistic Model

Create multilabel models with the `PassiveAggressiveClassifier` for the Linguistic category of labels in order to compare their performance to the sequence classifier's performance (the Passive Aggressive algorithm was top-performing for the baseline Linguistic sequence classification model).

#### Hypothesis
* The baseline sequence classifiers will have worse performance (F1 score >=0.1 lower) than the baseline multilabel token classifiers for labels in the Linguistic category.

In [51]:
a = "pa"
l_clf = ClassifierChain(classifier=PassiveAggressiveClassifier(
        max_iter=100, 
        loss="squared_hinge",  # equivalent to pa_type=2 (PA-II)
        random_state=22,
    )
)
# a = "rf"
# l_clf = ClassifierChain(classifier = RandomForestClassifier(random_state=22))
l_clf.fit(X_train, y_train)

In [52]:
predictions = l_clf.predict(X_dev)

#### Evaluate: All Labels

In [53]:
print("Precision - micro:", metrics.precision_score(y_dev, predictions, average="micro", zero_division=0))  # micro = calculated from TP, FP, FN sums across all labels
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - micro:", metrics.recall_score(y_dev, predictions, average="micro", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - micro:", metrics.f1_score(y_dev, predictions, average="micro", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - micro: 0.8578084802880332
Precision - macro: 0.2342654641746455

Recall - micro: 0.8862445780186117
Recall - macro: 0.3036530273250453

F1 Score - micro: 0.8717947094703455
F1 Score - macro: 0.2420080946447515

Accuracy - normalized: 0.820845495392083
Accuracy - unnormalized: 125142


In [54]:
print("Precision - per label:", metrics.precision_score(y_dev, predictions, average=None, zero_division=0))
print()
print("Recall - per label:", metrics.recall_score(y_dev, predictions, average=None, zero_division=0))
print()
print("F1 Score - per label:", metrics.f1_score(y_dev, predictions, average=None, zero_division=0))

Precision - per label: [0.04012211 0.08160769 0.95865737 0.01582278 0.07511737]

Recall - per label: [0.09745763 0.42551253 0.93535673 0.02090592 0.03903232]

F1 Score - per label: [0.05684276 0.13695015 0.94686372 0.01801261 0.05137124]


*Note: the scores above are calculated with the `"O"` label*

#### Evaluate: Each Label

In [55]:
pred_df = utils.makePredictionDF(predictions, dev_data, "label", "predicted_label", "O", mlb)
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag,predicted_label
0,5,154,After,IN,[O],O
1,5,155,his,PRP$,[O],O
2,5,156,ordination,NN,[O],O
3,5,157,he,PRP,[O],Masculine
4,5,158,spent,VBD,[O],O


In [56]:
exp_df = dev_data.explode(["label"])
exp_df = exp_df.rename(columns={"label":"expected_label"})
# exp_df.head()

In [57]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "expected_label"],   # left on
    ["sentence_id", "token_id", "token", "pos", "predicted_label"],  # right on
    ["sentence_id", "token_id", "token", "pos", "expected_label", "predicted_label", "_merge"],  # final column list
    "expected_label",
    "predicted_label", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_label,predicted_label,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,O,O,true negative
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,O,O,left_only
4,5,158,spent,VBD,O,O,true negative


Save the data:

In [58]:
Path(config.tokc_path+"multilabel_model_output/").mkdir(parents=True, exist_ok=True)
# eval_df.to_csv(config.tokc_path+"multilabel_model_output/cc-{a}_baseline_fastText{d}_predictions.csv".format(a=a,d=d))
eval_df.to_csv(config.tokc_path+"multilabel_model_output/cc-{a}_linguistic_baseline_fastText{d}_predictions.csv".format(a=a,d=d))

##### Strict Agreement

Calculate the total true positives, false positives, true negatives, and false negatives.

In [59]:
agmt_stats = utils.getAgreementStatsForAllTags(eval_df, "_merge", "token_id", "label(s)", y_dev, predictions)

Calculate precision, recall, and F1 score at the token level for each tag:

In [60]:
labels = [ 
#     'Feminine', 'Masculine', 'Unknown',                      # Person Name category of labels
    'Gendered-Pronoun','Gendered-Role', 'Generalization',   # Linguistic category of labels
#     'Occupation', 'Omission', 'Stereotype'                  # Contextual category of labels
]

In [61]:
for label in labels:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label], exp_col="expected_label", pred_col="predicted_label")
    label_agmt_stats = label_agmt_stats.rename(columns={"tag(s)":"label(s)"})
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Unnamed: 0,label(s),false negative,false positive,left_only,right_only,true negative,true positive,precision,recall,f1
0,all,393,388,5155.0,5971.0,140738,292,0.234265,0.303653,0.242008
0,Gendered-Pronoun,0,0,,,0,0,0.0,0.0,0.0
0,Gendered-Role,0,0,,,0,0,0.0,0.0,0.0
0,Generalization,0,0,,,0,0,0.0,0.0,0.0


Save the data:

In [62]:
Path(config.tokc_path+"multilabel_model_performance/").mkdir(parents=True, exist_ok=True)
# agmt_stats.to_csv(config.tokc_path+"multilabel_model_performance/cc-{a}_baseline_fastText{d}_strict_agmt.csv".format(a=a,d=d))
agmt_stats.to_csv(config.tokc_path+"multilabel_model_performance/cc-{a}_linguistic_baseline_fastText{d}_strict_agmt.csv".format(a=a,d=d))

***
*Note: code below is for tags, not labels*

##### Loose Agreement

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

Create a copy of the evaluation DataFrame where tags are replaced by label names:

In [63]:
label_tags = {
#     "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
    "Gendered Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered Role": ["B-Gendered-Role", "I-Gendered-Role"],
    "Generalization": ["B-Generalization", "I-Generalization"], 
#     "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"], "Occupation": ["B-Occupation", "I-Occupation"]
             }

In [64]:
loose_eval_df = eval_df.copy()
for label,tags in label_tags.items():
    for tag in tags:
        loose_eval_df["expected_tag"] = loose_eval_df["expected_tag"].replace(to_replace=tag, value=label)
        loose_eval_df["predicted_tag"] = loose_eval_df["predicted_tag"].replace(to_replace=tag, value=label)
# loose_eval_df.head()

KeyError: 'expected_tag'

In [None]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
# loose_eval_df.head()

In [None]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[], "true negative":[], 
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [None]:
for label,tags in label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

With a Classifier Chain, the Random Forest estimator yields better results than the Passive Aggressive estimator.

Compared to the Baseline Sequence Classifier:
* Gendered-Pronoun F1: 0.872418
* Gendered-Role F1: 0.659875
* Generalization F1: 0.319392


Save the data:

In [None]:
loose_agmt.to_csv(config.tokc_path+"multilabel_model_performance/cc-{a}_baseline_fastText{d}_ling_loose_agmt.csv".format(a=a,d=d))