# Baseline Gender Biased Token Classifiers

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/model_input/`
    * Prediction Data: under directory `../data/token_clf_data/multilabel_model_output/`
* Multilabel classification
    * 3 categories of labels:
        1. Person Name: Unknown, Feminine, Masculine (Non-binary not applied during annotation)
        2. Linguistic: Generalization, Gendered Pronoun, Gendered Role
        3. Contextual: Occupation, Omission, Stereotype (Empowering only applied by one annotator and too few times for training)
* Word embeddings: custom fastText embeddings

***

### Table of Contents

[0.](#0) Preprocessing

[1.](#CC) Classifier Chain Models

[2.](#2) Person Name Model - *need to rerun without MVC*

[3.](#3) Linguistic Model - *need to rerun without MVC*


**Appendices**

[A.](#A) Algorithm Experiments with Classifier Chain

[B.](#B) Algorithm Experiments with Majority Voting Classifier

[C.](#C) Scratch Work (Logistic Regresssion)

***

Load necessary libraries:

In [30]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For embeddings
from gensim.models import FastText
from gensim import utils as gensim_utils
from gensim.test.utils import get_tmpfile

# For classification
import sklearn.metrics as metrics
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.model_selection import GridSearchCV
from skmultilearn.ensemble import LabelSpacePartitioningClassifier, MajorityVotingClassifier
from sklearn.multiclass import OneVsRestClassifier
from skmultilearn.cluster import FixedLabelSpaceClusterer
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.adapt import MLTSVM
# Base estimators
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier, PassiveAggressiveClassifier

<a id="0"></a>
## 0. Preprocessing

Load the train and validation (dev) data:

In [31]:
df_train = pd.read_csv(config.tokc_path+"model_input/token_train.csv", index_col=0)
df_dev = pd.read_csv(config.tokc_path+"model_input/token_validate.csv", index_col=0)
print(df_train.shape, df_dev.shape)
df_train.head()

(467564, 10) (157740, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,train
4,1,1,99999,4,:,"(22, 23)",:,O,Title,train
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,train
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,train
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,train


In [32]:
df_train = df_train.drop(columns=["ann_id"])
df_train = df_train.drop_duplicates()
df_dev = df_dev.drop(columns=["ann_id"])
df_dev = df_dev.drop_duplicates()
print(df_train.shape, df_dev.shape)

(463441, 9) (156146, 9)


Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, and because only one token has this label, it prevents the data from being input into the models with cross-validation.

In [33]:
df_train = df_train.loc[df_train.tag != "B-Nonbinary"]
df_train = df_train.loc[df_train.tag != "I-Nonbinary"]

In [34]:
df_train.shape

(463439, 9)

***
#### Optional Preprocessing Step

If not classifying all labels at once, consider only the rows with tags for the select subset of labels, replacing all tags not in that subset with `"O"`:

In [41]:
# cont_label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission", "B-Occupation", "I-Occupation"]
pers_label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine"]#, "B-Nonbinary", "I-Nonbinary"]
ling_label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
df_train, df_dev = utils.selectDataForLabels(df_train, df_dev, "tag", pers_label_subset)
# df_train, df_dev = utils.selectDataForLabels(df_train, df_dev, "tag", ling_label_subset)
print(df_train.shape, df_dev.shape)

***

Group the data by token, so there is one row per token rather than one row per token-tag pair:

In [46]:
subdf_train = df_train.drop(columns=["description_id", "field", "subset", "token_offsets"])
subdf_dev = df_dev.drop(columns=["description_id", "field", "subset", "token_offsets"])
df_train_imploded = utils.implodeDataFrame(subdf_train, ["sentence_id", "token_id", "token", "pos"])
df_train_imploded = df_train_imploded.reset_index()
df_dev_imploded = utils.implodeDataFrame(subdf_dev, ["sentence_id", "token_id", "token", "pos"])
df_dev_imploded = df_dev_imploded.reset_index()
df_dev_imploded.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag
0,5,154,After,IN,[O]
1,5,155,his,PRP$,[B-Gendered-Pronoun]
2,5,156,ordination,NN,[O]
3,5,157,he,PRP,[B-Gendered-Pronoun]
4,5,158,spent,VBD,[O]


Associate word embeddings to the tokens:

In [47]:
dimensions = ["50", "100", "200", "300"]
d = dimensions[1]
file_name = config.tokc_path+"fasttext{}_lowercased.model".format(d)  # compare to GloVe word embeddings with 50 dimensions
embedding_model = FastText.load(file_name)

In [48]:
vocabulary = list(df_train.token.unique())
vocabulary_lowercased = [token.lower() for token in vocabulary]
vocabulary_lowercased = list(set(vocabulary_lowercased))
print("Vocabulary size:", len(vocabulary))
print("Lowercased vocabulary size:", len(vocabulary_lowercased))

Vocabulary size: 35968
Lowercased vocabulary size: 31335


Vectorize and binarize the data:

In [49]:
mlb = MultiLabelBinarizer()

In [50]:
target_col = "tag"
feature_cols = ["token_id", "token"]
train_data = df_train_imploded
dev_data = df_dev_imploded

In [51]:
train_data.head()

Unnamed: 0,sentence_id,token_id,token,pos,tag
0,1,3,Title,NN,[O]
1,1,4,:,:,[O]
2,1,5,Papers,NNS,[O]
3,1,6,of,IN,[O]
4,1,7,The,DT,"[B-Unknown, B-Masculine, B-Stereotype]"


Extract features:

In [52]:
# Get a vector representation of a token from a fastText word embedding model
def extractEmbedding(token, fasttext_model=embedding_model):
    if token.isalpha():
        token = token.lower()
    embedding = fasttext_model.wv[token]
    return embedding

def makeFeatureMatrix(token_data):
    feature_list = [extractEmbedding(token) for token_id,token in token_data]
    return np.array(feature_list)

In [53]:
train_tokens = list(zip(train_data[feature_cols[0]], train_data[feature_cols[1]]))
dev_tokens = list(zip(dev_data[feature_cols[0]], dev_data[feature_cols[1]]))

In [54]:
X_train = makeFeatureMatrix(train_tokens)
X_dev = makeFeatureMatrix(dev_tokens)
print(X_train.shape, X_dev.shape)  # number_of_samples, number_of_features

(452086, 100) (152455, 100)


Binarize targets:

In [55]:
y_train_labels = train_data[target_col]
y_train = mlb.fit_transform(y_train_labels)
y_dev_labels = dev_data[target_col]
y_dev = mlb.transform(y_dev_labels)
print(y_train.shape, y_dev.shape)  # number_of_samples, number_of_labels

(452086, 19) (152455, 19)


In [56]:
for labels in y_train:
    if sum(labels) > 1:
        print("Multilabelled tokens exist, as expected.")
        break

Multilabelled tokens exist, as expected.


For baseline models, use only the tokens' embeddings as features.

In [76]:
## Twin Support Vector Machines
# clf = MLTSVM(c_k = 2**-1)
# clf.fit(X_train, y_train)  # full dataset needs 1.47 TiB of data - try on one category of labels at a time?
# predictions = clf.predict(X_test)

<a id="CC"></a>
## 1. Classifier Chain Models

*Reference: http://scikit.ml/api/skmultilearn.problem_transform.cc.html#skmultilearn.problem_transform.ClassifierChain*

#### Train & Predict

In [59]:
clf = ClassifierChain(
    classifier = RandomForestClassifier(random_state=22),
)
clf.fit(X_train, y_train)

In [60]:
predictions = clf.predict(X_dev)

#### Evaluate: All Labels

In [61]:
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.9139355112811681
Precision - macro: 0.5117583387635518

Recall - weighted: 0.9275677891204386
Recall - macro: 0.33299717339152485

F1 Score - weighted: 0.9144440009060665
F1 Score - macro: 0.3739269852824525

Accuracy - normalized: 0.9373192089469023
Accuracy - unnormalized: 142899


#### Evaluate: Each Label

In [62]:
pred_df = utils.makePredictionDF(predictions, dev_data, "tag", "predicted_tag", "O", mlb)
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,predicted_tag
0,5,154,After,IN,O
1,5,155,his,PRP$,B-Gendered-Pronoun
2,5,156,ordination,NN,O
3,5,157,he,PRP,B-Gendered-Pronoun
4,5,158,spent,VBD,O


In [63]:
exp_df = dev_data.explode(["tag"])
exp_df = exp_df.rename(columns={"tag":"expected_tag"})
# exp_df.head()

In [64]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "expected_tag"],   # left on
    ["sentence_id", "token_id", "token", "pos", "predicted_tag"],  # right on
    ["sentence_id", "token_id", "token", "pos", "expected_tag", "predicted_tag", "_merge"],  # final column list
    "expected_tag",
    "predicted_tag", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
4,5,158,spent,VBD,O,O,true negative


Save the data:

In [65]:
Path(config.tokc_path+"multilabel_model_output/").mkdir(parents=True, exist_ok=True)
eval_df.to_csv(config.tokc_path+"multilabel_model_output/cc-rf_baseline_fastText{}_predictions.csv".format(d))

##### Strict Agreement

Calculate the total true positives, false positives, true negatives, and false negatives.

In [66]:
agmt_stats = utils.getAgreementStatsForAllTags(eval_df, "_merge", "token_id", "tag(s)", y_dev, predictions)

Calculate precision, recall, and F1 score at the token level for each tag:

In [67]:
label_tags = [ 
    'B-Unknown', 'I-Unknown', 'B-Feminine', 'I-Feminine', 'B-Masculine',  'I-Masculine',
    'B-Gendered-Pronoun', 'I-Gendered-Pronoun','B-Gendered-Role', 'I-Gendered-Role', 
    'B-Generalization', 'I-Generalization', 
    'B-Stereotype', 'I-Stereotype', 'B-Omission', 'I-Omission', 'B-Occupation', 'I-Occupation'
]

In [68]:
for label_tag in label_tags:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label_tag])
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,all,11310,9593,140352,4484,0.511758,0.332997,0.373927
0,B-Unknown,1321,369,0,1130,0.753836,0.461036,0.572152
0,I-Unknown,2423,509,0,1278,0.715165,0.345312,0.465743
0,B-Feminine,105,78,0,352,0.818605,0.770241,0.793687
0,I-Feminine,505,58,0,340,0.854271,0.402367,0.547064
0,B-Masculine,468,262,0,1000,0.792393,0.681199,0.732601
0,I-Masculine,1014,306,0,452,0.596306,0.308322,0.406475
0,B-Gendered-Pronoun,9,178,0,1470,0.89199,0.993915,0.940198
0,I-Gendered-Pronoun,15,0,0,0,0.0,0.0,0.0
0,B-Gendered-Role,140,159,0,870,0.845481,0.861386,0.853359


Save the data:

In [69]:
Path(config.tokc_path+"multilabel_model_performance/").mkdir(parents=True, exist_ok=True)
agmt_stats.to_csv(config.tokc_path+"multilabel_model_performance/cc-rf_baseline_fastText{}_strict_agmt.csv".format(d))

##### Loose Agreement

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

Create a copy of the evaluation DataFrame where tags are replaced by label names:

In [70]:
label_tags = {
    "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
    "Gendered Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered Role": ["B-Gendered-Role", "I-Gendered-Role"],
    "Generalization": ["B-Generalization", "I-Generalization"], 
    "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"], "Occupation": ["B-Occupation", "I-Occupation"]
             }

In [71]:
loose_eval_df = eval_df.copy()
for label,tags in label_tags.items():
    for tag in tags:
        loose_eval_df["expected_tag"] = loose_eval_df["expected_tag"].replace(to_replace=tag, value=label)
        loose_eval_df["predicted_tag"] = loose_eval_df["predicted_tag"].replace(to_replace=tag, value=label)
# loose_eval_df.head()

In [72]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
# loose_eval_df.head()

In [73]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[], "true negative":[], 
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [74]:
for label,tags in label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,Unknown,3744.0,0.0,0.0,2408.0,1.0,0.391417,0.562617
0,Feminine,610.0,0.0,0.0,692.0,1.0,0.53149,0.694082
0,Masculine,1482.0,0.0,0.0,1452.0,1.0,0.494888,0.662107
0,Gendered Pronoun,24.0,0.0,0.0,1470.0,1.0,0.983936,0.991903
0,Gendered Role,258.0,0.0,0.0,870.0,1.0,0.771277,0.870871
0,Generalization,322.0,0.0,0.0,126.0,1.0,0.28125,0.439024
0,Stereotype,930.0,0.0,0.0,86.0,1.0,0.084646,0.15608
0,Omission,1994.0,0.0,0.0,1032.0,1.0,0.341044,0.508625
0,Occupation,1019.0,0.0,0.0,832.0,1.0,0.449487,0.620201


Save the data:

In [75]:
loose_agmt.to_csv(config.tokc_path+"multilabel_model_performance/cc-pa_baseline_fastText{}_loose_agmt.csv".format(d))

<a id="2"></a>

## 2. Person Name Model

Create multilabel models with the `PassiveAggressiveClassifier` for the Person Name category of labels in order to compare their performance to the sequence classifier's performance (the Passive Aggressive algorithm was top-performing for the baseline Person Name sequence classification model).

#### Hypothesis
* The baseline sequence classifiers will outperform (F1 score >0.1 higher) the baseline multilabel token classifiers for labels in the Person Name category - *supported*

In [19]:
pn_clf = ClassifierChain(classifier=PassiveAggressiveClassifier(
        max_iter=100, 
        loss="squared_hinge",  # equivalent to pa_type=2 (PA-II)
        random_state=22,
    )
)
pn_clf.fit(X_train, y_train)

In [20]:
predictions = pn_clf.predict(X_dev)

#### Evaluate

In [21]:
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.007294781083555422
Precision - macro: 0.16438238482824089

Recall - weighted: 0.007154218155357936
Recall - macro: 0.14721860785793983

F1 Score - weighted: 0.006746758975236831
F1 Score - macro: 0.14159993780287958

Accuracy - normalized: 0.0016529467711783805
Accuracy - unnormalized: 252


In [22]:
pred_df = utils.makePredictionDF(predictions, dev_data, "tag", "predicted_tag", "O", mlb)
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,predicted_tag
0,5,154,After,IN,O
1,5,155,his,PRP$,O
2,5,156,ordination,NN,O
3,5,157,he,PRP,O
4,5,158,spent,VBD,O


In [23]:
exp_df = dev_data.explode(["tag"])
exp_df = exp_df.rename(columns={"tag":"expected_tag"})
# exp_df.head()

In [24]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "expected_tag"],   # left on
    ["sentence_id", "token_id", "token", "pos", "predicted_tag"],  # right on
    ["sentence_id", "token_id", "token", "pos", "expected_tag", "predicted_tag", "_merge"],  # final column list
    "expected_tag",
    "predicted_tag", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,O,O,true negative
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,O,O,true negative
4,5,158,spent,VBD,O,O,true negative


Save the data:

In [25]:
Path(config.tokc_path+"multilabel_model_output/").mkdir(parents=True, exist_ok=True)
eval_df.to_csv(config.tokc_path+"multilabel_model_output/cc-pa_baseline_fastText100_pn_predictions.csv")

##### Strict Agreement

Calculate the total true positives, false positives, true negatives, and false negatives.

In [26]:
agmt_stats = utils.getAgreementStatsForAllTags(eval_df, "_merge", "token_id", "tag(s)", y_dev, predictions)

Calculate precision, recall, and F1 score at the token level for each tag:

In [27]:
label_tags = [ 
    'B-Unknown', 'I-Unknown', 'B-Feminine', 'I-Feminine', 'B-Masculine',  'I-Masculine',
#     'B-Gendered-Pronoun', 'I-Gendered-Pronoun','B-Gendered-Role', 'I-Gendered-Role', 
#     'B-Generalization', 'I-Generalization', 
#     'B-Stereotype', 'I-Stereotype', 'B-Omission', 'I-Omission', 'B-Occupation', 'I-Occupation'
]

In [28]:
for label_tag in label_tags:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label_tag])
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,all,10275,10973,143374,1111,0.164382,0.147219,0.1416
0,B-Unknown,1264,1838,0,638,0.257674,0.335436,0.291457
0,I-Unknown,2510,1397,0,360,0.204895,0.125436,0.155608
0,B-Feminine,162,60,0,104,0.634146,0.390977,0.483721
0,I-Feminine,411,578,0,258,0.308612,0.38565,0.342857
0,B-Masculine,430,1577,0,696,0.306203,0.618117,0.409532
0,I-Masculine,838,1275,0,166,0.115198,0.165339,0.135787


The baseline sequence classsifier performs better, **supporting** the hypothesis.

Save the data:

In [29]:
Path(config.tokc_path+"multilabel_model_performance/").mkdir(parents=True, exist_ok=True)
agmt_stats.to_csv(config.tokc_path+"multilabel_model_performance/cc-pa_baseline_fastText100_pn_strict_agmt.csv")

##### Loose Agreement

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

Create a copy of the evaluation DataFrame where tags are replaced by label names:

In [35]:
label_tags = {
    "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
#     "Gendered Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered Role": ["B-Gendered-Role", "I-Gendered-Role"],
#     "Generalization": ["B-Generalization", "I-Generalization"], 
#     "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"], "Occupation": ["B-Occupation", "I-Occupation"]
             }

In [36]:
loose_eval_df = eval_df.copy()
for label,tags in label_tags.items():
    for tag in tags:
        loose_eval_df["expected_tag"] = loose_eval_df["expected_tag"].replace(to_replace=tag, value=label)
        loose_eval_df["predicted_tag"] = loose_eval_df["predicted_tag"].replace(to_replace=tag, value=label)
# loose_eval_df.head()

In [37]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
# loose_eval_df.head()

In [38]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[], "true negative":[], 
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [39]:
for label,tags in label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,Unknown,3774.0,0.0,0.0,998.0,1.0,0.209137,0.345927
0,Feminine,573.0,0.0,0.0,362.0,1.0,0.387166,0.558211
0,Masculine,1268.0,0.0,0.0,862.0,1.0,0.404695,0.576203


The baseline sequence classifiers perform better, **supporting** the hypothesis.

One additional note: this model yielded better performance for the Person Name labels than the multilabel classifier with all labels (though that model used a different algorithm, Random Forest).

Save the data:

In [40]:
loose_agmt.to_csv(config.tokc_path+"multilabel_model_performance/cc-pa_baseline_fastText100_pn_loose_agmt.csv")

<a id="3"></a>

## 3. Linguistic Model

Create multilabel models with the `PassiveAggressiveClassifier` for the Linguistic category of labels in order to compare their performance to the sequence classifier's performance (the Passive Aggressive algorithm was top-performing for the baseline Linguistic sequence classification model).

#### Hypothesis
* The baseline sequence classifiers will have worse performance (F1 score >=0.1 lower) than the baseline multilabel token classifiers for labels in the Linguistic category - *countered if measure strictly, supported if measure loosely*

In [72]:
l_clf = ClassifierChain(classifier=PassiveAggressiveClassifier(
        max_iter=100, 
        loss="squared_hinge",  # equivalent to pa_type=2 (PA-II)
        random_state=22,
    )
)
l_clf.fit(X_train, y_train)

In [73]:
predictions = l_clf.predict(X_dev)

#### Evaluate

In [74]:
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.006489524643169926
Precision - macro: 0.22368507592610504

Recall - weighted: 0.005338136350881315
Recall - macro: 0.18990065451145394

F1 Score - weighted: 0.005834330222320042
F1 Score - macro: 0.20366653567181384

Accuracy - normalized: 0.003437079794037585
Accuracy - unnormalized: 524


In [75]:
pred_df = utils.makePredictionDF(predictions, dev_data, "tag", "predicted_tag", "O", mlb)
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,predicted_tag
0,5,154,After,IN,O
1,5,155,his,PRP$,O
2,5,156,ordination,NN,O
3,5,157,he,PRP,B-Gendered-Pronoun
4,5,158,spent,VBD,O


In [76]:
exp_df = dev_data.explode(["tag"])
exp_df = exp_df.rename(columns={"tag":"expected_tag"})
# exp_df.head()

In [77]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "expected_tag"],   # left on
    ["sentence_id", "token_id", "token", "pos", "predicted_tag"],  # right on
    ["sentence_id", "token_id", "token", "pos", "expected_tag", "predicted_tag", "_merge"],  # final column list
    "expected_tag",
    "predicted_tag", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,B-Gendered-Pronoun,,false negative
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
4,5,158,spent,VBD,O,O,true negative


Save the data:

In [78]:
Path(config.tokc_path+"multilabel_model_output/").mkdir(parents=True, exist_ok=True)
eval_df.to_csv(config.tokc_path+"multilabel_model_output/mvc_cc-rf_baseline_ling_predictions.csv")

##### Strict Agreement

Calculate the total true positives, false positives, true negatives, and false negatives.

In [79]:
agmt_stats = utils.getAgreementStatsForAllTags(eval_df, "_merge", "token_id", "tag(s)", y_dev, predictions)

Calculate precision, recall, and F1 score at the token level for each tag:

In [80]:
label_tags = [ 
#     'B-Unknown', 'I-Unknown', 'B-Feminine', 'I-Feminine', 'B-Masculine',  'I-Masculine',
    'B-Gendered-Pronoun', 'I-Gendered-Pronoun','B-Gendered-Role', 'I-Gendered-Role', 
    'B-Generalization', 'I-Generalization', 
#     'B-Stereotype', 'I-Stereotype', 'B-Omission', 'I-Omission', 'B-Occupation', 'I-Occupation'
]

In [81]:
for label_tag in label_tags:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label_tag])
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,all,1760,1279,153103,818,0.223685,0.189901,0.203667
0,B-Gendered-Pronoun,196,100,0,1050,0.913043,0.842697,0.876461
0,I-Gendered-Pronoun,14,16,0,0,0.0,0.0,0.0
0,B-Gendered-Role,140,176,0,492,0.736527,0.778481,0.756923
0,I-Gendered-Role,65,25,0,0,0.0,0.0,0.0
0,B-Generalization,94,282,0,94,0.25,0.5,0.333333
0,I-Generalization,45,170,0,0,0.0,0.0,0.0


The baseline sequence classsifier performs better, **countering** the hypothesis.

Save the data:

In [82]:
Path(config.tokc_path+"multilabel_model_performance/").mkdir(parents=True, exist_ok=True)
agmt_stats.to_csv(config.tokc_path+"multilabel_model_performance/mvc_cc-rf_baseline_ling_strict_agmt.csv")

##### Loose Agreement

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

Create a copy of the evaluation DataFrame where tags are replaced by label names:

In [83]:
label_tags = {
#     "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
    "Gendered Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered Role": ["B-Gendered-Role", "I-Gendered-Role"],
    "Generalization": ["B-Generalization", "I-Generalization"], 
#     "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"], "Occupation": ["B-Occupation", "I-Occupation"]
             }

In [84]:
loose_eval_df = eval_df.copy()
for label,tags in label_tags.items():
    for tag in tags:
        loose_eval_df["expected_tag"] = loose_eval_df["expected_tag"].replace(to_replace=tag, value=label)
        loose_eval_df["predicted_tag"] = loose_eval_df["predicted_tag"].replace(to_replace=tag, value=label)
# loose_eval_df.head()

In [85]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
# loose_eval_df.head()

In [86]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[], "true negative":[], 
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [87]:
for label,tags in label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,Gendered Pronoun,210.0,0.0,0.0,1050.0,1.0,0.833333,0.909091
0,Gendered Role,205.0,0.0,0.0,492.0,1.0,0.705882,0.827586
0,Generalization,139.0,0.0,0.0,94.0,1.0,0.403433,0.574924


The baseline sequence classifiers perform worse by the loose agreement measure, **supporting** the hypothesis.

Save the data:

In [88]:
loose_agmt.to_csv(config.tokc_path+"multilabel_model_performance/cc-pa_baseline_fastText100_ling_loose_agmt.csv")

<a id="A"></a>
## Appendix A: Algorithm Experiments with Classifier Chain

In [27]:
# Random Forest with default params
clf = ClassifierChain(
    classifier = RandomForestClassifier(random_state=22),
)
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.9939906380075957
Precision - macro: 0.4522395266401285

Recall - weighted: 0.9940419089384418
Recall - macro: 0.42926914483057615

F1 Score - weighted: 0.9938517260287932
F1 Score - macro: 0.42858522358505574

Accuracy - normalized: 0.9928569085959792
Accuracy - unnormalized: 151366


In [29]:
# Extra Trees Classifier with default params
clf_et = ClassifierChain(
    classifier = ExtraTreesClassifier(random_state=22),
)
clf_et.fit(X_train, y_train)
predictions = clf_et.predict(X_dev)
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.9940462341297188
Precision - macro: 0.4580114351178922

Recall - weighted: 0.9939962280650234
Recall - macro: 0.42743072713623403

F1 Score - weighted: 0.9938303866656131
F1 Score - macro: 0.4281386553268569

Accuracy - normalized: 0.9928634679085632
Accuracy - unnormalized: 151367


In [33]:
# LR with OvR with liblinear solver (suitable for smaller datasets, according to documentation)
clf_lr_liblinear = ClassifierChain(
    # Uses binary relevance method for multilabel classification - trains a binary classifier for each label
    classifier = OneVsRestClassifier(LogisticRegression(multi_class='ovr', solver='liblinear', random_state=22)),
)
clf_lr_liblinear.fit(X_train, y_train)
predictions = clf_lr_liblinear.predict(X_dev)
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.9926034708384126
Precision - macro: 0.3696959386644671

Recall - weighted: 0.9929259904592233
Recall - macro: 0.3545416942041249

F1 Score - weighted: 0.9925966375241835
F1 Score - macro: 0.3561134491803094

Accuracy - normalized: 0.9924502312157686
Accuracy - unnormalized: 151304


In [35]:
# LR with OvR with liblinear and balanced class weights - DIDN'T CONVERGE
clf_lr_liblinear_balanced = ClassifierChain(
    # Uses binary relevance method for multilabel classification - trains a binary classifier for each label
    classifier = OneVsRestClassifier(LogisticRegression(multi_class='ovr', class_weight='balanced', solver='liblinear', random_state=22)),
)
clf_lr_liblinear_balanced.fit(X_train, y_train)
predictions = clf_lr_liblinear_balanced.predict(X_dev)
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples



KeyboardInterrupt: 

In [28]:
# Ridge Classifier with default params
clf_rc = ClassifierChain(
    classifier = RidgeClassifier(random_state=22),
)
clf_rc.fit(X_train, y_train)
predictions = clf_rc.predict(X_dev)
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.9888414526611495
Precision - macro: 0.26011444229933084

Recall - weighted: 0.9908768769944596
Recall - macro: 0.2586768755313947

F1 Score - weighted: 0.9898578320522367
F1 Score - macro: 0.2593890171337297

Accuracy - normalized: 0.9910006231346955
Accuracy - unnormalized: 151083


Try Grid Search one of the highest-performing algorithms, Random Forest:

In [None]:
## Use Grid Search to optimize parameters
parameters = {
    'max_depth': [3, 4],
    'class_weight': ['balanced', None],
    'max_features': ['log2', 0.5]
             }
score = 'f1_macro'

clf = GridSearchCV(RandomForestClassifier(random_state=22), parameters, scoring=score)
clf.fit(X_train, y_train)

# print (clf.best_params_, clf.best_score_)  # Lower than RF with default parameters

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


In [None]:
# ## Use Grid Search to optimize parameters
# parameters = {
#     'max_depth': [3, 4],
#     'class_weight': ['balanced', None],
#     'max_features': ['log2', 0.5]
#              }
# score = 'f1_macro'

# clf = GridSearchCV(ExtraTreesClassifier(random_state=22), parameters, scoring=score)
# clf.fit(X_train, y_train)

# print (clf.best_params_, clf.best_score_)

<a id="B"></a>
## Appendix B: Algorithm Experiments with Majority Voting Classifier

The Random Forest Classifier with default parameters performed best.

In [54]:
# Majority Voting + Extra Trees with max. depth of 4 and balanced classes
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.004975145358542142
Precision - macro: 0.05673776476853289

Recall - weighted: 0.019385703123999332
Recall - macro: 0.270422193131941

F1 Score - weighted: 0.006264680707162905
F1 Score - macro: 0.07270225712081037

Accuracy - normalized: 0.0028467416614738777
Accuracy - unnormalized: 434


In [50]:
# Majority Voting + Extra Trees with max. depth of 3
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.004173136841655642
Precision - macro: 0.04609639397829385

Recall - weighted: 0.002529683757508998
Recall - macro: 0.027942840973401245

F1 Score - weighted: 0.0031499325783877733
F1 Score - macro: 0.03479409821625192

Accuracy - normalized: 0.0024203863435112
Accuracy - unnormalized: 369


In [41]:
# Majority Voting + LR with OvR and balanced class weights
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.004736068132697176
Precision - macro: 0.0538548492552831

Recall - weighted: 0.01968029920715228
Recall - macro: 0.27861743847249637

F1 Score - weighted: 0.00606212683599175
F1 Score - macro: 0.06995432567570094

Accuracy - normalized: 0.0018825227116198223
Accuracy - unnormalized: 287


In [36]:
# Majority Voting + LR with OvR
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.014542711068512293
Precision - macro: 0.18511478280829652

Recall - weighted: 0.00968324516798381
Recall - macro: 0.12519454215074025

F1 Score - weighted: 0.010533287235813869
F1 Score - macro: 0.13646371728939513

Accuracy - normalized: 0.005588534321603096
Accuracy - unnormalized: 852


In [47]:
# Majority Voting + Ridge Classifier
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.003926223662819222
Precision - macro: 0.04336899547641273

Recall - weighted: 0.0038681746570517336
Recall - macro: 0.04272778720996038

F1 Score - weighted: 0.0038969829991150843
F1 Score - macro: 0.043046003634679114

Accuracy - normalized: 0.0036338591715588207
Accuracy - unnormalized: 554


In [44]:
# Majority Voting + Ridge Classifier with balanced classes
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.004697233991329188
Precision - macro: 0.05287934345155035

Recall - weighted: 0.01952659690289857
Recall - macro: 0.27542877986736086

F1 Score - weighted: 0.005827413355677853
F1 Score - macro: 0.06636114769517683

Accuracy - normalized: 0.0014233708307369388
Accuracy - unnormalized: 217


In [30]:
# Majority Voting + RF
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.01577857101823814
Precision - macro: 0.22404798759314012

Recall - weighted: 0.014313527083626862
Recall - macro: 0.19202597968383292

F1 Score - weighted: 0.01474363832208297
F1 Score - macro: 0.20132410267490758

Accuracy - normalized: 0.008422157357908891
Accuracy - unnormalized: 1284


In [67]:
## Use Grid Search to optimize parameters
parameters = {
    'max_depth': [depth for depth in range(3,6)],
    'class_weight': ['balanced', 'balanced_subsample', None],
    'max_features': ['sqrt', 'log2', 10]
             }
score = 'f1_macro'

clf = GridSearchCV(RandomForestClassifier(random_state=22,), parameters, scoring=score)
clf.fit(X_train, y_train)

print (clf.best_params_, clf.best_score_)  #{'class_weight': None, 'max_depth': 3, 'max_features': 'sqrt'} 0.07822504011047318}

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(tru

{'class_weight': None, 'max_depth': 3, 'max_features': 'sqrt'} 0.07822504011047318


In [20]:
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.01156642798880111
Precision - macro: 0.1253681719726464

Recall - weighted: 0.004809601270605715
Recall - macro: 0.053879138459085714

F1 Score - weighted: 0.005548354816765824
F1 Score - macro: 0.06240011578730738

Accuracy - normalized: 0.003955265488176839
Accuracy - unnormalized: 603


#### 2.1 All Labels

#### Train & Predict

The Random Forest Classifier with the default parameters (documented [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)) performed better than other combinations of parameters tried using Grid Search, so we'll use the default parameters for the baseline model:

In [16]:
clf = MajorityVotingClassifier(
    clusterer = FixedLabelSpaceClusterer(clusters = [[1,2,3], [0, 2, 5], [4, 5]]),  # what's a good way to decide on the clusters?
    classifier = ClassifierChain(classifier=RandomForestClassifier(random_state=22))
)
clf.fit(X_train, y_train)

In [17]:
predictions = clf.predict(X_dev)

#### Evaluate

In [18]:
print("Precision - weighted:", metrics.precision_score(y_dev, predictions, average="weighted", zero_division=0))
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print()
print("Recall - weighted:", metrics.recall_score(y_dev, predictions, average="weighted", zero_division=0))
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("F1 Score - weighted:", metrics.f1_score(y_dev, predictions, average="weighted", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print()
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - weighted: 0.015679590603897366
Precision - macro: 0.22096954431353613

Recall - weighted: 0.014409591023785431
Recall - macro: 0.19350795123474884

F1 Score - weighted: 0.014782658612655562
F1 Score - macro: 0.201702022278094

Accuracy - normalized: 0.00846807254599718
Accuracy - unnormalized: 1291


In [24]:
pred_df = utils.makePredictionDF(predictions, dev_data, "tag", "predicted_tag", "O")
pred_df.head()

In [25]:
exp_df = dev_data.explode(["tag"])
exp_df = exp_df.rename(columns={"tag":"expected_tag"})
# exp_df.head()

In [29]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "expected_tag"],   # left on
    ["sentence_id", "token_id", "token", "pos", "predicted_tag"],  # right on
    ["sentence_id", "token_id", "token", "pos", "expected_tag", "predicted_tag", "_merge"],  # final column list
    "expected_tag",
    "predicted_tag", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
4,5,158,spent,VBD,O,O,true negative


Save the data:

In [31]:
Path(config.tokc_path+"multilabel_model_output/").mkdir(parents=True, exist_ok=True)
eval_df.to_csv(config.tokc_path+"multilabel_model_output/mvc_cc-rf_baseline_predictions.csv")

##### Strict Agreement

Calculate the total true positives, false positives, true negatives, and false negatives.

In [134]:
agmt_stats = utils.getAgreementStatsForAllTags(eval_df, "_merge", "token_id", "tag(s)", y_dev, predictions)

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,all,13125,9715,140771,2250,0.22097,0.193508,0.201702


Calculate precision, recall, and F1 score at the token level for each tag:

In [135]:
label_tags = [ 
    'B-Unknown', 'I-Unknown', 'B-Feminine', 'I-Feminine', 'B-Masculine',  'I-Masculine',
    'B-Gendered-Pronoun', 'I-Gendered-Pronoun','B-Gendered-Role', 'I-Gendered-Role', 
    'B-Generalization', 'I-Generalization', 
    'B-Stereotype', 'I-Stereotype', 'B-Omission', 'I-Omission', 'B-Occupation', 'I-Occupation'
]

In [None]:
for label_tag in label_tags:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label_tag])
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Save the data:

In [137]:
Path(config.tokc_path+"multilabel_model_performance/").mkdir(parents=True, exist_ok=True)
agmt_stats.to_csv(config.tokc_path+"multilabel_model_performance/mvc_cc-rf_baseline_strict_agmt.csv")

##### Loose Agreement

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

Create a copy of the evaluation DataFrame where tags are replaced by label names:

In [141]:
label_tags = {
    "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
    "Gendered Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered Role": ["B-Gendered-Role", "I-Gendered-Role"],
    "Generalization": ["B-Generalization", "I-Generalization"], 
    "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"], "Occupation": ["B-Occupation", "I-Occupation"]
             }

In [151]:
loose_eval_df = eval_df.copy()
for label,tags in label_tags.items():
    for tag in tags:
        loose_eval_df["expected_tag"] = loose_eval_df["expected_tag"].replace(to_replace=tag, value=label)
        loose_eval_df["predicted_tag"] = loose_eval_df["predicted_tag"].replace(to_replace=tag, value=label)
# loose_eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,Gendered Pronoun,Gendered Pronoun,true positive
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,Gendered Pronoun,Gendered Pronoun,true positive
4,5,158,spent,VBD,O,O,true negative


In [154]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
# loose_eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,Gendered Pronoun,Gendered Pronoun,true positive
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,Gendered Pronoun,Gendered Pronoun,true positive
4,5,158,spent,VBD,O,O,true negative


In [156]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[], "true negative":[], 
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [157]:
for label,tags in label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,Unknown,4948.0,0.0,0.0,0.0,0.0,0.0,0.0
0,Feminine,780.0,0.0,0.0,352.0,1.0,0.310954,0.474394
0,Masculine,1719.0,0.0,0.0,978.0,1.0,0.362625,0.532245
0,Gendered Pronoun,24.0,0.0,0.0,1470.0,1.0,0.983936,0.991903
0,Gendered Role,258.0,0.0,0.0,870.0,1.0,0.771277,0.870871
0,Generalization,322.0,0.0,0.0,126.0,1.0,0.28125,0.439024
0,Stereotype,973.0,0.0,0.0,0.0,0.0,0.0,0.0
0,Omission,2510.0,0.0,0.0,0.0,0.0,0.0,0.0
0,Occupation,1083.0,0.0,0.0,704.0,1.0,0.393956,0.565235


Save the data:

In [162]:
loose_agmt.to_csv(config.tokc_path+"multilabel_model_performance/mvc_cc-rf_baseline_loose_agmt.csv")

<a id="C"></a>
## Appendix C: Scratch Work (Logistic Regression Model)

#### Feature Engineering

In [118]:
# class CountWordCaps(BaseEstimator, TransformerMixin):
# """ Model that extracts a counter of capital words from text. """
#     def fit(self, X, y=None):
#         return self
#    def transform(self, texts):
#         """ transform data :texts: The texts to count capital words in :returns: list of counts for each text """
#         return [[sum(w.isupper() for w in nltk.word_tokenize(text))] for text in texts]
    
class GloveTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, sentence):
        return [[embedding_dict[t] if t in embedding_dict.keys() else np.nan for t in token] for token in sentence]
#         self.insert(0, "embedding", embeddings)
#         return self
#         return pd.DataFrame({"embedding":embeddings})

In [119]:
col_transformer = ColumnTransformer(
    [
#         ('pos_encodings', OneHotEncoder(), ['pos']),
        ('token_embeddings', GloveTransformer(), 'token'),
    ],
    remainder='passthrough'
)

In [120]:
# v = DictVectorizer(sparse=True)
mlb = MultiLabelBinarizer()
labels2numbers = LabelEncoder()

In [121]:
feature_cols = ["sentence_id", "token"] #, "pos"] #"token",  #, "lemma"]
target_col = "tag"
labels = list(np.unique(df_train.tag))
labels2numbers = LabelEncoder()
y = labels2numbers.fit_transform(labels)
label_to_no = dict(zip(labels,list(y)))
no_to_label = dict(zip(list(y),labels))
print(label_to_no)

{'B-Feminine': 0, 'B-Gendered-Pronoun': 1, 'B-Gendered-Role': 2, 'B-Generalization': 3, 'B-Masculine': 4, 'B-Occupation': 5, 'B-Omission': 6, 'B-Stereotype': 7, 'B-Unknown': 8, 'I-Feminine': 9, 'I-Gendered-Pronoun': 10, 'I-Gendered-Role': 11, 'I-Generalization': 12, 'I-Masculine': 13, 'I-Occupation': 14, 'I-Omission': 15, 'I-Stereotype': 16, 'I-Unknown': 17, 'O': 18}


In [132]:
# X_train = train_sentences["sentence"]#df_train[feature_cols]
# # X_train = col_transformer.fit_transform(X_train)
# y_train = list(train_sentences["tag"])#df_train[target_col].values
y_train_numeric = [[tuple((label_to_no[label] for label in labels)) for labels in labels_list] for labels_list in y_train]
# y_train_numeric = utils.getNumericLabels(y_train, label_to_no)  # Convert the string labels to numeric labels
y_train_binarized = mlb.fit_transform(y_train_numeric)          # Convert each iterable of iterables above to a multilabel format
# print(X_train.shape, y_train.shape)
print(y_train_numeric[0])
print(y_train_binarized[0])

[(18,), (18,), (18,), (18,), (8, 4, 7), (17, 16, 13), (17, 13, 8, 16), (16, 13, 17, 17), (13, 17, 16, 17), (17, 13, 17, 16), (18,), (18,), (18,)]
[0 0 0 ... 0 1 0]


In [137]:
i = 0
print(y_train_numeric[i])
# print(y_train_binarized[i])
y_train_binarized[i][4]

[(18,), (18,), (18,), (18,), (8, 4, 7), (17, 16, 13), (17, 13, 8, 16), (16, 13, 17, 17), (13, 17, 16, 17), (17, 13, 17, 16), (18,), (18,), (18,)]


0

In [91]:
X_dev = df_dev[feature_cols]
# X_dev = col_transformer.transform(X_dev)  # No fit?
y_dev = df_dev[target_col].values
y_dev_numeric = utils.getNumericLabels(y_dev, label_to_no)  # Make numeric
y_dev_binarized = mlb.transform(y_dev_numeric)              # Binarize
print(y_dev.shape, y_dev_binarized.shape)

(157740,) (157740, 19)


In [92]:
# assert X_dev.shape[1] == X_train.shape[1], "The train and dev data must have the same number of columns."

#### Train the Model

In [93]:
log_reg = OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr", random_state=22))

In [102]:
pipeline = Pipeline([
    ("col_transformer", col_transformer),
    ('imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ("classifier", log_reg)
])

In [103]:
clf = pipeline.fit(X_train, y_train_binarized)

ValueError: The output of the 'token_embeddings' transformer should be 2D (scipy matrix, array, or pandas DataFrame).

#### Predict

In [109]:
predicted_dev = clf.predict(X_dev)
print(predicted_dev[0])

ValueError: Found unknown categories ['……', '..', '{'] in column 0 during transform

#### Evaluate Model Performance

In [18]:
original_labels = mlb.classes_
dev_matrix = multilabel_confusion_matrix(y_dev_binarized, predicted_dev, labels=mlb.classes_)
df_dev_perf = utils.getPerformanceMetrics(y_dev_binarized, predicted_dev, dev_matrix, mlb.classes_, original_labels, no_to_label)
df_dev_perf

Unnamed: 0,labels,true_neg,false_neg,true_pos,false_pos,precision,recall,f_1
0,B-Feminine,157417,323,0,0,0.0,0.0,0.0
1,B-Gendered-Pronoun,156996,744,0,0,0.0,0.0,0.0
2,B-Gendered-Role,157150,590,0,0,0.0,0.0,0.0
3,B-Generalization,157495,245,0,0,0.0,0.0,0.0
4,B-Masculine,156716,1024,0,0,0.0,0.0,0.0
5,B-Occupation,157085,655,0,0,0.0,0.0,0.0
6,B-Omission,156658,1082,0,0,0.0,0.0,0.0
7,B-Stereotype,157481,259,0,0,0.0,0.0,0.0
8,B-Unknown,155680,2060,0,0,0.0,0.0,0.0
9,I-Feminine,156894,846,0,0,0.0,0.0,0.0


In [19]:
print("Dev Accuracy (all labels) on `token` col:", np.mean(predicted_dev == y_dev_binarized))

Dev Accuracy (all labels) on `token` col: 0.9890666186195806


Try using cross-validation (stratified k fold, where k=3) with Logistic Regression:

In [20]:
# k = 3 # number of folds

In [21]:
# log_reg_cv = OneVsRestClassifier(LogisticRegressionCV(
#     solver="liblinear", multi_class="ovr", cv=k, scoring="f1", random_state=22)  #max_iter=500, --> default is 100 iterations
#                                 )
# clf1 = log_reg_cv.fit(X_train, y_train_binarized)
# pred1_dev = clf1.predict(X_dev)
# # print("Dev Accuracy (all labels) on `lemma` col:", np.mean(pred1_dev == y_dev))  # 90%
# # print("Dev Accuracy (all labels) on `token` col:", np.mean(pred1_dev == y_dev))  # 90%
# # from sklearn import metrics
# # metrics.SCORERS.keys()
# # print("Accuracy:", clf1.score(pred1_dev, y_dev_binarized))

In [22]:
# original_labels = mlb.classes_
# dev_matrix1 = multilabel_confusion_matrix(y_dev_binarized, pred1_dev, labels=mlb.classes_)
# df_dev_perf1 = utils.getPerformanceMetrics(y_dev_binarized, pred1_dev, dev_matrix1, mlb.classes_, original_labels, no_to_label)
# df_dev_perf1

**QUESTION:** Are scores averaged across the 3 folds?

<a id="1.1"></a>
## 1.1. With Word Embeddings

In [88]:
from sklearn.base import BaseEstimator, TransformerMixin
# # Model that assigns a GloVe embedding to a token.
# class GloveVectorizer(BaseEstimator, TransformerMixin):
#     def fit(self, X, y=None):
#         return self
    
#     # Transform tokens from the input data to their corresponding GloVe embedding
#     def transform(self, tokens):
#         return [words_to_vectors[token] for token in tokens]
class EmbeddingTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, embedding_dict=words_to_vectors):
        tokens = list(self.token)
        embeddings = np.array([])
        for token in tokens:
            embedding = embedding_dict[token]
            embeddings += [embedding]
        return embeddings

# embedding_transformer = FunctionTransformer(embedding_extractor)

class PosTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self):
        # STOPPED HERE
        # use multilabel binarizer?
        # REFERENCE: https://stackoverflow.com/questions/56774862/how-to-add-a-feature-using-a-pipeline-and-featureunion
        
        return #np.array(self.pos)

# pos_transformer = FunctionTransformer(pos_extractor)

In [82]:
log_reg = OneVsRestClassifier(LogisticRegression(solver="liblinear", multi_class="ovr", random_state=22))

In [90]:
feature_union = FeatureUnion([("embeddings", EmbeddingTransformer), ("pos", PosTransformer)])

In [91]:
pipe = Pipeline([("features", feature_union), ("classifier", log_reg)])

In [100]:
pipe.fit(X_train, y_train_binarized)

AttributeError: fit not found

In [75]:
col_transformer = ColumnTransformer(["vectorizer", GloveVectorizer(), 'token'])

In [None]:
features = FeatureUnion(["vectorizer"])

In [78]:
# vectorizer = Pipeline(
#     ["col_selector", FunctionTransformer(lambda df: df["token"])],
#     ["vectorizer", GloveVectorizer()]
# )
pipe = Pipeline(
    ["transformer", col_transformer],
    ["estimator", log_reg]
)

TypeError: 'ColumnTransformer' object is not iterable

In [55]:
col_transformer = ColumnTransformer(["vectorizer", vectorizer, 'token'])

In [58]:
pipe = Pipeline(
    ["transformer", col_transformer],
    ["clf", log_reg]
)

TypeError: 'ColumnTransformer' object is not iterable