# Experiment 2

#### Model Setup

Run models in the following order, using their output labels as features for the next model:

1. Multilabel Linguistic Classifier
3. Multilabel Stereotype + Omission Document Classifier

Train the first model and then run it over the entire dataset.

***

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/experiment_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/experiment1/`
* Word Embeddings
    * Custom fastText (word2vec with subwords) embeddings of 100 dimensions trained on the CRC Archives catalog's descriptive metadata (harvested October 2020)
    
***

**Table of Contents**

[I.](#i) Train the Stereotype + Omission Classifier

[II.](#ii) Predict Over All Data

Load programming resources:

In [1]:
# For custom functions and variables
import utils, utils1, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For classification
import scipy
import sklearn.metrics
from sklearn.multiclass import OneVsRestClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MultiLabelBinarizer, FunctionTransformer
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix#, plot_confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_fscore_support

Define resources for the models:

In [2]:
### For 60-20-20 train-dev-test split of documents (metadata descriptions)
Path(config.experiment_input_path).mkdir(parents=True, exist_ok=True)    # For train, devtest, and blind test data
Path(config.experiment1_output_path).mkdir(parents=True, exist_ok=True)  # For predictions
Path(config.experiment1_agmt_path).mkdir(parents=True, exist_ok=True)    # For agreement metrics
# ----------------------
### For features from modified 5-fold CV
agreement_dir = config.experiment2_path+"5fold/"
Path(agreement_dir).mkdir(parents=True, exist_ok=True)    # For agreement metrics

In [3]:
# Model 1:
ling_label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
# Model 3:
so_label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission"]

In [4]:
ling_label_tags = {
    "Gendered-Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered-Role": ["B-Gendered-Role", "I-Gendered-Role"],"Generalization": ["B-Generalization", "I-Generalization"]
    }
so_label_tags = {
    "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"]
             }

In [5]:
d = 100               # dimensions of word embeddings (should match utils1.py) for file names
target_labels = "so"  # for file names

<a id="i"></a>
## I. Train the Stereotype + Omission Classifier

### Preprocessing

Load the document classification model's input data:

In [6]:
train = pd.read_csv(config.docc_path+"model_input/"+"{}_splits_as_csv/aggregated_final_train.csv".format(target_labels), index_col=0)
dev = pd.read_csv(config.docc_path+"model_input/"+"{}_splits_as_csv/aggregated_final_validate.csv".format(target_labels), index_col=0)
test = pd.read_csv(config.docc_path+"model_input/"+"{}_splits_as_csv/aggregated_final_test.csv".format(target_labels), index_col=0)
df_exp = pd.concat([train, dev, test])
df_exp["label"] = df_exp["label"].fillna("{'None'}")
df_exp = df_exp.loc[~df_exp.description.isna()]
df_exp = utils.getColumnValuesAsLists(df_exp, "label")
# df_exp.head()

Load the Linguistic labels as features and associate description IDs to the data, creating one row per description ID:

In [7]:
### For 40-40-20 train-dev-test split
# features_filename = "crf_{a}_{t}_baseline_fastText{d}_predictions_ALLDATA.csv".format(a="arow", t="pers_o", d=d)
# df_features = pd.read_csv(config.experiment1_output_path+features_filename, usecols=["sentence_id", "token_id", "pred_ling_tag"])
# df_features = utils.getColumnValuesAsLists(df_features, "pred_ling_tag")
# df_features = df_features.rename(columns={"pred_ling_tag":"ling_pred"})
# ----------------------
### For features from modified 5-fold CV
a="rf"
df_features = pd.read_csv(agreement_dir+"cc-{a}_ling_baseline_fastText{d}_evaluation_loose.csv".format(a=a,d=d), usecols=["sentence_id", "token_id", "predicted_tag"])
df_features = df_features.rename(columns={"predicted_tag":"ling_pred"})
df_features = utils.implodeDataFrame(df_features, ["sentence_id", "token_id"]).reset_index()
df_features.head()

Unnamed: 0,sentence_id,token_id,ling_pred
0,4,134,[Gendered-Pronoun]
1,4,148,[Gendered-Pronoun]
2,5,155,[Gendered-Pronoun]
3,5,157,[Gendered-Pronoun]
4,6,211,[Gendered-Pronoun]


In [8]:
df_desc = pd.read_csv(config.agg_path+"descs_sents_tokens_anns.csv", usecols=["description_id", "sentence_id", "token_id"])
df_desc = df_desc.set_index("description_id")
df_desc = utils1.getColumnValuesAsLists(df_desc, "sentence_id")
df_desc = utils1.getColumnValuesAsLists(df_desc, "token_id")
df_desc_exploded = df_desc.explode(["sentence_id", "token_id"])
df_desc_exploded = df_desc_exploded.reset_index()
df_desc_exploded = df_desc_exploded.astype("int64")
# df_desc_exploded.head()
assert df_desc_exploded.shape[0] == len(df_desc_exploded.token_id.unique())

In [9]:
joined = df_features.join(df_desc_exploded.set_index(["sentence_id", "token_id"]), on=["sentence_id", "token_id"])
grouped = utils.implodeDataFrame(joined, ["description_id"]).reset_index()
grouped.head()

Unnamed: 0,description_id,sentence_id,token_id,ling_pred
0,3,"[4, 4, 5, 5, 6, 7, 7]","[134, 148, 155, 157, 211, 216, 226]","[[Gendered-Pronoun], [Gendered-Pronoun], [Gend..."
1,7,"[16, 17, 19, 19, 22, 22, 23, 23, 24, 24, 28]","[435, 478, 533, 539, 598, 618, 634, 643, 668, ...","[[Gendered-Pronoun], [Gendered-Pronoun], [Gend..."
2,11,"[32, 33, 33, 33, 34, 34, 37, 37, 38, 38, 39, 4...","[856, 875, 876, 883, 888, 902, 960, 973, 990, ...","[[Gendered-Role], [Gendered-Pronoun], [Gendere..."
3,100,"[134, 134]","[1840, 1842]","[[O], [Gendered-Pronoun]]"
4,155,[189],[2176],[[Gendered-Role]]


Flatten the lists of values in the `ling_pred` column and remove duplicates from the lists:

In [10]:
feature_col1 = "ling_pred"

In [11]:
ling = utils1.flattenFeatureCol(grouped, feature_col1)

In [12]:
grouped.insert(len(grouped.columns), "doc_"+feature_col1, ling)
# grouped.head()

Join the Linguistic feature column to the document classification model data:

In [29]:
features = grouped[["description_id", "doc_"+feature_col1]]
join_on = "description_id"
df = df_exp.join(features.set_index(join_on), on=join_on)
df = df.loc[~df.description.isna()]
df.head()

Unnamed: 0,description_id,start_offset,end_offset,field,description,subset,label,doc_ling_pred
4699,4699,1853,2066,Biographical / Historical,"Labelled Apparently some chapters, amounting t...",train,[Omission],[Gendered-Role]
8942,8942,384,540,Biographical / Historical,James Aikman of Perth signed his name to a vol...,train,[None],[Gendered-Pronoun]
5440,5440,5692,5850,Biographical / Historical,This piece was published in 'Milk Production i...,train,[None],
3474,3474,3608,8549,Biographical / Historical,Margaret Winifred Bartholomew was born on 21 A...,train,"[Omission, Stereotype]","[Gendered-Role, Generalization, Gendered-Pronoun]"
4769,4769,2378,2576,Biographical / Historical,Blacker and Thomson became close friends throu...,train,[Omission],[Gendered-Pronoun]


Replace `NaN` values in last column with an empty list:

In [31]:
doc_ling_pred = list(df.doc_ling_pred)
new_doc_ling_pred = [preds if type(preds)==list else [] for preds in doc_ling_pred]
# new_doc_ling_pred[:5]  # Looks good
df["doc_"+feature_col1] = new_doc_ling_pred
df.head()

Unnamed: 0,description_id,start_offset,end_offset,field,description,subset,label,doc_ling_pred
4699,4699,1853,2066,Biographical / Historical,"Labelled Apparently some chapters, amounting t...",train,[Omission],[Gendered-Role]
8942,8942,384,540,Biographical / Historical,James Aikman of Perth signed his name to a vol...,train,[None],[Gendered-Pronoun]
5440,5440,5692,5850,Biographical / Historical,This piece was published in 'Milk Production i...,train,[None],[]
3474,3474,3608,8549,Biographical / Historical,Margaret Winifred Bartholomew was born on 21 A...,train,"[Omission, Stereotype]","[Gendered-Role, Generalization, Gendered-Pronoun]"
4769,4769,2378,2576,Biographical / Historical,Blacker and Thomson became close friends throu...,train,[Omission],[Gendered-Pronoun]


Vectorize the documents, and binarize the features and targets:

In [32]:
def binarizeMultilabelTrainColumn(df_col):
    mlb = MultiLabelBinarizer()
    binarized = mlb.fit_transform(df_col)
    return mlb, binarized

def binarizeMultilabelDevColumn(mlb, df_col):
    binarized = mlb.transform(df_col)
    return binarized

In [33]:
train_df = df.loc[df.subset == "train"]
dev_df = df.loc[df.subset == "dev"]
target_col = "label"
feat1_col = "doc_ling_pred"

In [34]:
mlb_target, y_train = binarizeMultilabelTrainColumn(train_df["label"])
y_dev = binarizeMultilabelDevColumn(mlb_target, dev_df["label"])
print(y_train.shape, y_dev.shape)

(16397, 3) (5452, 3)


In [35]:
mlb_feat1, train_feat1 = binarizeMultilabelTrainColumn(train_df[feat1_col])
dev_feat1 = binarizeMultilabelDevColumn(mlb_feat1, dev_df[feat1_col])
print(train_feat1.shape, dev_feat1.shape)

(16397, 3) (5452, 3)


In [36]:
cvectorizer = CountVectorizer()
tfidf = TfidfTransformer()
train_docs = cvectorizer.fit_transform(train_df["description"])
dev_docs = cvectorizer.transform(dev_df["description"])
train_docs = tfidf.fit_transform(train_docs)
dev_docs = tfidf.transform(dev_docs)
print(train_docs.shape, dev_docs.shape)

(16397, 26960) (5452, 26960)


In [37]:
train_feats = scipy.sparse.csr_matrix(train_feat1)
dev_feats = scipy.sparse.csr_matrix(dev_feat1)

Concatenate the documents and features, creating one scipy sparse matrix for the train data and another for the dev data:

In [38]:
X_train = scipy.sparse.hstack([train_docs, train_feats])
X_dev = scipy.sparse.hstack([dev_docs, dev_feats])
print(X_train.shape, X_dev.shape)

(16397, 26963) (5452, 26963)


### Train & Predict

In [39]:
a = "sgd-svm"

Build a pipeline:

In [40]:
doc_clf = Pipeline([
    ("clf", OneVsRestClassifier(SGDClassifier(loss="hinge")))  # Support Vector Machines loss function
])

In [41]:
doc_clf.fit(X_train, y_train)
predicted_dev = doc_clf.predict(X_dev)

### Peformance

Calculate performance metrics for the Stochastic Gradient Descent classifier

In [42]:
print("Dev Test Accuracy:", np.mean(predicted_dev == y_dev))

Dev Test Accuracy: 0.9397163120567376


In [43]:
classes = doc_clf.classes_
print(classes)
original_classes = mlb_target.classes_
print(original_classes)
label_dict = dict(zip(original_classes, classes))

[0 1 2]
['None' 'Omission' 'Stereotype']


Create a [confusion matrix](https://scikit-learn.org/stable/modules/model_evaluation.html#multilabel-confusion-matrix) of the results, where, for class *i*:
* Count of true negatives (TN) is at position *i*,0,0
* Count of false negatives (FN) is at position *i*,1,0
* Count of true positives (FP) is at position *i*,1,1
* Count of false positives (PF) is at position *i*,0,1

In [44]:
matrix = multilabel_confusion_matrix(y_dev, predicted_dev, labels=classes)

In [45]:
scores = utils.getPerformanceMetrics(y_dev, predicted_dev, matrix, classes, original_classes, label_dict)
scores = scores.tail(2) # Remove row for 'None'
scores = scores.drop(columns="true_neg")  # Not accurate because considers 'None' a class
scores["labels"] = original_classes[1:]
scores

Unnamed: 0,labels,false_neg,true_pos,false_pos,precision,recall,f_1
1,Omission,357,447,93,0.827778,0.55597,0.665179
2,Stereotype,87,228,21,0.915663,0.72381,0.808511


Save the performance results:

In [58]:
# dir_path = config.tokc_path+"/experiment2/40-40-20/output/"
dir_path = config.tokc_path+"/experiment2/5fold/agreement/"
Path(dir_path).mkdir(parents=True, exist_ok=True)
scores.to_csv(dir_path+"docclf_{a}_{t}_baseline_performance.csv".format(a=a, t=target_labels))

Add the predicted labels to the dev data:

In [48]:
pred_dev_labels = mlb_target.inverse_transform(predicted_dev)
# pred_dev_labels[0]

Add the classifier's labels to the `aggregated_validate.csv` DataFrame of descriptions to facilitate error analysis:

In [49]:
dev_df = dev_df.rename(columns={"label":"manual_label"})
dev_df.insert(len(dev_df.columns), "{a}_label".format(a=a), pred_dev_labels)
dev_df.head()
# print(len(pred_dev_labels), dev_df.shape)

Unnamed: 0,description_id,start_offset,end_offset,field,description,subset,manual_label,doc_ling_pred,sgd-svm_label
5523,5523,367,1965,Biographical / Historical,"Edward Bald Jamieson, from Shetland, was a gra...",dev,[Stereotype],"[Generalization, Gendered-Pronoun]","(Stereotype,)"
4719,4719,5650,5811,Biographical / Historical,This likely refers to an article of the same t...,dev,[Omission],[],"(Omission,)"
735,735,7735,7881,Biographical / Historical,John Baillie kept a collection of the prayers ...,dev,[None],[Gendered-Pronoun],"(None,)"
2183,2183,1072,1372,Biographical / Historical,Joseph W. Hills graduated with the degree of M...,dev,[None],[Gendered-Pronoun],"(None,)"
2299,2299,546,3642,Biographical / Historical,This collection is composed simply of an invit...,dev,"[Omission, Stereotype]","[Gendered-Role, Gendered-Pronoun]","(Omission,)"


Save this version of the data:

In [59]:
dir_path = config.tokc_path+"/experiment2/5fold/output/"
Path(dir_path).mkdir(parents=True, exist_ok=True)
dev_df.to_csv(dir_path+"aggregated_final_validate_predictions_docclf_{a}_{t}.csv".format(a=a, t=target_labels))

<a id="ii"></a>
## II. Predict Over All Data

### Preprocessing
Vectorize the documents, and binarize the features and targets:

In [60]:
target_col = "label"
feat1_col = "doc_ling_pred"

In [61]:
y_all = binarizeMultilabelDevColumn(mlb_target, df["label"])
print(y_all.shape)

(27312, 3)


In [62]:
all_feat1 = binarizeMultilabelDevColumn(mlb_feat1, df[feat1_col])
print(all_feat1.shape)

(27312, 3)


In [63]:
all_docs = cvectorizer.transform(df["description"])
all_docs = tfidf.transform(all_docs)
print(all_docs.shape)

(27312, 26960)


In [64]:
all_feats = scipy.sparse.csr_matrix(all_feat1)

Concatenate the documents and features, creating one scipy sparse matrix for the train data and another for the dev data:

In [65]:
X_all = scipy.sparse.hstack([all_docs, all_feats])
print(X_all.shape)

(27312, 26963)


### Predict

In [66]:
predicted_all = doc_clf.predict(X_all)

### Peformance

Calculate performance metrics for the Stochastic Gradient Descent classifier

In [67]:
print("Accuracy:", np.mean(predicted_all == y_all))

Accuracy: 0.951938098027729


In [68]:
matrix = multilabel_confusion_matrix(y_all, predicted_all, labels=classes)

In [69]:
scores = utils.getPerformanceMetrics(y_all, predicted_all, matrix, classes, original_classes, label_dict)
scores = scores.tail(2) # Remove row for 'None'
scores = scores.drop(columns="true_neg")  # Not accurate because considers 'None' a class
scores["labels"] = original_classes[1:]
scores

Unnamed: 0,labels,false_neg,true_pos,false_pos,precision,recall,f_1
1,Omission,1507,2525,328,0.885033,0.62624,0.733479
2,Stereotype,296,1305,63,0.953947,0.815116,0.879084


Save the performance results:

In [73]:
# dir_path = config.tokc_path+"/experiment2/40-40-20/agreement/"
dir_path = config.tokc_path+"/experiment2/5fold/agreement/"
scores.to_csv(dir_path+"docclf_{a}_{t}_baseline_performance_ALLDATA.csv".format(a=a, t=target_labels))

Add the predicted labels to the dev data:

In [71]:
pred_all_labels = mlb_target.inverse_transform(predicted_all)

Add the classifier's labels to the `aggregated_validate.csv` DataFrame of descriptions to facilitate error analysis:

In [72]:
df = df.rename(columns={"label":"manual_label"})
df.insert(len(df.columns), "{a}_label".format(a=a), pred_all_labels)
df.head()
# print(len(pred_all_labels), df.shape)

Unnamed: 0,description_id,start_offset,end_offset,field,description,subset,manual_label,doc_ling_pred,sgd-svm_label
4699,4699,1853,2066,Biographical / Historical,"Labelled Apparently some chapters, amounting t...",train,[Omission],[Gendered-Role],"(Omission,)"
8942,8942,384,540,Biographical / Historical,James Aikman of Perth signed his name to a vol...,train,[None],[Gendered-Pronoun],"(None,)"
5440,5440,5692,5850,Biographical / Historical,This piece was published in 'Milk Production i...,train,[None],[],"(None,)"
3474,3474,3608,8549,Biographical / Historical,Margaret Winifred Bartholomew was born on 21 A...,train,"[Omission, Stereotype]","[Gendered-Role, Generalization, Gendered-Pronoun]","(Omission, Stereotype)"
4769,4769,2378,2576,Biographical / Historical,Blacker and Thomson became close friends throu...,train,[Omission],[Gendered-Pronoun],"(Omission,)"


Save this version of the data:

In [74]:
# dir_path = config.tokc_path+"/experiment2/40-40-20/output/"
dir_path = config.tokc_path+"/experiment2/5fold/output/"
df.to_csv(dir_path+"aggregated_final_validate_predictions_docclf_{a}_{t}_ALLDATA.csv".format(a=a, t=target_labels))