# Experiment 1, Model 3

#### Model Setup

Run models in the following order, using their output labels as features for the next model:

1. Multilabel Linguistic Classifier
2. Multiclass Person Name + Occupation Sequence Classifier
3. Multilabel Stereotype + Omission Document Classifier

Train the first model and then run it over the entire dataset.

***

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/experiment_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/experiment1/`
* Word Embeddings
    * Custom fastText (word2vec with subwords) embeddings of 100 dimensions trained on the CRC Archives catalog's descriptive metadata (harvested October 2020)
    
***

**Table of Contents**

[I.](#i) Stereotype + Omission Classifier
* [Preprocessing](#prep)
* [Training & Prediction](#tp)
* [Evaluation](#eval)

[II.](#ii) Predict Over All Data

Load programming resources:

In [1]:
# For custom functions and variables
import utils, utils1, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For classification
import scipy
import sklearn.metrics
from sklearn.multiclass import OneVsRestClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MultiLabelBinarizer, FunctionTransformer
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix
from sklearn.metrics import precision_recall_fscore_support

Define resources for the models:

In [2]:
# Path(config.experiment_input_path).mkdir(parents=True, exist_ok=True)    # For train, devtest, and blind test data

# predictions_dir = config.experiment1_path+"5fold/output/"              # For predictions
# Path(predictions_dir).mkdir(parents=True, exist_ok=True)  
# agreement_dir = config.experiment1_path+"5fold/agreement/"             # For agreement metrics
# Path(agreement_dir).mkdir(parents=True, exist_ok=True)

predictions_dir = config.experiment1_path+"5fold/with_manual_labels/output/"              # For predictions with features as manual labels
Path(predictions_dir).mkdir(parents=True, exist_ok=True)  
agreement_dir = config.experiment1_path+"5fold/with_manual_labels/agreement/"             # For agreement metrics with features as manual labels
Path(agreement_dir).mkdir(parents=True, exist_ok=True)

In [3]:
# Model 1:
ling_label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
# Model 2:
pers_o_label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Occupation", "I-Occupation"]
# Model 3:
so_label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission"]

In [4]:
ling_label_tags = {
    "Gendered-Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered-Role": ["B-Gendered-Role", "I-Gendered-Role"],"Generalization": ["B-Generalization", "I-Generalization"]
    }
pers_o_label_tags = {
    "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
     "Occupation": ["B-Occupation", "I-Occupation"]
    }
so_label_tags = {
    "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"]
             }

In [5]:
d = 100               # dimensions of word embeddings (should match utils1.py) for file names
target_labels = "so"  # for file names

<a id="i"></a>
## I. Stereotype + Omission Classifier
<a id="prep"></a>
### Preprocessing

Load the document classification model's input data:

In [6]:
# # For 60-20-20 data split 
# train = pd.read_csv(config.docc_path+"model_input/"+"{}_splits_as_csv/aggregated_final_train.csv".format(target_labels), index_col=0)
# dev = pd.read_csv(config.docc_path+"model_input/"+"{}_splits_as_csv/aggregated_final_validate.csv".format(target_labels), index_col=0)
# test = pd.read_csv(config.docc_path+"model_input/"+"{}_splits_as_csv/aggregated_final_test.csv".format(target_labels), index_col=0)
# df_exp = pd.concat([train, dev, test])
# df_exp["label"] = df_exp["label"].fillna("{'None'}")
# df_exp = df_exp.loc[~df_exp.description.isna()]
# df_exp = utils.getColumnValuesAsLists(df_exp, "label")
# # df_exp.head()
# ------------------------------
# For modified 5-fold cross validation
df = pd.read_csv(config.tokc_path+"experiment_input/document_5fold.csv", index_col=0)
df_exp = utils.getColumnValuesAsLists(df, "label")
df_exp = df_exp.drop(columns=["subset"])
df_exp.head()

Unnamed: 0,description_id,start_offset,end_offset,field,description,label,fold
0,4699,1853,2066,Biographical / Historical,"Labelled Apparently some chapters, amounting t...",[Omission],split3
1,8942,384,540,Biographical / Historical,James Aikman of Perth signed his name to a vol...,[],split2
2,5440,5692,5850,Biographical / Historical,This piece was published in 'Milk Production i...,[],split0
3,3474,3608,8549,Biographical / Historical,Margaret Winifred Bartholomew was born on 21 A...,"[Omission, Stereotype]",split0
4,4769,2378,2576,Biographical / Historical,Blacker and Thomson became close friends throu...,[Omission],split3


Load the Linguistic, Person Name, and Occupation features and associate description IDs to the data, creating one row per description ID:

In [86]:
# # Predictions as features
# features_filename = "crf_{a}_{t}_baseline_fastText{d}_predictions.csv".format(a="arow", t="pers_o", d=d)
# df_features = pd.read_csv(predictions_dir+features_filename, usecols=["sentence_id", "token_id", "pred_ling_tag", "tag_pers_o_predicted"])
# df_features = utils.getColumnValuesAsLists(df_features, "pred_ling_tag")
# df_features = df_features.rename(columns={"tag_pers_o_predicted":"pers_o_pred", "pred_ling_tag":"ling_pred"})
# # Generalize person name and occupation tags to labels
# for label,tags in pers_o_label_tags.items():
#     for tag in tags:
#         df_features["pers_o_pred"] = df_features["pers_o_pred"].replace(to_replace=tag, value=label)
# df_features.head()
# df_features.pers_o_pred.value_counts()  # Looks good
# ---------------------
# Manual labels as features
feature_col1 = "label_ling_expected"
feature_col2 = "label_pers_o_expected"
perso_features_filename = "crf_{a}_{t}_baseline_fastText{d}_loose_evaluation.csv".format(a="arow", t="pers_o", d=d)
perso_features = pd.read_csv(config.experiment1_path+"5fold/output/"+perso_features_filename, usecols=["sentence_id", "token_id", feat1_col])
# perso_features.head()
ling_features_filename = "cc-{a}_{t}_baseline_fastText{d}_evaluation_loose.csv".format(a="rf", t="ling", d=d)
ling_features = pd.read_csv(config.experiment1_path+"5fold/output/"+ling_features_filename, usecols=["sentence_id", "token_id", "expected_tag"])
ling_features = ling_features.rename(columns={"expected_tag": feat2_col})
# ling_features.head()
df_features = perso_features.join(ling_features.set_index(["sentence_id", "token_id"]), on=["sentence_id", "token_id"], how="outer")
print(perso_features.shape, ling_features.shape, df_features.shape)
df_features.head()

(753521, 3) (7607, 3) (753550, 4)


Unnamed: 0,sentence_id,token_id,label_pers_o_expected,label_ling_expected
0,8,233,['Masculine'],
1,8,234,['Masculine'],
2,8,235,['O'],
3,8,236,['O'],
4,8,237,['O'],


Remove duplicates from and create lists of the labels for every row:

In [87]:
perso = list(df_features[feature_col2])
new_perso = [[] if type(labels)==float else labels[2:-2].split("', '") for labels in perso]
new_perso = [list(set(labels)) for labels in new_perso]
ling = list(df_features[feature_col1])
new_ling = ["O" if type(label)==float else label for label in ling]
print(new_perso[:3], new_ling[:3])

[['Masculine'], ['Masculine'], ['O']] ['O', 'O', 'O']


In [88]:
df_features[feature_col2] = new_perso
df_features[feature_col1] = new_ling
df_features.head()  # Looks good

Unnamed: 0,sentence_id,token_id,label_pers_o_expected,label_ling_expected
0,8,233,[Masculine],O
1,8,234,[Masculine],O
2,8,235,[O],O
3,8,236,[O],O
4,8,237,[O],O


In [89]:
df_desc = pd.read_csv(config.agg_path+"descs_sents_tokens_anns.csv", usecols=["description_id", "sentence_id", "token_id"])
df_desc = df_desc.set_index("description_id")
df_desc = utils1.getColumnValuesAsLists(df_desc, "sentence_id")
df_desc = utils1.getColumnValuesAsLists(df_desc, "token_id")
df_desc_exploded = df_desc.explode(["sentence_id", "token_id"])
df_desc_exploded = df_desc_exploded.reset_index()
df_desc_exploded = df_desc_exploded.astype("int64")
# df_desc_exploded.head()

In [90]:
joined = df_features.join(df_desc_exploded.set_index(["sentence_id", "token_id"]), on=["sentence_id", "token_id"])
grouped = utils.implodeDataFrame(joined, ["description_id"]).reset_index()
grouped.head()

Unnamed: 0,description_id,sentence_id,token_id,label_pers_o_expected,label_ling_expected
0,0,"[0, 0, 0]","[0, 1, 2]","[[O], [O], [O]]","[O, O, O]"
1,1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[[O], [O], [O], [O], [Masculine, O, Unknown], ...","[O, O, O, O, O, O, O, O, O, O, O, O, O]"
2,2,"[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,3,"[8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...","[233, 234, 235, 236, 237, 238, 239, 240, 241, ...","[[Masculine], [Masculine], [O], [O], [O], [O],...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,4,"[11, 11, 11]","[308, 309, 310]","[[O], [O], [O]]","[O, O, O]"


Flatten the lists of values in the feature columns and remove duplicates from the lists:

In [91]:
# feature_col1 = "ling_pred"
# feature_col2 = "pers_o_pred"

In [92]:
# ling = utils1.flattenFeatureCol(grouped, feature_col1)
# -------------------
old_ling = [list(set(labels)) for labels in list(grouped[feature_col1])]
ling = []
for labels in old_ling:
    if len(labels) > 1:
        if "O" in labels:
            labels.remove("O")
    ling += [labels]
perso = utils1.flattenFeatureCol(grouped, feature_col2)

In [93]:
grouped.insert(len(grouped.columns), "doc_"+feature_col1, ling)
grouped.insert(len(grouped.columns), "doc_"+feature_col2, perso)
# grouped.head()

Join the Linguistic and Person-Name+Occupation feature columns to the document classification model data:

In [94]:
features = grouped[["description_id", "doc_"+feature_col1, "doc_"+feature_col2]]
join_on = "description_id"
df = df_exp.join(features.set_index(join_on), on=join_on)
df = df.loc[~df.description.isna()]
df.head()

Unnamed: 0,description_id,start_offset,end_offset,field,description,label,fold,doc_label_ling_expected,doc_label_pers_o_expected
0,4699,1853,2066,Biographical / Historical,"Labelled Apparently some chapters, amounting t...",[Omission],split3,[O],"[Masculine, Unknown]"
1,8942,384,540,Biographical / Historical,James Aikman of Perth signed his name to a vol...,[],split2,[O],[Masculine]
2,5440,5692,5850,Biographical / Historical,This piece was published in 'Milk Production i...,[],split0,[O],"[Unknown, Occupation]"
3,3474,3608,8549,Biographical / Historical,Margaret Winifred Bartholomew was born on 21 A...,"[Omission, Stereotype]",split0,[O],"[Unknown, Occupation, Feminine, Masculine]"
4,4769,2378,2576,Biographical / Historical,Blacker and Thomson became close friends throu...,[Omission],split3,[O],"[Masculine, Unknown]"


In [96]:
# df.label.value_counts()                 # Looks good
# df["doc_"+feature_col1].value_counts()  # Looks good
# df["doc_"+feature_col2].value_counts()  # Looks good

Define the train (80% of the data) and test (20% of the data) splits:

In [97]:
split_col = "fold"
splits = df[split_col].unique()
splits.sort()
print(splits)
train0, test0 = list(splits[:4]), splits[4]
train1, test1 = list(splits[1:]), splits[0]
train2, test2 = list(splits[2:])+[splits[0]], splits[1]
train3, test3 = list(splits[3:])+list(splits[:2]), splits[2]
train4, test4 = [splits[4]]+list(splits[:3]), splits[3]
runs = [(train0, test0), (train1, test1), (train2, test2), (train3, test3), (train4, test4)]
for run in runs:
    print(run)

['split0' 'split1' 'split2' 'split3' 'split4']
(['split0', 'split1', 'split2', 'split3'], 'split4')
(['split1', 'split2', 'split3', 'split4'], 'split0')
(['split2', 'split3', 'split4', 'split0'], 'split1')
(['split3', 'split4', 'split0', 'split1'], 'split2')
(['split4', 'split0', 'split1', 'split2'], 'split3')


In [98]:
def binarizeMultilabelTrainColumn(df_col):
    mlb = MultiLabelBinarizer()
    binarized = mlb.fit_transform(df_col)
    return mlb, binarized

def binarizeMultilabelDevColumn(mlb, df_col):
    binarized = mlb.transform(df_col)
    return binarized

Vectorize the documents, and binarize the features and targets:

In [17]:
# train_df = df.loc[df.subset == "train"]
# dev_df = df.loc[df.subset == "dev"]
# target_col = "label"
# feat1_col = "doc_ling_pred"
# feat2_col = "doc_pers_o_pred"

In [18]:
# mlb_target, y_train = binarizeMultilabelTrainColumn(train_df["label"])
# y_dev = binarizeMultilabelDevColumn(mlb_target, dev_df["label"])
# print(y_train.shape, y_dev.shape)

(16397, 3) (5452, 3)


In [19]:
# mlb_feat1, train_feat1 = binarizeMultilabelTrainColumn(train_df[feat1_col])
# dev_feat1 = binarizeMultilabelDevColumn(mlb_feat1, dev_df[feat1_col])
# mlb_feat2, train_feat2 = binarizeMultilabelTrainColumn(train_df[feat2_col])
# dev_feat2 = binarizeMultilabelDevColumn(mlb_feat2, dev_df[feat2_col])
# print(train_feat1.shape, dev_feat1.shape)
# print(train_feat2.shape, dev_feat2.shape)

(16397, 3) (5452, 3)
(16397, 4) (5452, 4)


In [20]:
# cvectorizer = CountVectorizer()
# tfidf = TfidfTransformer()
# train_docs = cvectorizer.fit_transform(train_df["description"])
# dev_docs = cvectorizer.transform(dev_df["description"])
# train_docs = tfidf.fit_transform(train_docs)
# dev_docs = tfidf.transform(dev_docs)
# print(train_docs.shape, dev_docs.shape)

(16397, 26960) (5452, 26960)


In [21]:
# train_feats = scipy.sparse.csr_matrix(np.concatenate([train_feat1, train_feat2], axis=1))
# dev_feats = scipy.sparse.csr_matrix(np.concatenate([dev_feat1, dev_feat2], axis=1))

Concatenate the documents and features, creating one scipy sparse matrix for the train data and another for the dev data:

In [22]:
# X_train = scipy.sparse.hstack([train_docs, train_feats])
# X_dev = scipy.sparse.hstack([dev_docs, dev_feats])
# print(X_train.shape, X_dev.shape)

(16397, 26967) (5452, 26967)


<a id="tp"></a>
### Training & Prediction

In [99]:
a = "sgd-svm"

In [100]:
pred_df = pd.DataFrame()
target_col = "label"
feat1_col = "doc_label_ling_expected"    #"doc_ling_pred"
feat2_col = "doc_label_pers_o_expected"  #"doc_pers_o_pred"
for run in runs:
    # Get the train (80%) and test (20%) subsets of data
    train_splits, test_split = run[0], run[1]
    print("Training on:", train_splits)
    train_df = df.loc[df[split_col].isin(train_splits)]
    dev_df = df.loc[df[split_col] == test_split]
    
    # Binarize the features
    mlb_feat1, train_feat1 = binarizeMultilabelTrainColumn(train_df[feat1_col])
    dev_feat1 = binarizeMultilabelDevColumn(mlb_feat1, dev_df[feat1_col])
    mlb_feat2, train_feat2 = binarizeMultilabelTrainColumn(train_df[feat2_col])
    dev_feat2 = binarizeMultilabelDevColumn(mlb_feat2, dev_df[feat2_col])
    train_feats = scipy.sparse.csr_matrix(np.concatenate([train_feat1, train_feat2], axis=1))
    dev_feats = scipy.sparse.csr_matrix(np.concatenate([dev_feat1, dev_feat2], axis=1))
    
    # Vectorize the documents (descriptions)
    cvectorizer = CountVectorizer()
    tfidf = TfidfTransformer()
    train_docs = cvectorizer.fit_transform(train_df["description"])
    dev_docs = cvectorizer.transform(dev_df["description"])
    train_docs = tfidf.fit_transform(train_docs)
    dev_docs = tfidf.transform(dev_docs)
    
    # Concatenate the features and documents
    X_train = scipy.sparse.hstack([train_docs, train_feats])
    X_dev = scipy.sparse.hstack([dev_docs, dev_feats])
    
    # Binarize targets
    mlb_target, y_train = binarizeMultilabelTrainColumn(train_df["label"])
    y_dev = binarizeMultilabelDevColumn(mlb_target, dev_df["label"])

    # Train a classification model
    clf = OneVsRestClassifier(SGDClassifier(loss="hinge"))  # Support Vector Machines loss function
    clf.fit(X_train, y_train)
    
    # Predict with the trained model
    print("Predicting on:", test_split)
    predictions = clf.predict(X_dev)
    pred_labels = mlb_target.inverse_transform(predictions)    
    if pred_df.shape[0] > 0:
        next_pred_df = dev_df.copy()
        next_pred_df.insert(len(next_pred_df.columns), "{}_label".format(a), pred_labels)
        pred_df = pd.concat([pred_df, next_pred_df])
    else:
        pred_df = dev_df.copy()
        pred_df.insert(len(pred_df.columns), "{}_label".format(a), pred_labels)

print("Modified 5-fold cross-validation complete!")
print(pred_df.shape)

Training on: ['split0', 'split1', 'split2', 'split3']
Predicting on: split4
Training on: ['split1', 'split2', 'split3', 'split4']
Predicting on: split0
Training on: ['split2', 'split3', 'split4', 'split0']
Predicting on: split1
Training on: ['split3', 'split4', 'split0', 'split1']
Predicting on: split2
Training on: ['split4', 'split0', 'split1', 'split2']
Predicting on: split3
Modified 5-fold cross-validation complete!
(27312, 10)


In [101]:
pred_df = pred_df.rename(columns={"label":"manual_label"})
pred_df.head()

Unnamed: 0,description_id,start_offset,end_offset,field,description,manual_label,fold,doc_label_ling_expected,doc_label_pers_o_expected,sgd-svm_label
6,3027,627,1162,Biographical / Historical,Thomas Young was probably born in 1725. By the...,[Stereotype],split4,[Gendered-Pronoun],"[Masculine, Unknown, Occupation]","(Stereotype,)"
7,3397,8095,8334,Biographical / Historical,Andrew Tait worked on Paramecium in Beale's la...,[Omission],split4,[O],"[Masculine, Unknown, Occupation]","(Omission,)"
10,4736,9951,10026,Biographical / Historical,Delivered by Thomson to teachers in Darlington.,[Omission],split4,[O],"[Masculine, Unknown, Occupation]","(Omission,)"
14,4712,4199,4485,Biographical / Historical,"This was gifted by Thomson to his secretary, M...",[Omission],split4,[O],"[Unknown, Occupation, Feminine, Masculine]","(Omission,)"
22,15684,845,1179,Biographical / Historical,Catherine Robina Borland was responsible for t...,[],split4,[O],[],"(,)"


Save the predictions data:

In [102]:
pred_df.to_csv(predictions_dir+"aggregated_final_validate_predictions_docclf_{a}_{t}.csv".format(a=a, t=target_labels))

Build a pipeline:

In [24]:
# doc_clf = Pipeline([
#     ("clf", OneVsRestClassifier(SGDClassifier(loss="hinge")))  # Support Vector Machines loss function
#     ])

In [19]:
# doc_clf.fit(X_train, y_train)
# predictions = doc_clf.predict(X_dev)

<a id="eval"></a>
### Evaluate

Calculate performance metrics for the Stochastic Gradient Descent classifier

In [26]:
# print("Dev Test Accuracy:", np.mean(predictions == y_dev))

Dev Test Accuracy: 0.9394106138420152


In [103]:
classes = clf.classes_  #doc_clf.classes_
print(classes)
original_classes = mlb_target.classes_
print(original_classes)
label_dict = dict(zip(original_classes, classes))

[0 1 2]
['' 'Omission' 'Stereotype']


Create a [confusion matrix](https://scikit-learn.org/stable/modules/model_evaluation.html#multilabel-confusion-matrix) of the results, where, for class *i*:
* Count of true negatives (TN) is at position *i*,0,0
* Count of false negatives (FN) is at position *i*,1,0
* Count of true positives (FP) is at position *i*,1,1
* Count of false positives (PF) is at position *i*,0,1

In [104]:
y_dev = binarizeMultilabelDevColumn(mlb_target, pred_df["manual_label"])
predictions = mlb_target.transform(pred_df["{}_label".format(a)])
assert len(y_dev) == len(predictions)

In [105]:
matrix = multilabel_confusion_matrix(y_dev, predictions, labels=classes)

In [106]:
scores = utils.getPerformanceMetrics(y_dev, predictions, matrix, classes, original_classes, label_dict)
scores = scores.tail(2) # Remove row for ''
scores = scores.drop(columns="true_neg")  # Not accurate because considers '' a class
scores["labels"] = original_classes[1:]
scores

Unnamed: 0,labels,false_neg,true_pos,false_pos,precision,recall,f_1
1,Omission,1696,2336,456,0.836676,0.579365,0.684642
2,Stereotype,405,1196,79,0.938039,0.747033,0.831711


Save the performance results:

In [107]:
scores.to_csv(agreement_dir+"docclf_{a}_{t}_baseline_performance.csv".format(a=a, t=target_labels))

Add the predicted labels to the dev data:

In [32]:
# pred_labels = mlb_target.inverse_transform(predictions)
# # pred_labels[0]

Add the classifier's labels to the `aggregated_validate.csv` DataFrame of descriptions to facilitate error analysis:

In [33]:
# df = df.rename(columns={"label":"manual_label"})
# df.insert(len(df.columns), "{a}_label".format(a=a), pred_labels)
# df.head()
# # print(len(pred_labels), df.shape)

Unnamed: 0,description_id,start_offset,end_offset,field,description,subset,manual_label,doc_ling_pred,doc_pers_o_pred,sgd-svm_label
5523,5523,367,1965,Biographical / Historical,"Edward Bald Jamieson, from Shetland, was a gra...",dev,[Stereotype],"[Gendered-Pronoun, Generalization]","[Masculine, Unknown, Occupation]","(Stereotype,)"
4719,4719,5650,5811,Biographical / Historical,This likely refers to an article of the same t...,dev,[Omission],[],[Masculine],"(Omission,)"
735,735,7735,7881,Biographical / Historical,John Baillie kept a collection of the prayers ...,dev,[None],[Gendered-Pronoun],[Unknown],"(None,)"
2183,2183,1072,1372,Biographical / Historical,Joseph W. Hills graduated with the degree of M...,dev,[None],[Gendered-Pronoun],[Unknown],()
2299,2299,546,3642,Biographical / Historical,This collection is composed simply of an invit...,dev,"[Omission, Stereotype]","[Gendered-Pronoun, Gendered-Role]","[Unknown, Occupation, Masculine, Feminine]","(Omission, Stereotype)"


Save this version of the data:

In [34]:
# df.to_csv(predictions_dir+"aggregated_final_validate_predictions_docclf_{a}_{t}.csv".format(a=a, t=target_labels))

***

#### *For train-dev-test (i.e., 60-20-20) approach*

<a id="ii"></a>
## II. Predict Over All Data

### Preprocessing
Vectorize the documents, and binarize the features and targets:

In [35]:
target_col = "label"
feat1_col = "doc_ling_pred"
feat2_col = "doc_pers_o_pred"

In [36]:
y_all = binarizeMultilabelDevColumn(mlb_target, df["label"])
print(y_all.shape)

(27312, 3)


In [37]:
all_feat1 = binarizeMultilabelDevColumn(mlb_feat1, df[feat1_col])
all_feat2 = binarizeMultilabelDevColumn(mlb_feat2, df[feat2_col])
print(all_feat1.shape, all_feat2.shape)

(27312, 3) (27312, 4)


In [38]:
all_docs = cvectorizer.transform(df["description"])
all_docs = tfidf.transform(all_docs)
print(all_docs.shape)

(27312, 26960)


In [39]:
all_feats = scipy.sparse.csr_matrix(np.concatenate([all_feat1, all_feat2], axis=1))

Concatenate the documents and features, creating one scipy sparse matrix for the train data and another for the dev data:

In [40]:
X_all = scipy.sparse.hstack([all_docs, all_feats])
print(X_all.shape)

(27312, 26967)


### Predict

In [41]:
predicted_all = doc_clf.predict(X_all)

### Peformance

Calculate performance metrics for the Stochastic Gradient Descent classifier

In [42]:
print("Accuracy:", np.mean(predicted_all == y_all))

Accuracy: 0.951938098027729


In [43]:
matrix = multilabel_confusion_matrix(y_all, predicted_all, labels=classes)

In [44]:
scores = utils.getPerformanceMetrics(y_all, predicted_all, matrix, classes, original_classes, label_dict)
scores = scores.tail(2) # Remove row for 'None'
scores = scores.drop(columns="true_neg")  # Not accurate because considers 'None' a class
scores["labels"] = original_classes[1:]
scores

Unnamed: 0,labels,false_neg,true_pos,false_pos,precision,recall,f_1
1,Omission,1563,2469,297,0.892625,0.612351,0.72639
2,Stereotype,309,1292,52,0.96131,0.806996,0.877419


Save the performance results:

In [45]:
# dir_path = config.tokc_path+"/experiment1/output/"
scores.to_csv(agreement_dir+"docclf_{a}_{t}_baseline_performance_ALLDATA.csv".format(a=a, t=target_labels))

Add the predicted labels to the dev data:

In [46]:
pred_all_labels = mlb_target.inverse_transform(predicted_all)

Add the classifier's labels to the `aggregated_validate.csv` DataFrame of descriptions to facilitate error analysis:

In [47]:
df = df.rename(columns={"label":"manual_label"})
df.insert(len(df.columns), "{a}_label".format(a=a), pred_all_labels)
df.head()
# print(len(pred_all_labels), df.shape)

Unnamed: 0,description_id,start_offset,end_offset,field,description,subset,manual_label,doc_ling_pred,doc_pers_o_pred,sgd-svm_label
4699,4699,1853,2066,Biographical / Historical,"Labelled Apparently some chapters, amounting t...",train,[Omission],[Gendered-Role],[Unknown],()
8942,8942,384,540,Biographical / Historical,James Aikman of Perth signed his name to a vol...,train,[None],[Gendered-Pronoun],[Unknown],"(None,)"
5440,5440,5692,5850,Biographical / Historical,This piece was published in 'Milk Production i...,train,[None],[],"[Masculine, Unknown, Occupation]","(None,)"
3474,3474,3608,8549,Biographical / Historical,Margaret Winifred Bartholomew was born on 21 A...,train,"[Omission, Stereotype]","[Generalization, Gendered-Role, Gendered-Pronoun]","[Unknown, Occupation, Masculine, Feminine]","(Omission, Stereotype)"
4769,4769,2378,2576,Biographical / Historical,Blacker and Thomson became close friends throu...,train,[Omission],[Gendered-Pronoun],"[Masculine, Unknown]","(Omission,)"


Save this version of the data:

In [49]:
df.to_csv(predictions_dir+"aggregated_final_validate_predictions_docclf_{a}_{t}_ALLDATA.csv".format(a=a, t=target_labels))