# Experiment 1

#### Model Setup

Run models in the following order, using their output labels as features for the next model:

1. Multilabel Linguistic Classifier
2. Multiclass Person Name + Occupation Sequence Classifier
3. Multilabel Document Classifier

***

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/experiment_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/experiment1/`
* Word Embeddings
    * Custom fastText (word2vec with subwords) embeddings of 100 dimensions trained on the CRC Archives catalog's descriptive metadata (harvested October 2020)
    
***

Load programming resources:

In [152]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For preprocessing
from gensim.models import FastText
from gensim import utils as gensim_utils

# For multilabel token classification
import sklearn.metrics as metrics
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from skmultilearn.problem_transform import ClassifierChain
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# For multiclass sequence classification
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

Define resources for the models:

In [2]:
# Step 1:
ling_label_subset = ["B-Generalization", "I-Generalization", "B-Gendered-Role", "I-Gendered-Role", "B-Gendered-Pronoun", "I-Gendered-Pronoun"]
# Step 2:
pers_o_label_subset = ["B-Unknown", "I-Unknown", "B-Feminine", "I-Feminine", "B-Masculine", "I-Masculine", "B-Occupation", "I-Occupation"]
# Step 3:
so_label_subset = ["B-Stereotype", "I-Stereotype", "B-Omission", "I-Omission"]

In [34]:
ling_label_tags = {
    "Gendered Pronoun": ["B-Gendered-Pronoun", "I-Gendered-Pronoun"], "Gendered Role": ["B-Gendered-Role", "I-Gendered-Role"],"Generalization": ["B-Generalization", "I-Generalization"]
    }
pers_o_label_tags = {
    "Unknown": ["B-Unknown", "I-Unknown"], "Feminine": ["B-Feminine", "I-Feminine"], "Masculine": ["B-Masculine", "I-Masculine"],
     "Occupation": ["B-Occupation", "I-Occupation"]
    }
so_label_tags = {
    "Stereotype": ["B-Stereotype", "I-Stereotype"], "Omission": ["B-Omission", "I-Omission"]
             }

In [79]:
d = 100  # Dimensionality of word embeddings
file_name = config.fasttext_path+"fasttext{}_lowercased.model".format(d)
embedding_model = FastText.load(file_name)
def extractFastTextEmbedding(token, fasttext_model=embedding_model):
    if token.isalpha():
        token = token.lower()
    embedding = fasttext_model.wv[token]
    return embedding

<a id="1"></a>
## 1. Linguistic Classifier

Run a multilabel classifier on the train set of the data, focusing only on applying the Linguistic category of labels: Gendered Pronoun, Gendered Role, and Generalization.

Use a Classifier Chain with Random Forest, as this was the highest-performing multilabel model setup from previous algorithm experiments for the Linguistic labels.  When evaluated loosely, this type of model trained on 60% of the data performed as follows:

In [8]:
print('''
label           | precision   | recall    | f1           |
----------------------------------------------------------
Gendered Pronoun| 1.0         | 0.98459   | 0.99223      |
----------------------------------------------------------
Gendered Role   | 1.0         | 0.69018   | 0.81669      |
----------------------------------------------------------
Generalization  | 1.0         | 0.47423   | 0.64336      |
''')


label           | precision   | recall    | f1           |
----------------------------------------------------------
Gendered Pronoun| 1.0         | 0.98459   | 0.99223      |
----------------------------------------------------------
Gendered Role   | 1.0         | 0.69018   | 0.81669      |
----------------------------------------------------------
Generalization  | 1.0         | 0.47423   | 0.64336      |



For this experiment, we'll train the model on 40% of the data, rather than 60%.  We'll use fastText embeddings of 100 dimensions, as was used in the model that achieved the above scores.

#### Preprocessing

In [8]:
# ------------------------
# Load data
# ------------------------
def loadData(df):
    df = df.drop(columns=["ann_id"])
    df = df.drop_duplicates()
    # Remove Non-binary labels as these were mistaken labels identified early on that were meant to be excluded, 
    # and because only one token has this label, it prevents the data from being input into the models with cross-validation
    df = df.loc[~df.tag.isin(["B-Nonbinary", "I-Nonbinary"])]
    df = df.drop(columns=["description_id", "field", "subset", "token_offsets"])
    df = utils.implodeDataFrame(df, ["sentence_id", "token_id", "token", "pos"])
    return df.reset_index()

def zipTokensFeatures(loaded_data, feature_cols=["token_id", "token"]):
    token_data = list(zip(loaded_data[feature_cols[0]], loaded_data[feature_cols[1]]))
    return token_data

# ------------------------
# Extract fastText features
# ------------------------
def makeFastTextFeatureMatrix(token_data):
    feature_list = [extractFastTextEmbedding(token) for token_id,token in token_data]
    return np.array(feature_list)

# ------------------------
# Binarize targets
# ------------------------
def binarizeTrainTargets(train_data, target_col="tag", mlb=MultiLabelBinarizer()):
    y_train_labels = train_data[target_col]
    y_train = mlb.fit_transform(y_train_labels)
    return mlb, y_train

def binarizeDevTargets(mlb, dev_data, target_col="tag"):
    y_dev_labels = dev_data[target_col]
    y_dev = mlb.transform(y_dev_labels)
    return y_dev

In [6]:
train_df = pd.read_csv(config.tokc_path+"experiment_input/token_train.csv", index_col=0)
dev_df = pd.read_csv(config.tokc_path+"experiment_input/token_validate.csv", index_col=0)
ling_train, ling_dev = utils.selectDataForLabels(train_df, dev_df, "tag", ling_label_subset)

In [10]:
train_data = loadData(ling_train)
dev_data = loadData(ling_dev)
print(train_data.shape, dev_data.shape)
train_data.head()

(298617, 5) (305924, 5)


Unnamed: 0,sentence_id,token_id,token,pos,tag
0,2,16,Scope,NN,[O]
1,2,17,and,CC,[O]
2,2,18,Contents,NNS,[O]
3,2,19,:,:,[O]
4,2,20,Sermons,NNS,[O]


Create feature matrices:

In [11]:
train_tokens = zipTokensFeatures(train_data)
dev_tokens = zipTokensFeatures(dev_data)
X_train = makeFastTextFeatureMatrix(train_tokens)
X_dev = makeFastTextFeatureMatrix(dev_tokens)

Binarize targets:

In [12]:
mlb, y_train = binarizeTrainTargets(train_data)
y_dev = binarizeDevTargets(mlb, dev_data)

#### Train & Predict

In [14]:
a = "rf"
clf = ClassifierChain(
    classifier = RandomForestClassifier(random_state=22),
)
clf.fit(X_train, y_train)

In [15]:
predictions = clf.predict(X_dev)

#### Evaluate: Strict, All Labels

In [16]:
print("Precision - macro:", metrics.precision_score(y_dev, predictions, average="macro", zero_division=0))  # macro = mean of all labels' score
print("Recall - macro:", metrics.recall_score(y_dev, predictions, average="macro", zero_division=0))
print("F1 Score - macro:", metrics.f1_score(y_dev, predictions, average="macro", zero_division=0))
print("Accuracy - normalized:", metrics.accuracy_score(y_dev, predictions, normalize=True))  # fraction of correctly classified samples
print("Accuracy - unnormalized:", metrics.accuracy_score(y_dev, predictions, normalize=False))  # number of correctly classified samples

Precision - macro: 0.46238272522632073
Recall - macro: 0.42298863525280694
F1 Score - macro: 0.42251820432443793
Accuracy - normalized: 0.9925798564349315
Accuracy - unnormalized: 303654


#### Evaluate: Each Label

In [17]:
pred_df = utils.makePredictionDF(predictions, dev_data, "tag", "predicted_tag", "O", mlb)
pred_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,predicted_tag
0,1,3,Title,NN,O
1,1,4,:,:,O
2,1,5,Papers,NNS,O
3,1,6,of,IN,O
4,1,7,The,DT,O


In [18]:
exp_df = dev_data.explode(["tag"])
exp_df = exp_df.rename(columns={"tag":"expected_tag"})
# exp_df.head()

In [19]:
eval_df = utils.makeEvaluationDataFrame(
    exp_df, 
    pred_df, 
    ["sentence_id", "token_id", "token", "pos", "expected_tag"],   # left on
    ["sentence_id", "token_id", "token", "pos", "predicted_tag"],  # right on
    ["sentence_id", "token_id", "token", "pos", "expected_tag", "predicted_tag", "_merge"],  # final column list
    "expected_tag",
    "predicted_tag", 
    "token_id",  # ID column
    "O"          # No tag value
)
eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
0,1,3,Title,NN,O,O,true negative
1,1,4,:,:,O,O,true negative
2,1,5,Papers,NNS,O,O,true negative
3,1,6,of,IN,O,O,true negative
4,1,7,The,DT,O,O,true negative


Save the data:

In [20]:
experiment_dir = config.tokc_path+"experiment1/"
Path(experiment_dir).mkdir(parents=True, exist_ok=True)
eval_df.to_csv(experiment_dir+"cc-{a}_baseline_fastText{d}_predictions.csv".format(a=a,d=d))

##### Strict Agreement

Calculate the total true positives, false positives, true negatives, and false negatives.

In [21]:
agmt_stats = utils.getAgreementStatsForAllTags(eval_df, "_merge", "token_id", "tag(s)", y_dev, predictions)

Calculate precision, recall, and F1 score at the token level for each tag:

In [22]:
for label_tag in ling_label_subset:
    label_agmt_stats = utils.getScoresByTags(eval_df, "_merge", [label_tag])
    agmt_stats = pd.concat([agmt_stats, label_agmt_stats])
agmt_stats

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,all,1145,990,303152,1876,0.462383,0.422989,0.422518
0,B-Generalization,128,17,0,174,0.910995,0.576159,0.705882
0,I-Generalization,86,1,0,0,0.0,0.0,0.0
0,B-Gendered-Role,99,89,0,552,0.861154,0.847926,0.854489
0,I-Gendered-Role,133,0,0,0,0.0,0.0,0.0
0,B-Gendered-Pronoun,15,379,0,3026,0.888693,0.995067,0.938877
0,I-Gendered-Pronoun,23,0,0,0,0.0,0.0,0.0


Save the data:

In [23]:
agmt_stats.to_csv(experiment_dir+"cc-{a}_baseline_fastText{d}_ling_strict_agmt.csv".format(a=a,d=d))

##### Loose Agreement

Calculate precision, recall, and F1 score at the token level for each label, where a correct prediction is a prediction with the correct annotation label (not necessarily the correct IOB tag).

Create a copy of the evaluation DataFrame where tags are replaced by label names:

In [54]:
loose_eval_df = eval_df.copy()
for label,tags in ling_label_tags.items():
    for tag in tags:
        loose_eval_df["expected_tag"] = loose_eval_df["expected_tag"].replace(to_replace=tag, value=label)
        loose_eval_df["predicted_tag"] = loose_eval_df["predicted_tag"].replace(to_replace=tag, value=label)
# loose_eval_df.head()

In [55]:
loose_eval_df = loose_eval_df.fillna("O")
loose_eval_df = loose_eval_df.drop(columns=["_merge"])
loose_eval_df = utils.compareExpectedPredicted(loose_eval_df, "_merge", "O")
loose_eval_df.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
0,1,3,Title,NN,O,O,true negative
1,1,4,:,:,O,O,true negative
2,1,5,Papers,NNS,O,O,true negative
3,1,6,of,IN,O,O,true negative
4,1,7,The,DT,O,O,true negative


In [61]:
loose_agmt = pd.DataFrame.from_dict({
        "tag(s)":[], "false negative":[], "false positive":[], "true negative":[], 
         "true positive":[], "precision":[], "recall":[], "f1":[]
    })

In [62]:
for label,tags in ling_label_tags.items():
    labels_agmt_stats = utils.getScoresByTags(loose_eval_df, "_merge", [label])
    loose_agmt = pd.concat([loose_agmt, labels_agmt_stats])
loose_agmt

Unnamed: 0,tag(s),false negative,false positive,true negative,true positive,precision,recall,f1
0,Gendered Pronoun,38.0,0.0,0.0,3026.0,1.0,0.987598,0.99376
0,Gendered Role,232.0,0.0,0.0,552.0,1.0,0.704082,0.826347
0,Generalization,214.0,0.0,0.0,174.0,1.0,0.448454,0.619217


Great!  The performance of this model looks comparable to the model trained on 60% of the data.

Save the data:

In [63]:
loose_agmt.to_csv(experiment_dir+"cc-{a}_baseline_fastText{d}_ling_loose_agmt.csv".format(a=a,d=d))

***
<a id="2"></a>
## 2. Person Name + Occupation Labels

Train a multiclass sequence classifier, using Conditional Random Field with Adaptive Regularization of Weight Vectors (AROW), on the Person Name and Occupation labels, **passing in the Linguistic labels (not specific BIO label-tag pair) from the previous model's predictions as features to this model.**

Multiclass is a suitable setup for these labels because they are mutually exclusive (no one token should have more than one of these labels).  The sequence classifier with AROW was the highest performing for past algorithm experiments with sequence classifiers for Person Name and Occupation labels.

In [81]:
train_ling_features = loose_eval_df[["token_id", "predicted_tag"]]
train_ling_features = train_ling_features.rename(columns={"predicted_tag":"pred_ling_tag"})
train_ling_features.head()

Unnamed: 0,token_id,pred_ling_tag
0,3,O
1,4,O
2,5,O
3,6,O
4,7,O


The devtest data subset from the model in step 1 will be the train data subset in this step, with the predicted Linguistic labels as features passed into this second model.  The train data subset from the first model will be the devtest data subset for this second model.

In [76]:
train_df = pd.read_csv(config.tokc_path+"experiment_input/token_validate.csv", index_col=0)
dev_df =  pd.read_csv(config.tokc_path+"experiment_input/token_train.csv", index_col=0)
perso_train, perso_dev = utils.selectDataForLabels(train_df, dev_df, "tag", pers_o_label_subset)
print(perso_train.shape, perso_dev.shape)

(316721, 10) (308583, 10)


Join the linguistic labels (features) to the train and devtest data:

In [77]:
perso_train = perso_train.join(train_ling_features.set_index("token_id"), on="token_id", how="outer")
perso_train.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset,pred_ling_tag
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,dev,O
4,1,1,99999,4,:,"(22, 23)",:,O,Title,dev,O
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,dev,O
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,dev,O
9,1,1,52952,7,The,"(34, 37)",DT,O,Title,dev,O


In [78]:
perso_train.pred_ling_tag.value_counts()

O                   315669
Gendered Pronoun      2083
Gendered Role          486
Generalization         134
Name: pred_ling_tag, dtype: int64

Looks good!

#### Preprocessing

In [148]:
def zipTrainFeaturesAndTarget(df, target_col, feature_col1="sentence", feature_col2="pred_ling_tag"):
    feature1_list = list(df[feature_col1])  # sentence
    feature2_list = list(df[feature_col2])  # linguistic label
    tag_list = list(df[target_col])
    length = len(feature1_list)
    return [[tuple((feature1_list[i][j], feature2_list[i][j], tag_list[i][j])) for j in range(len(feature1_list[i]))] for i in range(len(feature1_list))]

def zipDevFeaturesAndTarget(df, target_col, feature_col1="sentence"):
    feature1_list = list(df[feature_col1])  # sentence
    tag_list = list(df[target_col])
    length = len(feature1_list)
    return [[tuple((feature1_list[i][j], tag_list[i][j])) for j in range(len(feature1_list[i]))] for i in range(len(feature1_list))]

def extractTokenFeatures(sentence, i):
    token = sentence[i][0]
    if (len(sentence[i]) > 2):
        features = {
        'bias': 1.0,
        'token': token,
        'ling_label':sentence[i][1]
        }
    else:
        features = {
            'bias': 1.0,
            'token': token,
        }
    
    # Add each value in a token's word embedding as a separate feature
    embedding = extractFastTextEmbedding(token)
    for i,n in enumerate(embedding):
        features['e{}'.format(i)] = n
    
    # Record whether a token is the first or last token of a sentence
    if i == 0:
        features['START'] = True
    elif i == (len(sentence) - 1):
        features['END'] = True
    
    return features

def extractSentenceFeatures(sentence):
    return [extractTokenFeatures(sentence, i) for i in range(len(sentence))]

def extractSentenceTargets(sentence):
    return [s[-1][0] for s in sentence]

def extractSentenceTokens(sentence):
    return [token for token, ling_label, tag_list in sentence]

In [125]:
train_df = perso_train.drop(columns=["description_id", "ann_id", "token_offsets", "field", "subset", "pos"])
dev_df = perso_dev.drop(columns=["description_id", "ann_id", "token_offsets", "field", "subset", "pos"])

In [127]:
df_train_token_groups = utils.implodeDataFrame(train_df, ['token_id', 'sentence_id', 'token'])
df_train_token_groups = df_train_token_groups.reset_index()
# df_train_token_groups.head()

Unnamed: 0,token_id,sentence_id,token,tag,pred_ling_tag
0,3,1,Title,[O],[O]
1,4,1,:,[O],[O]
2,5,1,Papers,[O],[O]
3,6,1,of,[O],[O]
4,7,1,The,"[O, B-Unknown, B-Masculine]","[O, O, O]"


In [130]:
df_dev_token_groups = utils.implodeDataFrame(df_dev, ['token_id', 'sentence_id', 'token'])
df_dev_token_groups = df_dev_token_groups.reset_index()
# df_dev_token_groups.head()

Unnamed: 0,token_id,sentence_id,token,tag
0,16,2,Scope,[O]
1,17,2,and,[O]
2,18,2,Contents,[O]
3,19,2,:,[O]
4,20,2,Sermons,[O]


In [140]:
df_train_grouped = utils.implodeDataFrame(df_train_token_groups, ['sentence_id'])
df_dev_grouped = utils.implodeDataFrame(df_dev_token_groups, ['sentence_id'])
df_train_grouped = df_train_grouped.rename(columns={"token":"sentence"})
df_dev_grouped = df_dev_grouped.rename(columns={"token":"sentence"})
df_train_grouped.head()

Unnamed: 0_level_0,token_id,sentence,tag,pred_ling_tag
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,"[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]","[Title, :, Papers, of, The, Very, Rev, Prof, J...","[[O], [O], [O], [O], [O, B-Unknown, B-Masculin...","[[O], [O], [O], [O], [O, O, O], [O, O, O], [O,..."
3,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ...","[Biographical, /, Historical, :, Professor, Ja...","[[O], [O], [O], [O], [B-Masculine], [I-Masculi...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."
5,"[154, 155, 156, 157, 158, 159, 160, 161, 162, ...","[After, his, ordination, he, spent, three, yea...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[[O], [Gendered Pronoun], [O], [Gendered Prono..."
7,"[216, 217, 218, 219, 220, 221, 222, 223, 224, ...","[His, primary, interests, were, in, liturgy, a...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[[Gendered Pronoun], [O], [O], [O], [O], [O], ..."
9,"[256, 257, 258, 259, 260, 261, 262, 263, 264, ...","[The, service, was, relayed, around, the, worl...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ..."


Zip the linguistic label and BIO tags together with the tokens so each sentence item is a tuple: `(TOKEN, LING_LABEL, TAG_LIST)`

In [143]:
df_train_grouped = df_train_grouped.reset_index()
df_dev_grouped = df_dev_grouped.reset_index()
train_sentences_pers = zipTrainFeaturesAndTarget(df_train_grouped, "tag")
print(train_sentences_pers[2][:3])
dev_sentences_pers = zipDevFeaturesAndTarget(df_dev_grouped, "tag")
print(dev_sentences_pers[0][:3])

[('After', ['O'], ['O']), ('his', ['Gendered Pronoun'], ['O']), ('ordination', ['O'], ['O'])]
[('Scope', ['O']), ('and', ['O']), ('Contents', ['O'])]


In [144]:
train_sentences = train_sentences_pers
dev_sentences = dev_sentences_pers

In [149]:
# Features
X_train = [extractSentenceFeatures(sentence) for sentence in train_sentences]
X_dev = [extractSentenceFeatures(sentence) for sentence in dev_sentences]
# Target
y_train = [extractSentenceTargets(sentence) for sentence in train_sentences]
y_dev = [extractSentenceTargets(sentence) for sentence in dev_sentences]

#### Train

Train a Conditional Random Field (CRF) model with the default parameters on the **Person Name** category of tags.  We'll increase the max iterations to 100 for this model.

In [153]:
a = "arow"
clf_pers = sklearn_crfsuite.CRF(algorithm=a, variance=0.5, max_iterations=100, all_possible_transitions=True)

In [154]:
# https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
try:
    clf_pers.fit(X_train, y_train)
except AttributeError:
    pass

Remove `'O'` tags from the targets list since we are interested in the ability to apply the gendered and gender biased language related tags, and the `'O'` tags far outnumber the tags for gendered and gender biased language.

In [155]:
targets = list(clf_pers.classes_)
targets.remove('O')
print(targets)

['I-Unknown', 'B-Masculine', 'I-Masculine', 'B-Occupation', 'I-Occupation', 'B-Unknown', 'I-Feminine', 'B-Feminine']


#### Predict

In [156]:
y_pred = clf_pers.predict(X_dev)

#### Evaluate

##### Strict Evaluation Summary

In [157]:
print("  - F1:", metrics.flat_f1_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Prec:", metrics.flat_precision_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))
print("  - Rec", metrics.flat_recall_score(y_dev, y_pred, average="weighted", zero_division=0, labels=targets))

  - F1: 0.41564801734391454
  - Prec: 0.46410238326931585
  - Rec 0.38680097255991663


Save the prediction data:

In [158]:
df_dev_grouped = df_dev_grouped.rename(columns={"tag":"tag_pers_o_expected"})
df_dev_grouped.insert(len(df_dev_grouped.columns), "tag_pers_o_predicted", y_pred)
df_dev_grouped.head()

Unnamed: 0,sentence_id,token_id,sentence,tag_pers_o_expected,tag_pers_o_predicted
0,2,"[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2...","[Scope, and, Contents, :, Sermons, and, addres...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,8,"[233, 234, 235, 236, 237, 238, 239, 240, 241, ...","[James, Whyte, was, called, upon, to, preach, ...","[[B-Masculine], [I-Masculine], [O], [O], [O], ...","[B-Masculine, O, O, O, O, O, O, O, O, O, O, O,..."
2,19,"[520, 521, 522, 523, 524, 525, 526, 527, 528, ...","[Rev, Tom, Allan, 's, first, charge, was, Nort...","[[B-Masculine], [I-Masculine], [I-Masculine], ...","[I-Unknown, I-Unknown, I-Unknown, I-Unknown, O..."
3,21,"[579, 580, 581, 582, 583, 584, 585, 586, 587, ...","[In, 1953, the, "", Tell, Scotland, "", committe...","[[O], [O], [O], [O], [O], [O], [O], [O], [O], ...","[O, O, O, O, O, O, O, O, O, O, O, I-Unknown, I..."
4,30,"[768, 769, 770, 771, 772, 773, 774, 775, 776, ...","[Title, :, Papers, of, Rev, Prof, Alec, Campbe...","[[O], [O], [O], [O], [B-Masculine, B-Unknown],...","[O, O, O, O, I-Unknown, I-Unknown, B-Unknown, ..."


In [159]:
df_dev_grouped = df_dev_grouped.set_index("sentence_id")
df_dev_exploded = df_dev_grouped.explode(list(df_dev_grouped.columns))
df_dev_exploded.head()

Unnamed: 0_level_0,token_id,sentence,tag_pers_o_expected,tag_pers_o_predicted
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,16,Scope,[O],O
2,17,and,[O],O
2,18,Contents,[O],O
2,19,:,[O],O
2,20,Sermons,[O],O


In [161]:
filename = "crf_{a}_personname_occupation_labels_baseline_fastText{d}.csv".format(a=a, d=d)
df_dev_exploded.to_csv(experiment_dir+filename)

#### Evaluate: