# Experiment 1

#### Model Setup

Run models in the following order, using their output labels as features for the next model:

1. Multilabel Linguistic Classifier
2. Multiclass Person Name + Occupation Sequence Classifier
3. Multilabel Document Classifier

***

* Supervised learning
    * Train, Validate, and (Blind) Test Data: under directory `../data/token_clf_data/model_input/`
    * Prediction Data: Data: under directory `../data/token_clf_data/model_output/experiment1/`
* Word Embeddings
    * Custom fastText (word2vec with subwords) embeddings of 100 dimensions trained on the CRC Archives catalog's descriptive metadata (harvested October 2020)
    
***

Load programming resources:

In [1]:
# For custom functions and variables
import utils, config

# For data analysis
import pandas as pd
import numpy as np
import os, re

# For creating directories
from pathlib import Path

# For preprocessing
from gensim.models import FastText
from gensim import utils as gensim_utils

# For classification
import sklearn_crfsuite

# For evaluation
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# # For evaluation
# from collections import Counter
# from sklearn.metrics import classification_report, make_scorer
# from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, ConfusionMatrixDisplay
# from sklearn.metrics import precision_recall_fscore_support,  f1_score

## Step 1: Linguistic Features

Load the output from the multilabel linguistic classifier (Classifier Chain with Random Forest) for the Linguistic labels, Gendered Pronoun, Gendered Role, and Generalization.  The predicted labels will be features for the next model.

In [4]:
linguistic_features = pd.read_csv(config.tokc_path+"multilabel_model_output/cc-rf_baseline_fastText100_ling_predictions.csv", index_col=0)
linguistic_features.head()

Unnamed: 0,sentence_id,token_id,token,pos,expected_tag,predicted_tag,_merge
0,5,154,After,IN,O,O,true negative
1,5,155,his,PRP$,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
2,5,156,ordination,NN,O,O,true negative
3,5,157,he,PRP,B-Gendered-Pronoun,B-Gendered-Pronoun,true positive
4,5,158,spent,VBD,O,O,true negative


When evaluated loosely, this model had the following scores:

In [8]:
print('''
label           | precision   | recall    | f1           |
----------------------------------------------------------
Gendered Pronoun| 1.0         | 0.98459   | 0.99223      |
----------------------------------------------------------
Gendered Role   | 1.0         | 0.69018   | 0.81669      |
----------------------------------------------------------
Generalization  | 1.0         | 0.47423   | 0.64336      |
''')


label           | precision   | recall    | f1           |
----------------------------------------------------------
Gendered Pronoun| 1.0         | 0.98459   | 0.99223      |
----------------------------------------------------------
Gendered Role   | 1.0         | 0.69018   | 0.81669      |
----------------------------------------------------------
Generalization  | 1.0         | 0.47423   | 0.64336      |



In [10]:
ling_pred_features = linguistic_features[["token_id", "predicted_tag"]]
ling_pred_features = ling_pred_features.rename(columns={"predicted_tag":"predicted_linguistic"})
ling_pred_features.head(2)

Unnamed: 0,token_id,predicted_linguistic
0,154,O
1,155,B-Gendered-Pronoun


Join this data to the model input data: