# Preprocessing Data for Baseline Gender Bias Token Classifiers

* **Supervised learning**
    * Train, Validate, and (Blind) Test Data: under directory `../data/doc_clf_data/`
* **Multilabel classification**
    * 3 categories of labels:
        1. *Person Name:* Unknown, Non-binary, Feminine, Masculine
        2. *Linguistic:* Generalization, Gendered Pronoun, Gendered Role
        3. *Contextual:* Empowering, Occupation, Omission, Stereotype

***

**Table of Contents**

[I. Setup](#i)

[II. Tokenization and Categorization](#ii)


***

<a id="i"></a>

## I. Setup

**Import libraries and load data**

In [1]:
# For custom functions
# import utils

# For working with data files and directories
import numpy as np
import pandas as pd
from pathlib import Path

# For preprocessing the text
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
# from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag

# # For visualization
# import matplotlib.pyplot as plt
# import seaborn as sns

# # For classification with scikit-learn
# from sklearn.preprocessing import MultiLabelBinarizer
# from sklearn.multiclass import OneVsRestClassifier
# from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
# # from sklearn.svm import SVC
# # from sklearn.tree import DecisionTreeClassifier
# # from sklearn.tree import export_text
# # from sklearn import tree
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.linear_model import LogisticRegression
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.pipeline import Pipeline
# from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix, plot_confusion_matrix, ConfusionMatrixDisplay
# from sklearn.metrics import precision_recall_fscore_support

Use the same data splits used for document classification:

In [2]:
# corpus = PlaintextCorpusReader(data_dir, ".*.txt")
# print(corpus.fileids())

# RERUN WITH THESE FILES!!!
# dir_name = "data/aggregated_data/splits/"
# Path(dir_name).mkdir(parents=True, exist_ok=True)
# train.to_csv(dir_name+"aggregated_train.csv")
# validate.to_csv(dir_name+"aggregated_validate.csv")
# test.to_csv(dir_name+"aggregated_test.csv")

df = pd.read_csv("../data/aggregated_data/aggregated_with_eadid_descid_desc_cols.csv")

In [3]:
df = df.rename({"Unnamed: 0":"ann_id"}, axis=1)
df.head()

Unnamed: 0,ann_id,eadid,field,file_ann,offsets_ann,text_ann,label,category,id,description,file_desc,desc_id,file,desc_start_offset,desc_end_offset
0,0,BAI,Title,BAI_01000.ann,"(1290, 1302)",John Baillie,Unknown,Person-Name,211,John Baillie: posthumous,BAI_01000.txt,70381,BAI_01000.txt,1290,1315
1,1,BAI,Scope and Contents,BAI_01300.ann,"(5875, 5894)",Henry Sloane Coffin,Unknown,Person-Name,524,"Letters received from Henry Sloane Coffin, wit...",BAI_01300.txt,47675,BAI_01300.txt,5853,5983
2,2,BAI,Scope and Contents,BAI_01300.ann,"(5925, 5936)",Hugh Martin,Unknown,Person-Name,525,"Letters received from Henry Sloane Coffin, wit...",BAI_01300.txt,47675,BAI_01300.txt,5853,5983
3,3,BAI,Scope and Contents,BAI_01300.ann,"(5951, 5963)",John Baillie,Masculine,Person-Name,526,"Letters received from Henry Sloane Coffin, wit...",BAI_01300.txt,47675,BAI_01300.txt,5853,5983
4,4,BAI,Scope and Contents,BAI_01300.ann,"(5951, 5963)",John Baillie,Unknown,Person-Name,527,"Letters received from Henry Sloane Coffin, wit...",BAI_01300.txt,47675,BAI_01300.txt,5853,5983


Make a new directory to save the token-level classification data:

In [4]:
new_data_dir = "../data/token_clf_data/"
Path(new_data_dir).mkdir(parents=True, exist_ok=True)

<a id="ii"></a>
## II. Tokenization and Categorization

**Categorize every token as either:**
* **`B-[LABEL_NAME]` for the *beginning* token of an annotated text span with a particular label**
* **`I` for *inside* an annotated text span**
* **`O` or *outside* an annotated text span, respectively**

**Associate each token with a description ID and, for `B-` and `I` tokens, an annotation ID.**

#### Create dictionaries with relevant data from the DataFrame

In [5]:
# Get unique description data
desc_df = pd.DataFrame({"description":df.description, "desc_id": df.desc_id, "desc_start_offset":df.desc_start_offset, "desc_end_offset":df.desc_end_offset})
desc_df = desc_df.drop_duplicates()

In [6]:
# Associate each description with its ID: 
# create a dictionary with description texts as keys and description identifiers as values

text_desc_list = list(desc_df.description)
desc_id_list = list(desc_df.desc_id)
desc_text_to_id = dict(zip(text_desc_list, desc_id_list))

assert desc_text_to_id["John Baillie: posthumous"] == 70381

In [7]:
# Associate each description ID to its offsets:
# create a dictionary with description IDs as keys and tuples of offsets as values

desc_starts = list(desc_df.desc_start_offset)
desc_ends = list(desc_df.desc_end_offset)
desc_offset_tuples = [tuple((desc_starts[i], desc_ends[i])) for i in range(len(desc_starts))]
print(desc_offset_tuples[:5])

desc_id_to_offsets = dict(zip(desc_id_list, desc_offset_tuples))
print(desc_id_to_offsets[70381])

[(1290, 1315), (5853, 5983), (5967, 6202), (5297, 5506), (15180, 15419)]
(1290, 1315)


In [9]:
# Associate each description ID to its annotations: 
# create a dictionary with description IDs as keys and arrays of annotation IDs as values 

id_df = pd.DataFrame({"ann_id":df.ann_id, "desc_id":df.desc_id})
desc_id_to_ann_id = dict.fromkeys(desc_id_list)
for desc_id in desc_id_list:
    desc_ann_ids = list((id_df.loc[id_df.desc_id == desc_id]).ann_id)
    desc_id_to_ann_id[desc_id] = desc_ann_ids

assert 2 in desc_id_to_ann_id[47675]
assert not 0 in desc_id_to_ann_id[47675]

In [21]:
# Associate each annotation ID to its label, text span, and offsets:
# create a dictionaries with annotation IDs as keys and labels, text spans, and offsets as values

ann_id_list = list(df.ann_id)
# Make sure every annotation ID is unique
assert len(list(df.ann_id)) == len(set(df.ann_id))
    
# Store the annotation's label name
ann_id_to_label = dict(zip(ann_id_list, list(df.label)))

# Store the text span of the annotation
ann_id_to_text = dict(zip(ann_id_list, list(df.text_ann)))

# Store the annotation's start and end offsets (as a tuple, like the description offsets)
ann_id_list = list(df.ann_id)
ann_offsets_list = list(df.offsets_ann)
ann_offsets_tuples = []
for offset in ann_offsets_list:
    offset_str_pair = offset[1:-1].split(", ")
    offset_int_pair = [int(o) for o in offset_str_pair]
    ann_offsets_tuples += [offset_int_pair]
ann_id_to_offsets = dict(zip(ann_id_list, ann_offsets_tuples))
assert ann_id_to_offsets[ann_id_list[2]] == ann_offsets_tuples[2]

#### Tokenize Descriptions and Calculate Token Offsets