# Preprocessing Data for Baseline Gender Bias Token Classifiers

* **Supervised learning**
    * Source data: `../data/token_clf_data/`
    * Output data: train, validate, and test splits under `..data/token_clf_data/model_input/
* **Multilabel classification**
    * 3 categories of labels:
        1. *Person Name:* Unknown, Non-binary, Feminine, Masculine
        2. *Linguistic:* Generalization, Gendered Pronoun, Gendered Role
        3. *Contextual:* Empowering, Occupation, Omission, Stereotype

***

**Table of Contents**

[0. Setup](#0)

[1. Preprocess Data](#1)

[2. Summarize the Data](#2)

[3. Split the Data](#3)


***

<a id="0"></a>

## 0. Setup

**Import libraries and load data**

In [1]:
# For custom functions and for paths
import utils, config

# For working with data files and directories
import numpy as np
import pandas as pd
from pathlib import Path
import os, re

# For preprocessing the text
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag

# For comparing text offsets
from intervaltree import Interval, IntervalTree

Load the data:

In [2]:
df_tags = pd.read_csv(config.tokc_path+"tagged_tokens.csv", index_col=0) #
df_tags.head()

Unnamed: 0,ann_id,description_id,tag,token_id
0,,0,O,0
0,,0,O,1
0,,0,O,2
0,14384.0,1,B-Unknown,7
0,24275.0,1,B-Masculine,7


In [3]:
df_dsat = pd.read_csv(config.tokc_path+"desc_sent_ann_token_tag.csv")
subdf_dsat = df_dsat[["token_id", "sentence_id", "description_id", "token", "token_offsets"]]
subdf_dsat.head()

Unnamed: 0,token_id,sentence_id,description_id,token,token_offsets
0,0,0,0,Identifier,"(0, 10)"
1,1,0,0,:,"(10, 11)"
2,2,0,0,AA5,"(12, 15)"
3,3,1,1,Title,"(17, 22)"
4,4,1,1,:,"(22, 23)"


## 2. Preprocess the Data

Join the data on the token IDs.

In [4]:
df = df_tags.join(subdf_dsat.set_index(["token_id", "description_id"]), on=["token_id", "description_id"])
df.head()

Unnamed: 0,ann_id,description_id,tag,token_id,sentence_id,token,token_offsets
0,,0,O,0,0,Identifier,"(0, 10)"
0,,0,O,1,0,:,"(10, 11)"
0,,0,O,2,0,AA5,"(12, 15)"
0,14384.0,1,B-Unknown,7,1,The,"(34, 37)"
0,24275.0,1,B-Masculine,7,1,The,"(34, 37)"


In [5]:
df = df[["description_id", "sentence_id", "ann_id", "token_id", "token", "tag"]]
df = df.sort_values(by=["description_id", "sentence_id", "token_id"])
df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag
0,0,0,,0,Identifier,O
0,0,0,,1,:,O
0,0,0,,2,AA5,O
0,1,1,14384.0,7,The,B-Unknown
0,1,1,24275.0,7,The,B-Masculine


<a id="2"></a>
## 2. Summarize the Data

In [6]:
print("Total descriptions:", len(df.description_id.unique()))
print("Total sentences:", len(df.sentence_id.unique()))
print("Total annotations:", len(df.ann_id.unique())-1)  # -1 because NaN value indicates no annotation
print("Total tokens:", len(df.token_id.unique()))
print("Total token tags:", df.shape[0])

Total descriptions: 27908
Total sentences: 36405
Total annotations: 55218
Total tokens: 324824
Total token tags: 355434


In [7]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals

Unnamed: 0,tag,total
15,I-Nonbinary,1
5,B-Nonbinary,1
11,I-Gendered-Pronoun,67
12,I-Gendered-Role,726
13,I-Generalization,891
0,B-Feminine,1614
3,B-Generalization,2051
8,B-Stereotype,2614
2,B-Gendered-Role,3577
10,I-Feminine,3782


In [8]:
df_tag_totals.to_csv(config.tokc_path+"token_tag_totals.csv")

In [9]:
# # Make the token_offsets column values tuples of ints
# token_offsets = list(df_tags.token_offsets)
# token_offsets_clean = [offsets[1:-1].split(", ") for offsets in token_offsets if type(offsets)]
# token_offsets_tuples = [tuple((int(offsets[0]), int(offsets[1]))) for offsets in token_offsets_clean]
# df_tags = df_tags.drop(columns=["token_offsets"])
# df_tags.insert(len(df_tags.columns), "token_offsets", token_offsets_tuples)

Save the data for token classification:

In [10]:
Path(config.tokc_path+"model_input").mkdir(parents=True, exist_ok=True)

In [11]:
df = df.reset_index()
df = df.drop(columns="index")
df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag
0,0,0,,0,Identifier,O
1,0,0,,1,:,O
2,0,0,,2,AA5,O
3,1,1,14384.0,7,The,B-Unknown
4,1,1,24275.0,7,The,B-Masculine


In [12]:
df.to_csv(config.tokc_path+"model_input/all_token_data.csv")

<a id="3"></a>
## 3. Split the Data

Split the data randomly, balancing the number of metadata field types (Title, Scope and Contents, Biographical / Historical, and Processing Information) across the train, validation, and test splits.

Implode the data,  grouping tokens by their sentence, then explode the data after the sentences are split into train, validation, and test sets so each row has one token.

In [2]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv")
df_descs = df_descs[["description_id", "field"]]
df_descs.head()

Unnamed: 0,description_id,field
0,0,Identifier
1,1,Title
2,2,Scope and Contents
3,3,Biographical / Historical
4,4,Identifier


In [3]:
df = pd.read_csv(config.tokc_path+"model_input/all_token_data.csv", index_col=0)
df = df.join(df_descs.set_index("description_id"), on="description_id", how="left")
df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field
0,0,0,,0,Identifier,O,Identifier
1,0,0,,1,:,O,Identifier
2,0,0,,2,AA5,O,Identifier
3,1,1,14384.0,7,The,B-Unknown,Title
4,1,1,24275.0,7,The,B-Masculine,Title


In [7]:
# ann_id_values = list(df.loc[~df.ann_id.isna()].ann_id)
# print(max(ann_id_values))  # 55259.0

# Replace NaN values with 99999 to indicate rows without an annotation
df[["ann_id"]] = df[["ann_id"]].fillna("99999")
df = df.astype({"ann_id":"int32"})
df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field
0,0,0,99999,0,Identifier,O,Identifier
1,0,0,99999,1,:,O,Identifier
2,0,0,99999,2,AA5,O,Identifier
3,1,1,14384,7,The,B-Unknown,Title
4,1,1,24275,7,The,B-Masculine,Title


Shuffle the sentence IDs randomly and then assign them to train, validation, and (blind) test set splits of the data for creating and evaluating token classifiers:

In [44]:
sents = df[["sentence_id"]]
sents = sents.drop_duplicates()
shuffled_sents = utils.shuffleDataFrame(sents)
# shuffled_sents.head()
train_size, validat_size, test_size = utils.getTrainValTestSizes(shuffled_sents)
shuffled_sents_splits = utils.assignSubsets(shuffled_sents, train_size, validat_size, test_size)
shuffled_sents_splits.head()

Unnamed: 0,subset,sentence_id
11338,train,1291
39387,train,5333
109416,train,13857
267041,train,32648
251428,train,29676


In [45]:
shuffled_sents_splits.groupby("subset").size().reset_index(name="subset_count")  # Looks good

Unnamed: 0,subset,subset_count
0,dev,7281
1,test,7281
2,train,21843


Join the subset data with the token data:

In [50]:
df_joined = df.join(shuffled_sents_splits.set_index("sentence_id"), on="sentence_id", how="left")
df_joined.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field,subset
0,0,0,99999,0,Identifier,O,Identifier,dev
1,0,0,99999,1,:,O,Identifier,dev
2,0,0,99999,2,AA5,O,Identifier,dev
3,1,1,14384,7,The,B-Unknown,Title,train
4,1,1,24275,7,The,B-Masculine,Title,train


**Train Data:**

In [51]:
df_train = df_joined.loc[df_joined.subset == "train"]
print(df_train.shape)
df_train.head()

(213094, 8)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field,subset
3,1,1,14384,7,The,B-Unknown,Title,train
4,1,1,24275,7,The,B-Masculine,Title,train
5,1,1,52952,7,The,B-Stereotype,Title,train
6,1,1,14384,8,Very,I-Unknown,Title,train
7,1,1,24275,8,Very,I-Masculine,Title,train


In [52]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_train.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Nonbinary,1
1,B-Nonbinary,1
2,I-Gendered-Pronoun,44
3,I-Gendered-Role,445
4,I-Generalization,529
5,B-Feminine,974
6,B-Generalization,1211
7,B-Stereotype,1601
8,B-Gendered-Role,2163
9,I-Feminine,2262


**Validation Data:**

In [53]:
df_validate = df_joined.loc[df_joined.subset == "dev"]
print(df_validate.shape)
df_validate.head()

(70683, 8)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field,subset
0,0,0,99999,0,Identifier,O,Identifier,dev
1,0,0,99999,1,:,O,Identifier,dev
2,0,0,99999,2,AA5,O,Identifier,dev
138,3,4,14377,134,He,B-Gendered-Pronoun,Biographical / Historical,dev
139,3,4,14378,148,he,B-Gendered-Pronoun,Biographical / Historical,dev


In [54]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_validate.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Gendered-Pronoun,16
1,I-Gendered-Role,136
2,I-Generalization,183
3,B-Feminine,313
4,B-Generalization,405
5,B-Stereotype,504
6,B-Gendered-Role,703
7,I-Feminine,746
8,B-Occupation,810
9,B-Gendered-Pronoun,874


**Test Data:**

In [55]:
df_test = df_joined.loc[df_joined.subset == "test"]
print(df_test.shape)
df_test.head()

(71657, 8)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field,subset
144,3,6,14386,178,James,B-Masculine,Biographical / Historical,test
145,3,6,14386,179,Whyte,I-Masculine,Biographical / Historical,test
146,3,6,41263,196,chair,B-Occupation,Biographical / Historical,test
147,3,6,41263,197,of,I-Occupation,Biographical / Historical,test
148,3,6,41263,198,practical,I-Occupation,Biographical / Historical,test


In [56]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_test.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Gendered-Pronoun,7
1,I-Gendered-Role,145
2,I-Generalization,179
3,B-Feminine,327
4,B-Generalization,435
5,B-Stereotype,509
6,B-Gendered-Role,711
7,I-Feminine,774
8,B-Occupation,825
9,B-Gendered-Pronoun,838


Write the data splits to a file:

In [57]:
df_train.to_csv(config.tokc_path+"model_input/token_train.csv")
df_validate.to_csv(config.tokc_path+"model_input/token_validate.csv")
df_test.to_csv(config.tokc_path+"model_input/token_test.csv")