# Preprocessing Data for Baseline Gender Bias Token Classifiers

* **Supervised learning**
    * Source data: `../data/token_clf_data/`
    * Output data: train, validate, and test splits under `..data/token_clf_data/model_input/
* **Multilabel classification**
    * 3 categories of labels:
        1. *Person Name:* Unknown, Non-binary, Feminine, Masculine
        2. *Linguistic:* Generalization, Gendered Pronoun, Gendered Role
        3. *Contextual:* Empowering, Occupation, Omission, Stereotype

***

**Table of Contents**

[0. Setup](#0)

[1. Preprocess Data](#1)

[2. Summarize the Data](#2)

[3. Split the Data](#3)


***

<a id="0"></a>

## 0. Setup

**Import libraries and load data**

In [1]:
# For custom functions and for paths
import utils, config

# For working with data files and directories
import numpy as np
import pandas as pd
from pathlib import Path
import os, re

# For preprocessing the text
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag

# For comparing text offsets
from intervaltree import Interval, IntervalTree

Load the data:

In [2]:
df_tags = pd.read_csv(config.tokc_path+"tagged_tokens.csv", index_col=0) #
df_tags.head()

Unnamed: 0,ann_id,description_id,tag,token_id
0,,0,O,0
0,,0,O,1
0,,0,O,2
0,14384.0,1,B-Unknown,7
0,24275.0,1,B-Masculine,7


In [3]:
df_dsat = pd.read_csv(config.tokc_path+"desc_sent_ann_token_tag.csv")
subdf_dsat = df_dsat[["token_id", "sentence_id", "description_id", "token", "token_offsets"]]
subdf_dsat.head()

Unnamed: 0,token_id,sentence_id,description_id,token,token_offsets
0,0,0,0,Identifier,"(0, 10)"
1,1,0,0,:,"(10, 11)"
2,2,0,0,AA5,"(12, 15)"
3,3,1,1,Title,"(17, 22)"
4,4,1,1,:,"(22, 23)"


## 2. Preprocess the Data

Join the data on the token IDs.

In [4]:
df = df_tags.join(subdf_dsat.set_index(["token_id", "description_id"]), on=["token_id", "description_id"])
df.head()

Unnamed: 0,ann_id,description_id,tag,token_id,sentence_id,token,token_offsets
0,,0,O,0,0,Identifier,"(0, 10)"
0,,0,O,1,0,:,"(10, 11)"
0,,0,O,2,0,AA5,"(12, 15)"
0,14384.0,1,B-Unknown,7,1,The,"(34, 37)"
0,24275.0,1,B-Masculine,7,1,The,"(34, 37)"


In [5]:
df = df[["description_id", "sentence_id", "ann_id", "token_id", "token", "tag"]]
df = df.sort_values(by=["description_id", "sentence_id", "token_id"])
df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag
0,0,0,,0,Identifier,O
0,0,0,,1,:,O
0,0,0,,2,AA5,O
0,1,1,14384.0,7,The,B-Unknown
0,1,1,24275.0,7,The,B-Masculine


<a id="2"></a>
## 2. Summarize the Data

In [6]:
print("Total descriptions:", len(df.description_id.unique()))
print("Total sentences:", len(df.sentence_id.unique()))
print("Total annotations:", len(df.ann_id.unique())-1)  # -1 because NaN value indicates no annotation
print("Total tokens:", len(df.token_id.unique()))
print("Total token tags:", df.shape[0])

Total descriptions: 27908
Total sentences: 36405
Total annotations: 55218
Total tokens: 324824
Total token tags: 355434


In [7]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals

Unnamed: 0,tag,total
15,I-Nonbinary,1
5,B-Nonbinary,1
11,I-Gendered-Pronoun,67
12,I-Gendered-Role,726
13,I-Generalization,891
0,B-Feminine,1614
3,B-Generalization,2051
8,B-Stereotype,2614
2,B-Gendered-Role,3577
10,I-Feminine,3782


In [8]:
df_tag_totals.to_csv(config.tokc_path+"token_tag_totals.csv")

In [9]:
# # Make the token_offsets column values tuples of ints
# token_offsets = list(df_tags.token_offsets)
# token_offsets_clean = [offsets[1:-1].split(", ") for offsets in token_offsets if type(offsets)]
# token_offsets_tuples = [tuple((int(offsets[0]), int(offsets[1]))) for offsets in token_offsets_clean]
# df_tags = df_tags.drop(columns=["token_offsets"])
# df_tags.insert(len(df_tags.columns), "token_offsets", token_offsets_tuples)

Save the data for token classification:

In [10]:
Path(config.tokc_path+"model_input").mkdir(parents=True, exist_ok=True)

In [11]:
df = df.reset_index()
df = df.drop(columns="index")
df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag
0,0,0,,0,Identifier,O
1,0,0,,1,:,O
2,0,0,,2,AA5,O
3,1,1,14384.0,7,The,B-Unknown
4,1,1,24275.0,7,The,B-Masculine


In [12]:
df.to_csv(config.tokc_path+"model_input/all_token_data.csv")

<a id="3"></a>
## 3. Split the Data

Split the data randomly, balancing the number of metadata field types (Title, Scope and Contents, Biographical / Historical, and Processing Information) across the train, validation, and test splits.

In [13]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv")
df_descs = df_descs[["description_id", "field"]]
df_descs.head()

Unnamed: 0,description_id,field
0,0,Identifier
1,1,Title
2,2,Scope and Contents
3,3,Biographical / Historical
4,4,Identifier


In [14]:
df = pd.read_csv(config.tokc_path+"model_input/all_token_data.csv", index_col=0)
df = df.join(df_descs.set_index("description_id"), on="description_id", how="left")
df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field
0,0,0,,0,Identifier,O,Identifier
1,0,0,,1,:,O,Identifier
2,0,0,,2,AA5,O,Identifier
3,1,1,14384.0,7,The,B-Unknown,Title
4,1,1,24275.0,7,The,B-Masculine,Title


In [16]:
df_train, df_validate, df_test = utils.getShuffledSplitData(df)

**Train Data:**

In [20]:
print(df_train.shape)
df_train.head()

(212650, 8)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,subset,field
140640,8892,16887,9753.0,351285,van,B-Masculine,train,Biographical / Historical
40577,2315,5519,53363.0,127520,of,I-Generalization,train,Biographical / Historical
230185,15591,25037,1693.0,497163,Alexander,I-Masculine,train,Biographical / Historical
42968,2384,5854,7991.0,136742,Sim,I-Unknown,train,Biographical / Historical
88919,5445,11974,,268864,1988,O,train,Biographical / Historical


In [26]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_train.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Nonbinary,1
1,B-Nonbinary,1
2,I-Gendered-Pronoun,36
3,I-Gendered-Role,419
4,I-Generalization,498
5,B-Feminine,973
6,B-Generalization,1212
7,B-Stereotype,1544
8,B-Gendered-Role,2184
9,I-Feminine,2272


**Validation Data:**

In [22]:
print(df_validate.shape)
df_validate.head()

(70885, 8)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,subset,field
232039,15714,25371,39398.0,505471,he,B-Gendered-Pronoun,dev,Biographical / Historical
232952,15761,25499,14682.0,508300,",",I-Unknown,dev,Biographical / Historical
59095,3391,7896,2882.0,176164,He,B-Gendered-Pronoun,dev,Biographical / Historical
76598,4732,10734,19367.0,238443,",",I-Unknown,dev,Biographical / Historical
23397,1170,3072,52811.0,70103,whom,I-Stereotype,dev,Biographical / Historical


In [25]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_validate.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Gendered-Pronoun,15
1,I-Gendered-Role,166
2,I-Generalization,186
3,B-Feminine,332
4,B-Generalization,425
5,B-Stereotype,551
6,B-Gendered-Role,670
7,I-Feminine,749
8,B-Occupation,812
9,B-Gendered-Pronoun,826


**Test Data:**

In [27]:
print(df_test.shape)
df_test.head()

(70885, 8)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,subset,field
91304,5533,12291,15762.0,277039,the,I-Gendered-Role,test,Biographical / Historical
29066,1517,3701,19065.0,86177,Gibbon,I-Masculine,test,Biographical / Historical
343539,27407,40312,42527.0,722678,Professor,B-Occupation,test,Biographical / Historical
230729,15611,25120,1247.0,499434,Boys,B-Gendered-Role,test,Biographical / Historical
76678,4732,10756,19398.0,239046,his,B-Gendered-Pronoun,test,Biographical / Historical


In [28]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_test.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Gendered-Pronoun,16
1,I-Gendered-Role,141
2,I-Generalization,207
3,B-Feminine,309
4,B-Generalization,414
5,B-Stereotype,519
6,B-Gendered-Role,723
7,I-Feminine,761
8,B-Occupation,778
9,B-Gendered-Pronoun,852


Write the data splits to a file:

In [29]:
df_train.to_csv(config.tokc_path+"model_input/token_train.csv")
df_validate.to_csv(config.tokc_path+"model_input/token_validate.csv")
df_test.to_csv(config.tokc_path+"model_input/token_test.csv")