# Preprocessing Data for Baseline Gender Bias Token Classifiers

* **Supervised learning**
    * Source data: `../data/token_clf_data/`
    * Output data: train, validate, and test splits under `..data/token_clf_data/model_input/
* **Multilabel classification**
    * 3 categories of labels:
        1. *Person Name:* Unknown, Non-binary, Feminine, Masculine
        2. *Linguistic:* Generalization, Gendered Pronoun, Gendered Role
        3. *Contextual:* Empowering, Occupation, Omission, Stereotype

***

**Table of Contents**

[0. Setup](#0)

[1. Preprocess Data](#1)

[2. Split the Data](#2)


***

<a id="0"></a>

## 0. Setup

**Import libraries and load data**

In [9]:
# For custom functions and for paths
import utils, config

# For working with data files and directories
import numpy as np
import pandas as pd
from pathlib import Path
import os

# For preprocessing the text
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag

# # For visualization
# import matplotlib.pyplot as plt
# import seaborn as sns

Load the tagged token data:

In [5]:
df_tags = pd.read_csv(config.tokc_path+"tagged_tokens.csv")
df_tags = df_tags.drop(columns=["Unnamed: 0"])
df_tags.sort_values(by=["description_id","token_id"])
df_tags.tail()

Unnamed: 0,ann_id,description_id,offsets,tag,text,token,token_id
784676,,27907,"(6332, 6337)",O,,cases,753906.0
784677,,27907,"(6338, 6347)",O,,involving,753907.0
784678,,27907,"(6348, 6358)",O,,homosexual,753908.0
784679,,27907,"(6359, 6367)",O,,offences,753909.0
784680,,27907,"(6367, 6368)",O,,.,753910.0


In [6]:
df_tags.loc[df_tags.tag != "O"].head()

Unnamed: 0,ann_id,description_id,offsets,tag,text,token,token_id
7,14384.0,1,"(34, 37)",B-Unknown,The Very Rev Prof James Whyte,The,7.0
8,24275.0,1,"(34, 37)",B-Masculine,The Very Rev Prof James Whyte,The,7.0
9,52952.0,1,"(34, 37)",B-Stereotype,The Very Rev Prof James Whyte,The,7.0
10,14384.0,1,"(38, 42)",I-Unknown,The Very Rev Prof James Whyte,Very,8.0
11,24275.0,1,"(38, 42)",I-Masculine,The Very Rev Prof James Whyte,Very,8.0


In [11]:
df_tags.shape

(784681, 7)

Load the description data:

In [7]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv", index_col=0)
df_descs.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset,field,clean_desc,word_count,sent_count
0,0,Identifier: AA5,AA5_00100.txt,0,16,Identifier,AA5,1,1
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76,Title,Papers of The Very Rev Prof James Whyte (1920-...,10,1
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633,Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19...",65,1
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725,Biographical / Historical,Professor James Aitken White was a leading Sco...,181,8
4,4,Identifier: AA6,AA6_00100.txt,0,16,Identifier,AA6,1,1


In [12]:
df_descs.shape

(27908, 9)

<a id="1"></a>
## 1. Preprocess the Data

**Perform sentence tokenization of the descriptions, associate each sentence to a description ID, and then associate every token to a sentence ID.**

#### Sentence Tokenization

In [65]:
# Ignore descriptions that weren't annotated
subdf_descs = df_descs.loc[df_descs.field != "Identifier"]
print(subdf_descs.shape)
# print(subdf_descs.loc[subdf_descs.clean_desc.isna()].shape)

(27570, 9)


In [66]:
# Remove any empty clean_description values (NaN if description for metadata field at end of file appears in next file)
# subdf_descs = subdf_descs.loc[~subdf_descs.clean_desc.isna()]
# Fill NaN with empty string
subdf_descs = subdf_descs.fillna("")

In [71]:
subdf_descs.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset,field,clean_desc,word_count,sent_count
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76,Title,Papers of The Very Rev Prof James Whyte (1920-...,10,1
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633,Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19...",65,1
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725,Biographical / Historical,Professor James Aitken White was a leading Sco...,181,8
5,5,Title:\nPapers of Rev Tom Allan (1916-1965),AA6_00100.txt,17,60,Title,Papers of Rev Tom Allan (1916-1965),7,1
6,6,"Scope and Contents:\nSermons and addresses, 19...",AA6_00100.txt,61,560,Scope and Contents,"Sermons and addresses, 1947-1963; essays and l...",62,2


#### Associate Tokens to Sentences

In [72]:
sents_dict, offsets_dict = utils.getSentsAndOffsetsFromStrings(list(subdf_descs.description), list(subdf_descs.description_id), list(subdf_descs.start_offset), list(subdf_descs.end_offset))

In [73]:
desc_id_col = list(sents_dict.keys())
sents_col = list(sents_dict.values())
offsets_col = list(offsets_dict.values())
df_sents = pd.DataFrame({"description_id":desc_id_col, "sentences":sents_col, "sent_offsets":offsets_col})
df_sents.head()

Unnamed: 0,description_id,sentences,sent_offsets
0,1,[Title:\nPapers of The Very Rev Prof James Why...,"[(0, 59)]"
1,2,"[Scope and Contents:\nSermons and addresses, 1...","[(0, 556)]"
2,3,[Biographical / Historical:\nProfessor James A...,"[(0, 155), (155, 273), (273, 398), (398, 607),..."
3,5,[Title:\nPapers of Rev Tom Allan (1916-1965)],"[(0, 43)]"
4,6,"[Scope and Contents:\nSermons and addresses, 1...","[(0, 462), (462, 499)]"


In [79]:
# df_sents_exploded = df_sents.apply(pd.Series.explode)
df_sents_exploded = df_sents_exploded.reset_index()
df_sents_exploded = df_sents_exploded.rename(columns={"index":"sentence_id"})
df_sents_exploded.head()

Unnamed: 0,sentence_id,description_id,sentences,sent_offsets
0,0,1,Title:\nPapers of The Very Rev Prof James Whyt...,"(0, 59)"
1,1,2,"Scope and Contents:\nSermons and addresses, 19...","(0, 556)"
2,2,3,Biographical / Historical:\nProfessor James Ai...,"(0, 155)"
3,2,3,He was educated at Daniel Stewart's College an...,"(155, 273)"
4,2,3,After his ordination he spent three years as a...,"(273, 398)"


Save the file to a CSV:

In [80]:
df_sents_exploded.to_csv(config.tokc_path+"sentences.csv")