# Preprocessing Data for Baseline Gender Bias Token Classifiers

* **Supervised learning**
    * Source data: `../data/token_clf_data/`
    * Output data: train, validate, and test splits under `..data/token_clf_data/model_input/
* **Multilabel classification**
    * 3 categories of labels:
        1. *Person Name:* Unknown, Non-binary, Feminine, Masculine
        2. *Linguistic:* Generalization, Gendered Pronoun, Gendered Role
        3. *Contextual:* Empowering, Occupation, Omission, Stereotype

***

**Table of Contents**

[0. Setup](#0)

[1. Preprocess Data](#1)

[2. Split the Data](#2)


***

<a id="0"></a>

## 0. Setup

**Import libraries and load data**

In [49]:
# For custom functions and for paths
import utils, config

# For working with data files and directories
import numpy as np
import pandas as pd
from pathlib import Path
import os

# For preprocessing the text
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag

# For comparing text offsets
from intervaltree import Interval, IntervalTree

Load the tagged token data:

In [59]:
df_tags = pd.read_csv(config.tokc_path+"tagged_tokens.csv")
df_tags = df_tags.drop(columns=["Unnamed: 0"])
# Make the offsets tuples of ints
# token_offsets = list(df_tags.offsets)
# token_offsets_clean = [offsets[1:-1].split(", ") for offsets in token_offsets if type(offsets)]
# token_offsets_tuples = [tuple((int(offsets[0]), int(offsets[1]))) for offsets in token_offsets_clean]
# df_tags = df_tags.drop(columns=["offsets"])
# df_tags.insert(len(df_tags.columns), "token_offsets", token_offsets_tuples)
# df_tags.tail()
df_tags.loc[df_tags.offsets.isna()].head()

Unnamed: 0,ann_id,description_id,offsets,tag,text,token,token_id
12471,41306.0,639,,,poet,,
12472,,639,,O,,,
14785,41490.0,705,,,editor,,
14786,41495.0,705,,,author,,
16722,12923.0,733,,,he,,


In [3]:
df_tags.loc[df_tags.tag != "O"].head()

Unnamed: 0,ann_id,description_id,offsets,tag,text,token,token_id
7,14384.0,1,"(34, 37)",B-Unknown,The Very Rev Prof James Whyte,The,7.0
8,24275.0,1,"(34, 37)",B-Masculine,The Very Rev Prof James Whyte,The,7.0
9,52952.0,1,"(34, 37)",B-Stereotype,The Very Rev Prof James Whyte,The,7.0
10,14384.0,1,"(38, 42)",I-Unknown,The Very Rev Prof James Whyte,Very,8.0
11,24275.0,1,"(38, 42)",I-Masculine,The Very Rev Prof James Whyte,Very,8.0


In [4]:
df_tags.shape

(784681, 7)

Load the description data:

In [5]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv", index_col=0)
df_descs.head()

Unnamed: 0,description_id,description,file,start_offset,end_offset,field,clean_desc,word_count,sent_count
0,0,Identifier: AA5,AA5_00100.txt,0,16,Identifier,AA5,1,1
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76,Title,Papers of The Very Rev Prof James Whyte (1920-...,10,1
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633,Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19...",65,1
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725,Biographical / Historical,Professor James Aitken White was a leading Sco...,181,8
4,4,Identifier: AA6,AA6_00100.txt,0,16,Identifier,AA6,1,1


In [6]:
df_descs.shape

(27908, 9)