# Splitting Data for Gender Bias Token Classifiers

* **Supervised learning**
    * Source data: `../data/token_clf_data/`
    * Output data: train, validate (a.k.a. development), and test splits under `..data/token_clf_data/model_input/`
* **Multilabel classification**
    * 3 categories of labels:
        1. *Person Name:* Unknown, Non-binary,* Feminine, Masculine
        2. *Linguistic:* Generalization, Gendered Pronoun, Gendered Role
        3. *Contextual:* Empowering,* Occupation, Omission, Stereotype

*Annotators did not find text on which to apply these labels during the manual annotation process!

***

**Table of Contents**

[0. Setup](#0)

[1. Preprocess Data](#1)

[2. Summarize the Data](#2)

[3. Split the Data](#3)


***

<a id="0"></a>

## 0. Setup

**Import libraries and load data**

In [1]:
# For custom functions and for paths
import utils, config

# For working with data files and directories
import pandas as pd
from pathlib import Path

Load the data:

In [2]:
df_tags = pd.read_csv(config.tokc_path+"tagged_tokens.csv", index_col=0)
df_tags.head()

Unnamed: 0,ann_id,description_id,tag,token_id
0,,0,O,0
0,,0,O,1
0,,0,O,2
0,14384.0,1,B-Unknown,7
0,24275.0,1,B-Masculine,7


In [3]:
df_dsat = pd.read_csv(config.tokc_path+"desc_sent_ann_token_tag.csv")
subdf_dsat = df_dsat[["token_id", "sentence_id", "description_id", "token", "token_offsets"]]
subdf_dsat.head()

Unnamed: 0,token_id,sentence_id,description_id,token,token_offsets
0,0,0,0,Identifier,"(0, 10)"
1,1,0,0,:,"(10, 11)"
2,2,0,0,AA5,"(12, 15)"
3,3,1,1,Title,"(17, 22)"
4,4,1,1,:,"(22, 23)"


**Remove data after file Coll-1434_14700, because...**
* Annotator 0 labeled through file Coll-146_00800
* Annotator 1 labeled through file Coll-1434_14700
* Annotator 2 labeled through file Coll-146_28300
* Annotator 3 labeled through file Coll-1434_14700
* Annotator 4 labeled through file Coll-1497_00400

In [5]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv", index_col=0)
print(df_descs.shape)
df_descs.head()

(27908, 9)


Unnamed: 0,description_id,description,file,start_offset,end_offset,field,clean_desc,word_count,sent_count
0,0,Identifier: AA5,AA5_00100.txt,0,16,Identifier,AA5,1,1
1,1,Title:\nPapers of The Very Rev Prof James Whyt...,AA5_00100.txt,17,76,Title,Papers of The Very Rev Prof James Whyte (1920-...,10,1
2,2,"Scope and Contents:\nSermons and addresses, 19...",AA5_00100.txt,77,633,Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19...",65,1
3,3,Biographical / Historical:\nProfessor James Ai...,AA5_00100.txt,634,1725,Biographical / Historical,Professor James Aitken White was a leading Sco...,181,8
4,4,Identifier: AA6,AA6_00100.txt,0,16,Identifier,AA6,1,1


In [31]:
last_descid = max(list(df_descs.loc[df_descs.file == "Coll-1434_14700.txt"].description_id))  #"Coll-1497_00400.txt"
print(last_descid)

13547


In [32]:
print("Before:", df_tags.shape)
df_tags = df_tags.loc[df_tags.description_id <= last_descid]
print("After:", df_tags.shape)

Before: (192323, 4)
After: (192323, 4)


## 2. Preprocess the Data

Join the data on the token IDs.

In [33]:
df = df_tags.join(subdf_dsat.set_index(["token_id", "description_id"]), on=["token_id", "description_id"], how="outer")
df.head()

Unnamed: 0,ann_id,description_id,tag,token_id,sentence_id,token,token_offsets
0,,0,O,0,0,Identifier,"(0, 10)"
0,,0,O,1,0,:,"(10, 11)"
0,,0,O,2,0,AA5,"(12, 15)"
0,14384.0,1,B-Unknown,7,1,The,"(34, 37)"
0,24275.0,1,B-Masculine,7,1,The,"(34, 37)"


In [34]:
# Replace NaN values in ann_id column with 99999 to indicate rows without an annotation
df[["ann_id"]] = df["ann_id"].fillna("99999")
df = df.astype({"ann_id":"int32"})
# Replace NaN values in tag column with O to indicate rows without an annotation
df[["tag"]] = df["tag"].fillna("O")
df.loc[df.sentence_id == 4].head()

Unnamed: 0,ann_id,description_id,tag,token_id,sentence_id,token,token_offsets
1,14377,3,B-Gendered-Pronoun,134,4,He,"(789, 791)"
1,14378,3,B-Gendered-Pronoun,148,4,he,"(871, 873)"
8076,99999,3,O,135,4,was,"(792, 795)"
8076,99999,3,O,136,4,educated,"(796, 804)"
8076,99999,3,O,137,4,at,"(805, 807)"


In [35]:
df = df[["description_id", "sentence_id", "ann_id", "token_id", "token", "tag"]]
df = df.sort_values(by=["description_id", "sentence_id", "token_id"])
df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag
0,0,0,99999,0,Identifier,O
0,0,0,99999,1,:,O
0,0,0,99999,2,AA5,O
8076,1,1,99999,3,Title,O
8076,1,1,99999,4,:,O


Make sure all token values are a string:

In [36]:
df.loc[df.token.isna()].shape

(0, 6)

Reset the index:

In [37]:
df = df.reset_index()
df = df.drop(columns=["index"])
df.tail()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag
779115,27907,42029,99999,753927,cases,O
779116,27907,42029,99999,753928,involving,O
779117,27907,42029,99999,753929,homosexual,O
779118,27907,42029,99999,753930,offences,O
779119,27907,42029,99999,753931,.,O


<a id="2"></a>
## 2. Summarize the Data

In [38]:
print("Total descriptions:", len(df.description_id.unique()))
print("Total sentences:", len(df.sentence_id.unique()))
print("Total annotations:", len(df.ann_id.unique())-1)  # -1 because 9999 value indicates no annotation
print("Total tokens:", len(df.token_id.unique()))
print("Total token tags:", df.shape[0])

Total descriptions: 27908
Total sentences: 42030
Total annotations: 35258
Total tokens: 753932
Total token tags: 779120


In [39]:
unique_tokens = list(set(df.token))
unique_words = [token for token in unique_tokens if token.isalpha()]
print("Total word types (unique words):", len(unique_words))

Total word types (unique words): 36462


In [40]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals

Unnamed: 0,tag,total
15,I-Nonbinary,1
5,B-Nonbinary,1
11,I-Gendered-Pronoun,54
12,I-Gendered-Role,611
13,I-Generalization,775
8,B-Stereotype,1234
3,B-Generalization,1269
0,B-Feminine,1461
2,B-Gendered-Role,2772
6,B-Occupation,2951


In [41]:
df_tag_totals.to_csv(config.tokc_path+"token_tag_totals.csv")

In [42]:
# # Make the token_offsets column values tuples of ints
# token_offsets = list(df_tags.token_offsets)
# token_offsets_clean = [offsets[1:-1].split(", ") for offsets in token_offsets if type(offsets)]
# token_offsets_tuples = [tuple((int(offsets[0]), int(offsets[1]))) for offsets in token_offsets_clean]
# df_tags = df_tags.drop(columns=["token_offsets"])
# df_tags.insert(len(df_tags.columns), "token_offsets", token_offsets_tuples)

Save the data for token classification:

In [43]:
Path(config.tokc_path+"model_input").mkdir(parents=True, exist_ok=True)

In [44]:
df = df.reset_index()
df = df.drop(columns="index")
df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag
0,0,0,99999,0,Identifier,O
1,0,0,99999,1,:,O
2,0,0,99999,2,AA5,O
3,1,1,99999,3,Title,O
4,1,1,99999,4,:,O


In [45]:
df.to_csv(config.tokc_path+"model_input/all_token_data.csv")

<a id="3"></a>
## 3. Split the Data

Split the data randomly, balancing the number of metadata field types (Title, Scope and Contents, Biographical / Historical, and Processing Information) across the train, validation, and test splits.

Implode the data,  grouping tokens by their sentence, then explode the data after the sentences are split into train, validation, and test sets so each row has one token.

In [46]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv")
df_descs = df_descs[["description_id", "field"]]
df_descs.head()

Unnamed: 0,description_id,field
0,0,Identifier
1,1,Title
2,2,Scope and Contents
3,3,Biographical / Historical
4,4,Identifier


In [47]:
df = pd.read_csv(config.tokc_path+"model_input/all_token_data.csv", index_col=0)
df = df.join(df_descs.set_index("description_id"), on="description_id", how="left")
df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field
0,0,0,99999,0,Identifier,O,Identifier
1,0,0,99999,1,:,O,Identifier
2,0,0,99999,2,AA5,O,Identifier
3,1,1,99999,3,Title,O,Title
4,1,1,99999,4,:,O,Title


In [48]:
# ann_id_values = list(df.loc[~df.ann_id.isna()].ann_id)
# print(max(ann_id_values))  # 55259.0

Shuffle the sentence IDs randomly and then assign them to train, validation, and (blind) test set splits of the data for creating and evaluating token classifiers:

In [49]:
sents = df[["sentence_id"]]
sents = sents.drop_duplicates()
shuffled_sents = utils.shuffleDataFrame(sents)
# shuffled_sents.head()
train_size, validat_size, test_size = utils.getTrainValTestSizes(shuffled_sents)
shuffled_sents_splits = utils.assignSubsets(shuffled_sents, train_size, validat_size, test_size)
shuffled_sents_splits.head()

Unnamed: 0,subset,sentence_id
775153,train,41770
184904,train,7756
347141,train,15662
47948,train,2041
595279,train,31629


In [50]:
shuffled_sents_splits.groupby("subset").size().reset_index(name="subset_count")  # Looks good

Unnamed: 0,subset,subset_count
0,dev,8406
1,test,8406
2,train,25218


Join the subset data with the token data:

In [51]:
df_joined = df.join(shuffled_sents_splits.set_index("sentence_id"), on="sentence_id", how="left")
df_joined.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field,subset
0,0,0,99999,0,Identifier,O,Identifier,test
1,0,0,99999,1,:,O,Identifier,test
2,0,0,99999,2,AA5,O,Identifier,test
3,1,1,99999,3,Title,O,Title,train
4,1,1,99999,4,:,O,Title,train


In [52]:
df_joined.loc[df_joined.sentence_id == 4]

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field,subset
152,3,4,14377,134,He,B-Gendered-Pronoun,Biographical / Historical,test
153,3,4,99999,135,was,O,Biographical / Historical,test
154,3,4,99999,136,educated,O,Biographical / Historical,test
155,3,4,99999,137,at,O,Biographical / Historical,test
156,3,4,99999,138,Daniel,O,Biographical / Historical,test
157,3,4,99999,139,Stewart,O,Biographical / Historical,test
158,3,4,99999,140,'s,O,Biographical / Historical,test
159,3,4,99999,141,College,O,Biographical / Historical,test
160,3,4,99999,142,and,O,Biographical / Historical,test
161,3,4,99999,143,the,O,Biographical / Historical,test


Looks good!

**Train Data:**

In [53]:
df_train = df_joined.loc[df_joined.subset == "train"]
print(df_train.shape)
df_train.head()

(467477, 8)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field,subset
3,1,1,99999,3,Title,O,Title,train
4,1,1,99999,4,:,O,Title,train
5,1,1,99999,5,Papers,O,Title,train
6,1,1,99999,6,of,O,Title,train
7,1,1,14384,7,The,B-Unknown,Title,train


In [54]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_train.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Nonbinary,1
1,B-Nonbinary,1
2,I-Gendered-Pronoun,24
3,I-Gendered-Role,361
4,I-Generalization,448
5,B-Stereotype,744
6,B-Generalization,772
7,B-Feminine,840
8,B-Gendered-Role,1673
9,B-Occupation,1823


**Validation Data:**

In [55]:
df_validate = df_joined.loc[df_joined.subset == "dev"]
print(df_validate.shape)
df_validate.head()

(157705, 8)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field,subset
172,3,5,99999,154,After,O,Biographical / Historical,dev
173,3,5,14379,155,his,B-Gendered-Pronoun,Biographical / Historical,dev
174,3,5,99999,156,ordination,O,Biographical / Historical,dev
175,3,5,14380,157,he,B-Gendered-Pronoun,Biographical / Historical,dev
176,3,5,99999,158,spent,O,Biographical / Historical,dev


In [56]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_validate.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Gendered-Pronoun,16
1,I-Gendered-Role,134
2,I-Generalization,144
3,B-Generalization,239
4,B-Stereotype,252
5,B-Feminine,323
6,B-Gendered-Role,586
7,B-Occupation,654
8,B-Gendered-Pronoun,743
9,I-Occupation,780


**Test Data:**

In [57]:
df_test = df_joined.loc[df_joined.subset == "test"]
print(df_test.shape)
df_test.head()

(153938, 8)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,tag,field,subset
0,0,0,99999,0,Identifier,O,Identifier,test
1,0,0,99999,1,:,O,Identifier,test
2,0,0,99999,2,AA5,O,Identifier,test
152,3,4,14377,134,He,B-Gendered-Pronoun,Biographical / Historical,test
153,3,4,99999,135,was,O,Biographical / Historical,test


In [58]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_test.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Gendered-Pronoun,14
1,I-Gendered-Role,116
2,I-Generalization,183
3,B-Stereotype,238
4,B-Generalization,258
5,B-Feminine,298
6,B-Occupation,474
7,B-Gendered-Role,513
8,I-Occupation,565
9,I-Feminine,696


Write the data splits to a file:

In [59]:
df_train.to_csv(config.tokc_path+"model_input/token_train.csv")
df_validate.to_csv(config.tokc_path+"model_input/token_validate.csv")
df_test.to_csv(config.tokc_path+"model_input/token_test.csv")