# Gender Biased Document Classification

*February 2023*

* Input data: Aggregated annotated data (`data/aggregated_data/aggregated_final.csv`)
* Output data: Split Person-Name labels data (`data/doc_clf_data/model_input/

***

**Table of Contents**

[1.](#1) Prepare the Data

[2.](#2) Split the Data

[3.](#3) Write the Data

***

In [1]:
import utils, config
from pathlib import Path
import numpy as np
import pandas as pd
# from sklearn.model_selection import StratifiedShuffleSplit  # insufficient data to use this
from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
Path(config.docc_path).mkdir(parents=True, exist_ok=True)  # doc_clf_data/

<a id="1"></a>
### 1. Prepare the Data

Load the annotation data:

In [10]:
ann_df = pd.read_csv(config.agg_path+"aggregated_final.csv", index_col=0)
print(ann_df.shape)
ann_df.head()

(55259, 8)


Unnamed: 0,agg_ann_id,file,text,ann_offsets,label,category,associated_genders,description_id
0,0,Coll-1157_00100.ann,knighted,"(1407, 1415)",Gendered-Role,Linguistic,Unclear,2364
1,1,Coll-1310_02300.ann,knighthood,"(9625, 9635)",Gendered-Role,Linguistic,Unclear,4542
2,2,Coll-1281_00100.ann,Prince Regent,"(2426, 2439)",Gendered-Role,Linguistic,Unclear,3660
3,3,Coll-1310_02700.ann,knighthood,"(9993, 10003)",Gendered-Role,Linguistic,Unclear,4678
4,4,Coll-1310_02900.ann,Sir,"(7192, 7195)",Gendered-Role,Linguistic,Unclear,4732


Keep only the `Person-Name` category of labels, since the other labels are more appropriate for token classification.  Manual annotators assigned `Person-Name` labels based on the information provided within the description that a person's name occurred in, so these labels are suitable for document classification. 

In [12]:
ann_df = ann_df.loc[ann_df.category == "Person-Name"]
print("Label count:", ann_df.shape[0])
print("Document (description) count:", len(ann_df.description_id.unique()))

Label count: 31157
Document (description) count: 12658


In [14]:
ann_df.groupby("label").size().reset_index(name="count")

Unnamed: 0,label,count
0,Feminine,1836
1,Masculine,6087
2,Unknown,23234


Implode the annotation data so the annotation labels are grouped by description (one row per description): 

In [15]:
ann_subdf = ann_df.drop(columns=["file", "associated_genders", "category", "text", "ann_offsets"])
ann_subdf_imploded = utils.implodeDataFrame(ann_subdf, ["description_id"])
ann_subdf_imploded.head()

Unnamed: 0_level_0,agg_ann_id,label
description_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"[14384, 24275, 26233]","[Unknown, Masculine, Unknown]"
3,"[14385, 14386, 14387]","[Masculine, Masculine, Masculine]"
5,"[9531, 23084]","[Unknown, Masculine]"
7,"[55, 9526, 9527, 9528, 9529, 9530, 9532, 9533,...","[Masculine, Masculine, Unknown, Unknown, Unkno..."
9,"[14000, 24207]","[Unknown, Masculine]"


Remove any repeated label names from the label lists:

In [16]:
label_lists = list(ann_subdf_imploded.label)
label_sets = [set(label_list) for label_list in label_lists]
ann_subdf_imploded["label"] = label_sets
ann_subdf_imploded.head()

Unnamed: 0_level_0,agg_ann_id,label
description_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"[14384, 24275, 26233]","{Unknown, Masculine}"
3,"[14385, 14386, 14387]",{Masculine}
5,"[9531, 23084]","{Unknown, Masculine}"
7,"[55, 9526, 9527, 9528, 9529, 9530, 9532, 9533,...","{Unknown, Masculine, Feminine}"
9,"[14000, 24207]","{Unknown, Masculine}"


Load the description data and join it to the annotation data:

In [21]:
desc_df = pd.read_csv(config.crc_meta_path+"annot_descs.csv", index_col=0)
desc_subdf = desc_df.drop(columns=["file", "description", "word_count", "sent_count"])
desc_subdf = desc_subdf.loc[desc_subdf.field != "Identifier"]
print(desc_subdf.shape)
desc_subdf.head()

(27570, 5)


Unnamed: 0,description_id,start_offset,end_offset,field,clean_desc
1,1,17,76,Title,Papers of The Very Rev Prof James Whyte (1920-...
2,2,77,633,Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19..."
3,3,634,1725,Biographical / Historical,Professor James Aitken White was a leading Sco...
5,5,17,60,Title,Papers of Rev Tom Allan (1916-1965)
6,6,61,560,Scope and Contents,"Sermons and addresses, 1947-1963; essays and l..."


In [19]:
joined = desc_subdf.join(ann_subdf_imploded, on="description_id", how="outer")
joined = joined.rename(columns={"clean_desc":"description"})
joined = joined.fillna("")
joined.head()

Unnamed: 0,description_id,start_offset,end_offset,field,description,agg_ann_id,label
1,1,17,76,Title,Papers of The Very Rev Prof James Whyte (1920-...,"[14384, 24275, 26233]","{Unknown, Masculine}"
2,2,77,633,Scope and Contents,"Sermons and addresses, 1948-1996; lectures, 19...",,
3,3,634,1725,Biographical / Historical,Professor James Aitken White was a leading Sco...,"[14385, 14386, 14387]",{Masculine}
5,5,17,60,Title,Papers of Rev Tom Allan (1916-1965),"[9531, 23084]","{Unknown, Masculine}"
6,6,61,560,Scope and Contents,"Sermons and addresses, 1947-1963; essays and l...",,


In [22]:
assert joined.shape[0] == desc_subdf.shape[0]

Make sure the sequences of labels in each row appear correctly:

In [23]:
valid_label_names = ann_df.label.unique()
print(valid_label_names)

['Feminine' 'Unknown' 'Masculine']


In [24]:
label_col = list(joined.label)
invalid = []
for label_list in label_col:
    if len(label_list) > 0:
        for label_name in label_list:
            if not label_name in valid_label_names:
                invalid += [label_name]
assert len(invalid) == 0, "Label names must be valid"

<a id="2"></a>
### 2. Split the Data

Shuffle the data and then add a column that assigns every row to a either the training, validation, or test subset of data.  For each DataFrame: 
* 60% of the rows are for `training`
* 20% of the rows are for `validation` (or dev test)
* 20% of the rows are for `test` (or blind test)

In [25]:
train, validate, test = utils.getShuffledSplitData(joined)

In [26]:
splits =  [train, validate, test]
for split in splits:
    print(split.shape[0])

16541
5514
5515


In [27]:
print(train.shape[0]/(train.shape[0]+validate.shape[0]+test.shape[0]))
print(validate.shape[0]/(train.shape[0]+validate.shape[0]+test.shape[0]))
print(test.shape[0]/(train.shape[0]+validate.shape[0]+test.shape[0]))

0.5999637286906058
0.2
0.20003627130939428


<a id="3"></a>
### 3. Write the Data

The files will separate labels by `\n` (a newline) and descriptions by `\n|\n` (a pipe character surrounded by newlines)

In [28]:
dir_path = config.docc_path
Path(dir_path).mkdir(parents=True, exist_ok=True)  # doc_clf_data/model_input/

In [29]:
utils.writeDocs(list(train.description), "train_docs.txt", dir_path)
utils.writeLabels(list(train.label), "train_labels.txt", dir_path)

Your documents file has been written!
Your labels file has been written!


In [30]:
utils.writeDocs(list(validate.description), "validate_docs.txt", dir_path)
utils.writeLabels(list(validate.label), "validate_labels.txt", dir_path)

Your documents file has been written!
Your labels file has been written!


In [31]:
utils.writeDocs(list(test.description), "blindtest_docs.txt", dir_path)
utils.writeLabels(list(test.label), "blindtest_labels.txt", dir_path)

Your documents file has been written!
Your labels file has been written!


Write the train, validate, and test split DataFrames to files as well:

In [32]:
dir_path = config.docc_path+"splits_as_csv/"
Path(dir_path).mkdir(parents=True, exist_ok=True)  # doc_clf_data/splits_as_csv/

In [33]:
train.to_csv(dir_path+"aggregated_final_train.csv")
validate.to_csv(dir_path+"aggregated_final_validate.csv")
test.to_csv(dir_path+"aggregated_final_test.csv")