# Splitting Data for Experiments with Gender Biased Language Classifiers

## Modified k-fold Cross-Validation

### k = 5

* **Supervised learning**
    * Source data: `../data/token_clf_data/`
    * Output data: train, validate (a.k.a. development), and test splits under `..data/token_clf_data/experiment_input/`
* **Multilabel + multiclass classification**
    * 3 categories of labels:
        1. *Person Name:* Unknown, Non-binary,* Feminine, Masculine
        2. *Linguistic:* Generalization, Gendered Pronoun, Gendered Role
        3. *Contextual:* Empowering,* Occupation, Omission, Stereotype

*Annotators did not find text on which to apply these labels during the manual annotation process!

***

#### Table of Contents

[Preparation](#prep)

[Modified k-Fold Cross-Validation](#cv)

[Appendix: Initial Approach](#a)

***

<a id="prep"></a>
## Preparation

### Token Classification Data

Import libraries and load data:

In [1]:
# For custom functions and for paths
import utils, config

# For working with data files and directories
import pandas as pd
from pathlib import Path

Implode the data,  grouping tokens by their sentence, then explode the data after the sentences are split into train, validation, and test sets so each row has one token.

In [2]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv")
df_descs = df_descs[["description_id", "field"]]
df_descs.head()

Unnamed: 0,description_id,field
0,0,Identifier
1,1,Title
2,2,Scope and Contents
3,3,Biographical / Historical
4,4,Identifier


In [3]:
df = pd.read_csv(config.tokc_path+"model_input/all_token_data.csv", index_col=0)
df = df.join(df_descs.set_index("description_id"), on="description_id", how="left")
df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field
0,0,0,99999,0,Identifier,"(0, 10)",NN,O,Identifier
1,0,0,99999,1,:,"(10, 11)",:,O,Identifier
2,0,0,99999,2,AA5,"(12, 15)",NN,O,Identifier
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title
4,1,1,99999,4,:,"(22, 23)",:,O,Title


### Document Classification Data

In [9]:
target_labels="so"
train = pd.read_csv(config.docc_path+"model_input/"+"{}_splits_as_csv/aggregated_final_train.csv".format(target_labels), index_col=0)
dev = pd.read_csv(config.docc_path+"model_input/"+"{}_splits_as_csv/aggregated_final_validate.csv".format(target_labels), index_col=0)
test = pd.read_csv(config.docc_path+"model_input/"+"{}_splits_as_csv/aggregated_final_test.csv".format(target_labels), index_col=0)
df_doc = pd.concat([train, dev, test])
df_doc["label"] = df_doc["label"].fillna("{}")
df_doc = df_doc.loc[~df_doc.description.isna()]
df_doc = utils.getColumnValuesAsLists(df_doc, "label")
df_doc = df_doc.reset_index().drop(columns=["index"])    # Reset the index values
df_doc.head()

Unnamed: 0,description_id,start_offset,end_offset,field,description,subset,label
0,4699,1853,2066,Biographical / Historical,"Labelled Apparently some chapters, amounting t...",train,[Omission]
1,8942,384,540,Biographical / Historical,James Aikman of Perth signed his name to a vol...,train,[]
2,5440,5692,5850,Biographical / Historical,This piece was published in 'Milk Production i...,train,[]
3,3474,3608,8549,Biographical / Historical,Margaret Winifred Bartholomew was born on 21 A...,train,"[Omission, Stereotype]"
4,4769,2378,2576,Biographical / Historical,Blacker and Thomson became close friends throu...,train,[Omission]


In [10]:
assert len(df_doc.description_id.unique()) == df_doc.shape[0]

In [11]:
df_doc.shape

(27312, 7)

<a id="cv"></a>
## Modified k-Fold Cross-Validation

Split the data into five (*k*) folds of equal size (20% of total data).  For experiment 1, with three classifiers run sequentially where each classifier provides features for the next classifier, five models will be run for each step of the experiment (classifier 1 will actually be five models combined, and classifier 2 will actually be five models combined).  This will ensure that predictions on testing data are used as features for the next classifier, rather than providing manual annotations of training data as features for the next classifier.

When randomly splitting data, balance the number of metadata field types (Title, Scope and Contents, Biographical / Historical, and Processing Information) across the train, validation, and test splits, and ensure that sentences are complete (not split across folds of the data).

Shuffle the sentence IDs randomly and then assign them to specific splits of the data for creating, evaluating, and predicting with classifiers:

In [15]:
# INPUT:   A shuffled DataFrame for a particular metadata field and list of floats (fractions) for data splits
# OUTPUT: The shuffled DataFrame with a split assigned to each row (based on the index of the input DataFrame) in
#         an inserted "fold" column
def splitData(df, splits):
    remaining = list(df.index)
    split_col = []
    for i,fraction in enumerate(splits):
        split_name = "split{}".format(i)
        if (i == (len(splits)-1)):
            split_col = split_col + ([split_name]*len(remaining))
        else:
            split = remaining[ : int(df.shape[0]*fraction)]
            remaining = remaining[int(df.shape[0]*fraction) : ]
            split_col = split_col + ([split_name]*len(split))
    df.insert(len(df.columns)-1, "fold", split_col)
    return df

In [13]:
# # Tokens
# sents = df[["sentence_id"]]
# sents = sents.drop_duplicates()
# shuffled_sents = utils.shuffleDataFrame(sents)
# -----------------
# Documents
descs = df_doc[["description_id"]]
shuffled_descs = utils.shuffleDataFrame(descs)

In [20]:
# # Tokens
# shuffled_sents_split = splitData(shuffled_sents, [0.2, 0.2, 0.2, 0.2, 0.2])
# shuffled_sents_split.head()
# -----------------
# Documents
shuffled_descs_split = splitData(shuffled_descs, [0.2, 0.2, 0.2, 0.2, 0.2])
shuffled_descs_split.tail()

Unnamed: 0,fold,description_id
20691,split4,376
5699,split4,4289
10742,split4,24177
16921,split4,20189
25796,split4,14908


In [21]:
## Tokens
# shuffled_sents_split.groupby("fold").size().reset_index(name="count")  # Looks good
# -----------------
# Documents
shuffled_descs_split.groupby("fold").size().reset_index(name="count")  # Looks good

Unnamed: 0,fold,count
0,split0,5462
1,split1,5462
2,split2,5462
3,split3,5462
4,split4,5464


All folds have the same or nearly the same number of rows (sentences or descriptions). 

Join the fold data with the token data:

In [23]:
join_id = "description_id"  # "sentence_id"
shuffled_split = shuffled_descs_split.copy()
df_joined = df_doc.join(shuffled_split.set_index(join_id), on=join_id, how="left")
df_joined.head()

Unnamed: 0,description_id,start_offset,end_offset,field,description,subset,label,fold
0,4699,1853,2066,Biographical / Historical,"Labelled Apparently some chapters, amounting t...",train,[Omission],split3
1,8942,384,540,Biographical / Historical,James Aikman of Perth signed his name to a vol...,train,[],split2
2,5440,5692,5850,Biographical / Historical,This piece was published in 'Milk Production i...,train,[],split0
3,3474,3608,8549,Biographical / Historical,Margaret Winifred Bartholomew was born on 21 A...,train,"[Omission, Stereotype]",split0
4,4769,2378,2576,Biographical / Historical,Blacker and Thomson became close friends throu...,train,[Omission],split3


Check that the counts for each fold are still relatively close:

In [24]:
df_joined.groupby("fold").size().reset_index(name="count")

Unnamed: 0,fold,count
0,split0,5462
1,split1,5462
2,split2,5462
3,split3,5462
4,split4,5464


Looks good!

Write the data to a file:

In [25]:
Path(config.tokc_path+"experiment_input/").mkdir(parents=True, exist_ok=True)
# df_joined.to_csv(config.tokc_path+"experiment_input/token_5fold.csv")
df_joined.to_csv(config.tokc_path+"experiment_input/document_5fold.csv")

### Distribution of Tags across Folds

In [26]:
# df_joined = pd.read_csv(config.tokc_path+"experiment_input/token_5fold.csv", index_col=0)
# df_joined.head()

In [33]:
df_joined_exploded = df_joined.explode("label")
df_joined_exploded.head()

Unnamed: 0,description_id,start_offset,end_offset,field,description,subset,label,fold
0,4699,1853,2066,Biographical / Historical,"Labelled Apparently some chapters, amounting t...",train,Omission,split3
1,8942,384,540,Biographical / Historical,James Aikman of Perth signed his name to a vol...,train,,split2
2,5440,5692,5850,Biographical / Historical,This piece was published in 'Milk Production i...,train,,split0
3,3474,3608,8549,Biographical / Historical,Margaret Winifred Bartholomew was born on 21 A...,train,Omission,split0
3,3474,3608,8549,Biographical / Historical,Margaret Winifred Bartholomew was born on 21 A...,train,Stereotype,split0


In [40]:
# subdf = df_joined.loc[df_joined.fold == "split4"]  # split0, split1, split2, split3
# df_tag_totals = subdf.groupby('tag').size().reset_index(name='total')
# df_tag_totals = df_tag_totals.sort_values(by="total")
# df_tag_totals = df_tag_totals.reset_index()
# df_tag_totals.drop("index", axis=1)
# ---------------------------
subdf = df_joined_exploded.loc[df_joined_exploded.fold == "split4"]  # split0, split1, split2, split3
df_label_totals = subdf.groupby('label').size().reset_index(name='total')
df_label_totals = df_label_totals.sort_values(by="total")
df_label_totals = df_label_totals.reset_index()
df_label_totals.drop("index", axis=1)

Unnamed: 0,label,total
0,Stereotype,343
1,Omission,838
2,,4510


**For document classification:** All splits of relatively even numbers of each tag.  `Stereotype` on average occurs less than half as much as `Omission`.

**For token classification:** All splits have examples of every tag, however the `I-Genered-Pronoun`, `I-Gendered-Role`, and `I-Generalization` tags have very low occurrences (especially when split across folds).

In [15]:
df_joined.loc[df_joined.tag == "I-Gendered-Pronoun"].head(50)
# df_joined.loc[df_joined.tag == "I-Generalization"].head(50)
# df_joined.loc[df_joined.tag == "I-Gendered-Role"].head(50)

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,fold
16904,739,875,12934,15410,.,"(8565, 8566)",.,I-Gendered-Pronoun,Biographical / Historical,split2
31251,1037,1481,8433,28576,',"(14013, 14014)",',I-Gendered-Pronoun,Scope and Contents,split4
37848,1055,1686,7005,34848,'s,"(30634, 30636)",POS,I-Gendered-Pronoun,Scope and Contents,split3
53563,1055,2285,7725,49888,',"(105945, 105946)",',I-Gendered-Pronoun,Scope and Contents,split2
58326,1059,2451,53140,54218,'s,"(16403, 16405)",POS,I-Gendered-Pronoun,Scope and Contents,split3
58327,1059,2451,20792,54218,'s,"(16403, 16405)",POS,I-Gendered-Pronoun,Scope and Contents,split3
62047,1076,2572,20971,57675,'s,"(32922, 32924)",VBZ,I-Gendered-Pronoun,Scope and Contents,split3
74292,1165,3022,11021,68977,',"(3894, 3895)",',I-Gendered-Pronoun,Scope and Contents,split1
90416,1489,3623,17135,83716,;,"(9947, 9948)",;,I-Gendered-Pronoun,Scope and Contents,split4
92540,1516,3694,19033,85679,",","(289, 290)",",",I-Gendered-Pronoun,Scope and Contents,split4


Tokens with `I-Gendered-Pronoun` tags aren't actually gendered pronouns, so perhaps those tags are better replaced with "O" (for outside)...

Tokens with `I-Gendered-Role` and `I-Generalization` are similar - they aren't really necessary to have annotated.

<a id="a"></a>
### Appendix: Initial Approach

Split the data **40-40-20**:

In [None]:
train_frac, val_frac = 0.4, 0.4
train_size, validat_size, test_size = utils.getTrainValTestSizes(shuffled_sents, train_frac, val_frac)
shuffled_sents_splits = utils.assignSubsets(shuffled_sents, train_size, validat_size, test_size)
shuffled_sents_splits.head()

**Train Data:**

In [10]:
df_train = df_joined.loc[df_joined.subset == "train"]
print(df_train.shape)
df_train.head()

(308583, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
32,2,2,99999,16,Scope,"(77, 82)",NN,O,Scope and Contents,train
33,2,2,99999,17,and,"(83, 86)",CC,O,Scope and Contents,train
34,2,2,99999,18,Contents,"(87, 95)",NNS,O,Scope and Contents,train
35,2,2,99999,19,:,"(95, 96)",:,O,Scope and Contents,train
36,2,2,99999,20,Sermons,"(97, 104)",NNS,O,Scope and Contents,train


In [11]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_train.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Gendered-Pronoun,15
1,I-Gendered-Role,213
2,I-Generalization,295
3,B-Stereotype,478
4,B-Generalization,500
5,B-Feminine,566
6,B-Gendered-Role,1059
7,B-Occupation,1177
8,I-Feminine,1243
9,I-Occupation,1371


**Validation Data:**

In [12]:
df_validate = df_joined.loc[df_joined.subset == "dev"]
print(df_validate.shape)
df_validate.head()

(316721, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,dev
4,1,1,99999,4,:,"(22, 23)",:,O,Title,dev
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,dev
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,dev
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,dev


In [13]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_validate.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Nonbinary,1
1,B-Nonbinary,1
2,I-Gendered-Pronoun,25
3,I-Gendered-Role,282
4,I-Generalization,300
5,B-Generalization,525
6,B-Stereotype,533
7,B-Feminine,597
8,B-Gendered-Role,1205
9,B-Occupation,1305


**Test Data:**

In [14]:
df_test = df_joined.loc[df_joined.subset == "test"]
print(df_test.shape)
df_test.head()

(153966, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
0,0,0,99999,0,Identifier,"(0, 10)",NN,O,Identifier,test
1,0,0,99999,1,:,"(10, 11)",:,O,Identifier,test
2,0,0,99999,2,AA5,"(12, 15)",NN,O,Identifier,test
152,3,4,14377,134,He,"(789, 791)",PRP,B-Gendered-Pronoun,Biographical / Historical,test
153,3,4,99999,135,was,"(792, 795)",VBD,O,Biographical / Historical,test


In [15]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_test.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Gendered-Pronoun,14
1,I-Gendered-Role,116
2,I-Generalization,183
3,B-Stereotype,240
4,B-Generalization,258
5,B-Feminine,298
6,B-Occupation,474
7,B-Gendered-Role,517
8,I-Occupation,565
9,I-Feminine,696


Write the data splits to a file:

In [16]:
Path(config.tokc_path+"experiment_input/").mkdir(parents=True, exist_ok=True)

In [17]:
df_train.to_csv(config.tokc_path+"experiment_input/token_train_40.csv")
df_validate.to_csv(config.tokc_path+"experiment_input/token_validate_40.csv")
df_test.to_csv(config.tokc_path+"experiment_input/token_test_20.csv")