# Splitting Data for Experiments with Gender Biased Language Classifiers

* **Supervised learning**
    * Source data: `../data/token_clf_data/`
    * Output data: train, validate (a.k.a. development), and test splits under `..data/token_clf_data/experiment_input/`
* **Multilabel classification**
    * 3 categories of labels:
        1. *Person Name:* Unknown, Non-binary,* Feminine, Masculine
        2. *Linguistic:* Generalization, Gendered Pronoun, Gendered Role
        3. *Contextual:* Empowering,* Occupation, Omission, Stereotype

*Annotators did not find text on which to apply these labels during the manual annotation process!

***

Import libraries and load data:

In [1]:
# For custom functions and for paths
import utils, config

# For working with data files and directories
import pandas as pd
from pathlib import Path

Split the data **40-40-20**, with the final 20% reserved as a blind test.  For experiments with three classifiers run sequentially, with each providing features for the next classifier, the first classifier's train set will become the second classifier's devtest set, and the second classifier's devtest set will become the next classifier's train set.  The sets will switch once more for the third classifier, with the original train set once again being the train set and the original devtest set once again being the devtest set.

When randomly splitting data, balance the number of metadata field types (Title, Scope and Contents, Biographical / Historical, and Processing Information) across the train, validation, and test splits.

First, implode the data,  grouping tokens by their sentence, then explode the data after the sentences are split into train, validation, and test sets so each row has one token.

In [2]:
df_descs = pd.read_csv(config.crc_meta_path+"annot_descs.csv")
df_descs = df_descs[["description_id", "field"]]
df_descs.head()

Unnamed: 0,description_id,field
0,0,Identifier
1,1,Title
2,2,Scope and Contents
3,3,Biographical / Historical
4,4,Identifier


In [3]:
df = pd.read_csv(config.tokc_path+"model_input/all_token_data.csv", index_col=0)
df = df.join(df_descs.set_index("description_id"), on="description_id", how="left")
df.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field
0,0,0,99999,0,Identifier,"(0, 10)",NN,O,Identifier
1,0,0,99999,1,:,"(10, 11)",:,O,Identifier
2,0,0,99999,2,AA5,"(12, 15)",NN,O,Identifier
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title
4,1,1,99999,4,:,"(22, 23)",:,O,Title


Shuffle the sentence IDs randomly and then assign them to train, validation, and (blind) test set splits of the data for creating and evaluating token classifiers:

In [4]:
sents = df[["sentence_id"]]
sents = sents.drop_duplicates()
shuffled_sents = utils.shuffleDataFrame(sents)

In [5]:
train_frac = 0.4
val_frac = 0.4
train_size, validat_size, test_size = utils.getTrainValTestSizes(shuffled_sents, train_frac, val_frac)
shuffled_sents_splits = utils.assignSubsets(shuffled_sents, train_size, validat_size, test_size)
shuffled_sents_splits.head()

Unnamed: 0,subset,sentence_id
775304,train,41770
184875,train,7756
347077,train,15662
47939,train,2041
595345,train,31629


In [6]:
shuffled_sents_splits.groupby("subset").size().reset_index(name="subset_count")  # Looks good

Unnamed: 0,subset,subset_count
0,dev,16812
1,test,8406
2,train,16812


Looks good - dev and train subsets are the same size.

Join the subset data with the token data:

In [7]:
df_joined = df.join(shuffled_sents_splits.set_index("sentence_id"), on="sentence_id", how="left")
df_joined.head()

Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
0,0,0,99999,0,Identifier,"(0, 10)",NN,O,Identifier,test
1,0,0,99999,1,:,"(10, 11)",:,O,Identifier,test
2,0,0,99999,2,AA5,"(12, 15)",NN,O,Identifier,test
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,dev
4,1,1,99999,4,:,"(22, 23)",:,O,Title,dev


In [9]:
# df_joined.loc[df_joined.sentence_id == 4]

Looks good!

**Train Data:**

In [10]:
df_train = df_joined.loc[df_joined.subset == "train"]
print(df_train.shape)
df_train.head()

(308583, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
32,2,2,99999,16,Scope,"(77, 82)",NN,O,Scope and Contents,train
33,2,2,99999,17,and,"(83, 86)",CC,O,Scope and Contents,train
34,2,2,99999,18,Contents,"(87, 95)",NNS,O,Scope and Contents,train
35,2,2,99999,19,:,"(95, 96)",:,O,Scope and Contents,train
36,2,2,99999,20,Sermons,"(97, 104)",NNS,O,Scope and Contents,train


In [11]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_train.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Gendered-Pronoun,15
1,I-Gendered-Role,213
2,I-Generalization,295
3,B-Stereotype,478
4,B-Generalization,500
5,B-Feminine,566
6,B-Gendered-Role,1059
7,B-Occupation,1177
8,I-Feminine,1243
9,I-Occupation,1371


**Validation Data:**

In [12]:
df_validate = df_joined.loc[df_joined.subset == "dev"]
print(df_validate.shape)
df_validate.head()

(316721, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
3,1,1,99999,3,Title,"(17, 22)",NN,O,Title,dev
4,1,1,99999,4,:,"(22, 23)",:,O,Title,dev
5,1,1,99999,5,Papers,"(24, 30)",NNS,O,Title,dev
6,1,1,99999,6,of,"(31, 33)",IN,O,Title,dev
7,1,1,14384,7,The,"(34, 37)",DT,B-Unknown,Title,dev


In [13]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_validate.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Nonbinary,1
1,B-Nonbinary,1
2,I-Gendered-Pronoun,25
3,I-Gendered-Role,282
4,I-Generalization,300
5,B-Generalization,525
6,B-Stereotype,533
7,B-Feminine,597
8,B-Gendered-Role,1205
9,B-Occupation,1305


**Test Data:**

In [14]:
df_test = df_joined.loc[df_joined.subset == "test"]
print(df_test.shape)
df_test.head()

(153966, 10)


Unnamed: 0,description_id,sentence_id,ann_id,token_id,token,token_offsets,pos,tag,field,subset
0,0,0,99999,0,Identifier,"(0, 10)",NN,O,Identifier,test
1,0,0,99999,1,:,"(10, 11)",:,O,Identifier,test
2,0,0,99999,2,AA5,"(12, 15)",NN,O,Identifier,test
152,3,4,14377,134,He,"(789, 791)",PRP,B-Gendered-Pronoun,Biographical / Historical,test
153,3,4,99999,135,was,"(792, 795)",VBD,O,Biographical / Historical,test


In [15]:
# Reference: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
df_tag_totals = df_test.groupby('tag').size().reset_index(name='total')
df_tag_totals = df_tag_totals.sort_values(by="total")
df_tag_totals = df_tag_totals.reset_index()
df_tag_totals.drop("index", axis=1)

Unnamed: 0,tag,total
0,I-Gendered-Pronoun,14
1,I-Gendered-Role,116
2,I-Generalization,183
3,B-Stereotype,240
4,B-Generalization,258
5,B-Feminine,298
6,B-Occupation,474
7,B-Gendered-Role,517
8,I-Occupation,565
9,I-Feminine,696


Write the data splits to a file:

In [16]:
Path(config.tokc_path+"experiment_input/").mkdir(parents=True, exist_ok=True)

In [17]:
df_train.to_csv(config.tokc_path+"experiment_input/token_train.csv")
df_validate.to_csv(config.tokc_path+"experiment_input/token_validate.csv")
df_test.to_csv(config.tokc_path+"experiment_input/token_test.csv")