# Data Preparation
Data source: https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/data?select=all_data.csv

In this notebook, we prepare the data downloaded from kaggle and export data subsets which we will feed into our models. The data preparation process includes data cleaning, extracting data subsets according to identity category, and splitting each identity subset into train, validation, and test sets via stratified sampling.

In [1]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split

## Load the data:
The kaggle competition corresponding to this dataset came with csv files for their own train and test subset. However, since the competition has ended, the `all_data.csv` file was released containing labels for both the train and test sets. Therefore, we'll be using the `all_data.csv` as our starting dataset.

In [2]:
all_data_df = pd.read_csv('data/all_data.csv')

## Clean the data

EDA revealed that there were some rows with a missing value for `comment_text`. What does these rows look like?

In [3]:
all_data_df[pd.isna(all_data_df["comment_text"])]

Unnamed: 0,id,comment_text,split,created_date,publication_id,parent_id,article_id,rating,funny,wow,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
446630,392337,,train,2016-07-18 19:34:48.278774+00,13,392165.0,141670,approved,0,0,...,,,,,,,,,0,4
869804,872115,,train,2017-01-21 02:04:30.064452+00,54,872109.0,163140,approved,5,0,...,,,,,,,,,0,4
1556982,5971919,,train,2017-09-18 02:40:48.161601+00,13,5971615.0,378393,approved,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,4
1567442,5353666,,train,2017-06-04 02:48:07.950238+00,13,5352881.0,340316,approved,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,4


### Delete the row with the missing comments
Since we'll have no input text to feed in for these rows, it will be unusable and therefore we'll remove them from our dataset.

In [4]:
all_data_df_cleansed = all_data_df.copy().drop(index=all_data_df[pd.isna(all_data_df['comment_text'])].index)

We can see that we now have a few less lines in our dataset:

In [5]:
all_data_df.shape

(1999516, 46)

In [6]:
all_data_df_cleansed.shape

(1999512, 46)

## Basic feature engineering

### Add `toxicity_binary` column

In [7]:
all_data_df_cleansed['toxicity_binary'] = (all_data_df_cleansed['toxicity'] >= 0.5).astype(int)

In [8]:
all_data_df_cleansed[['toxicity','toxicity_binary']]

Unnamed: 0,toxicity,toxicity_binary
0,0.373134,0
1,0.605263,1
2,0.666667,1
3,0.815789,1
4,0.550000,1
...,...,...
1999511,0.400000,0
1999512,0.400000,0
1999513,0.400000,0
1999514,0.400000,0


Move the new column towards the front of the dataframe:

In [9]:
orig_cols = all_data_df_cleansed.columns.tolist()
reordered_cols = orig_cols[:2] + orig_cols[-1:] + orig_cols[2:-1]
all_data_df_cleansed = all_data_df_cleansed[reordered_cols]
all_data_df_cleansed.head()

Unnamed: 0,id,comment_text,toxicity_binary,split,created_date,publication_id,parent_id,article_id,rating,funny,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
0,1083994,He got his money... now he lies in wait till a...,0,train,2017-03-06 15:21:53.675241+00,21,,317120,approved,0,...,,,,,,,,,0,67
1,650904,Mad dog will surely put the liberals in mental...,1,train,2016-12-02 16:44:21.329535+00,21,,154086,approved,0,...,,,,,,,,,0,76
2,5902188,And Trump continues his lifelong cowardice by ...,1,train,2017-09-05 19:05:32.341360+00,55,,374342,approved,1,...,,,,,,,,,0,63
3,7084460,"""while arresting a man for resisting arrest"".\...",1,test,2016-11-01 16:53:33.561631+00,13,,149218,approved,0,...,,,,,,,,,0,76
4,5410943,Tucker and Paul are both total bad ass mofo's.,1,train,2017-06-14 05:08:21.997315+00,21,,344096,approved,0,...,,,,,,,,,0,80


## Drop the columns we won't be using

In [10]:
all_data_df_cleansed = all_data_df_cleansed.drop(columns=['id', 'split', 'created_date', 'publication_id',
       'parent_id', 'article_id', 'rating', 'funny', 'wow', 'sad', 'likes',
       'disagree', 'severe_toxicity', 'obscene', 'sexual_explicit',
       'identity_attack', 'insult', 'threat', 'identity_annotator_count',
       'toxicity_annotator_count'])
all_data_df_cleansed.head()

Unnamed: 0,comment_text,toxicity_binary,toxicity,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,...,other_religion,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability
0,He got his money... now he lies in wait till a...,0,0.373134,,,,,,,,...,,,,,,,,,,
1,Mad dog will surely put the liberals in mental...,1,0.605263,,,,,,,,...,,,,,,,,,,
2,And Trump continues his lifelong cowardice by ...,1,0.666667,,,,,,,,...,,,,,,,,,,
3,"""while arresting a man for resisting arrest"".\...",1,0.815789,,,,,,,,...,,,,,,,,,,
4,Tucker and Paul are both total bad ass mofo's.,1,0.55,,,,,,,,...,,,,,,,,,,


# Prepare Disability Subset

## Create disability subset

In [11]:
disability_df = all_data_df_cleansed[(all_data_df_cleansed["physical_disability"] > 0) | 
           (all_data_df_cleansed["intellectual_or_learning_disability"] > 0) | 
           (all_data_df_cleansed["psychiatric_or_mental_illness"] > 0) | 
           (all_data_df_cleansed["other_disability"] > 0)]

In [12]:
disability_df.shape

(18665, 27)

## Add disability subtype column
We'll add a categorical feature that specifies which of the following disability subtypes each comment corresponds to:

- `physical_disability`
- `intellectual_or_learning_disability`
- `psychiatric_or_mental_illness`
- `other_disability`

EDA revealed that some comments have nonzero values for more than one subtype above. Since the purpose of this comment is to facilitate stratified sampling, the disability subtype label for each comment will be the subtype corresponding to the largest value for that comment.

In [13]:
disability_subtypes_df = disability_df[['physical_disability','intellectual_or_learning_disability','psychiatric_or_mental_illness','other_disability']]
disability_df = disability_df.assign(disability_subtype=disability_subtypes_df.idxmax(axis=1))
disability_df.head()

Unnamed: 0,comment_text,toxicity_binary,toxicity,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,...,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,disability_subtype
7705,No sympathy for these two knuckleheads.,1,0.689655,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,physical_disability
8073,Wow!\nYour progressive psychosis has become ex...,1,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,psychiatric_or_mental_illness
8115,"Or.... maybe there IS chaos because the ""presi...",1,0.790323,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,psychiatric_or_mental_illness
8125,I'll take someone who's physically ill over on...,0,0.352941,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.75,0.0,1.0,0.0,psychiatric_or_mental_illness
8263,"Mental Illness at work again, again, again, ag...",1,0.842857,0.25,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,psychiatric_or_mental_illness


### Prepare data splits for disability
**Overview:** We'll split the disability subset into 70% train, 10% validation, and 20% test sets. Comments may have subtle differences due to disability subtypes (e.g. comments about physical disability may be different than comments about intellectual/learning disability). Therefore we'll need to do stratified sampling on the disability subtypes such that for each dataset split, the ratio for each disability subtype will be around the same.

**Disability+Non-Disability Interweaving Technique:** Every split should have stratified sampling. We'll divide the disabilty subset into 3 sets stratified on disability subtype, and divide a non-disability subset (e.g. gender) into 3 sets stratified on its corresponding subtype. Then we'll alternate between each disability split and non-disability split when fine-tuning (i.e. train on disability split 1, then train on gender split 1, then train on disability split 2, then train on gender split 2, etc.). For the alternating technique, we'll need 3 training sets and 3 validation sets for disability, and all sets will need to be stratified on the disability subtype.

**Splitting Method:** To prepare the splits as previously described, we'll use the train_test_split() method and since it only creates two splits, we'll take the following steps to create all of the dataset splits:

1. Split into group1: 80% for train and validation, and group2: 20% for test.
1. No need to further split the test set, so leave it alone.
1. Take the set from step 1 that combines train and validation, and split it into group1: 2/3 and group2: 1/3.
1. Take the set from step 3 that was 2/3 and split it into half.
1. Now we have 3 equal splits from step 3 and step 4.
1. For each of the 3 splits, create a train and validation split. Since we want the overall ratio to be 70% train and 10% validation, the ratio for train here should be (1-1/7) and for validation it should be 1/7.

In [14]:
# Split into 80% combined for train and val, and 20% test
disability_combined_df, disability_test_df = train_test_split(disability_df,
                                       test_size=0.2,
                                       random_state=266, stratify=disability_df['disability_subtype'])

# Split the 80% train and val into: 2/3 and 1/3
disability_combined_split12_df, disability_combined_split3_df = train_test_split(disability_combined_df,
                                       test_size=1/3,
                                       random_state=266, stratify=disability_combined_df['disability_subtype'])

# Split the 2/3 train and val into half
disability_combined_split1_df, disability_combined_split2_df = train_test_split(disability_combined_split12_df,
                                       test_size=0.5,
                                       random_state=266, stratify=disability_combined_split12_df['disability_subtype'])

# Create (1-1/7) train and 1/7 val for Split #1
# We want 70% train and 10% val
disability_split1_train_df, disability_split1_val_df = train_test_split(disability_combined_split1_df,
                                       test_size=1/7,
                                       random_state=266, stratify=disability_combined_split1_df['disability_subtype'])

# Create (1-1/7) train and 1/7 val for Split #2
# We want 70% train and 10% val
disability_split2_train_df, disability_split2_val_df = train_test_split(disability_combined_split2_df,
                                       test_size=1/7,
                                       random_state=266, stratify=disability_combined_split2_df['disability_subtype'])

# Create (1-1/7) train and 1/7 val for Split #3
# We want 70% train and 10% val
disability_split3_train_df, disability_split3_val_df = train_test_split(disability_combined_split3_df,
                                       test_size=1/7,
                                       random_state=266, stratify=disability_combined_split3_df['disability_subtype'])

In [15]:
print('\nHow big is each disability split?\n')
print('len(disability_split1_train_df): ', len(disability_split1_train_df))
print('len(disability_split2_train_df): ', len(disability_split2_train_df))
print('len(disability_split3_train_df): ', len(disability_split3_train_df))

print('len(disability_split1_val_df): ', len(disability_split1_val_df))
print('len(disability_split2_val_df): ', len(disability_split2_val_df))
print('len(disability_split3_val_df): ', len(disability_split3_val_df))

print('total disability train: ', len(disability_split1_train_df)+len(disability_split2_train_df)+len(disability_split3_train_df))
print('total disability val: ', len(disability_split1_val_df)+len(disability_split2_val_df)+len(disability_split3_val_df))
print('len(disability_test_df): ', len(disability_test_df))


How big is each disability split?

len(disability_split1_train_df):  4266
len(disability_split2_train_df):  4266
len(disability_split3_train_df):  4266
len(disability_split1_val_df):  711
len(disability_split2_val_df):  711
len(disability_split3_val_df):  712
total disability train:  12798
total disability val:  2134
len(disability_test_df):  3733


In [16]:
disability_train_df = pd.concat([disability_split1_train_df, disability_split2_train_df, disability_split3_train_df], axis=0)
disability_val_df = pd.concat([disability_split1_val_df, disability_split2_val_df, disability_split3_val_df], axis=0)
disability_full_df = pd.concat([disability_train_df, disability_val_df, disability_test_df], axis=0)

In [17]:
print('len(disability_train_df): ', len(disability_train_df))
print('len(disability_val_df): ', len(disability_val_df))
print('len(disability_test_df): ', len(disability_test_df))
print('len(disability_full_df): ', len(disability_full_df))

len(disability_train_df):  12798
len(disability_val_df):  2134
len(disability_test_df):  3733
len(disability_full_df):  18665


In [18]:
print('\nStratified Sampling Sanity Check for Disability')
disability_train_phys = (disability_split1_train_df['disability_subtype']=='physical_disability').astype(int).sum()
disability_train_intel = (disability_split1_train_df['disability_subtype']=='intellectual_or_learning_disability').astype(int).sum()
disability_train_psych = (disability_split1_train_df['disability_subtype']=='psychiatric_or_mental_illness').astype(int).sum()
disability_train_other = (disability_split1_train_df['disability_subtype']=='other_disability').astype(int).sum()
disability_train_total = len(disability_split1_train_df['disability_subtype'])
print('\nSplit 1 disability Train')
print('=====================')
print('phys:\t', disability_train_phys/disability_train_total)
print('intel:\t', disability_train_intel/disability_train_total)
print('psych:\t', disability_train_psych/disability_train_total)
print('other_disability:\t', disability_train_other/disability_train_total)

disability_val_phys = (disability_split1_val_df['disability_subtype']=='physical_disability').astype(int).sum()
disability_val_intel = (disability_split1_val_df['disability_subtype']=='intellectual_or_learning_disability').astype(int).sum()
disability_val_psych = (disability_split1_val_df['disability_subtype']=='psychiatric_or_mental_illness').astype(int).sum()
disability_val_other = (disability_split1_val_df['disability_subtype']=='other_disability').astype(int).sum()
disability_val_total = len(disability_split1_val_df['disability_subtype'])
print('\nSplit 1 disability Val')
print('=====================')
print('phys:\t', disability_val_phys/disability_val_total)
print('intel:\t', disability_val_intel/disability_val_total)
print('psych:\t', disability_val_psych/disability_val_total)
print('other_disability:\t', disability_val_other/disability_val_total)


Stratified Sampling Sanity Check for Disability

Split 1 disability Train
phys:	 0.15002344116268168
intel:	 0.11509610876699485
psych:	 0.5928270042194093
other_disability:	 0.1420534458509142

Split 1 disability Val
phys:	 0.15049226441631505
intel:	 0.11533052039381153
psych:	 0.5921237693389592
other_disability:	 0.1420534458509142


### Export disability split datasets into csv

In [19]:
disability_df.to_csv('data/disability-dataset-full.csv')
disability_train_df.to_csv('data/disability-dataset-train.csv')
disability_val_df.to_csv('data/disability-dataset-val.csv')
disability_test_df.to_csv('data/disability-dataset-test.csv')

disability_split1_train_df.to_csv('data/disability-dataset-split1-train.csv')
disability_split2_train_df.to_csv('data/disability-dataset-split2-train.csv')
disability_split3_train_df.to_csv('data/disability-dataset-split3-train.csv')

disability_split1_val_df.to_csv('data/disability-dataset-split1-val.csv')
disability_split2_val_df.to_csv('data/disability-dataset-split2-val.csv')
disability_split3_val_df.to_csv('data/disability-dataset-split3-val.csv')

### Address Data Imbalance Between Disability and Non-Disability Subsets
All of the non-disability identities have many more records than the disability subset. For the interweaving technique, we'll want the disability and non-disability subset to be balanced. That is, we don't want whatever is learned from the disability subset to be overpowered by the non-disability subset due training on more non-disability examples. Therefore, we'll do stratified undersampling of the non-disability subsets such that they're around the same size as the disability subset. (Since disability is our focus and we're only augmenting other identity groups to help with predicting disability-related comments, we're okay with discarding data for other identity groups).

Capture number of disability samples in disability train, val, and test set to be used in the undersampling for non-disability identities:

In [20]:
num_disability_train_samples = len(disability_train_df)
num_disability_val_samples = len(disability_val_df)
num_disability_test_samples = len(disability_test_df)

# Prepare non-disability subsets

## Gender subset

Create gender subset:

In [21]:
gender_df = all_data_df_cleansed[(all_data_df_cleansed['male'] > 0) | 
           (all_data_df_cleansed['female'] > 0) | 
           (all_data_df_cleansed['transgender'] > 0) | 
           (all_data_df_cleansed['other_gender'] > 0)]
gender_df.shape

(137722, 27)

Add gender subtype column:

In [22]:
gender_subtypes_df = gender_df[['male','female','transgender','other_gender']]
gender_df = gender_df.assign(gender_subtype=gender_subtypes_df.idxmax(axis=1))
gender_df.head()

Unnamed: 0,comment_text,toxicity_binary,toxicity,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,...,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,gender_subtype
7681,Blame men. There's always an excuse to blame ...,1,0.545455,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,male
7682,And the woman exposing herself saying grab thi...,1,0.728571,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,female
7691,Are you a Pilgrim?\nWhy arn't you growing your...,1,0.8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,male
7699,"No, he was accused of being a racist white man.",0,0.363636,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,male
7709,How do we fight agaisnt women who use sexual f...,1,0.8,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,male


### Prepare data splits for gender
**Overview:** We want the gender subset to have the following ratio: 70% train, 10% val, and 20% test.

**Disability+Gender Interweaving Technique:** Each split should have stratified sampling. We'll divide the disabilty subset into 3 sets stratified on disability subtype, and divide the gender disability subset into 3 sets stratified on gender subtype. Then we'll alternate between each disability split and gender split when fine-tuning (i.e. train on disability split 1, then train on gender split 1, then train on disability split 2, then train on gender split 2, etc.). For the alternating technique, we'll need 3 training and validation sets for gender, and all sets will need to be stratified on the gender subtype.

**Splitting Method:** To prepare the splits as previously described, we'll use the train_test_split() method and since it only creates two splits, we'll take the following steps to create all of the dataset splits:

1. Split into group1: 80% for train and validation, and group2: 20% for test.
1. No need to further split the test set, so leave it alone.
1. Take the set from step 1 that combines train and validation, and split it into group1: 2/3 and group2: 1/3.
1. Take the set from step 3 that was 2/3 and split it into half.
1. Now we have 3 equal splits from step 3 and step 4.
1. For each of the 3 splits, create a train and validation split. Since we want the overall ratio to be 70% train and 10% validation, the ratio for train here should be (1-1/7) and for validation it should be 1/7.

In [23]:
# Split into 80% combined for train and val, and 20% test
gender_combined_df, gender_test_df = train_test_split(gender_df,
                                       train_size=num_disability_train_samples+num_disability_val_samples,
                                       test_size=num_disability_test_samples,
                                       random_state=266, stratify=gender_df['gender_subtype'])

# Split the 80% train and val into: 2/3 and 1/3
gender_combined_split12_df, gender_combined_split3_df = train_test_split(gender_combined_df,
                                       test_size=1/3,
                                       random_state=266, stratify=gender_combined_df['gender_subtype'])

# Split the 2/3 train and val into half
gender_combined_split1_df, gender_combined_split2_df = train_test_split(gender_combined_split12_df,
                                       test_size=0.5,
                                       random_state=266, stratify=gender_combined_split12_df['gender_subtype'])

# Create (1-1/7) train and 1/7 val for Split #1
# We want 70% train and 10% val
gender_split1_train_df, gender_split1_val_df = train_test_split(gender_combined_split1_df,
                                       test_size=1/7,
                                       random_state=266, stratify=gender_combined_split1_df['gender_subtype'])

# Create (1-1/7) train and 1/7 val for Split #2
# We want 70% train and 10% val
gender_split2_train_df, gender_split2_val_df = train_test_split(gender_combined_split2_df,
                                       test_size=1/7,
                                       random_state=266, stratify=gender_combined_split2_df['gender_subtype'])

# Create (1-1/7) train and 1/7 val for Split #3
# We want 70% train and 10% val
gender_split3_train_df, gender_split3_val_df = train_test_split(gender_combined_split3_df,
                                       test_size=1/7,
                                       random_state=266, stratify=gender_combined_split3_df['gender_subtype'])

In [24]:
print('\nHow big is each split for gender?\n')
print('len(gender_split1_train_df): ', len(gender_split1_train_df))
print('len(gender_split2_train_df): ', len(gender_split2_train_df))
print('len(gender_split3_train_df): ', len(gender_split3_train_df))

print('len(gender_split1_val_df): ', len(gender_split1_val_df))
print('len(gender_split2_val_df): ', len(gender_split2_val_df))
print('len(gender_split3_val_df): ', len(gender_split3_val_df))

print('total gender train: ', len(gender_split1_train_df)+len(gender_split2_train_df)+len(gender_split3_train_df))
print('total gender val: ', len(gender_split1_val_df)+len(gender_split2_val_df)+len(gender_split3_val_df))
print('len(gender_test_df): ', len(gender_test_df))


How big is each split for gender?

len(gender_split1_train_df):  4266
len(gender_split2_train_df):  4266
len(gender_split3_train_df):  4266
len(gender_split1_val_df):  711
len(gender_split2_val_df):  711
len(gender_split3_val_df):  712
total gender train:  12798
total gender val:  2134
len(gender_test_df):  3733


In [25]:
gender_train_df = pd.concat([gender_split1_train_df, gender_split2_train_df, gender_split3_train_df], axis=0)
gender_val_df = pd.concat([gender_split1_val_df, gender_split2_val_df, gender_split3_val_df], axis=0)
gender_full_df = pd.concat([gender_train_df, gender_val_df, gender_test_df], axis=0)

In [26]:
print('len(gender_train_df): ', len(gender_train_df))
print('len(gender_val_df): ', len(gender_val_df))
print('len(gender_test_df): ', len(gender_test_df))
print('len(gender_full_df): ', len(gender_full_df))

len(gender_train_df):  12798
len(gender_val_df):  2134
len(gender_test_df):  3733
len(gender_full_df):  18665


In [27]:
print('\nStratified Sampling Sanity Check for Gender')
gender_train_male = (gender_split1_train_df['gender_subtype']=='male').astype(int).sum()
gender_train_female = (gender_split1_train_df['gender_subtype']=='female').astype(int).sum()
gender_train_trans = (gender_split1_train_df['gender_subtype']=='transgender').astype(int).sum()
gender_train_other = (gender_split1_train_df['gender_subtype']=='other_gender').astype(int).sum()
gender_train_total = len(gender_split1_train_df['gender_subtype'])
print('\nSplit 1 Gender Train')
print('=====================')
print('male:\t', gender_train_male/gender_train_total)
print('female:\t', gender_train_female/gender_train_total)
print('transgender:\t', gender_train_trans/gender_train_total)
print('other_gender:\t', gender_train_other/gender_train_total)

gender_val_male = (gender_split1_val_df['gender_subtype']=='male').astype(int).sum()
gender_val_female = (gender_split1_val_df['gender_subtype']=='female').astype(int).sum()
gender_val_trans = (gender_split1_val_df['gender_subtype']=='transgender').astype(int).sum()
gender_val_other = (gender_split1_val_df['gender_subtype']=='other_gender').astype(int).sum()
gender_val_total = len(gender_split1_val_df['gender_subtype'])
print('\nSplit 1 Gender Val')
print('=====================')
print('male:\t', gender_val_male/gender_val_total)
print('female:\t', gender_val_female/gender_val_total)
print('transgender:\t', gender_val_trans/gender_val_total)
print('other_gender:\t', gender_val_other/gender_val_total)


Stratified Sampling Sanity Check for Gender

Split 1 Gender Train
male:	 0.5307079231129864
female:	 0.42522269104547583
transgender:	 0.03164556962025317
other_gender:	 0.012423816221284576

Split 1 Gender Val
male:	 0.530239099859353
female:	 0.42616033755274263
transgender:	 0.030942334739803096
other_gender:	 0.012658227848101266


### Export gender split datasets into csv

In [28]:
gender_full_df.to_csv('data/gender-dataset-full.csv')
gender_train_df.to_csv('data/gender-dataset-train.csv')
gender_val_df.to_csv('data/gender-dataset-val.csv')
gender_test_df.to_csv('data/gender-dataset-test.csv')

gender_split1_train_df.to_csv('data/gender-dataset-split1-train.csv')
gender_split2_train_df.to_csv('data/gender-dataset-split2-train.csv')
gender_split3_train_df.to_csv('data/gender-dataset-split3-train.csv')

gender_split1_val_df.to_csv('data/gender-dataset-split1-val.csv')
gender_split2_val_df.to_csv('data/gender-dataset-split2-val.csv')
gender_split3_val_df.to_csv('data/gender-dataset-split3-val.csv')

## Sexual orientation subset

Create sexual orientation subset:

In [29]:
sexual_orientation_df = all_data_df_cleansed[(all_data_df_cleansed['heterosexual'] > 0) | 
           (all_data_df_cleansed['homosexual_gay_or_lesbian'] > 0) | 
           (all_data_df_cleansed['bisexual'] > 0) | 
           (all_data_df_cleansed['other_sexual_orientation'] > 0)]
sexual_orientation_df.shape

(22649, 27)

Add sexual orientation subtype column:

In [30]:
sexual_orientation_subtypes_df = sexual_orientation_df[['heterosexual','homosexual_gay_or_lesbian','bisexual','other_sexual_orientation']]
sexual_orientation_df = sexual_orientation_df.assign(sexual_orientation_subtype=sexual_orientation_subtypes_df.idxmax(axis=1))
sexual_orientation_df.head()

Unnamed: 0,comment_text,toxicity_binary,toxicity,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,...,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,sexual_orientation_subtype
7695,"Ridiculous, indeed. Although Rome does seem to...",1,0.857143,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,homosexual_gay_or_lesbian
7716,It took them long enough. And it goes against ...,0,0.181818,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,homosexual_gay_or_lesbian
7746,Well now Murray can simply go for the bisexual...,1,0.8,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,bisexual
7890,He probably thoughthimself gay when he did not...,0,0.453333,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,homosexual_gay_or_lesbian
7894,Sodomy isn't exclusive to homosexuals. I can s...,0,0.4375,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,homosexual_gay_or_lesbian


### Prepare data splits for sexual orientation
**Overview:** We want the sexual orientation subset to have the following ratio: 70% train, 10% val, and 20% test.

**Disability+Sexual Orientation Interweaving Technique:** Each split should have stratified sampling. We'll divide the disabilty subset into 3 sets stratified on disability subtype, and divide the sexual orientation disability subset into 3 sets stratified on sexual orientation subtype. Then we'll alternate between each disability split and sexual orientation split when fine-tuning (i.e. train on disability split 1, then train on sexual orientation split 1, then train on disability split 2, then train on sexual orientation split 2, etc.). For the alternating technique, we'll need 3 training sets for sexual orientation and 3 validation sets for sexual orientation, and all sets will need to be stratified on the sexual orientation subtype.

**Splitting Method:** To prepare the splits as previously described, we'll use the train_test_split() method and since it only creates two splits, we'll take the following steps to create all of the dataset splits:

1. Split into group1: 80% for train and validation, and group2: 20% for test.
1. No need to further split the test set, so leave it alone.
1. Take the set from step 1 that combines train and validation, and split it into group1: 2/3 and group2: 1/3.
1. Take the set from step 3 that was 2/3 and split it into half.
1. Now we have 3 equal splits from step 3 and step 4.
1. For each of the 3 splits, create a train and validation split. Since we want the overall ratio to be 70% train and 10% validation, the ratio for train here should be (1-1/7) and for validation it should be 1/7.

In [31]:
# Split into 80% combined for train and val, and 20% test
sexual_orientation_combined_df, sexual_orientation_test_df = train_test_split(sexual_orientation_df,
                                       train_size=num_disability_train_samples+num_disability_val_samples,
                                       test_size=num_disability_test_samples,
                                       random_state=266, stratify=sexual_orientation_df['sexual_orientation_subtype'])

# Split the 80% train and val into: 2/3 and 1/3
sexual_orientation_combined_split12_df, sexual_orientation_combined_split3_df = train_test_split(sexual_orientation_combined_df,
                                       test_size=1/3,
                                       random_state=266, stratify=sexual_orientation_combined_df['sexual_orientation_subtype'])

# Split the 2/3 train and val into half
sexual_orientation_combined_split1_df, sexual_orientation_combined_split2_df = train_test_split(sexual_orientation_combined_split12_df,
                                       test_size=0.5,
                                       random_state=266, stratify=sexual_orientation_combined_split12_df['sexual_orientation_subtype'])

# Create (1-1/7) train and 1/7 val for Split #1
# We want 70% train and 10% val
sexual_orientation_split1_train_df, sexual_orientation_split1_val_df = train_test_split(sexual_orientation_combined_split1_df,
                                       test_size=1/7,
                                       random_state=266, stratify=sexual_orientation_combined_split1_df['sexual_orientation_subtype'])

# Create (1-1/7) train and 1/7 val for Split #2
# We want 70% train and 10% val
sexual_orientation_split2_train_df, sexual_orientation_split2_val_df = train_test_split(sexual_orientation_combined_split2_df,
                                       test_size=1/7,
                                       random_state=266, stratify=sexual_orientation_combined_split2_df['sexual_orientation_subtype'])

# Create (1-1/7) train and 1/7 val for Split #3
# We want 70% train and 10% val
sexual_orientation_split3_train_df, sexual_orientation_split3_val_df = train_test_split(sexual_orientation_combined_split3_df,
                                       test_size=1/7,
                                       random_state=266, stratify=sexual_orientation_combined_split3_df['sexual_orientation_subtype'])

In [32]:
print('\nHow big is each sexual orientation split?\n')
print('len(sexual_orientation_split1_train_df): ', len(sexual_orientation_split1_train_df))
print('len(sexual_orientation_split2_train_df): ', len(sexual_orientation_split2_train_df))
print('len(sexual_orientation_split3_train_df): ', len(sexual_orientation_split3_train_df))

print('len(sexual_orientation_split1_val_df): ', len(sexual_orientation_split1_val_df))
print('len(sexual_orientation_split2_val_df): ', len(sexual_orientation_split2_val_df))
print('len(sexual_orientation_split3_val_df): ', len(sexual_orientation_split3_val_df))

print('total sexual_orientation train: ', len(sexual_orientation_split1_train_df)+len(sexual_orientation_split2_train_df)+len(sexual_orientation_split3_train_df))
print('total sexual_orientation val: ', len(sexual_orientation_split1_val_df)+len(sexual_orientation_split2_val_df)+len(sexual_orientation_split3_val_df))
print('len(sexual_orientation_test_df): ', len(sexual_orientation_test_df))


How big is each sexual orientation split?

len(sexual_orientation_split1_train_df):  4266
len(sexual_orientation_split2_train_df):  4266
len(sexual_orientation_split3_train_df):  4266
len(sexual_orientation_split1_val_df):  711
len(sexual_orientation_split2_val_df):  711
len(sexual_orientation_split3_val_df):  712
total sexual_orientation train:  12798
total sexual_orientation val:  2134
len(sexual_orientation_test_df):  3733


In [33]:
sexual_orientation_train_df = pd.concat([sexual_orientation_split1_train_df, sexual_orientation_split2_train_df, sexual_orientation_split3_train_df], axis=0)
sexual_orientation_val_df = pd.concat([sexual_orientation_split1_val_df, sexual_orientation_split2_val_df, sexual_orientation_split3_val_df], axis=0)
sexual_orientation_full_df = pd.concat([sexual_orientation_train_df, sexual_orientation_val_df, sexual_orientation_test_df], axis=0)

In [34]:
print('len(sexual_orientation_train_df): ', len(sexual_orientation_train_df))
print('len(sexual_orientation_val_df): ', len(sexual_orientation_val_df))
print('len(sexual_orientation_test_df): ', len(sexual_orientation_test_df))
print('len(sexual_orientation_full_df): ', len(sexual_orientation_full_df))

len(sexual_orientation_train_df):  12798
len(sexual_orientation_val_df):  2134
len(sexual_orientation_test_df):  3733
len(sexual_orientation_full_df):  18665


In [35]:
print('\nStratified Sampling Sanity Check for Sexual Orientation')
sexual_orientation_train_hetero = (sexual_orientation_split1_train_df['sexual_orientation_subtype']=='heterosexual').astype(int).sum()
sexual_orientation_train_homo = (sexual_orientation_split1_train_df['sexual_orientation_subtype']=='homosexual_gay_or_lesbian').astype(int).sum()
sexual_orientation_train_bi = (sexual_orientation_split1_train_df['sexual_orientation_subtype']=='bisexual').astype(int).sum()
sexual_orientation_train_other = (sexual_orientation_split1_train_df['sexual_orientation_subtype']=='other_sexual_orientation').astype(int).sum()
sexual_orientation_train_total = len(sexual_orientation_split1_train_df['sexual_orientation_subtype'])
print('\nSplit 1 sexual_orientation Train')
print('=====================')
print('hetero:\t', sexual_orientation_train_hetero/sexual_orientation_train_total)
print('homo:\t', sexual_orientation_train_homo/sexual_orientation_train_total)
print('bi:\t', sexual_orientation_train_bi/sexual_orientation_train_total)
print('other_sexual_orientation:\t', sexual_orientation_train_other/sexual_orientation_train_total)

sexual_orientation_val_hetero = (sexual_orientation_split1_val_df['sexual_orientation_subtype']=='heterosexual').astype(int).sum()
sexual_orientation_val_homo = (sexual_orientation_split1_val_df['sexual_orientation_subtype']=='homosexual_gay_or_lesbian').astype(int).sum()
sexual_orientation_val_bi = (sexual_orientation_split1_val_df['sexual_orientation_subtype']=='bisexual').astype(int).sum()
sexual_orientation_val_other = (sexual_orientation_split1_val_df['sexual_orientation_subtype']=='other_sexual_orientation').astype(int).sum()
sexual_orientation_val_total = len(sexual_orientation_split1_val_df['sexual_orientation_subtype'])
print('\nSplit 1 sexual_orientation Val')
print('=====================')
print('hetero:\t', sexual_orientation_val_hetero/sexual_orientation_val_total)
print('homo:\t', sexual_orientation_val_homo/sexual_orientation_val_total)
print('bi:\t', sexual_orientation_val_bi/sexual_orientation_val_total)
print('other_sexual_orientation:\t', sexual_orientation_val_other/sexual_orientation_val_total)


Stratified Sampling Sanity Check for Sexual Orientation

Split 1 sexual_orientation Train
hetero:	 0.10150023441162681
homo:	 0.7100328176277544
bi:	 0.04195968120018753
other_sexual_orientation:	 0.14650726676043133

Split 1 sexual_orientation Val
hetero:	 0.10126582278481013
homo:	 0.710267229254571
bi:	 0.04219409282700422
other_sexual_orientation:	 0.14627285513361463


### Export sexual orientation split datasets into csv

In [36]:
sexual_orientation_full_df.to_csv('data/sexual_orientation-dataset-full.csv')
sexual_orientation_train_df.to_csv('data/sexual_orientation-dataset-train.csv')
sexual_orientation_val_df.to_csv('data/sexual_orientation-dataset-val.csv')
sexual_orientation_test_df.to_csv('data/sexual_orientation-dataset-test.csv')

sexual_orientation_split1_train_df.to_csv('data/sexual_orientation-dataset-split1-train.csv')
sexual_orientation_split2_train_df.to_csv('data/sexual_orientation-dataset-split2-train.csv')
sexual_orientation_split3_train_df.to_csv('data/sexual_orientation-dataset-split3-train.csv')

sexual_orientation_split1_val_df.to_csv('data/sexual_orientation-dataset-split1-val.csv')
sexual_orientation_split2_val_df.to_csv('data/sexual_orientation-dataset-split2-val.csv')
sexual_orientation_split3_val_df.to_csv('data/sexual_orientation-dataset-split3-val.csv')

## Religion subset

Create religion subset:

In [37]:
religion_df = all_data_df_cleansed[(all_data_df_cleansed['christian'] > 0) | 
           (all_data_df_cleansed['jewish'] > 0) | 
           (all_data_df_cleansed['muslim'] > 0) | 
           (all_data_df_cleansed['hindu'] > 0) | 
           (all_data_df_cleansed['buddhist'] > 0) | 
           (all_data_df_cleansed['atheist'] > 0) | 
           (all_data_df_cleansed['other_religion'] > 0)]
gender_df.shape

(137722, 28)

Add religion subtype column:

In [38]:
religion_subtypes_df = religion_df[['christian','jewish','muslim','hindu','buddhist','atheist','other_religion']]
religion_df = religion_df.assign(religion_subtype=religion_subtypes_df.idxmax(axis=1))
religion_df.head()

Unnamed: 0,comment_text,toxicity_binary,toxicity,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,...,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,religion_subtype
7678,OH yes - Were those evil Christian Missionarie...,1,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,christian
7689,"Lela, you admit no records exist to support yo...",0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,atheist
7701,The robot censor seems disinclined to accept s...,1,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,muslim
7704,Agreed: there's no equivalence. What is stoppi...,1,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,christian
7716,It took them long enough. And it goes against ...,0,0.181818,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,christian


### Prepare data splits for religion
**Overview:** We want the religion subset to have the following ratio: 70% train, 10% val, and 20% test.

**Disability+religion Interweaving Technique:** Each split should have stratified sampling. We'll divide the disabilty subset into 3 sets stratified on disability subtype, and divide the religion disability subset into 3 sets stratified on religion subtype. Then we'll alternate between each disability split and religion split when fine-tuning (i.e. train on disability split 1, then train on religion split 1, then train on disability split 2, then train on religion split 2, etc.). For the alternating technique, we'll need 3 training and validation sets for religion, and all sets will need to be stratified on the religion subtype.

**Splitting Method:** To prepare the splits as previously described, we'll use the train_test_split() method and since it only creates two splits, we'll take the following steps to create all of the dataset splits:

1. Split into group1: 80% for train and validation, and group2: 20% for test.
1. No need to further split the test set, so leave it alone.
1. Take the set from step 1 that combines train and validation, and split it into group1: 2/3 and group2: 1/3.
1. Take the set from step 3 that was 2/3 and split it into half.
1. Now we have 3 equal splits from step 3 and step 4.
1. For each of the 3 splits, create a train and validation split. Since we want the overall ratio to be 70% train and 10% validation, the ratio for train here should be (1-1/7) and for validation it should be 1/7.

In [39]:
# Split into 80% combined for train and val, and 20% test
religion_combined_df, religion_test_df = train_test_split(religion_df,
                                       train_size=num_disability_train_samples+num_disability_val_samples,
                                       test_size=num_disability_test_samples,
                                       random_state=266, stratify=religion_df['religion_subtype'])

# Split the 80% train and val into: 2/3 and 1/3
religion_combined_split12_df, religion_combined_split3_df = train_test_split(religion_combined_df,
                                       test_size=1/3,
                                       random_state=266, stratify=religion_combined_df['religion_subtype'])

# Split the 2/3 train and val into half
religion_combined_split1_df, religion_combined_split2_df = train_test_split(religion_combined_split12_df,
                                       test_size=0.5,
                                       random_state=266, stratify=religion_combined_split12_df['religion_subtype'])

# Create (1-1/7) train and 1/7 val for Split #1
# We want 70% train and 10% val
religion_split1_train_df, religion_split1_val_df = train_test_split(religion_combined_split1_df,
                                       test_size=1/7,
                                       random_state=266, stratify=religion_combined_split1_df['religion_subtype'])

# Create (1-1/7) train and 1/7 val for Split #2
# We want 70% train and 10% val
religion_split2_train_df, religion_split2_val_df = train_test_split(religion_combined_split2_df,
                                       test_size=1/7,
                                       random_state=266, stratify=religion_combined_split2_df['religion_subtype'])

# Create (1-1/7) train and 1/7 val for Split #3
# We want 70% train and 10% val
religion_split3_train_df, religion_split3_val_df = train_test_split(religion_combined_split3_df,
                                       test_size=1/7,
                                       random_state=266, stratify=religion_combined_split3_df['religion_subtype'])

In [40]:
print('\nHow big is each split for religion?\n')
print('len(religion_split1_train_df): ', len(religion_split1_train_df))
print('len(religion_split2_train_df): ', len(religion_split2_train_df))
print('len(religion_split3_train_df): ', len(religion_split3_train_df))

print('len(religion_split1_val_df): ', len(religion_split1_val_df))
print('len(religion_split2_val_df): ', len(religion_split2_val_df))
print('len(religion_split3_val_df): ', len(religion_split3_val_df))

print('total religion train: ', len(religion_split1_train_df)+len(religion_split2_train_df)+len(religion_split3_train_df))
print('total religion val: ', len(religion_split1_val_df)+len(religion_split2_val_df)+len(religion_split3_val_df))
print('len(religion_test_df): ', len(religion_test_df))


How big is each split for religion?

len(religion_split1_train_df):  4266
len(religion_split2_train_df):  4266
len(religion_split3_train_df):  4266
len(religion_split1_val_df):  711
len(religion_split2_val_df):  711
len(religion_split3_val_df):  712
total religion train:  12798
total religion val:  2134
len(religion_test_df):  3733


In [41]:
religion_train_df = pd.concat([religion_split1_train_df, religion_split2_train_df, religion_split3_train_df], axis=0)
religion_val_df = pd.concat([religion_split1_val_df, religion_split2_val_df, religion_split3_val_df], axis=0)
religion_full_df = pd.concat([religion_train_df, religion_val_df, religion_test_df], axis=0)

In [42]:
print('len(religion_train_df): ', len(religion_train_df))
print('len(religion_val_df): ', len(religion_val_df))
print('len(religion_test_df): ', len(religion_test_df))
print('len(religion_full_df): ', len(religion_full_df))

len(religion_train_df):  12798
len(religion_val_df):  2134
len(religion_test_df):  3733
len(religion_full_df):  18665


In [43]:
print('\nStratified Sampling Sanity Check for Religion')
religion_train_christian = (religion_split1_train_df['religion_subtype']=='christian').astype(int).sum()
religion_train_jewish = (religion_split1_train_df['religion_subtype']=='jewish').astype(int).sum()
religion_train_muslim = (religion_split1_train_df['religion_subtype']=='muslim').astype(int).sum()
religion_train_hindu = (religion_split1_train_df['religion_subtype']=='hindu').astype(int).sum()
religion_train_other = (religion_split1_train_df['religion_subtype']=='other_religion').astype(int).sum()
religion_train_total = len(religion_split1_train_df['religion_subtype'])
print('\nSplit 1 Religion Train')
print('=====================')
print('christian:\t', religion_train_christian/religion_train_total)
print('jewish:\t', religion_train_jewish/religion_train_total)
print('muslim:\t', religion_train_muslim/religion_train_total)
print('hindu:\t', religion_train_hindu/religion_train_total)
print('other_religion:\t', religion_train_other/religion_train_total)

religion_val_christian = (religion_split1_val_df['religion_subtype']=='christian').astype(int).sum()
religion_val_jewish = (religion_split1_val_df['religion_subtype']=='jewish').astype(int).sum()
religion_val_muslim = (religion_split1_val_df['religion_subtype']=='muslim').astype(int).sum()
religion_val_hindu = (religion_split1_val_df['religion_subtype']=='hindu').astype(int).sum()
religion_val_other = (religion_split1_val_df['religion_subtype']=='other_religion').astype(int).sum()
religion_val_total = len(religion_split1_val_df['religion_subtype'])
print('\nSplit 1 Religion Val')
print('=====================')
print('christian:\t', religion_val_christian/religion_val_total)
print('jewish:\t', religion_val_jewish/religion_val_total)
print('muslim:\t', religion_val_muslim/religion_val_total)
print('hindu:\t', religion_val_hindu/religion_val_total)
print('other_religion:\t', religion_val_other/religion_val_total)


Stratified Sampling Sanity Check for Religion

Split 1 Religion Train
christian:	 0.629395218002813
jewish:	 0.07172995780590717
muslim:	 0.23253633380215658
hindu:	 0.007735583684950774
other_religion:	 0.03727144866385373

Split 1 Religion Val
christian:	 0.630098452883263
jewish:	 0.07172995780590717
muslim:	 0.23347398030942335
hindu:	 0.007032348804500703
other_religion:	 0.03656821378340366


### Export religion split datasets into csv

In [44]:
religion_full_df.to_csv('data/religion-dataset-full.csv')
religion_train_df.to_csv('data/religion-dataset-train.csv')
religion_val_df.to_csv('data/religion-dataset-val.csv')
religion_test_df.to_csv('data/religion-dataset-test.csv')

religion_split1_train_df.to_csv('data/religion-dataset-split1-train.csv')
religion_split2_train_df.to_csv('data/religion-dataset-split2-train.csv')
religion_split3_train_df.to_csv('data/religion-dataset-split3-train.csv')

religion_split1_val_df.to_csv('data/religion-dataset-split1-val.csv')
religion_split2_val_df.to_csv('data/religion-dataset-split2-val.csv')
religion_split3_val_df.to_csv('data/religion-dataset-split3-val.csv')

## Race subset

Create race subset:

In [45]:
race_df = all_data_df_cleansed[(all_data_df_cleansed['black'] > 0) | 
           (all_data_df_cleansed['white'] > 0) | 
           (all_data_df_cleansed['asian'] > 0) | 
           (all_data_df_cleansed['latino'] > 0) | 
           (all_data_df_cleansed['other_race_or_ethnicity'] > 0)]
race_df.shape

(71648, 27)

Add race subtype column:

In [46]:
race_subtypes_df = race_df[['black', 'white', 'asian','latino', 'other_race_or_ethnicity']]
race_df = race_df.assign(race_subtype=race_subtypes_df.idxmax(axis=1))
race_df.head()

Unnamed: 0,comment_text,toxicity_binary,toxicity,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,...,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,race_subtype
7679,Why is this black racist crap still on the G&M...,1,0.757143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,black
7680,even up here.......BLACKS!,1,0.688525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,black
7684,"""Let's get the black folks and the white folks...",1,0.736842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,black
7691,Are you a Pilgrim?\nWhy arn't you growing your...,1,0.8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,white
7696,And there it is. Our president is a white supr...,1,0.507042,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,white


### Prepare data splits for race
**Overview:** We want the race subset to have the following ratio: 70% train, 10% val, and 20% test.

**Disability+race Interweaving Technique:** Each split should have stratified sampling. We'll divide the disabilty subset into 3 sets stratified on disability subtype, and divide the race disability subset into 3 sets stratified on race subtype. Then we'll alternate between each disability split and race split when fine-tuning (i.e. train on disability split 1, then train on race split 1, then train on disability split 2, then train on race split 2, etc.). For the alternating technique, we'll need 3 training and validation sets for race, and all sets will need to be stratified on the race subtype.

**Splitting Method:** To prepare the splits as previously described, we'll use the train_test_split() method and since it only creates two splits, we'll take the following steps to create all of the dataset splits:

1. Split into group1: 80% for train and validation, and group2: 20% for test.
1. No need to further split the test set, so leave it alone.
1. Take the set from step 1 that combines train and validation, and split it into group1: 2/3 and group2: 1/3.
1. Take the set from step 3 that was 2/3 and split it into half.
1. Now we have 3 equal splits from step 3 and step 4.
1. For each of the 3 splits, create a train and validation split. Since we want the overall ratio to be 70% train and 10% validation, the ratio for train here should be (1-1/7) and for validation it should be 1/7.

In [47]:
# Split into 80% combined for train and val, and 20% test
race_combined_df, race_test_df = train_test_split(race_df,
                                       train_size=num_disability_train_samples+num_disability_val_samples,
                                       test_size=num_disability_test_samples,
                                       random_state=266, stratify=race_df['race_subtype'])

# Split the 80% train and val into: 2/3 and 1/3
race_combined_split12_df, race_combined_split3_df = train_test_split(race_combined_df,
                                       test_size=1/3,
                                       random_state=266, stratify=race_combined_df['race_subtype'])

# Split the 2/3 train and val into half
race_combined_split1_df, race_combined_split2_df = train_test_split(race_combined_split12_df,
                                       test_size=0.5,
                                       random_state=266, stratify=race_combined_split12_df['race_subtype'])

# Create (1-1/7) train and 1/7 val for Split #1
# We want 70% train and 10% val
race_split1_train_df, race_split1_val_df = train_test_split(race_combined_split1_df,
                                       test_size=1/7,
                                       random_state=266, stratify=race_combined_split1_df['race_subtype'])

# Create (1-1/7) train and 1/7 val for Split #2
# We want 70% train and 10% val
race_split2_train_df, race_split2_val_df = train_test_split(race_combined_split2_df,
                                       test_size=1/7,
                                       random_state=266, stratify=race_combined_split2_df['race_subtype'])

# Create (1-1/7) train and 1/7 val for Split #3
# We want 70% train and 10% val
race_split3_train_df, race_split3_val_df = train_test_split(race_combined_split3_df,
                                       test_size=1/7,
                                       random_state=266, stratify=race_combined_split3_df['race_subtype'])

In [48]:
print('\nHow big is each split for race?\n')
print('len(race_split1_train_df): ', len(race_split1_train_df))
print('len(race_split2_train_df): ', len(race_split2_train_df))
print('len(race_split3_train_df): ', len(race_split3_train_df))

print('len(race_split1_val_df): ', len(race_split1_val_df))
print('len(race_split2_val_df): ', len(race_split2_val_df))
print('len(race_split3_val_df): ', len(race_split3_val_df))

print('total race train: ', len(race_split1_train_df)+len(race_split2_train_df)+len(race_split3_train_df))
print('total race val: ', len(race_split1_val_df)+len(race_split2_val_df)+len(race_split3_val_df))
print('len(race_test_df): ', len(race_test_df))


How big is each split for race?

len(race_split1_train_df):  4266
len(race_split2_train_df):  4266
len(race_split3_train_df):  4266
len(race_split1_val_df):  711
len(race_split2_val_df):  711
len(race_split3_val_df):  712
total race train:  12798
total race val:  2134
len(race_test_df):  3733


In [49]:
race_train_df = pd.concat([race_split1_train_df, race_split2_train_df, race_split3_train_df], axis=0)
race_val_df = pd.concat([race_split1_val_df, race_split2_val_df, race_split3_val_df], axis=0)
race_full_df = pd.concat([race_train_df, race_val_df, race_test_df], axis=0)

In [50]:
print('len(race_train_df): ', len(race_train_df))
print('len(race_val_df): ', len(race_val_df))
print('len(race_test_df): ', len(race_test_df))
print('len(race_full_df): ', len(race_full_df))

len(race_train_df):  12798
len(race_val_df):  2134
len(race_test_df):  3733
len(race_full_df):  18665


In [51]:
print('\nStratified Sampling Sanity Check for race')
race_train_black = (race_split1_train_df['race_subtype']=='black').astype(int).sum()
race_train_white = (race_split1_train_df['race_subtype']=='white').astype(int).sum()
race_train_asian = (race_split1_train_df['race_subtype']=='asian').astype(int).sum()
race_train_latino = (race_split1_train_df['race_subtype']=='latino').astype(int).sum()
race_train_other = (race_split1_train_df['race_subtype']=='other_race_or_ethnicity').astype(int).sum()
race_train_total = len(race_split1_train_df['race_subtype'])
print('\nSplit 1 race Train')
print('=====================')
print('black:\t', race_train_black/race_train_total)
print('white:\t', race_train_white/race_train_total)
print('asian:\t', race_train_asian/race_train_total)
print('latino:\t', race_train_latino/race_train_total)
print('other_race_or_ethnicity:\t', race_train_other/race_train_total)

race_val_black = (race_split1_val_df['race_subtype']=='black').astype(int).sum()
race_val_white = (race_split1_val_df['race_subtype']=='white').astype(int).sum()
race_val_asian = (race_split1_val_df['race_subtype']=='asian').astype(int).sum()
race_val_latino = (race_split1_val_df['race_subtype']=='latino').astype(int).sum()
race_val_other = (race_split1_val_df['race_subtype']=='other_race_or_ethnicity').astype(int).sum()
race_val_total = len(race_split1_val_df['race_subtype'])
print('\nSplit 1 race Val')
print('=====================')
print('black:\t', race_val_black/race_val_total)
print('white:\t', race_val_white/race_val_total)
print('asian:\t', race_val_asian/race_val_total)
print('latino:\t', race_val_latino/race_val_total)
print('other_race_or_ethnicity:\t', race_val_other/race_val_total)


Stratified Sampling Sanity Check for race

Split 1 race Train
black:	 0.24917955930614158
white:	 0.3663853727144866
asian:	 0.13877168307548055
latino:	 0.07196436943272386
other_race_or_ethnicity:	 0.17369901547116737

Split 1 race Val
black:	 0.2489451476793249
white:	 0.3656821378340366
asian:	 0.13924050632911392
latino:	 0.07172995780590717
other_race_or_ethnicity:	 0.17440225035161744


### Export race split datasets into csv

In [52]:
race_full_df.to_csv('data/race-dataset-full.csv')
race_train_df.to_csv('data/race-dataset-train.csv')
race_val_df.to_csv('data/race-dataset-val.csv')
race_test_df.to_csv('data/race-dataset-test.csv')

race_split1_train_df.to_csv('data/race-dataset-split1-train.csv')
race_split2_train_df.to_csv('data/race-dataset-split2-train.csv')
race_split3_train_df.to_csv('data/race-dataset-split3-train.csv')

race_split1_val_df.to_csv('data/race-dataset-split1-val.csv')
race_split2_val_df.to_csv('data/race-dataset-split2-val.csv')
race_split3_val_df.to_csv('data/race-dataset-split3-val.csv')

# Reference code: How to load split data for use in our models

Load the csv files for each data split with the following code:

In [None]:
disability_df_train = pd.read_csv('data/disability-dataset-train.csv')
disability_df_val = pd.read_csv('data/disability-dataset-val.csv')
disability_df_test = pd.read_csv('data/disability-dataset-test.csv')

Now that we loaded our data, we'll need their labels and text examples in the form of tensors. Use the code below to accomplish this:

In [None]:
# Form tensors of labels and features.
disability_train_labels = tf.convert_to_tensor(disability_df_train['toxicity_binary'])
disability_val_labels = tf.convert_to_tensor(disability_df_val['toxicity_binary'])
disability_test_labels = tf.convert_to_tensor(disability_df_test['toxicity_binary'])

disability_train_examples = tf.convert_to_tensor(disability_df_train['comment_text'])
disability_val_examples = tf.convert_to_tensor(disability_df_val['comment_text'])
disability_test_examples = tf.convert_to_tensor(disability_df_test['comment_text'])