# Data Preparation
Data source: https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/data?select=all_data.csv

In this notebook, we prepare the data downloaded from kaggle and export data subsets which we will feed into our models. The data preparation process includes data cleaning, extracting data subsets according to identity category, and splitting each identity subset into train, validation, and test sets via stratified sampling.

In [1]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split

## Load the data:
The kaggle competition corresponding to this dataset came with csv files for their own train and test subset. However, since the competition has ended, the `all_data.csv` file was released containing labels for both the train and test sets. Therefore, we'll be using the `all_data.csv` as our starting dataset.

In [2]:
all_data_df = pd.read_csv('data/all_data.csv')

## Clean the data

EDA revealed that there were some rows with a missing value for `comment_text`. What does these rows look like?

In [3]:
all_data_df[pd.isna(all_data_df["comment_text"])]

Unnamed: 0,id,comment_text,split,created_date,publication_id,parent_id,article_id,rating,funny,wow,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
446630,392337,,train,2016-07-18 19:34:48.278774+00,13,392165.0,141670,approved,0,0,...,,,,,,,,,0,4
869804,872115,,train,2017-01-21 02:04:30.064452+00,54,872109.0,163140,approved,5,0,...,,,,,,,,,0,4
1556982,5971919,,train,2017-09-18 02:40:48.161601+00,13,5971615.0,378393,approved,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,4
1567442,5353666,,train,2017-06-04 02:48:07.950238+00,13,5352881.0,340316,approved,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,4


### Delete the rows with missing comments
Since we'll have no input text to feed in for these rows, it will be unusable and therefore we'll remove them from our dataset.

In [4]:
all_data_df_cleansed = all_data_df.copy().drop(index=all_data_df[pd.isna(all_data_df['comment_text'])].index)

We can see that we now have a few less lines in our dataset:

In [5]:
all_data_df.shape

(1999516, 46)

In [6]:
all_data_df_cleansed.shape

(1999512, 46)

## Basic feature engineering

### Add `toxicity_binary` column

In [7]:
all_data_df_cleansed['toxicity_binary'] = (all_data_df_cleansed['toxicity'] >= 0.5).astype(int)

In [8]:
all_data_df_cleansed[['toxicity','toxicity_binary']]

Unnamed: 0,toxicity,toxicity_binary
0,0.373134,0
1,0.605263,1
2,0.666667,1
3,0.815789,1
4,0.550000,1
...,...,...
1999511,0.400000,0
1999512,0.400000,0
1999513,0.400000,0
1999514,0.400000,0


Move the new column towards the front of the dataframe:

In [9]:
orig_cols = all_data_df_cleansed.columns.tolist()
reordered_cols = orig_cols[:2] + orig_cols[-1:] + orig_cols[2:-1]
all_data_df_cleansed = all_data_df_cleansed[reordered_cols]
all_data_df_cleansed.head()

Unnamed: 0,id,comment_text,toxicity_binary,split,created_date,publication_id,parent_id,article_id,rating,funny,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
0,1083994,He got his money... now he lies in wait till a...,0,train,2017-03-06 15:21:53.675241+00,21,,317120,approved,0,...,,,,,,,,,0,67
1,650904,Mad dog will surely put the liberals in mental...,1,train,2016-12-02 16:44:21.329535+00,21,,154086,approved,0,...,,,,,,,,,0,76
2,5902188,And Trump continues his lifelong cowardice by ...,1,train,2017-09-05 19:05:32.341360+00,55,,374342,approved,1,...,,,,,,,,,0,63
3,7084460,"""while arresting a man for resisting arrest"".\...",1,test,2016-11-01 16:53:33.561631+00,13,,149218,approved,0,...,,,,,,,,,0,76
4,5410943,Tucker and Paul are both total bad ass mofo's.,1,train,2017-06-14 05:08:21.997315+00,21,,344096,approved,0,...,,,,,,,,,0,80


## Drop the columns we won't be using

In [10]:
all_data_df_cleansed = all_data_df_cleansed.drop(columns=['id', 'split', 'created_date', 'publication_id',
       'parent_id', 'article_id', 'rating', 'funny', 'wow', 'sad', 'likes',
       'disagree', 'severe_toxicity', 'obscene', 'sexual_explicit',
       'identity_attack', 'insult', 'threat', 'identity_annotator_count',
       'toxicity_annotator_count'])
all_data_df_cleansed.head()

Unnamed: 0,comment_text,toxicity_binary,toxicity,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,...,other_religion,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability
0,He got his money... now he lies in wait till a...,0,0.373134,,,,,,,,...,,,,,,,,,,
1,Mad dog will surely put the liberals in mental...,1,0.605263,,,,,,,,...,,,,,,,,,,
2,And Trump continues his lifelong cowardice by ...,1,0.666667,,,,,,,,...,,,,,,,,,,
3,"""while arresting a man for resisting arrest"".\...",1,0.815789,,,,,,,,...,,,,,,,,,,
4,Tucker and Paul are both total bad ass mofo's.,1,0.55,,,,,,,,...,,,,,,,,,,


# Prepare Disability Subset

## Create disability subset

In [11]:
disability_df = all_data_df_cleansed[(all_data_df_cleansed["physical_disability"] > 0) | 
           (all_data_df_cleansed["intellectual_or_learning_disability"] > 0) | 
           (all_data_df_cleansed["psychiatric_or_mental_illness"] > 0) | 
           (all_data_df_cleansed["other_disability"] > 0)]

In [12]:
disability_df.shape

(18665, 27)

## Add disability subtype column
We'll add a categorical feature that specifies which of the following disability subtypes each comment corresponds to:

- `physical_disability`
- `intellectual_or_learning_disability`
- `psychiatric_or_mental_illness`
- `other_disability`

EDA revealed that some comments have nonzero values for more than one subtype above. Since the purpose of this comment is to facilitate stratified sampling, the disability subtype label for each comment will be the subtype corresponding to the largest value for that comment.

In [13]:
disability_subtypes_df = disability_df[['physical_disability','intellectual_or_learning_disability','psychiatric_or_mental_illness','other_disability']]
disability_df = disability_df.assign(disability_subtype=disability_subtypes_df.idxmax(axis=1))
disability_df.head()

Unnamed: 0,comment_text,toxicity_binary,toxicity,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,...,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,disability_subtype
7705,No sympathy for these two knuckleheads.,1,0.689655,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,physical_disability
8073,Wow!\nYour progressive psychosis has become ex...,1,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,psychiatric_or_mental_illness
8115,"Or.... maybe there IS chaos because the ""presi...",1,0.790323,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,psychiatric_or_mental_illness
8125,I'll take someone who's physically ill over on...,0,0.352941,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.75,0.0,1.0,0.0,psychiatric_or_mental_illness
8263,"Mental Illness at work again, again, again, ag...",1,0.842857,0.25,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,psychiatric_or_mental_illness


## Prepare data splits for disability
**Overview:** We'll split the disability subset into 70% train, 10% validation, and 20% test sets. Comments may have subtle differences due to disability subtypes (e.g. comments about physical disability may be different than comments about intellectual/learning disability). Therefore we'll need to do stratified sampling on the disability subtypes such that for each dataset split, the ratio for each disability subtype will be around the same.

**Splitting Method**

We'll use the train_test_split() method and since it only creates two splits, we'll take the following steps to create the three train/val/test splits:

1. Split into group1: 80% for train and validation, and group2: 20% for test.
1. No need to further split the test set, so leave it alone.
1. Take the set from step 1 that combines train and validation and divide it into train and val sets. Since we want the overall ratio to be 70% train and 10% validation, the ratio for train here should be (1-1/7) and for validation it should be 1/7.

Split the data:

In [14]:
# Split into 80% combined for train and val, and 20% test
disability_combined_df, disability_test_df = train_test_split(disability_df,
                                       test_size=0.2,
                                       random_state=266, stratify=disability_df['disability_subtype'])

# Split into 70% for train and 10% val
disability_train_df, disability_val_df = train_test_split(disability_combined_df,
                                       test_size=1/7,
                                       random_state=266, stratify=disability_combined_df['disability_subtype'])

How big is each split?

In [15]:
disability_train_len = len(disability_train_df)
disability_val_len = len(disability_val_df)
disability_test_len = len(disability_test_df)
disability_total = disability_train_len+disability_val_len+disability_test_len
print('disability_train size: ', disability_train_len)
print('disability_val size: ', disability_val_len)
print('disability_test size: ', disability_test_len)
print('disability total: ', disability_total)
print('disability train ratio: ', disability_train_len/disability_total)
print('disability val ratio: ', disability_val_len/disability_total)
print('disability test ratio: ', disability_test_len/disability_total)

disability_train size:  12798
disability_val size:  2134
disability_test size:  3733
disability total:  18665
disability train ratio:  0.6856683632467184
disability val ratio:  0.11433163675328155
disability test ratio:  0.2


In [16]:
print('\nStratified Sampling Sanity Check for Disability')
disability_train_phys = (disability_train_df['disability_subtype']=='physical_disability').astype(int).sum()
disability_train_intel = (disability_train_df['disability_subtype']=='intellectual_or_learning_disability').astype(int).sum()
disability_train_psych = (disability_train_df['disability_subtype']=='psychiatric_or_mental_illness').astype(int).sum()
disability_train_other = (disability_train_df['disability_subtype']=='other_disability').astype(int).sum()
disability_train_total = len(disability_train_df['disability_subtype'])
print('\nDisability Train')
print('=====================')
print('phys:\t', disability_train_phys/disability_train_total)
print('intel:\t', disability_train_intel/disability_train_total)
print('psych:\t', disability_train_psych/disability_train_total)
print('other_disability:\t', disability_train_other/disability_train_total)

disability_val_phys = (disability_val_df['disability_subtype']=='physical_disability').astype(int).sum()
disability_val_intel = (disability_val_df['disability_subtype']=='intellectual_or_learning_disability').astype(int).sum()
disability_val_psych = (disability_val_df['disability_subtype']=='psychiatric_or_mental_illness').astype(int).sum()
disability_val_other = (disability_val_df['disability_subtype']=='other_disability').astype(int).sum()
disability_val_total = len(disability_val_df['disability_subtype'])
print('\nDisability Val')
print('=====================')
print('phys:\t', disability_val_phys/disability_val_total)
print('intel:\t', disability_val_intel/disability_val_total)
print('psych:\t', disability_val_psych/disability_val_total)
print('other_disability:\t', disability_val_other/disability_val_total)

disability_test_phys = (disability_test_df['disability_subtype']=='physical_disability').astype(int).sum()
disability_test_intel = (disability_test_df['disability_subtype']=='intellectual_or_learning_disability').astype(int).sum()
disability_test_psych = (disability_test_df['disability_subtype']=='psychiatric_or_mental_illness').astype(int).sum()
disability_test_other = (disability_test_df['disability_subtype']=='other_disability').astype(int).sum()
disability_test_total = len(disability_test_df['disability_subtype'])
print('\nDisability Test')
print('=====================')
print('phys:\t', disability_test_phys/disability_test_total)
print('intel:\t', disability_test_intel/disability_test_total)
print('psych:\t', disability_test_psych/disability_test_total)
print('other_disability:\t', disability_test_other/disability_test_total)


Stratified Sampling Sanity Check for Disability

Disability Train
phys:	 0.15010157837162055
intel:	 0.11525238318487263
psych:	 0.5926707298015315
other_disability:	 0.1419753086419753

Disability Val
phys:	 0.14995313964386128
intel:	 0.11527647610121837
psych:	 0.5927835051546392
other_disability:	 0.14198687910028115

Disability Test
phys:	 0.15001339405304046
intel:	 0.11518885614787035
psych:	 0.5928207875703188
other_disability:	 0.14197696222877043


### Export disability split datasets into csv

In [17]:
disability_df.to_csv('data/disability-dataset-full.csv')
disability_train_df.to_csv('data/disability-dataset-train.csv')
disability_val_df.to_csv('data/disability-dataset-val.csv')
disability_test_df.to_csv('data/disability-dataset-test.csv')

### Address Data Imbalance Between Disability and Non-Disability Subsets
All of the non-disability identities have many more records than the disability subset. For the interweaving technique, we'll want the disability and non-disability subset to be balanced. That is, we don't want whatever is learned from the disability subset to be overpowered by the non-disability subset due training on more non-disability examples. Therefore, we'll do stratified undersampling of the non-disability subsets such that they're around the same size as the disability subset. (Since disability is our focus and we're only augmenting other identity groups to help with predicting disability-related comments, we're okay with discarding data for other identity groups).

#### Capture number of disability samples in disability train, val, and test set to be used in the undersampling for non-disability identities:

In [18]:
num_disability_train_samples = len(disability_train_df)
num_disability_val_samples = len(disability_val_df)
num_disability_test_samples = len(disability_test_df)

# Prepare non-disability subsets

**Overview**
- We want each non-disability subset to have the following ratio: 70% train, 10% val, and 20% test.
- Each split will be stratified on each identity's subtype.
- Disability has 18k comments. Since there are more examples for non-disability, we'll undersample the non-disability comments to reduce it to 18k. That way the models don't favor the identity group with more examples for that identity.

**Splitting Method**

We'll use the train_test_split() method and since it only creates two splits, we'll take the following steps to create the three train/val/test splits:

1. Split into group1: 80% for train and validation, and group2: 20% for test.
1. No need to further split the test set, so leave it alone.
1. Take the set from step 1 that combines train and validation and divide it into train and val sets. Since we want the overall ratio to be 70% train and 10% validation, the ratio for train here should be (1-1/7) and for validation it should be 1/7.

## Gender subset

Create gender subset:

In [19]:
gender_df = all_data_df_cleansed[(all_data_df_cleansed['male'] > 0) | 
           (all_data_df_cleansed['female'] > 0) | 
           (all_data_df_cleansed['transgender'] > 0) | 
           (all_data_df_cleansed['other_gender'] > 0)]
gender_df.shape

(137722, 27)

Add gender subtype column:

In [20]:
gender_subtypes_df = gender_df[['male','female','transgender','other_gender']]
gender_df = gender_df.assign(gender_subtype=gender_subtypes_df.idxmax(axis=1))
gender_df.head()

Unnamed: 0,comment_text,toxicity_binary,toxicity,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,...,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,gender_subtype
7681,Blame men. There's always an excuse to blame ...,1,0.545455,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,male
7682,And the woman exposing herself saying grab thi...,1,0.728571,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,female
7691,Are you a Pilgrim?\nWhy arn't you growing your...,1,0.8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,male
7699,"No, he was accused of being a racist white man.",0,0.363636,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,male
7709,How do we fight agaisnt women who use sexual f...,1,0.8,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,male


Split the data:

In [21]:
# Split into 80% combined for train and val, and 20% test
# Perform undersampling by specifying train_size and test_size corresponding to disability splits
gender_combined_df, gender_test_df = train_test_split(gender_df,
                                       train_size=num_disability_train_samples+num_disability_val_samples,
                                       test_size=num_disability_test_samples,
                                       random_state=266, stratify=gender_df['gender_subtype'])

# Split into 70% for train and 10% val
gender_train_df, gender_val_df = train_test_split(gender_combined_df,
                                       test_size=1/7,
                                       random_state=266, stratify=gender_combined_df['gender_subtype'])

How big is each split?

In [22]:
gender_train_len = len(gender_train_df)
gender_val_len = len(gender_val_df)
gender_test_len = len(gender_test_df)
gender_total = gender_train_len+gender_val_len+gender_test_len
print('gender_train size: ', gender_train_len)
print('gender_val size: ', gender_val_len)
print('gender_test size: ', gender_test_len)
print('gender total: ', gender_total)
print('gender train ratio: ', gender_train_len/gender_total)
print('gender val ratio: ', gender_val_len/gender_total)
print('gender test ratio: ', gender_test_len/gender_total)

gender_train size:  12798
gender_val size:  2134
gender_test size:  3733
gender total:  18665
gender train ratio:  0.6856683632467184
gender val ratio:  0.11433163675328155
gender test ratio:  0.2


In [23]:
print('\nStratified Sampling Sanity Check for Gender')
gender_train_male = (gender_train_df['gender_subtype']=='male').astype(int).sum()
gender_train_female = (gender_train_df['gender_subtype']=='female').astype(int).sum()
gender_train_trans = (gender_train_df['gender_subtype']=='transgender').astype(int).sum()
gender_train_other = (gender_train_df['gender_subtype']=='other_gender').astype(int).sum()
gender_train_total = len(gender_train_df['gender_subtype'])
print('\nGender Train')
print('=====================')
print('male:\t', gender_train_male/gender_train_total)
print('female:\t', gender_train_female/gender_train_total)
print('transgender:\t', gender_train_trans/gender_train_total)
print('other_gender:\t', gender_train_other/gender_train_total)

gender_val_male = (gender_val_df['gender_subtype']=='male').astype(int).sum()
gender_val_female = (gender_val_df['gender_subtype']=='female').astype(int).sum()
gender_val_trans = (gender_val_df['gender_subtype']=='transgender').astype(int).sum()
gender_val_other = (gender_val_df['gender_subtype']=='other_gender').astype(int).sum()
gender_val_total = len(gender_val_df['gender_subtype'])
print('\nGender Val')
print('=====================')
print('male:\t', gender_val_male/gender_val_total)
print('female:\t', gender_val_female/gender_val_total)
print('transgender:\t', gender_val_trans/gender_val_total)
print('other_gender:\t', gender_val_other/gender_val_total)

gender_test_male = (gender_test_df['gender_subtype']=='male').astype(int).sum()
gender_test_female = (gender_test_df['gender_subtype']=='female').astype(int).sum()
gender_test_trans = (gender_test_df['gender_subtype']=='transgender').astype(int).sum()
gender_test_other = (gender_test_df['gender_subtype']=='other_gender').astype(int).sum()
gender_test_total = len(gender_test_df['gender_subtype'])
print('\nGender Test')
print('=====================')
print('male:\t', gender_test_male/gender_test_total)
print('female:\t', gender_test_female/gender_test_total)
print('transgender:\t', gender_test_trans/gender_test_total)
print('other_gender:\t', gender_test_other/gender_test_total)


Stratified Sampling Sanity Check for Gender

Gender Train
male:	 0.5307079231129864
female:	 0.42545710267229253
transgender:	 0.031411157993436474
other_gender:	 0.012423816221284576

Gender Val
male:	 0.5304592314901593
female:	 0.42549203373945643
transgender:	 0.03139643861293346
other_gender:	 0.012652296157450796

Gender Test
male:	 0.5306723814626306
female:	 0.42539512456469325
transgender:	 0.03134208411465309
other_gender:	 0.012590409858023038


Check positive/negative labels balance:

In [24]:
neg, pos = np.bincount(gender_df['toxicity_binary'])
total = neg + pos
print('Gender ALL Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Gender ALL Examples:
    Total: 137722
    Positive: 19996 (14.52% of total)



In [25]:
neg, pos = np.bincount(gender_train_df['toxicity_binary'])
total = neg + pos
print('Gender Train Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Gender Train Examples:
    Total: 12798
    Positive: 1846 (14.42% of total)



### Export gender split datasets into csv

In [26]:
gender_df.to_csv('data/gender-dataset-full.csv')
gender_train_df.to_csv('data/gender-dataset-train.csv')
gender_val_df.to_csv('data/gender-dataset-val.csv')
gender_test_df.to_csv('data/gender-dataset-test.csv')

## Sexual orientation subset

Create sexual orientation subset:

In [27]:
sexual_orientation_df = all_data_df_cleansed[(all_data_df_cleansed['heterosexual'] > 0) | 
           (all_data_df_cleansed['homosexual_gay_or_lesbian'] > 0) | 
           (all_data_df_cleansed['bisexual'] > 0) | 
           (all_data_df_cleansed['other_sexual_orientation'] > 0)]
sexual_orientation_df.shape

(22649, 27)

Add sexual orientation subtype column:

In [28]:
sexual_orientation_subtypes_df = sexual_orientation_df[['heterosexual','homosexual_gay_or_lesbian','bisexual','other_sexual_orientation']]
sexual_orientation_df = sexual_orientation_df.assign(sexual_orientation_subtype=sexual_orientation_subtypes_df.idxmax(axis=1))
sexual_orientation_df.head()

Unnamed: 0,comment_text,toxicity_binary,toxicity,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,...,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,sexual_orientation_subtype
7695,"Ridiculous, indeed. Although Rome does seem to...",1,0.857143,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,homosexual_gay_or_lesbian
7716,It took them long enough. And it goes against ...,0,0.181818,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,homosexual_gay_or_lesbian
7746,Well now Murray can simply go for the bisexual...,1,0.8,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,bisexual
7890,He probably thoughthimself gay when he did not...,0,0.453333,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,homosexual_gay_or_lesbian
7894,Sodomy isn't exclusive to homosexuals. I can s...,0,0.4375,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,homosexual_gay_or_lesbian


Split the data:

In [29]:
# Split into 80% combined for train and val, and 20% test
# Perform undersampling by specifying train_size and test_size corresponding to disability splits
sexual_orientation_combined_df, sexual_orientation_test_df = train_test_split(sexual_orientation_df,
                                       train_size=num_disability_train_samples+num_disability_val_samples,
                                       test_size=num_disability_test_samples,
                                       random_state=266, stratify=sexual_orientation_df['sexual_orientation_subtype'])

# Split into 70% for train and 10% val
sexual_orientation_train_df, sexual_orientation_val_df = train_test_split(sexual_orientation_combined_df,
                                       test_size=1/7,
                                       random_state=266, stratify=sexual_orientation_combined_df['sexual_orientation_subtype'])

How big is each split?

In [30]:
sexual_orientation_train_len = len(sexual_orientation_train_df)
sexual_orientation_val_len = len(sexual_orientation_val_df)
sexual_orientation_test_len = len(sexual_orientation_test_df)
sexual_orientation_total = sexual_orientation_train_len+sexual_orientation_val_len+sexual_orientation_test_len
print('sexual_orientation_train size: ', sexual_orientation_train_len)
print('sexual_orientation_val size: ', sexual_orientation_val_len)
print('sexual_orientation_test size: ', sexual_orientation_test_len)
print('sexual_orientation total: ', sexual_orientation_total)
print('sexual_orientation train ratio: ', sexual_orientation_train_len/sexual_orientation_total)
print('sexual_orientation val ratio: ', sexual_orientation_val_len/sexual_orientation_total)
print('sexual_orientation test ratio: ', sexual_orientation_test_len/sexual_orientation_total)

sexual_orientation_train size:  12798
sexual_orientation_val size:  2134
sexual_orientation_test size:  3733
sexual_orientation total:  18665
sexual_orientation train ratio:  0.6856683632467184
sexual_orientation val ratio:  0.11433163675328155
sexual_orientation test ratio:  0.2


In [31]:
print('\nStratified Sampling Sanity Check for Sexual Orientation')
sexual_orientation_train_hetero = (sexual_orientation_train_df['sexual_orientation_subtype']=='heterosexual').astype(int).sum()
sexual_orientation_train_homo = (sexual_orientation_train_df['sexual_orientation_subtype']=='homosexual_gay_or_lesbian').astype(int).sum()
sexual_orientation_train_bi = (sexual_orientation_train_df['sexual_orientation_subtype']=='bisexual').astype(int).sum()
sexual_orientation_train_other = (sexual_orientation_train_df['sexual_orientation_subtype']=='other_sexual_orientation').astype(int).sum()
sexual_orientation_train_total = len(sexual_orientation_train_df['sexual_orientation_subtype'])
print('\nSexual Orientation Train')
print('=====================')
print('hetero:\t', sexual_orientation_train_hetero/sexual_orientation_train_total)
print('homo:\t', sexual_orientation_train_homo/sexual_orientation_train_total)
print('bi:\t', sexual_orientation_train_bi/sexual_orientation_train_total)
print('other_sexual_orientation:\t', sexual_orientation_train_other/sexual_orientation_train_total)

sexual_orientation_val_hetero = (sexual_orientation_val_df['sexual_orientation_subtype']=='heterosexual').astype(int).sum()
sexual_orientation_val_homo = (sexual_orientation_val_df['sexual_orientation_subtype']=='homosexual_gay_or_lesbian').astype(int).sum()
sexual_orientation_val_bi = (sexual_orientation_val_df['sexual_orientation_subtype']=='bisexual').astype(int).sum()
sexual_orientation_val_other = (sexual_orientation_val_df['sexual_orientation_subtype']=='other_sexual_orientation').astype(int).sum()
sexual_orientation_val_total = len(sexual_orientation_val_df['sexual_orientation_subtype'])
print('\nSexual Orientation Val')
print('=====================')
print('hetero:\t', sexual_orientation_val_hetero/sexual_orientation_val_total)
print('homo:\t', sexual_orientation_val_homo/sexual_orientation_val_total)
print('bi:\t', sexual_orientation_val_bi/sexual_orientation_val_total)
print('other_sexual_orientation:\t', sexual_orientation_val_other/sexual_orientation_val_total)

sexual_orientation_test_hetero = (sexual_orientation_test_df['sexual_orientation_subtype']=='heterosexual').astype(int).sum()
sexual_orientation_test_homo = (sexual_orientation_test_df['sexual_orientation_subtype']=='homosexual_gay_or_lesbian').astype(int).sum()
sexual_orientation_test_bi = (sexual_orientation_test_df['sexual_orientation_subtype']=='bisexual').astype(int).sum()
sexual_orientation_test_other = (sexual_orientation_test_df['sexual_orientation_subtype']=='other_sexual_orientation').astype(int).sum()
sexual_orientation_test_total = len(sexual_orientation_test_df['sexual_orientation_subtype'])
print('\nSexual Orientation test')
print('=====================')
print('hetero:\t', sexual_orientation_test_hetero/sexual_orientation_test_total)
print('homo:\t', sexual_orientation_test_homo/sexual_orientation_test_total)
print('bi:\t', sexual_orientation_test_bi/sexual_orientation_test_total)
print('other_sexual_orientation:\t', sexual_orientation_test_other/sexual_orientation_test_total)


Stratified Sampling Sanity Check for Sexual Orientation

Sexual Orientation Train
hetero:	 0.10134395999374902
homo:	 0.7101109548366933
bi:	 0.042115955618065325
other_sexual_orientation:	 0.14642912955149243

Sexual Orientation Val
hetero:	 0.10121836925960637
homo:	 0.7104029990627929
bi:	 0.04217432052483599
other_sexual_orientation:	 0.14620431115276475

Sexual Orientation test
hetero:	 0.10125904098580231
homo:	 0.7101526922046612
bi:	 0.04205732654701313
other_sexual_orientation:	 0.14653094026252345


Check positive/negative labels balance:

In [32]:
neg, pos = np.bincount(sexual_orientation_df['toxicity_binary'])
total = neg + pos
print('Sexual Orientation ALL Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Sexual Orientation ALL Examples:
    Total: 22649
    Positive: 5119 (22.60% of total)



In [33]:
neg, pos = np.bincount(sexual_orientation_train_df['toxicity_binary'])
total = neg + pos
print('Sexual Orientation Train Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Sexual Orientation Train Examples:
    Total: 12798
    Positive: 2834 (22.14% of total)



### Export sexual orientation split datasets into csv

In [34]:
sexual_orientation_df.to_csv('data/sexual_orientation-dataset-full.csv')
sexual_orientation_train_df.to_csv('data/sexual_orientation-dataset-train.csv')
sexual_orientation_val_df.to_csv('data/sexual_orientation-dataset-val.csv')
sexual_orientation_test_df.to_csv('data/sexual_orientation-dataset-test.csv')

## Religion subset

Create religion subset:

In [35]:
religion_df = all_data_df_cleansed[(all_data_df_cleansed['christian'] > 0) | 
           (all_data_df_cleansed['jewish'] > 0) | 
           (all_data_df_cleansed['muslim'] > 0) | 
           (all_data_df_cleansed['hindu'] > 0) | 
           (all_data_df_cleansed['buddhist'] > 0) | 
           (all_data_df_cleansed['atheist'] > 0) | 
           (all_data_df_cleansed['other_religion'] > 0)]
gender_df.shape

(137722, 28)

Add religion subtype column:

In [36]:
religion_subtypes_df = religion_df[['christian','jewish','muslim','hindu','buddhist','atheist','other_religion']]
religion_df = religion_df.assign(religion_subtype=religion_subtypes_df.idxmax(axis=1))
religion_df.head()

Unnamed: 0,comment_text,toxicity_binary,toxicity,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,...,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,religion_subtype
7678,OH yes - Were those evil Christian Missionarie...,1,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,christian
7689,"Lela, you admit no records exist to support yo...",0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,atheist
7701,The robot censor seems disinclined to accept s...,1,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,muslim
7704,Agreed: there's no equivalence. What is stoppi...,1,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,christian
7716,It took them long enough. And it goes against ...,0,0.181818,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,christian


Split the data:

In [37]:
# Split into 80% combined for train and val, and 20% test
religion_combined_df, religion_test_df = train_test_split(religion_df,
                                       train_size=num_disability_train_samples+num_disability_val_samples,
                                       test_size=num_disability_test_samples,
                                       random_state=266, stratify=religion_df['religion_subtype'])

# Split into 70% for train and 10% val
religion_train_df, religion_val_df = train_test_split(religion_combined_df,
                                       test_size=1/7,
                                       random_state=266, stratify=religion_combined_df['religion_subtype'])

How big is each split?

In [38]:
religion_train_len = len(religion_train_df)
religion_val_len = len(religion_val_df)
religion_test_len = len(religion_test_df)
religion_total = religion_train_len+religion_val_len+religion_test_len
print('religion_train size: ', religion_train_len)
print('religion_val size: ', religion_val_len)
print('religion_test size: ', religion_test_len)
print('religion total: ', religion_total)
print('religion train ratio: ', religion_train_len/religion_total)
print('religion val ratio: ', religion_val_len/religion_total)
print('religion test ratio: ', religion_test_len/religion_total)

religion_train size:  12798
religion_val size:  2134
religion_test size:  3733
religion total:  18665
religion train ratio:  0.6856683632467184
religion val ratio:  0.11433163675328155
religion test ratio:  0.2


In [39]:
print('\nStratified Sampling Sanity Check for Religion')
religion_train_christian = (religion_train_df['religion_subtype']=='christian').astype(int).sum()
religion_train_jewish = (religion_train_df['religion_subtype']=='jewish').astype(int).sum()
religion_train_muslim = (religion_train_df['religion_subtype']=='muslim').astype(int).sum()
religion_train_hindu = (religion_train_df['religion_subtype']=='hindu').astype(int).sum()
religion_train_other = (religion_train_df['religion_subtype']=='other_religion').astype(int).sum()
religion_train_total = len(religion_train_df['religion_subtype'])
print('\nReligion Train')
print('=====================')
print('christian:\t', religion_train_christian/religion_train_total)
print('jewish:\t', religion_train_jewish/religion_train_total)
print('muslim:\t', religion_train_muslim/religion_train_total)
print('hindu:\t', religion_train_hindu/religion_train_total)
print('other_religion:\t', religion_train_other/religion_train_total)

religion_val_christian = (religion_val_df['religion_subtype']=='christian').astype(int).sum()
religion_val_jewish = (religion_val_df['religion_subtype']=='jewish').astype(int).sum()
religion_val_muslim = (religion_val_df['religion_subtype']=='muslim').astype(int).sum()
religion_val_hindu = (religion_val_df['religion_subtype']=='hindu').astype(int).sum()
religion_val_other = (religion_val_df['religion_subtype']=='other_religion').astype(int).sum()
religion_val_total = len(religion_val_df['religion_subtype'])
print('\nReligion Val')
print('=====================')
print('christian:\t', religion_val_christian/religion_val_total)
print('jewish:\t', religion_val_jewish/religion_val_total)
print('muslim:\t', religion_val_muslim/religion_val_total)
print('hindu:\t', religion_val_hindu/religion_val_total)
print('other_religion:\t', religion_val_other/religion_val_total)

religion_test_christian = (religion_test_df['religion_subtype']=='christian').astype(int).sum()
religion_test_jewish = (religion_test_df['religion_subtype']=='jewish').astype(int).sum()
religion_test_muslim = (religion_test_df['religion_subtype']=='muslim').astype(int).sum()
religion_test_hindu = (religion_test_df['religion_subtype']=='hindu').astype(int).sum()
religion_test_other = (religion_test_df['religion_subtype']=='other_religion').astype(int).sum()
religion_test_total = len(religion_test_df['religion_subtype'])
print('\nReligion Test')
print('=====================')
print('christian:\t', religion_test_christian/religion_test_total)
print('jewish:\t', religion_test_jewish/religion_test_total)
print('muslim:\t', religion_test_muslim/religion_test_total)
print('hindu:\t', religion_test_hindu/religion_test_total)
print('other_religion:\t', religion_test_other/religion_test_total)


Stratified Sampling Sanity Check for Religion

Religion Train
christian:	 0.6294733552117519
jewish:	 0.07172995780590717
muslim:	 0.23261447101109547
hindu:	 0.007735583684950774
other_religion:	 0.037115174245975935

Religion Val
christian:	 0.6293345829428304
jewish:	 0.07169634489222118
muslim:	 0.23289597000937207
hindu:	 0.007497656982193065
other_religion:	 0.037019681349578254

Religion Test
christian:	 0.6295204929011519
jewish:	 0.07179212429681221
muslim:	 0.2325207607822127
hindu:	 0.0077685507634610235
other_religion:	 0.03696758639164211


Check positive/negative labels balance:

In [40]:
neg, pos = np.bincount(religion_df['toxicity_binary'])
total = neg + pos
print('Religion ALL Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Religion ALL Examples:
    Total: 101410
    Positive: 12581 (12.41% of total)



In [41]:
neg, pos = np.bincount(religion_train_df['toxicity_binary'])
total = neg + pos
print('Religion Train Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Religion Train Examples:
    Total: 12798
    Positive: 1575 (12.31% of total)



### Export religion split datasets into csv

In [42]:
religion_df.to_csv('data/religion-dataset-full.csv')
religion_train_df.to_csv('data/religion-dataset-train.csv')
religion_val_df.to_csv('data/religion-dataset-val.csv')
religion_test_df.to_csv('data/religion-dataset-test.csv')

## Race subset

Create race subset:

In [43]:
race_df = all_data_df_cleansed[(all_data_df_cleansed['black'] > 0) | 
           (all_data_df_cleansed['white'] > 0) | 
           (all_data_df_cleansed['asian'] > 0) | 
           (all_data_df_cleansed['latino'] > 0) | 
           (all_data_df_cleansed['other_race_or_ethnicity'] > 0)]
race_df.shape

(71648, 27)

Add race subtype column:

In [44]:
race_subtypes_df = race_df[['black', 'white', 'asian','latino', 'other_race_or_ethnicity']]
race_df = race_df.assign(race_subtype=race_subtypes_df.idxmax(axis=1))
race_df.head()

Unnamed: 0,comment_text,toxicity_binary,toxicity,male,female,transgender,other_gender,heterosexual,homosexual_gay_or_lesbian,bisexual,...,black,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,race_subtype
7679,Why is this black racist crap still on the G&M...,1,0.757143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,black
7680,even up here.......BLACKS!,1,0.688525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,black
7684,"""Let's get the black folks and the white folks...",1,0.736842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,black
7691,Are you a Pilgrim?\nWhy arn't you growing your...,1,0.8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,white
7696,And there it is. Our president is a white supr...,1,0.507042,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,white


Split the data:

In [45]:
# Split into 80% combined for train and val, and 20% test
# Perform undersampling by specifying train_size and test_size corresponding to disability splits
race_combined_df, race_test_df = train_test_split(race_df,
                                       train_size=num_disability_train_samples+num_disability_val_samples,
                                       test_size=num_disability_test_samples,
                                       random_state=266, stratify=race_df['race_subtype'])

# Split into 70% for train and 10% val
race_train_df, race_val_df = train_test_split(race_combined_df,
                                       test_size=1/7,
                                       random_state=266, stratify=race_combined_df['race_subtype'])

How big is each split?

In [46]:
race_train_len = len(race_train_df)
race_val_len = len(race_val_df)
race_test_len = len(race_test_df)
race_total = race_train_len+race_val_len+race_test_len
print('race_train size: ', race_train_len)
print('race_val size: ', race_val_len)
print('race_test size: ', race_test_len)
print('race total: ', race_total)
print('race train ratio: ', race_train_len/race_total)
print('race val ratio: ', race_val_len/race_total)
print('race test ratio: ', race_test_len/race_total)

race_train size:  12798
race_val size:  2134
race_test size:  3733
race total:  18665
race train ratio:  0.6856683632467184
race val ratio:  0.11433163675328155
race test ratio:  0.2


In [47]:
print('\nStratified Sampling Sanity Check for race')
race_train_black = (race_train_df['race_subtype']=='black').astype(int).sum()
race_train_white = (race_train_df['race_subtype']=='white').astype(int).sum()
race_train_asian = (race_train_df['race_subtype']=='asian').astype(int).sum()
race_train_latino = (race_train_df['race_subtype']=='latino').astype(int).sum()
race_train_other = (race_train_df['race_subtype']=='other_race_or_ethnicity').astype(int).sum()
race_train_total = len(race_train_df['race_subtype'])
print('\nRace Train')
print('=====================')
print('black:\t', race_train_black/race_train_total)
print('white:\t', race_train_white/race_train_total)
print('asian:\t', race_train_asian/race_train_total)
print('latino:\t', race_train_latino/race_train_total)
print('other_race_or_ethnicity:\t', race_train_other/race_train_total)

race_val_black = (race_val_df['race_subtype']=='black').astype(int).sum()
race_val_white = (race_val_df['race_subtype']=='white').astype(int).sum()
race_val_asian = (race_val_df['race_subtype']=='asian').astype(int).sum()
race_val_latino = (race_val_df['race_subtype']=='latino').astype(int).sum()
race_val_other = (race_val_df['race_subtype']=='other_race_or_ethnicity').astype(int).sum()
race_val_total = len(race_val_df['race_subtype'])
print('\nRace Val')
print('=====================')
print('black:\t', race_val_black/race_val_total)
print('white:\t', race_val_white/race_val_total)
print('asian:\t', race_val_asian/race_val_total)
print('latino:\t', race_val_latino/race_val_total)
print('other_race_or_ethnicity:\t', race_val_other/race_val_total)

race_test_black = (race_test_df['race_subtype']=='black').astype(int).sum()
race_test_white = (race_test_df['race_subtype']=='white').astype(int).sum()
race_test_asian = (race_test_df['race_subtype']=='asian').astype(int).sum()
race_test_latino = (race_test_df['race_subtype']=='latino').astype(int).sum()
race_test_other = (race_test_df['race_subtype']=='other_race_or_ethnicity').astype(int).sum()
race_test_total = len(race_test_df['race_subtype'])
print('\nRace Test')
print('=====================')
print('black:\t', race_test_black/race_test_total)
print('white:\t', race_test_white/race_test_total)
print('asian:\t', race_test_asian/race_test_total)
print('latino:\t', race_test_latino/race_test_total)
print('other_race_or_ethnicity:\t', race_test_other/race_test_total)


Stratified Sampling Sanity Check for race

Race Train
black:	 0.24910142209720268
white:	 0.3663072355055477
asian:	 0.13892795749335834
latino:	 0.07188623222378497
other_race_or_ethnicity:	 0.17377715268010627

Race Val
black:	 0.2492970946579194
white:	 0.36644798500468606
asian:	 0.1387066541705717
latino:	 0.07169634489222118
other_race_or_ethnicity:	 0.1738519212746017

Race Test
black:	 0.24912938655237074
white:	 0.3664612911867131
asian:	 0.13876238949906242
latino:	 0.07179212429681221
other_race_or_ethnicity:	 0.17385480846504153


Check positive/negative labels balance:

In [48]:
neg, pos = np.bincount(race_df['toxicity_binary'])
total = neg + pos
print('Race ALL Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Race ALL Examples:
    Total: 71648
    Positive: 14682 (20.49% of total)



In [49]:
neg, pos = np.bincount(race_train_df['toxicity_binary'])
total = neg + pos
print('Race Train Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Race Train Examples:
    Total: 12798
    Positive: 2621 (20.48% of total)



### Export race split datasets into csv

In [50]:
race_df.to_csv('data/race-dataset-full.csv')
race_train_df.to_csv('data/race-dataset-train.csv')
race_val_df.to_csv('data/race-dataset-val.csv')
race_test_df.to_csv('data/race-dataset-test.csv')

# Reference code: How to load split data for use in our models

Load the csv files for each data split with the following code:

In [None]:
disability_df_train = pd.read_csv('data/disability-dataset-train.csv')
disability_df_val = pd.read_csv('data/disability-dataset-val.csv')
disability_df_test = pd.read_csv('data/disability-dataset-test.csv')

Now that we loaded our data, we'll need their labels and text examples in the form of tensors. Use the code below to accomplish this:

In [None]:
# Form tensors of labels and features.
disability_train_labels = tf.convert_to_tensor(disability_df_train['toxicity_binary'])
disability_val_labels = tf.convert_to_tensor(disability_df_val['toxicity_binary'])
disability_test_labels = tf.convert_to_tensor(disability_df_test['toxicity_binary'])

disability_train_examples = tf.convert_to_tensor(disability_df_train['comment_text'])
disability_val_examples = tf.convert_to_tensor(disability_df_val['comment_text'])
disability_test_examples = tf.convert_to_tensor(disability_df_test['comment_text'])