# Data Preparation
Data source: https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/data?select=all_data.csv

In this notebook, we prepare the data downloaded from kaggle and export data subsets which we will feed into our models. The data preparation process includes data cleaning, extracting data subsets according to identity category, and splitting each identity subset into train, validation, and test sets via stratified sampling.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import train_test_split

## Load the data:
The kaggle competition corresponding to this dataset came with csv files for their own train and test subset. However, since the competition has ended, the `all_data.csv` file was released containing labels for both the train and test sets. Therefore, we'll be using the `all_data.csv` as our starting dataset.

In [3]:
all_data_df = pd.read_csv('drive/MyDrive/data/all_data.csv')

## Clean the data

EDA revealed that there were some rows with a missing value for `comment_text`. What does these rows look like?

In [4]:
all_data_df[pd.isna(all_data_df["comment_text"])]

Unnamed: 0,id,comment_text,split,created_date,publication_id,parent_id,article_id,rating,funny,wow,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
446630,392337,,train,2016-07-18 19:34:48.278774+00,13,392165.0,141670,approved,0,0,...,,,,,,,,,0,4


### Delete the rows with missing comments
Since we'll have no input text to feed in for these rows, it will be unusable and therefore we'll remove them from our dataset.

In [5]:
all_data_df_cleansed = all_data_df.copy().drop(index=all_data_df[pd.isna(all_data_df['comment_text'])].index)

We can see that we now have a few less lines in our dataset:

In [6]:
all_data_df.shape

(1999516, 46)

In [7]:
all_data_df_cleansed.shape

(1999515, 46)

## Drop the columns we won't be using

In [8]:
all_data_df_cleansed = all_data_df_cleansed.drop(columns=['id', 'split', 'created_date', 'publication_id',
       'parent_id', 'article_id', 'rating', 'funny', 'wow', 'sad', 'likes'])
all_data_df_cleansed.head()

Unnamed: 0,comment_text,disagree,toxicity,severe_toxicity,obscene,sexual_explicit,identity_attack,insult,threat,male,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
0,He got his money... now he lies in wait till a...,0,0.373134,0.044776,0.089552,0.014925,0.0,0.343284,0.014925,,...,,,,,,,,,0,67
1,Mad dog will surely put the liberals in mental...,0,0.605263,0.013158,0.065789,0.013158,0.092105,0.565789,0.065789,,...,,,,,,,,,0,76
2,And Trump continues his lifelong cowardice by ...,7,0.666667,0.015873,0.031746,0.0,0.047619,0.666667,0.0,,...,,,,,,,,,0,63
3,"""while arresting a man for resisting arrest"".\...",0,0.815789,0.065789,0.552632,0.592105,0.0,0.684211,0.105263,,...,,,,,,,,,0,76
4,Tucker and Paul are both total bad ass mofo's.,0,0.55,0.0375,0.3375,0.275,0.0375,0.4875,0.0,,...,,,,,,,,,0,80


In [9]:
pd.options.display.max_colwidth = 600

In [10]:
all_data_df_cleansed.iloc[[715484-1]]

Unnamed: 0,comment_text,disagree,toxicity,severe_toxicity,obscene,sexual_explicit,identity_attack,insult,threat,male,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
715484,"So a Christian can pledge allegiance to their religion first, but if a Muslim does then it's automatically more suspect? I'm not suggesting you explicitly said that, but inciting Sharia law or beheadings at the mention of Islam invokes that kind of double standard. As if Islam itself is to blame, not a complex mix if politics, history, and culture that leads to the extremes we see.\n\nI'm trying to get at the core assumptions here, as to the kind of rhetoric I see as harmful and unfair in it's correlation. It's this general belief that peaceful devotion to Islam is fundamentally dangerous ...",0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004457,...,0.0,0.000557,0.0,0.0039,0.000557,0.0,0.0,0.0,1795,4


### Add `toxicity_non_disability` column
Here we set the binary toxicity label for non-disability identity groups. For these groups, comments with toxicity >= 0.5 are considered toxict and will have a value of 1. Comments with toxicity < 0.5 will have a value of 0.

In [11]:
all_data_df_cleansed['toxicity_non_disability'] = (all_data_df_cleansed['toxicity'] >= 0.25).astype(int)

In [12]:
all_data_df_cleansed[['toxicity','toxicity_non_disability']]

Unnamed: 0,toxicity,toxicity_non_disability
0,0.373134,1
1,0.605263,1
2,0.666667,1
3,0.815789,1
4,0.550000,1
...,...,...
1999511,0.400000,1
1999512,0.400000,1
1999513,0.400000,1
1999514,0.400000,1


Move the new `toxicity_non_disability` column towards the front of the dataframe:

In [13]:
orig_cols = all_data_df_cleansed.columns.tolist()
reordered_cols = orig_cols[:2] + orig_cols[-1:] + orig_cols[2:-1]
all_data_df_cleansed = all_data_df_cleansed[reordered_cols]
all_data_df_cleansed.head()

Unnamed: 0,comment_text,disagree,toxicity_non_disability,toxicity,severe_toxicity,obscene,sexual_explicit,identity_attack,insult,threat,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
0,He got his money... now he lies in wait till after the election in 2 yrs.... dirty politicians need to be afraid of Tar and feathers again... but they aren't and so the people get screwed.,0,1,0.373134,0.044776,0.089552,0.014925,0.0,0.343284,0.014925,...,,,,,,,,,0,67
1,Mad dog will surely put the liberals in mental hospitals. Boorah,0,1,0.605263,0.013158,0.065789,0.013158,0.092105,0.565789,0.065789,...,,,,,,,,,0,76
2,And Trump continues his lifelong cowardice by not making this announcement himself.\n\nWhat an awful human being .....,7,1,0.666667,0.015873,0.031746,0.0,0.047619,0.666667,0.0,...,,,,,,,,,0,63
3,"""while arresting a man for resisting arrest"".\n\nIf you cop-suckers can't see a problem with this, then go suck the barrel of a Glock.",0,1,0.815789,0.065789,0.552632,0.592105,0.0,0.684211,0.105263,...,,,,,,,,,0,76
4,Tucker and Paul are both total bad ass mofo's.,0,1,0.55,0.0375,0.3375,0.275,0.0375,0.4875,0.0,...,,,,,,,,,0,80


# Prepare Disability Subset

## Create disability subset

In [14]:
disability_df = all_data_df_cleansed[(all_data_df_cleansed["physical_disability"] > 0) | 
           (all_data_df_cleansed["intellectual_or_learning_disability"] > 0) | 
           (all_data_df_cleansed["psychiatric_or_mental_illness"] > 0) | 
           (all_data_df_cleansed["other_disability"] > 0)]

In [None]:
disability_df.shape

## Basic feature engineering

## Add disability subtype column
We'll add a categorical feature that specifies which of the following disability subtypes each comment corresponds to:

- `physical_disability`
- `intellectual_or_learning_disability`
- `psychiatric_or_mental_illness`
- `other_disability`

EDA revealed that some comments have nonzero values for more than one subtype above. Since the purpose of this comment is to facilitate stratified sampling, the disability subtype label for each comment will be the subtype corresponding to the largest value for that comment.

In [16]:
disability_subtypes_df = disability_df[['physical_disability','intellectual_or_learning_disability','psychiatric_or_mental_illness','other_disability']]
disability_df = disability_df.assign(disability_subtype=disability_subtypes_df.idxmax(axis=1))
disability_df.head()

Unnamed: 0,comment_text,disagree,toxicity_non_disability,toxicity,severe_toxicity,obscene,sexual_explicit,identity_attack,insult,threat,...,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count,disability_subtypes_total,disability_subtype
7705,No sympathy for these two knuckleheads.,0,1,0.689655,0.0,0.017241,0.0,0.0,0.689655,0.0,...,0.0,0.0,0.25,0.0,0.0,0.0,4,58,0.25,physical_disability
8073,Wow!\nYour progressive psychosis has become extreme!\nPlease seek immediate medical help.,0,1,0.8,0.0,0.0,0.0,0.1,0.7,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,4,10,1.0,psychiatric_or_mental_illness
8115,"Or.... maybe there IS chaos because the ""president"" is a mentally ill, in-over-his-head idiot who couldn't lead cats to tuna.",1,1,0.790323,0.032258,0.16129,0.016129,0.145161,0.774194,0.016129,...,0.0,0.0,0.0,0.0,1.0,0.0,4,62,1.0,psychiatric_or_mental_illness
8125,I'll take someone who's physically ill over one who's mentally ill.,0,1,0.352941,0.029412,0.014706,0.014706,0.088235,0.264706,0.014706,...,0.0,0.0,0.75,0.0,1.0,0.0,4,68,1.75,psychiatric_or_mental_illness
8263,"Mental Illness at work again, again, again, again.............This women placed a clamp on the boys genitalia. This woman SHOULD have a clamp placed on HER INNER and OUTER LIPS of her VAGINA...........and Never Removed.............",0,1,0.842857,0.114286,0.171429,0.814286,0.1,0.371429,0.257143,...,0.0,0.0,0.0,0.0,1.0,0.0,4,70,1.0,psychiatric_or_mental_illness


### Add `'disability_subtypes_total` column
This column will indicate how likely a comment is to mention disability.

In [17]:
disability_df['disability_subtypes_total'] = disability_df['physical_disability']+disability_df['intellectual_or_learning_disability']+disability_df['psychiatric_or_mental_illness']+disability_df['other_disability']
disability_df['disability_subtypes_total']

7705       0.250000
8073       1.000000
8115       1.000000
8125       1.750000
8263       1.000000
             ...   
1999482    0.600000
1999507    0.700000
1999508    0.500000
1999514    0.003717
1999515    0.000640
Name: disability_subtypes_total, Length: 18665, dtype: float64

# Operationalize comments **mentioning disability** and toxic **ableist** comments

Need to operaztionalize the following:

1. What is considered *disability-related*?
1. What is considered *ableist* or *toxic towards people with disabilities*?

#### Comments where the disability score for each subtypes <= 0.1 mostly seems like they're not related to disability at all prob because there's not enough consensus on disability-relatedness scores. There are a few that have *slight mention* of disability, but most of them don't mention disability at all. In the interest of time, we'll drop rows where disability_subtype total <= 0.1.

In [18]:
test_filter_condition = disability_df['disability_subtypes_total']<=0.1
display(disability_df.loc[test_filter_condition,
                    ['comment_text','toxicity','insult', 'threat','physical_disability','intellectual_or_learning_disability','psychiatric_or_mental_illness','other_disability','disability_subtypes_total','identity_annotator_count']])

Unnamed: 0,comment_text,toxicity,insult,threat,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,disability_subtypes_total,identity_annotator_count
9884,"China is the worst possible ""global partner"" anyone could have, it is the biggest dysfunctional and theft-based society with ""entitlement"" attitude - and clearly the wealthy that immigrated here made Chinese-only ""organizations"". What a disgrace.",0.434211,0.381579,0.000000,0.000000,0.0,0.1,0.00000,0.100000,10
9902,"Trump is already selling his logo to Hindu contractors. In addition, he has to pay interest to the Russian mafia, who will accept access to projects as payment. Trump is selling his brand in Argentina in exchange for dollars. You Trump people are completely brain-dead. This will blow up in America's face very quickly. You and those like you are complete fools. Trump probably is immune. Let's see if the shameful idiotic Republican Congress has the balls to impeach this cheesy crotch-grabbing pervert lying bustard snake oil salesman. As if!",0.750000,0.750000,0.000000,0.000000,0.0,0.1,0.00000,0.100000,10
9904,"LW2: Your daughter is a nut. Amy's right, she needs to get help.",0.714286,0.714286,0.014286,0.000000,0.0,0.1,0.00000,0.100000,10
9907,"What a piece of GARBAGE! Obviously written by a left wing nut that resents the knowledge, experience and power that comes with getting older. It seems that she (and many other clueless people) think that 'aging', working your way up and being a useful part of society in your 'golden years' is something to be ashamed of.\n\nBeing 'old' and part of society is not a dirty word, it's something to be proud of. Remember lady one day you will be an 'old white woman' how you going to feel then?",0.728571,0.657143,0.000000,0.000000,0.0,0.1,0.00000,0.100000,10
9944,"great, this guy was a phony, just like trump. and like trump he is a racist, bigot and stupid.",0.866667,0.866667,0.000000,0.000000,0.1,0.0,0.00000,0.100000,10
...,...,...,...,...,...,...,...,...,...,...
1999269,"Pedophiles are rarely interested in a child's sexual orientation; they are interested in satisfying their cravings in an illegal and socially unacceptable manner.\nSince there are ever so many more heterosexuals in the world than there are homosexuals, most child molestation is perpetrated by heterosexuals; also most is perpetrated by men. So, you are suggesting that heterosexual men are trying to change the sexual orientation of children?\nAnother deep breath might be in order.\nIt would be most worthwhile to read the following article by a research psychologist at the University of Cali...",0.400000,0.000000,0.000000,0.000000,0.0,0.1,0.00000,0.100000,10
1999375,"I'm glad you brought this up, Mike AA. And there's more. As someone who has lost my husband to cancer and have been through the indescribable anguish and unbearable pain of sitting at the bedside of your terminally ill loved one you're about to have to let go, what I found particularly outrageous among the many new lows Trump has stooped to in the last few days: His call to terminally ill people to stick around long enough so they can vote for him on November 8. This man is devoid of even the smallest spark of compassion and decency.",0.400000,0.400000,0.000000,0.100000,0.0,0.0,0.00000,0.100000,10
1999377,"Paul: I mention Jim Jones as the extreme example of the power of charismatic demogoguery. We already saw Trump invite his supporters to physically attack protesters at his rallies.....and they did so. I don't think he'll be passing out KoolAid, but I do fear if he's defeated he may very well refer to that ""fixed"" election which could easily encourage violent behavior from the most zealous of his followers. \n\nRe: the crying baby. If, indeed, he was ""goofin"", it was in extremely poor taste. Embarassing someone for a ""laugh"" is not funny. The woman had come to his rally. She was a T...",0.400000,0.400000,0.000000,0.000000,0.1,0.0,0.00000,0.100000,10
1999514,I just don't find her a very good representation of the transexual community. She just seems so self-absorbed & concerned with such superficial issues.,0.400000,0.100000,0.000000,0.003717,0.0,0.0,0.00000,0.003717,269


For example, this comment is not disability-related at all:

In [19]:
list(disability_df.loc[[1307202]]['comment_text'])[0]

'Real men eat oil for breakfast!'

This comment is longer, but it still is not disability-related:

In [20]:
list(disability_df.loc[[1627111]]['comment_text'])[0]

"I wouldn't say that at all.  High school students have traditionally been held responsible, and often expelled, for their conduct out of school.  Black students have a right not to feel threatened by fellow students, just as black co-workers have.  You make a mistake in saying the right is restricted to commercial interest - in both cases."

### Drop the rows that are unrelated to disability

In [21]:
disability_cleaned_df = disability_df.loc[disability_df['disability_subtypes_total']>0.1]
print('# disability rows before: ', len(disability_df))
print('# disability rows after: ', len(disability_cleaned_df))

# disability rows before:  18665
# disability rows after:  15158


#### After exploring thresholds, positive labels for `toxicity_binary` **should be `toxicity >=  0.25`**, not `toxicity >= 0.5`.

In [22]:
# test_filter_condition = disability_cleaned_df['toxicity_binary']==1
# test_filter_condition = disability_cleaned_df['toxicity_binary']==0
# test_filter_condition = disability_cleaned_df['toxicity']==0)
# test_filter_condition = (disability_cleaned_df['toxicity']>0) & (disability_cleaned_df['toxicity']<0.5)
# test_filter_condition = disability_cleaned_df['toxicity']>0
# test_filter_condition = (disability_cleaned_df['toxicity']>0) & (disability_cleaned_df['toxicity']<0.3) # these should all be negative
# test_filter_condition = (disability_cleaned_df['toxicity']>0.3) & (disability_cleaned_df['toxicity']<0.35) # should be positive
# test_filter_condition = (disability_cleaned_df['toxicity']>0.2) & (disability_cleaned_df['toxicity']<0.25) # GREY AREA
# test_filter_condition = disability_cleaned_df['toxicity']>0.22) & (disability_cleaned_df['toxicity']<0.25) # some should be negative, some positive
# test_filter_condition = disability_cleaned_df['toxicity']==0 # 8379 rows - CLEARLY NEGATIVE
# test_filter_condition = disability_cleaned_df['toxicity']<0.25 # 12007 rows - NEGATIVE
test_filter_condition = disability_cleaned_df['toxicity']<0.05
# test_filter_condition = (disability_cleaned_df['toxicity']>=0.25) # 6491 rows - POSITIVE
# test_filter_condition = disability_condition # <-- FOR BOTH NEGATIVE AND POSITIVE

disability_cleaned_df.loc[test_filter_condition,
                    ['comment_text','toxicity','identity_attack','insult', 'threat','physical_disability', 'intellectual_or_learning_disability', 'psychiatric_or_mental_illness', 'other_disability','identity_annotator_count','toxicity_annotator_count']]

Unnamed: 0,comment_text,toxicity,identity_attack,insult,threat,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
141881,"SNAP solves a very real and very immediate problem -- that of feeding people who cannot buy food with their own resources. These are real people, from infants to very old, working people, disabled people, those disadvantaged by societal expectations they can't meet, et., etc., etc. I spent the first month of food bank (SNAP and local charity funded) volunteer service in shock, meeting clients and helping them obtain VERY LITTLE food that needed to last them a month. \n\nFamiliarity along with empathy really do help achieve genuine ""charity"" -- the virtue you preach about interminably.",0.008666,0.004333,0.004333,0.000867,0.1,0.0,0.000000,0.100000,10,1154
207222,"Finally a politician who cares what constituents want! No one thinks people with a history of mental illness should have guns. No one needs an automatic weapon. Thanks , Manka!",0.029412,0.000000,0.029412,0.000000,0.0,0.0,0.500000,0.000000,10,34
272394,"I can't help but feel that if life is so bad that people have to get high to bear it, then something is dreadfully wrong with our society as a whole. I more than understand that life can have it's times of depression for a myriad of reasons, but to give one's soul and Spirit to some form of drug, just seems like an act of desperation. Why is there so much of it, especially among the younger generations?",0.023288,0.002740,0.016438,0.000685,0.0,0.0,0.000000,0.166667,6,1460
536064,"Here's other ideas: proper housing, clean water, roads on reserves, actual schools with tenured teachers, lower food prices in northern stores, mental health counsellors, doctors, nurses, hospitals, long term care facilities, community centres ...\n\nMs. Stronach, please put your considerable means and influence towards these improvements first before you start on your laptop program.",0.009332,0.000718,0.008615,0.000718,0.0,0.0,0.700000,0.000000,10,1393
715508,"You're right Greeleaf. I hadn't thought that he/she may be on medication or have a mental health problem. I did not intend to demean him/her ... just trying to help.\nMy apologies.\n\nBest,\n\nRTD",0.000000,0.000000,0.000000,0.000000,0.0,0.0,1.000000,0.000000,4,4
...,...,...,...,...,...,...,...,...,...,...,...
1883916,"Recent experience and on topic discussions have shown Alaska's fiscal crisis as a real threat to the quality of all our lives. This particular tragedy is one , I wish could be debated as , times are tough , and dollars to help people looking for it, or deemed in need of help , are not there. \nThat is not the case when talking Mental Health , because there has not been dollars for the Mentally Ill for quite some time. The State when allocating dollars signing up with the new federal program, allocated 0 dollars for Mental Health, Let's repeat that amount for it truly does tell the sto...",0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.800000,0.000000,10,4
1883957,"Wait, do I understand this right? This kid was on anti-psychotics when he committed the crime. He's now being held involuntarily in a mental hospital which wants to treat him by putting him back on anti-psychotics. And the hospital has to rely on the courts to give them the right to put this patient back on the medication he was getting before he was arrested?\n\nI'm all for mental patients' civil rights, but this does seem pretty ridiculous.",0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.700000,0.000000,10,4
1883959,"There is so much more to why people are in prison and commit crimes. Many of the prisoners are people with mental illness that the courts deem too dangerous to let out. Thus they are given extreme sentences and left to rot inside. Most are damaged psychically and emotionally and often due to negligence or abuse as children. Understanding that these people are committing crimes in our society we need to deal with that, but we need to change the way we look at crime therapy and getting people back to being members of our communities. Remember that all of these people were once your neighbors...",0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.600000,0.000000,10,4
1883976,"What is the theological misconception? What is the scientific reality? We know that people are born with disabilities both physical and mental, that is the reality. The reality also is that someone with an perfectly normal male body is of the male sex and not female. The same goes for a woman with a perfectly normal female body is of the female sex and not male.\nA middle-aged married man with children who suddenly decides he is in the wrong body obviously has a mental problem. To humour him and encourage him to pretend that he is a woman is cruel. Medical science can give him the cosmeti...",0.000000,0.000000,0.000000,0.000000,0.6,0.0,0.700000,0.000000,10,4


### Add `toxicity_disability` column
Here we set the binary toxicity label for comments where disability is mentioned as determined by human annotators.

- Initially, we were going to set the toxicity label according to the dataset designers' recommendation where comments with toxicity >= 0.5 are considered toxict. However, EDA revealed that many comments with toxicity < 0.5 are toxic towards people with disabilites. Therefore, 0.5 is not the appropriate threshold here.
- After careful inspection of the comments where disability is mentioned, determining whether a comment is toxic is less obvious/in a grey area at toxicity=0.25.
- Therefore, we create a toxicity_disability column that maps comments where toxicity < 0.25 to 0 (negative/non-toxic) and comments with toxicity >= 0.25 to 1 (positive/toxic). This column will serve as our labels to train and evaluate on.

In [23]:
disability_cleaned_df['toxicity_disability'] = (all_data_df_cleansed['toxicity'] >= 0.25).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  disability_cleaned_df['toxicity_disability'] = (all_data_df_cleansed['toxicity'] >= 0.25).astype(int)


In [24]:
disability_cleaned_df[['toxicity','toxicity_disability']]

Unnamed: 0,toxicity,toxicity_disability
7705,0.689655,1
8073,0.800000,1
8115,0.790323,1
8125,0.352941,1
8263,0.842857,1
...,...,...
1999476,0.400000,1
1999478,0.400000,1
1999482,0.400000,1
1999507,0.400000,1


Move the new `toxicity_disability` column towards the front of the dataframe:

In [25]:
orig_cols = disability_cleaned_df.columns.tolist()
reordered_cols = orig_cols[:2] + orig_cols[-1:] + orig_cols[2:-1]
disability_cleaned_df = disability_cleaned_df[reordered_cols]
disability_cleaned_df.head()

Unnamed: 0,comment_text,disagree,toxicity_disability,toxicity_non_disability,toxicity,severe_toxicity,obscene,sexual_explicit,identity_attack,insult,...,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count,disability_subtypes_total,disability_subtype
7705,No sympathy for these two knuckleheads.,0,1,1,0.689655,0.0,0.017241,0.0,0.0,0.689655,...,0.0,0.0,0.25,0.0,0.0,0.0,4,58,0.25,physical_disability
8073,Wow!\nYour progressive psychosis has become extreme!\nPlease seek immediate medical help.,0,1,1,0.8,0.0,0.0,0.0,0.1,0.7,...,0.0,0.0,0.0,0.0,1.0,0.0,4,10,1.0,psychiatric_or_mental_illness
8115,"Or.... maybe there IS chaos because the ""president"" is a mentally ill, in-over-his-head idiot who couldn't lead cats to tuna.",1,1,1,0.790323,0.032258,0.16129,0.016129,0.145161,0.774194,...,0.0,0.0,0.0,0.0,1.0,0.0,4,62,1.0,psychiatric_or_mental_illness
8125,I'll take someone who's physically ill over one who's mentally ill.,0,1,1,0.352941,0.029412,0.014706,0.014706,0.088235,0.264706,...,0.0,0.0,0.75,0.0,1.0,0.0,4,68,1.75,psychiatric_or_mental_illness
8263,"Mental Illness at work again, again, again, again.............This women placed a clamp on the boys genitalia. This woman SHOULD have a clamp placed on HER INNER and OUTER LIPS of her VAGINA...........and Never Removed.............",0,1,1,0.842857,0.114286,0.171429,0.814286,0.1,0.371429,...,0.0,0.0,0.0,0.0,1.0,0.0,4,70,1.0,psychiatric_or_mental_illness


## Prepare data splits for disability
**Overview:** We'll split the disability subset into 70% train, 10% validation, and 20% test sets. Comments may have subtle differences due to disability subtypes (e.g. comments about physical disability may be different than comments about intellectual/learning disability). Therefore we'll need to do stratified sampling on the disability subtypes such that for each dataset split, the ratio for each disability subtype will be around the same.

**Splitting Method**

We'll use the train_test_split() method and since it only creates two splits, we'll take the following steps to create the three train/val/test splits:

1. Split into group1: 80% for train and validation, and group2: 20% for test.
1. No need to further split the test set, so leave it alone.
1. Take the set from step 1 that combines train and validation and divide it into train and val sets. Since we want the overall ratio to be 70% train and 10% validation, the ratio for train here should be (1-1/7) and for validation it should be 1/7.

Split the data:

In [26]:
# Split into 80% combined for train and val, and 20% test
disability_combined_df, disability_test_df = train_test_split(disability_cleaned_df,
                                       test_size=0.2,
                                       random_state=266, stratify=disability_cleaned_df['disability_subtype'])

# Split into 70% for train and 10% val
disability_train_df, disability_val_df = train_test_split(disability_combined_df,
                                       test_size=1/7,
                                       random_state=266, stratify=disability_combined_df['disability_subtype'])

How big is each split?

In [27]:
disability_train_len = len(disability_train_df)
disability_val_len = len(disability_val_df)
disability_test_len = len(disability_test_df)
disability_total = disability_train_len+disability_val_len+disability_test_len
print('disability_train size: ', disability_train_len)
print('disability_val size: ', disability_val_len)
print('disability_test size: ', disability_test_len)
print('disability total: ', disability_total)
print('disability train ratio: ', disability_train_len/disability_total)
print('disability val ratio: ', disability_val_len/disability_total)
print('disability test ratio: ', disability_test_len/disability_total)

disability_train size:  10393
disability_val size:  1733
disability_test size:  3032
disability total:  15158
disability train ratio:  0.6856445441351102
disability val ratio:  0.11432906715925584
disability test ratio:  0.200026388705634


How balanced is each split in terms of negative and positive labels?

In [28]:
neg, pos = np.bincount(disability_train_df['toxicity_disability'])
total = neg + pos
print('Train Disability Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))
neg, pos = np.bincount(disability_val_df['toxicity_disability'])
total = neg + pos
print('Val Disability Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))
neg, pos = np.bincount(disability_test_df['toxicity_disability'])
total = neg + pos
print('Test Disability Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Train Disability Examples:
    Total: 10393
    Positive: 3551 (34.17% of total)

Val Disability Examples:
    Total: 1733
    Positive: 622 (35.89% of total)

Test Disability Examples:
    Total: 3032
    Positive: 1093 (36.05% of total)



In [29]:
print('\nStratified Sampling Sanity Check for Disability')
disability_train_phys = (disability_train_df['disability_subtype']=='physical_disability').astype(int).sum()
disability_train_intel = (disability_train_df['disability_subtype']=='intellectual_or_learning_disability').astype(int).sum()
disability_train_psych = (disability_train_df['disability_subtype']=='psychiatric_or_mental_illness').astype(int).sum()
disability_train_other = (disability_train_df['disability_subtype']=='other_disability').astype(int).sum()
disability_train_total = len(disability_train_df['disability_subtype'])
print('\nDisability Train')
print('=====================')
print('phys:\t', disability_train_phys/disability_train_total)
print('intel:\t', disability_train_intel/disability_train_total)
print('psych:\t', disability_train_psych/disability_train_total)
print('other_disability:\t', disability_train_other/disability_train_total)

disability_val_phys = (disability_val_df['disability_subtype']=='physical_disability').astype(int).sum()
disability_val_intel = (disability_val_df['disability_subtype']=='intellectual_or_learning_disability').astype(int).sum()
disability_val_psych = (disability_val_df['disability_subtype']=='psychiatric_or_mental_illness').astype(int).sum()
disability_val_other = (disability_val_df['disability_subtype']=='other_disability').astype(int).sum()
disability_val_total = len(disability_val_df['disability_subtype'])
print('\nDisability Val')
print('=====================')
print('phys:\t', disability_val_phys/disability_val_total)
print('intel:\t', disability_val_intel/disability_val_total)
print('psych:\t', disability_val_psych/disability_val_total)
print('other_disability:\t', disability_val_other/disability_val_total)

disability_test_phys = (disability_test_df['disability_subtype']=='physical_disability').astype(int).sum()
disability_test_intel = (disability_test_df['disability_subtype']=='intellectual_or_learning_disability').astype(int).sum()
disability_test_psych = (disability_test_df['disability_subtype']=='psychiatric_or_mental_illness').astype(int).sum()
disability_test_other = (disability_test_df['disability_subtype']=='other_disability').astype(int).sum()
disability_test_total = len(disability_test_df['disability_subtype'])
print('\nDisability Test')
print('=====================')
print('phys:\t', disability_test_phys/disability_test_total)
print('intel:\t', disability_test_intel/disability_test_total)
print('psych:\t', disability_test_psych/disability_test_total)
print('other_disability:\t', disability_test_other/disability_test_total)


Stratified Sampling Sanity Check for Disability

Disability Train
phys:	 0.13816992206292697
intel:	 0.09833541806985471
psych:	 0.6535167901472144
other_disability:	 0.10997786972000385

Disability Val
phys:	 0.13791113675706868
intel:	 0.09867282169648009
psych:	 0.6532025389497981
other_disability:	 0.1102135025966532

Disability Test
phys:	 0.13819261213720316
intel:	 0.09828496042216359
psych:	 0.6533641160949868
other_disability:	 0.11015831134564644


### Export disability split datasets into csv

In [30]:
# disability_df.to_csv('drive/MyDrive/data/disability-dataset-full.csv')
disability_train_df.to_csv('drive/MyDrive/data/disability-dataset-train.csv')
disability_val_df.to_csv('drive/MyDrive/data/disability-dataset-val.csv')
disability_test_df.to_csv('drive/MyDrive/data/disability-dataset-test.csv')

### Address Data Imbalance Between Disability and Non-Disability Subsets
All of the non-disability identities have many more records than the disability subset. For the interweaving technique, we'll want the disability and non-disability subset to be balanced in terms of dataset size. That is, we don't want whatever is learned from the disability subset to be overpowered by the non-disability subset due training on more non-disability examples. Therefore, we'll divide of each non-disability subsets into chunks that are the same size as the disability subset.

#### Capture number of disability samples in disability train, val, and test set to be used in the undersampling for non-disability identities:

In [31]:
num_disability_train_samples = len(disability_train_df)
num_disability_val_samples = len(disability_val_df)
num_disability_test_samples = len(disability_test_df)

# Prepare non-disability subsets

**Overview**
- We want each non-disability subset to have the following ratio: 70% train, 10% val, and 20% test.
- Each split will be stratified on each identity's subtype.
- Disability has 18k comments. Since there are more examples for non-disability, we'll undersample the non-disability comments to reduce it to 18k. That way the models don't favor the identity group with more examples for that identity.

**Splitting Method**

We'll use the train_test_split() method and since it only creates two splits, we'll take the following steps to create the three train/val/test splits:

1. Split into group1: 80% for train and validation, and group2: 20% for test.
1. No need to further split the test set, so leave it alone.
1. Take the set from step 1 that combines train and validation and divide it into train and val sets. Since we want the overall ratio to be 70% train and 10% validation, the ratio for train here should be (1-1/7) and for validation it should be 1/7.

## Gender subset

Create gender subset:

In [32]:
gender_df = all_data_df_cleansed[(all_data_df_cleansed['male'] > 0) | 
           (all_data_df_cleansed['female'] > 0) | 
           (all_data_df_cleansed['transgender'] > 0) | 
           (all_data_df_cleansed['other_gender'] > 0)]
gender_df.shape

(137722, 36)

Divide gender data into chunks whose size is the same as the disability subset:

In [33]:
# get number of disability rows, which will be the size of each chunk
chunk_size = len(disability_cleaned_df)
# shuffle gender data rows before dividing the data
shuffled_gender_df = gender_df.sample(frac = 1)
# reset the index numbers since we'll be using the index for chunking
shuffled_gender_df = shuffled_gender_df.reset_index()

gender_chunk1_df = shuffled_gender_df.iloc[0:chunk_size]
gender_chunk2_df = shuffled_gender_df.iloc[chunk_size:chunk_size*2]
gender_chunk3_df = shuffled_gender_df.iloc[chunk_size*2:chunk_size*3]
gender_chunk4_df = shuffled_gender_df.iloc[chunk_size*3:chunk_size*4]
gender_chunk5_df = shuffled_gender_df.iloc[chunk_size*4:chunk_size*5]
gender_chunk6_df = shuffled_gender_df.iloc[chunk_size*5:chunk_size*6]
gender_chunk7_df = shuffled_gender_df.iloc[chunk_size*6:chunk_size*7]
gender_chunk8_df = shuffled_gender_df.iloc[chunk_size*7:chunk_size*8]
gender_chunk9_df = shuffled_gender_df.iloc[chunk_size*8:]

print('disability size: ', len(disability_cleaned_df))
print('gender_chunk1_df size: ', len(gender_chunk1_df))
print('gender_chunk2_df size: ', len(gender_chunk2_df))
print('gender_chunk3_df size: ', len(gender_chunk3_df))
print('gender_chunk4_df size: ', len(gender_chunk4_df))
print('gender_chunk5_df size: ', len(gender_chunk5_df))
print('gender_chunk6_df size: ', len(gender_chunk6_df))
print('gender_chunk7_df size: ', len(gender_chunk7_df))
print('gender_chunk8_df size: ', len(gender_chunk8_df))
print('gender_chunk9_df size: ', len(gender_chunk9_df))

disability size:  15158
gender_chunk1_df size:  15158
gender_chunk2_df size:  15158
gender_chunk3_df size:  15158
gender_chunk4_df size:  15158
gender_chunk5_df size:  15158
gender_chunk6_df size:  15158
gender_chunk7_df size:  15158
gender_chunk8_df size:  15158
gender_chunk9_df size:  16458


Split the data for each chunk:

In [34]:
# Gender Chunk 1
# Split into 80% combined for train and val, and 20% test
gender_combined_chunk1_df, gender_test_chunk1_df = train_test_split(gender_chunk1_df, test_size=0.2, random_state=266)
# Split into 70% for train and 10% val
gender_train_chunk1_df, gender_val_chunk1_df = train_test_split(gender_combined_chunk1_df, test_size=1/7, random_state=266)

# Gender Chunk 2
gender_combined_chunk2_df, gender_test_chunk2_df = train_test_split(gender_chunk2_df, test_size=0.2, random_state=266)
gender_train_chunk2_df, gender_val_chunk2_df = train_test_split(gender_combined_chunk2_df, test_size=1/7, random_state=266)

# Gender Chunk 3
gender_combined_chunk3_df, gender_test_chunk3_df = train_test_split(gender_chunk3_df, test_size=0.2, random_state=266)
gender_train_chunk3_df, gender_val_chunk3_df = train_test_split(gender_combined_chunk3_df, test_size=1/7, random_state=266)

# Gender Chunk 4
gender_combined_chunk4_df, gender_test_chunk4_df = train_test_split(gender_chunk4_df, test_size=0.2, random_state=266)
gender_train_chunk4_df, gender_val_chunk4_df = train_test_split(gender_combined_chunk4_df, test_size=1/7, random_state=266)

# Gender Chunk 5
gender_combined_chunk5_df, gender_test_chunk5_df = train_test_split(gender_chunk5_df, test_size=0.2, random_state=266)
gender_train_chunk5_df, gender_val_chunk5_df = train_test_split(gender_combined_chunk5_df, test_size=1/7, random_state=266)

# Gender Chunk 6
gender_combined_chunk6_df, gender_test_chunk6_df = train_test_split(gender_chunk6_df, test_size=0.2, random_state=266)
gender_train_chunk6_df, gender_val_chunk6_df = train_test_split(gender_combined_chunk6_df, test_size=1/7, random_state=266)

# Gender Chunk 7
gender_combined_chunk7_df, gender_test_chunk7_df = train_test_split(gender_chunk7_df, test_size=0.2, random_state=266)
gender_train_chunk7_df, gender_val_chunk7_df = train_test_split(gender_combined_chunk7_df, test_size=1/7, random_state=266)

# Gender Chunk 8
gender_combined_chunk8_df, gender_test_chunk8_df = train_test_split(gender_chunk8_df, test_size=0.2, random_state=266)
gender_train_chunk8_df, gender_val_chunk8_df = train_test_split(gender_combined_chunk8_df, test_size=1/7, random_state=266)

# Gender Chunk 9
gender_combined_chunk9_df, gender_test_chunk9_df = train_test_split(gender_chunk9_df, test_size=0.2, random_state=266)
gender_train_chunk9_df, gender_val_chunk9_df = train_test_split(gender_combined_chunk9_df, test_size=1/7, random_state=266)

Sanity check: how big is each split for gender_chunk1?

In [35]:
gender_train_chunk1_len = len(gender_train_chunk1_df)
gender_val_chunk1_len = len(gender_val_chunk1_df)
gender_test_chunk1_len = len(gender_test_chunk1_df)
gender_chunk1_total = gender_train_chunk1_len+gender_val_chunk1_len+gender_test_chunk1_len
print('gender_train_chunk1 size: ', gender_train_chunk1_len)
print('gender_val_chunk1 size: ', gender_val_chunk1_len)
print('gender_test_chunk1 size: ', gender_test_chunk1_len)
print('gender total_chunk1: ', gender_chunk1_total)
print('\ngender train chunk 1 ratio: ', gender_train_chunk1_len/gender_chunk1_total)
print('gender val chunk 1 ratio: ', gender_val_chunk1_len/gender_chunk1_total)
print('gender test chunk 1 ratio: ', gender_test_chunk1_len/gender_chunk1_total)

gender_train_chunk1 size:  10393
gender_val_chunk1 size:  1733
gender_test_chunk1 size:  3032
gender total_chunk1:  15158

gender train chunk 1 ratio:  0.6856445441351102
gender val chunk 1 ratio:  0.11432906715925584
gender test chunk 1 ratio:  0.200026388705634


Check positive/negative labels balance:

In [36]:
neg, pos = np.bincount(gender_df['toxicity_non_disability'])
total = neg + pos
print('Gender ALL Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Gender ALL Examples:
    Total: 137722
    Positive: 37815 (27.46% of total)



In [37]:
neg, pos = np.bincount(gender_train_chunk1_df['toxicity_non_disability'])
total = neg + pos
print('Gender Train Chunk 1 Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Gender Train Chunk 1 Examples:
    Total: 10393
    Positive: 2878 (27.69% of total)



### Export gender split chunk datasets into csv

In [38]:
gender_train_chunk1_df.to_csv('drive/MyDrive/data/gender-dataset-train-chunk1.csv')
gender_val_chunk1_df.to_csv('drive/MyDrive/data/gender-dataset-val-chunk1.csv')
gender_test_chunk1_df.to_csv('drive/MyDrive/data/gender-dataset-test-chunk1.csv')

gender_train_chunk2_df.to_csv('drive/MyDrive/data/gender-dataset-train-chunk2.csv')
gender_val_chunk2_df.to_csv('drive/MyDrive/data/gender-dataset-val-chunk2.csv')
gender_test_chunk2_df.to_csv('drive/MyDrive/data/gender-dataset-test-chunk2.csv')

gender_train_chunk3_df.to_csv('drive/MyDrive/data/gender-dataset-train-chunk3.csv')
gender_val_chunk3_df.to_csv('drive/MyDrive/data/gender-dataset-val-chunk3.csv')
gender_test_chunk3_df.to_csv('drive/MyDrive/data/gender-dataset-test-chunk3.csv')

gender_train_chunk3_df.to_csv('drive/MyDrive/data/gender-dataset-train-chunk3.csv')
gender_val_chunk3_df.to_csv('drive/MyDrive/data/gender-dataset-val-chunk3.csv')
gender_test_chunk3_df.to_csv('drive/MyDrive/data/gender-dataset-test-chunk3.csv')

gender_train_chunk4_df.to_csv('drive/MyDrive/data/gender-dataset-train-chunk4.csv')
gender_val_chunk4_df.to_csv('drive/MyDrive/data/gender-dataset-val-chunk4.csv')
gender_test_chunk4_df.to_csv('drive/MyDrive/data/gender-dataset-test-chunk4.csv')

gender_train_chunk5_df.to_csv('drive/MyDrive/data/gender-dataset-train-chunk5.csv')
gender_val_chunk5_df.to_csv('drive/MyDrive/data/gender-dataset-val-chunk5.csv')
gender_test_chunk5_df.to_csv('drive/MyDrive/data/gender-dataset-test-chunk5.csv')

gender_train_chunk6_df.to_csv('drive/MyDrive/data/gender-dataset-train-chunk6.csv')
gender_val_chunk6_df.to_csv('drive/MyDrive/data/gender-dataset-val-chunk6.csv')
gender_test_chunk6_df.to_csv('drive/MyDrive/data/gender-dataset-test-chunk6.csv')

gender_train_chunk7_df.to_csv('drive/MyDrive/data/gender-dataset-train-chunk7.csv')
gender_val_chunk7_df.to_csv('drive/MyDrive/data/gender-dataset-val-chunk7.csv')
gender_test_chunk7_df.to_csv('drive/MyDrive/data/gender-dataset-test-chunk7.csv')

gender_train_chunk8_df.to_csv('drive/MyDrive/data/gender-dataset-train-chunk8.csv')
gender_val_chunk8_df.to_csv('drive/MyDrive/data/gender-dataset-val-chunk8.csv')
gender_test_chunk8_df.to_csv('drive/MyDrive/data/gender-dataset-test-chunk8.csv')

gender_train_chunk9_df.to_csv('drive/MyDrive/data/gender-dataset-train-chunk9.csv')
gender_val_chunk9_df.to_csv('drive/MyDrive/data/gender-dataset-val-chunk9.csv')
gender_test_chunk9_df.to_csv('drive/MyDrive/data/gender-dataset-test-chunk9.csv')

## Sexual orientation subset

Create sexual orientation subset:

In [39]:
sexual_orientation_df = all_data_df_cleansed[(all_data_df_cleansed['heterosexual'] > 0) | 
           (all_data_df_cleansed['homosexual_gay_or_lesbian'] > 0) | 
           (all_data_df_cleansed['bisexual'] > 0) | 
           (all_data_df_cleansed['other_sexual_orientation'] > 0)]
sexual_orientation_df.shape

(22649, 36)

Divide sexual orientation data into chunks whose size is the same as the disability subset:

In [40]:
# get number of disability rows, which will be the size of each chunk
chunk_size = len(disability_cleaned_df)
# shuffle gender data rows before dividing the data
shuffled_sexual_orientation_df = sexual_orientation_df.sample(frac = 1)
# reset the index numbers since we'll be using the index for chunking
shuffled_sexual_orientation_df = shuffled_sexual_orientation_df.reset_index()

# chunk 1
sexual_orientation_chunk1_df = shuffled_sexual_orientation_df.iloc[0:chunk_size]

# chunk 2
sexual_orientation_chunk2_df = shuffled_sexual_orientation_df.iloc[chunk_size:]
# sample from chunk1 to fill the remaining space
so_chunk1_sample = sexual_orientation_chunk1_df.sample(frac=1-len(sexual_orientation_chunk2_df)/chunk_size)
sexual_orientation_chunk2_df = pd.concat([sexual_orientation_chunk2_df, so_chunk1_sample], axis=0)

print('disability size: ', len(disability_cleaned_df))
print('sexual_orientation_chunk1_df size: ', len(sexual_orientation_chunk1_df))
print('sexual_orientation_chunk2_df size: ', len(sexual_orientation_chunk2_df))

disability size:  15158
sexual_orientation_chunk1_df size:  15158
sexual_orientation_chunk2_df size:  15158


Split the data for each chunk:

In [41]:
# Sexual Orientation Chunk 1
# Split into 80% combined for train and val, and 20% test
sexual_orientation_combined_chunk1_df, sexual_orientation_test_chunk1_df = train_test_split(sexual_orientation_chunk1_df, test_size=0.2, random_state=266)
# Split into 70% for train and 10% val
sexual_orientation_train_chunk1_df, sexual_orientation_val_chunk1_df = train_test_split(sexual_orientation_combined_chunk1_df, test_size=1/7, random_state=266)

# Sexual Orientation Chunk 2
sexual_orientation_combined_chunk2_df, sexual_orientation_test_chunk2_df = train_test_split(sexual_orientation_chunk2_df, test_size=0.2, random_state=266)
sexual_orientation_train_chunk2_df, sexual_orientation_val_chunk2_df = train_test_split(sexual_orientation_combined_chunk2_df, test_size=1/7, random_state=266)

Sanity check: how big is each split for sexual_orientation_chunk1?

In [42]:
sexual_orientation_train_chunk1_len = len(sexual_orientation_train_chunk1_df)
sexual_orientation_val_chunk1_len = len(sexual_orientation_val_chunk1_df)
sexual_orientation_test_chunk1_len = len(sexual_orientation_test_chunk1_df)
sexual_orientation_chunk1_total = sexual_orientation_train_chunk1_len+sexual_orientation_val_chunk1_len+sexual_orientation_test_chunk1_len
print('sexual_orientation_train_chunk1 size: ', sexual_orientation_train_chunk1_len)
print('sexual_orientation_val_chunk1 size: ', sexual_orientation_val_chunk1_len)
print('sexual_orientation_test_chunk1 size: ', sexual_orientation_test_chunk1_len)
print('sexual_orientation total_chunk1: ', sexual_orientation_chunk1_total)
print('\nsexual_orientation train chunk 1 ratio: ', sexual_orientation_train_chunk1_len/sexual_orientation_chunk1_total)
print('sexual_orientation val chunk 1 ratio: ', sexual_orientation_val_chunk1_len/sexual_orientation_chunk1_total)
print('sexual_orientation test chunk 1 ratio: ', sexual_orientation_test_chunk1_len/sexual_orientation_chunk1_total)

sexual_orientation_train_chunk1 size:  10393
sexual_orientation_val_chunk1 size:  1733
sexual_orientation_test_chunk1 size:  3032
sexual_orientation total_chunk1:  15158

sexual_orientation train chunk 1 ratio:  0.6856445441351102
sexual_orientation val chunk 1 ratio:  0.11432906715925584
sexual_orientation test chunk 1 ratio:  0.200026388705634


Check positive/negative labels balance:

In [43]:
neg, pos = np.bincount(sexual_orientation_df['toxicity_non_disability'])
total = neg + pos
print('Sexual Orientation ALL Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Sexual Orientation ALL Examples:
    Total: 22649
    Positive: 9688 (42.77% of total)



In [44]:
neg, pos = np.bincount(sexual_orientation_train_chunk1_df['toxicity_non_disability'])
total = neg + pos
print('Sexual Orientation Train Chunk 1 Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Sexual Orientation Train Chunk 1 Examples:
    Total: 10393
    Positive: 4428 (42.61% of total)



### Export gender split chunk datasets into csv

In [45]:
sexual_orientation_train_chunk1_df.to_csv('drive/MyDrive/data/sexual_orientation-dataset-train-chunk1.csv')
sexual_orientation_val_chunk1_df.to_csv('drive/MyDrive/data/sexual_orientation-dataset-val-chunk1.csv')
sexual_orientation_test_chunk1_df.to_csv('drive/MyDrive/data/sexual_orientation-dataset-test-chunk1.csv')

sexual_orientation_train_chunk2_df.to_csv('drive/MyDrive/data/sexual_orientation-dataset-train-chunk2.csv')
sexual_orientation_val_chunk2_df.to_csv('drive/MyDrive/data/sexual_orientation-dataset-val-chunk2.csv')
sexual_orientation_test_chunk2_df.to_csv('drive/MyDrive/data/sexual_orientation-dataset-test-chunk2.csv')

## Religion subset

Create religion subset:

In [46]:
religion_df = all_data_df_cleansed[(all_data_df_cleansed['christian'] > 0) | 
           (all_data_df_cleansed['jewish'] > 0) | 
           (all_data_df_cleansed['muslim'] > 0) | 
           (all_data_df_cleansed['hindu'] > 0) | 
           (all_data_df_cleansed['buddhist'] > 0) | 
           (all_data_df_cleansed['atheist'] > 0) | 
           (all_data_df_cleansed['other_religion'] > 0)]
gender_df.shape

(137722, 36)

Divide gender data into chunks whose size is the same as the disability subset:

In [58]:
# get number of disability rows, which will be the size of each chunk
chunk_size = len(disability_cleaned_df)
# shuffle religion data rows before dividing the data
shuffled_religion_df = religion_df.sample(frac = 1)
# reset the index numbers since we'll be using the index for chunking
shuffled_religion_df = shuffled_religion_df.reset_index()

religion_chunk1_df = shuffled_religion_df.iloc[0:chunk_size]
religion_chunk2_df = shuffled_religion_df.iloc[chunk_size:chunk_size*2]
religion_chunk3_df = shuffled_religion_df.iloc[chunk_size*2:chunk_size*3]
religion_chunk4_df = shuffled_religion_df.iloc[chunk_size*3:chunk_size*4]
religion_chunk5_df = shuffled_religion_df.iloc[chunk_size*4:chunk_size*5]
religion_chunk6_df = shuffled_religion_df.iloc[chunk_size*5:chunk_size*6]
religion_chunk7_df = shuffled_religion_df.iloc[chunk_size*6:chunk_size*7]

print('disability size: ', len(disability_cleaned_df))
print('religion_chunk1_df size: ', len(religion_chunk1_df))
print('religion_chunk2_df size: ', len(religion_chunk2_df))
print('religion_chunk3_df size: ', len(religion_chunk3_df))
print('religion_chunk4_df size: ', len(religion_chunk4_df))
print('religion_chunk5_df size: ', len(religion_chunk5_df))
print('religion_chunk6_df size: ', len(religion_chunk6_df))
print('religion_chunk7_df size: ', len(religion_chunk7_df))

disability size:  15158
religion_chunk1_df size:  15158
religion_chunk2_df size:  15158
religion_chunk3_df size:  15158
religion_chunk4_df size:  15158
religion_chunk5_df size:  15158
religion_chunk6_df size:  15158
religion_chunk7_df size:  10462


Split the data for each chunk:

In [59]:
# Religion Chunk 1
# Split into 80% combined for train and val, and 20% test
religion_combined_chunk1_df, religion_test_chunk1_df = train_test_split(religion_chunk1_df, test_size=0.2, random_state=266)
# Split into 70% for train and 10% val
religion_train_chunk1_df, religion_val_chunk1_df = train_test_split(religion_combined_chunk1_df, test_size=1/7, random_state=266)

# Religion Chunk 2
religion_combined_chunk2_df, religion_test_chunk2_df = train_test_split(religion_chunk2_df, test_size=0.2, random_state=266)
religion_train_chunk2_df, religion_val_chunk2_df = train_test_split(religion_combined_chunk2_df, test_size=1/7, random_state=266)

# Religion Chunk 3
religion_combined_chunk3_df, religion_test_chunk3_df = train_test_split(religion_chunk3_df, test_size=0.2, random_state=266)
religion_train_chunk3_df, religion_val_chunk3_df = train_test_split(religion_combined_chunk3_df, test_size=1/7, random_state=266)

# Religion Chunk 4
religion_combined_chunk4_df, religion_test_chunk4_df = train_test_split(religion_chunk4_df, test_size=0.2, random_state=266)
religion_train_chunk4_df, religion_val_chunk4_df = train_test_split(religion_combined_chunk4_df, test_size=1/7, random_state=266)

# Religion Chunk 5
religion_combined_chunk5_df, religion_test_chunk5_df = train_test_split(religion_chunk5_df, test_size=0.2, random_state=266)
religion_train_chunk5_df, religion_val_chunk5_df = train_test_split(religion_combined_chunk5_df, test_size=1/7, random_state=266)

# Religion Chunk 6
religion_combined_chunk6_df, religion_test_chunk6_df = train_test_split(religion_chunk6_df, test_size=0.2, random_state=266)
religion_train_chunk6_df, religion_val_chunk6_df = train_test_split(religion_combined_chunk6_df, test_size=1/7, random_state=266)

# Religion Chunk 7
religion_combined_chunk7_df, religion_test_chunk7_df = train_test_split(religion_chunk7_df, test_size=0.2, random_state=266)
religion_train_chunk7_df, religion_val_chunk7_df = train_test_split(religion_combined_chunk7_df, test_size=1/7, random_state=266)

Sanity check: how big is each split for gender_chunk1?

In [None]:
religion_train_chunk1_len = len(religion_train_chunk1_df)
religion_val_chunk1_len = len(religion_val_chunk1_df)
religion_test_chunk1_len = len(religion_test_chunk1_df)
religion_chunk1_total = religion_train_chunk1_len+religion_val_chunk1_len+religion_test_chunk1_len
print('religion_train_chunk1 size: ', religion_train_chunk1_len)
print('religion_val_chunk1 size: ', religion_val_chunk1_len)
print('religion_test_chunk1 size: ', religion_test_chunk1_len)
print('religion total_chunk1: ', religion_chunk1_total)
print('\nreligion train chunk 1 ratio: ', religion_train_chunk1_len/religion_chunk1_total)
print('religion val chunk 1 ratio: ', religion_val_chunk1_len/religion_chunk1_total)
print('religion test chunk 1 ratio: ', religion_test_chunk1_len/religion_chunk1_total)

gender_train_chunk1 size:  10393
gender_val_chunk1 size:  1733
gender_test_chunk1 size:  3032
gender total_chunk1:  15158

gender train chunk 1 ratio:  0.6856445441351102
gender val chunk 1 ratio:  0.11432906715925584
gender test chunk 1 ratio:  0.200026388705634


Check positive/negative labels balance:

In [60]:
neg, pos = np.bincount(religion_df['toxicity_non_disability'])
total = neg + pos
print('Religion ALL Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Religion ALL Examples:
    Total: 101410
    Positive: 26187 (25.82% of total)



In [61]:
neg, pos = np.bincount(religion_train_chunk1_df['toxicity_non_disability'])
total = neg + pos
print('Religion Train Chunk 1 Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Religion Train Chunk 1 Examples:
    Total: 10393
    Positive: 2747 (26.43% of total)



### Export religion split chunk datasets into csv

In [63]:
religion_train_chunk1_df.to_csv('drive/MyDrive/data/religion-dataset-train-chunk1.csv')
religion_val_chunk1_df.to_csv('drive/MyDrive/data/religion-dataset-val-chunk1.csv')
religion_test_chunk1_df.to_csv('drive/MyDrive/data/religion-dataset-test-chunk1.csv')

religion_train_chunk2_df.to_csv('drive/MyDrive/data/religion-dataset-train-chunk2.csv')
religion_val_chunk2_df.to_csv('drive/MyDrive/data/religion-dataset-val-chunk2.csv')
religion_test_chunk2_df.to_csv('drive/MyDrive/data/religion-dataset-test-chunk2.csv')

religion_train_chunk3_df.to_csv('drive/MyDrive/data/religion-dataset-train-chunk3.csv')
religion_val_chunk3_df.to_csv('drive/MyDrive/data/religion-dataset-val-chunk3.csv')
religion_test_chunk3_df.to_csv('drive/MyDrive/data/religion-dataset-test-chunk3.csv')

religion_train_chunk3_df.to_csv('drive/MyDrive/data/religion-dataset-train-chunk3.csv')
religion_val_chunk3_df.to_csv('drive/MyDrive/data/religion-dataset-val-chunk3.csv')
religion_test_chunk3_df.to_csv('drive/MyDrive/data/religion-dataset-test-chunk3.csv')

religion_train_chunk4_df.to_csv('drive/MyDrive/data/religion-dataset-train-chunk4.csv')
religion_val_chunk4_df.to_csv('drive/MyDrive/data/religion-dataset-val-chunk4.csv')
religion_test_chunk4_df.to_csv('drive/MyDrive/data/religion-dataset-test-chunk4.csv')

religion_train_chunk5_df.to_csv('drive/MyDrive/data/religion-dataset-train-chunk5.csv')
religion_val_chunk5_df.to_csv('drive/MyDrive/data/religion-dataset-val-chunk5.csv')
religion_test_chunk5_df.to_csv('drive/MyDrive/data/religion-dataset-test-chunk5.csv')

religion_train_chunk6_df.to_csv('drive/MyDrive/data/religion-dataset-train-chunk6.csv')
religion_val_chunk6_df.to_csv('drive/MyDrive/data/religion-dataset-val-chunk6.csv')
religion_test_chunk6_df.to_csv('drive/MyDrive/data/religion-dataset-test-chunk6.csv')

religion_train_chunk7_df.to_csv('drive/MyDrive/data/religion-dataset-train-chunk7.csv')
religion_val_chunk7_df.to_csv('drive/MyDrive/data/religion-dataset-val-chunk7.csv')
religion_test_chunk7_df.to_csv('drive/MyDrive/data/religion-dataset-test-chunk7.csv')

## Race subset

Create race subset:

In [64]:
race_df = all_data_df_cleansed[(all_data_df_cleansed['black'] > 0) | 
           (all_data_df_cleansed['white'] > 0) | 
           (all_data_df_cleansed['asian'] > 0) | 
           (all_data_df_cleansed['latino'] > 0) | 
           (all_data_df_cleansed['other_race_or_ethnicity'] > 0)]
race_df.shape

(71648, 36)

Divide race data into chunks whose size is the same as the disability subset:

In [65]:
# get number of disability rows, which will be the size of each chunk
chunk_size = len(disability_cleaned_df)
# shuffle race data rows before dividing the data
shuffled_race_df = race_df.sample(frac = 1)
# reset the index numbers since we'll be using the index for chunking
shuffled_race_df = shuffled_race_df.reset_index()

race_chunk1_df = shuffled_race_df.iloc[0:chunk_size]
race_chunk2_df = shuffled_race_df.iloc[chunk_size:chunk_size*2]
race_chunk3_df = shuffled_race_df.iloc[chunk_size*2:chunk_size*3]
race_chunk4_df = shuffled_race_df.iloc[chunk_size*3:chunk_size*4]
race_chunk5_df = shuffled_race_df.iloc[chunk_size*4:]

print('disability size: ', len(disability_cleaned_df))
print('race_chunk1_df size: ', len(race_chunk1_df))
print('race_chunk2_df size: ', len(race_chunk2_df))
print('race_chunk3_df size: ', len(race_chunk3_df))
print('race_chunk4_df size: ', len(race_chunk4_df))
print('race_chunk5_df size: ', len(race_chunk5_df))

disability size:  15158
race_chunk1_df size:  15158
race_chunk2_df size:  15158
race_chunk3_df size:  15158
race_chunk4_df size:  15158
race_chunk5_df size:  11016


Split the data for each chunk:

In [66]:
# Race Chunk 1
# Split into 80% combined for train and val, and 20% test
race_combined_chunk1_df, race_test_chunk1_df = train_test_split(race_chunk1_df, test_size=0.2, random_state=266)
# Split into 70% for train and 10% val
race_train_chunk1_df, race_val_chunk1_df = train_test_split(race_combined_chunk1_df, test_size=1/7, random_state=266)

# Race Chunk 2
race_combined_chunk2_df, race_test_chunk2_df = train_test_split(race_chunk2_df, test_size=0.2, random_state=266)
race_train_chunk2_df, race_val_chunk2_df = train_test_split(race_combined_chunk2_df, test_size=1/7, random_state=266)

# Race Chunk 3
race_combined_chunk3_df, race_test_chunk3_df = train_test_split(race_chunk3_df, test_size=0.2, random_state=266)
race_train_chunk3_df, race_val_chunk3_df = train_test_split(race_combined_chunk3_df, test_size=1/7, random_state=266)

# Race Chunk 4
race_combined_chunk4_df, race_test_chunk4_df = train_test_split(race_chunk4_df, test_size=0.2, random_state=266)
race_train_chunk4_df, race_val_chunk4_df = train_test_split(race_combined_chunk4_df, test_size=1/7, random_state=266)

# Race Chunk 5
race_combined_chunk5_df, race_test_chunk5_df = train_test_split(race_chunk5_df, test_size=0.2, random_state=266)
race_train_chunk5_df, race_val_chunk5_df = train_test_split(race_combined_chunk5_df, test_size=1/7, random_state=266)

Sanity check: how big is each split for race_chunk1?

In [67]:
race_train_chunk1_len = len(race_train_chunk1_df)
race_val_chunk1_len = len(race_val_chunk1_df)
race_test_chunk1_len = len(race_test_chunk1_df)
race_chunk1_total = race_train_chunk1_len+race_val_chunk1_len+race_test_chunk1_len
print('race_train_chunk1 size: ', race_train_chunk1_len)
print('race_val_chunk1 size: ', race_val_chunk1_len)
print('race_test_chunk1 size: ', race_test_chunk1_len)
print('race total_chunk1: ', race_chunk1_total)
print('\nrace train chunk 1 ratio: ', race_train_chunk1_len/race_chunk1_total)
print('race val chunk 1 ratio: ', race_val_chunk1_len/race_chunk1_total)
print('race test chunk 1 ratio: ', race_test_chunk1_len/race_chunk1_total)

race_train_chunk1 size:  10393
race_val_chunk1 size:  1733
race_test_chunk1 size:  3032
race total_chunk1:  15158

race train chunk 1 ratio:  0.6856445441351102
race val chunk 1 ratio:  0.11432906715925584
race test chunk 1 ratio:  0.200026388705634


Check positive/negative labels balance:

In [68]:
neg, pos = np.bincount(race_df['toxicity_non_disability'])
total = neg + pos
print('Race ALL Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Race ALL Examples:
    Total: 71648
    Positive: 27642 (38.58% of total)



In [69]:
neg, pos = np.bincount(race_train_chunk1_df['toxicity_non_disability'])
total = neg + pos
print('Race Train Chunk 1 Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

Race Train Chunk 1 Examples:
    Total: 10393
    Positive: 4084 (39.30% of total)



### Export race split chunk datasets into csv

In [70]:
race_train_chunk1_df.to_csv('drive/MyDrive/data/race-dataset-train-chunk1.csv')
race_val_chunk1_df.to_csv('drive/MyDrive/data/race-dataset-val-chunk1.csv')
race_test_chunk1_df.to_csv('drive/MyDrive/data/race-dataset-test-chunk1.csv')

race_train_chunk2_df.to_csv('drive/MyDrive/data/race-dataset-train-chunk2.csv')
race_val_chunk2_df.to_csv('drive/MyDrive/data/race-dataset-val-chunk2.csv')
race_test_chunk2_df.to_csv('drive/MyDrive/data/race-dataset-test-chunk2.csv')

race_train_chunk3_df.to_csv('drive/MyDrive/data/race-dataset-train-chunk3.csv')
race_val_chunk3_df.to_csv('drive/MyDrive/data/race-dataset-val-chunk3.csv')
race_test_chunk3_df.to_csv('drive/MyDrive/data/race-dataset-test-chunk3.csv')

race_train_chunk3_df.to_csv('drive/MyDrive/data/race-dataset-train-chunk3.csv')
race_val_chunk3_df.to_csv('drive/MyDrive/data/race-dataset-val-chunk3.csv')
race_test_chunk3_df.to_csv('drive/MyDrive/data/race-dataset-test-chunk3.csv')

race_train_chunk4_df.to_csv('drive/MyDrive/data/race-dataset-train-chunk4.csv')
race_val_chunk4_df.to_csv('drive/MyDrive/data/race-dataset-val-chunk4.csv')
race_test_chunk4_df.to_csv('drive/MyDrive/data/race-dataset-test-chunk4.csv')

race_train_chunk5_df.to_csv('drive/MyDrive/data/race-dataset-train-chunk5.csv')
race_val_chunk5_df.to_csv('drive/MyDrive/data/race-dataset-val-chunk5.csv')
race_test_chunk5_df.to_csv('drive/MyDrive/data/race-dataset-test-chunk5.csv')

# Reference code: How to load split data for use in our models

Load the csv files for each data split with the following code:

In [None]:
# disability_df_train = pd.read_csv('data/disability-dataset-train.csv')
# disability_df_val = pd.read_csv('data/disability-dataset-val.csv')
# disability_df_test = pd.read_csv('data/disability-dataset-test.csv')

Now that we loaded our data, we'll need their labels and text examples in the form of tensors. Use the code below to accomplish this:

In [None]:
# # Form tensors of labels and features.
# disability_train_labels = tf.convert_to_tensor(disability_df_train['toxicity_binary'])
# disability_val_labels = tf.convert_to_tensor(disability_df_val['toxicity_binary'])
# disability_test_labels = tf.convert_to_tensor(disability_df_test['toxicity_binary'])

# disability_train_examples = tf.convert_to_tensor(disability_df_train['comment_text'])
# disability_val_examples = tf.convert_to_tensor(disability_df_val['comment_text'])
# disability_test_examples = tf.convert_to_tensor(disability_df_test['comment_text'])