# Data Preparation for Evaluation Datasets

🛑 👉 **DISCLAIMER:** 👈 Many of the text from our comments datasets may be offensive, obscene, and may contain profanity.

Data sources used for training our models:
- Toxigen: https://github.com/microsoft/toxigen
- Hatemoji: https://github.com/HannahKirk/Hatemoji

Data sources used for additional evaluation at the end:
- HateCheck: https://hatecheck.ai/download/
- Jigsaw: https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/data?select=all_data.csv

In this notebook, we prepare the data downloaded from HuggingFace and Kaggle and export data subsets which we will feed into our models. The data preparation process includes data cleaning, extracting data subsets according to identity category, and splitting each identity subset into train, validation, and test sets.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.6/264.6 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.2/114.2 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset

In [None]:
access_token = "REDACED"

# Prepare HateCheckCorpus data
We'll be using data from the HateCheckCorpus to evaluate our models at the end for analysis.

In [None]:
from datasets import load_dataset
hatecheck_dataset = load_dataset("Paul/hatecheck")

Downloading readme:   0%|          | 0.00/4.71k [00:00<?, ?B/s]

Downloading and preparing dataset csv/Paul--hatecheck to /root/.cache/huggingface/datasets/Paul___csv/Paul--hatecheck-de74109e77bfb6c3/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/652k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/Paul___csv/Paul--hatecheck-de74109e77bfb6c3/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
hatecheck_dataset

DatasetDict({
    test: Dataset({
        features: ['functionality', 'case_id', 'test_case', 'label_gold', 'target_ident', 'direction', 'focus_words', 'focus_lemma', 'ref_case_id', 'ref_templ_id', 'templ_id', 'case_templ'],
        num_rows: 3728
    })
})

In [None]:
hatecheck_df = pd.DataFrame(hatecheck_dataset['test'])
hatecheck_df = hatecheck_df.rename(columns={"test_case": "comment_text", "label_gold": "toxicity_binary"})
hatecheck_df['toxicity_binary'] = (hatecheck_df['toxicity_binary'] == 'hateful').astype(int)
hatecheck_df

Unnamed: 0,functionality,case_id,comment_text,toxicity_binary,target_ident,direction,focus_words,focus_lemma,ref_case_id,ref_templ_id,templ_id,case_templ
0,derog_neg_emote_h,1,I hate women.,1,women,general,hate,hate,,,1,I hate [IDENTITY_P].
1,derog_neg_emote_h,2,I hate trans people.,1,trans people,general,hate,hate,,,1,I hate [IDENTITY_P].
2,derog_neg_emote_h,3,I hate gay people.,1,gay people,general,hate,hate,,,1,I hate [IDENTITY_P].
3,derog_neg_emote_h,4,I hate black people.,1,black people,general,hate,hate,,,1,I hate [IDENTITY_P].
4,derog_neg_emote_h,5,I hate disabled people.,1,disabled people,general,hate,hate,,,1,I hate [IDENTITY_P].
...,...,...,...,...,...,...,...,...,...,...,...,...
3723,spell_leet_h,3897,Gay people are all complete 4ssholes.,1,gay people,general,4ssholes,asshole,1232.0,256.0,866,[IDENTITY_P] are all complete 4ssholes.
3724,spell_leet_h,3898,Black people are all complete 4ssholes.,1,black people,general,4ssholes,asshole,1233.0,256.0,866,[IDENTITY_P] are all complete 4ssholes.
3725,spell_leet_h,3899,Disabled people are all complete 4ssholes.,1,disabled people,general,4ssholes,asshole,1234.0,256.0,866,[IDENTITY_P] are all complete 4ssholes.
3726,spell_leet_h,3900,Muslims are all complete 4ssholes.,1,Muslims,general,4ssholes,asshole,1235.0,256.0,866,[IDENTITY_P] are all complete 4ssholes.


In [None]:
hatecheck_disability_df = hatecheck_df[hatecheck_df['target_ident']=='disabled people']
hatecheck_gender_df = hatecheck_df[(hatecheck_df['target_ident']=='women') | (hatecheck_df['target_ident']=='trans people')]
hatecheck_sexual_orientation_df = hatecheck_df[hatecheck_df['target_ident']=='gay people']
hatecheck_race_df = hatecheck_df[hatecheck_df['target_ident']=='black people']
hatecheck_religion_df = hatecheck_df[hatecheck_df['target_ident']=='Muslims']
hatecheck_nationality_df = hatecheck_df[hatecheck_df['target_ident']=='immigrants']

print('len(hatecheck_disability_df): \t\t', len(hatecheck_disability_df))
print('len(hatecheck_gender_df): \t\t', len(hatecheck_gender_df))
print('len(hatecheck_sexual_orientation_df): \t', len(hatecheck_sexual_orientation_df))
print('len(hatecheck_race_df): \t\t', len(hatecheck_race_df))
print('len(hatecheck_religion_df): \t\t', len(hatecheck_religion_df))
print('len(hatecheck_nationality_df): \t\t', len(hatecheck_nationality_df))

len(hatecheck_disability_df): 		 484
len(hatecheck_gender_df): 		 972
len(hatecheck_sexual_orientation_df): 	 551
len(hatecheck_race_df): 		 482
len(hatecheck_religion_df): 		 484
len(hatecheck_nationality_df): 		 463


### Export HateCheck Identity Group DataFrames to csv

In [None]:
hatecheck_disability_df.to_csv('drive/MyDrive/data/disability-dataset-hatecheck.csv')
hatecheck_gender_df.to_csv('drive/MyDrive/data/gender-dataset-hatecheck.csv')
hatecheck_sexual_orientation_df.to_csv('drive/MyDrive/data/sexual_orientation-dataset-hatecheck.csv')
hatecheck_race_df.to_csv('drive/MyDrive/data/race-dataset-hatecheck.csv')
hatecheck_religion_df.to_csv('drive/MyDrive/data/religion-dataset-hatecheck.csv')
hatecheck_nationality_df.to_csv('drive/MyDrive/data/nationality-dataset-hatecheck.csv')

# Prepare Jigsaw data:
The kaggle competition corresponding to this dataset came with csv files for their own train and test subset. However, since the competition has ended, the `all_data.csv` file was released containing labels for both the train and test sets. Therefore, we'll be using the `all_data.csv` as our starting dataset.

In [5]:
jigsaw_all_data_df = pd.read_csv('drive/MyDrive/data/jigsaw_raw_all_data.csv')

## Clean the data

EDA revealed that there were some rows with a missing value for `comment_text`. What does these rows look like?

In [6]:
jigsaw_all_data_df[pd.isna(jigsaw_all_data_df["comment_text"])]

Unnamed: 0,id,comment_text,split,created_date,publication_id,parent_id,article_id,rating,funny,wow,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
446630,392337,,train,2016-07-18 19:34:48.278774+00,13,392165.0,141670,approved,0,0,...,,,,,,,,,0,4


### Delete the rows with missing comments
Since we'll have no input text to feed in for these rows, it will be unusable and therefore we'll remove them from our dataset.

In [7]:
jigsaw_all_data_df_cleansed = jigsaw_all_data_df.copy().drop(index=jigsaw_all_data_df[pd.isna(jigsaw_all_data_df['comment_text'])].index)

We can see that we now have a few less lines in our dataset:

In [8]:
jigsaw_all_data_df.shape

(1999516, 46)

In [9]:
jigsaw_all_data_df_cleansed.shape

(1999515, 46)

### Drop the columns we won't be using

In [10]:
jigsaw_all_data_df_cleansed = jigsaw_all_data_df_cleansed.drop(columns=['id', 'split', 'created_date', 'publication_id',
       'parent_id', 'article_id', 'rating', 'funny', 'wow', 'sad', 'likes'])
jigsaw_all_data_df_cleansed.head()

Unnamed: 0,comment_text,disagree,toxicity,severe_toxicity,obscene,sexual_explicit,identity_attack,insult,threat,male,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
0,He got his money... now he lies in wait till a...,0,0.373134,0.044776,0.089552,0.014925,0.0,0.343284,0.014925,,...,,,,,,,,,0,67
1,Mad dog will surely put the liberals in mental...,0,0.605263,0.013158,0.065789,0.013158,0.092105,0.565789,0.065789,,...,,,,,,,,,0,76
2,And Trump continues his lifelong cowardice by ...,7,0.666667,0.015873,0.031746,0.0,0.047619,0.666667,0.0,,...,,,,,,,,,0,63
3,"""while arresting a man for resisting arrest"".\...",0,0.815789,0.065789,0.552632,0.592105,0.0,0.684211,0.105263,,...,,,,,,,,,0,76
4,Tucker and Paul are both total bad ass mofo's.,0,0.55,0.0375,0.3375,0.275,0.0375,0.4875,0.0,,...,,,,,,,,,0,80


In [11]:
pd.options.display.max_colwidth = 600

In [12]:
jigsaw_all_data_df_cleansed.iloc[[715484-1]]

Unnamed: 0,comment_text,disagree,toxicity,severe_toxicity,obscene,sexual_explicit,identity_attack,insult,threat,male,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
715484,"So a Christian can pledge allegiance to their religion first, but if a Muslim does then it's automatically more suspect? I'm not suggesting you explicitly said that, but inciting Sharia law or beheadings at the mention of Islam invokes that kind of double standard. As if Islam itself is to blame, not a complex mix if politics, history, and culture that leads to the extremes we see.\n\nI'm trying to get at the core assumptions here, as to the kind of rhetoric I see as harmful and unfair in it's correlation. It's this general belief that peaceful devotion to Islam is fundamentally dangerous ...",0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004457,...,0.0,0.000557,0.0,0.0039,0.000557,0.0,0.0,0.0,1795,4


### Add `toxicity_binary` column
Here we set the binary toxicity label for non-disability identity groups. For these groups, comments with toxicity >= 0.5 are considered toxict and will have a value of 1. Comments with toxicity < 0.5 will have a value of 0.

In [13]:
jigsaw_all_data_df_cleansed['toxicity_binary'] = (jigsaw_all_data_df_cleansed['toxicity'] >= 0.5).astype(int)

In [14]:
jigsaw_all_data_df_cleansed[['toxicity','toxicity_binary']]

Unnamed: 0,toxicity,toxicity_binary
0,0.373134,0
1,0.605263,1
2,0.666667,1
3,0.815789,1
4,0.550000,1
...,...,...
1999511,0.400000,0
1999512,0.400000,0
1999513,0.400000,0
1999514,0.400000,0


Move the new `toxicity_binary` column towards the front of the dataframe:

In [15]:
orig_cols = jigsaw_all_data_df_cleansed.columns.tolist()
reordered_cols = orig_cols[:2] + orig_cols[-1:] + orig_cols[2:-1]
jigsaw_all_data_df_cleansed = jigsaw_all_data_df_cleansed[reordered_cols]
jigsaw_all_data_df_cleansed.head()

Unnamed: 0,comment_text,disagree,toxicity_binary,toxicity,severe_toxicity,obscene,sexual_explicit,identity_attack,insult,threat,...,white,asian,latino,other_race_or_ethnicity,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
0,He got his money... now he lies in wait till after the election in 2 yrs.... dirty politicians need to be afraid of Tar and feathers again... but they aren't and so the people get screwed.,0,0,0.373134,0.044776,0.089552,0.014925,0.0,0.343284,0.014925,...,,,,,,,,,0,67
1,Mad dog will surely put the liberals in mental hospitals. Boorah,0,1,0.605263,0.013158,0.065789,0.013158,0.092105,0.565789,0.065789,...,,,,,,,,,0,76
2,And Trump continues his lifelong cowardice by not making this announcement himself.\n\nWhat an awful human being .....,7,1,0.666667,0.015873,0.031746,0.0,0.047619,0.666667,0.0,...,,,,,,,,,0,63
3,"""while arresting a man for resisting arrest"".\n\nIf you cop-suckers can't see a problem with this, then go suck the barrel of a Glock.",0,1,0.815789,0.065789,0.552632,0.592105,0.0,0.684211,0.105263,...,,,,,,,,,0,76
4,Tucker and Paul are both total bad ass mofo's.,0,1,0.55,0.0375,0.3375,0.275,0.0375,0.4875,0.0,...,,,,,,,,,0,80


## Prepare Disability Subset

#### Create disability subset

In [16]:
jigsaw_disability_all_df = jigsaw_all_data_df_cleansed[(jigsaw_all_data_df_cleansed["physical_disability"] > 0) | 
           (jigsaw_all_data_df_cleansed["intellectual_or_learning_disability"] > 0) | 
           (jigsaw_all_data_df_cleansed["psychiatric_or_mental_illness"] > 0) | 
           (jigsaw_all_data_df_cleansed["other_disability"] > 0)]

In [17]:
jigsaw_disability_all_df.shape

(18665, 36)

### Add `'disability_subtypes_total` column
This column will indicate how likely a comment is to mention disability.

In [18]:
jigsaw_disability_all_df['disability_subtypes_total'] = jigsaw_disability_all_df['physical_disability']+jigsaw_disability_all_df['intellectual_or_learning_disability']+jigsaw_disability_all_df['psychiatric_or_mental_illness']+jigsaw_disability_all_df['other_disability']
jigsaw_disability_all_df['disability_subtypes_total']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  jigsaw_disability_all_df['disability_subtypes_total'] = jigsaw_disability_all_df['physical_disability']+jigsaw_disability_all_df['intellectual_or_learning_disability']+jigsaw_disability_all_df['psychiatric_or_mental_illness']+jigsaw_disability_all_df['other_disability']


7705       0.250000
8073       1.000000
8115       1.000000
8125       1.750000
8263       1.000000
             ...   
1999482    0.600000
1999507    0.700000
1999508    0.500000
1999514    0.003717
1999515    0.000640
Name: disability_subtypes_total, Length: 18665, dtype: float64

### Operationalize comments **mentioning disability** and toxic **ableist** comments

Need to operaztionalize the following:

1. What is considered *disability-related*?
1. What is considered *ableist* or *toxic towards people with disabilities*?

Comments where the disability score for each subtypes <= 0.1 mostly seems like they're not related to disability at all prob because there's not enough consensus on disability-relatedness scores. There are a few that have *slight mention* of disability, but most of them don't mention disability at all. In the interest of time, we'll drop rows where disability_subtype total <= 0.1.

In [19]:
test_filter_condition = jigsaw_disability_all_df['disability_subtypes_total']<=0.1
display(jigsaw_disability_all_df.loc[test_filter_condition,
                    ['comment_text','toxicity','toxicity_binary','insult', 'threat','physical_disability','intellectual_or_learning_disability','psychiatric_or_mental_illness','other_disability','disability_subtypes_total','identity_annotator_count']])

Unnamed: 0,comment_text,toxicity,toxicity_binary,insult,threat,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,disability_subtypes_total,identity_annotator_count
9884,"China is the worst possible ""global partner"" anyone could have, it is the biggest dysfunctional and theft-based society with ""entitlement"" attitude - and clearly the wealthy that immigrated here made Chinese-only ""organizations"". What a disgrace.",0.434211,0,0.381579,0.000000,0.000000,0.0,0.1,0.00000,0.100000,10
9902,"Trump is already selling his logo to Hindu contractors. In addition, he has to pay interest to the Russian mafia, who will accept access to projects as payment. Trump is selling his brand in Argentina in exchange for dollars. You Trump people are completely brain-dead. This will blow up in America's face very quickly. You and those like you are complete fools. Trump probably is immune. Let's see if the shameful idiotic Republican Congress has the balls to impeach this cheesy crotch-grabbing pervert lying bustard snake oil salesman. As if!",0.750000,1,0.750000,0.000000,0.000000,0.0,0.1,0.00000,0.100000,10
9904,"LW2: Your daughter is a nut. Amy's right, she needs to get help.",0.714286,1,0.714286,0.014286,0.000000,0.0,0.1,0.00000,0.100000,10
9907,"What a piece of GARBAGE! Obviously written by a left wing nut that resents the knowledge, experience and power that comes with getting older. It seems that she (and many other clueless people) think that 'aging', working your way up and being a useful part of society in your 'golden years' is something to be ashamed of.\n\nBeing 'old' and part of society is not a dirty word, it's something to be proud of. Remember lady one day you will be an 'old white woman' how you going to feel then?",0.728571,1,0.657143,0.000000,0.000000,0.0,0.1,0.00000,0.100000,10
9944,"great, this guy was a phony, just like trump. and like trump he is a racist, bigot and stupid.",0.866667,1,0.866667,0.000000,0.000000,0.1,0.0,0.00000,0.100000,10
...,...,...,...,...,...,...,...,...,...,...,...
1999269,"Pedophiles are rarely interested in a child's sexual orientation; they are interested in satisfying their cravings in an illegal and socially unacceptable manner.\nSince there are ever so many more heterosexuals in the world than there are homosexuals, most child molestation is perpetrated by heterosexuals; also most is perpetrated by men. So, you are suggesting that heterosexual men are trying to change the sexual orientation of children?\nAnother deep breath might be in order.\nIt would be most worthwhile to read the following article by a research psychologist at the University of Cali...",0.400000,0,0.000000,0.000000,0.000000,0.0,0.1,0.00000,0.100000,10
1999375,"I'm glad you brought this up, Mike AA. And there's more. As someone who has lost my husband to cancer and have been through the indescribable anguish and unbearable pain of sitting at the bedside of your terminally ill loved one you're about to have to let go, what I found particularly outrageous among the many new lows Trump has stooped to in the last few days: His call to terminally ill people to stick around long enough so they can vote for him on November 8. This man is devoid of even the smallest spark of compassion and decency.",0.400000,0,0.400000,0.000000,0.100000,0.0,0.0,0.00000,0.100000,10
1999377,"Paul: I mention Jim Jones as the extreme example of the power of charismatic demogoguery. We already saw Trump invite his supporters to physically attack protesters at his rallies.....and they did so. I don't think he'll be passing out KoolAid, but I do fear if he's defeated he may very well refer to that ""fixed"" election which could easily encourage violent behavior from the most zealous of his followers. \n\nRe: the crying baby. If, indeed, he was ""goofin"", it was in extremely poor taste. Embarassing someone for a ""laugh"" is not funny. The woman had come to his rally. She was a T...",0.400000,0,0.400000,0.000000,0.000000,0.1,0.0,0.00000,0.100000,10
1999514,I just don't find her a very good representation of the transexual community. She just seems so self-absorbed & concerned with such superficial issues.,0.400000,0,0.100000,0.000000,0.003717,0.0,0.0,0.00000,0.003717,269


For example, this comment is not disability-related at all:

In [20]:
list(jigsaw_disability_all_df.loc[[1307202]]['comment_text'])[0]

'Real men eat oil for breakfast!'

This comment is longer, but it still is not disability-related:

In [21]:
list(jigsaw_disability_all_df.loc[[1627111]]['comment_text'])[0]

"I wouldn't say that at all.  High school students have traditionally been held responsible, and often expelled, for their conduct out of school.  Black students have a right not to feel threatened by fellow students, just as black co-workers have.  You make a mistake in saying the right is restricted to commercial interest - in both cases."

When we evaluate our models, we'll need mention of disability to be very clear in the evaluation set in order to properly align with our main task of toxic ableism detection. After exploring thresholds, the comments where mention of disability is very clear is where `disability_subtypes_total`>=1. Therefore we'll only keep the rows that meet this condition for our evaluation set.

In [23]:
# test_filter_condition = jigsaw_disability_all_df['disability_subtypes_total']==1
# test_filter_condition = jigsaw_disability_all_df['disability_subtypes_total']==0
# test_filter_condition = (jigsaw_disability_all_df['disability_subtypes_total']>0) & (jigsaw_disability_all_df['disability_subtypes_total']<0.5)
# test_filter_condition = (jigsaw_disability_all_df['disability_subtypes_total']>0) & (jigsaw_disability_all_df['disability_subtypes_total']<0.3)
# test_filter_condition = (jigsaw_disability_all_df['disability_subtypes_total']>0.3) & (jigsaw_disability_all_df['disability_subtypes_total']<0.35)
# test_filter_condition = (jigsaw_disability_all_df['disability_subtypes_total']>0.2) & (jigsaw_disability_all_df['disability_subtypes_total']<0.25)
# test_filter_condition = jigsaw_disability_all_df['disability_subtypes_total']>=0.4
test_filter_condition = jigsaw_disability_all_df['disability_subtypes_total']>=1

jigsaw_disability_all_df.loc[test_filter_condition,
                    ['comment_text','toxicity','identity_attack','insult', 'threat','physical_disability', 'intellectual_or_learning_disability', 'psychiatric_or_mental_illness', 'other_disability','identity_annotator_count','toxicity_annotator_count']]

Unnamed: 0,comment_text,toxicity,identity_attack,insult,threat,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,identity_annotator_count,toxicity_annotator_count
8073,Wow!\nYour progressive psychosis has become extreme!\nPlease seek immediate medical help.,0.800000,0.100000,0.700000,0.000000,0.00,0.0,1.0,0.0,4,10
8115,"Or.... maybe there IS chaos because the ""president"" is a mentally ill, in-over-his-head idiot who couldn't lead cats to tuna.",0.790323,0.145161,0.774194,0.016129,0.00,0.0,1.0,0.0,4,62
8125,I'll take someone who's physically ill over one who's mentally ill.,0.352941,0.088235,0.264706,0.014706,0.75,0.0,1.0,0.0,4,68
8263,"Mental Illness at work again, again, again, again.............This women placed a clamp on the boys genitalia. This woman SHOULD have a clamp placed on HER INNER and OUTER LIPS of her VAGINA...........and Never Removed.............",0.842857,0.100000,0.371429,0.257143,0.00,0.0,1.0,0.0,4,70
8377,And at no time do you ever address the facts that debunk the ridiculous nonsense you post around here. Like a coward you run from those facts to repost ridiculous nonsense that has been thoroughly debunked. It is why I say liberalism is a mental Illness.,0.772727,0.121212,0.727273,0.000000,0.00,0.0,1.0,0.0,4,66
...,...,...,...,...,...,...,...,...,...,...,...
1998392,"Perhaps Doug should check Hamburg, it looks like another ""refugee"" with mental illness decided to take his frustration out on German civilians... the fact that he was already on a deportation order seems lost on Doug and like-minded progressives. Never mind, Germans are also capable of random stabbing attacks in supermarkets so we shouldn't discuss the fact the last three terrorist attacks in Germany were carried out by individuals with outstanding deportation orders...",0.400000,0.200000,0.400000,0.000000,0.00,0.0,1.0,0.0,4,10
1998466,"Think of how many abortions there have been since the dawn of time, Mr. Barrowcliff, if that's what you decide you should be bothered about but since you are incapable of getting pregnant or having children it seems a rather pointless preoccupation. It seems like something that someone would do who probably has a mental health issue. Mental fixation is a possible sign of dementia. Perhaps your letter is a cry for help. There are probably a lot of mental health providers out there who can help you, Mr. Barrowcliff, with you mental fixation issues. Just don't wait too long and end up doing ...",0.400000,0.100000,0.400000,0.000000,0.00,0.0,1.0,0.0,4,10
1998563,"Severe Narcisissim, likely Sociopathic (formerly called psychopathic)",0.400000,0.000000,0.400000,0.000000,0.00,0.0,1.0,0.0,4,10
1998578,1. If the individual were carrying concealed you wouldn't know so there would be nothing to worry about.\n2. Anyone adjudicated with a mental health problem is banned from obtaining a firearm\n3. There should be NO infringements on our constitutional rights. You don't have to have your ignorance tested in order to write so why should you have to ask permission to obtain a firearm? \n4. There is no gun show loophole. You buy a gun from a dealer you have to undergo a background check. No exceptions. Once again you are ignorant of the facts and the laws yet you can exercise your Const...,0.400000,0.000000,0.400000,0.000000,0.00,0.0,1.0,0.0,4,10


### Only keep the rows where disability is clearly mentioned

In [24]:
jigsaw_disability_df = jigsaw_disability_all_df.loc[jigsaw_disability_all_df['disability_subtypes_total']>=1]
jigsaw_disability_df[['comment_text','toxicity_binary']]

Unnamed: 0,comment_text,toxicity_binary
8073,Wow!\nYour progressive psychosis has become extreme!\nPlease seek immediate medical help.,1
8115,"Or.... maybe there IS chaos because the ""president"" is a mentally ill, in-over-his-head idiot who couldn't lead cats to tuna.",1
8125,I'll take someone who's physically ill over one who's mentally ill.,0
8263,"Mental Illness at work again, again, again, again.............This women placed a clamp on the boys genitalia. This woman SHOULD have a clamp placed on HER INNER and OUTER LIPS of her VAGINA...........and Never Removed.............",1
8377,And at no time do you ever address the facts that debunk the ridiculous nonsense you post around here. Like a coward you run from those facts to repost ridiculous nonsense that has been thoroughly debunked. It is why I say liberalism is a mental Illness.,1
...,...,...
1998392,"Perhaps Doug should check Hamburg, it looks like another ""refugee"" with mental illness decided to take his frustration out on German civilians... the fact that he was already on a deportation order seems lost on Doug and like-minded progressives. Never mind, Germans are also capable of random stabbing attacks in supermarkets so we shouldn't discuss the fact the last three terrorist attacks in Germany were carried out by individuals with outstanding deportation orders...",0
1998466,"Think of how many abortions there have been since the dawn of time, Mr. Barrowcliff, if that's what you decide you should be bothered about but since you are incapable of getting pregnant or having children it seems a rather pointless preoccupation. It seems like something that someone would do who probably has a mental health issue. Mental fixation is a possible sign of dementia. Perhaps your letter is a cry for help. There are probably a lot of mental health providers out there who can help you, Mr. Barrowcliff, with you mental fixation issues. Just don't wait too long and end up doing ...",0
1998563,"Severe Narcisissim, likely Sociopathic (formerly called psychopathic)",0
1998578,1. If the individual were carrying concealed you wouldn't know so there would be nothing to worry about.\n2. Anyone adjudicated with a mental health problem is banned from obtaining a firearm\n3. There should be NO infringements on our constitutional rights. You don't have to have your ignorance tested in order to write so why should you have to ask permission to obtain a firearm? \n4. There is no gun show loophole. You buy a gun from a dealer you have to undergo a background check. No exceptions. Once again you are ignorant of the facts and the laws yet you can exercise your Const...,0


## Prepare non-disability subsets

### Gender

Create gender subset:

In [25]:
jigsaw_gender_all_df = jigsaw_all_data_df_cleansed[(jigsaw_all_data_df_cleansed['male'] > 0) | 
           (jigsaw_all_data_df_cleansed['female'] > 0) | 
           (jigsaw_all_data_df_cleansed['transgender'] > 0) | 
           (jigsaw_all_data_df_cleansed['other_gender'] > 0)]
jigsaw_gender_all_df.shape

(137722, 36)

Only keep rows where gender is clearly mentioned:

In [26]:

jigsaw_gender_filtered_df = jigsaw_gender_all_df[(jigsaw_gender_all_df['male']+jigsaw_gender_all_df['female']+jigsaw_gender_all_df['transgender']+jigsaw_gender_all_df['other_gender'])>=1]
jigsaw_gender_filtered_df[['comment_text','toxicity_binary']]

Unnamed: 0,comment_text,toxicity_binary
7681,Blame men. There's always an excuse to blame men for the failures of women.,1
7682,And the woman exposing herself saying grab this pussy is an offer I take? She definitely is the sane and high class American that you gloat about trying to represent,1
7691,"Are you a Pilgrim?\nWhy arn't you growing your own veggies? \nAnd do you haul them in a covered wagon?\n\nThankfully it isn't up to ""white man"" to determine what is traditional for the First Peoples of Alaska.\nBut they try to all the time and it looks pathetic.",1
7699,"No, he was accused of being a racist white man.",0
7709,How do we fight agaisnt women who use sexual favours to get things they want from men?,1
...,...,...
1999476,"Oh my G-d, a group of men in their 60's and 70's gather to shoot clays!. Oh, the inhumanity. Probably the last generation of real men in this country. Combat vets from this war excepted.\n\nI am 56, a woman and can out shoot the vast majority of ""professionals"" like cops and military. Lord willing I will be shooting well into my 80's and 90's if I am still alive. \n\nAnd, finding the right words during an exam? Could be aphasia. My husband has it. Knows what he wants to say but can't always find the right words. Was a Ranger and Green Beret for 21 years. Can put his 45 together in the dar...",0
1999478,"No such thing as transgender - just males, females, and people with abnormal brain scans. Brain scans are abnormal among those who claim that they are really their own body's biological opposites.",0
1999494,"Brother William,\n\nSo you are opposed to women serving as local elders or as Commissioned Ministers? At least that is a consistent stance which I understand.\n\nI wonder whether you are also opposed to Deaconesses? That would be consistent if you limit the NT ""diakonos"" to masculine only. (I chose Helper rather than Servant because in American English that is more descriptive of the role of Deacons / Deaconesses. This was also the earlier English meaning of Minister.)\n\nWhat roles do you envision for women in your NT-mandated church? Can women lead a Sabbath School if men attend, or...",0
1999497,"Right, your attempt at shaming men into silence being noted...Can you not see that saying since a tiny fraction of men make it to the top, all men should just shut up and not complain is...asinine? Do you think men don't have really pain and real grievances just because a fraction of us are CEO's? Do you think that sharing a skin colour and genitalia with a CEO means most men get CEO perks and treatment, or that a CEO will somehow care at all about them? \n\nIts just so rediculous to say: Well, the very best among you are doing very well, I don't see anything wrong here, you big babies. W...",0


For a more fair comparison, we want the gender dataset to be the same size as the disability dataset. We'll accomplish this by undersampling the gender dataset.

In [27]:
# Keep gender evaluation set same as disability evaluation set
jigsaw_gender_df = jigsaw_gender_filtered_df.sample(frac=(len(jigsaw_disability_df)/len(jigsaw_gender_filtered_df)), random_state=266)
jigsaw_gender_df[['comment_text', 'toxicity_binary']]

Unnamed: 0,comment_text,toxicity_binary
1863241,You obviously have no clue about women or ppl.,0
1490071,"So the girly-man reporter got slapped. So what? While provoking a confrontation, its lucky he didn't get his teeth knocked down his throat. Take it like a man... Oh wait, nevermind; it's not politically correct for those to exist any more.",0
1457677,"They don't know much:\n\nAnchorage cab driver charged with sexual assault of customer\nPublished May 23, 2012\n\n""Anchorage police arrested a cab driver Tuesday accused of raping a passenger earlier this month. The Checker Cab driver, 28-year-old Paul Velasco, was charged with sexually assaulting the woman on May 1.""\n\n""This is the third case in two years in which an Anchorage cab driver has been charged with sexually assaulting a passenger. One man was charged in 2010 and convicted in 2011. The case of another is still moving toward trial, according to court records.""\n\nhttps://www.adn....",0
1292877,Am I the only one who noticed that all but 3 of the positive tweets left by women at JG's Twitter account were made by people who had created their profiles in the first 2-3 days of April? All had pro headshots too. Make of that what you will.,0
1035684,"Here too. Between 90K and 100K here in St. Paul -- all ages, from babes in strollers through high school and college through adulthood to elderly, some with canes and wheelchairs being helped through the melting slush; men and women ""and everything in between."" Variety of signs and groups. It was a gathering of joy and support and new friendships as the mist let up. \n\nOnly incident I heard of here was the guy arrested for trying to spray the crowd with ""a chemical irritant."" Police and marchers/rally goers had wonderful connections, by all accounts I've heard.",0
...,...,...
1795347,"This entire assault ordeal reeks of a spoiled, privileged celebrity accusing a man who did not do this at all of misconduct. The photo shown on the internet of Muellers hand behind Swift proves nothing. She is smiling in the photo. I have no respect for Swift embarking on a mission to ruin this man's job and life. I have a relative who is wrongfully being accused of similar and is in the middle of a court case for something of which he is totally innocent.In the meantime due to media he has lost his job just like this poor man.Innocent until proven guilty? Bull.the USA has become a joke in...",1
1099334,"If a woman does not want to be pregnant, the choice is better made without a bed involved. \n\n""In a crisis pregnancy, you eliminate the crisis, not the pregnancy."" ~~~the late Senator Bob Casey, Sr.",0
1604416,"Have made over 500 trips to the U.S. since the early 1970's. Been there for sports, vacations, business and just some local shopping . Back in the late 1990's I started to notice that if you had brown skin you got redirected at the Peace Bridge in Buffalo, New York. If the women worn a head scarf or dressed in a manner to suggest they were from a South Ssian country, the whole family got hauled in.\nCrossing over at the Detroit border, I can't tell you the number of times young black men were ordered out of their cars and spread eagled over the hoods of their cars and frisked. \nLook, I'm ...",0
1745655,He is not my president. Like I've said just because a cat lives in a dog house he's still a cat. The closest thing we have to a president is General John Kelly. Remember him the guy who had to bury his head in his hands when the idiot gave his United Nations speech. People wouldn't call him a moron if he was a true president. It's like being a father. You can be a child's biological father but until that child truly loves and respects you. You're just another man to them.,1


### Sexual Orientation

Create sexual orientation subset:

In [28]:
jigsaw_sexual_orientation_all_df = jigsaw_all_data_df_cleansed[(jigsaw_all_data_df_cleansed['heterosexual'] > 0) | 
           (jigsaw_all_data_df_cleansed['homosexual_gay_or_lesbian'] > 0) | 
           (jigsaw_all_data_df_cleansed['bisexual'] > 0) | 
           (jigsaw_all_data_df_cleansed['other_sexual_orientation'] > 0)]
jigsaw_sexual_orientation_all_df.shape

(22649, 36)

Only keep rows where sexual orientation is clearly mentioned:

In [29]:

jigsaw_sexual_orientation_filtered_df = jigsaw_sexual_orientation_all_df[(jigsaw_sexual_orientation_all_df['heterosexual']+jigsaw_sexual_orientation_all_df['homosexual_gay_or_lesbian']
                                                                          +jigsaw_sexual_orientation_all_df['bisexual']+jigsaw_sexual_orientation_all_df['other_sexual_orientation'])>=1]
jigsaw_sexual_orientation_filtered_df[['comment_text','toxicity_binary']]

Unnamed: 0,comment_text,toxicity_binary
7695,"Ridiculous, indeed. Although Rome does seem to have a soft spot for cocaine-fueled homosexual orgies under the very nose of the pope.",1
7716,"It took them long enough. And it goes against Catholic doctrine, for now. \n\nI've always wondered why conservative Christians - both Catholic and Protestant - were always so intent on enforcing two or three Old Testament texts on homosexuality, but were happy to reject slavery even though there are far more extensive Biblical passages that permit it.",0
7746,"Well now Murray can simply go for the bisexual vote too. After all, he likes men AND boys.",1
7890,"He probably thoughthimself gay when he did not like women, but I doubt most gays would accept him (or any true asexual).",0
7894,Sodomy isn't exclusive to homosexuals. I can see you've given this a lot of thought.\n\nMute.,0
...,...,...
1999415,"Here are the ones so far in 2016 : \n\nBlack activist in Seattle defaces African-American cultural center with racist graffiti\n\nAfrican-American students at the University of Albany falsely claim to be victims of racial abuse by fellow passenger on a bus\n\nPastor in Austin, Texas claims a local Whole Foods store wrote “@!$%#” on a cake he ordered for a gay congregation member, but surveillance footage shows he tampered with the cake himself\n\nMuslim woman in New York City admits to faking knife hate crime attack",0
1999422,"I share the common outrage about this ""new policy,"" but this is the reality of how many seminaries have pragmatically operated since the last declaration. Candidates are asked their sexual orientation, asked if they are sexually active, asked if they intend to live chastely during formation and after ordination -- and they are also scheduled for more ""counseling"" and ""closer supervision"" as seminarians. In the seminaries I'm familiar with, homosexuality is not an absolute barrier either to admittance or ordination -- as long as celibacy is the candidate's goal. Very recently, the head o...",0
1999425,"I agree that respect and civility are goals worth working at.\n\nUnfortunately, anti-gay hate crimes are on the rise even in the first world (doubled in the UK in the last 4 years, on the rise in Europe and the US ... transgender folk being routinely murdered ... corrective rape in South Africa rampant ... that's my neck of the woods). \n\nThere's no polite dialogue with Adolf Hitler (even though we daren't impugn his intentions ... that would be a really egregious crime!).\n\nClerics who equate homosexuality with the stuff of Satan and the destruction of civilization feed directly into h...",0
1999470,"Unity is God's will) for gays and straights. Complementarity is a conclusion, not an argument. If you quit thinking of gay sex as icky, it will start to click. The institute for marriage tried that argument in Perry and it was not coherent to anyone who heard it. This issue is settled civilly and resistence to it from the Church will be ended from within by gay couples, their families and gay priests and bishops. There is ample justification for when they make their move to do so, just as clerical celibacy and a male priesthood will also fall.",0


For a more fair comparison, we want the sexual orientation dataset to be the same size as the disability dataset. We'll accomplish this by undersampling the sexual orientation dataset.

In [30]:
# Keep sexual_orientation evaluation set same as disability evaluation set
jigsaw_sexual_orientation_df = jigsaw_sexual_orientation_filtered_df.sample(frac=(len(jigsaw_disability_df)/len(jigsaw_sexual_orientation_filtered_df)), random_state=266)
jigsaw_sexual_orientation_df[['comment_text', 'toxicity_binary']]

Unnamed: 0,comment_text,toxicity_binary
1740413,"That's a very large generalization. So if I say ""My friend Joe and I went hiking..."" I'm accused of telling everyone ""I have a friend??"" That's laughable. Saying ""I went hiking with my GAY friend"" is a proper analogy. When a hetero person says ""my husband, wife, gf or bf"" it is not announcing their sexuality. They're announcing WHO THAT PERSON IS in relation to them. Stop making it bigger than it is.",0
1755494,"Try talking to someone that's gay. Every one of the ones I've talked to said they knew as young as 5 that they were different. You honestly believe it's a choice? Yes, sign me up because I want to be treated like a second class citizen. I love being told I'm an abomination and will rot in hell. I love being beaten up because of who I am. Again, when did you decide to be hetero?",1
1853895,"In other words, you have NOT read any of Martin's books. You certainly jave not read the one he wrote about gays. Or else your ability to read would shame a sixth grader.",0
1959690,"I too prefer Bernier over Scheer, but I think Scheer is more electable. Bernier is way too libertarian to win. His policies on health care and spending cuts would make him DOA. Scheer is a social conservative, but unlike Trost and Lemieux he has promised to not to introduce any new legislation on gay marriage and abortion. One can personally oppose abortion on a personal level while still support keeping it legal otherwise believing its not the state's job to enforce morality. Paul Martin personally opposed abortion, but was a strong supporter of keeping it legal. Opposing something ...",0
1385242,"We need to encourage men to take a larger role in direct care for their children as well. So where are the benefits for them? We women can't carry ALL the responsibility AND expect equal representation in the boardrooms. And just because men CAN take the time off and take advantage of parental leave, doesn't mean they WILL. Most of the men I know who DO take a longer parental leave are gay.",1
...,...,...
1099718,"Should a Christian film maker be forced to film porn, because to not do so would be discrimination? Should a Jewish baker be forced to make a cake for a Nazi themed party? Should a Muslim baker be forced to bake a cake for a Gay Pride party?",0
1625595,"Hey, remember the 20th Century, when our hearts were young and gay?!\nNow our hair is old and gray - and our hearts are married. And our lives are harried.\n\nMother-in-law! She's the worst person in the world. But, hey, she's politically correct.\nNo, Donald Trump is the worst person in the world. My mother-in-law is the worst neo-liberal politically correct person in the world.\n But my father-in-law... He's the worst politically incorrect person in the world. Worse than Donald Trump.\n It's been a slice. Good night. Drive carefully.",0
1444419,"Gay men are not pedofiles but if you're a man who preys on boys, you're gay and a pedofile.\n\nThe gay community can try to argue otherwise but that is the way it is.",1
1956738,"And if he believes he has been fired because he's straight, he can bring action, also. The law works both ways, for straights and gays. Oh, ""no,"" you say, ""You've got to be kidding!""",0


### Religion

Create religion subset:

In [31]:
jigsaw_religion_all_df = jigsaw_all_data_df_cleansed[(jigsaw_all_data_df_cleansed['christian'] > 0) | 
           (jigsaw_all_data_df_cleansed['jewish'] > 0) | 
           (jigsaw_all_data_df_cleansed['muslim'] > 0) | 
           (jigsaw_all_data_df_cleansed['hindu'] > 0) | 
           (jigsaw_all_data_df_cleansed['buddhist'] > 0) | 
           (jigsaw_all_data_df_cleansed['atheist'] > 0) | 
           (jigsaw_all_data_df_cleansed['other_religion'] > 0)]
jigsaw_religion_all_df.shape

(101410, 36)

Only keep rows where religion is clearly mentioned:

In [32]:

jigsaw_religion_filtered_df = jigsaw_religion_all_df[(jigsaw_religion_all_df['jewish']+jigsaw_religion_all_df['muslim']
                                                      +jigsaw_religion_all_df['hindu']+jigsaw_religion_all_df['buddhist']
                                                      +jigsaw_religion_all_df['atheist']++jigsaw_religion_all_df['other_religion'])>=1]
jigsaw_religion_filtered_df[['comment_text','toxicity_binary']]

Unnamed: 0,comment_text,toxicity_binary
7689,"Lela, you admit no records exist to support your claims of a restrained religious hoard, but then state the atheists are worse? It's all bad, but it's all within human nature. If we can do away with any of the cults involved we might lessen future occurrences.",0
7701,"The robot censor seems disinclined to accept some harsh truths, but let's try again:\n\nWhy is it that despite 70% of serial killers and 80-90% of child molesters and mass-shooters being white, most white Americans are fixated on BLACK crime? Perhaps its related to a long-perpetuated double-standard that when black or brown (or Muslim) criminals commit crimes, it's held up as a reflection on their entire communities, whereas when white criminals do so, it's only a reflection on said criminals themselves. And that in turn seems predicated on a widely held and nearly as frequently denied a...",1
7719,C.Parsons\nIf the terrorists are loathsome for their killing then Brit soldiers are loathsome for their killing. And there has been much more of that over the decades- Brits killing Muslims.,1
7720,"Once again, you label anybody who questions Islam a hate monger.",0
7730,Muslims need to take a look in the mirror.,0
...,...,...
1999463,"Some of these guys are anti-muslim, others are simply anti religion. They disguise their hatred for religion in general, or muslims in particular, within an intellectual-sounding debate containing the word accomodation.\n\nThe truth is that religious symbols hurt no one. It is a total non-issue.\n\nA secular state tolerates religions. This debate is about shutting down religion altogether, which is the way of an atheist state, not a secular one.",0
1999464,"The Muslim community will not stop at prayer in public schools. And some of these prayers are absolutely toxic as we should all know by now. \n\nThere will be continuous pressure for more accommodation of Islamic principles that most Canadians find unacceptible particularly as it relates to girls, or any discussion which contravenes their books. Canada is not Iran or Saudi Arabia and yet you will find many who will support initiatives to move our schools in this direction. This will happen. \n\nBest to draw the line now. Public schools are not Madrassas. Stick to it and stop this appease...",0
1999473,"""The terrorist attack outside the Muslim Welfare House mosque in London’s Finsbury Park in the small hours of Monday morning didn’t happen in a vacuum.""\n----------------------------------------\n\nCorrect, it happened following Islamic inspired attacks of Westminster, Manchester and London Bridge as well as other attacks in Europe. Interestingly, Ms. Aziz neglects to mention the Islamic affiliation, simply using the words ""since the Manchester and London terrorist attacks""\n\nThe recent attack in Finsbury Park is sad and atrocious, as is the rise of assaults and abuse.\n\nBut if we need t...",0
1999479,Fogel did not mention the two-state solution because it is not a solution. All it will do is compromise Israel's security and give the Palestinians military sovereignty to load up and attack Israel in the Arabs' ongoing objective to eliminate Israel and eradicate Jews from the region.,0


For a more fair comparison, we want the gender dataset to be the same size as the disability dataset. We'll accomplish this by undersampling the gender dataset.

In [33]:
# Keep religion evaluation set same as disability evaluation set
jigsaw_religion_df = jigsaw_religion_filtered_df.sample(frac=(len(jigsaw_disability_df)/len(jigsaw_religion_filtered_df)), random_state=266)
jigsaw_religion_df[['comment_text', 'toxicity_binary']]

Unnamed: 0,comment_text,toxicity_binary
1834737,"If Islamophobia is perfectly understandable, then so is Mormonophobia, Amishophobia, Buddhistophobia and so on. \n\nThe DSM-V says a 'phobia' is a psychological disorder. Does M-103 say what physicians may diagnose Islamophobia?\n\nIslamophobia is a meme constructed to carry forward special consideration, and to silence criticism. If you flip over the Islamophobia card, the other side says ""Victim"".\n\nBigotry is bigotry, and is ugly-- regardless of the parties involved. It is relevant, however, to name the perpetrators, reveal their agenda, detail the facts of the case, and identify the ...",0
1640962,"Muslim countries believed NOT on Trump's list include Saudia Arabia, Turkey, Dubai, and Azerbaijan. \n\n... all countries in which Donald Trump had business interests - or still does. \n\nStill. Does.",0
1752481,"you assert ""Chrystia Freeland? The woman who went on US television in 2015 (post-election) to say that it is offensive to criticize honour killings?"" I hope you can provide a citation from a credible news source to back up that surprising and seemingly ridiculous fake news claim. \n\n I did a bit of a check and discovered that in response to a rather sweeping condemnation of all Syrian refugees by American talk show host Bill Maher, Ms.Freeland said that those of the muslim faith share many of the same values as Christians. Extrapolating from that quite defensible pov to saying Ms Freel...",0
1429809,"Racism will be cured when you take off yrou hijab when asked. \n\nMuslims excoriate us about alleged Islamophobia, but we should bear in mind Islam's entrenched racism, xenophobia, homophobia, misogyny, anti-Semitism, Christophbia, Europhobia, Americanophobia, destructive iconoclasm, and even Islamophobia against each other. Ancient tribal wars between Kurds, Shiite, Sunni, Wahhabi, Druze and other Islamic sects such s ISIS , Taliban and Al Quada are the cause of most wars in the world today.",0
1461189,"""A correction now: Over the course of this last week we have made some regrettable errors on air regarding the Muslim population in Europe, particularly with regard to England and France. Now, this applies especially to discussions of so-called ‘no-go zones,’ areas where non-Muslims allegedly aren’t allowed in and police supposedly won’t go. \n\n""To be clear, there is no formal designation of these zones in either country and no credible information to support the assertion there are specific areas in these countries that exclude individuals based solely on their religion. There are certai...",0
...,...,...
1570430,"well NeoClassic, i am unaware of your heritage, (and it isn't that important to me to know). Me, i am from eastern European Jewish stock. The (use of) quote that you object to was a response to good people doing nothing as the Nazis exterminated millions. Claiming the use of the quote ""....appropriates the Holocaust..."" seems silly to me. At most it could misappropriate the quote, at least in your view, but certainly not mine. It is impossible to imagine the originator of the quote would agree with you, as it seems entirely appropriate given the circumstances unfolding in America now. That...",0
1471683,"The Bible, the Koran, and any other religious texts ""matter"" to me only in their impact on the way followers treat others, the creatures of the earth and the earth itself. If those texts promote decent, respectful, compassionate actions....great. If they promote intolerance and condemnation of ""others"",......no so great. Much good has been done in the name of those texts. Unfortunately, much evil has been done as well. Fundamentalist radical Muslims condemn Christianity. Fundamentalist radical Christians condemn Islam. That's working out real well, isn't it. Gary Crum",0
1680550,"I find it all very amusing how in the 90's all the SJW's were up in arms if my christian cross that I wore around my neck was visible. There were laws enacted that government workers couldn't wear a visible cross while working at their jobs, and my religion was always maligned. Now the very same SJW's say that the muslims have all the rights to the world to their religion and their customs, etc. I guess the only intolerance still allowed in this country is toward Christians and white men.",0
1871719,"I'm almost sure the remark of ""Kurgan"" to which I'm replying is satire, but if not, as my Jewish neighbors would say, ""Oy Vey"". It appears the up vote came from someone taking you at face value.",0


## Race

Create race subset:

In [34]:
jigsaw_race_all_df = jigsaw_all_data_df_cleansed[(jigsaw_all_data_df_cleansed['black'] > 0) | 
           (jigsaw_all_data_df_cleansed['white'] > 0) | 
           (jigsaw_all_data_df_cleansed['asian'] > 0) | 
           (jigsaw_all_data_df_cleansed['latino'] > 0) | 
           (jigsaw_all_data_df_cleansed['other_race_or_ethnicity'] > 0)]
jigsaw_race_all_df.shape

(71648, 36)

Only keep rows where race is clearly mentioned:

In [35]:

jigsaw_race_filtered_df = jigsaw_race_all_df[(jigsaw_race_all_df['black']+jigsaw_race_all_df['white']
                                              +jigsaw_race_all_df['asian']+jigsaw_race_all_df['latino']
                                              +jigsaw_race_all_df['other_race_or_ethnicity'])>=1]
jigsaw_race_filtered_df[['comment_text','toxicity_binary']]

Unnamed: 0,comment_text,toxicity_binary
7679,Why is this black racist crap still on the G&M website?\n\nI stopped talking to black people long ago because I was afraid for my life!!!!!\nSince 80%-90% of all crime is by black people it kinda makes sense doesn't it.\nOh wait thats profiling or is it driving while black? \nI don't know you LEFTIST MARXISTS and BLM nutjobs change the names so fast no one can keep up.\nSo what do we do when the facts speak like this?\nWash it all away by letting identity Marxists spew their racist hatred of white people?\nI think not.\nSo because you commit most of the crimes in the world why the why the...,1
7680,even up here.......BLACKS!,1
7684,"""Let's get the black folks and the white folks at each others' throats""",1
7691,"Are you a Pilgrim?\nWhy arn't you growing your own veggies? \nAnd do you haul them in a covered wagon?\n\nThankfully it isn't up to ""white man"" to determine what is traditional for the First Peoples of Alaska.\nBut they try to all the time and it looks pathetic.",1
7696,And there it is. Our president is a white supremacist.,1
...,...,...
1999437,"Since decisions about release have to be made based on prior actions by similarly appearing defendants, a Mexican would be more likely to go to Mexico than this defendant, true? So, higher flight risk.\n\nAnd if you judge an unemployed black defendant against the history of unemployed black defendants and compare skip rates of that demographic to that of Alison, I'd bet a couple hundred bucks that their skip rate is higher. So, higher flight risk.",0
1999438,"gary says - ""Icon: You make excellent points. I agree.""\n\n Like this little morsel of racism? - ""Just sit back, get comfortable, and enjoy the ride white boys--the lady is on top.?",0
1999477,"Charlottesville police have reportedly issued warrants for the arrest of Christopher Cantwell, a white supremacist prominently featured in a Vice News documentary.\n\nCantwell is wanted for his involvement in the white supremacist rally in Charlottesville, Virginia last weekend, where a counter-protester was killed when a man affiliated with white nationalists allegedly drove his car through a crow\n\nThe warrants are for illegal use of gases and injury by caustic agent or explosive, the Boston Globe reported.\n\nCantwell posted a tearful video on YouTube earlier this week after apparently...",0
1999491,"Now here is an example of what I consider to be racist. Rapper XXXTentacion (sounds like a true young blooded American name), produced a music video featuring a White boy being lynched on a stage as a Black boy looks on. The child's feet twitch then suddenly stop. Now that IMO is racist. Do you agree?",0


For a more fair comparison, we want the gender dataset to be the same size as the disability dataset. We'll accomplish this by undersampling the gender dataset.

In [36]:
# Keep race evaluation set same as disability evaluation set
jigsaw_race_df = jigsaw_race_filtered_df.sample(frac=(len(jigsaw_disability_df)/len(jigsaw_race_filtered_df)), random_state=266)
jigsaw_race_df[['comment_text', 'toxicity_binary']]

Unnamed: 0,comment_text,toxicity_binary
1719866,">>Duke Charles Martel\n>>Are you a muslim Alceste?\n.\nTo that question, Alceste has a canned answer - ""Are you a White Supremacist?"".. or ""At least I am not a White Supremacist?""\nEither way - it is pretty telling how he feels about himself. And what does he equates being Muslim with.\nNow, to the apprentice who would be given the task of ""censoring"" this post - read twice.. Just because there are some words in it that may scare you into censorship, there is nothing offensive, or false in it, it is based on actual posts ( go search your database).",0
599504,Poor sick white men who have controlled America the First 200 years cannot now deal with minorities getting their just due in this nation. They will just have to learn to live with the fact that our Nation is for everyone not just for White Males.,1
1910244,Good idea. \n\nDo you think the white supremacist are motivated to historical significance from our perspective? Do they respect the person or the racial dominance that person represents?\n\nJust sayin let's not 'tidy up' the supremacist interest in racial dominance as an issue of 'the past'. It's their dream future.,0
1772793,"Still, no one can tell me how protesting the national anthem has anything to do with police picking on black people...which in and of itself is a spurious claim at best. Are there individual racist cops that have done inappropriate things? Of course. And when/if they do act in a racist manner, they should be (and are) punished appropriately. Is there a systemic based racism in all cops. Nope. But again I'll ask, can anyone explain to me how kneeling for our national anthem is somehow related to cops? And is it really a wise way to protest cops when the correlation between cops and o...",0
1925584,"Wilson has been in office too long. Its time that the black community elect a younger person that has not been influenced by the ""Black Panthers"". I know that the black community has some very bright and able people that can do a better job than ""Cowboy Hat"" Wilson.",0
...,...,...
1976270,Who is going to do this 'monitoring'? \n\nPeople wearing brown shirts and swastika armbands?\n\nAre there laws that permit me to 'monitor' you if you are just a black man who has had a fight with his girlfriend a month ago?,0
140595,"Protesting the national anthem, because you think all cops are racist pigs who kill blacks based on color, is like refusing to drink Gatorade, because you think coffee growers are killing bean pickers in the fields. One has absolutely nothing to do with the other. If you want to protest police policy, why not go to the station with signs? \n\nThis all started when the racist, BLM cop hater Kaepernick wore socks with cartoon pigs wearing police uniforms. BLM is a parade of lies. Trayvon Martin and Mike Brown were both killed during violent assaults. Hands up don't shoot was a huge lie.\n\nI...",1
470310,"Sorry tRump and tRumpettes, but it's too late. Too much of tRump's lame brain comments have resinated and nobody is about to forget tRump proclaiming that his bedroom was his Vietnam, he knows more than the nation's generals about ISIS, if he shoots someone he'll still get votes, making fun of a disabled reporter by stupidly convulsing behind a podium, mocking women and anyone with black or brown skin etc. etc. Noooooooo, tRump. It's over and you know it.",1
1607891,"Send all the faculty to the re-education camps. At least all the white males, anyway ...",1


## Export jigsaw identity datasets to csv

In [37]:
jigsaw_disability_df.to_csv('drive/MyDrive/data/disability-dataset-jigsaw.csv')
jigsaw_gender_df.to_csv('drive/MyDrive/data/gender-dataset-jigsaw.csv')
jigsaw_sexual_orientation_df.to_csv('drive/MyDrive/data/sexual_orientation-dataset-jigsaw.csv')
jigsaw_religion_df.to_csv('drive/MyDrive/data/religion-dataset-jigsaw.csv')
jigsaw_race_df.to_csv('drive/MyDrive/data/race-dataset-jigsaw.csv')