#### Goals of this notebook: 

* Find datasets for training a model to rate toxic severity of comments


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Jigsaw III

[Jigsaw Toxic Severity Rating Challenge 2021](https://www.kaggle.com/c/jigsaw-toxic-severity-rating)

Files:

* /kaggle/input/jigsaw-toxic-severity-rating/sample_submission.csv
* /kaggle/input/jigsaw-toxic-severity-rating/validation_data.csv
* /kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv

In [None]:
df_sample_submission = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/sample_submission.csv")
df_validation = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/validation_data.csv")
df_comments = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv")

In [None]:
df_comments.head()

In [None]:
df_validation.head()

In [None]:
len(df_validation)

In [None]:
df_sample_submission.head()

In [None]:
df_comments.iloc[0]['text'].replace("\n",'').replace('\\','')

### Jigsaw II

[Jigsaw Unintended Bias in Toxicity Classification](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification)


Detect toxicity across a diverse range of conversations

Files: 

* /kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/sample_submission.csv
* /kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/all_data.csv
* /kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/test_public_expanded.csv
* /kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/test_private_expanded.csv
* /kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/toxicity_individual_annotations.csv
* /kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/train.csv
* /kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/identity_individual_annotations.csv
* /kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/test.csv

#### File descriptions
```

train.csv - the training set, which includes toxicity labels and subgroups
test.csv - the test set, which does not include toxicity labels or subgroups
sample_submission.csv - a sample submission file in the correct format
The following files were added post-competition close, to use for additional research. Learn more here.
```

```
test_public_expanded.csv - The public leaderboard test set, including toxicity labels and subgroups. The competition target was a binarized version of the toxicity column, which can be easily reconstructed using a >=0.5 threshold.

test_private_expanded.csv - The private leaderboard test set, including toxicity labels and subgroups. The competition target was a binarized version of the toxicity column, which can be easily reconstructed using a >=0.5 threshold.

toxicity_individual_annotations.csv - The individual rater decisions for toxicity questions. Columns are:
    id - The comment id. Corresponds to id field in train.csv, test_public_labeled.csv, or test_private_labeled.csv.
    
    worker - The id of the individual annotator. These worker ids are shared between toxicity_individual_annotations.csv and identity_individual_annotations.csv.
    
    toxic - 1 if the worker said the comment was toxic, 0 otherwise.
    severe_toxic - 1 if the worker said the comment was severely toxic, 0 otherwise. Note that any comment that was considered severely toxic was also considered toxic.
    identity_attack, insult, obscene, sexual_explicit, threat - Toxicity subtype attributes. 1 if the worker said the comment exhibited each of these traits, 0 otherwise.
    
identity_individual_annoations.csv - The individual rater decisions for identity questions. Columns are:
    id - The comment id. Corresponds to id field in train.csv, test_public_labeled.csv, or test_private_labeled.csv.
    worker - The id of the individual annotator. These worker ids are shared between toxicity_individual_annotations.csv and toxicity_individual_annotations.csv.
    
    disability, gender, race_or_ethnicity, religion, sexual_orientation - The list of identities within this category that the rater noticed in the comment. Formatted a space-separated strings.
```
Questions that came to mind after observing the file names:

1. What is the difference between all_data.csv and train.csv
2. Whether any of the data is redundant, can I exclude some data

In [None]:
# "/kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/sample_submission.csv"
# "/kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/all_data.csv"
# "/kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/test_public_expanded.csv"
# "/kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/test_private_expanded.csv"
# "/kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/toxicity_individual_annotations.csv"
# "/kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/train.csv"
# "/kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/identity_individual_annotations.csv"
# "/kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/test.csv"


df2_train = pd.read_csv("/kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")
df2_train.head()

In [None]:
len(df2_train)

In [None]:
df2_train[df2_train["severe_toxicity"]>0]

In [None]:
len(df2_train[df2_train["severe_toxicity"]>0])

In [None]:
df2_train[(df2_train["severe_toxicity"]>0) | (df2_train['obscene']>0) | (df2_train['identity_attack']>0) | (df2_train['insult']>0)]

In [None]:
df2_train[(df2_train["severe_toxicity"]>0) | (df2_train['obscene']>0) | (df2_train['identity_attack']>0) | (df2_train['insult']>0)]

In [None]:
df2_train[df2_train[["severe_toxicity", "obscene", 'identity_attack', 'insult']].sum(axis=1)>0.5]

In [None]:
pd.set_option('display.max_colwidth', None)

In [None]:
df2_train[df2_train['target']>0.5][["comment_text", "target"]]

In [None]:
df = df2_train[["comment_text", "target"]]

In [None]:
bins = np.linspace(0, 1, 6, dtype=np.float)
print(bins)
labels = [1,2,3,4,5]
df['binned'] = pd.cut(df['target'], bins=bins, labels=labels,include_lowest=True)

In [None]:
df.groupby('binned').count()

In [None]:
df

In [None]:
import seaborn as sns, numpy as np
sns.set_theme(); np.random.seed(0)

In [None]:
sns.histplot(data=df, x="binned", bins=5)

In [None]:
len(df2_train)

In [None]:
k = 2000 
df_stack = pd.DataFrame()
for i in (labels):
    df1 = df[df['binned']==i].sample(n=k, random_state=0)
    df_stack = pd.concat([df_stack, df1],axis=0)

In [None]:
df_stack

In [None]:
df_stack.to_csv("jigsaw_II_training_data.csv", index=False)

In [None]:
df_stack.groupby('binned').count()

### Jigsaw I

[Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

Identify and classify toxic online comments

You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

* toxic
* severe_toxic
* obscene
* threat
* insult
* identity_hate

You must create a model which predicts a probability of each type of toxicity for each comment.

Files : 

* /kaggle/input/jigsaw-toxic-comment-classification-challenge/sample_submission.csv
* /kaggle/input/jigsaw-toxic-comment-classification-challenge/test_labels.csv
* /kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv
* /kaggle/input/jigsaw-toxic-comment-classification-challenge/test.csv

File descriptions
train.csv - the training set, contains comments with their binary labels
test.csv - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set contains some comments which are not included in scoring.
sample_submission.csv - a sample submission file in the correct format
test_labels.csv - labels for the test data; value of -1 indicates it was not used for scoring; 




In [None]:
df3_train = pd.read_csv("/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv")
df3_train.head()

In [None]:
df3_train.describe()


### Ruddit Dataset

#### ruddit-comments

Ruddit Comments extracted from Reddit using Python Web Scraper PRAW. This data is a supportive data to Jigsaw Comment Severity Rating Dataset.

Files: 
* /kaggle/input/ruddit-jigsaw-dataset/LICENSE
* /kaggle/input/ruddit-jigsaw-dataset/README.md
* /kaggle/input/ruddit-jigsaw-dataset/requirements.txt
* /kaggle/input/ruddit-jigsaw-dataset/ruddit-comment-extraction.ipynb
* /kaggle/input/ruddit-jigsaw-dataset/Dataset/create_dataset_variants.py
* /kaggle/input/ruddit-jigsaw-dataset/Dataset/identityterms_group.txt
* /kaggle/input/ruddit-jigsaw-dataset/Dataset/Ruddit.csv
* /kaggle/input/ruddit-jigsaw-dataset/Dataset/ReadMe.md
* /kaggle/input/ruddit-jigsaw-dataset/Dataset/Ruddit_individual_annotations.csv
* /kaggle/input/ruddit-jigsaw-dataset/Dataset/ruddit_with_text.csv
* /kaggle/input/ruddit-jigsaw-dataset/Dataset/node_dictionary.npy
* /kaggle/input/ruddit-jigsaw-dataset/Dataset/post_with_issues.csv
* /kaggle/input/ruddit-jigsaw-dataset/Dataset/Thread_structure.txt
* /kaggle/input/ruddit-jigsaw-dataset/Dataset/load_node_dictionary.py
* /kaggle/input/ruddit-jigsaw-dataset/Dataset/sample_input_file.csv
* /kaggle/input/ruddit-jigsaw-dataset/Models/BERT.py
* /kaggle/input/ruddit-jigsaw-dataset/Models/create_splits.py
* /kaggle/input/ruddit-jigsaw-dataset/Models/README.md
* /kaggle/input/ruddit-jigsaw-dataset/Models/BiLSTM.py
* /kaggle/input/ruddit-jigsaw-dataset/Models/info.md
* /kaggle/input/ruddit-jigsaw-dataset/Models/HateBERT.py

Ref: [Ruddit: Norms of Offensiveness for English Reddit Comments](https://aclanthology.org/2021.acl-long.210.pdf)

In [None]:
# /kaggle/input/ruddit-jigsaw-dataset/LICENSE
# /kaggle/input/ruddit-jigsaw-dataset/README.md
# /kaggle/input/ruddit-jigsaw-dataset/requirements.txt
# /kaggle/input/ruddit-jigsaw-dataset/ruddit-comment-extraction.ipynb
# /kaggle/input/ruddit-jigsaw-dataset/Dataset/create_dataset_variants.py
# /kaggle/input/ruddit-jigsaw-dataset/Dataset/identityterms_group.txt
# /kaggle/input/ruddit-jigsaw-dataset/Dataset/Ruddit.csv
# /kaggle/input/ruddit-jigsaw-dataset/Dataset/ReadMe.md
# /kaggle/input/ruddit-jigsaw-dataset/Dataset/Ruddit_individual_annotations.csv
# /kaggle/input/ruddit-jigsaw-dataset/Dataset/ruddit_with_text.csv
# /kaggle/input/ruddit-jigsaw-dataset/Dataset/node_dictionary.npy
# /kaggle/input/ruddit-jigsaw-dataset/Dataset/post_with_issues.csv
# /kaggle/input/ruddit-jigsaw-dataset/Dataset/Thread_structure.txt
# /kaggle/input/ruddit-jigsaw-dataset/Dataset/load_node_dictionary.py
# /kaggle/input/ruddit-jigsaw-dataset/Dataset/sample_input_file.csv
# /kaggle/input/ruddit-jigsaw-dataset/Models/BERT.py
# /kaggle/input/ruddit-jigsaw-dataset/Models/create_splits.py
# /kaggle/input/ruddit-jigsaw-dataset/Models/README.md
# /kaggle/input/ruddit-jigsaw-dataset/Models/BiLSTM.py
# /kaggle/input/ruddit-jigsaw-dataset/Models/info.md
# /kaggle/input/ruddit-jigsaw-dataset/Models/HateBERT.py

df_ruddit = pd.read_csv("/kaggle/input/ruddit-jigsaw-dataset/Dataset/ruddit_with_text.csv")
df_ruddit.head()

In [None]:
len(df_ruddit)

### [Sarcasm on Reddit](https://www.kaggle.com/danofer/sarcasm)

1.3 million labelled comments from Reddit

##### Context
This dataset contains 1.3 million Sarcastic comments from the Internet commentary website Reddit. The dataset was generated by scraping comments from Reddit (not by me :)) containing the \s ( sarcasm) tag. This tag is often used by Redditors to indicate that their comment is in jest and not meant to be taken seriously, and is generally a reliable indicator of sarcastic comment content.

##### Content
Data has balanced and imbalanced (i.e true distribution) versions. (True ratio is about 1:100). The
corpus has 1.3 million sarcastic statements, along with what they responded to as well as many non-sarcastic comments from the same source.

Labelled comments are in the train-balanced-sarcasm.csv file.

Files: 
* /kaggle/input/sarcasm/train-balanced-sarc.csv.gz
* /kaggle/input/sarcasm/train-balanced-sarcasm.csv
* /kaggle/input/sarcasm/test-balanced.csv
* /kaggle/input/sarcasm/test-unbalanced.csv    

In [None]:
# /kaggle/input/sarcasm/train-balanced-sarc.csv.gz
# /kaggle/input/sarcasm/train-balanced-sarcasm.csv
# /kaggle/input/sarcasm/test-balanced.csv
# /kaggle/input/sarcasm/test-unbalanced.csv

In [None]:
df_sarcasm = pd.read_csv("/kaggle/input/sarcasm/train-balanced-sarcasm.csv")
df_sarcasm.head()