# ☣️ Jigsaw - Explore Previous Competitions Datasets



## This notebook  goes over the data from the 3 previous competitions to extract what is useful and what is not. It provides 2 curated (but not processed) dataframes, and reasons to argue that these are all the relevant pieces of data to use from those competitions.

## The resulting dataframes are stored as a dataset here: [jigsaw-curated-raw-datasets](https://www.kaggle.com/julian3833/jigsaw-curated-raw-datasets)

---
# Input datasets (from the competitions):
* Dataset: [jigsaw-toxic-comment-classification-challenge](https://www.kaggle.com/julian3833/jigsaw-toxic-comment-classification-challenge). 

   Competition: [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) (2018)

* Dataset: [jigsaw-unintended-bias-in-toxicity-classification](https://www.kaggle.com/julian3833/jigsaw-unintended-bias-in-toxicity-classification). 

  Competition: [Jigsaw Unintended Bias in Toxicity Classification](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification) (2019)



* Dataset: [jigsaw-multilingual-toxic-comment-classification](https://www.kaggle.com/julian3833/jigsaw-multilingual-toxic-comment-classification). 

  Competition: [Jigsaw Multilingual Toxic Comment Classification](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification) (2020)


---

In this notebook we will:
1. Extract the 2 main dataframes from previous competitions and save a curated version of them to a [dataset](https://www.kaggle.com/julian3833/jigsaw-curated-raw-datasets)
2. Verify step-by-step that this data is all the relevant data
3. Provide a utility function to post process the classification dataset into common formats (binary classification, regression, regression with weights)


# Please, _DO_ upvote if you find this useful!

---

#  Summary / conclusions

There are two main relevant dataframes than can be read from the ported datasets:

```python

df_classification = pd.read_csv("../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv")

df_bias_all = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/all_data.csv")


```

## Data from the 2018 Classification Challenge
The first one has the training and test data from the original classification competition. It is merged and provided all together as part of the _multilingual_ competition. It can be easily assembled using the files `train.csv`, `test.csv` and `test_labels.csv` from the classification competition itself. The version of the multilingual competition is exactly the same as the assembled one (See below).

* Rows: `223549`
* Labels: `['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']` (6)

This dataset is stored as `classification-dataset.csv` in this notebook and provided in this [dataset]() as well.


## Data from the 2019 Unintended Bias in Toxicity Classification

The second one has the training and test data from the unintended bias competitions, and is provided in that competition (it was released _after_ the deadline). This one contains the 6 labels of the original one, with some changes, plus other labels and much more columns.

It contains much more data, but they come

* Rows: `1999516`
* Labels: `['toxicity', 'severe_toxicity', 'obscene', 'sexual_explicit', 'identity_attack', 'insult', 'threat']` (7)
* Other columns: 24 identities/flags + 14 others
* Values are floats between `0` and `1` and not binary flags. 

The version of this data in the multilingual competition is partially different to the originally assembled one (`all_data.csv`)




# Provision of almost raw curated dataframes

Here the previously mentioned datasets are stored to disk

Their names are `classification-dataset.csv` and `unintented-bias-dataset.csv`. They can be found in this dataset: [jigsaw-curated-raw-datasets](https://www.kaggle.com/julian3833/jigsaw-curated-raw-datasets).

For the one from the Unintented Bias competition, we drop the extra columns, we rename the labels to match the ones of the Classification Challenge, but the `sexual_explicit` label is kept and therefore the column shapes don't match.

We don't binarize the labels either, since that is already an important decision, and we want to provide a dataset as close to the original as it's possible.

In [1]:
import numpy as np
import pandas as pd

In [1]:
# 2018 classification challenge dataset
df_classification = pd.read_csv("../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv")
df_classification.head()

In [1]:
df_classification.shape

In [1]:
df_classification.to_csv("classification-dataset.csv", index=False)

In [1]:
# 2019 unintended bias dataset
df_bias_all = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/all_data.csv")
df_bias_all.head()

In [1]:
df_bias_all.shape

In [1]:
keep_cols = ['id', 'comment_text', 'toxicity', 'severe_toxicity', 
             'obscene', 'threat', 'insult', 'identity_attack',  'sexual_explicit']
df_bias_cleaned = df_bias_all[keep_cols].rename(columns={'toxicity': 'toxic', 'severe_toxicity': 'severe_toxic', 'identity_attack': 'identity_hate'})
df_bias_cleaned.head()

In [1]:
df_bias_cleaned.to_csv("unintented-bias-dataset.csv", index=False)

# Detailed analysis

## Verification for the `classification-dataset.csv`: 
### The data provided in the multilingual challenge is exactly the same as the original ✅

* `jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv`

is equal to 
* `jigsaw-toxic-comment-classification-challenge/train.csv`
* `jigsaw-toxic-comment-classification-challenge/test.csv`
* `jigsaw-toxic-comment-classification-challenge/test_labels.csv`

In [1]:
df_train = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/train.csv")
print(df_train.shape)
df_train.head()

In [1]:
df_test = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/test.csv")
df_test_labels = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/test_labels.csv")
df_test = df_test.merge(df_test_labels[df_test_labels['toxic'] != -1], on='id', how='inner')
print(df_test.shape)
df_test.head()

In [1]:
df_classification_full = pd.concat([df_train, df_test]).reset_index(drop=True)
df_classification_full.shape

In [1]:
# The provided dataframe has the train and the test altogether for the Classification Challenge
df_ml = pd.read_csv("../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv")

In [1]:
# Are they exactly the same?
(df_ml == df_classification_full).all().all()

# Verifications for `unintented-bias-dataset.csv`

## 1. `all_data.csv` is a superset of `train.csv` (only 1 extra row) ✅ 

In [1]:
# Read all_data.csv
df_all = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/all_data.csv")
df_all.head()

In [1]:
keep_cols = ['id', 'comment_text', 'split', 'toxicity', 'severe_toxicity', 
             'obscene', 'sexual_explicit', 'identity_attack', 'insult', 'threat']
df_all = df_all[keep_cols]
df_all.head()

In [1]:
# Read train.csv
df_train = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")
df_train.head()

In [1]:
keep_cols = ['id', 'comment_text', 'target',  'severe_toxicity', 'obscene',
             'sexual_explicit', 'identity_attack', 'insult', 'threat']
df_train = df_train[keep_cols]
df_train.head()

In [1]:
# Get the traininig part of all_data.csv
df_all_train_split = df_all[df_all['split'] == 'train'].drop("split", axis=1).sort_values("id").reset_index(drop=True)

In [1]:
# Align indexes of train.csv
df_train_massaged = df_train.rename(columns={'target': 'toxicity'}).sort_values("id").reset_index(drop=True)

In [1]:
# There is 1 sample different WTF!!
df_all_train_split.shape, df_train_massaged.shape

In [1]:
# Drop the extra row in all_data.csv's train split
df_all_train_split = df_all_train_split[df_all_train_split['id'].isin(df_train_massaged['id'])].reset_index(drop=True)

In [1]:
# Do they exaclty match now?
(df_all_train_split == df_train_massaged).all().all()

## 2. `all_data.csv` contains the `test.csv` with labels ✅ 

In [1]:
df_all[df_all['split'] == 'test'].shape

In [1]:
df_test = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv")
df_test.head()

In [1]:
# All the df tests samples are contained in df_all. We can use df all
df_test['id'].isin(df_all['id']).all()

In [1]:
# Conclusion: all_data.csv seems good enough!!

# Verifications for the multilingual competitions:

## 1. Multilingual competition doesn't have train data of their own ✅ 

This competition doesn't have train data. 

The train data is the one of the two previous competions.

Here we will see if anything is useful at all.

In [1]:
import os
os.listdir("../input/jigsaw-multilingual-toxic-comment-classification/")

In [1]:
# Is the test set useful?
df_test = pd.read_csv("../input/jigsaw-multilingual-toxic-comment-classification/test.csv")

In [1]:
# It doesn't seem so
df_test['lang'].value_counts()

In [1]:
# Validation doesnt' have English either
df_val = pd.read_csv("../input/jigsaw-multilingual-toxic-comment-classification/validation.csv")
df_val['lang'].value_counts()

## 2. Is the unintented bias summary provided in the multilingual competition better than `all_data.csv`? No

In [1]:
df_bias_ml = pd.read_csv("../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-unintended-bias-train.csv")

In [1]:
df_bias_ml.shape

In [1]:
df_bias_ml['id'].isin(df_all['id']).all()

In [1]:
df_all['id'].isin(df_bias_ml['id']).all()

In [1]:
# all_data.csv seems to have more data than the version provided in the multilingual competition
df_all.head()

# Utility function `get_classification_dataset_as`: simple common postprocessing for `classification-dataset.csv` 

This function allows to translate the `classification-dataset.csv` into a binary classification or a regression task.

Adding a column `y` and dropping the original labels.

For `task=binary`
  * `y=1` if any label is not zero and will be 0 otherwise.

For `task=regression`, 
 * `y` will be the sum of the non-zero labels.
 * the parameter `regression_weights` can be used to control how much each label affects `y`. See an example usage below.
 > For example: `regression_weights={'obscene': 0.1, 'insult': 15}`. The default weight for labels with no keys is `1`.





In [1]:
def get_classification_dataset_as(task="binary", regression_weights=None):
    """
    Args:
        task: ['binary', 'regression']
        regression_weights: dictionary {label => weights} for each label or None, if task=='regression'
        
    """
    assert task in ['binary', 'regression']
    
    df = pd.read_csv("classification-dataset.csv")

    
    labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
    
    if task == 'binary':    
        df['y'] = (df[labels].sum(axis=1) > 0).astype(int)
        df = df.drop(labels, axis=1)
    elif task == 'regression':
        if regression_weights is None:
            df['y'] = df[labels].sum(axis=1)
        else:
            weighed_columns = [regression_weights.get(l, 1) * df[l] for l in labels]
            df['y'] = pd.concat(weighed_columns, axis=1).sum(axis=1)
        df = df.drop(labels, axis=1)
    return df
    

# Example usage

In [1]:
df_binary = get_classification_dataset_as('binary')

In [1]:
df_regression = get_classification_dataset_as('regression')

In [1]:
df_regression_weighted = get_classification_dataset_as('regression', regression_weights={'severe_toxic': 2})

In [1]:
!mkdir postprocessed

In [1]:
df_binary.to_csv("postprocessed/classification-dataset-binary.csv", index=False)
df_regression.to_csv("postprocessed/classification-dataset-regression.csv", index=False)
df_regression_weighted.to_csv("postprocessed/classification-dataset-regression-weighted.csv", index=False)

# Please, _DO_ upvote if you find this useful!