#### What are you trying to do in this notebook?
This notebook is to visualise the text data to see and identify some patterns in the text data which might help us in differentiating between less_toxic and more_toxic comments.
This notebook attempts to perform EDA on the Jiggsaw Toxic Severity Rating dataset. The focus in this competition is on ranking the severity of comment toxicity from innocuous to outrageous.

#### Why are you trying it?
In this competition you will be ranking comments in order of severity of toxicity. You are given a list of comments, and each comment should be scored according to their relative toxicity. Comments with a higher degree of toxicity should receive a higher numerical value compared to comments with a lower degree of toxicity.
In order to avoid leaks, the same text needs to be put into same Folds.
For a single document this is easy, but for a pair of documents to both be in same folds is a bit tricky.
This simple notebook tracks pairs of text recursively to group them and try to create a leak-free Fold split.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from tqdm import tqdm
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold

In [None]:
n_splits=5
nrows = None

In [None]:
df = pd.read_csv("../input/jigsaw-toxic-severity-rating/validation_data.csv", nrows=nrows)
texts = set(df.less_toxic.to_list() + df.more_toxic.to_list())
text2id = {t:id for id,t in enumerate(texts)}
df['less_id'] = df['less_toxic'].map(text2id)
df['more_id'] = df['more_toxic'].map(text2id)
df

In [None]:
# Set array to store pair information
len_ids = len(text2id)
idarr = np.zeros((len_ids,len_ids), dtype=bool)

for lid, mid in df[['less_id', 'more_id']].values:
    min_id = min(lid, mid)
    max_id = max(lid, mid)
    idarr[max_id, min_id] = True

In [None]:
# Recursively retrieve the text that is paired with the text whose id is i,
# and store it's id in this_list.
# then set idarr[i, j] to False
def add_ids(i, this_list):
    for j in range(len_ids):
        if idarr[i, j]:
            idarr[i, j] = False
            this_list.append(j)
            this_list = add_ids(j,this_list)
            #print(j,i)
    for j in range(i+1,len_ids):
        if idarr[j, i]:
            idarr[j, i] = False
            this_list.append(j)
            this_list = add_ids(j,this_list)
            #print(j,i)
    return this_list

group_list = []
for i in tqdm(range(len_ids)):
    for j in range(i+1,len_ids):
        if idarr[j, i]:
            this_list = add_ids(i,[i])
            #print(this_list)
            group_list.append(this_list)

id2groupid = {}
for gid,ids in enumerate(group_list):
    for id in ids:
        id2groupid[id] = gid

df['less_gid'] = df['less_id'].map(id2groupid)
df['more_gid'] = df['more_id'].map(id2groupid)
df

In [None]:
print('unique text counts:', len_ids)
print('grouped text counts:', len(group_list))

In [None]:
# now we can use GroupKFold with group id
group_kfold = GroupKFold(n_splits=n_splits)

# Since df.less_gid and df.more_gid are the same, let's use df.less_gid here.
for fold, (trn, val) in enumerate(group_kfold.split(df, df, df.less_gid)): 
    df.loc[val , "fold"] = fold

df["fold"] = df["fold"].astype(int)
df

#### Did it work?
There is no training data for this competition. You can refer to previous Jigsaw competitions for data that might be useful to train models. But note that the task of previous competitions has been to predict the probability that a comment was toxic, rather than the degree or severity of a comment's toxicity.

#### What did you not understand about this process?
Well, everything provides in the competition data page. I've no problem while working on it. If you guys don't understand the thing that I'll do in this notebook then please comment on this notebook.

#### What else do you think you can try as part of this approach?
While we don't include training data, we do provide a set of paired toxicity rankings that can be used to validate models.

