# Overview

**As we know there is no training dataset provided for this competition, this notebook shows you how to use previous Jigsaw competitions datasets to create a dataset for this competition. Don't just use the output file since what I've done at the end is just a basic idea. One must understand the idea so that you can create a dataset on your own depending on how many times particular comment can be repeated.**

**Training data**

Note, there is no training data for this competition. You can refer to previous Jigsaw competitions for data that might be useful to train models. But note that the task of previous competitions has been to predict the probability that a comment was toxic, rather than the degree or severity of a comment's toxicity.

* Toxic Comment Classification Challenge
* Jigsaw Unintended Bias in Toxicity Classification
* Jigsaw Multilingual Toxic Comment Classification

While we don't include training data, we do provide a set of paired toxicity rankings that can be used to validate models.

**Evaluation**

Submissions are evaluated on Average Agreement with Annotators. For the ground truth, annotators were shown two comments and asked to identify which of the two was more toxic. Pairs of comments can be, and often are, rated by more than one annotator, and may have been ordered differently by different annotators.

For each of the approximately 200,000 pair ratings in the ground truth test data, we use your predicted toxicity score to rank the comment pair. The pair receives a 1 if this ranking matches the annotator ranking, or 0 if it does not match.

The final score is the average across all the pair evaluations.

**Objective**

In this competition you will be ranking comments in order of severity of toxicity. You are given a list of comments, and each comment should be scored according to their relative toxicity. Comments with a higher degree of toxicity should receive a higher numerical value compared to comments with a lower degree of toxicity.

In [None]:
import numpy as np
import pandas as pd
import random

In [None]:
comp1=pd.read_csv('../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')

In [None]:
comp1[['id','comment_text','toxic','severe_toxic']].head()

In [None]:
comp2=pd.read_csv('../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-unintended-bias-train.csv')

In [None]:
comp2[['id','comment_text','toxic','severe_toxicity']].head()

In [None]:
print(comp1.shape)
print(comp2.shape)

* For now let's just consider the dataset from the very first Jigsaw competition and see how we can use that to create a train dataset for the current competition.

In [None]:
comp1.toxic.value_counts()

In [None]:
comp1.severe_toxic.value_counts()

# Idea

* We can create a dataset to train our models in this competition in the format of validation_data.csv
* There are many ways one can create a training dataset by considering the non toxic comments i.e. toxicity==0 as less toxic ones and toxic comments i.e. toxicity==1 as more toxic ones. 
* And we can also use the severe_toxic feature to compare comments whose toxicity==1 and divide the ones whose severe_toxicity==0 as less_toxic and ones whose severe_toxity==1 as more_toxic. All of this only for the dataset from the first Jigsaw competition. We have the 2nd competition dataset which is almost 10 times that of first.
* If we consider all the possibilities we can create a dataset of size ****(idk, you calculate).
* So for now I've created it this way. For each non toxic comment(202165 of those) I've chosen a toxic comment(from 21384 comments) randomly. The length of train.csv will be 202165


In [None]:
less_toxic=[]
more_toxic=[]
non_toxic=comp1[comp1.toxic==0].reset_index(drop=True)
toxic=comp1[comp1.toxic==1].reset_index(drop=True)
for i in range(len(non_toxic)):
    less_toxic.append(non_toxic.loc[i,'comment_text'])
    j=random.randint(0,len(toxic)-1)
    more_toxic.append(toxic.loc[j,'comment_text'])

In [None]:
train=pd.DataFrame()
train['less_toxic']=less_toxic
train['more_toxic']=more_toxic

In [None]:
train.head()

In [None]:
train.to_csv('train.csv',index=False)