<p style = "font-size:40px; 
font-family: Helvetica; 
font-weight : bold; 
background-color: #036EB7; 
color : #FFFFFF; 
text-align: left; 
padding: 0px 15px; 
border-radius:3px">
	Jigsaw All-in-One Dataset
</p>

My goal is creating all-in-one dataset for jigsaw competition.  
First, I concatenated **'toxic comment classification challenge'**s dataset, **'jigsaw unintended bias in toxicity classification'**s dataset.   
If I find more new effective dataset, I'll update this notebook.  

In [None]:
import pandas as pd
import os.path as osp

In [None]:
INPUT_PATH = '/kaggle/input/'

<p style = "font-size:25px; 
font-family: Helvetica; 
font-weight : normal; 
background-color: #036EB7; 
color : #FFFFFF; 
text-align: left; 
padding: 0px 15px; 
border-radius:3px">
    [Toxic Comment Classification Challenge] Data
</p>

### [Reference : https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data)

You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

* `toxic`
* `severe_toxic`
* `obscene`
* `threat`
* `insult`
* `identity_hate`

You must create a model which predicts a probability of each type of toxicity for each comment.

File descriptions  
* **train.csv** - the training set, contains comments with their binary labels
* **test.csv** - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set contains some comments which are not included in scoring.
* **sample_submission.csv** - a sample submission file in the correct format
* **test_labels.csv** - labels for the test data; value of -1 indicates it was not used for scoring; (Note: file added after competition close!)


In [None]:
!unzip -n /kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv.zip -d ./ 
!unzip -n /kaggle/input/jigsaw-toxic-comment-classification-challenge/test.csv.zip -d ./
!unzip -n /kaggle/input/jigsaw-toxic-comment-classification-challenge/test_labels.csv.zip -d ./

In [None]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
test_labels = pd.read_csv('test_labels.csv')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
test_labels.head()

In [None]:
test_labels.describe()

There are some -1 labels. Delete that.

In [None]:
test = test.merge(test_labels, on="id")
test = test[test.toxic != -1]
test.head()

In [None]:
test.describe()

Cool!

In [None]:
toxic_comment = pd.concat([train, test])

In [None]:
toxic_comment.describe()

<p style = "font-size:25px; 
font-family: Helvetica; 
font-weight : normal; 
background-color: #036EB7; 
color : #FFFFFF; 
text-align: left; 
padding: 0px 15px; 
border-radius:3px">
   [Jigsaw Unintended Bias in Toxicity Classification] Data
</p>

### [Reference : https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data)

At the end of 2017 the Civil Comments platform shut down and chose make their ~2m public comments from their platform available in a lasting open archive so that researchers could understand and improve civility in online conversations for years to come. Jigsaw sponsored this effort and extended annotation of this data by human raters for various toxic conversational attributes.

In the data supplied for this competition, the text of the individual comment is found in the comment_text column. Each comment in Train has a toxicity label (target), and models should predict the target toxicity for the Test data. This attribute (and all others) are fractional values which represent the fraction of human raters who believed the attribute applied to the given comment. For evaluation, test set examples with target >= 0.5 will be considered to be in the positive class (toxic).

The data also has several additional toxicity subtype attributes. Models do not need to predict these attributes for the competition, they are included as an additional avenue for research. Subtype attributes are:

* `severe_toxicity`
* `obscene`
* `threat`
* `insult`
* `identity_attack`
* `sexual_explicit`

In [None]:
!ls /kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/

We need `all_data.csv`.

In [None]:
!cp /kaggle/input/jigsaw-unintended-bias-in-toxicity-classification/all_data.csv ./unintended.csv

In [None]:
unintended = pd.read_csv('unintended.csv')

In [None]:
unintended.columns

In [None]:
unintended.head()

In [None]:
unintended.columns

In [None]:
target_columns = [
    "id", "comment_text", "toxicity", "severe_toxicity", 
    "obscene", "identity_attack", "insult", "threat"
]
unintended = unintended[target_columns]

In [None]:
unintended.describe()

<p style = "font-size:25px; 
font-family: Helvetica; 
font-weight : normal; 
background-color: #036EB7; 
color : #FFFFFF; 
text-align: left; 
padding: 0px 15px; 
border-radius:3px">
    Merge
</p>

In [None]:
base_columns = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
intended_columns = ["toxicity", "severe_toxicity", "obscene", "threat", "insult", "identity_attack"]

In [None]:
unintended = unintended.rename(columns = {x: y for x, y in zip(intended_columns, base_columns)})

In [None]:
print('length of jigsaw-toxic-comment-classification-challenge \'s dataset :', len(toxic_comment))

In [None]:
print('length of jigsaw-unintended-bias-in-toxicity-classification \'s dataset :', len(unintended))

In [None]:
toxic_comment['dataset'] = 'toxic_comment'
unintended['dataset'] = 'unintended'

In [None]:
final = pd.concat([toxic_comment, unintended])
final.head()

It's a final dataset. (toxic_comment + unintended)

<p style = "font-size:25px; 
font-family: Helvetica; 
font-weight : normal; 
background-color: #036EB7; 
color : #FFFFFF; 
text-align: left; 
padding: 0px 15px; 
border-radius:3px">
    Toxic Data Preprocessing
</p>

### Reference : [https://www.kaggle.com/fizzbuzz/toxic-data-preprocessing](https://www.kaggle.com/fizzbuzz/toxic-data-preprocessing)

In [None]:
import pandas as pd
import numpy as np
import copy
import re
from keras.preprocessing.text import text_to_word_sequence
from nltk import WordNetLemmatizer


class BaseTokenizer(object):
    def process_text(self, text):
        raise NotImplemented

    def process(self, texts):
        for text in texts:
            yield self.process_text(text)


RE_PATTERNS = {
    ' american ':
        [
            'amerikan'
        ],
    ' adolf ':
        [
            'adolf'
        ],
    ' hitler ':
        [
            'hitler'
        ],
    ' fuck':
        [
            '(f)(u|[^a-z0-9 ])(c|[^a-z0-9 ])(k|[^a-z0-9 ])([^ ])*',
            '(f)([^a-z]*)(u)([^a-z]*)(c)([^a-z]*)(k)',
            ' f[!@#\$%\^\&\*]*u[!@#\$%\^&\*]*k', 'f u u c',
            '(f)(c|[^a-z ])(u|[^a-z ])(k)', r'f\*',
            'feck ', ' fux ', 'f\*\*', 
            'f\-ing', 'f\.u\.', 'f###', ' fu ', 'f@ck', 'f u c k', 'f uck', 'f ck'
        ],
    ' ass ':
        [
            '[^a-z]ass ', '[^a-z]azz ', 'arrse', ' arse ', '@\$\$'
                                                           '[^a-z]anus', ' a\*s\*s', '[^a-z]ass[^a-z ]',
            'a[@#\$%\^&\*][@#\$%\^&\*]', '[^a-z]anal ', 'a s s'
        ],
    ' ass hole ':
        [
            ' a[s|z]*wipe', 'a[s|z]*[w]*h[o|0]+[l]*e', '@\$\$hole'
        ],
    ' bitch ':
        [
            'b[w]*i[t]*ch', 'b!tch',
            'bi\+ch', 'b!\+ch', '(b)([^a-z]*)(i)([^a-z]*)(t)([^a-z]*)(c)([^a-z]*)(h)',
            'biatch', 'bi\*\*h', 'bytch', 'b i t c h'
        ],
    ' bastard ':
        [
            'ba[s|z]+t[e|a]+rd'
        ],
    ' trans gender':
        [
            'transgender'
        ],
    ' gay ':
        [
            'gay'
        ],
    ' cock ':
        [
            '[^a-z]cock', 'c0ck', '[^a-z]cok ', 'c0k', '[^a-z]cok[^aeiou]', ' cawk',
            '(c)([^a-z ])(o)([^a-z ]*)(c)([^a-z ]*)(k)', 'c o c k'
        ],
    ' dick ':
        [
            ' dick[^aeiou]', 'deek', 'd i c k'
        ],
    ' suck ':
        [
            'sucker', '(s)([^a-z ]*)(u)([^a-z ]*)(c)([^a-z ]*)(k)', 'sucks', '5uck', 's u c k'
        ],
    ' cunt ':
        [
            'cunt', 'c u n t'
        ],
    ' bull shit ':
        [
            'bullsh\*t', 'bull\$hit'
        ],
    ' homo sex ual':
        [
            'homosexual'
        ],
    ' jerk ':
        [
            'jerk'
        ],
    ' idiot ':
        [
            'i[d]+io[t]+', '(i)([^a-z ]*)(d)([^a-z ]*)(i)([^a-z ]*)(o)([^a-z ]*)(t)', 'idiots'
                                                                                      'i d i o t'
        ],
    ' dumb ':
        [
            '(d)([^a-z ]*)(u)([^a-z ]*)(m)([^a-z ]*)(b)'
        ],
    ' shit ':
        [
            'shitty', '(s)([^a-z ]*)(h)([^a-z ]*)(i)([^a-z ]*)(t)', 'shite', '\$hit', 's h i t'
        ],
    ' shit hole ':
        [
            'shythole'
        ],
    ' retard ':
        [
            'returd', 'retad', 'retard', 'wiktard', 'wikitud'
        ],
    ' rape ':
        [
            ' raped'
        ],
    ' dumb ass':
        [
            'dumbass', 'dubass'
        ],
    ' ass head':
        [
            'butthead'
        ],
    ' sex ':
        [
            'sexy', 's3x', 'sexuality'
        ],
    ' nigger ':
        [
            'nigger', 'ni[g]+a', ' nigr ', 'negrito', 'niguh', 'n3gr', 'n i g g e r'
        ],
    ' shut the fuck up':
        [
            'stfu'
        ],
    ' pussy ':
        [
            'pussy[^c]', 'pusy', 'pussi[^l]', 'pusses'
        ],
    ' faggot ':
        [
            'faggot', ' fa[g]+[s]*[^a-z ]', 'fagot', 'f a g g o t', 'faggit',
            '(f)([^a-z ]*)(a)([^a-z ]*)([g]+)([^a-z ]*)(o)([^a-z ]*)(t)', 'fau[g]+ot', 'fae[g]+ot',
        ],
    ' mother fucker':
        [
            ' motha ', ' motha f', ' mother f', 'motherucker',
        ],
    ' whore ':
        [
            'wh\*\*\*', 'w h o r e'
        ],
}


class PatternTokenizer(BaseTokenizer):
    def __init__(self, lower=True, initial_filters=r"[^a-z0-9!@#\$%\^\&\*_\-,\.' ]", patterns=RE_PATTERNS,
                 remove_repetitions=True):
        self.lower = lower
        self.patterns = patterns
        self.initial_filters = initial_filters
        self.remove_repetitions = remove_repetitions

    def process_text(self, text):
        x = self._preprocess(text)
        for target, patterns in self.patterns.items():
            for pat in patterns:
                x = re.sub(pat, target, x)
        x = re.sub(r"[^a-z' ]", ' ', x)
        return x.split()

    def process_ds(self, ds):
        ### ds = Data series

        # lower
        ds = copy.deepcopy(ds)
        if self.lower:
            ds = ds.str.lower()
        # remove special chars
        if self.initial_filters is not None:
            ds = ds.str.replace(self.initial_filters, ' ')
        # fuuuuck => fuck
        if self.remove_repetitions:
            pattern = re.compile(r"(.)\1{2,}", re.DOTALL) 
            ds = ds.str.replace(pattern, r"\1")

        for target, patterns in self.patterns.items():
            for pat in patterns:
                ds = ds.str.replace(pat, target)

        ds = ds.str.replace(r"[^a-z' ]", ' ')

        return ds.str.split()

    def _preprocess(self, text):
        # lower
        if self.lower:
            text = text.lower()

        # remove special chars
        if self.initial_filters is not None:
            text = re.sub(self.initial_filters, ' ', text)

        # fuuuuck => fuck
        if self.remove_repetitions:
            pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
            text = pattern.sub(r"\1", text)
        return text

In [None]:
tokenizer = PatternTokenizer()
final["comment_text_processed"] = tokenizer.process_ds(final["comment_text"]).str.join(sep=" ")

In [None]:
!rm *.csv

In [None]:
final.head()

In [None]:
final.to_csv('all_in_one_jigsaw.csv')

<p style = "font-size:25px; 
font-family: Helvetica; 
font-weight : normal; 
background-color: #036EB7; 
color : #FFFFFF; 
text-align: right; 
padding: 0px 15px; 
border-radius:3px">
    Final!!!
</p>