# Toxic Comment Classification Challenge

This is my submission to the Toxic Comment Classification Challenge (https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/).   


# Data preprocessing

##### control the use of the gpu

In [1]:
import time

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "3"

##### import libraries

In [2]:
import numpy as np
import pandas as pd

import random

import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Read data

##### read data as dataframes with pandas

In [3]:
df_train = pd.read_table('data/train.csv',sep=',',header=0)
df_test = pd.read_table('data/test.csv',sep=',',header=0)
sample_submission = pd.read_table('data/sample_submission.csv',sep=',',header=0)

In [4]:
print("The training dataset contains",df_train.shape[0],"examples.")
print("The dataframe is composed of",df_train.shape[1],"columns:")
for i in range(len(df_train.columns)): print(df_train.columns[i])

The training dataset contains 159571 examples.
The dataframe is composed of 8 columns:
id
comment_text
toxic
severe_toxic
obscene
threat
insult
identity_hate


In [5]:
id_train = df_train['id']
x_train = df_train['comment_text']
y_train = df_train.iloc[:, 2:].values

id_test = df_test['id']
x_test = df_test['comment_text']

##### display some examples of comments

In [6]:
# select randomly some examples
examples = random.sample(range(x_train.shape[0]),3)

print(examples)
for ex in examples:
    print('Comment:  {}'.format(x_train[ex]))
    print('Labels:    {}'.format(y_train[ex]))
    print('\n')

[55964, 91486, 127653]
Comment:  And you'll be pleased to know that this article has now come back to life with some decent references. OIC and myself are in the middle of sourcing even more for it, so don't be suprised if it gets further expansion. This is us being a little spiteful towards a user who doesn't have a snowflakes chance in hell of getting the mop.  (talk)
Labels:    [1 0 0 0 0 0]


Comment:  No it's not. There are still several typos, and the overall writing could be better. ♬♩  (talk)
Labels:    [0 0 0 0 0 0]


Comment:  "
 Speedy deletion of ""27 Tricor Ave. New Paltz, NY 12561"" 

 A page you created, 27 Tricor Ave. New Paltz, NY 12561, has been tagged for deletion, as it meets one or more of the criteria for speedy deletion; specifically, it is nonsense or gibberish.

You are welcome to contribute content which complies with our content policies and any applicable inclusion guidelines. However, please do not simply re-create the page with the same content. You may al

## Preprocess data

### Tokenizer

Create a tokenizer, which whill be use to vectorize the text of the comments (see: https://keras.io/preprocessing/text/ ).   
Then, fit the tokenizer on the training dataset and run it on the training and testing comments.

In [7]:
# create tokenizer
max_num_words = None

tokenizer = Tokenizer(num_words = max_num_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n',
                      lower=True, split=' ', char_level=False, oov_token=None)

In [8]:
# fit tokenizer
tokenizer.fit_on_texts(list(x_train))

In [9]:
word_index = tokenizer.word_index # dictionnary of words whith their indexes (uniquely assigned integers)

voc_size = len(word_index)
print('The tokenizer founds {} unique tokens.' .format(voc_size))

The tokenizer founds 210337 unique tokens.


In [10]:
# run tokenizer
seq_train = tokenizer.texts_to_sequences(x_train)
seq_test = tokenizer.texts_to_sequences(x_test)

In [11]:
avg_len = np.mean([len(seq) for seq in seq_train])
std_len = np.std([len(seq) for seq in seq_train])
longest_len = max(len(seq) for seq in seq_train)
shortest_len = min(len(seq) for seq in seq_train)

print("The average length of the training sequences is {} with a standard deviation of {}.".format(avg_len,std_len))
print("The longest sequence has a size of {}, and the shortest only {}.".format(longest_len,shortest_len))

The average length of the training sequences is 68.22156908210138 with a standard deviation of 101.07344657013672.
The longest sequence has a size of 1403, and the shortest only 1.


In [23]:
# Fix a maximum length for the sequences
max_len = int(avg_len + std_len * 3)
print("The maximum length for the sequences is fixed at {}." .format(maxlen))

The maximum length for the sequences is fixed at 371.


### pad_sequences

Pads the sequences to the same length : "max_size". Sequences shorter than "max_size" are padded with "0" and the end, and sequences longer are truncated.   
(see https://keras.io/preprocessing/sequence/ ).

In [59]:
data_train = pad_sequences(seq_train, maxlen=max_len, padding='post', truncating='post', value=0)
data_test = pad_sequences(seq_test, maxlen=max_len, padding='post', truncating='post', value=0)

print('Shape of training data tensor:', data_train.shape)
print('Shape of test data tensor:', data_test.shape)

Shape of training data tensor: (159571, 371)
Shape of test data tensor: (153164, 371)
