# Problem recap:

Ref: https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/overview
        
It only takes one toxic comment to sour an online discussion. The Conversation AI team, a research initiative founded by Jigsaw and Google, builds technology to protect voices in conversation. A main area of focus is machine learning models that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion. If these toxic contributions can be identified, we could have a safer, more collaborative internet.

In the previous 2018 Toxic Comment Classification Challenge, Kagglers built multi-headed models to recognize toxicity and several subtypes of toxicity. In 2019, in the Unintended Bias in Toxicity Classification Challenge, you worked to build toxicity models that operate fairly across a diverse range of conversations. This year, we're taking advantage of Kaggle's new TPU support and challenging you to build multilingual models with English-only training data.

Jigsaw's API, Perspective, serves toxicity models and others in a growing set of languages (see our documentation for the full list). Over the past year, the field has seen impressive multilingual capabilities from the latest model innovations, including few- and zero-shot learning. We're excited to learn whether these results "translate" (pun intended!) to toxicity classification. Your training data will be the English data provided for our previous two competitions and your test data will be Wikipedia talk page comments in several different languages.

As our computing resources and modeling capabilities grow, so does our potential to support healthy conversations across the globe. Develop strategies to build effective multilingual models and you'll help Conversation AI and the entire industry realize that potential.

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

<b> In this Notebook, we will perform EDA of the data, check the potentials of simple classical ML models, and check with some of the SOTA NN models.</b>

In [1]:
#useful imports
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
import os

In [2]:
base_path = "../../data/jigsaw-multilingual-toxic-comment-classification"
for dirname, _, filenames in os.walk(base_path):
    print("total files ",len(filenames))
    for filename in filenames:
        print(os.path.join(dirname, filename))

total files  9
../../data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv
../../data/jigsaw-multilingual-toxic-comment-classification/validation.csv
../../data/jigsaw-multilingual-toxic-comment-classification/test.csv
../../data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-processed-seqlen128.csv
../../data/jigsaw-multilingual-toxic-comment-classification/validation-processed-seqlen128.csv
../../data/jigsaw-multilingual-toxic-comment-classification/jigsaw-unintended-bias-train.csv
../../data/jigsaw-multilingual-toxic-comment-classification/jigsaw-unintended-bias-train-processed-seqlen128.csv
../../data/jigsaw-multilingual-toxic-comment-classification/test-processed-seqlen128.csv
../../data/jigsaw-multilingual-toxic-comment-classification/sample_submission.csv


In [3]:
try:

    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

REPLICAS:  1


In [4]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, GRU, SimpleRNN
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import BatchNormalization
import tensorflow.keras as keras
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from tensorflow.keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from tensorflow.keras.preprocessing import sequence, text
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import model_from_json 
import os
from tensorflow.keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)


max_len = 1200
batch_size = 400*strategy.num_replicas_in_sync

epoch = 5
import matplotlib.pyplot as plt

In [5]:
#nltk
import nltk
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer  
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

#gensim
from gensim.models import Word2Vec

lem = WordNetLemmatizer()
tokenizer=TweetTokenizer()
np.random.seed(0)

In [6]:
train = pd.read_csv(os.path.join(base_path, 'jigsaw-toxic-comment-train.csv'))
validation = pd.read_csv(os.path.join(base_path, 'validation.csv'))
test = pd.read_csv(os.path.join(base_path, 'test.csv'))

In [7]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [8]:
validation.head()

Unnamed: 0,id,comment_text,lang,toxic
0,0,Este usuario ni siquiera llega al rango de ...,es,0
1,1,Il testo di questa voce pare esser scopiazzato...,it,0
2,2,Vale. Sólo expongo mi pasado. Todo tiempo pasa...,es,1
3,3,Bu maddenin alt başlığı olarak uluslararası i...,tr,0
4,4,Belçika nın şehirlerinin yanında ilçe ve belde...,tr,0


In [9]:
test.head()

Unnamed: 0,id,content,lang
0,0,Doctor Who adlı viki başlığına 12. doctor olar...,tr
1,1,"Вполне возможно, но я пока не вижу необходимо...",ru
2,2,"Quindi tu sei uno di quelli conservativi , ...",it
3,3,Malesef gerçekleştirilmedi ancak şöyle bir şey...,tr
4,4,:Resim:Seldabagcan.jpg resminde kaynak sorunu ...,tr


In [10]:
train.drop(['id', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'], axis=1, inplace=True)
validation.drop(['id', 'lang'], axis=1, inplace=True)
X = train.append(validation)

In [11]:
# size_data = 50000
# X = X.loc[:size_data, :]

# max length of string
print(train.comment_text.apply(lambda x: len(str(x).split())).max())
def roc_auc(predictions,target):
    
    fpr, tpr, thresholds = metrics.roc_curve(target, predictions)
    roc_auc = metrics.auc(fpr, tpr)
    return roc_auc
xtrain, xvalid, ytrain, yvalid = train_test_split(X.comment_text.values, X.toxic.values, 
                                                stratify=X.toxic.values, 
                                                random_state=42, 
                                                test_size=0.2, shuffle=True)

2321


In [12]:
# Tokenizer
token = text.Tokenizer(num_words=None)

token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

#zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index

In [13]:
def create_model():
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
                    300,
                    input_length=max_len))
    model.add(SimpleRNN(100))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[keras.metrics.TruePositives(name='tp'),
        keras.metrics.FalsePositives(name='fp'),
        keras.metrics.TrueNegatives(name='tn'),
        keras.metrics.FalseNegatives(name='fn'),
        keras.metrics.BinaryAccuracy(name='accuracy'),
        keras.metrics.Precision(name='precision'),
        keras.metrics.Recall(name='recall'),
        keras.metrics.AUC(name='auc')])
    return model

def save_model(model):
    model_json = model.to_json()
    with open("model.json", "w") as json_file:
        json_file.write(model_json)
    model.save_weights("model.h5")
    print("Saved model to disk")

def load_model():
    json_file = open('model.json', 'r')
    loaded_model_json = json_file.read()
    json_file.close()
    loaded_model = model_from_json(loaded_model_json)
    loaded_model.load_weights("model.h5")
    print("Loaded model from disk")
    return loaded_model

weightList = list(train.toxic.value_counts())
toxic_weights = (1/weightList[1]) * ((weightList[1] + weightList[0])/2)
nontoxic_weights = (1/weightList[0]) * ((weightList[1] + weightList[0])/2)
class_weights = {0:nontoxic_weights, 1:toxic_weights}

if os.path.exists("model.h5") and os.path.exists('model.json'):
    model = load_model()
else:
    model = create_model()

print(model.summary())


model.fit(xtrain_pad, ytrain, epochs=epoch, batch_size=batch_size, class_weight=class_weights, validation_data=(xvalid_pad, yvalid))
save_model(model)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 1200, 300)         107498400 
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 100)               40100     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 107,538,601
Trainable params: 107,538,601
Non-trainable params: 0
_________________________________________________________________
None
Train on 185239 samples, validate on 46310 samples
Epoch 1/5
Epoch 2/5
Epo

In [14]:
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))

Auc: 0.93%


In [15]:
test_data = token.texts_to_sequences(test.content)
test_data_seq = sequence.pad_sequences(test_data, maxlen=max_len)

In [16]:
test['toxic'] = model.predict(test_data_seq, verbose=1)



In [17]:
test[['id', 'toxic']].to_csv(os.path.join(base_path, 'submission.csv'), index=False)

# References

https://www.kaggle.com/tarunpaparaju/jigsaw-multilingual-toxicity-eda-models/notebook

https://www.kaggle.com/joydeb28/simplernn-achive-good-accuracy