* Analyzed Twitter comments to detect and classify them
  into different types of toxicity like threats, obscenity,
  insults and hate.
* Applied deep learning techniques i.e. RNN, to
  understand and classify the given sentence into one of
  the 6 types of toxicity.
* Achieved 96% accuracy in test dataset.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
#from textblob import TextBlob
#from spellchecker import SpellChecker
#from autocorrect import spell
#from gingerit.gingerit import GingerIt
#from symspellpy.symspellpy import SymSpell, Verbosity
#from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
#from glove import Glove, Corpus
import tensorflow
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, Flatten, SimpleRNN, RNN,GRU, SpatialDropout1D, Dropout
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import text_to_word_sequence


## Loading Dataset into Kernel

In [None]:
dataset = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/train.csv.zip')
X = dataset.iloc[:,1].values
y = dataset.iloc[:,2:].values

dataset_test = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test.csv.zip')
X_test = dataset_test.iloc[:,1].values
X_test = X_test.reshape(153164,1)

test_labels = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test_labels.csv.zip')
Y_test = test_labels.iloc[:,1:].values

merged = pd.merge(dataset_test, test_labels, how="left", on="id")

In [None]:
dataset.shape

In [None]:
merged.head()
Y_test

In [None]:
merged['sum'] = merged['toxic'] + merged['severe_toxic'] + merged['obscene'] + merged['threat'] + merged['insult'] + merged['identity_hate']

In [None]:
merged.drop('id',axis=1, inplace=True)
merged.head()

In [None]:
merge = merged[merged['sum'] != -6]

In [None]:
merge.shape

In [None]:
Y_test = merge.iloc[:,1:7].values
X_test = merge.iloc[:,0].values

In [None]:
X_test.shape

## Tokenization of Data and using Regular Expressions to remove all characters except alphabets and lowercasing them

In [None]:
tokens = []
tokens = [word_tokenize(str(sentence)) for sentence in X]
rm = []
for w in tokens:
    sm = re.sub('[^A-Za-z]',' ', str(w))
    x = re.split("\s", sm)
    rm.append(x)

#Removing whitespaces
for sent in rm:
    while '' in sent:
        sent.remove('')

# Lowercasing
low = []
for i in rm:
    i = [x.lower() for x in i]
    low.append(i)


## Lemmatization and Removal of Stopwords
Using WordNetLemmatizer to obtain the root form of all the words in the dataset. For eg, reducing increased to its root form increase to reduce the number of redundant words and reducing dimensionality of the dataset. Also removed all the stopwords like a, an, the, and, not etc., since all of them are useless words and do not influence the predictions that much.

In [None]:
lemma = []
wnl = WordNetLemmatizer()
for doc in low:
    tokens = [wnl.lemmatize(w) for w in doc]
    lemma.append(tokens)

# Removing Stopwords
filter_words = []
Stopwords = set(stopwords.words('english'))

#ab = spell('nd')
for sent in lemma:
    tokens = [w for w in sent if w not in Stopwords]
    filter_words.append(tokens)

space = ' ' 
sentences = []
for sentence in filter_words:
    sentences.append(space.join(sentence))


In [None]:
filtered_words = []
for sent in filter_words:
    token = [word for word in sent if len(word)>2]
    filtered_words.append(token)

## Word Embedding
Using Word2Vec to calculate the relationship between words. Word2Vec gives the value of correlation between two words and formed a pre-trained matrix.

### Word2Vec
Word2Vec is a group of related models that are used to produce web embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic context of words. Word2Vec takes as input a large corpus and produces a vector space, of several hundred dimensions, with each unique word in corpus being assigned a vector in space. Words are placed such that words having similar meaning in the corpus are located in close proximity to each other. The cosine of angle between vectors of words having similar context should be close to 1, i.e., angle close to 0.

Here comes the idea of distributed representations. Intuitively, we introduce some dependence of one word on the other words. The words in context of the particular word would get a greater share of this dependence.

Word2Vec can utilize either of the two model architectures to produce a distributed representation of words: continuous bag of words (CBOW) or continuous skip gram

#### Continuous Bag of Words (CBOW)
This method takes the context of each word as input and tries to predict the word corresponding to the context from a window of surrounding context words. The order of context words does not influence the predictions. We learn the vector representation of the target word.
 
In CBOW, the input given is the context word and we try to predict the center words so that the cosine between context word and center word approaches to one. The input layer is the one hot encoding of the context words. The input layer is then multiplied with hidden layer which is the weight matrix and as output we get word vector representation. This hidden input layer is then multiplied to the output layer, i.e., weight matrix Wâ€™ and then the output is fed to the softmax function which calculates the probabilities and the vector with highest probability is chosen to be the vector closest to the input context word.

In multiple context words, the vectors obtained in hidden input layer are averaged before going further.
 
#### Skip gram
Skip gram is just opposite to CBOW. In this algorithm, target word is fed as input to the network. It weighs the nearby context words more heavily than more distant context words. It learns by predicting the surrounding words given a target value.
 
Both models are focused on learning about words given their local usage context, where the context is defined by a window of neighboring words. Continuous bag of words is considered to be faster than skip gram but skip gram performs well for small data is found to represent rare words well. CBOW has better representation for more frequent words.
Word2vec is implemented using Genism package.


In [None]:
model_cbow = Word2Vec(filtered_words)
word_vectors = model_cbow.wv
vocabulary = word_vectors.vocab.items()


In [None]:
model_cbow.most_similar('mother')

In [None]:
len(word_vectors.vocab)

## Handling of Unknown words

In [None]:
keys = list(word_vectors.vocab.keys())
unk = 0
total = 0

embedding_matrix = word_vectors.vectors
## Word with their index values
word2id = {k:v.index for k,v in word_vectors.vocab.items()}

## Unknown values
UNK_INDEX = 0
UNK_TOKEN = 'UNK'
unk_vector = embedding_matrix.mean(0)

## Inserting row for unknown words 
embedding_matrix = np.insert(embedding_matrix, [UNK_INDEX], [unk_vector], axis=0)
word2id = {word:(index+1) if index >= UNK_INDEX else index for word, index in 
           word2id.items()}
word2id[UNK_TOKEN] = UNK_INDEX

## Replacing words in x_train with their respective indices and replacing each unknown 
## word with index 0
L = []
for sent in filter_words:
    Z = []
    for word in sent:
        if word in word2id:
            Z.append(word2id.get(word))
        else:
            Z.append(UNK_INDEX)
            unk+=1
    L.append(Z)
X_train = pad_sequences(L, maxlen=100, padding='post',
                        dtype='float')


## Implementing RNN using GRU/LSTM using Keras

In [None]:
## Implementing RNN using GRU/LSTM
vocab_len = len(embedding_matrix)
model = Sequential()
model.add(Embedding(vocab_len, 100, input_length = 100,weights=[embedding_matrix]))
model.add(GRU(units=100, activation='tanh'))
#model.add(LSTM(units=120, activation='tanh'))
model.add(Dropout(0.2))
model.add(Dense(50,activation='tanh'))
model.add(Dense(6,activation='softmax'))
model.compile(optimizer='adam',loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()
model.fit(X_train,y,batch_size=1000,epochs=10)


## Testing model on test data

In [None]:
import pandas as pd
x_test = []
for sentence in X_test:
    x_test.append(text_to_word_sequence(str(sentence),filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' '))

filter_test = []
for sent in x_test:
    tokens = [w for w in sent if w not in Stopwords]
    filter_test.append(tokens)

## Converting the text into sequences using ids
L = []
for sent in x_test:
    Z = []
    for word in sent:
        if word in word2id and len(word)>2:
            Z.append(word2id.get(word))
        else:
            Z.append(UNK_INDEX)
            unk+=1
    L.append(Z)

X_test = pad_sequences(L, maxlen=100, padding= 'post',dtype='float')
y_pred = model.predict(X_test)


In [None]:
model.evaluate(X_test,Y_test)