<h2 style="text-align:center;color:blue;">Objectif :</h2>

<h3 style="text-align:center;">        In this competition we will be ranking comments in order of severity of toxicity. We are given a list of comments, and each comment should be scored according to their relative toxicity. Comments with a higher degree of toxicity should receive a higher numerical value compared to comments with a lower degree of toxicity.</h3>
    
  <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSblnNX1zqaG70dan0DywBXM1VP75dbjCYbkA&usqp=CAU" width="400"></img>

<p style="background-color:orange; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 10px 100px;"><b>Importing Libraries & Data</b></p> 

In [None]:
import sys
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import re
from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
import gensim

from wordcloud import WordCloud, STOPWORDS
import nltk
nltk.download('stopwords')
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Flatten, Dropout, Dense, LSTM, Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.callbacks import EarlyStopping

from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
dff = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/train.csv')
dff.head()

<p style="background-color:orange; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 10px 100px;"><b>Quick EDA</b></p> 

In [None]:
dff.shape

In [None]:
dff.isna().sum()

In [None]:
dff.describe()

In [None]:
dff.severe_toxic.value_counts()

In [None]:
dff['toxicity'] = (dff[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].sum(axis=1) > 0 ).astype(int)
dff = dff[['comment_text', 'toxicity']].rename(columns={'comment_text': 'text'})
dff.sample(5)

In [None]:
dff.describe()

In [None]:
dff.head()

In [None]:
dff.toxicity.value_counts()

Our data is not balanced.

In [None]:
min_len = (dff['toxicity'] == 1).sum()
df_undersample = dff[dff['toxicity'] == 0].sample(n=min_len, random_state=201)
dff = pd.concat([df_undersample, dff[dff['toxicity'] == 1]])
dff = shuffle(dff)

In [None]:
dff.text = dff.text.map(lambda x:x.replace('\n', ' '))
dff.text[:2]

In [None]:
toxic = dff[dff['toxicity'] == 1]
not_toxic = dff[dff['toxicity'] == 0]

In [None]:
wordcloud = WordCloud(width=1400, height=700, background_color='white').generate(' '.join(toxic.text.tolist()))
fig = plt.figure(figsize=(30,10), facecolor='white')
plt.imshow(wordcloud)
plt.axis('off')
plt.title('The most 100 frequent words in the toxic comments', fontsize=50)
plt.tight_layout(pad=0)
plt.show()

In [None]:
wordcloud = WordCloud(width=1400, height=700, background_color='white').generate(' '.join(not_toxic.text.tolist()))
fig = plt.figure(figsize=(30,10), facecolor='white')
plt.imshow(wordcloud)
plt.axis('off')
plt.title('The most 100 frequent words in the normal comments', fontsize=50)
plt.tight_layout(pad=0)
plt.show()

<p style="background-color:orange; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 10px 100px;"><b>Text Pre-Processing</b></p> 

We get our X and y variables, then create a copy to work on it :

In [None]:
y = dff.toxicity
x = dff.drop('toxicity', axis = 1)

In [None]:
texts = x.copy()
texts.reset_index(inplace = True, drop = True)
texts.head()

In order to not get a RecursionError, we reset our recursionlimit to 6000.

In [None]:
print(sys.getrecursionlimit())

In [None]:
sys.setrecursionlimit(6000)

When dealing with text, we should first do some cleaning and stemming :

### What Is Stemming ?

The process of removing a part of a word, or reducing a word to its stem or root.

### Example :

Let’s assume we have a set of words — **send, sent and sending**. All three words are different tenses of the same root word **send**. So after we stem the words, we’ll have just the one word — send. 

In [None]:
ps = PorterStemmer()
corpus = []

for i in tqdm(range(0, len(texts))) :
    cleaned = re.sub('[^a-zA-Z]', ' ', texts['text'][i])
    cleaned = cleaned.lower().split()
    
    cleaned = [ps.stem(word) for word in cleaned if not word in stopwords.words('english')]
    cleaned = ' '.join(cleaned)
    corpus.append(cleaned)

Our model will not be able to deal with text, it should have numbers as input, that's why we do first word embedding.

### What is Word Embedding ?

Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.


Word Embeddings are vector representations of a particular word. 

In [None]:
DIM = 100

X = [d.split() for d in corpus]
w2v_model = gensim.models.Word2Vec(sentences = X, vector_size = DIM, window = 10, min_count = 1)

Let's see how many words in our vocabulary :

In [None]:
len(w2v_model.wv.key_to_index.keys()) 

We can find similar words to a specific one, let's try with the word 'toxic' :

In [None]:
w2v_model.wv.most_similar('toxic')

Now we tokenize the sentences and convert X into sequences of numbers :

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X) 

In [None]:
X = tokenizer.texts_to_sequences(X)
X[:3]

Let's convert all the sentences to have the same length which is 20 in our case :

In [None]:
X = pad_sequences(X, padding = 'pre', maxlen = 20)
X[:3]

We will feed these vectors as initial weights to our model then recreate these vectors to get better accuracy :

In [None]:
vocab_size = len(tokenizer.word_index) + 1 
vocab = tokenizer.word_index

In [None]:
def get_weights_matrix(model) :
    weights_matrix = np.zeros((vocab_size, DIM))
    
    for word, i in vocab.items() :
        weights_matrix[i] = model.wv[word]
        
    return weights_matrix


embedding_vectors = get_weights_matrix(w2v_model) 

<p style="background-color:orange; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 10px 100px;"><b>Modeling & Training</b></p> 

In [None]:
model = Sequential()

model.add(Embedding(vocab_size, output_dim = DIM, weights = [embedding_vectors], input_length = 20)) 
model.add(Dropout(0.2))

model.add(LSTM(64))
model.add(Dropout(0.2))

model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(1, activation = 'linear'))

In [None]:
model.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = 'accuracy')
model.summary()

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

es = EarlyStopping(patience=3, 
                   monitor='loss', 
                   restore_best_weights=True, 
                   mode='min', 
                   verbose=1)

# train the model 
hist = model.fit(x_train, y_train, validation_data = (x_test, y_test), epochs = 15,
                 callbacks=es, batch_size = 32, shuffle=True)

In [None]:
plt.style.use('fivethirtyeight')

# visualize the models accuracy
plt.plot(hist.history['accuracy'])
plt.plot(hist.history['val_accuracy'])
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc = 'upper left')
plt.show()  

<p style="background-color:orange; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 10px 100px;"><b>Submission</b></p> 

In [None]:
sub = pd.read_csv("../input/jigsaw-toxic-severity-rating/comments_to_score.csv")

In [None]:
new_text = tokenizer.texts_to_sequences(sub.text)
new_text = pad_sequences(new_text, maxlen = 20)

In [None]:
sub['score'] = model.predict(new_text) * 1000 
sub.head()

In [None]:
sub[['comment_id', 'score']].to_csv("submission.csv", index=False)

## Please If You Like This Notebook, Please don't Forget To Upvote It ;