# Introduction

![](https://storage.googleapis.com/kaggle-media/competitions/jigsaw/003-avatar.png)

**Understanding the Dataset**

**The Dataset has Wikipedia comments which have been labeled by human raters for toxic behavior, these are shown below, this is part of competion [here](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview)**
* toxic
* severe_toxic
* obscene
* threat
* insult
* identity_hate

**You must create a model which predicts a probability of each type of toxicity for each comment.**

1. train.csv - the training set, contains comments with their binary labels
2. test.csv - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set contains some comments which are not included in scoring.
3. sample_submission.csv - a sample submission file in the correct format
4. test_labels.csv - labels for the test data; value of -1 indicates it was not used for scoring

In [None]:
!pip install wordcloud

# Loading Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU,SimpleRNN
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D, Input
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping


import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
import seaborn as sns
from pandas_profiling import ProfileReport

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Loading Dataset

In [None]:
train_df = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/train.csv.zip') 
test_df = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test.csv.zip')
labels_df = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test_labels.csv.zip')

print('The Training data has {} rows and {} columns '.format(train_df.shape[0], train_df.shape[1]))

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
labels_df.head()

***This dataset contains data, which were not used for scoring***

# Let's Dive into EDA!!

**Let's start with using Pandas profiling to get a concised exploration of the training data**

In [None]:
ProfileReport(train_df)

**As can be seen from above, the Dataset does not contain any missing values, so our work becomes easier now!!**

In [None]:
labels = train_df['toxic'].value_counts().index
values = train_df['toxic'].value_counts().values
color = ['green', 'lightblue']

plt.figure(figsize=(10,10))
fig = go.Figure(data=[go.Pie(labels=labels, textinfo='label+percent', values=values, marker=dict(colors=color))])
fig.show()

In [None]:
train_df.info()

# Correlation

***We do correlation plot to understand the correlations among columns, this helps in selecting certain specific models for training while developing the model***

In [None]:
plt.figure(figsize=(10,5))
colormap = plt.cm.plasma
sns.heatmap(train_df.corr(), annot=True, cmap=colormap)

**So as we can see, Obscene has good correlation with Toxic, Insult and Obscene has the highest correlation among the columns at 0.74**

**Let's look at some Toxic and Non Toxic Comments**

In [None]:
train_df['comment_text'][0]

In [None]:
train_df['comment_text'][2]

In [None]:
sns.countplot(train_df['toxic'])

In [None]:
sns.countplot(y=train_df['obscene'])

# Wordcloud

**It is one of great visualization techniques that can be used to see what words are repeated more often, thus getting an insight to the emotions of the person who might have commented**

In [None]:
texts = train_df['comment_text'][0]
wordcloud = WordCloud().generate(texts)

# Display the generated image:
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
texts = train_df['comment_text'][2]
wordcloud = WordCloud().generate(texts)

# Display the generated image:
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

**Let's define ROC_AUC curve metrics for our dataset**

In [None]:
def roc_auc(predictions, target):
    
    fpr, tpr, thresholds = metrics.roc_curve(target, predictions)
    roc_auc = metrics.auc(fpr, tpr)
    return roc_auc

In [None]:
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

**Let's check the character length and distribution of it in train data**

In [None]:
train_df['char_length'] = train_df['comment_text'].apply(lambda x : len(str(x)))

**Let's check the character length and distribution of it in test data**

In [None]:
test_df['char_length'] = test_df['comment_text'].apply(lambda x : len(str(x)))

# Cleaning comments

In [None]:
import re

def clean_text(texts):
    texts = texts.lower()
    texts = re.sub(r"what's", "what is ", texts)
    texts = re.sub(r"\'s", " ", texts)
    texts = re.sub(r"\'ve", " have ", texts)
    texts = re.sub(r"can't", "cannot ", texts)
    texts = re.sub(r"n't", " not ", texts)
    texts = re.sub(r"i'm", "i am ", texts)
    texts = re.sub(r"\'re", " are ", texts)
    texts = re.sub(r"\'d", " would ", texts)
    texts = re.sub(r"\'ll", " will ", texts)
    texts = re.sub(r"\'scuse", " excuse ", texts)
    texts = re.sub('\W', ' ', texts)
    texts = re.sub('\s+', ' ', texts)
    texts = texts.strip(' ')
    return texts

In [None]:
# clean the comment_text in train_df [Thanks to Pulkit Jha for the useful pointer.]
train_df['comment_text'] = train_df['comment_text'].map(lambda com : clean_text(com))
# clean the comment_text in test_df [Thanks, Pulkit Jha.]
test_df['comment_text'] = test_df['comment_text'].map(lambda com : clean_text(com))

In [None]:
train_df = train_df.drop('char_length',axis=1)
x = train_df.comment_text
x_test = test_df.comment_text

In [None]:
x.shape

In [None]:
x_test.shape

# Tokenization
**We will be using sklearn library tools for tokenizing using vectorizer for our comments from dataset**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


vect = TfidfVectorizer(max_features=5000,stop_words='english')
vect

In [None]:
x_train = vect.fit_transform(x)

In [None]:
x_test = vect.transform(x_test)

In [None]:
cols = ['obscene','insult','toxic','severe_toxic','identity_hate','threat']

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

rfc = RandomForestClassifier()

submission_binary = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/sample_submission.csv.zip')

#to get predictions specific to each columns in the dataset, binary relevance based approach
for labels in cols:
    print('Started with {}'.format(labels))
    y = train_df[labels]
    rfc.fit(x_train,y)
    
    y_preds = rfc.predict(x_train)
    print('Validation accuracy is {}'.format(accuracy_score(y, y_preds)))
    # compute the predicted probabilities for x_test
    y_prob = rfc.predict_proba(x_test)[:,1]
    submission_binary[labels] = y_prob

In [None]:
rc = roc_auc(y_preds, y)
rc

**Let's see the Classification Report**

In [None]:
cf = classification_report(y_preds, y)
print('The Classification Report {} \n '.format(cf))

In [None]:
submission_binary.head()

# Submission

In [None]:
submission_binary.to_csv('submission_binary',index=False)
print('Submission file is successfully created!!')

**TO DO Neural Networks part**