**In this NLP Tutorial , I will explore the Pipeline capabilities of Scikit learn to identify Toxic messages provided in this dataset.**   <br>

Note:- I will revisit this Notebook again for amendments 



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.gridspec as gridspec
import gc
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import log_loss,confusion_matrix,classification_report,roc_curve,auc
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from scipy import sparse
%matplotlib inline
seed = 2390

Import Train and Test dataset 

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
train.head()

What kind of messages we will be dealing with
               ----
 > Check few messages

In [None]:
for message_no, message in enumerate(train['comment_text'][:10]):
    print(message_no, message)
    print('\n')

In [None]:
'''
Messtype1 = train[['comment_text','toxic']]
Messtype2 = train[['comment_text','severe_toxic']]
Messtype3 = train[['comment_text','obscene']]
Messtype4 = train[['comment_text','threat']]
Messtype5 = train[['comment_text','insult']]
Messtype6 = train[['comment_text','identity_hate']]
''';

We have total 6 classification types of the messages and these message can be Toxic and severe_Toxic or Toxic and Insulting in nature at the same time so  to differentiate the message type we have seperate columns to identify those <br>
1.  toxic <br>
2. severe_toxic <br>
3. obscene <br>
4. threat <br>
5. insult <br>
6. identity_hate

In [None]:
cols= ['toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate']

plt.figure(figsize=(14,8))
gs = gridspec.GridSpec(2,3)
for i, cn in enumerate(cols):
    ax = plt.subplot(gs[i])
    sns.countplot(y = cn , data = train)
    ax.set_xlabel('')
    ax.set_title(str(cn))
    ax.set_ylabel(' ')

Lets check Correlation between the Message types

In [None]:
plt.figure(figsize=(14,6))
sns.heatmap(train[['toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate']].corr(),annot=True, fmt = ".2f", cmap = "coolwarm");

1) Insult and Obscene remarks are strongly corelated here <br>
2) Toxic and Obscene remarks 
2) Next is Toxic and insulting remarks 

Lets start building our prediction model on Train dataset to identify Toxic remarks 
             ----
> You can come up with innovative way to reuse the below model to identify rest of the message types which are 'severe_toxic', 'obscene', 'threat', 'insult' and  'identity_hate'

In [None]:
Messtype1 = train[['comment_text','toxic']]
#Messtype1['length'] = Messtype1['comment_text'].apply(len)
Messtype1['length'] = Messtype1['comment_text'].str.split().apply(len)
Messtype1.head()

In [None]:
Messtype1['length'].plot(bins=50, kind='hist');

We have really long messages here upto length of 5000 words . Lets quickly check.

In [None]:
Messtype1.length.describe()

In [None]:
Messtype1[Messtype1['length'] == 1411]['comment_text'].iloc[0]

Whoever wrote above text is in urgent need of medical help. :D
              --

In [None]:
Messtype1.hist(column='length', by='toxic', bins=50,figsize=(12,4));

Nothing interesting found in above graph .Length of Toxic comments is not of major hep in this case.

Lets build a function for Text Pre-Processing to perform below activities 
               ---
 
    1. Remove all punctuation
    2. Remove all stopwords
    3. Return a list of the cleaned text
    

In [None]:
def text_process(mess):
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]
    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    # Now just remove any stopwords
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

Lets check if above function is working properly or not
               -----

In [None]:
Messtype1['comment_text'].head(5).apply(text_process)

In [None]:
Messtype1.head()

Check the above differences which assures that our function is working properly
     ----

Customize below small rough model on your desktop to train full dataset and use it on test dataset for prediction.
                   ---

In [None]:
from sklearn.model_selection import train_test_split

msg_train, msg_test, label_train, label_test = \
train_test_split(Messtype1['comment_text'], Messtype1['toxic'], test_size=0.2)

print(len(msg_train), len(msg_test), len(msg_train) + len(msg_test))

SciKit Learn's pipeline capabilities along with the function we just built above
          ----

In [None]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

In [None]:
pipeline.fit(msg_train,label_train)
predictions = pipeline.predict(msg_test)

Classification report to identify Toxic comments
         ---

In [None]:
print(classification_report(predictions,label_test))

We can use different classifiers with Pipeline here and tune them for even better results.
             ----
 