# Welcome.
This is my first notebook on NLP. I have chosen the Hello World dataset for NLP
The dataset is from kaggle and can be found [here](https://www.kaggle.com/uciml/sms-spam-collection-dataset). 
The orignal dataset is published by UCI machine learning repository and can be found [here](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

Our aim is to build a machine learning model to be able to identify spam messages. This is similar to how gmail filtersspam emails in your inbox.

# Agenda
* Reading the data
* Exploring the data
* Cleaning the data
* Vectorizing the data
* Fit multinomial NB model
* Fit all classifiers using Pycaret AutoML library
* Fit a LSTM nnet
* Compare results

First we will import the necessary libraries:

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
import string

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()


import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')

# Reading the data
We will read the data using pandas read_csv command into the dataframe named 'messages'

In [None]:
messages = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv', encoding='latin-1')
messages

We only need the columns v1 and v2. We will select them from the dataframe and rename them.

In [None]:
messages = messages[['v1', 'v2']]
messages.columns = ['label', 'message']
messages

# Explorating the data



We first check for missing values.

In [None]:
messages.isna().sum()

We see that there are no missing values in our data.
We now check how many ham and spam messages do we have in our dataset.

In [None]:
messages.groupby('label').count()

In [None]:
sns.countplot(messages['label'])

We see that the data is imbalanced as we would expect. The number of spam messages are less as compared to ham messages.
Let's see how many words do we have in each messages.

In [None]:
message_len = messages['message'].apply(lambda x: len(x.split()))
message_len.describe()

In [None]:
sns.distplot(message_len)

We see that most of our messages have less than 50 words with a mean at 15 words.

# Cleaning the data


Let us first take an example of the 6th message. 

In [None]:
messages['message'][5]

We will convert all the words to lowercase so that our algorithm identifies 'Hello', 'hello' and 'HELLO' all as the same word.
We also have numbers and punctuations. We would like to remove all the non text data from the messages because we do not need them for our analysis.

In [None]:
review = messages['message'][5].lower()
review = re.sub('[^a-z]', ' ', review)
review

Now we will remove all the stopwords (words like 'a', 'there', 'to', 'for' etc) which do not help us in our analysis. You can read more about stopwords [here](https://en.wikipedia.org/wiki/Stop_word).
We also apply Stemming. This is the process of converting each word into its root word. You can read more about it [here](https://en.wikipedia.org/wiki/Stemming).

In [None]:
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
review

Now the message looks clean. We would now apply all these steps to all of our messages. We would add all the cleaned messages in a list named corpus.

In [None]:
corpus = []
for i in range(0, len(messages)):
    review = messages['message'][i].lower()
    review = re.sub('[^a-z]', ' ', review)
    review = review.split()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)
    
corpus

# Vectorizing the data
Converting text data  to numeric data that the machine can understand. We will use Count Vectorizer here. We choose to create a maximum of 2500 feature words here. You can read more about count vectorizer [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()
X

We also convert labels into binary numbers. 

In [None]:
messages['label']

In [None]:
y = pd.get_dummies(messages['label']).iloc[:,1].values

In [None]:
y

As we can see 0 is assigned to 'ham' and 1 is assigned to 'spam' messages.

# Modelling

So to start, I like to apply an autoML library to compare how the model is performing on all the available classifiers. I used pycaret library for this purpose. PyCaret is an open source, low-code machine learning library. It compares  models on all the required metrics. You can read and learn to apply pycaret [here](https://pycaret.org).
Let's instal pycaret, setup our data and start comparing models.

In [None]:
!pip install pycaret

In [None]:
from pycaret.classification import *

In [None]:
data = pd.DataFrame(X)
data['label'] = y
data.head()

In [None]:
clf = setup(data = data, target = 'label')

In [None]:
compare_models()

We see that Logistic Regression and LightGBM outperform rest of the models. Therefore we choose to manually apply logistic regression model on our data. We will also manually apply the multinomial Naive Bayes classifier which has proven to perform well on NLP problems.

We split the data into training and testing set in the ratio 80:20.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 420)

We tune the hyperparameters of Logistic Regression model using Grid search cross validation. 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
param_grid = {'C': np.logspace(-4, 4, 20), 'penalty': ['l1', 'l2'], 'solver': ['liblinear']}
grid = GridSearchCV(LR,param_grid,refit=True,verbose=3, scoring = 'roc_auc')
grid.fit(x_train,y_train)

We make predictions on the test dataset using the best hyperparameters obtained from the Grid search.

In [None]:
pred_LR = grid.predict(x_test)
proba_LR = grid.predict_proba(x_test)

Now we see how our model performed. We compare our data to the actual labels in the test set. 

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
print(classification_report(y_test,pred_LR))
print(confusion_matrix(y_test,pred_LR))
print('AUC score is: {}'.format(roc_auc_score(y_test, pred_LR)))

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
fpr, tpf, thresholds = roc_curve(y_test, pred_LR)


Now let's try Multinomial Naive Bayes model for our data. Since there are no hyperparameters, we can fit the model directly to our data.

In [None]:
from sklearn.naive_bayes import MultinomialNB
model_NB = MultinomialNB().fit(x_train, y_train)
pred_NB = model_NB.predict(x_test)
proba_NB = model_NB.predict_proba(x_test)
print(classification_report(y_test,pred_NB))
print(confusion_matrix(y_test,pred_NB))
print('AUC score is: {}'.format(roc_auc_score(y_test, pred_NB)))

For the LGBM classifier, we use the parameters provided by our pycaret model.  

In [None]:
model_LGBM = LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=420, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0).fit(x_train, y_train)
pred_LGBM = model_LGBM.predict(x_test)
proba_LGBM = model_LGBM.predict_proba(x_test)
print(classification_report(y_test,pred_LGBM))
print(confusion_matrix(y_test,pred_LGBM))
print('AUC score is: {}'.format(roc_auc_score(y_test, pred_LGBM)))

We now apply LSTM nnet using techniques like embedding and padding to our corpus.

In [None]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
voc_size = 10000
onehot_repr=[one_hot(words,voc_size)for words in corpus] 

In [None]:
sent_length=60
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)

In [None]:
embedding_vector_features=40
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model.add(Dropout(0.3))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

In [None]:
X_final=np.array(embedded_docs)
y_final=np.array(pd.get_dummies(messages['label']).iloc[:,1])

In [None]:
X_final.shape,y_final.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.2, random_state=420)

In [None]:
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=25)
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=1000, callbacks=[early_stop])
          
          

In [None]:
pred_LSTM=model.predict_classes(X_test)
print(classification_report(y_test,pred_LSTM))
print(confusion_matrix(y_test,pred_LSTM))
print('AUC score is: {}'.format(roc_auc_score(y_test, pred_LSTM)))

# Comparing the models:
According to the results, Naive Bayes model performed best in classifying spam messages. 