This is a project to classify spam and ham messages using LSTM. This is a basic and simple model that will take one through a detailed steps to approach a NLP problem.

**Steps involved in the Project:**

*  **Data Cleaning:**
1. Removing unwanted columns.
2. Exploring & comparing length of messages.
3. Performing undersampling on dataset.

* **Text preparation:**
1. Tokenization of Messages.
2. One hot implementation on tokenized message(corpus)
3. Perform word embedding

* **Data preparation/Data Splitting:**
1. Split the data into training+validation(85%) & testing(15%) data.
2. Further split the training+validation data into training(85%) and validation(15%) data.

* **Model Building:**
1. Build a Sequential model: Embedding Layer->LSTM->Dense(output layer)
2. Fit and Validate model on training and validation model

* **Model Evaluation:**
1. Evaluate the model on test dataset.
2. Get the model accuracy score and visualize confusion matrix


* **Model Testing:**
1. Created a function that would classifiy the messages using the model


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Import seaborn and matplotlib for visualization

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Read the .csv file and display the first entries of the data

In [None]:
data=pd.read_csv("../input/sms-spam-collection-dataset/spam.csv",encoding="latin")
data.head()

The dataset contains 5 columns of which 3 Unnamed columns are not important for us

In [None]:
data.columns

Drop these three columns

In [None]:
data=data.drop(columns=["Unnamed: 2","Unnamed: 3","Unnamed: 4"])

columns **v1** & **v2** does not make any sense so rename the columns into meaningful **Category** & **Message** names

In [None]:
data=data.rename(
{
    "v1":"Category",
    "v2":"Message"
},
    axis=1
)

Display head of the new dataset

In [None]:
data.head()

Check if the dataset contains any **null** values, luckily we got dataset without any null values

In [None]:
data.isnull().sum()

In [None]:
data.info()

Create new column called **Message Length** that would compute the message lengths

In [None]:
data["Message Length"]=data["Message"].apply(len)

**Visualize** the Messages length using **histogram**

It is evident form the plot that **spam** messages on an average are usually **lengthier** than the **non-spam(ham)** messages.

In [None]:
fig=plt.figure(figsize=(12,8))
sns.histplot(
    x=data["Message Length"],
    hue=data["Category"]
)
plt.title("ham & spam messege length comparision")
plt.show()

Display the **description** of length of **ham** and **spam** messages seperately on an individual series.

From the statistics of the two description we can see that the ham contains the longest message of 910 length. However more than 70% of the ham messages contains messages of length less than 90.

On the other hand 75% spam messages have messages length more than 130. Hence can conclude than the spam messages are usually lenthier 

In [None]:
ham_desc=data[data["Category"]=="ham"]["Message Length"].describe()
spam_desc=data[data["Category"]=="spam"]["Message Length"].describe()

print("Ham Messege Length Description:\n",ham_desc)
print("************************************")
print("Spam Message Length Description:\n",spam_desc)

From the overall data Description we knew that label **Category** contains 2 unique values and hence is a categorical variable and a binary classification problem

In [None]:
data.describe(include="all")

The two unique values are **ham** & **spam** and **ham** contains **4825** & **spam** with **747** entries which is a vast difference

In [None]:
data["Category"].value_counts()

Visualizing the **Category** using countplot shows our **spam** messages are relatively less compared to the **ham** messages in the dataset

In [None]:
sns.countplot(
    data=data,
    x="Category"
)
plt.title("ham vs spam")
plt.show()

**Ham** contains **86.6%** while **spam** constitute only **13.4%** of the total dataset, and thus we can conclude that the data is **imbalanced**.

**Imbalanced Data:** Imbalanced data are dataset with an unequal class distribution. Since the ham contains more than 86% of dataset, the model can plainly have an accuracy score of **86%** by just classifying all the entries as ham message. However we just dont want to have a more accuracy score but also a model that can **generalize** well.



In [None]:
ham_count=data["Category"].value_counts()[0]
spam_count=data["Category"].value_counts()[1]

total_count=data.shape[0]

print("Ham contains:{:.2f}% of total data.".format(ham_count/total_count*100))
print("Spam contains:{:.2f}% of total data.".format(spam_count/total_count*100))

Since our dataset is imbalanced I have used undersampling technique to make a balanced dataset.

**Undersampling**: It is a technique of obtaining an equivalent sample from the dataset by simply **deleting** some of the examples of **majority** class

In [None]:
#compute the length of majority & minority class
minority_len=len(data[data["Category"]=="spam"])
majority_len=len(data[data["Category"]=="ham"])

#store the indices of majority and minority class
minority_indices=data[data["Category"]=="spam"].index
majority_indices=data[data["Category"]=="ham"].index

#generate new majority indices from the total majority_indices
#with size equal to minority class length so we obtain equivalent number of indices length
random_majority_indices=np.random.choice(
    majority_indices,
    size=minority_len,
    replace=False
)

#concatenate the two indices to obtain indices of new dataframe
undersampled_indices=np.concatenate([minority_indices,random_majority_indices])

#create df using new indices
df=data.loc[undersampled_indices]

#shuffle the sample
df=df.sample(frac=1)

#reset the index as its all mixed
df=df.reset_index()

#drop the older index
df=df.drop(
    columns=["index"],
)


The resulting dataframes have **1494** rows and **4** columns 

In [None]:
df.shape

Now we can see that the category cotains **747** entries for **ham** and **spam** and now we have a **balanced** dataset.

In [None]:
df["Category"].value_counts()

Both **ham** and **spam** message bars are now **equal** and hence obtained a **balanced** dataset.

In [None]:
sns.countplot(
    data=df,
    x="Category"
)
plt.title("ham vs spam")
plt.show()

Display the head of new **df**

In [None]:
df.head()

Created new column **Label** and encode **ham** as **0** and **spam** as **1**

In [None]:
df["Label"]=df["Category"].map(
    {
        "ham":0,
        "spam":1
    }
)

display head to see the new column

In [None]:
df.head()

Import libraries to perform word **tokenization**

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

stemmer=PorterStemmer()

Perform word **tokenization** using the below block of code

In [None]:
#declare empty list to store tokenized message
corpus=[]

#iterate through the df["Message"]
for message in df["Message"]:
    
    #replace every special characters, numbers etc.. with whitespace of message
    #It will help retain only letter/alphabets
    message=re.sub("[^a-zA-Z]"," ",message)
    
    #convert every letters to its lowercase
    message=message.lower()
    
    #split the word into individual word list
    message=message.split()
    
    #perform stemming using PorterStemmer for all non-english-stopwords
    message=[stemmer.stem(words)
            for words in message
             if words not in set(stopwords.words("english"))
            ]
    #join the word lists with the whitespace
    message=" ".join(message)
    
    #append the message in corpus list
    corpus.append(message)

Perform one_hot on the corpus

I have initialized the **vocabulary** size to **10,000.**

**oneHot_doc** will contain the list of indices of words in the corpus whose indices will range in bw **0-10,000**.

In [None]:
from tensorflow.keras.preprocessing.text import one_hot
vocab_size=10000

oneHot_doc=[one_hot(words,n=vocab_size)
           for words in corpus
           ]

After **one_hot** we will then perform **word embedding** 

The resulting list of one_hot will contain uneven indices length because of uneven length of tokenized words in the corpus.

To perform word embedding we have to consider a sentence length and hence to define a fixed sentence length for our dataset we will try to visualize and understand the patterns of Messege length of dataset.

In [None]:
df["Message Length"].describe()

**visualizing the Messages length using kdeplot.**

From the plot we can see that spam messages usually are lengthier than ham messages, however both the messages are concentrated bw the lenth of 0-200, of course there are few messages with lengths more than 200. However we will take a length of 200 to make an even distribution amongst the two messages.

In [None]:
fig=plt.figure(figsize=(12,8))
sns.kdeplot(
    x=df["Message Length"],
    hue=df["Category"]
)
plt.title("ham & spam messege length comparision")
plt.show()

We will use **pad_sequences** from keras to perform **word embedding**.

This will make every list to an equal length(**sentence length**) which we can later be fed to our model.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
sentence_len=200
embedded_doc=pad_sequences(
    oneHot_doc,
    maxlen=sentence_len,
    padding="pre"
)

Let us make a data frame using embedded document, and a target using **Label** column from df

In [None]:
extract_features=pd.DataFrame(
    data=embedded_doc
)
target=df["Label"]

I have concatenated two dataframes to get the final dataframe

In [None]:
df_final=pd.concat([extract_features,target],axis=1)

The Resulting dataframe contains 201 columns where 200 are independent features and 1 is our target class.

In [None]:
df_final.head()

Split the dataframe into **dependent(y)** & **independent(X)** variables 

In [None]:
X=df_final.drop("Label",axis=1)
y=df_final["Label"]

We will now split the dataset for **training**,**validataing** and **testing** sets.

In [None]:
from sklearn.model_selection import train_test_split

The code will split the whole data into two parts **trainval** & **testing**.

**Trainval** dataset will contain the dataset for training and validation sets, which will constitute 85% of the data and rest as test data(15%).

This is done so to stop **data leakage**. If we use test data as both validation and testing purpose its very likely that our model will just **memorize** the dataset while validating using test data and we will get a **overfitted model**.

So I just wanted to completely **isolate** the **test** data from **train** and **validation** dataset, so we get a model with better **generilization**.

In [None]:
X_trainval,X_test,y_trainval,y_test=train_test_split(
    X,
    y,
    random_state=42,
    test_size=0.15
)


We have splitted **85**% dataset into **training** and **validataion** so lets futher split our **trainval** data into **training**(**85**%) and **validation**(**15**%) dataset.

In [None]:
X_train,X_val,y_train,y_val=train_test_split(
    X_trainval,
    y_trainval,
    random_state=42,
    test_size=0.15
)

import libraries to create **model**

In [None]:
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Embedding
from tensorflow.keras.models import Sequential


Initialize the **Sequential** model

In [None]:
model=Sequential()

create a model using **Embedding layer->LSTM->Dense**

**Embedding_layer**: We will declare input_dimesion as our vocaburaly size(10,000),input_length as the sentence length and output dimension as 100.

**LSTM**: Add **128** units to the layers whose output will be fed as an input to our output Dense layer.

**Dense**: Add 1 unit(neurons) to the dense layers and with an **sigmoid** activation, since we have binary classification problem. Else if you are to performing multi-class classification problem **softmax** activation with **units=no. of classes** would perform pretty well.

In [None]:
feature_num=100
model.add(
    Embedding(
        input_dim=vocab_size,
        output_dim=feature_num,
        input_length=sentence_len
    )
)
model.add(
    LSTM(
    units=128
    )
)

model.add(
    Dense(
        units=1,
        activation="sigmoid"
    )
)

Lets **compile** the model built above.

I have used a **adam** optimizer with **learning rate of 0.001** & **binary_crossentropy** as loss funtion

In [None]:
from tensorflow.keras.optimizers import Adam
model.compile(
    optimizer=Adam(
    learning_rate=0.001
    ),
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

Once the model is complied we will **fit** the model using **train** and **validation** dataset.

In [None]:
model.fit(
    X_train,
    y_train,
    validation_data=(
        X_val,
        y_val
    ),
    epochs=10
)

Since the model is fitted using required datasets, its time that how our model predict test data we have isolated earlier.

The prediction will be stored in array of boolean where prediction value **greater than** **0.5** will be assigned **True(Spam)** else **lesser than 0.5 will be False(Ham)**

In [None]:
y_pred=model.predict(X_test)
y_pred=(y_pred>0.5)

Lets import **metrics** to **evaluate** our **model**

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix

The model predicts pretty well on the **test** data as evident from the accuracy score.

In [None]:
score=accuracy_score(y_test,y_pred)
print("Test Score:{:.2f}%".format(score*100))

Lets **visualize** our **confusion_matrix** using **heatmap**

Our Model also gives a better generalization since the number of **False Positive(FP)** and **False Negative(FN)** are relatively lesser than the **True Postive(TP)** and **True Negative(TN)**

In [None]:
cm=confusion_matrix(y_test,y_pred)
fig=plt.figure(figsize=(12,8))
sns.heatmap(
    cm,
    annot=True,
)
plt.title("Confusion Matrix")
cm

This is the final code(**function**) that would take a **raw message** and classfiy the message using the model.

In [None]:
#The function take model and message as parameter
def classify_message(model,message):
    
    #We will treat message as a paragraphs containing multiple sentences(lines)
    #we will extract individual lines
    for sentences in message:
        sentences=nltk.sent_tokenize(message)
        
        #Iterate over individual sentences
        for sentence in sentences:
            #replace all special characters
            words=re.sub("[^a-zA-Z]"," ",sentence)
            
            #perform word tokenization of all non-english-stopwords
            if words not in set(stopwords.words('english')):
                word=nltk.word_tokenize(words)
                word=" ".join(word)
    
    #perform one_hot on tokenized word            
    oneHot=[one_hot(word,n=vocab_size)]
    
    #create an embedded documnet using pad_sequences 
    #this can be fed to our model
    text=pad_sequences(oneHot,maxlen=sentence_len,padding="pre")
    
    #predict the text using model
    predict=model.predict(text)
    
    #if predict value is greater than 0.5 its a spam
    if predict>0.5:
        print("It is a spam")
    #else the message is not a spam    
    else:
        print("It is not a spam")

In [None]:
message1="I am having a bad day and I would like to have a break today"
message2="This is to inform you had won a lottery and the subscription will end in a week so call us."


The model predicts **message1** as **not a spam message**

In [None]:
classify_message(model,message1)

The model predicts **message2** as **spam message**

In [None]:
classify_message(model,message2)