# Spam Detection with NLTK (98.1% accuracy)

# Overview
1. Importing Libraries
2. Reading the Dataset
3. Exploratory Data Analysis (EDA)
     - Mapping Labels
     - Dropping Duplicates
     - Adding "length" column
     - Adding "contain" column
4. Data Preprocessing
     - Removing Punctuations & Digits
     - Tokenization & Lower Case
     - Removing Stopwords
     - Lemmatization 
     - Merging Tokens
     - Count Vectorization
     - TFIDF
5. Model Training
    - Multinomial Naive Bayes 
    - Decision Trees
    - Random Forest

# 1. Importing Libraries

In [None]:
#importing libraries
import pandas as pd
import seaborn as sns
import string 
import re
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import warnings

warnings.filterwarnings("ignore")
nltk.download('stopwords')
nltk.download('wordnet')

# 2. Reading the Dataset
Dataset has 3 empty columns (Unnamed: 2, Unnamed 3, Unnamed: 4}. Dropping those columns.

Renaming v1 and v2 columns as 'label' and 'text' respectively.

In [None]:
#reading the dataset 
#dataset: https://www.kaggle.com/uciml/sms-spam-collection-data
msg=pd.read_csv('../input/sms-spam-collection-dataset/spam.csv',encoding='latin-1')
msg.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1,inplace=True)
msg.rename(columns={'v1':'label','v2':'text'},inplace=True)
msg.head()

# 3. Exploratory Data Analysis (EDA)

### Mapping Labels
Mapping ham to 0 and spam to 1

In [None]:
#mapping ham=0 and spam=1
for i in msg.index:
  if msg['label'][i]=='ham':
    msg['label'][i]=0
  else:
    msg['label'][i]=1
msg.head()

### Dropping Duplicates

In [None]:
#category count plot (count of spam and ham)
sns.countplot(msg.label)

In [None]:
#data description grouped by labels 
msg.groupby('label').describe()

We have 4852 ham messages (4516 unique) and 747 spam messages (653 unique)

In [None]:
#dropping duplicate rows
msg=msg.drop_duplicates()
msg.groupby('label').describe()

### Adding "length" column

In [None]:
#adding length column to the dataset 
msg['length']=msg['text'].apply(len)
msg.head()

In [None]:
msg[msg.label==0].describe()

In [None]:
sns.distplot(a=msg[msg['label']==0].length,kde=False)

In [None]:
msg[msg.label==1].describe()

In [None]:
sns.distplot(a=msg[msg['label']==1].length,kde=False)

From the above outputs and graphs, we notice that
* Most of the ham messages have length<100 (mean around 70)
* Most of the spam messages have a length of 150 (mean around 132)

So, we have discovered that spam messages generally have more characters than ham messages.

### Adding "contain" column
Let us examine the spam messages and see if we can find any trends.

In [None]:
#examining spam texts
for i in range(50):
  if msg['label'][i]==1:
    print(msg['text'][i])

We observe that spam texts are more likely to contain numbers (charges, phone numbers), emails, links, and symbols!

Let us add a column named "contain" denoting whether a text contains numbers, emails, links, or symbols!

In [None]:
msg['contain']=msg['text'].str.contains('£').map({False:0,True:1})
msg['contain']=msg['contain']|msg['text'].str.contains('%').map({False:0,True:1})
msg['contain']=msg['contain']|msg['text'].str.contains('€').map({False:0,True:1})
msg['contain']=msg['contain']|msg['text'].str.contains('\$').map({False:0,True:1})
msg['contain']=msg['contain']|msg['text'].str.contains("T&C").map({False:0,True:1})
msg['contain']=msg['contain']|msg['text'].str.contains("www|WWW").map({False:0,True:1})
msg['contain']=msg['contain']|msg['text'].str.contains("http|HTTP").map({False:0,True:1})
msg['contain']=msg['contain']|msg['text'].str.contains("https|HTTPS").map({False:0,True:1})
msg['contain']=msg['contain']|msg['text'].str.contains("@").map({False:0,True:1})
msg['contain']=msg['contain']|msg['text'].str.contains("email|Email|EMAIL").map({False:0,True:1})
msg['contain']=msg['contain']|msg['text'].str.contains("SMS|sms|FREEPHONE").map({False:0,True:1})
msg['contain']=msg['contain']|msg['text'].str.contains("\d{11}",regex=True).map({False:0,True:1})
msg['contain']=msg['contain']|msg['text'].str.contains("\d{10}",regex=True).map({False:0,True:1})
msg['contain']=msg['contain']|msg['text'].str.contains("\d{5}",regex=True).map({False:0,True:1})

msg.head()

In [None]:
sns.distplot(a=msg[msg['label']==0].contain,kde=False)

In [None]:
sns.distplot(a=msg[msg['label']==1].contain,kde=False)

The above graphs confirm our observation that spam texts have a high occurrence of numbers, emails, links, and symbols as compared to ham texts.

# 4. Data Preprocessing

The goal of Text Preprocessing is to convert the text in a form that is easy to process and analyze. 

It helps us get rid of unwanted data & noise by removing punctuations/digits/stopwords, converting to lower case, etc.

### Removing punctuation & digits
Using inbuilt functions string.punctuation and.isdigit() to check for punctuations and digits and remove them.

In [None]:
#data cleaning/preprocessing - removing punctuation and digits 
msg['cleaned_text']=""

for i in msg.index:
  updated_list=[]
  for j in range(len(msg['text'][i])):
    if msg['text'][i][j] not in string.punctuation:
      if msg['text'][i][j].isdigit()==False:
        updated_list.append(msg['text'][i][j])
  updated_string="".join(updated_list)
  msg['cleaned_text'][i]=updated_string

msg.drop(['text'],axis=1,inplace=True)
msg.head() 

### Tokenizing & converting to lower case 
Using re.split() to split text into words(tokens) and using .lower() to convert them into lower case.

In [None]:
#data cleaning/preprocessing - tokenization and convert to lower case 
msg['token']=""

for i in msg.index:
  msg['token'][i]=re.split("\W+",msg['cleaned_text'][i].lower())

msg.head()

### Removing Stopwords
Stopwords refer to the most commonly used words in a language. For English, some of stopwords are "on","in","a","the".

More on stopwords: https://www.tutorialspoint.com/python_text_processing/python_remove_stopwords.htm

In [None]:
#data cleaning/preprocessing - stopwords
msg['updated_token']=""
stopwords=nltk.corpus.stopwords.words('english')

for i in msg.index:
  updated_list=[]
  for j in range(len(msg['token'][i])):
    if msg['token'][i][j] not in stopwords:
      updated_list.append(msg['token'][i][j])
  msg['updated_token'][i]=updated_list

msg.drop(['token'],axis=1,inplace=True)
msg.head()

### Lemmatization 
Lemmatization is the process in which different forms of a word are converted to its root word.
For example,
eating->eat, 
ran->run, 
runs->run, 
books->book

More on Lemmatization: https://www.geeksforgeeks.org/python-lemmatization-with-nltk/

In [None]:
#data cleaning/preprocessing - lemmatization 
msg['lem_text']=""
wordlem=nltk.WordNetLemmatizer()

for i in msg.index:
  updated_list=[]
  for j in range(len(msg['updated_token'][i])):
    updated_list.append(wordlem.lemmatize(msg['updated_token'][i][j]))
  msg['lem_text'][i]=updated_list 

msg.drop(['updated_token'],axis=1,inplace=True)
msg.head()

### Merging Tokens
Merging tokens to form the final text string.

In [None]:
#data cleaning/preprocessing - merging token
msg['final_text']=""

for i in msg.index:
  updated_string=" ".join(msg['lem_text'][i])
  msg['final_text'][i]=updated_string

msg.drop(['cleaned_text','lem_text'],axis=1,inplace=True)
msg.head()

Let's separate the targets & features, and then let's split them into training and validation set.

In [None]:
#separating target and features
y=pd.DataFrame(msg.label)
x=msg.drop(['label'],axis=1)

In [None]:
#splitting the data (80:20 ratio)
x_train,x_val,y_train,y_val=train_test_split(x,y,train_size=0.8,test_size=0.2,random_state=0)

### Count Vectorization
It involves counting the number of occurrences of each word/token in a given text.

More on Count Vectorization: https://www.educative.io/edpresso/countvectorizer-in-python

In [None]:
#count vectorization 
cv=CountVectorizer(max_features=5000)
temp_train=cv.fit_transform(x_train['final_text']).toarray()
temp_val=cv.transform(x_val['final_text']).toarray()

### TFIDF
It tells us how important a word is to a text in a group of text. It is calculated by multiplying the frequency of a word, and the inverse document frequency (how common a word is, calculated by log(number of text/number of text which contains the word)) of the word across a group of text.

More on TFIDF: https://monkeylearn.com/blog/what-is-tf-idf/

In [None]:
#tfidf
tf=TfidfTransformer()
temp_train=tf.fit_transform(temp_train)
temp_val=tf.transform(temp_val)

In [None]:
#merging temp datafram with original dataframe
temp_train=pd.DataFrame(temp_train.toarray(),index=x_train.index)
temp_val=pd.DataFrame(temp_val.toarray(),index=x_val.index)
x_train=pd.concat([x_train,temp_train],axis=1,sort=False)
x_val=pd.concat([x_val,temp_val],axis=1,sort=False)

x_train.head()

In [None]:
#dropping the final_text column
x_train.drop(['final_text'],axis=1,inplace=True)
x_val.drop(['final_text'],axis=1,inplace=True)

x_train.head()

In [None]:
#converting the labels to int datatype (for model training)
y_train=y_train.astype(int)
y_val=y_val.astype(int)

# 5. Model Training

### Multinomial Naive Bayes

In [None]:
#Multinomial Naive Bayes
model=MultinomialNB()
model.fit(x_train,y_train)
y_preds=model.predict(x_val)
print("Multinomial Naive Bayes:",accuracy_score(y_val,y_preds))

### Decision Tree

In [None]:
#Decision Tree
model=DecisionTreeClassifier(random_state=0)
model.fit(x_train,y_train)
y_preds=model.predict(x_val)
print("Decision Tree:",accuracy_score(y_val,y_preds))

### Random Forest

In [None]:
#Random Forest
model=RandomForestClassifier(n_estimators=100,random_state=0)
model.fit(x_train,y_train)
y_preds=model.predict(x_val)
print("Random Forest:",accuracy_score(y_val,y_preds))