## Introduction


### Problem Area

The challenge with spam detection technologies is effectively distinguishing between spam and legitimate messages in a variety of settings, including social media, messaging platforms, and email to improve user experience


### Objectives

The main objectives of spam detection are to safeguard you from unsolicited communications, conserve your time and effort, and guarantee that you only get communication that is pertinent and crucial. It serves as a filter, keeping your inbox clear and preventing frauds and time-wasting dealing with unimportant communications. 

### Datasets

‘SMS Spam Collection Dataset’ dataset available from the Kaggle website. The files contain one message per line. Each line has two columns: v1 has the label (ham or spam) and v2 has the raw text. This dataset was gathered from various sources on the internet, either freely available or free for research purposes. 

A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received.

A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available.

A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis.

Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available.


### Evaluation Methodology
The evaluation metrics I will be using are precision and accuracy. When the goal is to reduce false positives and make sure that a model's positive predictions are very accurate, precision score is very important. And accuracy is being used as it is an easy to interpret evaluation metric ans dit provides a single number to evaluate the model's capability


#### Install packages

In [158]:

!pip install nltk scikit-learn regex numpy pandas wordcloud matplotlib 



In [204]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### Reading dataset

In [105]:
df=pd.read_csv('spam.csv',encoding='latin-1')

In [106]:
df.shape

(5572, 5)

In [107]:
df.head(5)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


#### Data cleaning

In [108]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [109]:
# dropping unecessary columns
df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'],inplace=True)

In [110]:
#renaming the columns
df.rename(columns = {'v1':'target', 'v2': 'text'},inplace=True)

In [111]:
#Checking missing values
df.isnull().sum()

target    0
text      0
dtype: int64

In [112]:
#cheking duplicated values
df.duplicated().sum()

403

In [113]:
#dropping duplicates value
print("before removing duplicates;",df.shape)
df.drop_duplicates(keep='first',inplace=True)
print("after removing duplicates",df.shape)

before removing duplicates; (5572, 2)
after removing duplicates (5169, 2)


In [114]:
df.head()

Unnamed: 0,target,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Implementation

### Preprocessing

#### Removing Stopwords, punctuation and stemming

In [135]:
import nltk 
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

#Function to transform the text
def transform_text(text):
    
    #Convert text to lowercase
    text = text.lower()
    
    #Tokenize the text into individual words
    text = nltk.word_tokenize(text)
    
    #Create and empty list to store filtered words
    filtered_words = []
    
    for word in text:
        if word.isalnum():
            filtered_words.append(word)
    
    #update the text
    text = filtered_words[:]
    
    #Clear the filtered words list for reuse
    filtered_words.clear()
    
    #Remove stopwords and punctuation from the text
    for word in text:
        if word not in stopwords.words('english') and word not in string.punctuation:
            filtered_words.append(word)
    
    #update the text
    text = filtered_words[:]
    
    #Clear the filtered words list for reuse
    filtered_words.clear()
    
     # Apply stemming to the words in the text
    stemmer = PorterStemmer()
    for word in text:
        # Perform stemming on each word
        stemmed_word = stemmer.stem(word)
        # Add the stemmed word to the filtered list
        filtered_words.append(stemmed_word)
    
    #join the filtered words to form the transformed text
    transformed_text = " ".join(filtered_words)
    
    #return the transformed text
    return transformed_text
   

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gourisrinijag/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [136]:
# Creating a new column
df['transformed_text'] = df['text'].apply(transform_text)

In [137]:
df.head()

Unnamed: 0,target,text,transformed_text
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazi avail bugi n great world...
1,ham,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entri 2 wkli comp win fa cup final tkt 21...
3,ham,U dun say so early hor... U c already then say...,u dun say earli hor u c alreadi say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah think goe usf live around though


#### Bag-of-Words

In [170]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [171]:
X = cv.fit_transform(df['transformed_text']).toarray()

In [172]:
X.shape

(5169, 6708)

In [173]:
y = df['target'].values

In [175]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

### Baseline

#### Navie Bayes

In [176]:
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score

In [177]:
# Create a Gaussian Naive Bayes classifier
gnb = GaussianNB()
# Create a Multinomial Naive Bayes classifier
mnb = MultinomialNB()
# Create a Bernoulli Naive Bayes classifier
bnb = BernoulliNB()

##### Gaussian Naive Bayes Classifier 

In [202]:

gnb.fit(X_train,y_train)
y_pred1 = gnb.predict(X_test)
print(accuracy_score(y_test,y_pred1))
print(confusion_matrix(y_test,y_pred1))
print(precision_score(y_test,y_pred1, pos_label='spam'))

0.8800773694390716
[[792 104]
 [ 20 118]]
0.5315315315315315


##### Multinominal Naive Bayes Classifier

In [203]:
bnb.fit(X_train,y_train)
y_pred3 = bnb.predict(X_test)
print(accuracy_score(y_test,y_pred3))
print(confusion_matrix(y_test,y_pred3))
print(precision_score(y_test,y_pred3, pos_label='spam'))

0.9700193423597679
[[893   3]
 [ 28 110]]
0.9734513274336283


##### Bernoulli Naive Bayes Classifier

In [201]:
mnb.fit(X_train,y_train)
y_pred2 = mnb.predict(X_test)
print(accuracy_score(y_test,y_pred2))
print(confusion_matrix(y_test,y_pred2))
print(precision_score(y_test, y_pred2, pos_label='spam'))

0.9642166344294004
[[871  25]
 [ 12 126]]
0.8344370860927153


### Classification approach

A Support Vector Machine (SVM) classifier with a linear kernel is used.

In [208]:
from sklearn import svm

In [209]:
classifier = svm.SVC(kernel='linear')
classifier.fit(X_train, y_train)

SVC(kernel='linear')

In [211]:
y_pred = classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

precison = precision_score(y_test, y_pred,pos_label='spam')
print("Precison:", precison)

classification_rep = classification_report(y_test, y_pred)
print("Classification Report:\n", classification_rep)

Accuracy: 0.9777562862669246
Precison: 0.9752066115702479
Classification Report:
               precision    recall  f1-score   support

         ham       0.98      1.00      0.99       896
        spam       0.98      0.86      0.91       138

    accuracy                           0.98      1034
   macro avg       0.98      0.93      0.95      1034
weighted avg       0.98      0.98      0.98      1034



## Conclusion

### Evaluation

I have used the Naive Bayes algorihtm as my baseline. I used 3 classifier models namely the Multinomial Naive Bayes classifier, Gaussian Naive Bayes classifier and the Bernoulli Naive Bayes classifier. The Multinomial Navie Bayes has a accuracy score of 0.97 and a precision score of 0.97. Since my SVM classifier has a similar accuracy and precision score.

### Citations

“The National University of Singapore SMS Corpus | ScholarBank@NUS.” The National University of Singapore SMS Corpus | ScholarBank@NUS, doi.org/10.25540/WVM0-4RNX.

Tagg, Caroline. “A corpus linguistics study of SMS text messaging.” (2009).