<a href="https://colab.research.google.com/github/shamiya829/SMS-Spam-Detection/blob/main/i310_Project_Spring_2024_KMNS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Importing Libraries and Initial Loading of Dataset

https://www.datacamp.com/tutorial/naive-bayes-scikit-learn

In [2]:
# Importing necessary libraries
import pandas as pd
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Load the dataset from Kaggle
df = pd.read_csv('spam.csv', encoding='latin-1')
df = df[['v1', 'v2']]  # Select only the relevant columns
df.columns = ['label', 'text']  # Rename columns for clarity

# Loading first 10 rows to ensure labeling and loading was done correctly
df.head(10)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


### Text Preprocessing

Following Neptune.ai's [article](https://neptune.ai/blog/tokenization-in-nlp) on tokenization in natural language processing, we were able to use NLTK Word Tokenize in order to split the messages into individual words. We also used python's string library ".lower()" capability in order to convert all text to lowercase, which prevents redundancy that could be caused by case variations.

In [3]:
# Cleaning Data (converting it all to lowercase and tokenization)
df['text'] = df['text'].str.lower() # converting to lowercase
df['tokens'] = df['text'].apply(word_tokenize) # tokenization (splitting textual data into individual words)

df.head(10) # Checking to see whether tokenization and lowercase conversion was sucessful

Unnamed: 0,label,text,tokens
0,ham,"go until jurong point, crazy.. available only ...","[go, until, jurong, point, ,, crazy, .., avail..."
1,ham,ok lar... joking wif u oni...,"[ok, lar, ..., joking, wif, u, oni, ...]"
2,spam,free entry in 2 a wkly comp to win fa cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,ham,u dun say so early hor... u c already then say...,"[u, dun, say, so, early, hor, ..., u, c, alrea..."
4,ham,"nah i don't think he goes to usf, he lives aro...","[nah, i, do, n't, think, he, goes, to, usf, ,,..."
5,spam,freemsg hey there darling it's been 3 week's n...,"[freemsg, hey, there, darling, it, 's, been, 3..."
6,ham,even my brother is not like to speak with me. ...,"[even, my, brother, is, not, like, to, speak, ..."
7,ham,as per your request 'melle melle (oru minnamin...,"[as, per, your, request, 'melle, melle, (, oru..."
8,spam,winner!! as a valued network customer you have...,"[winner, !, !, as, a, valued, network, custome..."
9,spam,had your mobile 11 months or more? u r entitle...,"[had, your, mobile, 11, months, or, more, ?, u..."


###Splitting Data into Training and Testing sets

In [4]:
# Training/Testing Data Split
df.groupby('label').describe()

df['spam']=df['label'].apply(lambda x: 1 if x == 'spam' else 0) # else assigning 0 if y == 'ham'

X_train, X_test, y_train, y_test = train_test_split(df.text, df.spam, test_size = 0.2, shuffle = True)



https://www.geeksforgeeks.org/multinomial-naive-bayes/#

# Vectorize the Data

In [5]:
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train the Model

In [6]:
trained_model = MultinomialNB()
trained_model.fit(X_train_vec, y_train)

# Predict Whether the Message is Spam or Ham

In [7]:
y_pred = trained_model.predict(X_test_vec)

# Test How Accurate the Prediction Was

In [8]:
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

0.9829596412556054


In [10]:
# Sample 1
sample_text_1 = "Congratulations! You've won a free cruise. Call now to claim your prize."

sample_text_vec_1 = vectorizer.transform([sample_text_1])
sample_prediction_1 = trained_model.predict(sample_text_vec_1)

print(f"Text: {sample_text_1}")
print(f"Predicted label: {'spam' if sample_prediction_1 == 1 else 'ham'}")
print()



Text: Congratulations! You've won a free cruise. Call now to claim your prize.
Predicted label: spam



In [11]:
# Sample 2
sample_text_2 = "Hey, how are you doing today? Let's catch up sometime."

sample_text_vec_2 = vectorizer.transform([sample_text_2])
sample_prediction_2 = trained_model.predict(sample_text_vec_2)

print(f"Text: {sample_text_2}")
print(f"Predicted label: {'spam' if sample_prediction_2 == 1 else 'ham'}")
print()


Text: Hey, how are you doing today? Let's catch up sometime.
Predicted label: ham



In [12]:
# Sample 3
sample_text_3 = "Get 50% off on all products today only. Visit our website for more details."

sample_text_vec_3 = vectorizer.transform([sample_text_3])
sample_prediction_3 = trained_model.predict(sample_text_vec_3)

print(f"Text: {sample_text_3}")
print(f"Predicted label: {'spam' if sample_prediction_3 == 1 else 'ham'}")
print()


Text: Get 50% off on all products today only. Visit our website for more details.
Predicted label: spam

