Machine Learning Project for Predicting whether a email is a spam or not.


*   **Group members of Project**:

*   F2022266178 ( Muhammad Saad Qureshi  )
*   F2022266178 ( Muhammad Awais Naeem)


In [35]:
import pandas as pd # Library is used for reading Data Set CSV File and for other file manipulation tasks.
from sklearn.feature_extraction.text import TfidfVectorizer # It will convert collection of document(emails) to matrix
from sklearn.model_selection import train_test_split # it is used to split the data in random train and test subsets.
import string # it is used for cleaning the data in csv file like removing punctuations
import nltk # it is a natural language toolkit , it is used to remove stopwords(commonly used words) from data

from nltk.corpus import stopwords # importing the stopwords
nltk.download('stopwords')
from sklearn.metrics import accuracy_score, classification_report # To tell the different types of metrics of our Trained model , this library generates the report

from sklearn.svm import SVC # SVC Stands for Support Vector Classification. basically using SVM Model in this program.

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [36]:
# In order to use the dataset that i have stored on my Google Drive , i will first Connect drive in order to access it
from google.colab import drive
drive.mount('/content/drive')

# Using pandas function read_csv reading csv file.
df = pd.read_csv("/content/drive/MyDrive/csv_files/spam.csv")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [14]:
df

Unnamed: 0,Spam/Not_Spam,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ÃÂ_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [41]:
# Groupby Function for each unique value in Spam/Not_Spam Column , all rows are grouped together
# and describe function tells the stats like Count , unique , frequency and other many statistics
df.groupby('Spam/Not_Spam').describe()

Unnamed: 0_level_0,message,message,message,message
Unnamed: 0_level_1,count,unique,top,freq
Spam/Not_Spam,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


In [42]:
# Creating Copy of message column and making a new dataframe

df_copy = df['message'].copy()

In [43]:
# A function that Cleans the punctuations and common words (stopwords) so our model can focus on important words

def text_preprocess(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = [word for word in text.split() if word.lower() not in stopwords.words('english')]
    return " ".join(text)

In [49]:
# Applying the function on each element of our data

df_copy =df_copy.apply(text_preprocess)


In [48]:
# updated data

df_copy

0       Go jurong point crazy Available bugis n great ...
1                                 Ok lar Joking wif u oni
2       Free entry 2 wkly comp win FA Cup final tkts 2...
3                     U dun say early hor U c already say
4             Nah dont think goes usf lives around though
                              ...                        
5567    2nd time tried 2 contact u U ÃÂ¥ÃÂ£750 Pound...
5568                       ÃÂ b going esplanade fr home
5569                          Pity mood Soany suggestions
5570    guy bitching acted like id interested buying s...
5571                                       Rofl true name
Name: message, Length: 5572, dtype: object

In [55]:
# initalizing tfidf object using Tfidfvectorizer class
# TF-IDF stands for Term Frequency-Inverse Document basically it gives a stats that tells how important a word
# is in our data
tfidf = TfidfVectorizer()



# fit_transform function basically learns the vocabulary used in our data , checks how often a word repeats
# and then convert it into a numerical value that model would be able to understand.
X = tfidf.fit_transform(df_copy)

#print(X)

y = df['Spam/Not_Spam']



0        ham
1        ham
2       spam
3        ham
4        ham
        ... 
5567    spam
5568     ham
5569     ham
5570     ham
5571     ham
Name: Spam/Not_Spam, Length: 5572, dtype: object


In [58]:
# Split the message_data_copy

#X_train and X_test will contain the input data for the training and testing sets, respectively.
#y_train and y_test will contain the corresponding targets for the training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=2)

In [59]:
# Model selection and training (SVM)
model_svm = SVC(kernel='linear')  # makes a obj of Support Vector Classfication with linear kernel
model_svm.fit(X_train, y_train)  # then train our object (Model) on the training data

In [26]:
y_pred_svm = model_svm.predict(X_test) # Our Trained Model predict on test data

In [27]:
accuracy_svm = accuracy_score(y_test, y_pred_svm)
report_svm = classification_report(y_test, y_pred_svm)

print("SVM Model Accuracy:", accuracy_svm)
print("SVM Classification Report:\n", report_svm)

SVM Model Accuracy: 0.9713261648745519
SVM Classification Report:
               precision    recall  f1-score   support

         ham       0.97      1.00      0.98       490
        spam       1.00      0.76      0.87        68

    accuracy                           0.97       558
   macro avg       0.98      0.88      0.93       558
weighted avg       0.97      0.97      0.97       558



In [57]:
# Testing the trained model if it is correctly loaded in `model_svm` and TF-IDF vectorizer loaded in `tfidf`
sample_email = """07732584351 - Rodger Burns - MSG = We tried to call you re your reply to our sms for a free nokia mobile +
free camcorder. Please call now 08000930705 for delivery tomorrow"""

#sample_email = "I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k?"

#sample_email = "Yeah he got in at 2 and was v apologetic. n had fallen out and she was actin like spoilt child and"

# Preprocess the sample email
sample_email = text_preprocess(sample_email)

# Transform the sample email using the same TF-IDF vectorizer
sample_email_vector = tfidf.transform([sample_email])

# Use the trained model to predict
prediction = model_svm.predict(sample_email_vector)

#print(prediction)

if prediction[0] == 'spam':
    print("The sample email is classified as SPAM.")
else:
    print("The sample email is classified as NOT SPAM.")

['ham']
The sample email is classified as NOT SPAM.
