<a href="https://colab.research.google.com/github/widura26/machine-learning-portfolio/blob/main/email_spam_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

References

[Medium](https://medium.com/@oluyaled/email-spam-detection-using-machine-learning-scikit-python-1b15ee1c6f75)

[Youtube](https://www.youtube.com/watch?v=nkPNQk4-3UE)

[dataset](https://www.kaggle.com/datasets/venky73/spam-mails-dataset/data?select=spam_ham_dataset.csv)




In [None]:
import pandas as pd
import joblib
import string
import csv
import nltk # this package from natural language processing
from nltk.corpus import stopwords # to remove unnecessary words for text analysis
from nltk.stem import PorterStemmer # to convert a word to it's base form. ex: running -> run

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

from google.colab import drive

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
file_path = "/content/drive/MyDrive/dataset/spam_ham_dataset.csv"
emails = pd.read_csv(file_path)
emails.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5171 non-null   int64 
 1   label       5171 non-null   object
 2   text        5171 non-null   object
 3   label_num   5171 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 161.7+ KB


In [None]:
emails

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
...,...,...,...,...
5166,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [None]:
emails['text'][0]

"Subject: enron methanol ; meter # : 988291\r\nthis is a follow up to the note i gave you on monday , 4 / 3 / 00 { preliminary\r\nflow data provided by daren } .\r\nplease override pop ' s daily volume { presently zero } to reflect daily\r\nactivity you can obtain from gas control .\r\nthis change is needed asap for economics purposes ."

In [None]:
emails['text'] = emails['text'].apply(lambda x : x.replace('\r\n', ' '))

In [None]:
stemmer = PorterStemmer()
corpus = []
stop_words = set(stopwords.words('english'))

for i in range(len(emails)):
  text = emails['text'].iloc[i].lower()
  text = text.translate(str.maketrans('', '', string.punctuation)).split()
  text = [stemmer.stem(word) for word in text if word not in stop_words]
  text = ' '.join(text)
  corpus.append(text)

In [None]:
corpus[0]

'subject enron methanol meter 988291 follow note gave monday 4 3 00 preliminari flow data provid daren pleas overrid pop daili volum present zero reflect daili activ obtain ga control chang need asap econom purpos'

In [None]:
#Extractioning text
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(corpus).toarray()
y = emails['label']

In [None]:
#Seperating dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [None]:
x[0]

array([1, 0, 0, ..., 0, 0, 0])

In [None]:
#Train the classification model use multinomial naive bayes
clf = MultinomialNB()
clf = clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Akurasi: ", accuracy)
print("Classification report")
print(report)

Akurasi:  0.9748792270531401
Classification report
              precision    recall  f1-score   support

         ham       0.99      0.98      0.98       742
        spam       0.95      0.97      0.96       293

    accuracy                           0.97      1035
   macro avg       0.97      0.97      0.97      1035
weighted avg       0.98      0.97      0.97      1035



In [None]:
joblib.dump(clf, '/content/drive/MyDrive/ML_Models/spam_email__detection_model.pkl') #ekstrak model ke file

['/content/drive/MyDrive/ML_Models/spam_email__detection_model.pkl']

In [None]:
loaded_model = joblib.load('/content/drive/MyDrive/ML_Models/spam_email__detection_model.pkl')

In [None]:
email = ["Congratulations! You've won a prize. Claim it now."]
email = vectorizer.transform(email)
prediction = loaded_model.predict(email)

if prediction[0] == "spam":
  print("Spam")
else:
  print("Not Spam")

Spam
