<a href="https://colab.research.google.com/github/styxOO7/Spam-Mail-Prediction-/blob/main/Spam_Mail_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing the Dependencies

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
# term frequency inverse document frequency vectorizer, it converts the given text into numbers 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Data Collection & Pre-Processing

In [25]:
raw_mail_data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/csvs/mail_data.csv")
raw_mail_data.head(10)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [26]:
# raw_mail_data[['Message']].isna() --> provides boolean for the entires having NaN

mail_data = raw_mail_data.where(pd.notnull(raw_mail_data), '')
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Label Encoding:

In [27]:
mail_data.loc[mail_data['Category'] == 'spam', 'Category'] = 0
mail_data.loc[mail_data['Category'] == 'ham', 'Category'] = 1

Spam -> 0 &
Ham -> 1

In [28]:
x = mail_data['Message']
y = mail_data['Category']

In [29]:
print(x)
print(y)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object
0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


Train Test Split

In [30]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)
x_test

1078                         Yep, by the pretty sculpture
4028        Yes, princess. Are you going to make me moan?
958                            Welp apparently he retired
4642                                              Havent.
4674    I forgot 2 ask ü all smth.. There's a card on ...
                              ...                        
324     That would be great. We'll be at the Guild. Co...
1163    Free entry in 2 a wkly comp to win FA Cup fina...
86      For real when u getting on yo? I only need 2 m...
4214                     I attended but nothing is there.
90      Yeah do! Don‘t stand to close tho- you‘ll catc...
Name: Message, Length: 1115, dtype: object

Feature Extraction:

In [31]:
# Converts the text into numerical values for machine  to understand:
# f = sum(fword)
# fword = TF * IDF
# TF = (no. of repeated words in the sentence) / (total words in the sentence)
# IDF = log((total no. of sentences) / (no.of sentences having that word))

feature_extraction = TfidfVectorizer(min_df= 1, stop_words='english', lowercase='True')
x_train_features = feature_extraction.fit_transform(x_train)
x_test_features = feature_extraction.transform(x_test)

# Convert y into integers:
y_train = y_train.astype('int')
y_test = y_test.astype('int')

In [32]:
print(x_test_features)

  (0, 7406)	0.7202901083692191
  (0, 5207)	0.693672948719682
  (1, 7408)	0.39146814311442063
  (1, 5220)	0.4705918182872853
  (1, 4400)	0.5934443291757167
  (1, 4191)	0.3895427356578373
  (1, 3071)	0.3483910428713775
  (2, 7181)	0.7357795587192053
  (2, 994)	0.6772211167491541
  (3, 3259)	1.0
  (4, 7336)	0.3056500641731669
  (4, 7100)	0.18546351525534188
  (4, 6030)	0.5868928485020234
  (4, 5925)	0.3163109675928492
  (4, 5196)	0.33058431450158204
  (4, 3951)	0.2762463598251023
  (4, 2854)	0.2623110272820492
  (4, 2093)	0.19907660636915728
  (4, 1604)	0.2934464242510117
  (4, 1067)	0.2180322556038374
  (5, 7120)	0.41930198660651424
  (5, 7095)	0.47776118013894237
  (5, 6602)	0.4818760834807631
  (5, 4743)	0.33583035252339755
  (5, 3101)	0.3500213296091081
  :	:
  (1111, 2886)	0.11453235557068618
  (1111, 2761)	0.18749587684632552
  (1111, 2652)	0.4776476584652413
  (1111, 2540)	0.36360411501974377
  (1111, 2067)	0.2044709693034356
  (1111, 1873)	0.19469034463818594
  (1111, 1002)	0.1695

Training the Model:


In [33]:
model = LogisticRegression()
model.fit(x_train_features, y_train)

LogisticRegression()

In [34]:
model.score(x_test_features, y_test)

0.9704035874439462

Check for Underfitting and Overfitting

In [35]:
# To get an idea about overfitting (High test error and low train error) and 
#underfitting (high training and testing error)

# Prediction on training data:
y_train_pred = model.predict(x_train_features)
accuracy_train_data = accuracy_score(y_train, y_train_pred)

In [36]:
print("Accuracy on training data = ",accuracy_train_data)

Accuracy on training data =  0.9681400044873233


In [37]:
# Prediction on test data:
y_test_pred = model.predict(x_test_features)
accuracy_test_data = accuracy_score(y_test, y_test_pred)

print("Accuracy on test data = ",accuracy_test_data)

Accuracy on test data =  0.9704035874439462


Buidling a Pridective System:

In [45]:
input_mail = [input("Enter the mail to be tested: ")]
input_mail_feature = feature_extraction.transform(input_mail)

output_mail = "SPAM" if model.predict(input_mail_feature) == 0 else "HAM"
print(f"Its' a {output_mail} mail.")

Enter the mail to be tested: URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18
Its' a SPAM mail.
