# SPAM MAIL PREDICTION

In this project, I've made a Machine Learning Model that predicts whether a mail is a spam mail or a ham mail, based on the content inside the mail. I've used supervised ML technique Logistic Regression for that purpose, and I've also used a feature extraction technique TF-IDF (Term Frequency- Inverse Document Frequency) Vectorizer, which is used to convert a textual data to a numerical representation. Logistic Regression is a statistical method used for binary classification and sometimes extended for multiclass classification problems. Despite its name, it's actually a classification algorithm rather than a regression algorithm.

For my project, I've used the dataset of mail data available on Kaggle. You can download the dataset from [here](https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification/download?datasetVersionNumber=1)

In [None]:
#Importing the libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Data Collection and Preprocessing

In [None]:
df= pd.read_csv('/content/mail_data.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


We need to replace the null values with null string, if there is any.

In [None]:
#Replace null values with null string
df = df.where((pd.notnull(df)),'')

In [None]:
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Now, we will do Label Encoding to convert spam mails as 0 and ham as 1

In [None]:
#Labelling spam as 0, ham as 1
df.loc[df['Category']=='spam', 'Category']=0
df.loc[df['Category']=='ham', 'Category']=1

0 represents spam mail and 1 represents ham mail

In [None]:
df['Category']

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object

Now, we make 2 separate variables X and Y, where X will contain the content of the mail, and Y will contain the outcome of whether the mail is a spam mail or a ham mail.

In [8]:
X = df['Message']
Y = df['Category']

In [9]:
X

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object

In [10]:
Y

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object

Now, we can split the data into training and test data by using the train_test_split function present in the sklearn library.

In [32]:
#Splitting the data
X_train, X_test, Y_train, Y_test= train_test_split(X, Y, test_size=0.2, random_state=3)

Feature extraction

We need to convert the text data into numerical data as the computer understands numerical data in a better way as compared to text data. We'll use feature extraction for that purpose.

In [33]:
feature_extraction = TfidfVectorizer(min_df = 1, stop_words= 'english', lowercase= True) #min_df=1 signifies not to include any word whose occurence

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test) #we don't fit the data for x_test as we don't want our model to know about the data

#Converting Y_train and Y_test into numeric type from object type
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [34]:
print(X_train_features)

  (0, 5413)	0.6198254967574347
  (0, 4456)	0.4168658090846482
  (0, 2224)	0.413103377943378
  (0, 3811)	0.34780165336891333
  (0, 2329)	0.38783870336935383
  (1, 4080)	0.18880584110891163
  (1, 3185)	0.29694482957694585
  (1, 3325)	0.31610586766078863
  (1, 2957)	0.3398297002864083
  (1, 2746)	0.3398297002864083
  (1, 918)	0.22871581159877646
  (1, 1839)	0.2784903590561455
  (1, 2758)	0.3226407885943799
  (1, 2956)	0.33036995955537024
  (1, 1991)	0.33036995955537024
  (1, 3046)	0.2503712792613518
  (1, 3811)	0.17419952275504033
  (2, 407)	0.509272536051008
  (2, 3156)	0.4107239318312698
  (2, 2404)	0.45287711070606745
  (2, 6601)	0.6056811524587518
  (3, 2870)	0.5864269879324768
  (3, 7414)	0.8100020912469564
  (4, 50)	0.23633754072626942
  (4, 5497)	0.15743785051118356
  :	:
  (4454, 4602)	0.2669765732445391
  (4454, 3142)	0.32014451677763156
  (4455, 2247)	0.37052851863170466
  (4455, 2469)	0.35441545511837946
  (4455, 5646)	0.33545678464631296
  (4455, 6810)	0.29731757715898277
  (4

Now that we're done with our data pre-processing, we are now ready to train the model.

In [35]:
model = LogisticRegression()

In [36]:
# Training the logistic regression model with the training data
model.fit(X_train_features, Y_train)

Now since our model is trained, we will now check the accuracy of our model, about how accurately can it predict.

In [37]:
#Predicting on training data
prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

accuracy_on_training_data

0.9670181736594121

We can conclude that out model gives a 96% right prediction when it encounters with the data it is trained on. Now let's see its performance on test data.

In [38]:
#Predicting on test data
prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

accuracy_on_test_data

0.9659192825112107

We can conclude that out model gives a 76% right prediction when it encounters with some unseen data.

Now, our model is ready to predict the data based on its properties, so let's evaluate our model.

# Building a Predictive System

In [40]:
input_mail = ["Thanks for your subscription to Ringtone UK your mobile will be charged £5/month Please confirm by replying YES or NO. If you reply NO you will not be charged"]

#converting it into numerical form
input_mail_vectorized = feature_extraction.transform(input_mail)

#making prediction
prediction = model.predict(input_mail_vectorized)

print(prediction[0])
if(prediction[0]==0):
  print("This is a spam mail")
else:
  print("This is a ham mail")

0
This is a spam mail


In [41]:
input_mail = ["Hello! How's you and how did saturday go? I was just texting to see if you'd decided to do anything tomo. Not that i'm trying to invite myself or anything!"]

#converting it into numerical form
input_mail_vectorized = feature_extraction.transform(input_mail)

#making prediction
prediction = model.predict(input_mail_vectorized)

print(prediction[0])
if(prediction[0]==0):
  print("This is a spam mail")
else:
  print("This is a ham mail")

1
This is a ham mail


##CONCLUSION


In this project, I used feature extraction and performed supervised learning technique, Logistic Regression for building a model that predicts whether a mail is spam or not, based on its contents.

## References and Future Work

You can find the links to the resources that I found useful during the execution of this project and learn more about the tools and libraries used in it.


*   Kaggle Dataset:https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification/download?datasetVersionNumber=1
*   Pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
*   Numpy user guide: https://numpy.org/doc/stable/user/absolute_beginners.html
*   Logistic Regression user guide: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
* Feature Extraction user guide: https://scikit-learn.org/stable/modules/feature_extraction.html
