<a href="https://colab.research.google.com/github/shubhgosalia/Spam-Mail-Predictor/blob/main/CSS_IA1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing Dependencies

In [1]:
import numpy as np
import pandas as pd
#for splitting the data into train and test data
from sklearn.model_selection import train_test_split
#for converting text data into numerical values
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
#for evaluation of model
from sklearn.metrics import accuracy_score

Data Collection & Pre-Processing

In [2]:
#loading the data from csv file to a pandas Dataframe
raw_mail_data=pd.read_csv('/content/mail_data.csv')

In [3]:
print(raw_mail_data)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


In [4]:
raw_mail_data.isnull().sum()

Category    0
Message     0
dtype: int64

In [5]:
#replace null values with null string
new_mail_data=raw_mail_data.where((pd.notnull(raw_mail_data)),'')

In [6]:
#printing the first 5 rows of the dataframe
new_mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
new_mail_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [9]:
#checking no. of rows and columns in the dataframe
new_mail_data.shape

(5572, 2)

Label Encoding - Converting text labels to numeric form


In [10]:
#labelling spam mail as 0 and ham mail as 1
new_mail_data.loc[new_mail_data['Category']=='spam','Category',]=0
new_mail_data.loc[new_mail_data['Category']=='ham','Category',]=1

In [11]:
#separating the data as texts(messages/mails) and labels(1/0 for ham/spam respectively)

X=new_mail_data['Message']
Y=new_mail_data['Category']

In [12]:
#printing all messages/mails
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [13]:
#printing all labels(for spam/ham mails)
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


Splitting data into Training & Test data

In [14]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=3)

In [15]:
#printng the data split(no. of rows) between train and test data
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(5572,)
(4457,)
(1115,)


Feature Extraction

In [17]:
# transform the  text data to feature vectors that can be used as input to the logistic regression

feature_extraction=TfidfVectorizer(min_df=1,stop_words='english',lowercase='True')

In [24]:
# now chaning x_train text into numeric values an storing it into another variables
X_train_features=feature_extraction.fit_transform(X_train)

# similar for X_test
X_test_features=feature_extraction.transform(X_test)

In [23]:
Y_train

3075    1
1787    1
1614    1
4304    1
3266    0
       ..
789     0
968     1
1667    1
3321    1
1688    0
Name: Category, Length: 4457, dtype: object

In [25]:
Y_test

2632    0
454     1
983     0
1282    1
4610    1
       ..
4827    1
5291    1
3325    1
3561    1
1136    1
Name: Category, Length: 1115, dtype: object

In [29]:
# here you can see data_type of Y_train and Y_test is object so we are going to convert it into integer

Y_train=Y_train.astype('int')
Y_test=Y_test.astype('int')

In [30]:
Y_train

3075    1
1787    1
1614    1
4304    1
3266    0
       ..
789     0
968     1
1667    1
3321    1
1688    0
Name: Category, Length: 4457, dtype: int64

In [31]:
Y_test

2632    0
454     1
983     0
1282    1
4610    1
       ..
4827    1
5291    1
3325    1
3561    1
1136    1
Name: Category, Length: 1115, dtype: int64

In [None]:
print(X_train_features)

Training model

In [34]:
model=LogisticRegression()

In [35]:
# training the logistic regression model
model.fit(X_train_features,Y_train)

LogisticRegression()

Evaluating the train model

In [38]:
# prediction on training data
prediction_on_training_data=model.predict(X_train_features)

accuraccy_on_training_data=accuracy_score(Y_train,prediction_on_training_data)

In [41]:
print("Accuracy on training data is ",accuraccy_on_training_data)


Accuracy on training data is  0.9670181736594121


In [43]:
# prediction on test data
prediction_on_testing_data=model.predict(X_test_features)

accuraccy_on_test_data=accuracy_score(Y_test,prediction_on_testing_data)

In [45]:
print("Accuracy on test data is ",accuraccy_on_test_data)

Accuracy on test data is  0.9659192825112107


Building a predictive system

In [54]:
def predict(input_mail):
  # convert it into numeric
  input_data_features=feature_extraction.transform(input_mail)
  # make predictionS
  prediction=model.predict(input_data_features)
  if prediction[0]==1:
    print('Ham mail')
  else :
    print("Spam mail")

In [60]:
# taking a ham mail from our data set and check the output
input_ham_mail=["As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"]


In [61]:
predict(input_ham_mail)

Ham mail


In [62]:
# taking a spam mail from our data set and check the output
input_spam_mail=["WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."]

In [63]:
predict(input_spam_mail)

Spam mail
