# Problem Statement 
* Objective is to create a predictive model capable of distinguishing between spam and legitimate emails. 
* This model will rely on various email attributes and content features to make informed predictions, ultimately helping users filter out unwanted and potentially harmful messages.
* It also involves use of Natural Language Processing (NLP) techniques for handling textual data.



In [1]:
import numpy as np
import pandas as pd

# for text data preprocessing
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# for model buidling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


In [2]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# printing the stopwords in English
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [4]:
mail_data = pd.read_csv('/kaggle/input/spam-mail-predict/Day17_Mail_Data.csv') 
mail_data

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [5]:
print('The size of Dataframe is: ', mail_data.shape)
print('The Column Name, Record Count and Data Types are as follows: ')
mail_data.info()

The size of Dataframe is:  (5572, 2)
The Column Name, Record Count and Data Types are as follows: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [6]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in mail_data.columns if mail_data[feature].dtype != 'O']
categorical_features = [feature for feature in mail_data.columns if mail_data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))


We have 0 numerical features : []

We have 2 categorical features : ['Category', 'Message']


In [7]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
total=mail_data.isnull().sum().sort_values(ascending=False)
percent=(mail_data.isnull().sum()/mail_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])


Missing Value Presence in different columns of DataFrame are as follows : 


Unnamed: 0,Total,Percent
Category,0,0.0
Message,0,0.0


In [8]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
mail_data.describe(include='object')


Summary Statistics of numerical features for DataFrame are as follows:


Unnamed: 0,Category,Message
count,5572,5572
unique,2,5157
top,ham,"Sorry, I'll call later"
freq,4825,30


In [9]:
mail_data = mail_data.where((pd.notnull(mail_data)),'')
mail_data

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [10]:
mail_data.loc[mail_data['Category'] == 'spam', 'Category',] = 0
mail_data.loc[mail_data['Category'] == 'ham', 'Category',] = 1


In [11]:
mail_data

Unnamed: 0,Category,Message
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,0,This is the 2nd time we have tried 2 contact u...
5568,1,Will ü b going to esplanade fr home?
5569,1,"Pity, * was in mood for that. So...any other s..."
5570,1,The guy did some bitching but I acted like i'd...


In [12]:
mail_data['Category'].value_counts()


Category
1    4825
0     747
Name: count, dtype: int64

In [13]:
porter_stemmer = PorterStemmer()


In [14]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [porter_stemmer.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content
mail_data['Message'] = mail_data['Message'].apply(stemming)
mail_data['Message']


0       go jurong point crazi avail bugi n great world...
1                                   ok lar joke wif u oni
2       free entri wkli comp win fa cup final tkt st m...
3                     u dun say earli hor u c alreadi say
4                    nah think goe usf live around though
                              ...                        
5567    nd time tri contact u u pound prize claim easi...
5568                                b go esplanad fr home
5569                                    piti mood suggest
5570    guy bitch act like interest buy someth els nex...
5571                                       rofl true name
Name: Message, Length: 5572, dtype: object

In [15]:
# separating the data and labels
X = mail_data['Message'] # Feature matrix
y = mail_data['Category'] # Target variable

In [16]:
X

0       go jurong point crazi avail bugi n great world...
1                                   ok lar joke wif u oni
2       free entri wkli comp win fa cup final tkt st m...
3                     u dun say earli hor u c alreadi say
4                    nah think goe usf live around though
                              ...                        
5567    nd time tri contact u u pound prize claim easi...
5568                                b go esplanad fr home
5569                                    piti mood suggest
5570    guy bitch act like interest buy someth els nex...
5571                                       rofl true name
Name: Message, Length: 5572, dtype: object

In [17]:
y

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object

In [18]:
# convert Y_train and Y_test values as integers
y = y.astype('int')

In [19]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)
X

<5572x6296 sparse matrix of type '<class 'numpy.float64'>'
	with 45096 stored elements in Compressed Sparse Row format>

In [20]:
print(X)

  (0, 6135)	0.23616756554565888
  (0, 5957)	0.19460776670194488
  (0, 4091)	0.24055424511726686
  (0, 2932)	0.28506031120996994
  (0, 2827)	0.3522946643655987
  (0, 2245)	0.19460776670194488
  (0, 2208)	0.1649859743034801
  (0, 2171)	0.14066343975170745
  (0, 1169)	0.27282796669086984
  (0, 964)	0.29761995607435426
  (0, 738)	0.29761995607435426
  (0, 736)	0.33630333732147566
  (0, 379)	0.26350491969128115
  (0, 190)	0.3522946643655987
  (1, 6056)	0.44597659211687757
  (1, 3785)	0.564793662023427
  (1, 3760)	0.2809319560263009
  (1, 2960)	0.4218982744467187
  (1, 2794)	0.4745440766926726
  (2, 6101)	0.21369536090695063
  (2, 6067)	0.16011115093017092
  (2, 5695)	0.13727833879237866
  (2, 5536)	0.2476330040187214
  (2, 5420)	0.1320245012320154
  (2, 5131)	0.22058857181065877
  :	:
  (5567, 784)	0.15667410716389937
  (5567, 724)	0.2920652264491494
  (5568, 2457)	0.37457404553349233
  (5568, 2171)	0.29597505521175127
  (5568, 1996)	0.5740672391289212
  (5568, 1704)	0.6652366917601374
  (5

In [21]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=45)


In [22]:
print("For X ; ",X.shape, X_train.shape, X_test.shape)
print("\n\n")
print("For y ; ",y.shape, y_train.shape, y_test.shape)


For X ;  (5572, 6296) (4457, 6296) (1115, 6296)



For y ;  (5572,) (4457,) (1115,)


In [23]:
models = [LogisticRegression, SVC, DecisionTreeClassifier, RandomForestClassifier]
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for model in models:
    classifier = model().fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    precision_scores.append(precision_score(y_test, y_pred))
    recall_scores.append(recall_score(y_test, y_pred))
    f1_scores.append(f1_score(y_test, y_pred))


In [24]:
classification_metrics_df = pd.DataFrame({
    "Model": ["Logistic Regression", "SVM", "Decision Tree", "Random Forest"],
    "Accuracy": accuracy_scores,
    "Precision": precision_scores,
    "Recall": recall_scores,
    "F1 Score": f1_scores
})

classification_metrics_df.set_index('Model', inplace=True)
classification_metrics_df


Unnamed: 0_level_0,Accuracy,Precision,Recall,F1 Score
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Logistic Regression,0.953363,0.949803,0.998965,0.973764
SVM,0.972197,0.968907,1.0,0.984208
Decision Tree,0.964126,0.976337,0.982402,0.97936
Random Forest,0.975785,0.97281,1.0,0.986217


# Inference
* The performance metrics of four different models—Logistic Regression, SVM, Decision Tree, and Random Forest—have been evaluated.

* All models exhibit high accuracy, with SVM and Random Forest leading the way, achieving accuracy scores of approximately 97.22% and 97.49%, respectively. This suggests that these models are effective at correctly classifying emails as spam or legitimate.

* When considering precision, recall, and F1 Score, the SVM and Random Forest models again excel, with precision scores exceeding 96.89% and recall scores reaching 100%, indicating minimal false positives and false negatives. Decision Tree also performs well in terms of precision and recall.

* In summary, SVM and Random Forest models stand out as strong contenders for spam mail prediction, with high accuracy, precision, and recall scores. 

* However, the choice of the best model may also depend on specific requirements such as computational efficiency and interpretability.