# Spam Mail Prediction

**WORKFLOW:**

1. **Data Collection**: First, we gather the email data, which includes both spam and ham emails. This data will be used to train our machine learning model.

2. **Data Preprocessing**: Next, we preprocess the data by converting the text and paragraph data into meaningful numerical representations.

3. **Data Splitting and Model Training**: After preprocessing, we split the dataset into training and testing data. The training data is used to train our machine learning model, while the test data is used to evaluate it. We will use a logistic regression model, as it is well-suited for binary classification problems, where there are two classes to classify—in this case, spam and ham emails.

4. **Model Prediction**: Once the model is trained, it can predict whether a new email is spam or ham based on the trained logistic regression model.

## Importing the Dependencies

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## Data Collection & Pre-Processing

In [2]:
# Loading the data from csv file to a pandas dataframe
raw_mail_data = pd.read_csv('mail_data.csv')

In [3]:
print(raw_mail_data)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


In [4]:
# Replace the null values with a null string
mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)), '')

In [5]:
# Printing the first 5 rows of the dataframe
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
# Checking the number of rows and columns in the dataframe
mail_data.shape

(5572, 2)

## Label Encoding

In [7]:
# Label spam mail as 0; ham mail as 1;

# Label all instances where the 'Category' column has 'spam' as 0
mail_data.loc[mail_data['Category'] == 'spam', 'Category',] = 0

# Label all instances where the 'Category' column has 'ham' as 1
mail_data.loc[mail_data['Category'] == 'ham', 'Category',] = 1

spam - 0 \
ham - 1

In [8]:
# Separating the data as texts and label

X = mail_data['Message']

Y = mail_data['Category']

In [9]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [10]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


## Splitting the data into training data & testing data

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)

In [12]:
print(X.shape, X_train.shape, X_test.shape)

(5572,) (4457,) (1115,)


## Feature Extraction

In [13]:
# Transform the text data to feature vectors that can be used as input to the Logistic Regression Model
feature_extraction = TfidfVectorizer(min_df = 1, # Ignore words that appear less than once in the dataset
                                     stop_words = 'english', # Remove common words (e.g., "the", "and") that don't add much value
                                     lowercase = True) # Convert all text to lowercase for better processing

# Fit the TfidfVectorizer to the training data and transform it into feature vectors
X_train_features = feature_extraction.fit_transform(X_train) 
X_test_features = feature_extraction.transform(X_test) # only transform because we don't want the model to see it.

# Convert Y_train and Y_test to integer type as they are currently of object type
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [14]:
# These are feature vectors where each sentence is assigned a score based on the vectorizer function
# Displays a sparse matrix where each entry shows a word's TF-IDF score in a specific document
# (document index, word index) -> TF-IDF score
print(X_train_features)

  (0, 5413)	0.6198254967574347
  (0, 4456)	0.4168658090846482
  (0, 2224)	0.413103377943378
  (0, 3811)	0.34780165336891333
  (0, 2329)	0.38783870336935383
  (1, 4080)	0.18880584110891163
  (1, 3185)	0.29694482957694585
  (1, 3325)	0.31610586766078863
  (1, 2957)	0.3398297002864083
  (1, 2746)	0.3398297002864083
  (1, 918)	0.22871581159877646
  (1, 1839)	0.2784903590561455
  (1, 2758)	0.3226407885943799
  (1, 2956)	0.33036995955537024
  (1, 1991)	0.33036995955537024
  (1, 3046)	0.2503712792613518
  (1, 3811)	0.17419952275504033
  (2, 407)	0.509272536051008
  (2, 3156)	0.4107239318312698
  (2, 2404)	0.45287711070606745
  (2, 6601)	0.6056811524587518
  (3, 2870)	0.5864269879324768
  (3, 7414)	0.8100020912469564
  (4, 50)	0.23633754072626942
  (4, 5497)	0.15743785051118356
  :	:
  (4454, 4602)	0.2669765732445391
  (4454, 3142)	0.32014451677763156
  (4455, 2247)	0.37052851863170466
  (4455, 2469)	0.35441545511837946
  (4455, 5646)	0.33545678464631296
  (4455, 6810)	0.29731757715898277
  (4

## Training the Model
### Logistic Regression

In [15]:
model = LogisticRegression() # Initialize the Logistic Regression model

In [16]:
# Training the Logistic Regression model with the training data
# This code trains the model by adjusting its parameters to learn the relationship between the feature vectors (X_train_features)
# and the actual labels (Y_train). The model uses this training data to make predictions on new, unseen data.
model.fit(X_train_features, Y_train)

## Evaluating the trained model

In [17]:
# Prediction on training data

# Predict labels for the training data using the trained model
prediction_on_training_data = model.predict(X_train_features)

# Calculate the accuracy of the predictions compared to the true labels
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

In [18]:
print('Accuracy on training data : ', accuracy_on_training_data)

Accuracy on training data :  0.9670181736594121


**Observation:** We achieved an accuracy of over 96%. This means that if you use your model to predict 100 different emails, it will correctly classify 96 of them. Typically, an accuracy score above 75% indicates that the model is performing well.

In [19]:
# Prediction on test data

# Predict labels for the test data using the trained model
prediction_on_test_data = model.predict(X_test_features)

# Calculate the accuracy of the predictions compared to the true labels
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

In [20]:
print('Accuracy on test data : ', accuracy_on_test_data)

Accuracy on test data :  0.9659192825112107


**Question:** Why do we check the model's accuracy on training data using `Y_train` instead of using `Y_test`?

**Answer:** Checking the model's accuracy on both the training data and the test data is important because it helps us identify if our model is overfitting. 

**Overfitting** occurs when a model performs very well on the training data but poorly on new, unseen data. Here's how it works:

- **High Training Accuracy**: If the model has high accuracy on the training data (e.g., 96%), it means the model has learned the training data well.

- **Low Test Accuracy**: If the model has low accuracy on the test data (e.g., 60%), it indicates that the model does not generalize well to new data.

The large difference between training accuracy and test accuracy suggests overfitting, meaning the model is too closely tailored to the training data and is not performing well on new, unseen data. This usually happens when the model is too complex or has learned the noise in the training data rather than the underlying patterns.

## Building a Predictive System

In [21]:
# This is the input data; it doesn't include the label, as the model will predict it.
input_mail = ["A gram usually runs like  &lt;#&gt; , a half eighth is smarter though and gets you almost a whole second gram for  &lt;#&gt;"] 

# Convert text to feature vectors
input_data_features = feature_extraction.transform(input_mail)

# Making Prediction
prediction = model.predict(input_data_features)
print(prediction)

if prediction[0] == 1:
    print("Ham mail")
else:
    print("Spam mail")

[1]
Ham mail
