# **Text classification in NLP (Natural Language Processing)**

# **Problem Statement**
With the rapid rise of digital communication via emails and SMS, spam detection has become crucial for filtering out unwanted or harmful content. In this project, I will use the text data from the spam.tsv dataset to develop a model that can accurately classify messages as either "spam" or "ham."

# Overview
The spam.tsv dataset is designed for a classic spam detection task, where we aim to classify text messages as either "spam" or "ham" (not spam). This dataset contains two columns:


*   **Class**: The target label, indicating whether a message is "spam" or "ham.
*  **Message**: The text content of the message that needs to be analyzed.

The goal is to build a text classification model that can predict whether a given message is spam or ham based on its content


# Data Preprocessing Steps

# 1.Lowercasing:
All the text in the messages was converted to lowercase. This helps standardize the data by removing case sensitivity. For example, "Free" and "free" are treated as the same word.

# 2. HTML Tag Removal:
Any HTML tags (e.g., <br>, <p>) were stripped out from the text. This is crucial when dealing with text that might have been scraped from websites or contains formatting markup.

#3.Punctuation Removal:
I removed punctuation marks (like commas, periods, and exclamation points) from the messages. Punctuation does not usually contribute to understanding the intent of a message and can be safely removed to reduce noise.


# 4.Text Vectorization using CountVectorizer:

After cleaning the text, I converted it into numerical form using the CountVectorizer technique. This method transforms the text into a matrix where each row corresponds to a message, and each column corresponds to a word (or token). The value in each cell represents how many times the word appears in the message.


# Step1. Importing libraries

In [29]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step2. Loading the dataset

In [28]:
df= pd.read_csv('/content/spam.tsv', sep='\t', names=['class', 'message'])
df

Unnamed: 0,class,message
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!!
...,...,...
5562,spam,This is the 2nd time we have tried 2 contact u...
5563,ham,Will ü b going to esplanade fr home?
5564,ham,"Pity, * was in mood for that. So...any other s..."
5565,ham,The guy did some bitching but I acted like i'd...


# Step 3. Data preprocessing





In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5567 entries, 0 to 5566
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   class    5567 non-null   object
 1   message  5567 non-null   object
dtypes: object(2)
memory usage: 87.1+ KB


In [31]:
df['class'].value_counts()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
ham,4821
spam,746


Here I apply length function to measure the length of column "message"

In [34]:
df['length'] = df['message'].apply(len)
df


Unnamed: 0,class,message,length
0,ham,I've been searching for the right words to tha...,196
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
2,ham,"Nah I don't think he goes to usf, he lives aro...",61
3,ham,Even my brother is not like to speak with me. ...,77
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!!,36
...,...,...,...
5562,spam,This is the 2nd time we have tried 2 contact u...,160
5563,ham,Will ü b going to esplanade fr home?,36
5564,ham,"Pity, * was in mood for that. So...any other s...",57
5565,ham,The guy did some bitching but I acted like i'd...,125


Class Label Encoding I mapped the "class" column from categorical values to numerical values:



1. "spam" → 1
2.  "ham" → 0




In [35]:

df['class'] = df['class'].map({'ham':0, 'spam':1})

Unnamed: 0,class,message,length
0,0,I've been searching for the right words to tha...,196
1,1,Free entry in 2 a wkly comp to win FA Cup fina...,155
2,0,"Nah I don't think he goes to usf, he lives aro...",61
3,0,Even my brother is not like to speak with me. ...,77
4,0,I HAVE A DATE ON SUNDAY WITH WILL!!!,36
...,...,...,...
5562,1,This is the 2nd time we have tried 2 contact u...,160
5563,0,Will ü b going to esplanade fr home?,36
5564,0,"Pity, * was in mood for that. So...any other s...",57
5565,0,The guy did some bitching but I acted like i'd...,125


In [40]:
#lowercase
df['message'] = df['message'].str.lower()


In [43]:
#html
import re
def remove_html(text):
  pattern = re.compile('<.*?>')
  return pattern.sub(r'', text)


In [44]:
df['message'] = df['message'].apply(remove_html)

In [46]:
# punctation
import string
def remove_punct(text):
  return text.translate(str.maketrans('', '', string.punctuation))

In [47]:
df['message'] = df['message'].apply(remove_punct)

In [48]:
df

Unnamed: 0,class,message,length
0,0,ive been searching for the right words to than...,196
1,1,free entry in 2 a wkly comp to win fa cup fina...,155
2,0,nah i dont think he goes to usf he lives aroun...,61
3,0,even my brother is not like to speak with me t...,77
4,0,i have a date on sunday with will,36
...,...,...,...
5562,1,this is the 2nd time we have tried 2 contact u...,160
5563,0,will ü b going to esplanade fr home,36
5564,0,pity was in mood for that soany other suggest...,57
5565,0,the guy did some bitching but i acted like id ...,125


In [60]:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
x = cv.fit_transform(df['message']).toarray()
x




array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [65]:
y=df['class'].values
#0r
y=df.iloc[:, 0].values
y


array([0, 1, 0, ..., 0, 0, 0])

# Step.4 Splitting the data and applying Naive_Bayes ML Algo

In [67]:
# splitting test and train
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [69]:

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)

In [70]:
y_pred = model.predict(x_test)
y_pred

array([0, 0, 0, ..., 0, 0, 1])

In [71]:
y_test

array([0, 0, 0, ..., 0, 0, 0])

# 4.1 Evaluation

In [72]:

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.9802513464991023
[[954  15]
 [  7 138]]
              precision    recall  f1-score   support

           0       0.99      0.98      0.99       969
           1       0.90      0.95      0.93       145

    accuracy                           0.98      1114
   macro avg       0.95      0.97      0.96      1114
weighted avg       0.98      0.98      0.98      1114



# Step.5 Spam Classification application

In [82]:

msg=(' free entry in 2 a wkly co')
msg_input= cv.transform(['msg']).toarray()
model.predict(msg_input)
if model.predict==1:
  print('spam')
else:
  print('not spam')


not spam


**Conclusion:**
I built a spam detection model using the spam.tsv dataset. After preprocessing (lowercasing, punctuation/HTML removal, and label encoding), I used CountVectorizer for feature extraction. The Multinomial Naive Bayes model achieved 98% accuracy, effectively classifying messages as spam or ham. Additionally, I implemented a solution to predict spam/ham for future messages.