Implement e-mail spam filtering using text classification algorithm with appropriate dataset


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv(r"F:\BE\My Cl2\I5\email_spam.csv")



In [3]:
df.drop_duplicates(inplace=True)
print("Duplicates removed.")

Duplicates removed.


In [4]:
le = LabelEncoder()
df['spam'] = le.fit_transform(df['type'])

In [5]:
df['text'] = df['text'].replace("\n", "", regex=True)

In [6]:
x_train, x_test, y_train, y_test = train_test_split(df['text'], df['spam'], test_size=0.1, random_state=42)


In [7]:
vectorizer = CountVectorizer()
x_train_count = vectorizer.fit_transform(x_train.values)
x_test_count = vectorizer.transform(x_test.values)


In [8]:
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(x_train_count, y_train)

In [9]:
accuracy = knn_model.score(x_test_count, y_test)
print(f"Model accuracy: {accuracy }")

Model accuracy: 0.6666666666666666


In [10]:
email = ['50% discount on data science courses signup now']
new_email_count = vectorizer.transform(email)
prediction = knn_model.predict(new_email_count)

In [11]:
if prediction[0] == 1:
    print("The email is Spam.")
else:
    print("The email is Not Spam.")

The email is Spam.


Possible Viva Questions
What is the purpose of CountVectorizer in this code?

Answer: CountVectorizer converts the text data into a matrix of word counts, allowing the model to interpret text as numerical features for classification.
Why do we use K-Nearest Neighbors (KNN) for spam detection here?

Answer: KNN is a simple, non-parametric classifier that categorizes emails based on the similarity of their word count vectors to those of known spam or non-spam emails. Although KNN may not be the most efficient for text data, it’s useful for understanding how instance-based learning works.
What are the pros and cons of using KNN for this type of classification problem?

Answer: Pros include simplicity and interpretability. Cons include higher memory and computational costs, especially with large datasets, as KNN stores all data points and calculates distances during classification.
Why do we remove duplicate rows from the dataset?

Answer: Duplicate rows can bias the model by overrepresenting certain data points, which may lead to inaccurate classifications.
How could this model be improved for spam detection?

Answer: Improvements could include using TF-IDF instead of CountVectorizer to weigh word importance, trying more sophisticated algorithms like Naive Bayes or Support Vector Machines, and adding preprocessing steps like stemming or lemmatization to reduce dimensionality.