Classifying Emails as Spam Using Decision Trees

Dataset: Spam Email Dataset

Preprocessing Steps:
Handle missing values if any.
Standardize features.
Encode categorical variables if present.

Task: Implement a decision tree classifier to classify emails as spam or not and evaluate the model using precision, recall, and F1-score

In [88]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer

# Loading dataset
df = pd.read_csv('/content/spam.csv')
print(df.head())
df.info()
df.describe()

# Check for missing values
missing_values = df.isna().sum()
print(missing_values)

# Extracting features and target
X = df['Message']
y = df['Category']

# Text vectorization
tfidf = TfidfVectorizer(stop_words='english', max_features=1000)
X = tfidf.fit_transform(X).toarray()

# Encode the labels (ham = 0, spam = 1)
y = y.map({'ham': 0, 'spam': 1})

# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training Model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Prediction
y_pred = model.predict(X_test)

# Model Evaluation
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")

  Category                                            Message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB
Category    0
Message     0
dtype: int64
Precision: 0.9014084507042254
Recall: 0.8590604026845637
F1 Score: 0.8797250859106529
Confusion Matrix:
[[952  14]
 [ 21 128]]
