Pandas is a python library for data analysis and manipulation.
scikit-learn is a machine learning library.

In [3]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, classification_report

The dataset has been cleaned by removing the subject, sender's email address... only leaving spam/ham and email content to prevent the curse of dimensionality. No pre-processing is done (like stemming or lemmatization), as it only increases the runtime and does not yield higher accuracy.

In [5]:
# Load the data
df = pd.read_csv(".../datasets/trec07p_data.csv")

The Pipeline object automates the workflow and streamlines the transformation and model training.
TF-IDF vectorizer is more refined than CountVectorizer, giving better results (adjusts for the frequency of words in general).

LinearSVC is used as opposed to SVC as it is much more time-efficient and yield similar accuracy.
I used support vector machines rather than naive Bayes classifiers, since it is designed to handle high dimensional data.

In [6]:
x = df['text']
y = df['label']

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LinearSVC())
])

# Splitting the data into training (80%) and testing (20%)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=2)

Hyperparameter tuning using grid search technique.

Explanation of each parameter:
- "max_df" - sets an upper threshold for the document frequency (the proportion of documents that contain a specific term), with any term with a higher frequency being excluded from the feature set - a form of stopwords removal
- "ngram_range" - (min_n, max_n) The first range means only unigrams, and the second range means extracting both unigrams and bigrams.
- "sublinear_tf" - replaces term frequency with "1 + log(TF)" - it reduces the impact of words with very high frequency, making the representation of terms more balanced
- "c" - regularisation parameter - trade-off between low training error and low testing error (relates to overfitting)

In [7]:
# Define parameter grid for GridSearchCV
param_grid = {
    'tfidf__max_df': [0.5, 0.75, 0.85, 1.0],
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__sublinear_tf': [True, False],
    'clf__C': [0.01, 0.1, 1, 10, 100]
}

"n_jobs" - the number of CPU cores to run the task in parallel; "-1" uses all cores

"cv" as an integer - the number of folds in a KFold

In [8]:
# Initialize GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=10, n_jobs=-1, scoring='accuracy')

# Train the model
grid_search.fit(x_train, y_train)

# Predict on the test set
y_pred = grid_search.predict(x_test)

In [15]:
# Print the best parameters and classification report
print("Best Parameters:", grid_search.best_params_)
print(classification_report(y_test, y_pred))

# Print results
print('Classification accuracy {:.3%}'.format(accuracy_score(y_test, y_pred)))

Best Parameters: {'clf__C': 10, 'tfidf__max_df': 0.85, 'tfidf__ngram_range': (1, 2), 'tfidf__sublinear_tf': True}
              precision    recall  f1-score   support

           0       1.00      0.99      1.00      4760
           1       1.00      1.00      1.00      5974

    accuracy                           1.00     10734
   macro avg       1.00      1.00      1.00     10734
weighted avg       1.00      1.00      1.00     10734

Classification accuracy 99.599%


Can load the trained model in the future to predict new emails. Using joblib - handles large numpy arrays more efficiently.

In [10]:
# Save the model and the vectorizer
import joblib
joblib.dump(grid_search.best_estimator_, 'spam_classifier.pkl')

['spam_classifier.pkl']

Cross-validation scores to estimate the model's performance on unseen data. Check whether there has been overfitting.

Note that accuracy is the same with or without grid search, but there is a slight increase in cross-validation score (0.9932 → 0.9949). It may be more time- and cost-efficient to train the model without grid search.

In [11]:
# Additional diagnostics: cross-validation scores
cv_scores = cross_val_score(grid_search.best_estimator_, x_train, y_train, cv=10)
print("Cross-validation scores:", cv_scores)
print("Mean cross-validation score:", cv_scores.mean())

Cross-validation scores: [0.99417792 0.99324639 0.99697252 0.99417792 0.99394363 0.99534125
 0.99580713 0.99487538 0.99510832 0.99487538]
Mean cross-validation score: 0.9948525838631385


New email input has been generated by ChatGPT.

In [14]:
# Using custom input to test the model
new_email = ["""Hi there,

We’re excited to offer you a special opportunity! For a limited time, you can take advantage of our exclusive promotion. This offer is only available to select individuals like you.

Here’s what you’ll get:

Special Discount: 50% off your first purchase
Free Gift: A complimentary item with every order
Limited Time Only: Act fast to secure your deal
Don’t miss out! Click the link below to claim your offer and learn more about our exciting products.

Claim Your Offer Now

Best regards,
The Special Offers Team

P.S. If you have any questions, feel free to reply to this email. We’re here to help!"""]
prediction = grid_search.predict(new_email)
print("predicted spam") if prediction == [1] else print("predicted ham")

predicted spam
