# **Dataset loading**

In [106]:
import pandas as pd

url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
df = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

print(df.head()) # Output of the first few rows

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


# **Data preprocessing**

Before being used by the model, the text goes through a few cleaning steps to make it easier to work with. Everything is turned to lowercase to keep it consistent, punctuation is removed, and common words like “the” or “is” that don’t add much meaning are taken out. Finally, the text is split into individual words (tokenized) so the model can better understand and analyze it.

In [107]:
import string
import re
from sklearn.model_selection import train_test_split

def clean(text):
  text = text.lower()
  text = re.sub(f'{re.escape(string.punctuation)}', '', text)
  return text

df['cleaned'] = df['message'].apply(clean)
df['bin_label'] = df['label'].map({'ham': 0, 'spam': 1}) # Mapping labels to binary values

X_train, X_test, y_train, y_test = train_test_split(df['cleaned'], df['bin_label'], test_size=0.2, random_state=42)

`test_size=0.2` ensures 20% of the data will be used for testing the dataset once trained, the remaining 80% will be used for model training. `X_train` contains training features, `X_test` contains testing features, `y_train` contains training labels, `y_test` contains testing labels.

# **Text vectorizing**

**TF-IDF** vectorizer converts text data into numerical vectors, giving higher scores to words that important in a specific message and not common across all. **TF-IDF** stands for *Term Frequency - Inverse Document Frequency* and the function is defined as $w_{x,y} = \text{tf}(t, d) \times \text{idf}(t, D) = \frac{f_{t, d}}{\sum_{\bar{t} \in d} f_{\bar t, d}} \times \log \frac{N}{|\{d \in D : t \in d\}|}$



In [108]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1, 2)); # Vectorizer object
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# **Classifier training**

**Multinomial Naive Bayes** is a simple yet widely used classification algorithm for text data, and it is in fact pretty naive. The model looks at how many times a certain word appears in *spam* or *ham* messages and uses that to determine wheter a message is spam or not. It uses **multinomial** **distribution** to calculate the probability of a message belonging to a certain category.

Multinomial distribution is defined as $P(X)=\frac{n!}{n_1! n_2! \dots n_m!} p_1^{n_1}p_2^{n_2} \dots p_m^{n^m}$, where $n$ is the number of trials, $n_i$ is the count of occurrencies for outcome $i$, $p_i$ is the probability of outcome $i$. [**Maximum Likelihood Estimation**](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation/) (MLE) is used to predict how likely each single word is *spam* or *ham*.

In [109]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

model = MultinomialNB(class_prior=(0.44, 0.56))
model.fit(X_train_vec, y_train)

y_pred = model.predict(X_test_vec)

# **Model evaluation**

The model is evaluated on test data using accuracy and classification report. The metrics are **precision** (how many predicted spams were actually spams), **recall** (how many real spams were caught), **F1-Score** (harmonic mean of precision and recall). A confusion matrix was added to provide a detailed breakdown of the model’s predictions, showing how many spam and non-spam messages were correctly or incorrectly classified.

In [118]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f"\nMetrics\n____________")
print("Accuracy:", accuracy, "\nReport:\n", report, "\nConfusion Matrix:\n____________")
print(f'''\t\tCorrect\tWrong\n
          Ham   {cm[0][0]}\t{cm[0][1]}\n
          Spam  {cm[1][1]}\t{cm[1][0]}
      ''')


Metrics
____________
Accuracy: 0.9802690582959641 
Report:
               precision    recall  f1-score   support

           0       0.98      0.99      0.99       966
           1       0.95      0.90      0.92       149

    accuracy                           0.98      1115
   macro avg       0.97      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115
 
Confusion Matrix:
____________
		Correct	Wrong

          Ham   959	7

          Spam  134	15
      


# **Try a custom message**

To improve the flexibility of spam detection, a custom threshold check was implemented using the `predict_thresh()` function. Unlike the default model.`predict()` method, which uses a fixed threshold of $0.5$ to classify messages, this function allows us to define our own threshold (e.g., $0.4$) for the spam probability.

This is useful for fine-tuning the balance between catching more spam (recall) and avoiding false positives (precision), depending on the specific needs of the application.

In [119]:
def predict_thresh(thresh, msg):
  cleaned = clean(msg)
  vec = vectorizer.transform([cleaned])
  return (model.predict_proba(vec)[:, 1] >= thresh).astype(int), thresh

def predict_no_thresh(msg):
  cleaned = clean(msg)
  vec = vectorizer.transform([cleaned])
  return model.predict(vec)

sample = str(input("Your message >> "))

thresh = 0.4
prediction_thresh = predict_thresh(thresh, sample)
prediction = predict_no_thresh(sample)

print(f"\nPrediction (fixed probability threshold >0.5):")
print("Spam (1)\n" if prediction[0] == 1 else "Ham (0)\n")
print(f"Prediction (custom probability threshold >{thresh}):")
print("Spam (1)" if prediction_thresh[0] == 1 else "Ham (0)")

Your message >> URGENT: Your account has been compromised. Verify now at secure-update-login.net to avoid suspension!

Prediction (fixed probability threshold >0.5):
Spam (1)

Prediction (custom probability threshold >0.4):
Spam (1)
