##### ### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2025 Semester 1

## Assignment 1: Scam detection with Naive Bayes


**Student ID(s):**     1462539


## 0. Set-up

Let us load data from `sms_supervised_train.csv`

In [4]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from typing import List, Set, Dict

# Load data
labeled_train = pd.read_csv('data/sms_supervised_train.csv')

Now that we loaded our data, let us delete rows where `textPreprocessed` is empty by calling `data.dropna`. These instances do not have any features we can learn from. We can then tokenise (split) the already preprossessd text for our next stage.

In [5]:
# Split text into tokens
labeled_train.dropna(subset=['textPreprocessed'], inplace=True)
labeled_train['tokens'] = labeled_train['textPreprocessed'].apply(lambda x: x.split())

Let us define the **vocabulary**, which is a list of every word which occurs in the training data set

In [6]:
vocabulary = set()
for tokens in labeled_train['tokens']:
    vocabulary.update(tokens)
vocabulary = list(vocabulary)

Define **count** matrix (Bag-of-Words feature matrix). Since the dataset is already preprocessed, I'm directly supplying `vocabulary=vocabulary` to `CountVectorizer` without calling `fit()`, to avoid any unintended token filtering.

In [14]:
vectorizer: CountVectorizer = CountVectorizer(vocabulary=vocabulary)
X: np.ndarray = vectorizer.transform(labeled_train['textPreprocessed'])
y = labeled_train['class'].values

## 1. Supervised model training


We now computer the prior probability of each class $P(C)$. It returns a dictionary of class label with prior probabilities $C: P(C)$.

In [8]:
def calc_prior(data: pd.DataFrame, label_col: str = 'class') -> Dict[int, float]:
    class_counts = data[label_col].value_counts()
    total = class_counts.sum()
    return (class_counts / total).to_dict()

The next set of probabilities we need to calculate are the conditional probabilities. We need to know the likelihood of each feature value given a specific label.

message1: [1, 0, 2]   → "hello now now"

message2: [0, 1, 1]   → "win now"

[1 + 0, 0 + 1, 2 + 1] = [1, 1, 3]

then we can computer [1 + 0, 0 + 1, 2 + 1] = [1, 1, 3]

where each `word_counts[i]` is exactly $count_{c,i}$ where

$$
p_{c,i} = \frac{count_{c,i} + \alpha}{total_c + V \alpha}
$$




In [9]:
def calc_likelihood(X: np.ndarray, y: np.ndarray, alpha: float = 1.0) -> Dict[int, np.ndarray]:
    classes = np.unique(y)
    vocab_size = X.shape[1]
    likelihoods = {}

    for c in classes:
        X_c = X[np.where(y == c)[0]]
        word_counts = X_c.sum(axis=0)
        word_counts = np.asarray(word_counts).flatten()
        total_count = word_counts.sum()
        likelihoods[c] = (word_counts + alpha) / (total_count + alpha * vocab_size)

    return likelihoods

We are ready to train our dataset, but before that, let us create function to out most probable words in each class and also 

In [10]:
def get_top_words(likelihood, vocab, class_label, top_n=10):
    indices = np.argsort(-likelihood[class_label])[:top_n]
    return [(vocab[i], likelihood[class_label][i]) for i in indices]

In [11]:
def get_predictive_words(likelihoods, vocab, top_n=10):
    ratio_1_over_0 = likelihoods[1] / likelihoods[0]
    ratio_0_over_1 = likelihoods[0] / likelihoods[1]
    top_1_indices = np.argsort(-ratio_1_over_0)[:top_n]
    top_0_indices = np.argsort(-ratio_0_over_1)[:top_n]
    top_scams = [(vocab[i], ratio_1_over_0[i]) for i in top_1_indices]
    top_nonmal = [(vocab[i], ratio_0_over_1[i]) for i in top_0_indices]
    return top_scams, top_nonmal

In [16]:
priors = calc_prior(labeled_train)
likelihoods = calc_likelihood(X, y)


top_words_0 = get_top_words(likelihoods, vocabulary, 0)
top_words_1 = get_top_words(likelihoods, vocabulary, 1)

top_predictive_1, top_predictive_0 = get_predictive_words(likelihoods, vocabulary)

# --- Print results ---
print("Prior probabilities:")
for c, p in priors.items():
    print(f"Class {c} ({'scam' if c==1 else 'non-malicious'}): {p:.4f}")

print("\nTop 10 most probable words in non-malicious class:")
for word, prob in top_words_0:
    print(f"{word}: {prob:.4f}")

print("\nTop 10 most probable words in scam class:")
for word, prob in top_words_1:
    print(f"{word}: {prob:.4f}")

print("\nTop 10 most predictive words for scam class (P(w|1) / P(w|0)):")
for word, ratio in top_predictive_1:
    print(f"{word}: {ratio:.2f}")

print("\nTop 10 most predictive words for non-malicious class (P(w|0) / P(w|1)):")
for word, ratio in top_predictive_0:
    print(f"{word}: {ratio:.2f}")


Prior probabilities:
Class 0 (non-malicious): 0.7995
Class 1 (scam): 0.2005

Top 10 most probable words in non-malicious class:
go: 0.0161
get: 0.0143
gt: 0.0085
lt: 0.0084
call: 0.0083
ok: 0.0078
come: 0.0075
ur: 0.0075
know: 0.0075
good: 0.0071

Top 10 most probable words in scam class:
call: 0.0274
free: 0.0137
claim: 0.0100
customer: 0.0092
txt: 0.0090
ur: 0.0085
text: 0.0082
stop: 0.0082
reply: 0.0080
mobile: 0.0078

Top 10 most predictive words for scam class (P(w|1) / P(w|0)):
prize: 88.80
tone: 57.46
select: 41.79
claim: 41.21
50: 34.82
paytm: 33.08
code: 31.34
award: 28.73
won: 27.86
18: 26.12

Top 10 most predictive words for non-malicious class (P(w|0) / P(w|1)):
gt: 60.30
lt: 59.73
lor: 32.16
hope: 27.57
ok: 27.57
da: 22.40
let: 20.10
wat: 19.53
oh: 18.38
lol: 17.80


$$P(c | count) \propto P(c)P(count|c)$$

In [None]:
def calc_prior(data: pd.DataFrame, label_col: str = 'class') -> pd.Series:
    """Calculate P(class) for each class"""
    class_counts = data[label_col].value_counts()
    return class_counts / class_counts.sum()


def calc_likelihood(X: np.ndarray, y: np.ndarray, alpha: float = 1.0) -> Dict[int, np.ndarray]:
    """Calculate P(word|class) with Laplace smoothing"""
    classes = np.unique(y)
    vocab_size = X.shape[1]
    likelihoods = {}

    for c in classes:
        X_c = X[np.where(y == c)[0]]
        word_counts = X_c.sum(axis=0)
        word_counts = np.asarray(word_counts).flatten()
        total_count = word_counts.sum()
        likelihoods[c] = (word_counts + alpha) / (total_count + alpha * vocab_size)

    return likelihoods


def calc_posterior(counts: np.ndarray, priors: Dict[int, float], likelihoods: Dict[int, np.ndarray]) -> Dict[int, float]:
    """Calculate log-posterior score for each class"""
    scores = {}
    for c in priors:
        log_prior = np.log(priors[c])
        log_likelihood = np.log(likelihoods[c])
        scores[c] = log_prior + np.dot(counts, log_likelihood)
    return scores


def predict_NB(text: str, priors: Dict[int, float], likelihoods: Dict[int, np.ndarray], vectorizer: CountVectorizer) -> int:
    """Predict class for one message using trained NB model"""
    counts = vectorizer.transform([text]).toarray().flatten()
    scores = calc_posterior(counts, priors, likelihoods)
    return max(scores, key=scores.get)


def train_NB(data: pd.DataFrame, text_col: str = 'textPreprocessed', label_col: str = 'class', alpha: float = 1.0):
    """
    Trains a multinomial NB model on preprocessed SMS data.
    Returns: (priors, likelihoods, fitted_vectorizer)
    """
    data = data.dropna(subset=[text_col])
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(data[text_col])
    y = data[label_col].values
    priors = calc_prior(data, label_col)
    likelihoods = calc_likelihood(X, y, alpha)
    return priors, likelihoods, vectorizer


## 2. Supervised model evaluation

## 3. Extending the model with semi-supervised training

## 4. Supervised model evaluation