# CAS2105 Mini AI Pipeline Project

### spam_classification.ipynb
Name : [ YAN SHIYU ]

StudentID : [ 2021147609 ]

This notebook demonstrates a mini AI pipeline for text classification.

It includes data loading, baselines based on original keywords, and transformer-based AI pipelines using sentence embedding.

Use accuracy, precision, recall and F1 scores to evaluate and compare the results of different methods.

In [14]:
import os
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1" #In order to make the interface clearer

### 1.1 Data Loading

In [15]:
from datasets import load_dataset

dataset = load_dataset("codesignal/sms-spam-collection")
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'message'],
        num_rows: 5572
    })
})

### 1.2 Data Sorting
Turn the original data into a numerical form that the model can use

In [2]:
import numpy as np
data = dataset["train"]
texts = data["message"]
labels_str = data["label"]

labels = np.array([1 if l == "spam" else 0 for l in labels_str])

print("Total samples:", len(texts))
print("Spam ratio:", labels.mean())

Total samples: 5572
Spam ratio: 0.13406317300789664


At this point, we have finished:
1. Statistics of the total number of samples
2. Label processing

### 1.3 Data splitting
Divide train / test / validation dataset

In [4]:
from sklearn.model_selection import train_test_split
import numpy as np

texts = list(data["message"])
labels = np.array([1 if l == "spam" else 0 for l in data["label"]])

# split out test
texts_tv, texts_test, y_tv, y_test = train_test_split(
    texts,
    labels,
    test_size=0.1,
    random_state=42,
    stratify=labels
)

# split train/val
texts_train, texts_val, y_train, y_val = train_test_split(
    texts_tv,
    y_tv,
    test_size=0.1111,
    random_state=42,
    stratify=y_tv
)

print("Train:", len(texts_train))
print("Val:  ", len(texts_val))
print("Test: ", len(texts_test))

Train: 4456
Val:   558
Test:  558


In [5]:
def spam_ratio(y):
    return y.mean()

print("Spam ratio:")
print(" Train:", spam_ratio(y_train))
print(" Val:  ", spam_ratio(y_val))
print(" Test: ", spam_ratio(y_test))

Spam ratio:
 Train: 0.13397666068222622
 Val:   0.13440860215053763
 Test:  0.13440860215053763


### 2.1 Naive Baseline

In [6]:
import re
import numpy as np

def preprocess_text(s: str) -> str:
    s = s.lower()
    s = re.sub(r"\s+", " ", s).strip()
    return s

SPAM_KEYWORDS = {
    "free", "win", "prize", "cash", "claim", "urgent", 
    "offer", "click", "call", "reply", "txt", "stop"
}

def spam_score(text: str) -> int:
    t = preprocess_text(text)
    return sum(kw in t for kw in SPAM_KEYWORDS)

def baseline_predict(texts, threshold: int = 2):
    scores = np.array([spam_score(t) for t in texts])
    preds = (scores >= threshold).astype(int)  # 1=spam, 0=ham
    return preds, scores

In [None]:
### 2

In [8]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def eval_binary(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    p, r, f1, _ = precision_recall_fscore_support(
        y_true, y_pred,
        average="binary", pos_label=1,
        zero_division=0
    )
    return acc, p, r, f1


best = None
for th in [1,2,3,4,5,6]:
    val_pred, _ = baseline_predict(texts_val, threshold=th)
    acc, p, r, f1 = eval_binary(y_val, val_pred)
    print(f"threshold={th} | acc={acc:.4f}  precision={p:.4f}  recall={r:.4f}  f1={f1:.4f}")
    if best is None or f1 > best["f1"]:
        best = {"th": th, "acc": acc, "p": p, "r": r, "f1": f1}

best_th = best["th"]
print("\nBest threshold on validation:", best)

threshold=1 | acc=0.8817  precision=0.5391  recall=0.8267  f1=0.6526
threshold=2 | acc=0.9265  precision=0.8542  recall=0.5467  f1=0.6667
threshold=3 | acc=0.9032  precision=1.0000  recall=0.2800  f1=0.4375
threshold=4 | acc=0.8781  precision=1.0000  recall=0.0933  f1=0.1707
threshold=5 | acc=0.8674  precision=1.0000  recall=0.0133  f1=0.0263
threshold=6 | acc=0.8656  precision=0.0000  recall=0.0000  f1=0.0000

Best threshold on validation: {'th': 2, 'acc': 0.9265232974910395, 'p': 0.8541666666666666, 'r': 0.5466666666666666, 'f1': 0.6666666666666666}


## 4.4 AI Pipeline

In [9]:
!pip -q install sentence-transformers

In [10]:
import numpy as np
from sentence_transformers import SentenceTransformer
# Introduce the model
encoder = SentenceTransformer("all-MiniLM-L6-v2") 

X_train = encoder.encode(list(texts_train), show_progress_bar=True)
X_val   = encoder.encode(list(texts_val), show_progress_bar=True)
X_test  = encoder.encode(list(texts_test), show_progress_bar=True)

print(X_train.shape, X_val.shape, X_test.shape)

Batches: 100%|████████████████████████████████| 140/140 [00:11<00:00, 12.16it/s]
Batches: 100%|██████████████████████████████████| 18/18 [00:00<00:00, 18.66it/s]
Batches: 100%|██████████████████████████████████| 18/18 [00:00<00:00, 44.16it/s]

(4456, 384) (558, 384) (558, 384)





#### Training classifier (Logistic Regression), upregulate the super parameter C in Val

#### Evaluate the different values of the regularization parameter C in the verification, and select the value with the best performance.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def eval_binary(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    p, r, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="binary", pos_label=1, zero_division=0)
    return acc, p, r, f1

best = None
for C in [0.1, 1.0, 3.0, 10.0]:
    clf = LogisticRegression(max_iter=2000, C=C)
    clf.fit(X_train, y_train)

    val_pred = clf.predict(X_val)
    acc, p, r, f1 = eval_binary(y_val, val_pred)
    print(f"C={C:<4} | acc={acc:.4f}  precision={p:.4f}  recall={r:.4f}  f1={f1:.4f}")

    if best is None or f1 > best["f1"]:
        best = {"C": C, "acc": acc, "p": p, "r": r, "f1": f1, "model": clf}

best_C = best["C"]
best_clf = best["model"]
print("\nBest C on validation:", {k:v for k,v in best.items() if k!='model'})

C=0.1  | acc=0.9534  precision=1.0000  recall=0.6533  f1=0.7903
C=1.0  | acc=0.9785  precision=0.9846  recall=0.8533  f1=0.9143
C=3.0  | acc=0.9821  precision=0.9851  recall=0.8800  f1=0.9296
C=10.0 | acc=0.9892  precision=1.0000  recall=0.9200  f1=0.9583

Best C on validation: {'C': 10.0, 'acc': 0.989247311827957, 'p': 1.0, 'r': 0.92, 'f1': 0.9583333333333334}


#### Only evaluate once on the test set, and no longer adjust the parameters

### Compare with baseline

In [12]:
test_pred = best_clf.predict(X_test)
acc, p, r, f1 = eval_binary(y_test, test_pred)

print("Test performance (AI pipeline):")
print(f"accuracy={acc:.4f} precision={p:.4f} recall={r:.4f} f1={f1:.4f}")

Test performance (AI pipeline):
accuracy=0.9785 precision=0.9315 recall=0.9067 f1=0.9189


The previous baseline (keyword rule) is probably: 

Baseline F1 ≈ 0.66

Now the AI pipeline is: 

AI pipeline F1 = 0.9189

These shows the AI-based pipeline significantly outperforms the naive keyword-based baseline.