expense categorizer using Sentence-BERT MiniLM (all-MiniLM-L6-v2)

1️⃣ Sentence-BERT MiniLM Embedding (Transfer Learning)

The model uses Sentence-BERT MiniLM (all-MiniLM-L6-v2), a pretrained transformer from the SentenceTransformers library.

Which converts each transaction description into a 384-dimensional semantic vector.

These embeddings capture meaning, not just keywords.

Example:
“Starbucks coffee” and “Dunkin latte” produce similar embeddings, helping classification even if exact words differ.

2️⃣ Logistic Regression Classifier (Supervised Learning)

The embeddings are fed into a Logistic Regression classifier.

The classifier learns patterns that associate certain semantic embeddings with categories.

Example:

Coffee, pizza, groceries → Food

Uber, tolls, metro passes → Transport

Netflix, Spotify → Entertainment

It uses class weights to balance uneven categories and improve fairness.

3️⃣ Final Output

When you input a new transaction:

➡️ MiniLM converts it to a semantic vector

➡️ Logistic Regression predicts the category

➡️ LabelEncoder converts numeric label → human-readable category

Example:

Input: "Monthly rent payment"
Output: "Rent"


In [1]:
import pandas as pd
import numpy as np
import os
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight
import joblib

# Config
DATA_PATH = "/content/expense_data.csv"   # update if needed
MODEL_DIR = "models"
os.makedirs(MODEL_DIR, exist_ok=True)

# Load Dataset
data = pd.read_csv(DATA_PATH)

# Expecting columns: Description, Category
assert "Description" in data.columns and "Category" in data.columns, \
    "CSV must contain 'Description' and 'Category' columns."

# Drop missing values
data = data.dropna(subset=["Description", "Category"]).reset_index(drop=True)

print("Sample Data:")
print(data.head())


Sample Data:
                   Description   Category
0  Grocery shopping at Walmart       Food
1         Monthly rent payment       Rent
2      Uber ride to the office  Transport
3     Electricity bill payment  Utilities
4   Dining out at a restaurant       Food


In [2]:
# Remove rare categories FIRST
counts = data['Category'].value_counts()
rare_classes = counts[counts < 2].index.tolist()

print("\nDropping rare categories:", rare_classes)

data = data[~data['Category'].isin(rare_classes)].reset_index(drop=True)




Dropping rare categories: ['Shopping', 'Hobbies', 'Debt Repayment', 'Marketing']


In [3]:
# Encode Labels
label_encoder = LabelEncoder()
data["label"] = label_encoder.fit_transform(data["Category"])

# Prepare inputs AFTER filtering
X_text = data["Description"].astype(str).tolist()
y = data["label"].values


In [4]:
# Train-test split
X_train_text, X_test_text, y_train, y_test = train_test_split(
    X_text,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"\nTrain size: {len(X_train_text)}, Test size: {len(X_test_text)}")
print("Number of classes:", len(label_encoder.classes_))
print("Classes:", list(label_encoder.classes_))




Train size: 90, Test size: 23
Number of classes: 20
Classes: ['Books', 'Business Expenses', 'Charity', 'Clothing', 'Education', 'Electronics', 'Entertainment', 'Food', 'Gifts', 'Health & Fitness', 'Household', 'Insurance', 'Personal Care', 'Pets', 'Professional Services', 'Rent', 'Software', 'Transport', 'Travel', 'Utilities']


In [5]:
# Load Sentence-BERT MiniLM
print("\nLoading Sentence-BERT MiniLM (all-MiniLM-L6-v2)...")
encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Compute Embeddings
print("\nEncoding training sentences...")
X_train_emb = encoder.encode(
    X_train_text,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True
)

print("\nEncoding test sentences...")
X_test_emb = encoder.encode(
    X_test_text,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True
)

# Compute Class Weights
classes = np.unique(y_train)
class_weights = compute_class_weight(
    class_weight="balanced",
    classes=classes,
    y=y_train
)
class_weight_dict = {cls: w for cls, w in zip(classes, class_weights)}

print("\nClass Weights:", class_weight_dict)



Loading Sentence-BERT MiniLM (all-MiniLM-L6-v2)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


Encoding training sentences...


Batches:   0%|          | 0/3 [00:00<?, ?it/s]


Encoding test sentences...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Class Weights: {np.int64(0): np.float64(2.25), np.int64(1): np.float64(1.125), np.int64(2): np.float64(2.25), np.int64(3): np.float64(1.5), np.int64(4): np.float64(1.125), np.int64(5): np.float64(0.9), np.int64(6): np.float64(0.5), np.int64(7): np.float64(0.34615384615384615), np.int64(8): np.float64(1.5), np.int64(9): np.float64(0.6428571428571429), np.int64(10): np.float64(0.9), np.int64(11): np.float64(1.125), np.int64(12): np.float64(1.125), np.int64(13): np.float64(2.25), np.int64(14): np.float64(4.5), np.int64(15): np.float64(2.25), np.int64(16): np.float64(2.25), np.int64(17): np.float64(0.5), np.int64(18): np.float64(0.9), np.int64(19): np.float64(1.125)}


In [6]:

# Train Logistic Regression classifier
clf = LogisticRegression(
    max_iter=5000,
    class_weight=class_weight_dict,
    n_jobs=-1
)

print("\nTraining classifier...")
clf.fit(X_train_emb, y_train)

# Evaluate model
y_pred = clf.predict(X_test_emb)
accuracy = accuracy_score(y_test, y_pred)

print("\n================= MODEL PERFORMANCE =================")
print(f"Accuracy: {accuracy:.4f}")

# ---- Correct Classification Report ----
test_classes = np.unique(y_test)

print("\nClassification Report:")
print(
    classification_report(
        y_test,
        y_pred,
        labels=test_classes,
        target_names=label_encoder.inverse_transform(test_classes)
    )
)

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred, labels=test_classes))



Training classifier...

Accuracy: 0.6957

Classification Report:
                       precision    recall  f1-score   support

                Books       0.00      0.00      0.00         1
    Business Expenses       0.00      0.00      0.00         1
             Clothing       0.00      0.00      0.00         1
            Education       0.00      0.00      0.00         1
          Electronics       0.00      0.00      0.00         1
        Entertainment       0.67      1.00      0.80         2
                 Food       1.00      1.00      1.00         3
                Gifts       1.00      1.00      1.00         1
     Health & Fitness       1.00      1.00      1.00         2
            Household       0.50      1.00      0.67         1
            Insurance       1.00      1.00      1.00         1
        Personal Care       0.00      0.00      0.00         1
Professional Services       0.00      0.00      0.00         1
             Software       0.50      1.00      0.6

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [8]:
# Save Encoder, Classifier, LabelEncoder
print("\nSaving models...")

encoder_save_path = os.path.join(MODEL_DIR, "minilm_sentence_encoder")
encoder.save(encoder_save_path)

clf_path = os.path.join(MODEL_DIR, "minilm_expense_classifier.pkl")
joblib.dump(clf, clf_path)

le_path = os.path.join(MODEL_DIR, "label_encoder.pkl")
joblib.dump(label_encoder, le_path)

print(f"\n Saved encoder to: {encoder_save_path}")
print(f" Saved classifier to: {clf_path}")
print(f" Saved label encoder to: {le_path}")
print("\n Training complete!")


Saving models...

 Saved encoder to: models/minilm_sentence_encoder
 Saved classifier to: models/minilm_expense_classifier.pkl
 Saved label encoder to: models/label_encoder.pkl

 Training complete!


In [9]:
#TESTING
import joblib
from sentence_transformers import SentenceTransformer
import os

MODEL_DIR = "models"

# Load encoder, classifier, and label encoder
encoder = SentenceTransformer(os.path.join(MODEL_DIR, "minilm_sentence_encoder"))
clf = joblib.load(os.path.join(MODEL_DIR, "minilm_expense_classifier.pkl"))
label_encoder = joblib.load(os.path.join(MODEL_DIR, "label_encoder.pkl"))

def predict_category(description: str):
    emb = encoder.encode([description], convert_to_numpy=True)
    pred_label = clf.predict(emb)[0]
    category = label_encoder.inverse_transform([pred_label])[0]
    return category

# Example:
examples = [
    "Starbucks coffee with friends",
    "Uber ride to airport",
    "Recharge Netflix subscription",
    "Electricity bill payment",
    "Dinner at a restaurant"
]

for text in examples:
    print(f"{text}  →  {predict_category(text)}")


Starbucks coffee with friends  →  Food
Uber ride to airport  →  Transport
Recharge Netflix subscription  →  Entertainment
Electricity bill payment  →  Utilities
Dinner at a restaurant  →  Food


Synthetic Data Creation for Further testing

In [10]:
data_samples = [
    # FOOD
    ("Starbucks iced latte and sandwich", "Food"),
    ("Grocery shopping at Walmart", "Food"),
    ("Dominos Pizza delivery", "Food"),
    ("Dinner at Olive Garden", "Food"),
    ("Snacks from 7-Eleven", "Food"),
    ("Subway lunch combo", "Food"),

    # TRANSPORT
    ("Uber ride to office", "Transport"),
    ("Shell gas refill", "Transport"),
    ("Parking fee downtown", "Transport"),
    ("Taxi fare late night", "Transport"),
    ("Monthly metro pass", "Transport"),

    # SHOPPING
    ("Amazon purchase – phone charger", "Shopping"),
    ("H&M clothing shopping", "Shopping"),
    ("Target home essentials", "Shopping"),
    ("IKEA furniture purchase", "Shopping"),
    ("Best Buy laptop accessories", "Shopping"),

    # RENT
    ("Monthly apartment rent", "Rent"),
    ("Security deposit to landlord", "Rent"),
    ("Rent payment through bank", "Rent"),

    # UTILITIES
    ("Electricity bill payment", "Utilities"),
    ("Water bill for July", "Utilities"),
    ("Internet bill – Verizon", "Utilities"),
    ("Heating and gas charges", "Utilities"),

    # ENTERTAINMENT
    ("Netflix monthly subscription", "Entertainment"),
    ("Movie tickets AMC", "Entertainment"),
    ("Spotify premium renewal", "Entertainment"),
    ("Bowling alley games", "Entertainment"),

    # HEALTH
    ("Pharmacy medical supplies", "Health"),
    ("Doctor consultation fee", "Health"),
    ("Dental cleaning appointment", "Health"),
    ("Health insurance premium", "Health"),

    # MISC
    ("Gift purchase for friend", "Misc"),
    ("Charity donation", "Misc"),
    ("Stationery purchase at Staples", "Misc"),
    ("Bank ATM withdrawal fee", "Misc"),
]

df_generated = pd.DataFrame(data_samples, columns=["Description", "Category"])
df_generated.to_csv("generated_test_data.csv", index=False)

print("Generated dataset saved as generated_test_data.csv")
df_generated.head()


Generated dataset saved as generated_test_data.csv


Unnamed: 0,Description,Category
0,Starbucks iced latte and sandwich,Food
1,Grocery shopping at Walmart,Food
2,Dominos Pizza delivery,Food
3,Dinner at Olive Garden,Food
4,Snacks from 7-Eleven,Food


In [11]:
MODEL_DIR = "models"

# Load encoder, classifier, and label encoder
encoder = SentenceTransformer(os.path.join(MODEL_DIR, "minilm_sentence_encoder"))
clf = joblib.load(os.path.join(MODEL_DIR, "minilm_expense_classifier.pkl"))
label_encoder = joblib.load(os.path.join(MODEL_DIR, "label_encoder.pkl"))

def predict_category(description: str):
    emb = encoder.encode([description], convert_to_numpy=True)
    pred_label = clf.predict(emb)[0]
    category = label_encoder.inverse_transform([pred_label])[0]
    return category

# Load your generated test data
test_df = pd.read_csv("generated_test_data.csv")

print("\n=== Testing MiniLM Classifier on Generated Dataset ===\n")

for i, row in test_df.iterrows():
    desc = row["Description"]
    actual = row["Category"]
    predicted = predict_category(desc)
    print(f"Text: {desc}")
    print(f"Actual Category   → {actual}")
    print(f"Predicted Category → {predicted}")
    print("-" * 60)



=== Testing MiniLM Classifier on Generated Dataset ===

Text: Starbucks iced latte and sandwich
Actual Category   → Food
Predicted Category → Food
------------------------------------------------------------
Text: Grocery shopping at Walmart
Actual Category   → Food
Predicted Category → Food
------------------------------------------------------------
Text: Dominos Pizza delivery
Actual Category   → Food
Predicted Category → Business Expenses
------------------------------------------------------------
Text: Dinner at Olive Garden
Actual Category   → Food
Predicted Category → Food
------------------------------------------------------------
Text: Snacks from 7-Eleven
Actual Category   → Food
Predicted Category → Food
------------------------------------------------------------
Text: Subway lunch combo
Actual Category   → Food
Predicted Category → Food
------------------------------------------------------------
Text: Uber ride to office
Actual Category   → Transport
Predicted Category

In [12]:
def predict_category(text):
    emb = encoder.encode([text], convert_to_numpy=True)
    pred = clf.predict(emb)[0]
    return label_encoder.inverse_transform([pred])[0]

# Make predictions
test_df["Predicted"] = test_df["Description"].apply(predict_category)
y_true = test_df["Category"]
y_pred = test_df["Predicted"]

accuracy = accuracy_score(y_true, y_pred)
report = classification_report(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)

print("========== EVALUATION RESULTS ==========")
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", report)
print("\nConfusion Matrix:\n", cm)


Accuracy: 0.4857

Classification Report:
                        precision    recall  f1-score   support

    Business Expenses       0.00      0.00      0.00         0
              Charity       0.00      0.00      0.00         0
             Clothing       0.00      0.00      0.00         0
          Electronics       0.00      0.00      0.00         0
        Entertainment       1.00      0.50      0.67         4
                 Food       0.83      0.83      0.83         6
                Gifts       0.00      0.00      0.00         0
               Health       0.00      0.00      0.00         4
            Household       0.00      0.00      0.00         0
            Insurance       0.00      0.00      0.00         0
                 Misc       0.00      0.00      0.00         4
        Personal Care       0.00      0.00      0.00         0
Professional Services       0.00      0.00      0.00         0
                 Rent       1.00      1.00      1.00         3
            

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
