# Part 1: Theory
We trained a classifier… now what?

https://medium.com/@vivek.bharti31/from-notebook-to-production-what-most-ml-tutorials-dont-teach-5bdea33b20bb

# Part 2: Build the Spam Classifier

In this post, we’ll train a spam classifier using the SMS Spam Collection dataset — but we’ll take a production-minded, real-world approach at every step. That means:

✅ Creating a holdout set for final evaluation
✅ Preprocessing the data robustly
✅ Comparing multiple models with real-world tradeoffs
✅ Optimizing for the right evaluation metric
✅ Saving the model for real-world deployment

https://medium.com/@vivek.bharti31/build-a-spam-classifier-like-a-production-ml-engineer-05acb540c9c3

## Step 1: Load & Explore the SMS Spam Dataset (UCI Repository)

In [13]:
import pandas as pd

df = pd.read_csv("../data/SMSSpamCollection", sep="\t", header=None, names=["label", "text"])
print(df.label.value_counts())
df.head()

label
ham     4825
spam     747
Name: count, dtype: int64


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Step 2: Create a Holdout Set (To Mimic the Real World)

In [14]:
# Most tutorials split into train/test. But in production, you almost always need an unseen holdout set — 
# data that stays hidden until the very end. This helps evaluate your final model more realistically.

from sklearn.model_selection import train_test_split

df_train_val, df_holdout = train_test_split(
    df, test_size=0.1, stratify=df['label'], random_state=42
)

In [15]:
# save these to disk to simulate a real ML pipeline where training, testing, and deployment can be handled separately
df_train_val.to_csv('../data/raw/spam_train_val.csv', index=False)
df_holdout.to_csv('../data/raw/spam_holdout.csv', index=False)

## Step 3: Preprocess the Text

In [16]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import string, nltk

nltk.download('punkt_tab')
nltk.download('stopwords')

def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(t) for t in tokens]
    return ' '.join(tokens)
df_train_val['clean_text'] = df_train_val['text'].apply(preprocess_text)
df_train_val['label_num'] = df_train_val['label'].map({'ham': 0, 'spam': 1})

df_train_val.head()

[nltk_data] Downloading package punkt_tab to /Users/thi/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/thi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,label,text,clean_text,label_num
3398,ham,Heehee that was so funny tho,heehe funni tho,0
3325,ham,I don wake since. I checked that stuff and saw...,wake sinc check stuff saw true avail space pl ...,0
2498,ham,Dai what this da.. Can i send my resume to thi...,dai da send resum id,0
1553,ham,U too...,u,0
46,ham,Didn't you get hep b immunisation in nigeria.,didnt get hep b immunis nigeria,0


In [17]:
# split into training and test 

X = df_train_val['clean_text']
y = df_train_val['label_num']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

## Step 4: Train and Compare Multiple Models

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

models = {
    'LogisticRegression': LogisticRegression(class_weight='balanced', max_iter=1000),
    'SVM': SVC(kernel='linear', class_weight='balanced', probability=True),
    'MultinomialNB': MultinomialNB(),
    'RandomForest': RandomForestClassifier(n_estimators=100, class_weight='balanced'),
}

for name, clf in models.items():
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(stop_words='english')),
        ('clf', clf),
    ])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    print(f"\nModel: {name}")
    print(classification_report(y_test, y_pred))


Model: LogisticRegression
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       869
           1       0.92      0.92      0.92       134

    accuracy                           0.98      1003
   macro avg       0.96      0.95      0.95      1003
weighted avg       0.98      0.98      0.98      1003


Model: SVM
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       869
           1       0.97      0.90      0.93       134

    accuracy                           0.98      1003
   macro avg       0.98      0.95      0.96      1003
weighted avg       0.98      0.98      0.98      1003


Model: MultinomialNB
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       869
           1       1.00      0.68      0.81       134

    accuracy                           0.96      1003
   macro avg       0.98      0.84      0.89      1003
weighted avg 

Result (from tutorial): Logistic Regression gave the best balance of precision and recall, especially for the minority spam class (1). SVM, although it had slightly higher overall accuracy but had lower recall than the logistic regression, so we selected the logistic regression as our final model. In this case, **recall is more important as we don’t want to misclassify the spam as ham**, since missing a spam message is worse than misclassifying a ham – even if it means slightly more false positives.

### Optional: Grid Search for Spam Recall
To validate our default model and see if hyperparameter tuning might yield improvement, we ran a grid search using GridSearchCV, scoring by recall on the spam class. This helps us ensure we're not missing a better configuration.

In [21]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, recall_score

param_grid = {
    'clf__C': [0.01, 0.1, 1, 10],
    'clf__class_weight': [None, 'balanced'],
}

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000, solver='liblinear')),
])

grid = GridSearchCV(pipeline, param_grid, scoring=make_scorer(recall_score, pos_label=1), cv=5)
grid.fit(X_train, y_train)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/thi/Data Science/Projects/mlops_spam/venv/lib/python3.13/site-packages/sklearn/utils/_repr_html/estimator.js'

FileNotFoundError: [Errno 2] No such file or directory: '/Users/thi/Data Science/Projects/mlops_spam/venv/lib/python3.13/site-packages/sklearn/utils/_repr_html/estimator.js'

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tfidf',
                                        TfidfVectorizer(stop_words='english')),
                                       ('clf',
                                        LogisticRegression(max_iter=1000,
                                                           solver='liblinear'))]),
             param_grid={'clf__C': [0.01, 0.1, 1, 10],
                         'clf__class_weight': [None, 'balanced']},
             scoring=make_scorer(recall_score, response_method='predict', pos_label=1))

In [8]:
print("Best Params:", grid.best_params_)
print("Best Recall Score:", grid.best_score_)

Best Params: {'clf__C': 10, 'clf__class_weight': 'balanced'}
Best Recall Score: 0.9088785046728972


Model Selection Note (from tutorial): While the best grid search model had slightly higher precision, the default model had better recall (92%). Since our primary goal is catching spam, we chose the default model. This emphasizes that in ML for production, it’s not about the most optimized score — it’s about optimizing for the right business metric.

## Step 5: Retrain on Full Data and Save the Model

In [22]:
X_full = pd.concat([X_train, X_test])
y_full = pd.concat([y_train, y_test])

We now save the trained model as a pipeline that includes both the TF-IDF vectorizer and the logistic regression classifier. This ensures the exact same preprocessing steps are applied during inference, making deployment seamless and reproducible.

In [23]:
import pickle
import os

# Ordner anlegen, falls nicht vorhanden
os.makedirs("../models", exist_ok=True)

final_model = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression(class_weight='balanced', max_iter=1000)),
])
final_model.fit(X_full, y_full)
with open('../models/logreg_spam_pipeline.pkl', 'wb') as f:
    pickle.dump(final_model, f)

## Step 6: Final Check on Holdout Set
Our last step: test the model on our untouched holdout set. This final evaluation step ensures that the model generalizes well and hasn’t overfitted to the training or test data. Based on the results, we observe strong performance with no signs of overfitting.

In [24]:
df_holdout = pd.read_csv('../data/raw/spam_holdout.csv')
df_holdout['label_num'] = df_holdout['label'].map({'ham': 0, 'spam': 1})
df_holdout['clean_text'] = df_holdout['text'].apply(preprocess_text)

with open('../models/logreg_spam_pipeline.pkl', 'rb') as f:
    model = pickle.load(f)
X_holdout = df_holdout['clean_text']
y_holdout = df_holdout['label_num']
y_pred = model.predict(X_holdout)
print(classification_report(y_holdout, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       483
           1       0.88      0.89      0.89        75

    accuracy                           0.97       558
   macro avg       0.93      0.94      0.93       558
weighted avg       0.97      0.97      0.97       558



 Final Results on Holdout Set: The model achieved 97% accuracy and 89% recall on the spam class — confirming strong generalization to unseen data. This is a solid indicator that the pipeline is production-ready.

# Part 3: Serving ML with Flask: Your First Spam Detection API

In this post, we’ll build a lightweight Flask API that takes an SMS message and tells you whether it’s spam — in real time.

https://medium.com/@vivek.bharti31/serving-ml-with-flask-your-first-spam-detection-api-7f0a1669726e

### Why APIs Matter for ML

Most machine learning models never make it into production. When they do, it’s usually through an API — a simple, structured interface that lets other software talk to your model.

Whether it’s a web app, a mobile app, or a data pipeline, an API makes your model accessible to the world. Even a lightweight Flask app is a huge step toward production.

### Quick Intro to Flask

Flask is a micro web framework in Python. It’s ideal for rapid prototyping and ML demos.

In [26]:
# app.py

'''from flask import Flask

app = Flask(__name__)
@app.route('/')
def home():
    return "Hello, world!"
if __name__ == '__main__':
    app.run(debug=True)'''

'from flask import Flask\n\napp = Flask(__name__)\n@app.route(\'/\')\ndef home():\n    return "Hello, world!"\nif __name__ == \'__main__\':\n    app.run(debug=True)'

With just a few lines, you’ve spun up a working web server. Now let’s connect this to our saved spam classifier.

### Load the Model and Handle Predictions

We load our trained model and reuse the same text preprocessing logic before serving predictions.

We’ll load the model pipeline we saved earlier — it already includes both the TF-IDF vectorizer and the logistic regression classifier. We’ll also apply the same preprocessing steps as in training, and apply a custom threshold for prediction.

In [None]:
'''from flask import Flask, request, jsonify
import pickle'''

# --- Optional: NLTK nur, wenn du wirklich manuell tokenizest ---
# import nltk
# nltk.download('punkt', quiet=True)
# nltk.download('stopwords', quiet=True)
# from nltk.corpus import stopwords
# from nltk.stem import PorterStemmer
# from nltk.tokenize import word_tokenize

'''app = Flask(__name__)

# Modell laden (Pipeline mit Tfidf + LogisticRegression)
with open('models/logreg_spam_pipeline.pkl', 'rb') as f:
    logreg_pipeline = pickle.load(f)

BEST_THRESHOLD = 0.620

@app.route('/', methods=['GET'])
def home():
    return jsonify(status="ok", message="Spam API is running"), 200

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json(silent=True) or {}
    text = data.get('text')
    if not text:
        return jsonify({'error': 'No text provided'}), 400

    # Empfehlung: Rohtext direkt an die Pipeline geben,
    # damit Preprocessing identisch zu Training bleibt.
    prob = logreg_pipeline.predict_proba([text])[0][1]
    pred = int(prob >= BEST_THRESHOLD)
    label = 'spam' if pred == 1 else 'ham'
    return jsonify({'prediction': label, 'probability_spam': prob})

if __name__ == '__main__':
    # Debug ok, aber Reloader aus, damit nichts doppelt läuft
    app.run(debug=True, use_reloader=False)'''

This API accepts POST requests with a JSON payload like { "text": "You've won a free prize! Click here" } and returns whether it’s spam/ham.

### Running the API Locally

To start the Flask server, run the following command from the terminal in the same directory as your app.py file: 

python3 app.py

By default, this will launch the API at http://localhost:5000.

### Test with Postman or cURL

Once your Flask server is running, you can test it using:
Postman (GUI tool)

Postman is a popular desktop app that helps developers test APIs without writing code. It provides a user-friendly interface to craft requests and view responses.

Steps:

    Open Postman and set the method to POST
    URL: http://localhost:5000/predict
    Go to the Body tab → select raw → choose JSON
    Paste this JSON:

{ "text": "You've won a free prize! Click here" }

Click Send and check the response

### cURL (Command Line Tool)

If you prefer the terminal, cURL is a command-line tool to send HTTP requests:

In [28]:
'''curl -X POST http://localhost:5000/predict \
     -H "Content-Type: application/json" \
     -d '{"text": "You’ve won a free prize! Click here"}'''

'curl -X POST http://localhost:5000/predict      -H "Content-Type: application/json"      -d \'{"text": "You’ve won a free prize! Click here"}'

### Security and Production Tips

This is a minimal setup — perfect for learning, but not secure for public exposure. Here are a few things to keep in mind:

    Add input validation and logging
    Limit request size and rate
    Disable debug mode in production
    Consider using gunicorn or uvicorn for deployment
    Eventually, containerize your app with Docker or deploy it to a cloud platform like AWS, GCP, or Heroku using production-grade servers like Gunicorn.