GitHub - yshen92/mlops-capstone-project: MLOps capstone project built with MLFlow, Prefect.io, Evidently AI, Grafana and Docker

MLOps Capstone Project: Email Spam Detector

Project Description

The capstone project demonstrates the data and modelling pipelines built based on the main aspects of MLOps, such as modelling experimentation and tracking, model registry, workflow orchestration, model deployment and monitoring.

The problem statement of

Problem Statement and Objective

Many spam emails still managed to get into my email inbox daily. The chore of having to manually report each one of them as spam is repetitive and time wasting. The goal is to create a spam detector to automate this process by treating the problem as a classification.

Steps Overview

The order of the pipeline is as follows:

Data Collection
Model Experimentation and Tracking, and Orchestration
Model Deployment
Monitoring and Orchestration

Pipeline Steps Execution

Data Collection

Instead of building the API to fetch emails from my email client, I decided to simulate it using Deysi/spam-detection-dataset. The dataset consists of 8,180 train samples and 2,730 test samples. A small subset from both training and test datasets are used to train and test the model respectively. The test dataset subset also act as the reference data for data drift monitoring. Whenever unseen samples are needed, data is randomly fetched from the training dataset.

Model Experimentation and Tracking

I first started experimenting the solution on a Jupyter Notebook, notebook/basic_modelling.ipynb. Then refactored the notebook codes into model_training/training.py. Amazon EC2, Amazon RDS and Amazon S3 are setup to host the MLFlow tracking server, to store MLFlow metadata and artifacts respectively.

Once an Amazon EC2 instance is running with MLFlow installations done, run this command with all the variables substituted:

mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://postgres:<password>@<aws_rds_hostname>:5432/mlflow_db --default-artifact-root s3://<s3_bucket_name>

The MLFlow UI will be available at http://<ec2_public_address>:5000

Once the MLFlow tracking server is ready, the training code can be ran. The overview of model_training/training.py is as such:

Initialize MLFlow tracking URI and experiment name
Get the training and test datasets
Data preprocessing
Model training and hyperparameter runing
Model registry staging

To ease the modelling process, it is deployed to Prefect Cloud. For Prefect Version 2.11.0, these are the steps to create a deployment:

Make sure terminal directory is at root.
Run prefect init and follow the UI instructions, I chose git
Run prefect deploy and follow the UI instructions
Run prefect worker start --pool 'mlops-capstone' to start worker

Once the deployment is successful, a flow run can be executed from Spam Detector Capstone/mlops-capstone-spam-detector.

At the end of the run, a model will be staged in MLFlow Model Registry.

Model Deployment

In this section, I created an inference script, inference.py with the following steps:

Get unseen data, simulated by getting randomly from the training dataset
Preprocess unseen data
Initialize MLFlow tracking URI
Load production model from MLFlow Model Registry
Predict on the unseen data
Write results to parquet, to simulate passing the results to email client API

The inference script is containerized using a DockerFile with can be hosted on Amazon ECS.

Steps to build and run the container:

Make sure terminal directory is at root.
Run docker build -t spam-detection-inference:v1 -f .\deployment\Dockerfile .
Run docker run -it -e AWS_ACCESS_KEY_ID= <XXX> -e AWS_SECRET_ACCESS_KEY= <XXX> spam-detection-inference:v1 <MLFLOW_TRACKING_URL> with all the variables substituted.

Once the run the successful, you should see a spam_detection.parquet file in the S3 bucket with the current year and date as part of the prefix key.

Monitoring

The monitoring section leverages the Evidently AI library. I have mainly modified evidently_metrics_calculation.py from the course to fit my use case. The steps of the script are:

Prepare PostgreSQL database and table
Initialize MLFlow tracking URI
Load production model from MLFlow Model Registry
Get unseen data, simulated by getting randomly from the training dataset
Preprocess unseen data
Predict on the unseen data
Get reference data from MLFlow production model run
Calculate drift metrics at 5 random time intervals

From the root directory, run

docker compose -f .\monitoring\docker-compose.yml up --build -d

to prepare PostgreSQL, Adminer and Grafana. Likewise this can be deployed to Amazon ECS.

Since the text embeddings play a large role in the model's performance, three embedding drift metrics are used, namely with the methods: classifier model, maximum mean discrepancy and cosine distance. This is based on a good blog writeup by Evidently AI. A drift is considered to be occured when at least two drifts from the mentioned three methods are detected. When a drift is detected, a flow from the model training deployment Spam Detector Capstone/mlops-capstone-spam-detector is automatically ran to retrain the model.

The monitoring script has also been deployed to Prefect Cloud with the same steps as described earlier and scheduled to run daily. A flow run can be executed from Batch Monitoring/batch-drift-monitoring.

Once everything has run successfully, Adminer can be logged in from http://localhost:8081/ and you should see some data in the embedding_drift_metrics table.

The Grafana dashboard can be accessed from http://localhost:3001/.

Code best practices

The following have been developed:

Unit test
Integration test
auto code formatter using Black
Makefile

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.prefect		.prefect
deployment		deployment
model_training		model_training
monitoring		monitoring
notebook		notebook
src/image		src/image
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.prefectignore		.prefectignore
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
prefect.yaml		prefect.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLOps Capstone Project: Email Spam Detector

Project Description

Problem Statement and Objective

Steps Overview

Pipeline Steps Execution

Data Collection

Model Experimentation and Tracking

Model Deployment

Monitoring

Code best practices

About

Releases

Packages

Languages

yshen92/mlops-capstone-project

Folders and files

Latest commit

History

Repository files navigation

MLOps Capstone Project: Email Spam Detector

Project Description

Problem Statement and Objective

Steps Overview

Pipeline Steps Execution

Data Collection

Model Experimentation and Tracking

Model Deployment

Monitoring

Code best practices

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages