Skip to content

MLOps capstone project built with MLFlow, Prefect.io, Evidently AI, Grafana and Docker

Notifications You must be signed in to change notification settings

yshen92/mlops-capstone-project

Repository files navigation

MLOps Capstone Project: Email Spam Detector

Project Description

The capstone project demonstrates the data and modelling pipelines built based on the main aspects of MLOps, such as modelling experimentation and tracking, model registry, workflow orchestration, model deployment and monitoring.

The problem statement of

Problem Statement and Objective

Many spam emails still managed to get into my email inbox daily. The chore of having to manually report each one of them as spam is repetitive and time wasting. The goal is to create a spam detector to automate this process by treating the problem as a classification.

Steps Overview

The order of the pipeline is as follows:

  1. Data Collection
  2. Model Experimentation and Tracking, and Orchestration
  3. Model Deployment
  4. Monitoring and Orchestration

Pipeline Steps Execution

Data Collection

Instead of building the API to fetch emails from my email client, I decided to simulate it using Deysi/spam-detection-dataset. The dataset consists of 8,180 train samples and 2,730 test samples. A small subset from both training and test datasets are used to train and test the model respectively. The test dataset subset also act as the reference data for data drift monitoring. Whenever unseen samples are needed, data is randomly fetched from the training dataset.

Model Experimentation and Tracking

I first started experimenting the solution on a Jupyter Notebook, notebook/basic_modelling.ipynb. Then refactored the notebook codes into model_training/training.py. Amazon EC2, Amazon RDS and Amazon S3 are setup to host the MLFlow tracking server, to store MLFlow metadata and artifacts respectively.

Once an Amazon EC2 instance is running with MLFlow installations done, run this command with all the variables substituted:

mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://postgres:<password>@<aws_rds_hostname>:5432/mlflow_db --default-artifact-root s3://<s3_bucket_name>

The MLFlow UI will be available at http://<ec2_public_address>:5000

Once the MLFlow tracking server is ready, the training code can be ran. The overview of model_training/training.py is as such:

  1. Initialize MLFlow tracking URI and experiment name
  2. Get the training and test datasets
  3. Data preprocessing
  4. Model training and hyperparameter runing
  5. Model registry staging

To ease the modelling process, it is deployed to Prefect Cloud. For Prefect Version 2.11.0, these are the steps to create a deployment:

  1. Make sure terminal directory is at root.
  2. Run prefect init and follow the UI instructions, I chose git
  3. Run prefect deploy and follow the UI instructions
  4. Run prefect worker start --pool 'mlops-capstone' to start worker

Once the deployment is successful, a flow run can be executed from Spam Detector Capstone/mlops-capstone-spam-detector.

At the end of the run, a model will be staged in MLFlow Model Registry.

Model Deployment

In this section, I created an inference script, inference.py with the following steps:

  1. Get unseen data, simulated by getting randomly from the training dataset
  2. Preprocess unseen data
  3. Initialize MLFlow tracking URI
  4. Load production model from MLFlow Model Registry
  5. Predict on the unseen data
  6. Write results to parquet, to simulate passing the results to email client API

The inference script is containerized using a DockerFile with can be hosted on Amazon ECS.

Steps to build and run the container:

  1. Make sure terminal directory is at root.
  2. Run docker build -t spam-detection-inference:v1 -f .\deployment\Dockerfile .
  3. Run docker run -it -e AWS_ACCESS_KEY_ID= <XXX> -e AWS_SECRET_ACCESS_KEY= <XXX> spam-detection-inference:v1 <MLFLOW_TRACKING_URL> with all the variables substituted.

Once the run the successful, you should see a spam_detection.parquet file in the S3 bucket with the current year and date as part of the prefix key.

Monitoring

The monitoring section leverages the Evidently AI library. I have mainly modified evidently_metrics_calculation.py from the course to fit my use case. The steps of the script are:

  1. Prepare PostgreSQL database and table
  2. Initialize MLFlow tracking URI
  3. Load production model from MLFlow Model Registry
  4. Get unseen data, simulated by getting randomly from the training dataset
  5. Preprocess unseen data
  6. Predict on the unseen data
  7. Get reference data from MLFlow production model run
  8. Calculate drift metrics at 5 random time intervals

From the root directory, run

docker compose -f .\monitoring\docker-compose.yml up --build -d

to prepare PostgreSQL, Adminer and Grafana. Likewise this can be deployed to Amazon ECS.

Since the text embeddings play a large role in the model's performance, three embedding drift metrics are used, namely with the methods: classifier model, maximum mean discrepancy and cosine distance. This is based on a good blog writeup by Evidently AI. A drift is considered to be occured when at least two drifts from the mentioned three methods are detected. When a drift is detected, a flow from the model training deployment Spam Detector Capstone/mlops-capstone-spam-detector is automatically ran to retrain the model.

The monitoring script has also been deployed to Prefect Cloud with the same steps as described earlier and scheduled to run daily. A flow run can be executed from Batch Monitoring/batch-drift-monitoring.

Once everything has run successfully, Adminer can be logged in from http://localhost:8081/ and you should see some data in the embedding_drift_metrics table.

The Grafana dashboard can be accessed from http://localhost:3001/.

Code best practices

The following have been developed:

  1. Unit test
  2. Integration test
  3. auto code formatter using Black
  4. Makefile

About

MLOps capstone project built with MLFlow, Prefect.io, Evidently AI, Grafana and Docker

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published