# Testing Machine Learning and MLOps Code
> A tutorial to provide samples of tests for the most common operations in MLOps/Machine Learning projects. Testing the code used for MLOps or Machine Learning projects follows the same principles of any other software project.

> Some scenarios might seem different or more difficult to test. The best way to approach this is to always have a test design session, where the focus is on the input/outputs, exceptions and testing the behavior of data transformations. Designing the tests first makes it easier to test as it forces a more modular style, where each function has one purpose, and extracting common functionality functions and modules.

> Below are some common operations in MLOps or Data Science projects, along with suggestions on how to test them.

* Saving and loading data
* Transforming data
* Model load or predict
* Data validation
* Model testing

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]
- image: images/chart-preview.png

# Distinguish between traditional software tests and machine learning (ML) tests

* software tests check the written logic while ML tests check the learned logic.
* ML tests can be further split into testing and evaluation. 

We’re familiar with ML evaluation where we train a model and evaluate its performance on an unseen validation set; this is done via metrics (e.g., accuracy, Area under Curve of Receiver Operating Characteristic (AUC ROC)) and visuals (e.g., precision-recall curve).

ML testing involves checks on model behaviour. Pre-train tests—which can be run without trained parameters—check if our written logic is correct. For example, is classification probability between 0 to 1? Post-train tests check if the learned logic is expected. For example, on the Titanic dataset, we should expect females to have a higher survival probability (relative to males).

Taken together, here’s how the workflow might look like. To complement this, we’ll implement a machine learning model and run the following tests on it:

- Pre-train tests to ensure correct implementation
- Post-train tests to ensure expected learned behaviour
- Evaluation to ensure satisfactory model performance



# Writing Robust Tests for Data & Machine Learning Pipelines

How data and machine learning pipelines are tested? Especially why some tests break more often than others. And it’s not because the new code is wrong; often, the new code is correct but the tests break anyway and need to be updated.

Why certain tests break—incorrectly—more than others and try to find a better way to test pipelines. We’ll start with a simple pipeline and test it via unit, schema, and integration tests. Then, I’ll introduce new data and logic, observe how tests break, and draw patterns from it. Finally, I’ll suggest how to make pipeline testing less brittle.

- Overview of testing scopes: unit, integration, functional, etc.

Smaller testing scopes = Shorter feedback loops
- An example pipeline: behavioral logs -> batch inference output
- Writing tests for our pipeline: unit, schema, integration
- Adding new data (visible impressions) and logic to our pipeline
- The additive vs. retroactive impact of new data and logic on tests
- Related concepts: Test validity, granularity, and scope
- Suggestions for more robust pipeline tests