<a href="https://colab.research.google.com/github/victorviro/Machine-Learning-Python/blob/master/Model_Deployment_MLOps_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🚀 ML model deployment

# Table of contents

1. [ℹ Introduction](#1)
2. [⬅️ Before production](#2)
    1. [⚙ Runtyme environments](#2.1)
    2. [⛔️ Data access](#2.2)
    3. [🧐 Model risk evaluation](#2.3)
    4. [✔ QA and Auditability for ML](#2.4)
    5. [🔒 ML security](#2.5)
        1. [😈 Adversarial attacks](#2.5.1)
        2. [🔥 Other vulnerabilities](#2.5.2)
    6. [⬇ Model risk mitigation](#2.6)
3. [🚀 Deploying to production](#3)
    1. [👷‍♂️ CI/CD](#3.1)
        1. [✅ Testing](#3.1.1)
    2. [↔️ ML pipeline](#3.2)
    3. [🚀 Model deployment](#3.3)
        1. [🚚🏎 Batch vs online inference ](#3.3.1)
        2. [👉 Model deployment strategies](#3.3.2)
        3. [🔧 Maintenance in production](#3.3.3)
    4. [📦 Containerization](#3.4)
    5. [⬆ Scaling deployments](#3.5)
4. [📕 References](#4)

# ℹ Introduction <a name="1"></a>

Productionalizing and 🚀 **deploying models** and ML pipelines is a 🔑 key component of MLOps that presents **different** technical **challenges than developing the model**. It's the **domain of the software/ML engineer and the DevOps** team, and the organizational challenges in managing the information exchange between the data scientists and these teams must not be underestimated. As described in the notebook [Introduction to MLOps](https://nbviewer.jupyter.org/github/victorviro/Machine-Learning-Python/blob/master/Introduction_to_MLOps.ipynb), without effective 🔗 collaboration between the teams, ⌛️ delays or 👎 failures to deploy are inevitable.


# ⬅️ Before production <a name="2"></a>

> That something works in the 🧪 lab does not mean it will work well in the real 🌏 world. 

The **production environment is typically different from the development environment**, and the **commercial risks** associated with models **in production are ⬆ greater**. The complexities of the ⏩ transition to production need to be understood and tested, and the potential risks adequately mitigated.

This section explores some important **considerations to prepare for production** in robust MLOps systems.

## ⚙ Runtime environments <a name="2.1"></a>

The 1️⃣ first step in 🚀 **deploying a model** to production is making sure **it’s technically possible**. **Ideal** MLOps systems favor ✈ **rapid, 🤖 automated deployment** over labor-intensive processes. **Production environments take a wide variety of forms**: custom-built services, data science platforms, dedicated services like TensorFlow Serving, low-level infrastructure like Kubernetes clusters, etc.

Ideally, models running in the development environment would be ✅ validated and ➡️ sent as is to production; thus ⬇ minimizing the amount of adaptation work and improving the chances that the model in production will behave as it did in development. Unfortunately, this **ideal scenario is not always possible**.

**Adaptation from development to production environment**

**If** the development and production platforms are 🔄 interoperable, and **the model can run without any modification in production**, then the **steps required to push the model into production** are reduced to **a few 🖱️ clicks or commands**, and all efforts can be focused on ✅ validation.

However, there are **cases** (the reality in many organizations) **where the model needs to be ↩️ reimplemented** from scratch, possibly by another team, and possibly in another programming language (that model won’t probably reach production for months). In these cases, there can be **model transformations or interactions with the environment to make the model compatible with production**. In all cases, it is crucial to **perform validation in an environment that mimics production as closely as possible**.

- The **format required** to send to production should be 🤔 considered ⬅ early, as it **may impact** on the model itself and the work required to productionalize it. For example, when a model is developed using Python and production is a Java environment that expects PMML or ONNX as input, ↪️ conversion is required.

- Performance is important when the production model must **run on a low-power 📱 device** (Edge AI). With 🧠 deep neural networks, for example, trained models can be extremely large. One solution is to use 🗜 compression techniques like quantization, pruning, or distillation. These methods are still recent but already used in NLP.

## ⛔️ Data access <a name="2.2"></a>

Another aspect to address before validation and launch to 🚀 production is ⛔️ data access. For example, a model predicting 🏠 apartment 💰 prices may use the average market price in a zip code area; however, the 👨 user requesting the scoring will probably not provide this average and would provide simply the zip code, meaning **a lookup is necessary** to fetch the value of the average.

**In some cases**, data can be frozen and bundled with the model. But when this is not possible (e.g., if the dataset is too large or the enrichment data needs to be 📅 up to date), the **production environment should access a database** and thus have the appropriate **network 🖇 connectivity, libraries, or 🚙 drivers required to communicate with the data storage** installed, and authentication 🔐 credentials stored in some form of production 🔧 configuration. Managing this setup and configuration can be complex in practice since it requires appropriate ⚒️ tooling and collaboration.

## 🧐 Model risk evaluation <a name="2.3"></a>

**Implementation of models can have 🐛 bugs**, as can the environment they are executing in.  It is possible to ⬅ **anticipate the risks of models in production** and thus design and ☑️ validate them to ⬇ **minimize these risks**.

Before putting a model in 🚀 production teams should **ask the next ❓ questions**:

- What if the model acts in the 😟 worst imaginable way?
- What if a user manages to extract the training data or the internal logic of the model?
- What are the financial, business, legal, safety, and reputational ⚠️ risks?

For ⚡️ **high-risk applications**, the whole team must be 🧐 aware of these risks so that they can design the ✔ validation process appropriately and apply the complexity appropriate for the magnitude of the risks.

ML model risk originates essentially from:

- 🐛 Bugs in designing, training, or evaluating the model
- Bugs in the runtime framework, in the model post-processing/conversion, or hidden incompatibilities between the model and its runtime
- ⬇ Low quality of training data
- High difference between production data and training data
- Adversarial 😈 attacks
- Reputational risk due to bias, unethical use of ML, etc.

The risk and its 🎚 magnitude can be ⬆ amplified by:

- Broad use of the model
- A 🏎 rapidly changing environment
- Complex ⛓ interactions between models

## ✔ QA and auditability for ML <a name="2.4"></a>

Software engineering has developed 🛠 tools and methodologies for quality assurance (QA), but the equivalent for ML is still in its 👶 infancy. The 🎯 **purpose of QA for ML** is to **ensure compliance with ⚙ processes as well as ML and computational performance requirements**, with a level of detail appropriate to the 🎚 level of ⚠️ risk.

The organization’s structure needs to give people in charge of validation the 👮 authority to appropriately 📝 report 🐛 issues, contribute to continuous improvement, and ✋ block passage to production if the level of risk justifies it. Performing QA before sending to production is not only about technical validation, it is also the occasion to create ✍️ documentation and validate the model against organizational 📋 guidelines. The origin of all input datasets, pre-trained models, or other assets should be known, as they could be subject to 📜 regulations or copyrights.

**Auditability** is related to reproducibility (which we discussed in the notebook [Model development in MLOps](https://nbviewer.jupyter.org/github/victorviro/Machine-Learning-Python/blob/master/Model_development_in_MLOps.ipynb)), but it adds some requirements. For a model to be auditable, it must be possible to **access the full history (lineage) of the ML pipeline from a central and reliable storage** and to easily fetch metadata on all model versions including:

- The full 📓 documentation
- An artifact that allows running the model with its exact initial environment
- ✔ Test results, including model explanations and fairness reports
- Detailed model 🔊 logs and 🖥 monitoring metadata

Auditability can be required in ⬆ highly regulated applications, but it has benefits for all organizations because it **facilitates model 🐞 debugging, continuous improvement, and keeping ✍️ track of actions and responsibilities** (which is an essential part of governance for responsible AI). A full QA 🧰 toolchain for ML should provide a clear view of model performance about requirements while also facilitating auditability.

Auditability must allow for an intuitive human 🤔 understanding of all the parts of the system and their version histories. Depending on the ⚡ criticality of the application, a wider audience may need to understand the details of the model. As a result, 🔋 full auditability comes at a cost that should be ⚖️ balanced with the criticality of the model itself.

## 🔒 ML security <a name="2.5"></a>

As a piece of software, **a 🚀 deployed model can present security 🐛 issues**. ML introduces a new range of potential threats where an **😈 attacker provides malicious data designed to cause the model to make a 👎 mistake**.

There are numerous cases of potential 💥 attacks. For **example, spam filters** based on scoring words that were in a dictionary. One way for spam creators to avoid detection was to avoid ✍️ writing these words while still making their message easily understandable by a 👨 human reader (e.g., using exotic Unicode characters, introducing typos or 🖼 images).

### 😈 Adversarial attacks <a name="2.5.1"></a>



A more modern example of an ML model security issue is an **adversarial attack for 🧠 deep neural networks** in which an 🧑 image **slight modification** that can seem minor or even impossible for a human 👁 eye to notice **can cause the model to drastically change its prediction**. Since deep learning inference is essentially matrix multiplication, carefully chosen small perturbations to coefficients can cause a large change in the output numbers.

<center><img src='https://i.ibb.co/QbYkDF5/adversarial-attack.png'></center>

For complex models like deep learning, the 😈 attacker will probably need to perform many queries and either use 💪 brute force to test as many combinations as possible or use a model to search for problematic examples. The **difficulty of countermeasures is ⬆ increasing with the complexity of models** and their availability. Simple models such as logistic regressions are essentially immune, while an open-source pre-trained deep neural network will always be vulnerable, even with advanced attack detectors.

Adversarial attacks don’t necessarily happen at inference time. If an attacker can get **access to the training data**, even partially, then they get 🕹 control over the system. This kind of attack is traditionally known as a ☠️ **poisoning attack** in computer security. One famous **example** is the **Twitter 🤖 chatbot** released by Microsoft in 2016. Just a few ⏳ hours after launch, the bot **started to generate 🖕 offensive tweets**. This was caused by the bot adapting to its input; when realizing that some users submitted a large amount of offensive content, the bot started to replicate. In theory, a poisoning attack can occur as a result of an intrusion or even, in a more sophisticated way, through pre-trained models. But in practice, one should mostly care about data collected from easily manipulated data sources. Tweets sent to a specific account are a particularly clear example.

### 🔥 Other vulnerabilities <a name="2.5.2"></a>

Since the **ML models** can be considered a 📄 summary of the data they have been trained on, they **can leak** more or less 👌 precise **information on the training data**. Imagine, for example, that a model predicts how much someone is paid using the nearest neighbor algorithm. If one knows the zip code, age, and profession of a certain person registered on the service, it’s pretty easy to obtain that person’s exact 💲 income. There is a wide range of attacks that can extract information from models in this way.

In addition to technical hardening and audit, **governance** is critical in security. **Responsibilities** must be 📎 assigned clearly and in an appropriate ⚖️ balance between security and capacity of execution. It is also important to put in place 🔄 **feedback ⚙ mechanisms**, and employees and users should have an easy **channel to 🔉 communicate breaches** (including programs that 🎁 reward reporting vulnerabilities). It is also possible, and necessary, to build safety nets around the system to mitigate the risks.

One of the main ideas is that computer system **security is not an ➕ additional independent feature of the system**; that is, generally we cannot secure a system that is not designed to be secure, and the organization processes must take into account the nature of the threat from the beginning.

## ⬇ Model risk mitigation <a name="2.6"></a>


To 🕹 control the 🚀 deployment of new versions, progressive or **canary deployments** allow 🆕 **new versions of models being served to a small proportion of the customer base 1️⃣ first and slowly 📈 increasing that proportion**, while 📊 monitoring behavior and getting human 🔄 feedback if appropriate.

**Changing environments**

🏎 Rapidly **changing environments ⬆ multiply risk**. Even with an efficient monitoring system and a ⚙ procedure to 🔁 retrain models, the ⌛ time necessary to remediate may be a critical threat, especially if simply retraining the model on new data is not enough and a new model must be 🧑‍💻 developed. During this time, the production systems misbehaving can cause large 💰 losses for the organization.

To control this risk, monitoring should be reactive enough (⏰ alerting on distributions computed every week might not be enough), and the procedure should consider the period necessary for remediation. For example, in addition to retraining or ↩️ rollout strategies, the procedure may define 🔢 **thresholds that would trigger a degraded mode for the system**. A degraded mode may simply consist of a ⚠️ warning message displayed for end-users but could be as drastic as shutting down the system to avoid harm until a stable solution can be deployed.

In many cases, 🔁 retraining the model on more data will ⬆ improve the model, and this problem will disappear, but this can take ⏳ time. Before this convergence, a solution might be to use a simpler model that may have a ⬇ lower performance but be more consistent in a frequently changing environment.

🖇 **Interactions between models**

Complex interactions between models are a challenging source of risk. ➕ **Adding models will often add complexity** to an organization, but the complexity **does not necessarily grow linearly in proportion to the number of models**. The absence of interactions between models makes the complexity grow closer to linearly (in practice, it is rarely the case, as there can always be interactions in the real 🌍 world even if models are not connected).

The total complexity is determined by how the 🔗 interactions with models are designed. Using **models in ⛓ chains (where a model uses inputs from another model) can create additional complexity** as well as unexpected results, **whereas using models independently**, which are each as short and explainable as possible, **is** a much **more ♻️ sustainable** way to design the large-scale deployment of ML. On other hand, [ensembled models](https://nbviewer.org/github/victorviro/ML_algorithms_python/blob/master/Ensemble_learning.ipynb) can avoid errors, that is, if a decision is based on several independent models with methods as different as possible, it can be more robust.

**Model misbehavior**

Many 📏 measures can be implemented to avoid model misbehavior. 

🕹 Controlling **feature-value intervals** is a **useful** and simple technique. For example, if the value of a **feature at inference time is out of bounds, the system can trigger appropriate measures** (e.g., dispatching a ⚠️ warning message). But **it might be insufficient**. For example, when training an algorithm to evaluate 🚙 car💲 prices, the data may have provided examples of recent light cars and old heavy cars, but no recent heavy cars. The performance of a model for these is unpredictable. When the number of features is large, this issue becomes unavoidable due to the [curse of dimensionality](https://nbviewer.org/github/victorviro/ML_algorithms_python/blob/master/Dimensionality_reduction_algorithms.ipynb#The-Curse-of-Dimensionality).

In these situations, more sophisticated methods can be used, including **anomaly detection** to identify records where the model is used outside of its application domain. After scoring, the outputs of the model can be 🧐 examined before ✅ confirming the inference. In the case of 🏷 classification, many algorithms provide certainty scores in addition to their prediction, and a 🔢 threshold can be fixed to accept an inference output.

**Conformal prediction** is a set of ⚒️ techniques that helps ⚖️ calibrate these predictions to obtain an accurate estimation of the probability of correctness. For regression, the value can be ✅ checked against a predetermined interval. For example, if the model predicts a car costs between 50\$ and 500,000\$, we may not want to ❌ commit any business on this prediction.

# 🚀 Deploying to production <a name="3"></a>

🚀 **Deploying to production** is a 🗝 key component of MLOps, and, having the right ⚙ processes and 🧰 tools can ensure that it happens ✈ **quickly**. The good 📰 news is that many of the elements of success in software engineering, particularly **CI/CD best practices**, **can be applied to ML**.

This section dives into the concepts and considerations when 🚀 deploying ML pipelines and models to production.

## 👷‍♂️ CI/CD <a name="3.1"></a>

Let's understand some DevOps concepts:

- ***Integration***: The process of **merging a contribution to a central repository** (typically merging a Git 🌿 branch to the main branch) **and performing ✅ tests**.

- ***Delivery***: The process of **building a fully 📦 packaged and validated version of the software ready to be 🚀 deployed** to production.

- ***Deployment***: The process of **updating and 🔛 running a new version of the software on a 🎯 target infrastructure**. 

- ***Release***: **A release is a version of the software**. **Releasing** is the act of **making the release version 👀 visible to our customers**. 🚀 Deployment doesn't necessarily expose customers to the 🆕 new version of the software, although in many cases, it is implicit. But, there are ways to hide the new version from customers even while it's deployed (like in blue/green deployment or shadow testing). The goal of releasing is ensuring a feature meets customer needs and is 🔙 turned off when defective.

[***CI/CD***](https://en.wikipedia.org/wiki/CI/CD) is an acronym for **continuous integration and continuous delivery**, a philosophy of agile software development, and a set of 📜 practices and 🛠 tools to **release applications more often and ✈ faster**, while also better 🕹 controlling quality and risk. They are a critical part of MLOps strategy.

- The goal of **continuous integration** is **merge, frequently, the work from several contributors into a shared repository** (using a version control system). Each integration can then be ✅ verified by **automated tests** (automated build). While automated testing is not strictly part of CI, it is typically implied. By integrating regularly, 🐛 errors can be detected quickly and located more easily. It allows checking that the application is not broken whenever new commits are integrated into the main branch.

- **Continuous delivery** is an extension of CI since it 🤖 automatically deploys the new version of the software to a testing and/or production-like environment after the build stage to ensure the software will 👍 work in production. So, the **software can be reliably released to production at any time**. Beyond making sure our application passes automated tests, it has to have all the configuration necessary to push it into production. 

- Continuous delivery is sometimes confused with continuous deployment. **Continuous deployment** goes one step further than continuous delivery. It means that **every change goes through the pipeline and automatically gets put into production**. There's no ✋ human intervention, and only a ❎ failed test will prevent a new change to be 🚀 deployed to production. Continuous delivery just means that we can do frequent deployments but we may choose not to do it. Fully 🤖 automated deployment is not always desirable and is a business decision as much as a technical decision. A further explanation about the differences between continuous delivery and continuous deployment is available in this [video](https://youtu.be/LNLKZ4Rvk8w).

<center><img src='https://i.ibb.co/hcrdszd/cd-cd.png'></center>

The most **widespread 🛠 tool for CI/CD** is [***Jenkins***](https://www.jenkins.io/), an open-source system that allows for the building of CI/CD pipelines regardless of the programming language, testing framework, etc. Jenkins can be used in data science to orchestrate CI/CD pipelines, although there are many other options (like [Github Actions](https://github.com/features/actions)).

An ⬆ **incremental approach to building a CI/CD pipeline** is natural since a starting project has no infrastructure 📋 requirements of a tech giant, and it can be hard to know ⬅️ upfront which challenges deployments will present. The best 🛣️ path is starting from a simple (but fully functional) CI/CD workflow and introducing more sophisticated steps along the way as quality or scaling challenges appear.

### ✅ Testing <a name="3.1.1"></a>

✅ Testing is an important strategy for improving reliability, ⬇ reducing technical debt and lowering long-term 🔧 maintenance costs. However, as suggested in the following figure (from [The ML Test Score](https://research.google/pubs/pub46555/)), **ML system testing is more complex than testing traditional software systems**, since **ML system behavior** is not specified directly in code but **is learned from data**.

<center><img src='https://i.ibb.co/2WF825x/ML-testing-monitoring.png'></center>

Therefore, while traditional software can rely on **unit tests and integration tests of the code**, for ML we need to ➕ **add tests for the data, and the model**.

- **Data validation** is required ⬅ before model training to decide whether we should retrain the model or stop the execution of the pipeline. This decision can be automatically made based on the **data schema or data values skews**.

- **Model validation** occurs ➡ after we successfully train the model given the new data, and ⬅ before it's promoted to 🚀 production. In the notebook [Model development in MLOps](https://nbviewer.jupyter.org/github/victorviro/Machine-Learning-Python/blob/master/Model_development_in_MLOps.ipynb), we discussed how to evaluate a model properly. Some additional steps for this **offline model validation** are:

 - **Comparing** the evaluation 📏 metric values produced by our 🆕 newly trained model **to the production model** (or previous versions) or to a performance 🔢 threshold (does the new model produce better performance than the current model?).

 - Check the **consistent performance of the model on** various segments of the data (**subpopulations**).

 - Making sure that we test our model for deployment, including infrastructure compatibility and consistency with the prediction service API.

Good tests should make it as **easy** as possible **to diagnose the source 🐛 issue when they ❎ fail**. For that purpose, **naming appropriately the tests is important**.

**Online validation**: In addition to offline validation, some ✅ checks should be performed and automatically analyzed when the model is already 🚀 deployed to production (canary deployment or A/B testing).

- An essential part of model validation is testing on recent production data. One or several datasets should be used, extracted from several time windows, and named appropriately. 
-  Online validation also should monitor computational performance (CPU, memory, disk, latency, etc).

**🤖 Automating these tests** as much as possible **is essential** and a 🔑 key component of efficient MLOps. A lack of automation or 🚄 speed wastes ⌛ time, but also it 😞 discourages the development team from testing and deploying often, which can 🕐 delay the discovery of 🐛 bugs or design choices that make it impossible to deploy to production.

## ↔️ ML pipeline <a name="3.2"></a>

**In many businesses** that are 🌱 beginning to apply ML to their use cases, their **process for building and deploying ML models is entirely 🤚 manual**. This process can separate data scientists who create the model and engineers who serve the model as a prediction service, and it presents a **disconnection between ML and operations**. In practice, models often break when they are deployed in the real 🌏 world (they 👎 fail to adapt to changes in the dynamics of the environment or changes in the data that describes the environment).

To address the challenges of this manual process, MLOps practices for CI/CD are helpful. By 🚀 **deploying an ML training pipeline**, we can set up a CI/CD system to ✈ rapidly ✅ test, 🏗 build, and 🚀 deploy new implementations of the ML pipeline. 🤖 Automated data validation and model validation steps must be added to the ML pipeline.

<center><img src='https://i.ibb.co/bKLDmN8/automatic-ml-pipeline.png'></center>

## 🚀 Model deployment <a name="3.3"></a>

The process of **taking a trained ML model and making its predictions available** to 🧑 users or other systems is known as model 🚀 deployment in ML. To decide **how to deploy a model**, we need to 🤔 understand **how end-users should interact with the model’s predictions**. There are **multiple ❓ factors to consider** when determining how to deploy an ML model. These factors include:

- How 🕛 frequently predictions should be generated?
- Predictions should be generated for a 1️⃣ single instance at a time or a batch of instances?
- The number of applications that will access the model.
- The ⏳ latency requirements of these applications.

### 🚚🏎 Batch inference vs online inference <a name="3.3.1"></a>

In addition to different deployment strategies, there are two ways to approach model 🚀 deployment:

- 🚚 **Batch or offline inference** generates predictions on a **batch of observations**. The batch jobs are typically generated on some **recurring schedule** (e.g. 🕑 **hourly, 📅 daily**). **Latency is often not a concern**. These predictions are then stored in a database and can be made available to 🧑‍💻 developers or end-users. Batch inference may use big data 🧰 tools such as Spark to generate predictions on large batches. Some advantages are:

 - We can ✅ check and verify all of our predictions ⬅ before they are used.

 - Once the predictions have been ✍️ written to some look-up table, they can be served with minimal ⏳ latency (no feature computation or model inference needs to be done at request time).

 <center><img src='https://i.ibb.co/m4RbBs7/batch-inference.png'></center>

- 🏎 **Real-time or online inference** generates predictions in ✈ real-time upon request. **Typically**, these predictions are generated **on a 1️⃣ single observation** of data at runtime. For example, when an ad is displayed on a website and a user session is scored by models to decide what to display.

  <center><img src='https://i.ibb.co/c60bKn5/online-inference.png'></center>


In both cases, multiple instances of the model can be 🚀 deployed to ⬆ increase throughput and lower ⏳ latency.

Typically, **online inference faces more challenges** than batch inference. Online inference tends to be more complex because of the added 🧰 tooling and systems required to meet latency requirements. A system that needs to respond with a prediction within 100ms is much harder to implement than a system with a service-level agreement of 24 hours. In those 100ms, the system needs to ⬅️ retrieve any necessary data to generate predictions, perform ⚙ inference, ✅ validate the model output, and then (typically) ↩️ return the results over a network.

One of the first ❓ questions we’ll need to answer when deciding how to deploy our ML models is whether to use **batch inference or online inference**. This 🗳 choice is mainly driven by product **factors**: 
- Who is using the inferences, and how ✈ soon do they need them? 
- If the predictions do not need to be served immediately, we may opt for the simplicity of 🚚 batch inference. 
- If predictions need to be served on an individual basis and within the time of a single web request, 🏎 online inference is the way to go.

### 👉 Model deployment strategies <a name="3.3.2"></a>

When ➡️ sending a new model version to production, the first consideration is often to **avoid downtime**, in particular for 🏎 online inference. 

- [***Blue-green deployment***](https://en.wikipedia.org/wiki/Blue-green_deployment): Rather than shutting down the system, upgrading it, and then putting it back online, the **new system** can be **deployed and tested in another environment (green)**, that is as identical as possible to the stable production environment (blue). Note that it's 🚀 deployed but it's not released, since it does not start responding to production requests. **Once the software is working in the green environment, the live 🚦 traffic is 🔛 switched** so that all incoming requests go to the green environment. If no 🐛 issues are found for a period of time, the 🔵 blue environment can be ⚰️ removed.

 <center><img src='https://i.ibb.co/J7thY91/blue-green-deployment.png'></center>

- ***Canary deployment***. The stable version of the model is kept in production, but a **small percentage of the workload is ➡ redirected to the new model, and results are 🖥 monitored**. Other 🧑 users continue to use the previous version until **we’re 👍 satisfied with the new release. Then, we can gradually roll the new release out to all users** (potentially in several workload percentage ⬆ increments). This way, a 🐛 malfunction would likely impact only a small portion of the workload. Computational performance and statistical ✔ tests can be performed to decide whether to fully 🔛 switch to the new model.

 <center><img src='https://i.ibb.co/6N6326V/canary-release1.png'></center>
 <center><img src='https://i.ibb.co/KX0NT0j/canary-release2.png'></center>

 Note that **requests handled by the canary model should be carefully 👉 picked**. For example, if the canary model is serving to a region, the model may, for ML or infrastructure reasons, not perform as expected in other regions. A more robust approach is to pick the portion of 🧑 users served by the new model at random, but then it is often desirable for user experience to implement an affinity ⚙ mechanism so that the same user always uses the same version of the model.

 Canary testing can be used to carry out A/B ✅ testing, which is a process to compare two versions of an application in terms of a business performance 📏 metric. A/B testing and shadow testing will be discussed in the following notebook.

 Overall, canary releases are a powerful ⚒️ tool, but they require advanced 🧰️ tooling to manage the 🚀 deployment, gather the 📏 metrics, specify and run computations on them, display the results, and dispatch and process ⏰ alerts.

### 🔧 Maintenance in production <a name="3.3.3"></a>

Once a model is released, it must be 🔧 maintained. At a high level, there are three maintenance 📏 measures:

- 💻 **Resource monitoring**: Just as for any application running on a server, collecting **IT metrics** such as **CPU, memory, disk, or network usage** can be useful to detect and troubleshoot 🐛 issues.

- ✅ **Health check**: To check if the model is indeed online and to analyze its ⏳ **latency**, it is common to implement a health check ⚙ mechanism that simply queries the model at a fixed interval (on the order of one minute) and ✍️ logs the results.

- 📏 **ML metrics monitoring**: This is about analyzing the **accuracy** of the model and **comparing it to another version or detecting when it is going stale**. Since it may require heavy computation, this is typically lower frequency, but as always, will depend on the application; it is typically done once a week.


Finally, **when a 🐛 malfunction is detected, a ↩️ rollback to a previous version may be necessary**. It is critical to have the rollback procedure ready and as 🤖 automated as possible.

## 📦 Containerization <a name="3.4"></a>

Managing the versions of a model is more than **saving its code into a version control system**. It's necessary to provide a 📄 **description of the environment** (**libraries and their versions**, **🖇 dependencies**, etc). But storing this information is not enough. Deploying to production should 🤖 automatically and reliably 🔁 **rebuild this environment on the target 💻 machine**. The target machine may run multiple models simultaneously, and two models may have incompatible dependency versions. Several models running on the same machine could compete for resources, and one misbehaving model could hurt the performance of multiple cohosted models.

📦 **Containerization** technology is used to **tackle these challenges**. These tools bundle an application together with its 🔩 configuration files, libraries, and 🔗 dependencies that are required for it to run across different operating environments. Unlike virtual machines, containers do not duplicate the operating system; and are therefore far more 👍 efficient.

The most known containerization technology is the open-source platform 🐋 [***Docker***](https://www.docker.com/). It allows an application to be 📦 packaged, ➡ sent to a server, and run with all its dependencies in isolation from other applications.

Building the basis of a model-serving environment that can accommodate many models, each of which may run multiple copies, may imply some issues:

- Which Docker host(s) should receive the container?
- When a model is 🚀 deployed in several copies, how can the workload be ⚖️ balanced?
- What happens if the model becomes unresponsive, for example, if the machine hosting it ❌ fails? How can that be 🧐 detected and a container reprovisioned?
- How can a model running on multiple machines be upgraded, with assurances that old and new versions are switched on and off, and that the load balancer is updated with a correct sequence?

☸️ [***Kubernetes***](https://kubernetes.io/es/) is an open-source platform for **container orchestration**, which greatly simplifies these issues and many others. It provides a powerful declarative API to run applications in a group of Docker hosts, called a **Kubernetes cluster**. The word declarative means that rather than trying to express in code the steps to set up, 🖥 monitor, upgrade, ✋ stop, and connect the container (which can be complex and error-prone), users specify in a 📄 **configuration file** the desired state, and Kubernetes makes it happen and then maintains it.

For example, users need only specify to Kubernetes "make sure four instances of this container run at all times", and Kubernetes will allocate the hosts, start the containers, monitor them, and start a new instance if one of them 👎 fails. 

Docker with Kubernetes can provide a powerful 🏗 infrastructure to host applications, including ML models. Leveraging these products greatly simplifies the implementation of the 🚀 deployment strategies, like blue-green deployments or canary releases, although they are not aware of the nature of the deployed applications and thus can’t natively manage the ML performance 📊 analysis. Another advantage is the ability to easily ⬆ scale the model’s deployment.

## ⬆ Scaling deployments <a name="3.5"></a>

As ML adoption ⬆ grows, organizations face two types of challenges:

- The ability to use a model in production with high-scale data

- The ability to train larger and larger numbers of models

- Handling more data **for online inference** is easier with frameworks such as Kubernetes. Models can be replicated in the cluster in as many copies as necessary. The 🤖⬆ **auto-scaling** features in Kubernetes can provide new machines and load ⚖️ balancing. The major difficulty can then be to process a large amount of monitoring data.

- **For batch scoring**, the situation can be more complex. When the volume of data becomes too large, there are essentially two types of strategies to ➡️ distribute the computation:

 - Using a 🧰 framework that handles **distributed computation** natively like [**Spark**](https://spark.apache.org/). Spark is a distributed ⚙ computation framework that **can split the data and the computation among its nodes**.

 - Another way to distribute batch processing is to ✂️ **partition the data**. The general idea is that inference is typically a row-by-row operation (each row is scored one by one), and the data can be split in some way so that several machines can each read a subset of the data and score a subset of the rows.

This is also challenging in terms of governance and processes. ⬆ **Scaling the number of models** means that the **CI/CD pipeline** must be able to **handle many 🚀 deployments**. As the number of models ⬆ grows, the need for 🤖 automation and governance grows, as human verification cannot necessarily be systematic or consistent. In some applications, it is possible to rely on fully 🤖 automated continuous deployment if the ⚠️ risks are controlled by automated ✅ validation, canary releases, etc. There can be 🏗 infrastructure challenges since training, building models, validating on test data, etc., all need to be performed on clusters rather than on a single 💻 machine. 

Also, with a ⬆ higher number of models, the CI/CD pipeline of each model can vary, and if nothing is done, each team could develop its own CI/CD pipeline for each model. This is suboptimal from efficiency and governance perspectives. While some models may need specific validation pipelines, most projects can probably use common patterns. In addition, 🔧 maintenance is more complex as it may become impractical to implement a new validation step since the pipelines would not share a common structure and would then be impossible to update safely. 🖇 Sharing practices and standardized pipelines can help limit complexity.

**Scalable and elastic systems**

A computational system is said to be horizontally scalable if it is possible to ➕ add more servers to expand its processing 💪 power. For example, a Kubernetes cluster can be expanded to 💯 hundreds of machines. However, if a system includes only one machine, it may be challenging to incrementally upgrade it significantly, and at some point, a migration to a bigger machine or a horizontally scalable system will be required.

An elastic system allows, in addition to being scalable, easy ➕ addition and ⚰️ removal of resources to match the compute requirements. For example, a Kubernetes cluster in the cloud can have an **auto-scaling** capability that 🤖 automatically **adds machines when the cluster usage metrics are ⬆ high and removes them when they are ⬇ low**. In principle, elastic systems can ♻️ optimize the usage of resources; they **automatically adapt to an increase in usage without the need to permanently provision resources that are rarely required**.

# 📕 References <a name="4"></a>

- [MLOps: Continuous delivery and automation pipelines in machine learning](https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning)

- [Book "Introducing MLOps"](https://www.oreilly.com/library/view/introducing-mlops/9781492083283/)


- [AWS machine-learning-lens](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/model-deployment-approaches.html)

- [MLinproduction blog](https://mlinproduction.com/)

- [Continuous Delivery for Machine Learning](https://martinfowler.com/articles/cd4ml.html)

- [The ML Test Score](https://research.google/pubs/pub46555/)

- [Awesome MLOps repo](https://github.com/visenger/awesome-mlops)

