# 6.3 Deploy the model

Once our model has successfully cleared the final eval^n on the test set, it's time to deploy the model and begin its productive life!

## 6.3.1 Explain your work to stakeholders and set expectations

- Trust is about consistently meeting or exceeding people's expectations.
- The actual system is 1/2 of that picture. The other half is setting appropriate expectations before launch.
- People not in AI  often have unrealistic expectations from the system. They think it "understands" the task.
  - To battle this, show them *failure modes* of your model (ex: show what incorrectly classified examples look like).
- Often expect human-like performance for tasks that were previously performed by humans. Most ML models do not get there.
- Clearly convey the model performance expectations and avoid abstract statements.
- Prefer talking about false negative/positive rates.
  - Example: “*With these settings, the fraud detection model would have a 5% false negative rate and a 2.5% false positive rate. Every day, an average of 200 valid transactions would be flagged as fraudulent and sent for manual review, and an average of 14 fraudulent transactions would be missed. An average of 266 fraudulent transactions would be correctly caught.*”
- Clearly relate model's performance metrics to the business goals.
- Make sure to discuss the choice of key launch parameters with the stakeholders. Some parameters cause decisions that involve trade-offs that can only be handled with a deep understanding on the business context.



## 6.3.2 Ship an inference model

- ML project doesn't end when you save a trained model on Google Colab.
- First, we may have to export the model to something other than Python:
  - The production environment may not support Python at all (mobile/embedded systems).
  - If the rest of the app is not in pyhton, the use of python to serve a model may induce significant overhead.
- Since we only want predictions from our production model (a phase called *inference*), we have room to perform various optimizations that can make the model quicker and reduce its memory footprint.

Different model deployment options:

### Deploying a model as a REST API

- A common way.
- Could build your own serving app or use *TensorFlow Serving*: TF's own library for shippinh models as APIs. You can deploy a Keras model in minutes with this.
- Use this when:
  - The app consuming the model's prediction has reliable access to the internet.
  - Does not have strict latency requirements.
  - Input data sent for reference is not highly sensitive (data will need to be in decrypted form).
- Important question to use REST API: wanna host your own code or use a fully-managed 3rd party cloud service like Cloud AI Platform by Google.

### Deploying a model on a device

- Sometime we need the model to be on the same device as the app that uses it.
- Do this when:
  - Strict latency constraints.
  - Model can be made sufficiently small. Can use TensorFlow Model Optimization Toolkit.
  - Highest possible accuracy is not the critical mission of the task.
  - Input data is strictly sensitive.
- To deploy a Keras model on a smartphone or an embedded device, use TensorFlow Lite. Runs on Android, iOS, ARM64-based computers, Raspberry Pi, etc. Includes a converter to straightforwardly turn your Keras code into TensorFlow Lite format.

### Deploying a model in the browser

- DL is often used in browser-based or desktop-based JS apps.
- Use this when:
  - Wanna offload compute to the end user to save your own server costs.
  - Input data needs to stay on the end user's device.
  - Strict latency constraints.
  - Need your app to keep working without connectivity.
- Only go with this option if the model is small enough to not hog the CPU, GPU or RAM of the user's device.
- Make sure nothing about the model stays confidential (since entire model will be downloaded to the user's device).
- Usually possible to recover some info. about the training data so make sure to not make your trained model public if it was trained on sensitive data.
- To deploy a model in JS, use TensorFlow.js, a JS library for DL that implements almost all of Keras API and many lower-level TF APIs.
- Can easily import a saved Keras model into TF.js to query it as part of your browser-based JS app or desktop Electron app.

### Inference model optimization

- Optimizing the model for inference is important when the environment it is deployed in has strict constraints on power/memory or for apps with low latency requirements.
- 2 popular optimizing techniques:
  1. *Weight pruning*: not every weight contributes equally. We can prune the weights that are not important. Reduces memory and compute footprint, at a small cost in performance metrics.
  2. *Weight quantization*: DL models are trained with `float32` weights, but you can *quantize* weights to `int8` to get an inference-only model that's 1/4th the size but near the accuracy of the original model. TF ecosystem includes a weight pruning and quantization toolkit: http://tensorflow.org/model_optimization deeply integrated with the Keras API).

## 6.3.3 Monitor your model in the wild

- Exported inference model, integrated to the app, and dry run it on production data, written unit tests, logged and status-monitored the code. Now deploy the model to production!
- This is not the end! Once deployed, gotta monitor its behavior, performance on new dat, interaction with rest of the app, and eventual impact on business metrics.
  - *Randomized A/B testing*: send a subset of cases through your new model, and another subset through the old process. After many cases, the difference in outcomes between the 2 is likely attributable to your model.
  - Do a regular manual audit on production data. Send some fraction of production data for manual annotation, and compare the model's predictions to new annotations.
  - When manual annotations not possible, consider eval^n avenues like user surveys etc.

## 6.3.4 Maintain your model

- No model lasts forever, remember *concept drift*: characteristics of production data will change over time.
- As soon as your model is launched, get ready to train the next generation that will replace it.
  - Watch for changes in production data. Are new features available? Should you expand or edit the label set?
  - Keep collecting and annotating data, and keep improving your annotation pipeline over time.
  - Pay attention to samples that are hard for your model to classify—such samples are most likely to improve performance.

# Summary

1. New ML project? First define the problem:
  - Broader context of problem, end goal and constraints.
  - Collect and annotate dataset, understand the data in depth.
  - Choose a measure of success. Metrics to monitor validation data.
2. Next, develop a model:
  - Prepare data.
  - Pick evaluation protocol: holdout vali^n, K-fold/iterated K-fold vali^n? Which portion of data to use for validation?
  - Achieve statistical power: beat a simple baseline.
  - Scale up: overfitting model.
  - Regularize model and tune its hyperparams.
3. When this is done, it's time for deployment:
  - Set appropriate expectations with the stakeholders.
  - Optimize final model for inference, ship model to deployment of choice: web server, mobile, browser, embedded device, etc.
  - Monitor model's performance in production, keep collecting data to work on the next generation of the model.
