# Chapter 7: Model Deployment and Prediction Service

- Deployment can be as easy as using Flask/FastAPI to write some boilerplate
    ```python
    @app.route('/predict', methods=['POST'])
    def predict():
        X = request.get_json()['X']
        y = MODEL.predict(X).tolist()
        return json.dumps({'y': y}), 200
    ```

- But what if you need your model to be highly available, with 99% uptime, and setting up infra so people are notified when things go wrong?

## Deployment Myths

- Myth: You only deploy 1 or 2 models in prd 
    - tbh even a simple service like GFG has hundreds of models
- Myth: Model work ends upon deployment
    - Your model probably sucks, and will suck more over time
- Myth: You won't need to update your model
    - See the previous myth
- Myth: Most MLEs don't need to worry about scale
    - Yea that's completely untrue. Nobody is going to accept that doubling a product reach is going to double costs, when the entire point of ML is scalability to begin with

## Batch vs Online

- Online prediction: On demand
- Batch prediction: Batch requests to infer at fixed intervals
- Often done side by side

## Unified Batch and Streaming Pipeline 

![batch v stream](./artifacts/7_image.png)

![system](./artifacts/7a_image.png)

## Model Compression

- Low rank factorisation
    - When you have blocks that are super large (e.g. big layers in neural nets), it can make sense to size down your components
    - e.g. Instead of 3x3 convolutions, do 1x1 convolutions

- Knowledge distillation
    - Train a model (student) to mimic the prediction of a bigger model (teacher)

- Pruning
    - Remove unnecessary parts of the model (e.g. layers with small to zero weights)

- Quantization
    - Change 64 bit float to 8 bit float


## ML on Edge

- Instead of running inference on your own compute, offload to edge devices

- This requires specialised skill set to map compute to hardware

![memory](./artifacts/7b_image.png)

- To avoid a specific framework for each type of compute primitive, best to rely on a middleman framework, which translates high to low level intermediate representations before translating to machine code

- This is known as "lowering"

### Model Optimisation

- There is sometimes an additional step between MLEs and Data Scientist, and these are optimisation engineers
- Basically taking single-threaded models and rewriting components to make them run faster

- Some standard optimisation techniques:
    - Vectorisation: Given a loop or a nested loop, instead of executing it one item at a time, execute multiple elements contiguous in memory at the same time to reduce latency caused by data I/O.
    - Parallelization: Operate on multiple chunks of the data at the same time
    - Loop tiling: Change the data accessing order in a loop to leverage hardwareâ€™s memory layout and cache. This kind of optimization is hardware dependent: A good access pattern on CPUs is not a good access pattern on GPUs.
    - Operator fusion: Fuse multiple operators into one to avoid redundant memory access.

### ML for Optimisation

- For example, in CUDA, there is `torch.backends.cudnn.benchmark=True` which enables cuDNN autotune
- `autoTVM` is another such solution that tunes the architecture of your model

### ML in Browsers

- WebAssembly is the often used language to run ML in browsers, though this is still slow

## Overall

![big picture](./artifacts/7d_image.png)