<a href="https://colab.research.google.com/github/venkatasl/AIML_TRAINING_VENKAT/blob/main/PSU_Day_09_MLDevOps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLDevOps - Deployment, Adaptation and Maintenance

## What is DevOps?

![](https://ml-devops-tutorial.readthedocs.io/en/latest/_images/devops1.jpg)

* it is not just tools
* it is not just processes
* it is not a trendy job title
* it is not just automation

DevOps is the practice of operations and development engineers participating together in the entire service lifecycle, from design through the development process to production support.

![](https://ml-devops-tutorial.readthedocs.io/en/latest/_images/devops-whatisdevops.png)

Adopting these practices and operations can lead to more robutst and reliable systems. As well as positively impact the whole development cycle: from R&D to development and production. Meaning: reducing deployment turnarounds, enhanced system montoring and alerting and better development planning. DevOps focuses on continuous integration and continuous delivery of software by leveraging on-demand IT resources (infrastructure as code) and by automating integration, test and deployment of code.

## Some concepts part of MLDevOps framework:

**1. Development platform:** a collaborative platform for performing ML experiments and empowering the creation of ML models by data scientists should be considered part of the MLOps framework. This platform should enable secure access to data sources (e.g., from data engineering workflows). We want the handover from ML training to deployment to be as smooth as possible, which is more likely the case for such a platform than ML models developed in different local environments.

**2. Model unit testing:** every time we create, change, or retrain a model, we should automatically validate the integrity of the model, e.g.
- should meet minimum performance on a test set
- should perform well on synthetic use case-specific datasets

**3. Versioning:** it should be possible to go back in time to inspect everything relating to a given model, e.g., what data & code was used. Why? Because if something breaks, we need to be able to go back in time and see why.

**4. Model registry:** there should be an overview of deployed & decommissioned ML models, their version history, and the deployment stage of each version. Why? If something breaks, we can roll back a previously archived version back into production.

**5. Model Governance:** only certain people should have access to see training related to any given model. There should be access control for who can request/reject/approve transitions between deployment stages (e.g., dev to test to prod) in the model registry.

**6. Deployments:** deployment can be many things, but in this post, I consider the case where we want to deploy a model to cloud infrastructure and expose an API, which enables other people to consume and use the model, i.e., I’m not considering cases where we want to deploy ML models into embedded systems. Efficient model deployments on appropriate infrastructure should:
- support multiple ML frameworks + custom models
- have well-defined API spec (e.g., Swagger/OpenAPI)
- support containerized model servers

**7. Monitoring:** tracking performance metrics (throughput, uptime, etc.). Why? If suddenly a model starts returning errors or being unexpectedly slow, we need to know before the end-user complains so that we can fix it.

**8. Feedback:** we need to feedback information to the model on how well it is performing. Why? Typically we run predictions on new samples where we do not yet know the ground truth. As we learn the truth, however, we need to inform the model to report on how well it is actually doing.

**9. A/B testing:** no matter how solid cross-validation we think we’re doing, we never know how the model will perform until it actually gets deployed. It should be easy to perform A/B experiments with live models within the MLOps framework.

**10. Drift detection:** typically, the longer time a given model is deployed, the worse it becomes as circumstances change compared to when the model was trained. We can try to monitor and alert on these different circumstances, or “drifts” before they get too severe:
- Concept drift: when the relation between input and output has changed
- Prediction drift: changes in predictions, but the model still holds
- Label drift: change in the model’s outcomes compared to training data
- Feature drift: change in the distribution of model input data

**11. Outlier detection:** if a deployed model receives an input sample that is significantly different from anything observed during training, we can try to identify this sample as a potential outlier, and the returned prediction should be marked as such, indicating that the end-user should be careful in trusting the prediction.

**12. Adversarial Attack Detection:** we should be warned when adversarial samples attack our models (e.g., someone trying to abuse/manipulate the outcome of our algorithms).

**13. Interpretability:** the ML deployments should support endpoints returning the explanation of our prediction, e.g., through SHAP values. Why? for a lot of use cases, a prediction is not enough, and the end-user needs to know why a given prediction was made.

**14. Governance of deployments:** we not only need access restrictions on who can see the data, trained models, etc., but also on who can eventually use the deployed models. These deployed models can often be just as confidential as the data they were trained on.

**15. Data-centricity:** rather than focus on model performance & improvements, it makes sense that an MLOps framework also enables an increased focus on how the end-user can improve data quality and breadth.



---



## **Let's deploy a simple classification model using Flask**

### Import Libraries

In [None]:
import io
import os
import json
import time
import threading
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
from flask import Flask, jsonify, request

In [None]:
! wget https://raw.githubusercontent.com/pytorch/serve/master/examples/image_classifier/index_to_name.json

--2023-09-05 18:33:52--  https://raw.githubusercontent.com/pytorch/serve/master/examples/image_classifier/index_to_name.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35363 (35K) [text/plain]
Saving to: ‘index_to_name.json’


2023-09-05 18:33:53 (9.75 MB/s) - ‘index_to_name.json’ saved [35363/35363]



### Code to setup model for inference

In [None]:
app = Flask(__name__)
model = models.densenet121(pretrained=True)               # Trained on 1000 classes from ImageNet
model.eval()                                              # Turns off autograd

img_class_map = None
mapping_file_path = 'index_to_name.json'                  # Human-readable names for Imagenet classes
if os.path.isfile(mapping_file_path):
    with open (mapping_file_path) as f:
        img_class_map = json.load(f)

# Transform input into the form our model expects
def transform_image(infile):
    input_transforms = [transforms.Resize(255),           # We use multiple TorchVision transforms to ready the image
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406],       # Standard normalization for ImageNet model input
            [0.229, 0.224, 0.225])]
    my_transforms = transforms.Compose(input_transforms)
    image = Image.open(infile)                            # Open the image file
    timg = my_transforms(image)                           # Transform PIL image to appropriately-shaped PyTorch tensor
    timg.unsqueeze_(0)                                    # PyTorch models expect batched input; create a batch of 1
    return timg

# Get a prediction
def get_prediction(input_tensor):
    outputs = model.forward(input_tensor)                 # Get likelihoods for all ImageNet classes
    _, y_hat = outputs.max(1)                             # Extract the most likely class
    prediction = y_hat.item()                             # Extract the int value from the PyTorch tensor
    return prediction

# Make the prediction human-readable
def render_prediction(prediction_idx):
    stridx = str(prediction_idx)
    class_name = 'Unknown'
    if img_class_map is not None:
        if stridx in img_class_map is not None:
            class_name = img_class_map[stridx][1]
    return prediction_idx, class_name

# Retrain model
def retrain_network(file):
  # Write code to retrain network here

  return jsonify({'message': "This is a new image!! Network has now been trained on this sample..."})

Downloading: "https://download.pytorch.org/models/densenet121-a639ec97.pth" to /root/.cache/torch/hub/checkpoints/densenet121-a639ec97.pth
100%|██████████| 30.8M/30.8M [00:00<00:00, 177MB/s]


### Define flask routes where the model API will be served

In [None]:
@app.route('/', methods=['GET'])
def root():
    return jsonify({'msg' : 'Try POSTing to the /predict endpoint with an RGB image attachment'})

@app.route('/predict', methods=['POST'])
def predict():
    if request.method == 'POST':
        file = request.files['file']
        if file is not None:
            input_tensor = transform_image(file)
            prediction_idx = get_prediction(input_tensor)
            class_id, class_name = render_prediction(prediction_idx)
            if class_name == "Unknown":
              return retrain_network(file)
            return jsonify({'class_id': class_id, 'class_name': class_name})

### Start Flask server in background

In [None]:
def network_call():
  app.run()
threading.Thread(target=network_call).start()

### Test on a cat image

In [None]:
! wget https://raw.githubusercontent.com/pytorch/serve/master/examples/image_classifier/kitten.jpg

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m


--2023-09-05 18:33:53--  https://raw.githubusercontent.com/pytorch/serve/master/examples/image_classifier/kitten.jpg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 110969 (108K) [image/jpeg]
Saving to: ‘kitten.jpg’


2023-09-05 18:33:54 (4.59 MB/s) - ‘kitten.jpg’ saved [110969/110969]



In [None]:
! curl -X POST -H "Content-Type: multipart/form-data" http://localhost:5000/predict -F "file=@kitten.jpg"

INFO:werkzeug:127.0.0.1 - - [05/Sep/2023 18:33:54] "POST /predict HTTP/1.1" 200 -


{"class_id":282,"class_name":"tiger_cat"}


### Retrain network when unknown data is uploaded

In [None]:
! curl -X POST -H "Content-Type: multipart/form-data" http://localhost:5000/predict -F "file=@unknown.jpg"

curl: (26) Failed to open/read local data from file/application


**CLASS ASSIGNMENT:**

**Above we used the Imagenet1000 pretrained DensetNet121 model. Now do the same, but for the model that we had trained in an earlier tutorial. The model was used for predicting the digit present in an image (MNIST Dataset). Additionally log and plot the classification IDs inferenced so far by the model using a tool like [WandB](https://wandb.ai/).**

In [None]:
# Write code here