# Pytorch Serve

This tutorial assumes that you already have knowledge of the basic concepts of pytorch serve. If you have no the aspect knowledge, please see https://pytorch.org/serve/ website.

This tutorial will take you to do bloom-7b1 model inference. Please see https://huggingface.co/bigscience/bloom-7b1 to know more bloom-7b1 model. 
There is mainly 3 sections to learn how to do model inference.
* Load large Huggingface models with constrained resources using accelerate
* Start model with torchserve
* Run model inference to test

## 0. Install dependencies

In [2]:
!pip install transformers
!pip install accelerate

# fix: pynvml.nvml.NVMLError_FunctionNotFound when start bloom-7b1 model
!pip install pynvml==8.0.4

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting accelerate
[0m  Downloading accelerate-0.18.0-py3-none-any.whl (215 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.3/215.3 kB[0m [31m31.2 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: accelerate
  NOTE: The current PATH contains path(s) starting with `~`, which may not be expanded by all applications.[0m[33m
[0mSuccessfully installed accelerate-0.18.0
Defaulting to user installation because normal site-packages is not writeable
Collecting pynvml==8.0.4
  Downloading pynvml-8.0.4-py3-none-any.whl (36 kB)
Installing collected packages: pynvml
Successfully installed pynvml-8.0.4


## 1. Prepare model and configurations
We have to prepare model MAR file and configuration (config.properties) to start model. Thus, This step will guide you through this process.
### 1.1 Download bloom-7b1 model

In [None]:
!python Download_model.py --model_name bigscience/bloom-7b1

The script prints the path where the model is downloaded as below.

model/models--bigscience-bloom-7b1/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/

The downloaded model is around 14GB.

### 1.2 Compress downloaded model

Navigate to the path got from the above script. In this example it is

In [None]:
cd model/models--bigscience-bloom-7b1/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/
zip -r ~/model.zip *
cd ~

### 1.3 Generate MAR file

Navigate up to the directory that have custome_handler.py, model.zip, setup_config.json.
* custom_handler.py: codes for model initialization, pre-processing, post-processing, etc.
* model.zip: Compressed package of model files (*.bin) should be the checkpoint.
* setup_config.json: configurations when loading large huggingface model, Refer: https://huggingface.co/docs/transformers/main_classes/model#large-model-loading

**Notice**: Should update parameters in the setup_config.json file according to your device resources. 
For example, With device_map="sequential", Huggingface accelerate will occupy gpu memory in the order of GPUs; 
"max_memory": {
        "0": "32GB"
    }
There is only 32GB gpu card, hope to load model into this GPU card.

In [3]:
!torch-model-archiver --model-name bloom --version 1.0 --handler custom_handler.py --extra-files model.zip,setup_config.json

You will see the bloom.mar file once the process executed finished.

## 2. Start model with torchserve

Move the Mar file to the specific directory

Update config.properties, especially notice model_store directory that need to find the MAR file

In [8]:
# torchserve/model_store/*.mar
!mv bloom.mar torchserve/model_store/

In [7]:
%%writefile torchserve/config/config.properties
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
enable_envvars_config=true
install_py_dep_per_model=true
number_of_gpu=1
load_models=all
max_response_size=655350000
default_response_timeout=6000
model_store=/home/model-server/bloom/torchserve/model_store

Overwriting torchserve/config/config.properties


In [10]:
!ls -al torchserve/model_store/
!ls -al torchserve/config/

total 3041736
drwxr-xr-x 2 model-server model-server       4096 Apr 18 02:13 .
drwxr-xr-x 5 model-server model-server       4096 Apr 18 02:12 ..
-rw-r--r-- 1 model-server model-server 3114721892 Apr 18 02:06 bloom.mar
total 16
drwxr-xr-x 3 model-server model-server 4096 Apr 18 02:10 .
drwxr-xr-x 5 model-server model-server 4096 Apr 18 02:12 ..
drwxr-xr-x 2 model-server model-server 4096 Apr 18 02:10 .ipynb_checkpoints
-rw-r--r-- 1 model-server model-server  320 Apr 18 02:12 config.properties


**It takes about 2 to 5 minutes to start model, that depends to your device resources.**

If you want to see the logs in the real time, you can run the below command in the terminal. and if you run the below command to start model in this notebook server, that might can't see the logs. 

In [17]:
!torchserve --start --ncs --ts-config ./torchserve/config/config.properties

TorchServe is already running, please use torchserve --stop to stop TorchServe.


## 3. Run model inference to test

### Option 1: Request model inference by curl command

In [8]:
!cat sample_text.txt

Today the weather is really nice and I am planning on


In [4]:
!curl -v "http://localhost:8080/predictions/bloom" -T sample_text.txt

*   Trying ::1:8080...
* TCP_NODELAY set
* Connected to localhost (::1) port 8080 (#0)
> PUT /predictions/bloom HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.68.0
> Accept: */*
> Content-Length: 54
> Expect: 100-continue
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 
< x-request-id: 5864229e-1bf1-428c-bcb4-b7ba5653779b
< Pragma: no-cache
< Cache-Control: no-cache; no-store, must-revalidate, private
< Expires: Thu, 01 Jan 1970 00:00:00 UTC
< content-length: 385
< connection: keep-alive
< 
Today the weather is really nice and I am planning on
traveling to the mountains on a holiday in the spring time and there is only one good thing :
Cavals (a bit like a backpacker backpack on the other hand) are one of the most commonly used transportation methods on the island.
* Connection #0 to host localhost left intact
My name is Jekyll, my mother is Hester and I am one of the

### Option 2

In [11]:
with open('./sample_text1.txt') as f:
    text=f.read()
    
display(text)

'The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows\n'

In [12]:
import requests

# Access model endpoint url
url='http://localhost:8080/predictions/bloom'

def make_request(url,text_content):
    response=requests.post(url,text_content)
    return response.text
   
answer=make_request(url,text)
display(answer)

'The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows\nThe Kubeflow project is dedicated to making deployments of machine learning (ML) workflows. It includes three main activities: • Using Kubeflow to define the workflow • Using Kubeflow to write the workflow • Implementing the Kubeflow workflow in a Java application with a Python package This project builds upon the Kubeflow project to support the creation of a Java-based workflow engine using'