# Deployment

https://github.com/huggingface/transformers-bloom-inference/tree/main/inference_server

In [None]:
!pip install flask flask_api gunicorn pydantic accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3 deepspeed-mii==0.0.2

In [None]:
!pip install protobuf==3.20.0

Add to make file:

```
bloom-176b-int8:
	TOKENIZERS_PARALLELISM=false \
	MODEL_NAME=microsoft/bloom-deepspeed-inference-int8 \
	DEPLOYMENT_FRAMEWORK=ds_inference \
	DTYPE=int8 \
	MAX_INPUT_LENGTH=2048 \
	MAX_BATCH_SIZE=4 \
	CUDA_VISIBLE_DEVICES=0,1,2,3 \
	gunicorn -t 0 -w 1 -b 127.0.0.1:5000 inference_server.server:app --access-logfile - --access-logformat '%(h)s %(t)s "%(r)s" %(s)s %(b)s'
```

In [None]:
!make bloom-176b-int8

# Inference

In [9]:
!python inference_server/examples/server_request.py --host=127.0.0.1 --port=5000

{'method': 'generate', 'num_generated_tokens': [40, 40, 40, 40], 'query_id': 2, 'text': ['DeepSpeed</em></p>\n *\n * <p>The following example shows how to use the {@link #get(int)} method to get the\n * deepest node in the tree:</p>\n * <', 'DeepSpeed is a measure of the amount of time it takes for a sound to travel through a medium. It is measured in units of distance per second.</em></p>\n<p>The speed of sound in air is about', 'DeepSpeed is a machine learning framework for training deep neural networks using stochastic gradient descent. It is designed to be fast and easy to use. It is a library for deep learning in Python. It is designed to be modular and', 'DeepSpeed is a machine learning framework for the deep learning community. It is a Python package that provides a set of tools for deep learning. It is a framework for building and training deep neural networks. It is a framework for building and'], 'total_time_taken': '8.10 secs'} 

{'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1

In [4]:
!python inference_server/examples/server_request.py --host=127.0.0.1 --port=5000

{'method': 'generate', 'num_generated_tokens': [40, 40, 40, 40], 'query_id': 0, 'text': ['DeepSpeed</em></p>\n *\n * <p>The following example shows how to use the {@link #get(int)} method to get the\n * deepest node in the tree:</p>\n * <', 'DeepSpeed is a measure of the amount of time it takes for a sound to travel through a medium. It is measured in units of distance per second.</em></p>\n<p>The speed of sound in air is about', 'DeepSpeed is a machine learning framework for training deep neural networks using stochastic gradient descent. It is designed to be fast and easy to use. It is a library for deep learning in Python. It is designed to be modular and', 'DeepSpeed is a machine learning framework for the deep learning community. It is a Python package that provides a set of tools for deep learning. It is a framework for building and training deep neural networks. It is a framework for building and'], 'total_time_taken': '12.10 secs'} 

{'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 

In [1]:
import requests

def generate(url: str) -> None:
    url = url + "/generate/"

    request_body = {
        "text": [
            "DeepSpeed",
            "DeepSpeed is a",
            "DeepSpeed is a machine",
            "DeepSpeed is a machine learning framework",
        ],
        "max_new_tokens": 40,
    }
    response = requests.post(url=url, json=request_body, verify=False)
    print(response.json(), "\n")


In [2]:
generate("http://127.0.0.1:5000")

{'method': 'generate', 'num_generated_tokens': [40, 40, 40, 40], 'query_id': 0, 'text': ['DeepSpeed, a leading provider of high-performance computing (HPC) solutions, today announced that it has been selected by the U.S. Department of Energy (DOE) to provide a high-per', 'DeepSpeed is a new, fast, and accurate method for the detection of protein-protein interactions. It is based on the use of a single, high-affinity, small molecule that binds to the protein of interest', 'DeepSpeed is a machine learning framework that is designed to be used by researchers and developers who are interested in applying deep learning to their own problems. It is a Python library that provides a set of tools for training and evaluating deep', 'DeepSpeed is a machine learning framework that is designed to be used by researchers and developers who are interested in applying deep learning to their own problems. It is a Python library that provides a set of tools for training and evaluating deep neural network

    text: List[str] = None
    min_length: int = None
    do_sample: bool = None
    early_stopping: bool = None
    temperature: float = None
    top_k: int = None
    top_p: float = None
    typical_p: float = None
    repetition_penalty: float = None
    bos_token_id: int = None
    pad_token_id: int = None
    eos_token_id: int = None
    length_penalty: float = None
    no_repeat_ngram_size: int = None
    encoder_no_repeat_ngram_size: int = None
    num_return_sequences: int = None
    max_time: float = None
    max_new_tokens: int = None
    decoder_start_token_id: int = None
    diversity_penalty: float = None
    forced_bos_token_id: int = None
    forced_eos_token_id: int = None
    exponential_decay_length_penalty: float = None
    remove_input_from_output: bool = False

In [3]:
import requests

url = "http://127.0.0.1:5000/generate/"

request_body = {
    "text": [
        "DeepSpeed is a machine learning framework",
    ],
    "max_new_tokens": 64,
    "top_p": 1.0,
    "temperature": 1.0
}
response = requests.post(url=url, json=request_body, verify=False)
print(response.json(), "\n")


{'method': 'generate', 'num_generated_tokens': [64], 'query_id': 0, 'text': ['DeepSpeed is a machine learning framework that is used to train a model to predict the best time to a certain point in a sequence. In a real-world scenario, the time to a certain point in a sequence is a function of the time in the previous time period, the time in the current time period, the time in the next time period, the'], 'total_time_taken': '1.29 secs'} 



In [4]:
response.json()['text'][0]

'DeepSpeed is a machine learning framework that is used to train a model to predict the best time to a certain point in a sequence. In a real-world scenario, the time to a certain point in a sequence is a function of the time in the previous time period, the time in the current time period, the time in the next time period, the'

In [5]:
import requests

url = "http://127.0.0.1:5000/logprob/"

request_body = {
    "text": [
        "DeepSpeed is a machine learning framework",
    ],    
}
response = requests.post(url=url, json=request_body, verify=False)
print(response, "\n")


<Response [200]> 



In [4]:
response.json()

{'logprobs': [[368.0, 368.0, 434.0, 430.0, 432.0, 432.0]],
 'mean_log_prob': 410.6666564941406}

In [5]:
response.json()['mean_log_prob']

410.6666564941406

In [2]:
import requests

url = "http://127.0.0.1:5000/logprob/"

request_body = {
    "text": [
        "DeepSpeed is a machine learning framework: correct",
    ],    
}
response = requests.post(url=url, json=request_body, verify=False)
print(response.content, "\n")

b'{"logprobs":[[368.0,368.0,434.0,430.0,432.0,432.0,434.0,426.0]],"mean_log_prob":415.5}\n' 



In [1]:
import requests

url = "http://127.0.0.1:5000/logprob/"

request_body = {
    "text": [
        "DeepSpeed is a machine learning framework: incorrect",
    ],    
}
response = requests.post(url=url, json=request_body, verify=False)
print(response.json(), "\n")

{'logprobs': [[368.0, 368.0, 434.0, 430.0, 432.0, 432.0, 434.0, 422.0]], 'mean_log_prob': 415.0} 

