<div align="center" dir="auto">
<p dir="auto">

<a href="https://colab.research.google.com/github/write-with-neurl/modelbit-notebooks/blob/main/deploy-tinyvicuna/Deploy_Tiny_Vicuna_1B_With_ModelBit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

</p>

# ⚡ Deploying Tiny Vicuna 1B LLM to A Rest API Endpoint for Text Generation

In this example, we'll use hugging face to deploy a Tiny Vicuna 1B model as a REST endpoint for text generation inference.

## 🧑‍💻 Installations and Set Up

In [None]:
!pip install transformers sentencepiece torch protobuf huggingface_hub modelbit

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
Collecting modelbit
  Downloading modelbit-0.31.6-py3-none-any.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.0/114.0 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
Collecting pycryptodomex (from modelbit)
  Downloading pycryptodomex-3.19.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
Collecting types-requests (from modelbit)
  Downloading types_requests-2.31.0.10-py3-none-any.whl (14 kB)
Collecting types-PyYAML (from modelbit)
  Downloading types_PyYAML-6.0.12.12-py3-none-any.whl (14 kB)
Collecting types-pkg-resources (from modelbit)
  Downloading types_pkg_resources-0.1.3

In [None]:
import modelbit
import requests
from huggingface_hub import snapshot_download

## Loading and Caching of Tiny Vicuna 1B and Tokenizer

We'll be setting up a function to load and cache a language model and its tokenizer for efficient usage. The first part involves importing necessary modules: `AutoModelForCausalLM` and `AutoTokenizer` from the `transformers` library, and `cache` from `functools`.

The `get_vicuna_model` function, decorated with `@cache`, is our key player. This function uses `snapshot_download` to fetch a specific model (here, https://huggingface.co/Jiayi-Pan/Tiny-Vicuna-1B/tree/main "Tiny-Vicuna-1B") and initializes both the tokenizer and the model using the `AutoTokenizer.from_pretrained` and `AutoModelForCausalLM.from_pretrained` methods, respectively.

The use of `@cache` is a clever optimization; it ensures that once the model and tokenizer are loaded, they are stored in memory. This significantly speeds up future calls to this function, as it avoids reloading the model and tokenizer from scratch each time, making it ideal for deployments.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from functools import cache

@cache
def get_vicuna_model():
    model_path = snapshot_download(repo_id="Jiayi-Pan/Tiny-Vicuna-1B")
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path)
    return model, tokenizer

## Inference Function for Generating Responses

In this section, we write a function for inference, `tiny_vicuna_inference` function. Upon receiving a text prompt as input, the first step within the function is to retrieve the pre-loaded Vicuna model and tokenizer by calling `get_vicuna_model()`. This efficient retrieval is thanks to our previously established caching mechanism.

Next, the function encodes the input prompt using the tokenizer, preparing it in a format suitable for the model, and specifies that the output should be PyTorch tensors (`return_tensors="pt"`). The model then steps in to generate a response based on these inputs, with an upper limit of 512 tokens for the response length.

In [None]:
def tiny_vicuna_inference(prompt: str) -> str:
    model, tokenizer = get_vicuna_model()

    # Encode the prompt and generate a response
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=512)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response

## 🚀 Deploying Model to a REST API

### Checking and Displaying Versions of Key Python Libraries

In preparation for deployment using [Modelbit](https://doc.modelbit.com/getting-started/), where specific library versions are a requirement, this script checks and displays the versions of key Python libraries. It uses `pkg_resources` to fetch version information for `transformers`, `sentencepiece`, and `torch`, which are essential for running inference

In [None]:
import pkg_resources

# Check versions of transformers, sentencepiece and torch
transformers_version = pkg_resources.get_distribution("transformers").version
sentencepiece_version = pkg_resources.get_distribution("sentencepiece").version
torch_version = pkg_resources.get_distribution("torch").version

print(f"🔍 Transformers version: {transformers_version}")
print(f"🔍 Sentencepiece version: {sentencepiece_version}")
print(f"🔍 Torch version: {torch_version}")

🔍 Transformers version: 4.35.2
🔍 Sentencepiece version: 0.1.99
🔍 Torch version: 2.1.0+cu121


### 🔐 Log into `modelbit`

In [None]:
# Log into Modelbit
mb = modelbit.login(branch="main")

We are now ready to deploy our model to a REST API Endpoint on Modelbit. For this deployment, we'll use the `tiny_vicuna_inference` function, which encapsulates the entire process of loading the model and performing inference. This function simply takes a text prompt as input and efficiently generates the corresponding text output

In [None]:
# Deploy the inference function to ModelBit
mb.deploy(tiny_vicuna_inference, python_packages=["transformers==4.36.2", "sentencepiece==0.1.99", "torch==2.1.2"], require_gpu=True)

## 📩 Test the REST Endpoint with a Prompt

In this section, we test the deployed `tiny_vicuna_inference` model using a Python function. The function `test_vicuna_inference` makes a POST request to the Modelbit endpoint, sending a text prompt and receiving the generated text in return.

In [None]:
import requests
import json

def test_vicuna_inference(prompt: str):
    # Construct the URL for the ModelBit endpoint
    url = "https://ENTER_WORKSPACE_NAME.us-east-1.modelbit.com/v1/tiny_vicuna_inference/latest"
    # Set the headers to indicate JSON content type
    headers = {"Content-Type": "application/json"}
    # Format the data payload as JSON, with 'prompt' as a key
    data = json.dumps({"data": prompt})
    # Make the POST request and return the JSON response
    response = requests.post(url, headers=headers, data=data)
    return response.json()

In [None]:
# Example usage
test_prompt = "My name is Clara and I am"
print(test_vicuna_inference(test_prompt))

{'data': 'My name is Clara and I am a student at the University of California, Berkeley. I am currently pursuing a degree in Computer Science. I am interested in learning about the field of computer science and how it can be applied to real-world problems. In my free time, I enjoy playing video games, reading, and exploring new places.'}


You can also test your endpoint from the command line using:


> `curl -s -XPOST "https://ENTER_WORKSPACE_NAME.us-east-1.modelbit.com/v1/tiny_vicuna_inference/latest" -d '{"data": "Once upon a time,"}' | json_pp`

---
> ⚠️ Replace the `ENTER_WORKSPACE_NAME` placeholder with your workspace name.

## 🚀 Model Hub by Modelbit

Interested in more notebooks like this to deploy LLMs? Check out Model Hub by Modelbit ⚡️: https://www.modelbit.com/model-hub