<div align="center" dir="auto">
<p dir="auto">

<a href="https://colab.research.google.com/github/write-with-neurl/modelbit-notebooks/blob/main/deploy-falcon7b/Deploy_Falcon_7B_With_ModelBit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

</p>

# ⚡ Deploying Falcon 7B LLM to A Rest API Endpoint with Modelbit for Text Generation

In this example, we'll use hugging face to deploy the Falcon 7B model as a REST endpoint for text generation inference.

## 🧑‍💻 Installations and Set Up

In [None]:
!pip3 install modelbit protobuf==3.20 accelerate==0.25.0 bitsandbytes==0.41.3.post2 transformers==4.36.2 scipy==1.11.4 torch==2.1.0 cloudpickle==3.0.0



In [None]:
import torch
import cloudpickle
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline

## Loading and Quantization of Falcon 7B and Tokenizer
We'll be setting up a function to load and cache a language model and its tokenizer by reducing it's memory footprint for efficient usage. The first part involves importing necessary modules: `AutoModelForCausalLM`, `AutoTokenizer`, `BitsAndBytesConfig` and `pipeline` from the `transformers` library.

### Setting Up the device and model
We set up our model to run inference on a GPU by setting the device to CUDA.

In [None]:
device = "cuda"
llm_model = "tiiuae/falcon-7b"

### Quantization Configuration

The `BitsAndBytesConfig` is used to set up the quantization configuration for the model, using a 4-bit quantization, which reduces the model's memory footprint. This is particularly useful for running large models on hardware with limited memory.

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    llm_model,
    load_in_4bit=True,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(llm_model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

configuration_falcon.py:   0%|          | 0.00/7.16k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b:
- configuration_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.



modeling_falcon.py:   0%|          | 0.00/56.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b:
- modeling_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

### Pipeline Setup
The pipeline function is used to create a pipeline for text generation and entails the following components

* `model=model` and `tokenizer=tokenizer` pass the loaded model and tokenizer to the pipeline.
* `torch_dtype=torch.bfloat16` ensures the pipeline uses bfloat16 precision.
* `trust_remote_code=True` allows the execution of remote custom code, which can be necessary for some custom models.
* `device_map="auto"` allows the pipeline to automatically determine the best way to distribute the model across the available hardware.



In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)

### Serializing the Pipeline with cloudpickle:

In this section, we serialize the pipe object and use cloudpickle, a more robust version of the standard pickle module in Python, capable of serializing more complex Python objects, writes it to the 4 bit quantized model to a pickle file

In [None]:
with open('falcon_pipe_int4.pkl', 'wb') as file:
    cloudpickle.dump(pipe, file)

## Inference Function for Generating Responses

In [None]:
from functools import cache
import pickle

@cache
def get_llm():
    with open('falcon_pipe_int4.pkl', 'rb') as file:
        content = pickle.load(file)
    return content

In [None]:
def run_falcon_prompt(prompt):
    falcon_pipe = get_llm()
    sequences = falcon_pipe(
        prompt,
        do_sample=False,
        batch_size=8,
        max_new_tokens=50,
        temperature=0.7,
        top_k=10,
        num_return_sequences=1,
    )
    return {'output': sequences[0]['generated_text']}

In [None]:
run_falcon_prompt("My name is Clara and I am")

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


{'output': 'My name is Clara and I am a 20 year old student from the Netherlands. I am currently studying International Business and Management at the University of Groningen. I am a very outgoing person and I love to meet new people. I am very interested in travelling and I have been to'}

## 🚀 Deploying Model to a REST API

### 🔐 Log into `modelbit`

In [None]:
import modelbit as mb

mb.login()

We are now ready to deploy our model to a REST API Endpoint on Modelbit.
For this deployment, we'll use the `run_falcon_prompt` function, which encapsulates the entire process of loading the model and performing inference. This function simply takes a text prompt as input and efficiently generates the corresponding text output

In [None]:
mb.deploy(run_falcon_prompt, python_packages=["transformers==4.36.2", "torch==2.1.0+cu121", "accelerate==0.25.0", "cloudpickle==3.0.0"],
           extra_files=["falcon_pipe_int4.pkl"], require_gpu="A10G")

## 📩 Test the REST Endpoint with a Prompt

In this section, we test the deployed `run_falcon_prompt` model using a Python function. The function `test_falcon_inference` makes a POST request to the Modelbit endpoint, sending a text prompt and receiving the generated text in return.

In [None]:
import requests
import json

def test_falcon_inference(prompt: str):
    # Construct the URL for the ModelBit endpoint
    url = "https://ENTER_WORKSPACE.us-east-1.modelbit.com/v1/run_falcon_prompt/latest"
    # Set the headers to indicate JSON content type
    headers = {"Content-Type": "application/json"}
    # Format the data payload as JSON, with 'prompt' as a key
    data = json.dumps({"data": prompt})
    # Make the POST request and return the JSON response
    response = requests.post(url, headers=headers, data=data)
    return response.json()

In [None]:
# Example usage
test_prompt = "Once upon a time"
print(test_falcon_inference(test_prompt))

You can also test your endpoint from the command line using:


> `curl -s -XPOST "https://ENTER_WORKSPACE_NAME.us-east-1.modelbit.com/v1/run_falcon_prompt/latest" -d '{"data": "Once upon a time,"}' | json_pp`

---
> ⚠️ Replace the `ENTER_WORKSPACE_NAME` placeholder with your workspace name.

## 🚀 Model Hub by Modelbit

Interested in more notebooks like this to deploy LLMs? Check out Model Hub by Modelbit ⚡️: https://www.modelbit.com/model-hub