<div align="center" dir="auto">
<p dir="auto">

<a href="https://colab.research.google.com/github/write-with-neurl/modelbit-articles/blob/main/modelbit-08/Deploy_Llama_2_7b_Text_Summarization_LangChain_Modelbit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

</p>

# ⚡Deploy Llama 2-7B 🦙 as a REST Endpoint with Langchain 🦜🔗 and Modelbit 🟪

## 🧑‍💻 Installations and Set Up

This walkthrough will guide you through deploying a Llama 2-7B 🦙 directly from this notebook using Modelbit.





### 📥 Install Pre-requisite Libraries

To set up your environment, install the following packages:
- LangChain - `langchain==0.0.335`
- Modelbit - `modelbit==0.30.13`
- Huggingface_hub - `huggingface-hub==0.19.1`
- LLaMA_cpp_python - `llama_cpp_python==0.2.17`

As of this writing, those are the specific versions the following `pip` command installs:


In [1]:
!pip install --upgrade langchain huggingface_hub modelbit

Collecting langchain
  Downloading langchain-0.1.8-py3-none-any.whl (816 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m816.1/816.1 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Collecting modelbit
  Downloading modelbit-0.34.0-py3-none-any.whl (119 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.3/119.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.21 (from langchain)
  Downloading langchain_community-0.0.21-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2,>=0.1.24 (from langchain)
  Downloading langchain_core-0.1.24-py3-none-any.whl (241 kB)
[2K

### 🏗️ Build  Llama 2-7b 🦙 with GPU Support

You will load the embedding model directly onto your GPU device. Set an environment variable `CMAKE_ARGS` with the value `-DLLAMA_CUBLAS=on` to indicate that the `llama_cpp_python` package should be built with [cuBLAS support](https://developer.nvidia.com/cublas). cuBLAS is a GPU-accelerated library provided by NVIDIA as part of their CUDA toolkit, which offers optimized implementations for standard basic linear algebra subprograms.

> The `llama_cpp_python` package provides Python bindings for the llama.cpp library that bridges the gap between the C++ codebase of `llama.cpp` and Python to access and use the functionalities of llama.cpp directly from Python scripts.

Use the `FORCE_CMAKE=1`` environment variable to force the use of cmake and install the pip package for the Metal support ([source](https://python.langchain.com/docs/integrations/llms/llamacpp#installation-with-openblas--cublas--clblast)).

In [2]:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" # Build llama_cpp_python on NVIDIA cuBLAS for GPU support
FORCE_CMAKE=1 # Force installation to use cuBLAS

!pip install llama_cpp_python

Collecting llama_cpp_python
  Downloading llama_cpp_python-0.2.44.tar.gz (36.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.6/36.6 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: llama_cpp_python
  Building wheel for llama_cpp_python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama_cpp_python: filename=llama_cpp_python-0.2.44-cp310-cp310-manylinux_2_35_x86_64.whl size=2591542 sha256=65e2986358a90649367a87cd3f5c91499bd88b2801cc247a1b3ee41c0fc13780
  Stored in directory: /root/.cache/pip/wheels/6e/f0/52/1716aa7fefc7eb2a9b76775b0a61fc131b7dcc961e310a048a
Successfully built llama_cpp_python
Installing collected packages: llama_cpp_python
Successfully installed llama_cpp_python-0.2.44

### 📥🦙 Download the Llama-2-7B-GGUF model using the Hugging Face CLI command

Use the HF CLI to download the Llama-2-7B-GGUF model file from a repository in the Hugging Face model Hub and save it to this notebook environment.

In [3]:
!huggingface-cli download TheBloke/Llama-2-7B-GGUF llama-2-7b.Q4_0.gguf --local-dir . --local-dir-use-symlinks False

Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf to /root/.cache/huggingface/hub/tmp076udjtf
llama-2-7b.Q4_0.gguf: 100% 3.83G/3.83G [00:39<00:00, 96.8MB/s]
./llama-2-7b.Q4_0.gguf


## 🦜🔗 Set Up the Prompt Template with LangChain



After downloading the model file, you need to set up a [prompt template](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/). The prompt template defines the input variables and the response format for the LlamaCpp model.

In this case, the input variables are `text` and `num_of_words`, which represent the Twitter thread and the desired length of the summary, respectively.

The response format is a `str` that includes the original Twitter thread and the generated summary.

In [9]:
from langchain.prompts import PromptTemplate


prompt = PromptTemplate(
    input_variables=["text", "num_of_words"],
    template="""\n\n### Instruction:\nGiven a ubuntu command: '{text}', You are tasked with providing a command for ubuntu terminal in {num_of_words} words and ensure the command doesn't lose the main context of ubuntu. \n\n### Response:\n""",
)

## 🦙⛓️ initialize Llama Model Weights and LLMChain




After crafting your prompt, initialize the LlamaCpp model with the model file path and context size(`n_ctx`).

Use the prompt template to create an LLMChain class with the initialized LlamaCpp model and prompt template.

Finally, the `load_llm` function is defined to load the LlamaCpp model embeddings/weights and cache it for future use. Then we deifne a function `summarize_thread` that takes a text string and a number of words as input, and generates a summary of the text.



In [5]:
# Importing necessary classes and modules
from langchain.chains import LLMChain  # Used for creating language model chains
from langchain.callbacks.stdout import StdOutCallbackHandler  # Handles output to standard output
import os  # Used for operating system dependent functionality
from langchain.llms.llamacpp import LlamaCpp  # Interface for LlamaCpp model
from functools import cache  # Used for caching function results

# Setting up the file path for the model
folder = ''  # Directory where the model file is located (empty if in current directory)
file_name = 'llama-2-7b.Q4_0.gguf'  # Name of the model file
file_path = os.path.join(folder, file_name)  # Full path to the model file

@cache
def load_llm():
    """ Loads the LlamaCpp model with a specified model path and context size.
      Uses caching to optimize performance by storing the result for subsequent calls.
    """
    llm = LlamaCpp(model_path=file_path, n_ctx=8191,n_gpu_layers=20,
    n_batch=20, f16_kv=True)  # Load the LlamaCpp model // context window - 8191 // batch size 20 // gpu layer offload size 20 // F16_kv=True to enable half precision for key value cache

    return llm

### 📜 Define Inference Function

This function takes a text string and an optional `num_of_words argument` (defaulting to 200).

`llm = load_llm()` - calls the `load_llm` function to get the loaded LlamaCpp model.

`chain = LLMChain(llm=llm, prompt=prompt)` - Instantiates an LLMChain object with the LlamaCpp model and a prompt.

`return chain.run(...)` - executes the chain with the specified parameters and returns the result. The run method is passed the text and number of words, along with a callback handler `(StdOutCallbackHandler())` which  handles the output.

In [6]:
def summarize_thread(text: str, num_of_words: int = 200):
    """ Summarizes the given text using the LlamaCpp model.

    Args:
        text (str): Text to be summarized.
        num_of_words (int, optional): Number of words for the summary. Defaults to 200.

    Returns:
        The summary of the text.
    """
    llm = load_llm()  # Load the LlamaCpp model
    # Initialize LLMChain with the LlamaCpp model and the provided text as prompt

    chain = LLMChain(llm=llm, prompt=prompt)
    # Run the chain with the text and word limit, and return the result
    return chain.run({"text": text, "num_of_words":num_of_words}, callbacks=[StdOutCallbackHandler()])

## 🧪 Test the Inference Function with a Sample Twitter Thread 🧵

In [17]:
text="""reset wifi command ubuntu"""

# call inference function
summarize_thread(text='test me', num_of_words=200)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m

### Instruction:
Given a ubuntu command: 'test me', You are tasked with providing a command for ubuntu terminal in 200 words and ensure the command doesn't lose the main context of ubuntu. 

### Response:
[0m


Llama.generate: prefix-match hit

llama_print_timings:        load time =   16101.98 ms
llama_print_timings:      sample time =     145.70 ms /   256 runs   (    0.57 ms per token,  1757.08 tokens per second)
llama_print_timings: prompt eval time =   26987.00 ms /    42 tokens (  642.55 ms per token,     1.56 tokens per second)
llama_print_timings:        eval time =  195704.03 ms /   255 runs   (  767.47 ms per token,     1.30 tokens per second)
llama_print_timings:       total time =  223772.53 ms /   297 tokens



[1m> Finished chain.[0m


'The instruction was to provide a command for Ubuntu Terminal, and the command should not lose its main context. As the first word, I believe it is not necessary to explain what is Ubuntu Terminal, so I think it is not necessary to explain it, but it should be introduced that it is a shell, which is used in Ubuntu. In my opinion, the commands should be simple, so I hope I can write something simple, but not too simple that it does not contain any meaningful commands.\n\n### Sample Output\n```\n> test me\n```\n\n### My Solution\n\n#### Instructions\n\n##### Explain the solution and how it works\nI think that it should be simple, so I just use the basic commands like "cat" and "echo" but I don\'t think it is too simple because I tried to introduce some commands that I think is useful, like "find", "clear", "sleep", "read" etc...\n\n##### How it works? Explain your solution(s) to the problem and how it(they) solves(s) the problem.\nI think it is simple, but it works because I tried to mak

## 🚀 Deploying Llama 2-7B 🦙 as a REST API Endpoint with Modelbit



Modelbit is a lightweight platform designed to make deploying any ML model to a production endpoint fast and simple. With the ability to [deploy models from anywhere](https://www.modelbit.com/product/deploy-from-anywhere), it makes deploying your custom ML model as simple as passing an inference function to “modelbit.deploy()”.

Here are the basics you need to know about Modelbit:
- **Deployment from any Python environment:** Models can be deployed directly from Jupyter Notebooks, Hex, Deepnote, and VS Code.
- **Dependency detection:** Automatically detects which dependencies, libraries, and data your model needs and includes them in your model’s production Docker container.
- **REST API Endpoints:** Your models will be callable as [RESTful API endpoints](https://doc.modelbit.com/endpoints/).
- **Git-based version control:** Track and manage model iterations with Git repositories.
- **CI/CD integration:** Seamlessly integrate model updates and deployment into continuous integration and continuous delivery (CI/CD [link text](https://doc.modelbit.com/git/#connect-your-own-github-gitlab-or-azure-devops-repo)) pipelines like [GitHub Actions](https://github.com/features/actions) and [GitLab CI/CD](https://about.gitlab.com/solutions/continuous-integration/).

## 🔒 Log into Modelbit

Modelbit integrates with version control and exepriment iteration tools like gitlab, github, Weights & Biases and neptune.ai to move models from development to production quickly.

 To get started;

Create an account with [Modelbit]('modelbit.com') if you haven't already.**

In [12]:
import modelbit

# Log into the 'modelbit' service using the development ("dev") branch
# Ensure you create a "dev" branch in Modelbit or use the "main" branch for your deployment
mb = modelbit.login()

## ⚡ Time to `modelbit.deploy()`

Deploy the `summarize_thread` function to Modelbit with the [`mb.deploy()`](https://doc.modelbit.com/api-reference/deploy/) API.

This API takes the inference function as an argument and deploys it to Modelbit, making it available through a REST API.

Once deployed, you can use the REST API to call the function and generate summaries of Twitter threads.

In [13]:
mb.deploy(summarize_thread, extra_files=["llama-2-7b.Q4_0.gguf"], python_packages=["langchain==0.0.335","llama_cpp_python==0.2.17"],
          require_gpu=True)

Uploading 'prompt': 100%|██████████| 432/432 [00:00<00:00, 598B/s]


## 🧑‍🍳 Test the Llama 2-7B 🦙 REST API Endpoint



To test your REST Endpoint, you can use the requests package to send single or batch requests to the API.

You can use the `requests.post()` method to send a POST request to the API and use the json module to format the response. Here's an example of how to test your REST Endpoint using Python

In [None]:
twitter_thread="""In today's fast-paced world, it's crucial to stay connected and informed. Communication has evolved over the years, with technology playing a pivotal role. We're now able to share ideas, news, and personal stories with a global audience instantly.

The rise of social media has been a game-changer."""

Test your endpoint from the command line using:

> ⚠️ Replace the `ENTER_WORKSPACE_NAME` placeholder with your workspace name.

In [18]:
"""Testing the endpoint"""
import requests
import json

# Replace “ENTER_WORKSPACE_NAME” with your Modelbit username
# Also change the version branch from `v1` if you have a different deployment version

url = "https://nupipay.app.modelbit.com/v1/summarize_thread/latest"


headers = {
    'Content-Type': 'application/json'
}
data = {
    "data": ['shutdown command', 200] # text and num_of_words argument for inf function
}


# POST your request to the endpoint
response = requests.post(url, headers=headers, json=data)
response_json = response.json()

print(json.dumps(response_json, indent=4))

{
    "error": "Runtime exited unexpectedly."
}


** Use `cURL` **

> ⚠️ Replace the `ENTER_WORKSPACE_NAME` placeholder with your workspace name.

In [None]:
!curl -s -XPOST "https://ENTER_WORSKPACE_NAME.app.modelbit.com/v1/resnet_inference/dev/latest" -d '{"data": [twitter_thread, 200]}' | json_pp

# 📚 Modelbit Machine Learning Blog

Enjoyed this walkthrough? Check out our blog for more machine learning tutorials.

Recommendation:

- [Deploying a BERT Model to a REST API Endpoint for Text Classification](https://www.modelbit.com/blog/deploying-a-bert-model-to-a-rest-api-endpoint-for-text-classification).