# **Hosting Llama 2 with Free GPU via Google Collab**

**Reference**
https://medium.com/@yuhongsun96/host-a-llama-2-api-on-gpu-for-free-a5311463c183

## Install Dependencies
- Requirements for running FastAPI Server
- Requirements for creating a public model serving URL via Ngrok
- Requirements for running Llama2 13B (including Quantization)


In [1]:
!pip install fastapi[all] uvicorn python-multipart pydantic
!pip install accelerate tokenizer
!pip install transformers[torch]
!pip install einops
!pip install xformers
!pip install langchain
!pip install langchain-community
!pip install langchain-core
!pip install faiss-gpu
!pip install sentence_transformers
!pip install bitsandbytes
!pip install peft
!pip install pyngrok

Collecting fastapi[all]
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn
  Downloading uvicorn-0.30.0-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-multipart
  Downloading python_multipart-0.0.9-py3-none-any.whl (22 kB)
Collecting starlette<0.38.0,>=0.37.2 (from fastapi[all])
  Downloading starlette-0.37.2-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi-cli>=0.0.2 (from fastapi[all])
  Downloading fastapi_cli-0.0.4-py3-none-any.whl (9.5 kB)
Collecting httpx>=0.23.0 (from fastapi[all])
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6

In [2]:
# This downloads and sets up the Ngrok executable in the Google Colab instance
# !wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
# !unzip -o ngrok-stable-linux-amd64.zip

Ngrok is used to make the FastAPI server accessible via a public URL.

Users are required to make a free account and provide their auth token to use Ngrok. The free version only allows 1 local tunnel and the auth token is used to track this usage limit.

In [3]:
# https://dashboard.ngrok.com/signup
!./ngrok authtoken 2go3dgdjdxdqlzeNUmFS2ML9Yqa_6JYfKjSxtA8VpiSYDHxC9

/bin/bash: line 1: ./ngrok: No such file or directory


## Create FastAPI App
This provides an API to the Llama 2 model. The model version can be changed in the code below as desired.

For this demo we will use the 13 billion parameter version which is finetuned for instruction (chat) following.

Despite the compression, it is still a more powerful model than the 7B variant.

In [4]:
# !huggingface-cli login

In [5]:
from huggingface_hub import login
login(new_session = False,
     token = "")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [6]:
%%writefile app.py
import torch
import transformers
from typing import Any

from fastapi import FastAPI
from fastapi import HTTPException
from pydantic import BaseModel
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig
from langchain_core.prompts import PromptTemplate
from langchain import LLMChain
from langchain.llms import HuggingFacePipeline


bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_compute_dtype = torch.bfloat16,
    bnb_4bit_use_double_quant = False
)
model_name = "Tien094/vinallama-sum"
model  = AutoModelForCausalLM.from_pretrained(model_name,
                                              quantization_config = bnb_config,
                                              trust_remote_code=True,
                                             device_map = "auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)
template = '''
<|im_start|>system
Bạn là một trợ lí AI hữu ích. Hãy trả lời tóm tắt đoạn văn dưới đây một cách ngắn gọn.
<|im_end|>
<|im_start|>user
{text}<|im_end|>
<|im_start|>assistant
'''
prompt = PromptTemplate(template = template, input_variables = ["text"])

generation_pipeline = transformers.pipeline(
    model = model,
    task = "text-generation",
    tokenizer = tokenizer,
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=256  # max number of tokens to generate in the output
)
my_pipeline = HuggingFacePipeline(pipeline=generation_pipeline)
llm_chain = LLMChain(prompt=prompt,
                    llm=my_pipeline)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!


app = FastAPI()

# This defines the data json format expected for the endpoint, change as needed
class TextInput(BaseModel):
    inputs: str


@app.get("/")
def status_gpu_check() -> dict[str, str]:
    gpu_msg = "Available" if torch.cuda.is_available() else "Unavailable"
    return {
        "status": "I am ALIVE!",
        "gpu": gpu_msg
    }


@app.post("/generate/")
async def generate_text(data: TextInput) -> dict[str, str]:
    try:
        model_out = llm_chain.run({"text": data.inputs})
        return {"response": model_out}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Writing app.py


In [7]:
# llm_chain.run({"text":"Trong một ngôi làng nhỏ nằm ở miền quê xa xôi, có một cậu bé tên là Peter. Peter luôn mong ước được trở thành một hiệp sĩ dũng mãnh, nhưng ông bà của cậu lại muốn cậu trở thành một nhà nghiên cứu. Một ngày nọ, khi cậu đang lang thang trong khu rừng cạnh làng, Peter bắt gặp một con rồng lớn đang bị mắc kẹt trong một cái lưới. Thay vì sợ hãi, Peter quyết định giải thoát cho con rồng. Không ngờ rằng, sau đó, con rồng đã giúp Peter thực hiện ước mơ của mình bằng cách dạy cho cậu những kỹ năng và sức mạnh của một hiệp sĩ. Với sự giúp đỡ của con rồng, Peter trở thành một hiệp sĩ vĩ đại và bảo vệ ngôi làng của mình khỏi sự đe dọa của một kẻ ác."})

In [8]:
# llm_chain.run({"text": "giới thiệu về hà nội"})

In [9]:
# Set the authtoken
import nest_asyncio
from pyngrok import ngrok
ngrok.set_auth_token("2go3T7u0mYBxKMPrLH6pUIThIDO_3cUSudweBw7cURoofQ9Es")

# Connect to ngrok
ngrok_tunnel = ngrok.connect(8000)

# Print the public URL
print('Public URL:', ngrok_tunnel.public_url)
# Apply nest_asyncio
nest_asyncio.apply()

Public URL: https://2fa0-34-142-171-233.ngrok-free.app


## Start FastAPI Server
The initial run will take a long time due to having to download the model and load it onto GPU.

Note: interrupting the Google Colab runtime will send a SIGINT and stop the server.

In [None]:
# This cell finishes quickly because it just needs to start up the server
# The server will start the model download and will take a while to start up
# ~5 minutes
!uvicorn app:app --host 0.0.0.0 --port 8000


`from langchain_community.llms import HuggingFacePipeline`.

To install langchain-community run `pip install -U langchain-community`.
config.json: 100% 1.21k/1.21k [00:00<00:00, 7.49MB/s]
adapter_config.json: 100% 649/649 [00:00<00:00, 4.00MB/s]
config.json: 100% 682/682 [00:00<00:00, 4.65MB/s]
pytorch_model.bin: 100% 5.55G/5.55G [00:51<00:00, 108MB/s]


In [None]:
import transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig
from langchain_core.prompts import PromptTemplate
from langchain import LLMChain
from langchain.llms import HuggingFacePipeline

bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_compute_dtype = torch.bfloat16,
    bnb_4bit_use_double_quant = False
)
# model_name = "Tien094/vinallama-vietnews"
# model_name = "Tien094/vinallama-sum-2"
# model_name = "vilm/vinallama-2.7b-chat"
model_name = "Tien094/vinallama-sum"
model  = AutoModelForCausalLM.from_pretrained(model_name,
                                              quantization_config = bnb_config,
                                              trust_remote_code=True,
                                             device_map = "auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)

generation_pipeline = transformers.pipeline(
    model = model,
    task = "text-generation",
    tokenizer = tokenizer,
    temperature=0.7,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=256,  # max number of tokens to generate in the output
)
my_pipeline = HuggingFacePipeline(pipeline=generation_pipeline)
template = '''
<|im_start|>system
Bạn là một trợ lí AI hữu ích. Hãy tóm tắt đoạn văn dưới đây một cách ngắn gọn.
<|im_end|>
<|im_start|>user
{text}<|im_end|>
<|im_start|>assistant
'''
prompt = PromptTemplate(template = template, input_variables = ["text"])

llm_chain = LLMChain(prompt=prompt,
                    llm=my_pipeline)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/649 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/682 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/5.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/83.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.67M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/558 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
  warn_deprecated(
  warn_deprecated(


Check the logs at server.log to see progress.

Wait until model is loaded and check with the next cell before moving on.

In [None]:
# If you see "Failed to connect", it's because the server is still starting up
# Wait for the model to be downloaded and the server to fully start
# Check the server.log file to see the status
!curl localhost:8000

## Shutting Down
To shut down the processes, run the following commands in a new cell:
```
!pkill uvicorn
!pkill ngrok
```

In [None]:
# !pkill uvicorn

In [None]:
# !pkill ngrok