<a href="https://colab.research.google.com/github/raymond91125/Notebook/blob/master/HostLlama2BehindAPI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hosting Llama 2 with Free GPU via Google Collab**

https://medium.com/@yuhongsun96/host-a-llama-2-api-on-gpu-for-free-a5311463c183

**Before getting started, if running on Google Colab, check that the runtime is set to T4 GPU**

## Install Dependencies
- Requirements for running FastAPI Server
- Requirements for creating a public model serving URL via Ngrok
- Requirements for running Llama2 13B (including Quantization)


In [1]:
# Build Llama cpp
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.11.tar.gz (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.11-cp310-cp310-manylinux_2_35_x86_64.whl size=6423607 sha256=1a10c4c3de05174b2408bd85160f8778f5ad3a700d7fb1c38b14a6c20897d3f9
  Stored in directory: /root/.cache/pip/wheels/dc/42/77/a3ab0d02700427ea364de5797786c0272779dce795f62c3bc2
Successfully built llama-cpp-python
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.2.11


In [None]:
# If this complains about dependency resolver, it's safe to ignore
!pip install fastapi[all] uvicorn python-multipart transformers pydantic tensorflow

In [None]:
# This downloads and sets up the Ngrok executable in the Google Colab instance
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip -o ngrok-stable-linux-amd64.zip

Ngrok is used to make the FastAPI server accessible via a public URL.

Users are required to make a free account and provide their auth token to use Ngrok. The free version only allows 1 local tunnel and the auth token is used to track this usage limit.

In [None]:
# https://dashboard.ngrok.com/signup
!./ngrok authtoken <YOUR-NGROK-TOKEN-HERE>

## Create FastAPI App
This provides an API to the Llama 2 model. The model version can be changed in the code below as desired.

For this demo we will use the 13 billion parameter version which is finetuned for instruction (chat) following.

Despite the compression, it is still a more powerful model than the 7B variant.

In [None]:
%%writefile app.py
from typing import Any

from fastapi import FastAPI
from fastapi import HTTPException
from pydantic import BaseModel
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import tensorflow as tf


# GGML model required to fit Llama2-13B on a T4 GPU
GENERATIVE_AI_MODEL_REPO = "TheBloke/Llama-2-13B-chat-GGML"
GENERATIVE_AI_MODEL_FILE = "llama-2-13b-chat.ggmlv3.q5_1.bin"

model_path = hf_hub_download(
    repo_id=GENERATIVE_AI_MODEL_REPO,
    filename=GENERATIVE_AI_MODEL_FILE
)

llama2_model = Llama(
    model_path=model_path,
    n_gpu_layers=64,
    n_ctx=2000
)

# Test an inference
print(llama2_model(prompt="Hello ", max_tokens=1))


app = FastAPI()


# This defines the data json format expected for the endpoint, change as needed
class TextInput(BaseModel):
    inputs: str
    parameters: dict[str, Any] | None


@app.get("/")
def status_gpu_check() -> dict[str, str]:
    gpu_msg = "Available" if tf.test.is_gpu_available() else "Unavailable"
    return {
        "status": "I am ALIVE!",
        "gpu": gpu_msg
    }


@app.post("/generate/")
async def generate_text(data: TextInput) -> dict[str, str]:
    try:
        params = data.parameters or {}
        response = llama2_model(prompt=data.inputs, **params)
        model_out = response['choices'][0]['text']
        return {"generated_text": model_out}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

## Start FastAPI Server
The initial run will take a long time due to having to download the model and load it onto GPU.

Note: interrupting the Google Colab runtime will send a SIGINT and stop the server.

In [None]:
# This cell finishes quickly because it just needs to start up the server
# The server will start the model download and will take a while to start up
# ~5 minutes
!uvicorn app:app --host 0.0.0.0 --port 8000 > server.log 2>&1 &

Check the logs at server.log to see progress.

Wait until model is loaded and check with the next cell before moving on.

In [None]:
# If you see "Failed to connect", it's because the server is still starting up
# Wait for the model to be downloaded and the server to fully start
# Check the server.log file to see the status
!curl localhost:8000

## Use Ngrok to create a public URL for the FastAPI server.
**IMPORTANT:** If you created an account via email, please verify your email or the next 2 cells won't work.

If you signed up via Google or GitHub account, you're good to go.

In [None]:
# This starts Ngrok and creates the public URL
from IPython import get_ipython
get_ipython().system_raw('./ngrok http 8000 &')

Check the URL generated by the next cell, it should report that the FastAPI server is alive and that GPU is available.

To hit the model endpoint, simply add `/generate` to the URL

In [None]:
# Get the Public URL
# If this doesn't work, make sure you verified your email
# Then run the previous code cell and this one again
!curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

## Shutting Down
To shut down the processes, run the following commands in a new cell:
```
!pkill uvicorn
!pkill ngrok
```