# **Hosting Llama 2 with Free GPU via Google Collab**

**Before getting started, if running on Google Colab, check that the runtime is set to T4 GPU**

## Install Dependencies
- Requirements for running FastAPI Server
- Requirements for creating a public model serving URL via Ngrok
- Requirements for running Llama2 7B (including Quantization)


In [None]:
# Build Llama cpp
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.42.tar.gz (10.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.7/10.7 MB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.42-cp310-cp310-manylinux_2_35_x86_64.whl size=20499785 sha256=4e780d8a31802c172e01d1b0b5212e5bbb4070f14e9964df5f53d011f2afd27c
  Stored in directory: /root/.cache/pip/wheels/7d/5a/01/49d85bf7f082e503f94e01785a55f7fc07dd41441ac02d25cd
Successfully built llama-cpp-python
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.2.4

In [None]:
# If this complains about dependency resolver, it's safe to ignore
!pip install fastapi[all] uvicorn python-multipart transformers pydantic tensorflow

Collecting fastapi[all]
  Downloading fastapi-0.109.2-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn
  Downloading uvicorn-0.27.1-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-multipart
  Downloading python_multipart-0.0.9-py3-none-any.whl (22 kB)
Collecting starlette<0.37.0,>=0.36.3 (from fastapi[all])
  Downloading starlette-0.36.3-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.5/71.5 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Collecting email-validator>=2.0.0 (from fastapi[all])
  Downloading email_validator-2.1.0.post1-py3-none-any.whl (32 kB)
Collecting httpx>=0.23.0 (from fastapi[all])
  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# This downloads and sets up the Ngrok executable in the Google Colab instance
# Import the ngrok GPG key
!curl -s https://ngrok-agent.s3.amazonaws.com/ngrok.asc | gpg --import -

# Add the ngrok repository to the apt sources list
!echo "deb https://ngrok-agent.s3.amazonaws.com buster main" | sudo tee /etc/apt/sources.list.d/ngrok.list

# Fetch the public key associated with the ngrok repository
!sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 0E61D3BBAAEE37FE

# Update the apt package lists
!sudo apt-get update

# Install ngrok
!sudo apt-get install ngrok


gpg: directory '/root/.gnupg' created
gpg: keybox '/root/.gnupg/pubring.kbx' created
gpg: /root/.gnupg/trustdb.gpg: trustdb created
gpg: key 0E61D3BBAAEE37FE: public key "ngrok agent apt repo release bot <release-bot@ngrok.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1
deb https://ngrok-agent.s3.amazonaws.com buster main
Executing: /tmp/apt-key-gpghome.JdL9E8Nrw9/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys 0E61D3BBAAEE37FE
gpg: key 0E61D3BBAAEE37FE: public key "ngrok agent apt repo release bot <release-bot@ngrok.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1
Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_6

Ngrok is used to make the FastAPI server accessible via a public URL.

Users are required to make a free account and provide their auth token to use Ngrok. The free version only allows 1 local tunnel and the auth token is used to track this usage limit.

In [None]:
# https://dashboard.ngrok.com/signup
!ngrok authtoken 2cIFNXua7edzcHWFL0XXKZQuk7S_3qw1zgJbXBwXtBJJW8pVi

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


## Create FastAPI App
This provides an API to the Llama 2 model. The model version can be changed in the code below as desired.

For this demo we will use the 13 billion parameter version which is finetuned for instruction (chat) following.

Despite the compression, it is still a more powerful model than the 7B variant.

In [None]:
%%writefile app.py

from typing import Any
from fastapi import FastAPI
from fastapi import HTTPException
from pydantic import BaseModel
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import tensorflow as tf


# GGML model required to fit Llama2-13B on a T4 GPU

GENERATIVE_AI_MODEL_REPO = "TheBloke/Llama-2-7B-GGUF"
GENERATIVE_AI_MODEL_FILE = "llama-2-7b.Q4_K_M.gguf"

model_path = hf_hub_download(
    repo_id=GENERATIVE_AI_MODEL_REPO,
    filename=GENERATIVE_AI_MODEL_FILE
)

llama2_model = Llama(
    model_path=model_path,
    n_gpu_layers=64,
    n_ctx=2000
)

# Test an inference
print(llama2_model(prompt="Hello ", max_tokens=1))

app = FastAPI()

# This defines the data json format expected for the endpoint, change as needed
class TextInput(BaseModel):
    inputs: str
    parameters: dict[str, Any] | None

@app.get("/")
def status_gpu_check() -> dict[str, str]:
    gpu_msg = "Available" if tf.test.is_gpu_available() else "Unavailable"
    return {
        "status": "I am ALIVE!",
        "gpu": gpu_msg
    }

@app.post("/generate/")
async def generate_text(data: TextInput) -> dict[str, str]:
    try:
        print(type(data))
        print(data)
        params = data.parameters or {}
        response = llama2_model(prompt=data.inputs, **params)
        model_out = response['choices'][0]['text']
        return {"generated_text": model_out}
    except Exception as e:
        print(type(data))
        print(data)
        raise HTTPException(status_code=500, detail=len(str(e)))

Writing app.py


## Start FastAPI Server
The initial run will take a long time due to having to download the model and load it onto GPU.

Note: interrupting the Google Colab runtime will send a SIGINT and stop the server.

Check the logs at server.log to see progress.

When sucessful it should report that the FastAPI server is alive and that GPU is available.

In [None]:
# The server will start the model download and will take a while to start up
# ~5 minutes if its not already downloaded

import subprocess
import time

from ipywidgets import HTML
from IPython.display import display

t = HTML(
    value="0 Seconds",
    description = 'Server is Starting Up... Elapsed Time:' ,
    style={'description_width': 'initial'},
)
display(t)

flag = True
timer = 0

try:
    subprocess.check_output(['curl',"localhost:8000"])
    flag = False
except:
    get_ipython().system_raw('uvicorn app:app --host 0.0.0.0 --port 8000 > server.log 2>&1 &')

res = ""

while(flag and timer < 600):
  try:
    subprocess.check_output(['curl',"localhost:8000"])
  except:
    time.sleep(1)
    timer+= 1
    t.value = str(timer) + " Seconds"
    pass
  else:
    flag = False

if(timer >= 600):
  print("Error: timed out! took more then 10 minutes :(")
subprocess.check_output(['curl',"localhost:8000"])

HTML(value='0 Seconds', description='Server is Starting Up... Elapsed Time:', style=DescriptionStyle(descripti…

b'{"status":"I am ALIVE!","gpu":"Available"}'

## Use Ngrok to create a public URL for the FastAPI server.
**IMPORTANT:** If you created an account via email, please verify your email or the next 2 cells won't work.

If you signed up via Google or GitHub account, you're good to go.

To hit the model endpoint, simply add `/generate` to the URL

In [None]:
# This starts Ngrok and creates the public URL
import subprocess
import time
import sys
import json

from IPython import get_ipython
get_ipython().system_raw('ngrok http 8000 &')
time.sleep(1)
curlOut = subprocess.check_output(['curl',"http://localhost:4040/api/tunnels"],universal_newlines=True)
time.sleep(1)
ngrokURL = json.loads(curlOut)['tunnels'][0]['public_url']
%store ngrokURL
print(ngrokURL)

Stored 'ngrokURL' (str)
https://d02a-35-230-81-180.ngrok-free.app


# Testing API
The URL from the previous cell is stored and refered in this driver code. You can change the prompt under *inputs*. Let it run.

In [None]:
import requests
# Define the URL for the FastAPI endpoint
%store -r ngrokURL

# Define the data to send in the POST request
data = {
  "inputs": '''
Tell me how to make a chocolate cake?
''',
  #paramaters can be found here https://abetlen.github.io/llama-cpp-python/#llama_cpp.llama.Llama.create_completion
  "parameters": {"temperature":0.1,
                 "max_tokens":200}
  #higher temperature, more creative response is, lower more precise
  #max_token is the max amount of (simplified) "words" allowed to be generated
}


# Send the POST request
response = requests.post(ngrokURL + "/generate/", json=data)

# Check the response
if response.status_code == 200:
    result = response.json()
    print("Generated Text:\n", data["inputs"], result["generated_text"].strip())
else:
    print("Request failed with status code:", response.status_code)

Generated Text:
 
Tell me how to make a chocolate cake?
 I'm not sure if I can.
But I will try my best.
I'll need some eggs, and flour, and sugar, and butter.
And maybe some milk.
And then I'll have to mix it all together.
And then I'll have to bake it in the oven.
And then I'll have to wait for it to cool down.
And then I'll have to cut it into pieces.
And then I'll have to eat it.
And then I'll have to tell you how it tasted.
And then I'll have to tell you what it was like.
And then I'll have to tell you what it was like when it was done.
And then I'll have to tell you what it was like when it was done.
And then I'll have to tell you what it was like when it was done.
And then


## Shutting Down
To shut down the processes, run the following commands

In [None]:
!pkill uvicorn

In [None]:
!pkill ngrok