<a href="https://colab.research.google.com/github/saurabhkaul/Assignment_1_Empowered_Coder/blob/master/turboml_llm_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TurboML LLM Tutorial
TurboML can spin up LLM servers with an OpenAI-compatible API. We currently support models
in the GGUF format, but also support non-GGUF models that can be converted to GGUF. In the latter
case you get to decide the quantization type you want to use.

## Set up the environment and install TurboML's SDK.
We use `turboml-installer` to set up the environment for TurboML's SDK.

In [1]:
!pip install -q turboml-installer
import turboml_installer ; turboml_installer.install_on_colab()

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m821.6/821.6 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
[?25h📦 Installing...
🩹 Patching environment...
⏲ Done in 0:00:43
🔁 Restarting kernel...


The kernel should now be restarted with TurboML's SDK installed.

## Login to your TurboML instance

Note that you can copy and replace this snippet with one from your TurboML homepage.

In [3]:
import turboml as tb

tb.init(
  backend_url="https://screeching-dolphin.api.turboml.online",
  api_key="tb_iVKKijh8TKeezNjButxCsCHqdYi8HreO_7e07ce66"
)

In [4]:
LlamaServerRequest = tb.llm.LlamaServerRequest
HuggingFaceSpec = LlamaServerRequest.HuggingFaceSpec
ServerParams = LlamaServerRequest.ServerParams

## Choose a model
Let's use a Llama 3.2 quant already in the GGUF format.

In [7]:
hf_spec = HuggingFaceSpec(
    hf_repo_id="Mozilla/llava-v1.5-7b-llamafile",
    select_gguf_file="llava-v1.5-7b-Q4_K.gguf",
)

## Spawn a server
On spawning a server, you get a `server_id` to reference it later as well as `server_relative_url` you can
use to reach it. This method is synchronous, so it can take a while to yield as we retrieve (and convert) your model.

In [8]:
response = tb.llm.spawn_llm_server(
    LlamaServerRequest(
        source_type=LlamaServerRequest.SourceType.HUGGINGFACE,
        hf_spec=hf_spec,
        server_params=ServerParams(
            threads=-1,
            seed=-1,
            context_size=0,
            flash_attention=False,
        ),
    )
)
response

INFO:turboml.llm:[hf-acquisition] Status: in_progress, Progress: Downloading model from HF...
INFO:turboml.llm:[hf-acquisition] Status: completed, Progress: Completed successfully.
INFO:turboml.llm:[hf-acquisition] Acquisition Done, gguf_id = Mozilla$$llava-v1.5-7b-llamafile$$llava-v15-7b-Q4_Kgguf


LlamaServerResponse(server_id='Mozilla$$llava-v1.5-7b-llamafile$$llava-v15-7b-Q4_Kgguf.1939615351', server_relative_url='/openai/Mozilla$$llava-v1.5-7b-llamafile$$llava-v15-7b-Q4_Kgguf.1939615351/api/v1')

In [9]:
server_id = response.server_id

In [11]:
from IPython.display import display, Image, Audio
import cv2  # We're using OpenCV to read video, to install !pip install opencv-python
import base64
import time
# from openai import OpenAI
import os
import requests
import datetime

In [13]:
video = cv2.VideoCapture("/content/Video-642_480.mov")
frames = video.get(cv2.CAP_PROP_FRAME_COUNT)
fps = video.get(cv2.CAP_PROP_FPS)

# calculate duration of the video
seconds = round(frames / fps)
video_time = datetime.timedelta(seconds=seconds)
print(f"duration in seconds: {seconds}")
print(f"video time: {video_time}")

duration in seconds: 23
video time: 0:00:23


In [14]:
base64Frames = []
while video.isOpened():
    success, frame = video.read()
    if not success:
        break
    _, buffer = cv2.imencode(".jpg", frame)
    base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

video.release()
print(len(base64Frames), "frames read.")

688 frames read.


### Interacting with the LLM

Our LLM is exposed with an OpenAI-compatible API, so we can use the OpenAI SDK, or any
other tool compatible tool to use it.

In [None]:
from openai import OpenAI

base_url = tb.common.env.CONFIG.TURBOML_BACKEND_SERVER_ADDRESS
server_url = f"{base_url}/{response.server_relative_url}"

client = OpenAI(base_url=server_url, api_key="-")

prompt = "Describe whats happening in this instagram reel, also explain how we can improve this reel"


response = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content":[
                {"type": "text", "text": prompt},
                {"type": "image", "image": base64Frames[0]},
                {"type": "image", "image": base64Frames[100]},
                {"type": "image", "image": base64Frames[500]},
                {"type": "image", "image": base64Frames[600]},
            ]

        },
    ],
    model="-",
)

print(response)

INFO:httpx:HTTP Request: POST https://screeching-dolphin.api.turboml.online//openai/Mozilla$$llava-v1.5-7b-llamafile$$llava-v15-7b-Q4_Kgguf.1939615351/api/v1/chat/completions "HTTP/1.1 504 Gateway Time-out"
INFO:openai._base_client:Retrying request to /chat/completions in 0.441204 seconds


In [10]:
%pip install openai

Collecting openai
  Downloading openai-1.61.1-py3-none-any.whl.metadata (27 kB)
Collecting anyio<5,>=3.5.0 (from openai)
  Downloading anyio-4.8.0-py3-none-any.whl.metadata (4.6 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.8.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting sniffio (from openai)
  Downloading sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.7-py3-none-any.whl.metadata (21 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.61.1-py3-none-any.whl (463 kB)
Downloading anyio-4.8.0-py3-none-any.whl (96 kB)
Downloading httpx-0.28.1-py3-none-any.whl (73 kB)
Downloading httpcore-1.0.7-py3-none-any.whl (78 kB)
Downloading jiter

In [2]:
embeddings = (
    client.embeddings.create(input=["Hello there how are you doing today?"], model="-")
    .data[0]
    .embedding
)
len(embeddings), embeddings[:5]

NameError: name 'client' is not defined

## Stop the server

In [None]:
tb.llm.stop_llm_server(server_id)