## Setup Dependencies

In [None]:
!pip install groq
!pip install -U llama-stack




[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
!UV_SYSTEM_PYTHON=1 llama stack build --template together --image-type venv

'UV_SYSTEM_PYTHON' is not recognized as an internal or external command,
operable program or batch file.


In [5]:
!pip install yt-dlp pytubefix youtube-transcript-api

Collecting yt-dlp
  Downloading yt_dlp-2025.2.19-py3-none-any.whl.metadata (171 kB)
     ---------------------------------------- 0.0/171.9 kB ? eta -:--:--
     -- ------------------------------------- 10.2/171.9 kB ? eta -:--:--
     -------- ---------------------------- 41.0/171.9 kB 393.8 kB/s eta 0:00:01
     ------------------- ----------------- 92.2/171.9 kB 751.6 kB/s eta 0:00:01
     -------------------------------------- 171.9/171.9 kB 1.2 MB/s eta 0:00:00
Collecting pytubefix
  Downloading pytubefix-8.12.2-py3-none-any.whl.metadata (5.3 kB)
Collecting youtube-transcript-api
  Downloading youtube_transcript_api-0.6.3-py3-none-any.whl.metadata (17 kB)
Collecting defusedxml<0.8.0,>=0.7.1 (from youtube-transcript-api)
  Downloading defusedxml-0.7.1-py2.py3-none-any.whl.metadata (32 kB)
Downloading yt_dlp-2025.2.19-py3-none-any.whl (3.2 MB)
   ---------------------------------------- 0.0/3.2 MB ? eta -:--:--
   --- ------------------------------------ 0.2/3.2 MB 7.6 MB/s eta 0:00


[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Setup Tools

In [None]:
import yt_dlp
from dataclasses import dataclass
from datetime import datetime
import pytubefix
from youtube_transcript_api import YouTubeTranscriptApi

@dataclass
class VideoMetadata:
    title : str
    upload_data : str
    duration_s : int
    url : str

def search_youtube(search_query, num_queries=2):
    ydl_opts = {
        "default_search": f"ytsearch{num_queries}",
        "quiet": True,
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        search_results = ydl.extract_info(search_query, download=False)

    return [VideoMetadata(x['title'], datetime.fromtimestamp(float(x['upload_date'])).strftime('%m/%d/%Y'), x['duration'], x['webpage_url']) for x in search_results['entries']]

def get_transcript(url, fast=True):
    if fast:
        vid_id = pytubefix.YouTube(url).video_id
    else:
        with yt_dlp.YoutubeDL({'quiet':True}) as ydl:
            vid_id = ydl.extract_info(url, download=False)['id']
    return YouTubeTranscriptApi.get_transcript(vid_id)

In [None]:
import logging
from llama_stack_client.lib.agents.client_tool import client_tool

# Set up logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

@client_tool
def get_video_transcript(video_url: str) -> str:
    """Simple Youtube Video Information tool that provides a timestamped video transcription.

    :param video_url: The complete youtube video url to extract information from
    :returns: String containing the video transcription in words and timestamps in bracketed numbers
    """
    vid_id = pytubefix.YouTube(video_url).video_id
    return YouTubeTranscriptApi.get_transcript(vid_id)

In [None]:
import os

with open('api_key', 'r') as f:
    os.environ['TOGETHER_API_KEY'] = f.readline()

from llama_stack.distribution.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient("together")
client.initialize()

In [None]:
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig
from llama_stack_client.types import ToolDef, ToolInvocationResult

model_id = "meta-llama/Llama-3.1-8B-Instruct"
client_tools = [get_video_transcript]
agent_config = AgentConfig(
    model=model_id,
    instructions="You are a helpful assistant. Please use the calculator tool to solve math problems",
    toolgroups=[],
    client_tools=[
            client_tool.get_tool_definition() for client_tool in client_tools
        ],
    tool_choice="auto",
    enable_session_persistence=True,
)

agent = Agent(client, agent_config, client_tools)

user_prompts = [
    "Hello",
    "Can you help me summarize this youtube video: https://www.youtube.com/watch?v=WsQQvHm4lSw?",
]

session_id = agent.create_session("test-session")
for prompt in user_prompts:
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )

    for log in EventLogger().log(response):
        log.print()

In [1]:
import os
from groq import Groq

with open('api_key', 'r') as f:
    api_key = f.readline()

client = Groq(
    api_key=api_key,  # This is the default and can be omitted
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Explain the importance of low latency LLMs",
        }
    ],
    model="deepseek-r1-distill-llama-70b",
)
print(chat_completion.choices[0].message.content)

<think>
Okay, so I'm trying to understand why low latency is important for large language models (LLMs). I remember reading that latency refers to the delay before a response is received, so low latency means faster responses. But why is that a big deal for LLMs?

First, I think about where LLMs are used. They're in things like chatbots, virtual assistants, and maybe even in real-time applications. So if someone is using a chatbot and asks a question, they don't want to wait a long time for an answer. If the latency is high, the user experience would be slow and frustrating. That makes sense. People expect quick responses, especially when they're interacting in real-time, like in a conversation.

Then there's real-time applications. I'm not entirely sure what qualifies as a real-time application, but maybe things like live translation or live subtitles. If you're translating speech in real-time, any delay could make the translation useless because the speaker has already moved on. So l