In [1]:
import warnings
warnings.filterwarnings('ignore')

import os
import re
import sys
import time
import json
import numpy as np
import chromadb

from llama_index.llms.groq import Groq
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [2]:
GROQ_API_KEY = ''

In [3]:
user_query = """
Explain the importance of low latency LLMs
"""

In [4]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [5]:
llm = Groq(model="llama3-70b-8192", api_key=GROQ_API_KEY)

In [6]:
response = llm.complete(user_query)
print(response)

Low-latency Large Language Models (LLMs) are crucial in various applications where real-time or near-real-time processing is essential. Here are some reasons why low-latency LLMs are important:

1. **Interactive Systems**: In interactive systems like chatbots, virtual assistants, or conversational interfaces, low-latency LLMs enable rapid response times, making the interaction feel more natural and engaging. Users expect quick responses, and high latency can lead to frustration and abandonment.
2. **Real-time Decision Making**: In applications like autonomous vehicles, medical diagnosis, or financial trading, low-latency LLMs can process vast amounts of data quickly, enabling rapid decision-making and reaction to changing circumstances.
3. **Live Streaming and Broadcasting**: Low-latency LLMs can facilitate real-time language translation, captioning, or subtitling for live events, ensuring that audiences receive timely and accurate information.
4. **Gaming and Simulation**: In online g

In [7]:
model= "llama3-70b-8192"
collection_name = "ucds-test"
embedding_model = "BAAI/bge-small-en-v1.5"
temperature = 1.0

chroma_client = chromadb.EphemeralClient()
try:
    chroma_client.delete_collection(collection_name)
except:
    pass
chroma_collection = chroma_client.create_collection(collection_name)

embed_model = HuggingFaceEmbedding(model_name=embedding_model)

documents = SimpleDirectoryReader("./rag_files").load_data()

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)
Settings.llm = Groq(model = model, temperature = temperature)
query_engine = index.as_query_engine(llm)
response = query_engine.query(user_query)
print(response)

There is no mention of low latency LLMs (Large Language Models) in the provided context information. The context appears to be a collection of references and citations from various sources, including news articles, websites, and academic publications, covering a wide range of topics such as biographies, sports, entertainment, and more. Therefore, it is not possible to explain the importance of low latency LLMs based on this context.
