Skip to content

LLMTrack: streamline the usage of language models, including easy loading, caching, logging and token usage tracking

Notifications You must be signed in to change notification settings

xinzhel/llmtrack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

LLMTrack is a Python package designed to streamline the usage of language models, especially during batch generation. It offers features for easy loading of models, generation caching to optimize performance, detailed logging, and continuous token usage recording on a per-model basis. Whether you're working on research or deploying models in production, EfficientLLM helps you manage and monitor your models efficiently.

Installation

pip install llmtrack

LLM Loading

from llmtrack import get_llm
llm = get_llm(model_name="openai/gpt-4o-mini")
print(llm.generate("Generate ONLY a random word"))

Public LLM APIs are specified by simply specifying model_name consisting of API providers and model names. The supported APIs include :

  • OpenAI, e.g., "openai/xxxx" (xxxx should be replaced by specific model names)
    • The environment variable has to be setup: OPENAI_API_KEY
    • Popular model_name: gpt-4o-mini, gpt-3.5-turbo
    • All Available model_name: See the document
  • Azure OpenAI, e.g., "azure_openai/chatgpt-4k"
    • The three environment variables have to be setup: AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_API_VERSION
    • Ask providers for specific model names
  • MoonShot, e.g., "moonshot/moonshot-v1-8k"
    • The environment variable has to be setup: MOONSHOT_API_KEY
  • Groq
    • Popular model_name: llama3-8b-8192, llama3-70b-8192
    • All Available model_name: See the document

Unified Parameters

Parameter Description
num_return_sequences Number of sequences to return, defaults to 1. Same as n in OpenAI API
temperature More random if < 1.0; more deterministic if > 1.0
max_tokens Maximum number of tokens to generate
top_p Top p for sampling, refer to the paper: https://arxiv.org/abs/1904.09751
stop Stop sequence for generation

An example:

params = {"temperature": 0.2, "num_return_sequences": 1}
print(llm.generate("Generate ONLY a random word", **params))

Caching

from llmtrack import get_llm
llm = get_llm(model_name="openai/gpt-4o-mini", cache=True)
print(llm.generate("Generate ONLY a random word "))

After running the code above, the generation cache will be stored in cahe_llmtrack/openai/gpt-4o-mini, following the naming rule cahe_llmtrack/{API provider}/{model name}.

If you invoke the same model with the same prompt, the cache will be used.

Note: You can verify this by checking whether token usage increases with the next function: Token Usage Tracking.

Token Usage Tracking

from llmtrack import get_llm
llm = get_llm("openai/gpt-4o-mini", cache=True, token_usage=True)
print(llm.generate("Generate ONLY a random word "))
print(llm.generate("Generate ONLY a random word "))

Let's track the token usage at the ./gpt-4o-mini_token_usage.json. Only one record exists, although we invoke the LLM twice.

{"prompt": 17, "completion": 4, "total": 21, "time": "2024-08-25 14:02:36"}

Logging (TBA)

About

LLMTrack: streamline the usage of language models, including easy loading, caching, logging and token usage tracking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published