- You may find yourself in a situation where you are getting rate limited by the model provider API because you're making too many requests.
- For example, this might happen if you are running many parallel queries to benchmark the chat model on a test dataset.
- If you are facing such a situation, you can use a rate limiter to help match the rate at which you're making request to the rate allowed by the API.

- Langchain comes with a built-in in memory rate limiter.
- The provided rate limiter can only limit the number of requests per unit time. 

In [1]:
from langchain_core.rate_limiters import InMemoryRateLimiter

rate_limiter = InMemoryRateLimiter(
    requests_per_second=0.1,  # <-- Super slow! We can only make a request once every 10 seconds!!
    check_every_n_seconds=0.1,  # Wake up every 100 ms to check whether allowed to make a request,
    max_bucket_size=10,  # Controls the maximum burst size.
)

In [4]:
import getpass
import os
from langchain.chat_models import init_chat_model

In [3]:
if not os.environ.get("GROQ_API_KEY"):
  os.environ["GROQ_API_KEY"] = getpass.getpass("Enter API key for Groq: ")

Enter API key for Groq:  ········


In [6]:
model = init_chat_model("llama3-8b-8192", model_provider="groq")

- Let's confirm that the rate limiter works. We should only be able to invoke the model once per 10 seconds.

In [8]:
import time
for _ in range(5):
    tic = time.time()
    model.invoke("hello")
    toc = time.time()
    print(toc - tic)

0.6157643795013428
0.33002495765686035
0.3189585208892822
0.3052046298980713
0.7480382919311523
