rogeriochaves · rogeriochaves · Aug 26, 2023 · Aug 26, 2023 · Aug 26, 2023
diff --git a/.gitignore b/.gitignore
@@ -10,6 +10,7 @@ chainlit.md
 build/
 langstream.egg-info/
 .chroma
+litellm_uuid.txt
 
 # Generated markdown files from jupyter notebooks
 docs/docs/examples/*.md

diff --git a/docs/docs/llms/lite_llm.md b/docs/docs/llms/lite_llm.md
@@ -0,0 +1,51 @@
+---
+sidebar_position: 5
+---
+
+# Lite LLM (Azure, Anthropic, etc)
+
+[Lite LLM](https://github.com/BerriAI/litellm) is a library that wraps the API of many other LLMs APIs such as Azure, Anthropic, Cohere, HuggingFace, Replicate, and so on, standardizing all of them to use the same API surface as OpenAI Chat Completion API. LangStream provides a wrapper to LiteLLM so you can build your streams on most of the LLMs available on the market. To check all possible models, take a look at [Lite LLM Docs](https://docs.litellm.ai/docs/completion/supported).
+
+To use the multiple LLMs provided by Lite LLM, you will need to have in your environment the API keys depending on which model you are going to use of course, for example:
+
+```
+export OPENAI_API_KEY=<your key here>
+export AZURE_API_BASE=<your key here>
+export ANTHROPIC_API_KEY=<your key here>
+export HUGGINGFACE_API_KEY=<your key here>
+```
+
+Then, you should be able to use the [`LiteLLMChatStream`](pathname:///reference/langstream/contrib/index.html#langstream.contrib.LiteLLMChatStream), which has basically the same interface as as the [`OpenAIChatStream`](pathname:///reference/langstream/contrib/index.html#langstream.contrib.OpenAIChatStream), check it out:
+
+## Chat Completion
+
+```python
+from langstream import Stream, join_final_output
+from langstream.contrib import LiteLLMChatStream, LiteLLMChatMessage, LiteLLMChatDelta
+
+recipe_stream: Stream[str, str] = LiteLLMChatStream[str, LiteLLMChatDelta](
+    "RecipeStream",
+    lambda recipe_name: [
+        LiteLLMChatMessage(
+            role="system",
+            content="You are ChefLiteLLM, an assistant bot trained on all culinary knowledge of world's most proeminant Michelin Chefs",
+        ),
+        LiteLLMChatMessage(
+            role="user",
+            content=f"Hello, could you write me a recipe for {recipe_name}?",
+        ),
+    ],
+    model="gpt-3.5-turbo",
+).map(lambda delta: delta.content)
+
+await join_final_output(recipe_stream("instant noodles"))
+#=> "Of course! Here's a simple and delicious recipe for instant noodles:\n\nIngredients:\n- 1 packet of instant noodles (your choice of flavor)\n- 2 cups of water\n- 1 tablespoon of vegetable oil\n- 1 small onion, thinly sliced\n- 1 clove of garlic, minced\n- 1 small carrot, julienned\n- 1/2 cup of sliced mushrooms\n- 1/2 cup of shredded cabbage\n- 2 tablespoons of soy sauce\n- 1 teaspoon of sesame oil\n- Optional toppings: sliced green onions, boiled egg, cooked chicken or shrimp, chili flakes\n\nInstructions:\n1. In a medium-sized pot, bring the water to a boil. Add the instant noodles and cook according to the package instructions until they are al dente. Drain and set aside.\n\n2. In the same pot, heat the vegetable oil over medium heat. Add the sliced onion and minced garlic, and sauté until they become fragrant and slightly caramelized.\n\n3. Add the julienned carrot, sliced mushrooms, and shredded cabbage to the pot. Stir-fry for a few minutes until the vegetables are slightly softened.\n\n4. Add the cooked instant noodles to the pot and toss them with the vegetables.\n\n5. In a small bowl, mix together the soy sauce and sesame oil. Pour this mixture over the noodles and vegetables, and toss everything together until well combined.\n\n6. Cook for an additional 2-3 minutes, stirring occasionally, to allow the flavors to meld together.\n\n7. Remove the pot from heat and divide the noodles into serving bowls. Top with your desired toppings such as sliced green onions, boiled egg, cooked chicken or shrimp, and chili flakes.\n\n8. Serve the instant noodles hot and enjoy!\n\nFeel free to customize this recipe by adding your favorite vegetables or protein. Enjoy your homemade instant noodles!"
+```
+
+Analogous to OpenAI, it takes [`LiteLLMChatMessage`](pathname:///reference/langstream/contrib/index.html#langstream.contrib.LiteLLMChatMessage)s and produces [`LiteLLMChatDelta`](pathname:///reference/langstream/contrib/index.html#langstream.contrib.LiteLLMChatDelta)s, it can also take `functions` as an argument for function calling, but keep in mind not all models support it and it might simply be ignored, be sure to check [Lite LLM Docs](https://docs.litellm.ai/docs/completion/supported) for the model you are using.
+
+Another caveat is that, by default, LangStream tries to stream all outputs, but not all models and APIs support streaming, so you might need to disable it with `stream=False` or they might throw exceptions, again be sure to check which models support it or not.
+
+We hope that with that you will be able to use the best LLM available to you, or even mix and match on the middle of your streams depending on the need, or falling back if one LLM is not generating the right answer, and so on.
+
+Keep on reading the next part of the docs!
diff --git a/docs/docs/llms/memory.md b/docs/docs/llms/memory.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 5
+sidebar_position: 6
 ---
 
 # Adding Memory

diff --git a/docs/docs/llms/zero_temperature.md b/docs/docs/llms/zero_temperature.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 6
+sidebar_position: 7
 ---
 
 # Zero Temperature

diff --git a/langstream/contrib/__init__.py b/langstream/contrib/__init__.py
@@ -16,11 +16,19 @@
     OpenAIChatDelta,
 )
 from langstream.contrib.llms.gpt4all_stream import GPT4AllStream
+from langstream.contrib.llms.lite_llm import (
+    LiteLLMChatStream,
+    LiteLLMChatMessage,
+    LiteLLMChatDelta,
+)
 
 __all__ = (
     "OpenAICompletionStream",
     "OpenAIChatStream",
     "OpenAIChatMessage",
     "OpenAIChatDelta",
     "GPT4AllStream",
+    "LiteLLMChatStream",
+    "LiteLLMChatMessage",
+    "LiteLLMChatDelta",
 )
diff --git a/langstream/contrib/llms/lite_llm.py b/langstream/contrib/llms/lite_llm.py
@@ -0,0 +1,304 @@
+import asyncio
+from dataclasses import dataclass
+import itertools
+from types import GeneratorType
+from typing import (
+    Any,
+    AsyncGenerator,
+    Callable,
+    Dict,
+    List,
+    Literal,
+    Optional,
+    TypeVar,
+    Union,
+    cast,
+)
+
+import litellm
+from colorama import Fore
+from retry import retry
+
+from langstream.core.stream import Stream, StreamOutput
+
+T = TypeVar("T")
+U = TypeVar("U")
+V = TypeVar("V")
+
+
+@dataclass
+class LiteLLMChatMessage:
+    """
+    LiteLLMChatMessage is a data class that represents a chat message for building `LiteLLMChatStream` prompt.
+
+    Attributes
+    ----------
+    role : Literal["system", "user", "assistant", "function"]
+        The role of who sent this message in the chat, can be one of `"system"`, `"user"`, `"assistant"` or "function"
+
+    name: Optional[str]
+        The name is used for when `role` is `"function"`, it represents the name of the function that was called
+
+    content : str
+        A string with the full content of what the given role said
+
+    """
+
+    role: Literal["system", "user", "assistant", "function"]
+    content: str
+    name: Optional[str] = None
+
+    def to_dict(self):
+        return {k: v for k, v in self.__dict__.items() if v is not None}
+
+
+@dataclass
+class LiteLLMChatDelta:
+    """
+    LiteLLMChatDelta is a data class that represents the output of an `LiteLLMChatStream`.
+
+    Attributes
+    ----------
+    role : Optional[Literal["assistant", "function"]]
+        The role of the output message, the first message will have the role, while
+        the subsequent partial content output ones will have the role as `None`.
+        For now the only possible values it will have is either None or `"assistant"`
+
+    name: Optional[str]
+        The name is used for when `role` is `"function"`, it represents the name of the function that was called
+
+    content : str
+        A string with the partial content being outputted by the LLM, this generally
+        translate to each token the LLM is producing
+
+    """
+
+    role: Optional[Literal["assistant", "function"]]
+    content: str
+    name: Optional[str] = None
+
+    def __stream_debug__(self):
+        name = ""
+        if self.name:
+            name = f" {self.name}"
+        if self.role is not None:
+            print(f"{Fore.YELLOW}{self.role.capitalize()}{name}:{Fore.RESET} ", end="")
+        print(
+            self.content,
+            end="",
+            flush=True,
+        )
+
+
+class LiteLLMChatStream(Stream[T, U]):
+    """
+    `LiteLLMChatStream` is a wrapper for [LiteLLM](https://github.com/BerriAI/litellm), which gives you access to OpenAI, Azure OpenAI, Anthropic, Google VertexAI,
+    HuggingFace, Replicate, A21, Cohere and a bunch other LLMs all the the same time, all while keeping the standard OpenAI chat interface. Check it out the completion API
+    and the available models [on their docs](https://docs.litellm.ai/docs/).
+
+    Be aware not all models support streaming, and LangStream by default tries to stream everything. So if the model you choose is not working, you might need to set `stream=False`,
+    when calling the `LiteLLMChatStream`
+
+    The `LiteLLMChatStream` takes a lambda function that should return a list of `LiteLLMChatMessage` for the assistant to reply, it is stateless, so it doesn't keep
+    memory of the past chat messages, you will have to handle the memory yourself, you can [follow this guide to get started on memory](https://rogeriochaves.github.io/langstream/docs/llms/memory).
+
+    The `LiteLLMChatStream` also produces `LiteLLMChatDelta` as output, one per token, it contains the `role` that started the output, and then subsequent `content` updates.
+    If you want the final content as a string, you will need to use the `.content` property from the delta and accumulate it for the final result.
+
+    To use this stream you will need to have the proper environment keys available depending on the model you are using, like `OPENAI_API_KEY`, `COHERE_API_KEY`, `HUGGINGFACE_API_KEY`, etc,
+    check it out more details on [LiteLLM docs](https://docs.litellm.ai/docs/completion/supported)
+
+    Example
+    -------
+
+    >>> from langstream import Stream, join_final_output
+    >>> from langstream.contrib import LiteLLMChatStream, LiteLLMChatMessage, LiteLLMChatDelta
+    >>> import asyncio
+    ...
+    >>> async def example():
+    ...     recipe_stream: Stream[str, str] = LiteLLMChatStream[str, LiteLLMChatDelta](
+    ...         "RecipeStream",
+    ...         lambda recipe_name: [
+    ...             LiteLLMChatMessage(
+    ...                 role="system",
+    ...                 content="You are Chef Claude, an assistant bot trained on all culinary knowledge of world's most proeminant Michelin Chefs",
+    ...             ),
+    ...             LiteLLMChatMessage(
+    ...                 role="user",
+    ...                 content=f"Hello, could you write me a recipe for {recipe_name}?",
+    ...             ),
+    ...         ],
+    ...         model="claude-2",
+    ...         max_tokens=10,
+    ...     ).map(lambda delta: delta.content)
+    ...
+    ...     return await join_final_output(recipe_stream("instant noodles"))
+    ...
+    >>> asyncio.run(example()) # doctest:+SKIP
+    "Of course! Here's a simple and delicious recipe"
+
+    You can also pass LiteLLM function schemas in the `function` argument with all parameter definitions, just like for OpenAI model, but be aware that not all models support it.
+    Once you pass a `function` param, the model may then produce a `function` role `LiteLLMChatDelta` as output,
+    using your function, with the `content` field as a json which you can parse to call an actual function.
+
+    Take a look [at our OpenAI guide](https://rogeriochaves.github.io/langstream/docs/llms/open_ai_functions) to learn more about LLM function calls in LangStream, it works the same with LiteLLM.
+
+    Function Call Example
+    ---------------------
+
+    >>> from langstream import Stream, collect_final_output
+    >>> from langstream.contrib import LiteLLMChatStream, LiteLLMChatMessage, LiteLLMChatDelta
+    >>> from typing import Literal, Union, Dict
+    >>> import asyncio
+    ...
+    >>> async def example():
+    ...     def get_current_weather(
+    ...         location: str, format: Literal["celsius", "fahrenheit"] = "celsius"
+    ...     ) -> Dict[str, str]:
+    ...         return {
+    ...             "location": location,
+    ...             "forecast": "sunny",
+    ...             "temperature": "25 C" if format == "celsius" else "77 F",
+    ...         }
+    ...
+    ...     stream : Stream[str, Union[LiteLLMChatDelta, Dict[str, str]]] = LiteLLMChatStream[str, Union[LiteLLMChatDelta, Dict[str, str]]](
+    ...         "WeatherStream",
+    ...         lambda user_input: [
+    ...             LiteLLMChatMessage(role="user", content=user_input),
+    ...         ],
+    ...         model="gpt-3.5-turbo",
+    ...         functions=[
+    ...             {
+    ...                 "name": "get_current_weather",
+    ...                 "description": "Gets the current weather in a given location, use this function for any questions related to the weather",
+    ...                 "parameters": {
+    ...                     "type": "object",
+    ...                     "properties": {
+    ...                         "location": {
+    ...                             "description": "The city to get the weather, e.g. San Francisco. Guess the location from user messages",
+    ...                             "type": "string",
+    ...                         },
+    ...                         "format": {
+    ...                             "description": "A string with the full content of what the given role said",
+    ...                             "type": "string",
+    ...                             "enum": ("celsius", "fahrenheit"),
+    ...                         },
+    ...                     },
+    ...                     "required": ["location"],
+    ...                 },
+    ...             }
+    ...         ],
+    ...         temperature=0,
+    ...     ).map(
+    ...         lambda delta: get_current_weather(**json.loads(delta.content))
+    ...         if delta.role == "function" and delta.name == "get_current_weather"
+    ...         else delta
+    ...     )
+    ...
+    ...     return await collect_final_output(stream("how is the weather today in Rio de Janeiro?"))
+    ...
+    >>> asyncio.run(example()) # doctest:+SKIP
+    [{'location': 'Rio de Janeiro', 'forecast': 'sunny', 'temperature': '25 C'}]
+
+    """
+
+    def __init__(
+        self: "LiteLLMChatStream[T, LiteLLMChatDelta]",
+        name: str,
+        call: Callable[
+            [T],
+            List[LiteLLMChatMessage],
+        ],
+        model: str,
+        custom_llm_provider: Optional[str] = None,
+        stream: bool = True,
+        functions: Optional[List[Dict[str, Any]]] = None,
+        function_call: Optional[Union[Literal["none", "auto"], Dict[str, Any]]] = None,
+        temperature: Optional[float] = 0,
+        max_tokens: Optional[int] = None,
+        timeout: int = 5,
+        retries: int = 3,
+    ) -> None:
+        async def chat_completion(
+            messages: List[LiteLLMChatMessage],
+        ) -> AsyncGenerator[StreamOutput[LiteLLMChatDelta], None]:
+            loop = asyncio.get_event_loop()
+
+            @retry(tries=retries)
+            def get_completions():
+                function_kwargs = {}
+                if functions is not None:
+                    function_kwargs["functions"] = functions
+                if function_call is not None:
+                    function_kwargs["function_call"] = function_call
+
+                return litellm.completion(
+                    request_timeout=timeout,
+                    model=model,
+                    custom_llm_provider=custom_llm_provider,
+                    messages=[m.to_dict() for m in messages],
+                    temperature=temperature,  # type: ignore (why is their type int?)
+                    stream=stream,
+                    max_tokens=max_tokens,  # type: ignore (why is their type float?)
+                    **function_kwargs,
+                )
+
+            completions = await loop.run_in_executor(None, get_completions)
+
+            pending_function_call: Optional[LiteLLMChatDelta] = None
+
+            completions = (
+                completions if isinstance(completions, GeneratorType) else [completions]
+            )
+            for output in completions:
+                output = cast(dict, output)
+                if "choices" not in output:
+                    continue
+
+                if len(output["choices"]) == 0:
+                    continue
+
+                delta = (
+                    output["choices"][0]["message"]
+                    if "delta" not in output["choices"][0]
+                    else output["choices"][0]["delta"]
+                )
+
+                if "function_call" in delta:
+                    role = delta["role"] if "role" in delta else None
+                    function_name: Optional[str] = delta["function_call"].get("name")
+                    function_arguments: Optional[str] = delta["function_call"].get(
+                        "arguments"
+                    )
+
+                    if function_name is not None:
+                        pending_function_call = LiteLLMChatDelta(
+                            role="function",
+                            name=function_name,
+                            content=function_arguments or "",
+                        )
+                    elif (
+                        pending_function_call is not None
+                        and function_arguments is not None
+                    ):
+                        pending_function_call.content += function_arguments
+                elif "content" in delta:
+                    role = delta["role"] if "role" in delta else None
+                    yield self._output_wrap(
+                        LiteLLMChatDelta(
+                            role=role,
+                            content=delta["content"],
+                        )
+                    )
+                else:
+                    if pending_function_call:
+                        yield self._output_wrap(pending_function_call)
+                        pending_function_call = None
+            if pending_function_call:
+                yield self._output_wrap(pending_function_call)
+                pending_function_call = None
+
+        super().__init__(
+            name,
+            lambda input: cast(AsyncGenerator[U, None], chat_completion(call(input))),
+        )