# Unify

[Unify](https://unify.ai/hub) dynamically routes each query to the best LLM, with support for providers such as OpenAI, MistralAI, Perplexity AI, and Together AI. You can also access all providers individually using a single API key.

You can check out our [live benchmarks](https://unify.ai/hub/mixtral-8x7b-instruct-v0.1) to see where the data is coming from!

## Installation

First, let's install LlamaIndex 🦙 and the Unify integration.

In [None]:
%pip install llama-index-llms-unify llama-index

## Environment Setup

Make sure to set the `UNIFY_API_KEY` environment variable. You can get a key in the [Unify Console](https://console.unify.ai/login).

In [None]:
import os
os.environ["UNIFY_API_KEY"] = "<YOUR API KEY>"

## Using LlamaIndex with Unify

### Routing a request

The first thing we can do is initialize and query a chat model. To configure Unify's router, pass an endpoint string to `Unify`. You can read more about this in [Unify's docs](https://unify.ai/docs/hub/concepts/runtime_routing.html).

In this case, we will use the cheapest endpoint for `llama2-70b` in terms of input cost and then use `complete`.

In [9]:
from llama_index.llms.unify import Unify
llm = Unify(model="llama-2-70b-chat@dinput-cost")
llm.complete("How are you today, llama?")

CompletionResponse(text="  I'm doing well, thanks for asking! It's always a pleasure to chat with you. I hope you're having a great day too! Is there anything specific you'd like to talk about or ask me? I'm here to help with any questions you might have.", additional_kwargs={}, raw={'id': 'meta-llama/Llama-2-70b-chat-hf-b90de288-1927-4f32-9ecb-368983c45321', 'choices': [Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="  I'm doing well, thanks for asking! It's always a pleasure to chat with you. I hope you're having a great day too! Is there anything specific you'd like to talk about or ask me? I'm here to help with any questions you might have.", role='assistant', function_call=None, tool_calls=None, tool_call_id=None))], 'created': 1711047739, 'model': 'llama-2-70b-chat@anyscale', 'object': 'chat.completion', 'system_fingerprint': None, 'usage': CompletionUsage(completion_tokens=62, prompt_tokens=16, total_tokens=78, cost=7.8e-05)}, logprobs

### Single Sign-On

If you don't want the router to select the provider, you can also use our SSO to query endpoints in different providers without making accounts with all of them. For example, all of these are valid endpoints:

In [10]:
llm = Unify(model="llama-2-70b-chat@together-ai")
llm = Unify(model="gpt-3.5-turbo@openai")
llm = Unify(model="mixtral-8x7b-instruct-v0.1@mistral-ai")

This allows you to quickly switch and test different models and providers. For example, if you are working on an application that uses gpt-4 under the hood, you can use this to query a much cheaper LLM during development and/or testing to reduce costs.

Take a look at the available ones [here](https://unify.ai/hub)!

### Streaming and optimizing for latency

If you are building an application where responsiveness is key, you most likely want to get a streaming response. On top of that, ideally you would use the provider with the lowest Time to First Token, to reduce the time your users are waiting for a response. Using Unify this would look something like:

In [15]:
llm = Unify(model="mixtral-8x7b-instruct-v0.1@ttft")

response = llm.stream_complete(
    "Translate the following to German: "
    "Hey, there's an emergency in translation street, "
    "please send help asap!"
)

In [16]:
show_provider = True
for r in response:
    if show_provider:
        print(f"Model and provider are : {r.raw['model']}\n")
        show_provider = False        
    print(r.delta, end="", flush=True)

Model and provider are : mixtral-8x7b-instruct-v0.1@mistral-ai

Hallo, es gibt einen Notfall in der Übersetzungsstraße, bitte senden Sie Hilfe so schnell wie möglich!

(Note: This is a literal translation and the term "Übersetzungsstraße" is not a standard or commonly used term in German. A more natural way to express the idea of a "emergency in translation" could be "Notfall bei Übersetzungen" or "akute Übersetzungsnotwendigkeit".)

### Async calls and Lowest Input Cost

Last but not least, you can also run request asynchronously. For tasks like long document summarization, optimizing for input costs is crucial. Unify's dynamic router can do this too!

In [17]:
llm = Unify(model="mixtral-8x7b-instruct-v0.1@input-cost")

response = await llm.acomplete(
    "Summarize this in 10 words or less. OpenAI is a U.S. based artificial intelligence "
    "(AI) research organization founded in December 2015, researching artificial intelligence "
    "with the goal of developing 'safe and beneficial' artificial general intelligence, "
    "which it defines as 'highly autonomous systems that outperform humans at most economically "
    "valuable work'. As one of the leading organizations of the AI spring, it has developed "
    "several large language models, advanced image generation models, and previously, released "
    "open-source models. Its release of ChatGPT has been credited with starting the AI spring"
)

print(f"Model and provider are : {response.raw['model']}\n")
print(response)

Model and provider are : mixtral-8x7b-instruct-v0.1@deepinfra

 OpenAI: Pioneering 'safe' artificial general intelligence.
