# HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

#### Members: Kunal Gurnani, Murad Taher
#### Emails: kunal.gurnani@torontomu.ca, mtaher@torontomu.ca

# Introduction:

### Problem Description:

Although current LLMs exhibit superior capabilities in language understanding, generation, and reasoning, they are still imperfect and confront some challenges on the way to building an advanced AI system. Current LLMs lack the ability to process complex information such as vision and speech, the ability to coordinate multiple models to solve multiple sub-tasks that compose a complex one, and are weaker than some experts (e.g., fine-tuned models).

### Context of the Problem:

The problem today centers around the fact that while many AI models (like LLMs) have become very advanced when it comes to generating and understanding different forms and the semantics behind language, they still fail to perform tasks that require a diverse range of understandings and the use of multiple advanced drivers. The main challenge and context of this problem is not to develop a more advanced and capable model, but rather to uncover a way to integrate multiple models that are able to come together and solve real-world problems that are usually not just relegated to one specific task or depth.

### Limitation About other Approaches:

Many alternative approaches struggle with complex combinations of information such as pictures and sounds, or to bring together specialized solutions for a more general and broad problem. Other approaches fall short because they are unable to effectively and seamlessly bring together multiple AI tools to perform one unified and coherent task. There is no clear solution that currently accomodates for complex problems that have numerous specific needs whilst also ensuring that all tools employed flow together coherently to provide a logical solution.

### Solution:

By developing a control center that can direct different AI models from Hugging Face by using an LLM that utilizes API calls, the authors believe they can develop a way that can coherently leverage multiple models to solve more complex and multi-faceted tasks than the traditional approach.

# Background

| Reference |Explanation |  Dataset/Input |Weakness |
| --- | --- | --- | --- |
| Alayrac, Jean-Baptiste, et al. [1] | They propose a Vision Language Model named Flamingo that can perform open-ended vision and language tasks| M3W, ALIGN, LTIP, and VTP datasets | Performance lags behind on classification tasks, trade-offs of few-shot learning methods, hallucination |
| Huang, Shaohan, et al. [2] | They propose Kosmos-1, a multimodal large language model that can perceive general modalities, learn in context and follow instructions | The Pile, Common Crawl, English LAION-2B, LAION-400M, and COYO-700M datasets | Worse performance in zero-shot and one-shot language tasks compared to a baseline LLM |
| Shen, Yongliang, et al. [3] | They propose an LLM-powered agent that disassembles tasks based on requests and assigns suitable models to the tasks | Requests submitted by annotators | Requires multiple interactions with LLMs thus increasing time costs and monetary costs |

# Methodology

### Overview

The proposed LLM-powered agent in this paper tackles a wide range of complex AI tasks by connecting LLMs (i.e., ChatGPT) and the ML community (i.e., Hugging Face) and can process inputs from different modalities.  More specifically, the LLM acts as a brain: on one hand, it disassembles tasks based on user requests, and on the other hand, assigns suitable models to the tasks according to the model description. By executing models and integrating results in the planned tasks, HuggingGPT can autonomously fulfill complex user requests. The whole process of HuggingGPT can be divided into four stages:

1. **Task Planning:** Using ChatGPT to analyze the requests of users to understand their intention, and disassemble them into possible solvable tasks.
2. **Model Selection:** To solve the planned tasks, ChatGPT selects expert models that are hosted on Hugging Face based on model descriptions.
3. **Task Execution:** Invoke and execute each selected model, and return the results to ChatGPT.
4. **Response Generation:** Finally, ChatGPT is utilized to integrate the predictions from all models and generate responses for users.

![HuggingGPT](https://drive.google.com/uc?id=1rYzruuiywruPcmZEzcB1JGhKtBdjCEw3)

### Task Planning

The AI assistant performs task parsing on user input, generating a list of tasks with the following format: `[{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]`. The "dep" field denotes the id of the previous task which generates a new resource upon which the current task relies. The tag "<resource>-task_id" represents the generated text, image, audio, or video from the dependency task with the corresponding task_id.

#### Template for Task Planning

| Name | Definitions |
| ---- | ------ |
| task | It represents the type of the parsed task. It covers different tasks in language, visual, video, audio, etc. |
| id | The unique identifier for task planning, which is used for references to dependent tasks and their generated resources. |
| dep | It defines the pre-requisite tasks required for execution. The task will be launched only when all the pre-requisite dependent tasks are finished. |
| args | It contains the list of required arguments for task execution. It contains three subfields populated with text, image, and audio resources according to the task type. They are resolved from either the user's request or the generated resources of the dependent tasks. |

#### Example of Tasks and Models

| Task | Candidate Models |
| ---- | ---------------- |
| Text-CLS | [cardiffnlp/twitter-robertabase-sentiment, ...] |
| Summarization | [bart-large-cnn, ...] |
| Translation | [t5-base, ...] |
| Image-to-Text | [nlpconnect/vit-gpt2-imagecaptioning, ...] |
| Segmentation | [facebook/detr-resnet-50-panoptic, ...] |
| Object-Detection | [facebook/detr-resnet-50, ...] |
| Text-to-Speech | [espnet/kanbayashi_ljspeech_vits, ...] |
| Text-to-Video | [damo-vilab/text-to-videoms-1.7b, ...] |

#### Demonstration-based Parsing

To better understand the intention and criteria for task planning, HuggingGPT incorporates multiple demonstrations in the prompt. Each demonstration consists of a user request and its corresponding output, which represents the expected sequence of parsed tasks.

#### Example Top Level Prompt

`#1 Task Planning Stage: The AI assistant can parse user input to several tasks: [{"task": task, "id": task_id, "dep": dependency_task_id, "args": {"text": text or <GENERATED>-dep_id, "image": image_url or <GENERATED>-dep_id, "audio": audio_url or <GENERATED>-dep_id}}]. The special tag "<GENERATED>-dep_id" refer to the one generated text/image/audio in the dependency task (Please consider whether the dependency task generates resources of this type.) and "dep_id" must be in "dep" list. The "dep" field denotes the ids of the previous prerequisite tasks which generate a new resource that the current task relies on. The "args" field must in ["text", "image", "audio"], nothing else. The task MUST be selected from the following options: "token-classification", "text2text-generation", "summarization", "translation", "question-answering", "conversational", "text-generation", "sentence-similarity", "tabular-classification", "object-detection", "image-classification", "image-to-image", "image-to-text", "text-to-image", "text-to-video", "visual-question-answering", "document-question-answering", "image-segmentation", "depth-estimation", "text-to-speech", "automatic-speech-recognition", "audio-to-audio", "audio-classification", "canny-control", "hed-control", "mlsd-control", "normal-control", "openpose-control", "canny-text-to-image", "depth-text-to-image", "hed-text-to-image", "mlsd-text-to-image", "normal-text-to-image", "openpose-text-to-image", "seg-text-to-image". There may be multiple tasks of the same type. Think step by step about all the tasks needed to resolve the user's request. Parse out as few tasks as possible while ensuring that the user request can be resolved. Pay attention to the dependencies and order among tasks. If the user input can't be parsed, you need to reply empty JSON [].`

#### Example Follow Up Prompt

`The chat log [ {{context}} ] may contain the resources I mentioned. Now I input { {{input}} }. Pay attention to the input and output types of tasks and the dependencies between tasks.`

### Model Selection

Given the user request and the call command, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The AI assistant merely outputs the model id of the most appropriate model. The output must be in a strict JSON format: `{"id": "id", "reason": "your detail reason for the choice"}`.

#### Example Top Level Prompt

`#2 Model Selection Stage: Given the user request and the parsed tasks, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The assistant should focus more on the description of the model and find the model that has the most potential to solve requests and tasks. Also, prefer models with local inference endpoints for speed and stability.`

#### Example Follow Up Prompt

`Please choose the most suitable model from {{metas}} for the task {{task}}. The output must be in a strict JSON format: {"id": "id", "reason": "your detail reasons for the choice"}.`

### Task Execution

In this stage, HuggingGPT will automatically feed these task arguments into the
models, execute these models to obtain the inference results, and then send them back to the LLM.

#### Resource Dependency

HuggingGPT identifies the resources generated by the prerequisite task as <resource>-task_id, where task_id is the id of the prerequisite task. During the task planning stage, if some tasks are dependent on the outputs of previously executed tasks (e.g., task_id), HuggingGPT sets this symbol (i.e., <resource>-task_id) to the corresponding resource subfield in the arguments. Then in the task execution stage, HuggingGPT dynamically replaces this symbol with the resource generated by the prerequisite task.

### Response Generation

With the input and the inference results, the AI assistant needs to
describe the process and results. The previous stages can be formed as - User Input: `{{ User Input}}`, Task Planning: `{{ Tasks }}`, Model Selection: `{{ Model Assignment }}`, Task Execution: `{{ Predictions }}`.

#### Example Top Level Prompt

`#4 Response Generation Stage: With the task execution logs, the AI assistant needs to describe the process and inference results.`

#### Example Follow Up Prompt

`Yes. Please first think carefully and directly answer my request based on the inference results. Some of the inferences may not always turn out to be correct and require you to make careful consideration in making decisions. Then please detail your workflow including the used models and inference results for my request in your friendly tone. Please filter out information that is not relevant to my request. Tell me the complete path or urls of files in inference results. If there is nothing in the results, please tell me you can\'t make it.`

### Human Evaluation

To properly evaluate the outputs of the different stages where we interact with the LLM agent, the authors of the original paper have used these three key metrics on a collection of 130 diverse user requests:

* Passing Rate: to determine whether the planned task graph or selected model can be successfully
executed.
* Rationality: to assess whether the generated task sequence or selected tools align with user requests
in a rational manner.
* Success Rate: to verify if the final results satisfy the user's request.

Due to a lack of access to these requests, and the time it would take us to develop our own test set, we wont be replicating them in this notebook. What we will do is show the results table in the paper:

| LLM | Task Planning Passing Rate | Task Planning Rationality | Model Selection Passing Rate | Model Selection Rationality | Response Success Rate |
| --- | --- | --- | --- | --- | --- |
| Alpaca-13b | 51.04 | 32.17 | - | - | 6.92 |
| Vicuna-13b | 79.41 | 58.41 | - | - | 15.64 |
| GPT-3.5 | 91.22 | 78.47 | 93.89 | 84.29 | 63.08 |


# Implementation

### Dependencies Installation

We install some important dependencies that don't come with Colab.

In [None]:
!pip install tiktoken diffusers pydub

Collecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting diffusers
  Downloading diffusers-0.27.2-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub, tiktoken, diffusers
Successfully installed diffusers-0.27.2 pydub-0.25.1 tiktoken-0.6.0


### Mounting Google Drive to access files

We save important files in Drive that are going to be used either as input or are going to be used to prompt the LLM agent. For example, `demo_parse_task.json` contains several examples for task parsing in json format.

All the files to upload are:


*   p0_models.jsonl
*   demo_parse_task.json
*   demo_choose_model.json
*   demo_response_results.json
*   food.jpeg

We have saved everything in this folder on our Drive: "/MyDrive/Colab Notebooks/HuggingGPT"



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Imports

In [None]:
import tiktoken
import requests
from diffusers.utils import load_image
import re
import json
import copy
import time
import threading
from queue import Queue
from huggingface_hub.inference_api import InferenceApi
from huggingface_hub.inference_api import ALL_TASKS
from io import BytesIO
import io
import uuid
import random
from PIL import Image, ImageDraw
import base64
from pydub import AudioSegment

### OpenAI Key and HuggingFace Token

Here we set our OpenAI Key and HuggingFace Token. These are needed to use their api.

In [None]:
OPENAI_KEY = input("Your OpenAI Key: ")
HUGGINGFACE_TOKEN = input("Your HuggingFace Token: ")

### Setting up Constants

Here we set some values that wont be changing, like our inference_mode is going to be only huggingface for this implementation, we set encodings and max context length for gpt-4, we read a file with information about existing models available in hugging face, and set hugging face request headers.

In [None]:
inference_mode = "huggingface"

encodings = {
    "gpt-4": tiktoken.get_encoding("cl100k_base")
}

max_length = {
    "gpt-4": 8192
}

MODELS = [json.loads(line) for line in open("/content/drive/MyDrive/Colab Notebooks/HuggingGPT/p0_models.jsonl", "r").readlines()]
MODELS_MAP = {}
for model in MODELS:
    tag = model["task"]
    if tag not in MODELS_MAP:
        MODELS_MAP[tag] = []
    MODELS_MAP[tag].append(model)

LLM = "gpt-4"
LLM_encoding = LLM

HUGGINGFACE_HEADERS = {
    "Authorization": f"Bearer {HUGGINGFACE_TOKEN}",
}

openai_endpoint = "https://api.openai.com/v1/chat/completions"

### Prompts

These template prompts are going to be used for properly "chatting" with the LLM agent and getting data in the structure we want, except for input_prompt, which is our example input prompt to show the capabilities of this method in this implementation. We will se them in action later.

In [None]:
input_prompt = "There is a picture of some pizza located in '/content/drive/MyDrive/Colab Notebooks/HuggingGPT/food.jpeg' what topping does it have?"

parse_task_prompt = """
The chat log [ {{context}} ] may contain the resources I mentioned. Now I input { {{input}} }. Pay attention to the input and output types of tasks and the dependencies between tasks.
"""
parse_task_tprompt = """
#1 Task Planning Stage: The AI assistant can parse user input to several tasks: [{"task": task, "id": task_id, "dep": dependency_task_id, "args": {"text": text or <GENERATED>-dep_id, "image": image_url or <GENERATED>-dep_id, "audio": audio_url or <GENERATED>-dep_id}}]. The special tag "<GENERATED>-dep_id" refer to the one generated text/image/audio in the dependency task (Please consider whether the dependency task generates resources of this type.) and "dep_id" must be in "dep" list. The "dep" field denotes the ids of the previous prerequisite tasks which generate a new resource that the current task relies on. The "args" field must in ["text", "image", "audio"], nothing else. The task MUST be selected from the following options: "token-classification", "text2text-generation", "summarization", "translation", "question-answering", "conversational", "text-generation", "sentence-similarity", "tabular-classification", "object-detection", "image-classification", "image-to-image", "image-to-text", "text-to-image", "text-to-video", "visual-question-answering", "document-question-answering", "image-segmentation", "depth-estimation", "text-to-speech", "automatic-speech-recognition", "audio-to-audio", "audio-classification", "canny-control", "hed-control", "mlsd-control", "normal-control", "openpose-control", "canny-text-to-image", "depth-text-to-image", "hed-text-to-image", "mlsd-text-to-image", "normal-text-to-image", "openpose-text-to-image", "seg-text-to-image". There may be multiple tasks of the same type. Think step by step about all the tasks needed to resolve the user's request. Parse out as few tasks as possible while ensuring that the user request can be resolved. Pay attention to the dependencies and order among tasks. If the user input can't be parsed, you need to reply empty JSON [].
"""

choose_model_prompt = 'Please choose the most suitable model from {{metas}} for the task {{task}}. The output must be in a strict JSON format: {"id": "id", "reason": "your detail reasons for the choice"}.'
choose_model_tprompt = "#2 Model Selection Stage: Given the user request and the parsed tasks, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The assistant should focus more on the description of the model and find the model that has the most potential to solve requests and tasks. Also, prefer models with local inference endpoints for speed and stability."

response_results_prompt = 'Yes. Please first think carefully and directly answer my request based on the inference results. Some of the inferences may not always turn out to be correct and require you to make careful consideration in making decisions. Then please detail your workflow including the used models and inference results for my request in your friendly tone. Please filter out information that is not relevant to my request. Tell me the complete path or urls of files in inference results. If there is nothing in the results, please tell me you can\'t make it. }'
response_results_tprompt = "#4 Response Generation Stage: With the task execution logs, the AI assistant needs to describe the process and inference results."

### Task examples for agent

For the agent to know how to respond to different prompts, it is better to give it some examples. This technique is called Few-Shot Prompting. Few-shot prompting can be used as a technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance.

In [None]:
parse_task_demos_or_presteps = open("/content/drive/MyDrive/Colab Notebooks/HuggingGPT/demo_parse_task.json", "r").read()
choose_model_demos_or_presteps = open("/content/drive/MyDrive/Colab Notebooks/HuggingGPT/demo_choose_model.json", "r").read()
response_results_demos_or_presteps = open("/content/drive/MyDrive/Colab Notebooks/HuggingGPT/demo_response_results.json", "r").read()

In [None]:
print(parse_task_demos_or_presteps)

[
    {
        "role": "user",
        "content": "Give you some pictures e1.jpg, e2.png, e3.jpg, help me count the number of sheep?"
    },
    {
        "role": "assistant",
        "content": "[{\"task\": \"image-to-text\", \"id\": 0, \"dep\": [-1], \"args\": {\"image\": \"e1.jpg\" }}, {\"task\": \"object-detection\", \"id\": 1, \"dep\": [-1], \"args\": {\"image\": \"e1.jpg\" }}, {\"task\": \"visual-question-answering\", \"id\": 2, \"dep\": [1], \"args\": {\"image\": \"<GENERATED>-1\", \"text\": \"How many sheep in the picture\"}} }}, {\"task\": \"image-to-text\", \"id\": 3, \"dep\": [-1], \"args\": {\"image\": \"e2.png\" }}, {\"task\": \"object-detection\", \"id\": 4, \"dep\": [-1], \"args\": {\"image\": \"e2.png\" }}, {\"task\": \"visual-question-answering\", \"id\": 5, \"dep\": [4], \"args\": {\"image\": \"<GENERATED>-4\", \"text\": \"How many sheep in the picture\"}} }}, {\"task\": \"image-to-text\", \"id\": 6, \"dep\": [-1], \"args\": {\"image\": \"e3.jpg\" }},  {\"task\": \"o

### Helper Functions (Token-related)

Here we define some helper methods on top of the tiktoken library that will help us transform text to tokens, get token counts, among others. A good documentation is located at https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb.

In [None]:
def count_tokens(model_name, text):
    return len(encodings[model_name].encode(text))

def get_max_context_length(model_name):
    return max_length[model_name]

def get_token_ids_for_task_parsing(model_name):
    text = '''{"task": "text-classification",  "token-classification", "text2text-generation", "summarization", "translation",  "question-answering", "conversational", "text-generation", "sentence-similarity", "tabular-classification", "object-detection", "image-classification", "image-to-image", "image-to-text", "text-to-image", "visual-question-answering", "document-question-answering", "image-segmentation", "text-to-speech", "text-to-video", "automatic-speech-recognition", "audio-to-audio", "audio-classification", "canny-control", "hed-control", "mlsd-control", "normal-control", "openpose-control", "canny-text-to-image", "depth-text-to-image", "hed-text-to-image", "mlsd-text-to-image", "normal-text-to-image", "openpose-text-to-image", "seg-text-to-image", "args", "text", "path", "dep", "id", "<GENERATED>-"}'''
    res = encodings[model_name].encode(text)
    res = list(set(res))
    return res

def get_token_ids_for_choose_model(model_name):
    text = '''{"id": "reason"}'''
    res = encodings[model_name].encode(text)
    res = list(set(res))
    return res

choose_model_highlight_ids = get_token_ids_for_choose_model(LLM_encoding)
task_parsing_highlight_ids = get_token_ids_for_task_parsing(LLM_encoding)

### Helper Functions (Chat-related)

These functions here are used to send requests to OpenAI's api.

In [None]:
def send_request(data):
    api_key = data.pop("api_key")
    api_endpoint = data.pop("api_endpoint")
    HEADER = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    response = requests.post(api_endpoint, json=data, headers=HEADER, proxies=None)
    if "error" in response.json():
        return response.json()
    return response.json()["choices"][0]["message"]["content"]

def chitchat(messages, api_key, api_endpoint):
    data = {
        "model": LLM,
        "messages": messages,
        "api_key": api_key,
        "api_endpoint": api_endpoint
    }
    return send_request(data)

These functions here are used to apply transformations to data be it images or text, and to extract information from text. These are used during the execution of the task.

In [None]:
def replace_slot(text, entries):
    for key, value in entries.items():
        if not isinstance(value, str):
            value = str(value)
        text = text.replace("{{" + key +"}}", value.replace('"', "'").replace('\n', ""))
    return text

def find_json(s):
    s = s.replace("\'", "\"")
    start = s.find("{")
    end = s.rfind("}")
    res = s[start:end+1]
    res = res.replace("\n", "")
    return res

def field_extract(s, field):
    try:
        field_rep = re.compile(f'{field}.*?:.*?"(.*?)"', re.IGNORECASE)
        extracted = field_rep.search(s).group(1).replace("\"", "\'")
    except:
        field_rep = re.compile(f'{field}:\ *"(.*?)"', re.IGNORECASE)
        extracted = field_rep.search(s).group(1).replace("\"", "\'")
    return extracted

def image_to_bytes(img_url):
    img_byte = io.BytesIO()
    type = img_url.split(".")[-1]
    load_image(img_url).save(img_byte, format="png")
    img_data = img_byte.getvalue()
    return img_data

def get_id_reason(choose_str):
    reason = field_extract(choose_str, "reason")
    id = field_extract(choose_str, "id")
    choose = {"id": id, "reason": reason}
    return id.strip(), reason.strip(), choose

def collect_result(command, choose, inference_result):
    result = {"task": command}
    result["inference result"] = inference_result
    result["choose model result"] = choose
    return result

These functions here transform the result from the LLM agent into something usable by us. It orders all the tasks properly so we can have a nice array of tasks to execute.

In [None]:
def fix_dep(tasks):
    for task in tasks:
        args = task["args"]
        task["dep"] = []
        for k, v in args.items():
            if "<GENERATED>" in v:
                dep_task_id = int(v.split("-")[1])
                if dep_task_id not in task["dep"]:
                    task["dep"].append(dep_task_id)
        if len(task["dep"]) == 0:
            task["dep"] = [-1]
    return tasks

def unfold(tasks):
    try:
        for task in tasks:
            for key, value in task["args"].items():
                if "<GENERATED>" in value:
                    generated_items = value.split(",")
                    if len(generated_items) > 1:
                        for item in generated_items:
                            new_task = copy.deepcopy(task)
                            dep_task_id = int(item.split("-")[1])
                            new_task["dep"] = [dep_task_id]
                            new_task["args"][key] = item
                            tasks.append(new_task)
                        tasks.remove(task)
    except Exception as e:
        print(e)

    return tasks

These functions here check if certain models are available in the hugging face inference api, and return them accordingly.

In [None]:
def get_model_status(model_id, url, headers, queue = None):
    endpoint_type = "huggingface" if "huggingface" in url else "local"
    if "huggingface" in url:
        r = requests.get(url, headers=headers, proxies=None)
    else:
        r = requests.get(url)

    if r.status_code == 200 and "loaded" in r.json():
        if queue:
            queue.put((model_id, True, endpoint_type))
        return True
    else:
        if queue:
            queue.put((model_id, False, None))
        return False

def get_avaliable_models(candidates, topk=5):
    all_available_models = {"local": [], "huggingface": []}
    threads = []
    result_queue = Queue()

    for candidate in candidates:
        model_id = candidate["id"]

        if inference_mode != "local":
            huggingfaceStatusUrl = f"https://api-inference.huggingface.co/status/{model_id}"
            thread = threading.Thread(target=get_model_status, args=(model_id, huggingfaceStatusUrl, HUGGINGFACE_HEADERS, result_queue))
            threads.append(thread)
            thread.start()

    result_count = len(threads)
    while result_count:
        model_id, status, endpoint_type = result_queue.get()
        if status and model_id not in all_available_models:
            all_available_models[endpoint_type].append(model_id)
        if len(all_available_models["local"] + all_available_models["huggingface"]) >= topk:
            break
        result_count -= 1

    for thread in threads:
        thread.join()

    return all_available_models

Once a model has been chosen, we use these functions to call the hugging face api and send in the correct parameters for these model's inference. We structure them in the way thay need to be based on the type of task to be executed.

In [None]:
def huggingface_model_inference(model_id, data, task):
    task_url = f"https://api-inference.huggingface.co/models/{model_id}" # InferenceApi does not yet support some tasks
    inference = InferenceApi(repo_id=model_id, token=HUGGINGFACE_TOKEN)

    # NLP tasks
    if task == "question-answering":
        inputs = {"question": data["text"], "context": (data["context"] if "context" in data else "" )}
        result = inference(inputs)
    if task == "sentence-similarity":
        inputs = {"source_sentence": data["text1"], "target_sentence": data["text2"]}
        result = inference(inputs)
    if task in ["text-classification",  "token-classification", "text2text-generation", "summarization", "translation", "conversational", "text-generation"]:
        inputs = data["text"]
        result = inference(inputs)

    # CV tasks
    if task == "visual-question-answering" or task == "document-question-answering":
        img_url = data["image"]
        text = data["text"]
        img_data = image_to_bytes(img_url)
        img_base64 = base64.b64encode(img_data).decode("utf-8")
        json_data = {}
        json_data["inputs"] = {}
        json_data["inputs"]["question"] = text
        json_data["inputs"]["image"] = img_base64
        json_data["wait_for_model"] = True
        result = requests.post(task_url, headers=HUGGINGFACE_HEADERS, json=json_data).json()
        # result = inference(inputs) # not support

    if task == "image-to-image":
        img_url = data["image"]
        img_data = image_to_bytes(img_url)
        # result = inference(data=img_data) # not support
        HUGGINGFACE_HEADERS["Content-Length"] = str(len(img_data))
        r = requests.post(task_url, headers=HUGGINGFACE_HEADERS, data=img_data)
        result = r.json()
        if "path" in result:
            result["generated image"] = result.pop("path")

    if task == "text-to-image":
        inputs = data["text"]
        img = inference(inputs)
        name = str(uuid.uuid4())[:4]
        img.save(f"/content/drive/MyDrive/Colab Notebooks/HuggingGPT/{name}.png")
        result = {}
        result["generated image"] = f"/images/{name}.png"

    if task == "image-segmentation":
        img_url = data["image"]
        img_data = image_to_bytes(img_url)
        image = Image.open(BytesIO(img_data))
        predicted = inference(data=img_data)
        colors = []
        for i in range(len(predicted)):
            colors.append((random.randint(100, 255), random.randint(100, 255), random.randint(100, 255), 155))
        for i, pred in enumerate(predicted):
            label = pred["label"]
            mask = pred.pop("mask").encode("utf-8")
            mask = base64.b64decode(mask)
            mask = Image.open(BytesIO(mask), mode='r')
            mask = mask.convert('L')

            layer = Image.new('RGBA', mask.size, colors[i])
            image.paste(layer, (0, 0), mask)
        name = str(uuid.uuid4())[:4]
        image.save(f"/content/drive/MyDrive/Colab Notebooks/HuggingGPT/{name}.jpg")
        result = {}
        result["generated image"] = f"/images/{name}.jpg"
        result["predicted"] = predicted

    if task == "object-detection":
        img_url = data["image"]
        img_data = image_to_bytes(img_url)
        predicted = inference(data=img_data)
        image = Image.open(BytesIO(img_data))
        draw = ImageDraw.Draw(image)
        labels = list(item['label'] for item in predicted)
        color_map = {}
        for label in labels:
            if label not in color_map:
                color_map[label] = (random.randint(0, 255), random.randint(0, 100), random.randint(0, 255))
        for label in predicted:
            box = label["box"]
            draw.rectangle(((box["xmin"], box["ymin"]), (box["xmax"], box["ymax"])), outline=color_map[label["label"]], width=2)
            draw.text((box["xmin"]+5, box["ymin"]-15), label["label"], fill=color_map[label["label"]])
        name = str(uuid.uuid4())[:4]
        image.save(f"/content/drive/MyDrive/Colab Notebooks/HuggingGPT/{name}.jpg")
        result = {}
        result["generated image"] = f"/images/{name}.jpg"
        result["predicted"] = predicted

    if task in ["image-classification"]:
        img_url = data["image"]
        img_data = image_to_bytes(img_url)
        result = inference(data=img_data)

    if task == "image-to-text":
        img_url = data["image"]
        img_data = image_to_bytes(img_url)
        HUGGINGFACE_HEADERS["Content-Length"] = str(len(img_data))
        r = requests.post(task_url, headers=HUGGINGFACE_HEADERS, data=img_data, proxies=None)
        result = {}
        if "generated_text" in r.json()[0]:
            result["generated text"] = r.json()[0].pop("generated_text")

    # AUDIO tasks
    if task == "text-to-speech":
        inputs = data["text"]
        response = inference(inputs, raw_response=True)
        # response = requests.post(task_url, headers=HUGGINGFACE_HEADERS, json={"inputs": text})
        name = str(uuid.uuid4())[:4]
        with open(f"/content/drive/MyDrive/Colab Notebooks/HuggingGPT/{name}.flac", "wb") as f:
            f.write(response.content)
        result = {"generated audio": f"/audios/{name}.flac"}
    if task in ["automatic-speech-recognition", "audio-to-audio", "audio-classification"]:
        audio_url = data["audio"]
        audio_data = requests.get(audio_url, timeout=10).content
        response = inference(data=audio_data, raw_response=True)
        result = response.json()
        if task == "audio-to-audio":
            content = None
            type = None
            for k, v in result[0].items():
                if k == "blob":
                    content = base64.b64decode(v.encode("utf-8"))
                if k == "content-type":
                    type = "audio/flac".split("/")[-1]
            audio = AudioSegment.from_file(BytesIO(content))
            name = str(uuid.uuid4())[:4]
            audio.export(f"/content/drive/MyDrive/Colab Notebooks/HuggingGPT/{name}.{type}", format=type)
            result = {"generated audio": f"/audios/{name}.{type}"}
    return result

def model_inference(model_id, data, hosted_on, task):
    try:
        inference_result = huggingface_model_inference(model_id, data, task)
    except Exception as e:
        print(e)
        inference_result = {"error":{"message": str(e)}}
    return inference_result

This function is used to choose a model. It calls in our LLM agent and asks it which model is best suited.

In [None]:
def choose_model(input, task, metas, api_key, api_endpoint):
    prompt = replace_slot(choose_model_prompt, {
        "input": input,
        "task": task,
        "metas": metas,
    })
    print("Prompt we use to choose model given a list of candidates")
    print(prompt)
    print("")
    demos_or_presteps = replace_slot(choose_model_demos_or_presteps, {
        "input": input,
        "task": task,
        "metas": metas
    })
    messages = json.loads(demos_or_presteps)
    messages.insert(0, {"role": "system", "content": choose_model_tprompt})
    messages.append({"role": "user", "content": prompt})
    print("Promp with examples that we send:")
    print(messages)
    print("")
    data = {
        "model": LLM,
        "messages": messages,
        "temperature": 0,
        "logit_bias": {item: 5 for item in choose_model_highlight_ids},
        "api_key": api_key,
        "api_endpoint": api_endpoint
    }
    return send_request(data)

This is our main function to run. We call all the other functions from this one in one way or another.

In [None]:
def run_task(input, command, results, api_key, api_endpoint):
    id = command["id"]
    args = command["args"]
    task = command["task"]
    deps = command["dep"]
    if deps[0] != -1:
        dep_tasks = [results[dep] for dep in deps]
    else:
        dep_tasks = []

    print(f"Run task: {id} - {task}")
    print("")
    print("Deps: " + json.dumps(dep_tasks))
    print("")

    if deps[0] != -1:
        if "image" in args and "<GENERATED>-" in args["image"]:
            resource_id = int(args["image"].split("-")[1])
            if "generated image" in results[resource_id]["inference result"]:
                args["image"] = results[resource_id]["inference result"]["generated image"]
        if "audio" in args and "<GENERATED>-" in args["audio"]:
            resource_id = int(args["audio"].split("-")[1])
            if "generated audio" in results[resource_id]["inference result"]:
                args["audio"] = results[resource_id]["inference result"]["generated audio"]
        if "text" in args and "<GENERATED>-" in args["text"]:
            resource_id = int(args["text"].split("-")[1])
            if "generated text" in results[resource_id]["inference result"]:
                args["text"] = results[resource_id]["inference result"]["generated text"]

    text = image = audio = None
    for dep_task in dep_tasks:
        if "generated text" in dep_task["inference result"]:
            text = dep_task["inference result"]["generated text"]
            print("Detect the generated text of dependency task (from results):" + text)
            print("")
        elif "text" in dep_task["task"]["args"]:
            text = dep_task["task"]["args"]["text"]
            print("Detect the text of dependency task (from args): " + text)
            print("")
        if "generated image" in dep_task["inference result"]:
            image = dep_task["inference result"]["generated image"]
            print("Detect the generated image of dependency task (from results): " + image)
            print("")
        elif "image" in dep_task["task"]["args"]:
            image = dep_task["task"]["args"]["image"]
            print("Detect the image of dependency task (from args): " + image)
            print("")
        if "generated audio" in dep_task["inference result"]:
            audio = dep_task["inference result"]["generated audio"]
            print("Detect the generated audio of dependency task (from results): " + audio)
            print("")
        elif "audio" in dep_task["task"]["args"]:
            audio = dep_task["task"]["args"]["audio"]
            print("Detect the audio of dependency task (from args): " + audio)
            print("")

    if "image" in args and "<GENERATED>" in args["image"]:
        if image:
            args["image"] = image
    if "audio" in args and "<GENERATED>" in args["audio"]:
        if audio:
            args["audio"] = audio
    if "text" in args and "<GENERATED>" in args["text"]:
        if text:
            args["text"] = text

    for resource in ["image", "audio"]:
        if resource in args and not args[resource].startswith("public/") and len(args[resource]) > 0 and not args[resource].startswith("http"):
            args[resource] = f"{args[resource]}"

    if "-text-to-image" in command['task'] and "text" not in args:
        print("control-text-to-image task, but text is empty, so we use control-generation instead.")
        print("")
        control = task.split("-")[0]

        if control == "seg":
            task = "image-segmentation"
            command['task'] = task
        elif control == "depth":
            task = "depth-estimation"
            command['task'] = task
        else:
            task = f"{control}-control"

    command["args"] = args
    print(f"parsed task: {command}")
    print("")

    if task.endswith("-text-to-image") or task.endswith("-control"):
        if inference_mode != "huggingface":
            if task.endswith("-text-to-image"):
                control = task.split("-")[0]
                best_model_id = f"lllyasviel/sd-controlnet-{control}"
            else:
                best_model_id = task
            hosted_on = "local"
            reason = "ControlNet is the best model for this task."
            choose = {"id": best_model_id, "reason": reason}
            print(f"chosen model: {choose}")
            print("")
        else:
            print(f"Task {command['task']} is not available. ControlNet need to be deployed locally.")
            print("")
            inference_result = {"error": f"service related to ControlNet is not available."}
            results[id] = collect_result(command, "", inference_result)
            return False
    elif task in ["summarization", "translation", "conversational", "text-generation", "text2text-generation"]:
        best_model_id = "ChatGPT"
        reason = "ChatGPT performs well on some NLP tasks as well."
        choose = {"id": best_model_id, "reason": reason}
        messages = [{
            "role": "user",
            "content": f"[ {input} ] contains a task in JSON format {command}. Now you are a {command['task']} system, the arguments are {command['args']}. Just help me do {command['task']} and give me the result. The result must be in text form without any urls."
        }]
        response = chitchat(messages, api_key, api_endpoint)
        results[id] = collect_result(command, choose, {"response": response})
        return True
    else:
        if task not in MODELS_MAP:
            print(f"no available models on {task} task.")
            print("")
            inference_result = {"error": f"{command['task']} not found in available tasks."}
            results[id] = collect_result(command, "", inference_result)
            return False

        candidates = MODELS_MAP[task][:10]
        all_avaliable_models = get_avaliable_models(candidates, 5)
        all_avaliable_model_ids = all_avaliable_models["local"] + all_avaliable_models["huggingface"]
        print(f"avaliable models on {command['task']}: {all_avaliable_models}")
        print("")

        if len(all_avaliable_model_ids) == 0:
            print(f"no available models on {command['task']}")
            print("")
            inference_result = {"error": f"no available models on {command['task']} task."}
            results[id] = collect_result(command, "", inference_result)
            return False

        if len(all_avaliable_model_ids) == 1:
            best_model_id = all_avaliable_model_ids[0]
            hosted_on = "local" if best_model_id in all_avaliable_models["local"] else "huggingface"
            reason = "Only one model available."
            choose = {"id": best_model_id, "reason": reason}
            print(f"chosen model: {choose}")
            print("")
        else:
            cand_models_info = [
                {
                    "id": model["id"],
                    "inference endpoint": all_avaliable_models.get(
                        "local" if model["id"] in all_avaliable_models["local"] else "huggingface"
                    ),
                    "likes": model.get("likes"),
                    "description": model.get("description", "")[:100],
                    "tags": model.get("meta").get("tags") if model.get("meta") else None,
                }
                for model in candidates
                if model["id"] in all_avaliable_model_ids
            ]

            choose_str = choose_model(input, command, cand_models_info, api_key, api_endpoint)
            print(f"chosen model: {choose_str}")
            print("")
            try:
                choose = json.loads(choose_str)
                reason = choose["reason"]
                best_model_id = choose["id"]
                hosted_on = "local" if best_model_id in all_avaliable_models["local"] else "huggingface"
            except Exception as e:
                print(f"the response [ {choose_str} ] is not a valid JSON, try to find the model id and reason in the response.")
                print("")
                choose_str = find_json(choose_str)
                best_model_id, reason, choose  = get_id_reason(choose_str)
                hosted_on = "local" if best_model_id in all_avaliable_models["local"] else "huggingface"
    print("Model we use and args we send:")
    print(best_model_id)
    print(args)
    print("")
    inference_result = model_inference(best_model_id, args, hosted_on, command['task'])

    if "error" in inference_result:
        print(f"Inference error: {inference_result['error']}")
        print("")
        results[id] = collect_result(command, choose, inference_result)
        return False

    results[id] = collect_result(command, choose, inference_result)
    return True

Once we have our inference results, we ask the LLM agent to correctly answer back with what we requested initially.

In [None]:
def response_results(input, results, api_key, api_endpoint):
    results = [v for k, v in sorted(results.items(), key=lambda item: item[0])]
    prompt = replace_slot(response_results_prompt, {
        "input": input,
    })
    print("Response prompt we will be sending:")
    print(prompt)
    print("")
    demos_or_presteps = replace_slot(response_results_demos_or_presteps, {
        "input": input,
        "processes": results
    })
    messages = json.loads(demos_or_presteps)
    messages.insert(0, {"role": "system", "content": response_results_tprompt})
    messages.append({"role": "user", "content": prompt})
    print("All messages we will be sending:")
    print(messages)
    print("")
    data = {
        "model": LLM,
        "messages": messages,
        "temperature": 0,
        "api_key": api_key,
        "api_endpoint": api_endpoint
    }
    return send_request(data)

### Main Execution

We create an array where we will be storing all our chat messages, as our history. We start by appending our prompt as our role "user".

In [None]:
all_messages = []
all_messages.append({"role":"user", "content":input_prompt})

Our input prompt that we will be working with is the following:

In [None]:
print(input_prompt)

There is a picture of some pizza located in '/content/drive/MyDrive/Colab Notebooks/HuggingGPT/food.jpeg' what topping does it have?


Our parse task planning prompt.

In [None]:
print(parse_task_tprompt)


#1 Task Planning Stage: The AI assistant can parse user input to several tasks: [{"task": task, "id": task_id, "dep": dependency_task_id, "args": {"text": text or <GENERATED>-dep_id, "image": image_url or <GENERATED>-dep_id, "audio": audio_url or <GENERATED>-dep_id}}]. The special tag "<GENERATED>-dep_id" refer to the one generated text/image/audio in the dependency task (Please consider whether the dependency task generates resources of this type.) and "dep_id" must be in "dep" list. The "dep" field denotes the ids of the previous prerequisite tasks which generate a new resource that the current task relies on. The "args" field must in ["text", "image", "audio"], nothing else. The task MUST be selected from the following options: "token-classification", "text2text-generation", "summarization", "translation", "question-answering", "conversational", "text-generation", "sentence-similarity", "tabular-classification", "object-detection", "image-classification", "image-to-image", "image-t

We then start by reading our task prompt for the task planning stage and append it to the start of our "parse task" examples.

In [None]:
messages = json.loads(parse_task_demos_or_presteps)
messages.insert(0, {"role": "system", "content": parse_task_tprompt})
print(messages[0])
print(messages[1])

{'role': 'system', 'content': '\n#1 Task Planning Stage: The AI assistant can parse user input to several tasks: [{"task": task, "id": task_id, "dep": dependency_task_id, "args": {"text": text or <GENERATED>-dep_id, "image": image_url or <GENERATED>-dep_id, "audio": audio_url or <GENERATED>-dep_id}}]. The special tag "<GENERATED>-dep_id" refer to the one generated text/image/audio in the dependency task (Please consider whether the dependency task generates resources of this type.) and "dep_id" must be in "dep" list. The "dep" field denotes the ids of the previous prerequisite tasks which generate a new resource that the current task relies on. The "args" field must in ["text", "image", "audio"], nothing else. The task MUST be selected from the following options: "token-classification", "text2text-generation", "summarization", "translation", "question-answering", "conversational", "text-generation", "sentence-similarity", "tabular-classification", "object-detection", "image-classificat

Here we just separate any historical messages that may be in all_messages, but for this example, these two lines can be ignored, as we are only working with one example prompt.

In [None]:
context = all_messages[:-1]
input = all_messages[-1]["content"]
print(context)
print(input)

[]
There is a picture of some pizza located in '/content/drive/MyDrive/Colab Notebooks/HuggingGPT/food.jpeg' what topping does it have?


Here we go through all our history of prompts and append them to our initial messages array with examples of prompts. We generate a history text with them. Notice there is no history in the chat log.

In [None]:
start = 0
while start <= len(context):
    history = context[start:]
    prompt = replace_slot(parse_task_prompt, {
        "input": input,
        "context": history
    })
    print("Prompt to be added to array:")
    print(prompt)
    messages.append({"role": "user", "content": prompt})
    history_text = "<im_end>\nuser<im_start>".join([m["content"] for m in messages])
    num = count_tokens(LLM_encoding, history_text)
    if get_max_context_length(LLM) - num > 800:
        break
    messages.pop()
    start += 2

Prompt to be added to array:

The chat log [ [] ] may contain the resources I mentioned. Now I input { There is a picture of some pizza located in '/content/drive/MyDrive/Colab Notebooks/HuggingGPT/food.jpeg' what topping does it have? }. Pay attention to the input and output types of tasks and the dependencies between tasks.



We format our data to send it to OpenAI through the api. We can see all the messages we will be sending. Notice that the first one is our prompt that explains task planning and the last one is our prompt with the parse task template prompt.

In [None]:
data = {
    "model": LLM,
    "messages": messages,
    "temperature": 0,
    "logit_bias": {item: 0.1 for item in task_parsing_highlight_ids},
    "api_key": OPENAI_KEY,
    "api_endpoint": openai_endpoint
}
print(messages[0])
print(messages[-1])

{'role': 'system', 'content': '\n#1 Task Planning Stage: The AI assistant can parse user input to several tasks: [{"task": task, "id": task_id, "dep": dependency_task_id, "args": {"text": text or <GENERATED>-dep_id, "image": image_url or <GENERATED>-dep_id, "audio": audio_url or <GENERATED>-dep_id}}]. The special tag "<GENERATED>-dep_id" refer to the one generated text/image/audio in the dependency task (Please consider whether the dependency task generates resources of this type.) and "dep_id" must be in "dep" list. The "dep" field denotes the ids of the previous prerequisite tasks which generate a new resource that the current task relies on. The "args" field must in ["text", "image", "audio"], nothing else. The task MUST be selected from the following options: "token-classification", "text2text-generation", "summarization", "translation", "question-answering", "conversational", "text-generation", "sentence-similarity", "tabular-classification", "object-detection", "image-classificat

We send our data and get the tasks with the format we requested.

In [None]:
tasks = json.loads(send_request(data))
for task in tasks:
    print(task)

{'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': '/content/drive/MyDrive/Colab Notebooks/HuggingGPT/food.jpeg'}}
{'task': 'visual-question-answering', 'id': 1, 'dep': [0], 'args': {'image': '<GENERATED>-0', 'text': 'what topping does it have?'}}


In [None]:
tasks = unfold(tasks)
tasks = fix_dep(tasks)
for task in tasks:
    print(task)

{'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': '/content/drive/MyDrive/Colab Notebooks/HuggingGPT/food.jpeg'}}
{'task': 'visual-question-answering', 'id': 1, 'dep': [0], 'args': {'image': '<GENERATED>-0', 'text': 'what topping does it have?'}}


Here is where we start executing our tasks. We go through all our tasks, starting with the ones that have no dependencies (dep = -1). We run each task in a different thread. In the output section of this cell, you will see how it goes from parsing the tasks, to knowing which models are available, and most importantly the prompt it uses for model selection. Following that trail of outputs will give you a much better idea of how it reasons through the tasks.

In [None]:
results = {}
threads = []
tasks = tasks[:]
d = dict()
retry = 0
while True:
    num_thread = len(threads)
    for task in tasks:
        for dep_id in task["dep"]:
            if dep_id >= task["id"]:
                task["dep"] = [-1]
                break
        dep = task["dep"]
        if dep[0] == -1 or len(list(set(dep).intersection(d.keys()))) == len(dep):
            tasks.remove(task)
            thread = threading.Thread(target=run_task, args=(input, task, d, OPENAI_KEY, openai_endpoint))
            thread.start()
            threads.append(thread)
    if num_thread == len(threads):
        time.sleep(0.5)
        retry += 1
    if retry > 160:
        break
    if len(tasks) == 0:
        break
for thread in threads:
    thread.join()

results = d.copy()

Run task: 0 - image-to-text

Deps: []

parsed task: {'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': '/content/drive/MyDrive/Colab Notebooks/HuggingGPT/food.jpeg'}}

avaliable models on image-to-text: {'local': [], 'huggingface': ['Salesforce/blip-image-captioning-base', 'Salesforce/blip2-opt-2.7b', 'microsoft/trocr-base-handwritten', 'Salesforce/blip-image-captioning-large', 'nlpconnect/vit-gpt2-image-captioning']}

Prompt we use to choose model given a list of candidates
Please choose the most suitable model from [{'id': 'nlpconnect/vit-gpt2-image-captioning', 'inference endpoint': ['Salesforce/blip-image-captioning-base', 'Salesforce/blip2-opt-2.7b', 'microsoft/trocr-base-handwritten', 'Salesforce/blip-image-captioning-large', 'nlpconnect/vit-gpt2-image-captioning'], 'likes': 219, 'description': '\n\n# nlpconnect/vit-gpt2-image-captioning\n\nThis is an image captioning model trained by @ydshieh in [', 'tags': ['image-to-text', 'image-captioning']}, {'id': 'Salesforc



Run task: 1 - visual-question-answering

Deps: [{"task": {"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image": "/content/drive/MyDrive/Colab Notebooks/HuggingGPT/food.jpeg"}}, "inference result": {"generated text": "a pepperoni pizza on a wooden table "}, "choose model result": {"id": "nlpconnect/vit-gpt2-image-captioning", "reason": "The 'nlpconnect/vit-gpt2-image-captioning' model is specifically designed for image captioning tasks, which is exactly what we need for this task. It has a high number of likes indicating its popularity and effectiveness. Moreover, it has a local inference endpoint which ensures speed and stability."}}]

Detect the generated text of dependency task (from results):a pepperoni pizza on a wooden table 

Detect the image of dependency task (from args): /content/drive/MyDrive/Colab Notebooks/HuggingGPT/food.jpeg

parsed task: {'task': 'visual-question-answering', 'id': 1, 'dep': [0], 'args': {'image': '/content/drive/MyDrive/Colab Notebooks/Hugging

Here we can see the results of the inference, all combined together.

In [None]:
print(results)

{0: {'task': {'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': '/content/drive/MyDrive/Colab Notebooks/HuggingGPT/food.jpeg'}}, 'inference result': {'generated text': 'a pepperoni pizza on a wooden table '}, 'choose model result': {'id': 'nlpconnect/vit-gpt2-image-captioning', 'reason': "The 'nlpconnect/vit-gpt2-image-captioning' model is specifically designed for image captioning tasks, which is exactly what we need for this task. It has a high number of likes indicating its popularity and effectiveness. Moreover, it has a local inference endpoint which ensures speed and stability."}}, 1: {'task': {'task': 'visual-question-answering', 'id': 1, 'dep': [0], 'args': {'image': '/content/drive/MyDrive/Colab Notebooks/HuggingGPT/food.jpeg', 'text': 'what topping does it have?'}}, 'inference result': [{'score': 0.9694024920463562, 'answer': 'pepperoni'}, {'score': 0.1693362146615982, 'answer': 'cheese'}, {'score': 0.009121456183493137, 'answer': 'tomato'}, {'score': 0.0062710

Now finally we create another prompt with the same techniques for getting a final result back from our LLM agent.

In [None]:
response = response_results(input, results, OPENAI_KEY, openai_endpoint).strip()

Response prompt we will be sending:
Yes. Please first think carefully and directly answer my request based on the inference results. Some of the inferences may not always turn out to be correct and require you to make careful consideration in making decisions. Then please detail your workflow including the used models and inference results for my request in your friendly tone. Please filter out information that is not relevant to my request. Tell me the complete path or urls of files in inference results. If there is nothing in the results, please tell me you can't make it. }

All messages we will be sending:
[{'role': 'system', 'content': '#4 Response Generation Stage: With the task execution logs, the AI assistant needs to describe the process and inference results.'}, {'role': 'user', 'content': "There is a picture of some pizza located in '/content/drive/MyDrive/Colab Notebooks/HuggingGPT/food.jpeg' what topping does it have?"}, {'role': 'assistant', 'content': "Before give you a r

Our final response from our LLM agent. As it says, it indeed is a pizza with pepperoni.

In [None]:
print(response)

Sure, based on the inference results, the pizza in the image located at '/content/drive/MyDrive/Colab Notebooks/HuggingGPT/food.jpeg' appears to have pepperoni as a topping.

Here's a detailed explanation of how I arrived at this conclusion:

1. First, I used the 'nlpconnect/vit-gpt2-image-captioning' model to generate a description of the image. This model is specifically designed for image captioning tasks and is popular and effective. The model described the image as 'a pepperoni pizza on a wooden table'.

2. Next, to answer your specific question about the pizza topping, I used the 'dandelin/vilt-b32-finetuned-vqa' model. This model is a Vision-and-Language Transformer (ViLT) that has been fine-tuned on the VQAv2 dataset, making it highly suitable for visual question answering tasks. When asked 'what topping does it have?', the model provided several potential answers with corresponding confidence scores. The highest score was for 'pepperoni' (0.969), followed by 'cheese' (0.169), 

# Conclusion and Future Direction

HuggingGPT has established itself as a very efficient and successful controller for more complex tasks and modalities of information even when compared to counterparts like Alpaca-7b. The research in our respective paper and experimentation indicates that the future holds great promise for models like GPT-3.5 and GPT-4 to be integrated into a wider system to fulfill more intricate and expansive use-cases.

LLMs in combination with any set of well-engineered models can now be used to tackle tasks that in the past would be too difficult for any of the AI capabilities that existed before. The use of Hugging Face and HuggingGPT as the respective source of models and controller pushes forward the potential scability of what already exists today. AI is moving from being an assistive tool, to being able to see a project or goal from start to end.

When LLMs are exposed to a network as vast as HuggingFace and are given a well-engineered controller like HuggingGPT to integrate these models together under one combined goal and directive (what we were also able to implement), the sky is the limit.

Some limitations that were identified have to do mainly with restrictions that aren't specific to HuggingGPT but rather to LLMs as a whole. For example, limits to the token length and the max efficiency of the LLM that is being employed can pose as a ceiling for what the automonous AI system is able to achieve altogether. These two are not really able to be addressed and so it is just with the passage of time that HuggingGPT may be able to employ more powerful LLMs.

Finally, a more potent limitation to HuggingGPT is that there are undoubtedly gaps in efficiency caused by the inherent structure of the model's complexity in integrating multiple touchpoints for any given LLMs. This is something that is more worth addressing since it points out the opportunity for a more streamlined implementation to make things seamless from a network point of view.



# References

[1]:  Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning." Advances in neural information processing systems 35 (2022): 23716-23736.

[2]:  Huang, Shaohan, et al. "Language is not all you need: Aligning perception with language models." Advances in Neural Information Processing Systems 36 (2024).

[3]: Shen, Yongliang, et al. "Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face." Advances in Neural Information Processing Systems 36 (2024).