[feature] Run gorilla locally without GPUs 🦍 #77

ShishirPatil · 2023-08-01T09:12:05Z

Today, Gorilla end-points run on UC Berkeley hosted servers 🐻 When you try our colab, or our chat completion API, or the CLI tool, it hits our GPUs for inference. A popular ask among our users is to run Gorilla locally on Macbooks/Linux/WSL.

Describe the solution you'd like:
Have the model(s) running locally on MPS/CPU/GPU and listening to a port. All the current gorilla end-points can then just hit localhost to get the response to any given prompt.

Additional context:
Here is an application that would immediately use it: https://github.com/gorilla-llm/gorilla-cli
Given, we have LLaMA models, these should be plug-and-play: ggerganov/llama.cpp and karpathy/llama2.c
Also relevant: https://huggingface.co/TheBloke/gorilla-7B-GPTQ

Update 1: If you happen to have an RTX, or V100 or A100 or H100, you can use Gorilla today without any latency hit. The goal of this enhancement is to help those who may not have access to and greatest GPUs.

The text was updated successfully, but these errors were encountered:

fire · 2023-08-01T16:34:42Z

I am excited about a possible integration using ggml and mpt.

https://github.com/ggerganov/ggml/tree/master/examples/mpt

How much code is specific to gorilla that needs to port from python to c++?

How much functionality is finetuning the llm?

ShishirPatil · 2023-08-01T18:17:59Z

Hey @fire for the first cut, we don't have to use any gorilla specific code nor any finetuning. It would just be inference - and there is no change in the architecture of either llama or MPT, so the port should be pretty straightforward. The model weights are here https://huggingface.co/gorilla-llm

fire · 2023-08-01T20:17:55Z

As of today which model should I be using? (weights)

fire · 2023-08-02T01:02:02Z

Can someone help me quantize? I'm currently using mobile internet.

# get the repo and build it
git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build && cd build
cmake ..
make -j

# get the model from HuggingFace
# be sure to have git-lfs installed
git clone https://huggingface.co/gorilla-llm/gorilla-mpt-7b-hf-v0

# convert model to FP16
python3 ../examples/mpt/convert-h5-to-ggml.py ./gorilla-mpt-7b-hf-v0 1

# run inference using FP16 precision
./bin/mpt -m ./gorilla-mpt-7b-hf-v0/ggml-model-f16.bin -p "I would like to translate 'I feel very good today.' from English to Chinese." -t 8 -n 64

# quantize the model to 5-bits using Q5_0 quantization
./bin/mpt-quantize ./gorilla-mpt-7b-hf-v0/ggml-model-f16.bin ./gorilla-mpt-7b-hf-v0/ggml-model-q5_0.bin q5_0

# run inference using FP16 precision
./bin/mpt -m ./gorilla-mpt-7b-hf-v0/ggml-model-q5_0.bin -p "I would like to translate 'I feel very good today.' from English to Chinese." -t 8 -n 64

ShishirPatil · 2023-08-02T08:56:55Z

@fire good question re: models. gorilla-7b-hf-delta-v1 and gorilla-mpt-7b-hf-v0 are good models to get started with. The first is a diff of the model with llama base, and the second is MPT based.

re: quantize. How do you want to access the quantized model? 👀

fire · 2023-08-02T10:39:10Z

I was expecting to be put next to https://huggingface.co/gorilla-llm but with a ggml tag and the tag q5 I think.

I've also written up instructions for llama

gorilla-llm/gorilla-7b-hf-delta-v1

# get the repo and build it
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake ..
make -j

# get the model from HuggingFace
# be sure to have git-lfs installed
git clone https://huggingface.co/gorilla-llm/gorilla-7b-hf-delta-v1

# convert model to FP16
python3 convert.py ~/gorilla-7b-hf-delta-v1

# run inference using FP16 precision
./bin/main -m ./gorilla-7b-hf-delta-v1/ggml-model-f16.bin -p "I would like to translate 'I feel very good today.' from English to Chinese." -t 8 -n 64

# quantize the model to 5-bits using Q5_0 quantization
./bin/quantize ./gorilla-7b-hf-delta-v1/ggml-model-f16.bin ./gorilla-7b-hf-delta-v1/ggml-model-q5_0.bin q5_0

# run inference using FP16 precision
./bin/main -m ./gorilla-7b-hf-delta-v1/ggml-model-q5_0.bin -p "I would like to translate 'I feel very good today.' from English to Chinese." -t 8 -n 64

Llama evaluates poorly. No idea why.

ShishirPatil · 2023-08-07T20:44:14Z

@fire we have the mpt-ggml and the llama--ggml models up on Huggingface!
gorilla-llm/gorilla-7b-hf-v1-ggml
gorilla-llm/gorilla-mpt-7b-hf-v0-ggml

fire · 2023-08-07T20:50:31Z

The links are 404'ing not found.

ShishirPatil · 2023-08-07T22:25:15Z

Yikes, I think they were private! Made it public. Let me know if it works! Also, feel free to raise a PR for updates to README or anything you want to put into the HF models repo!

CHIRU98 · 2023-12-01T09:25:18Z

Hi ShishirPatil still this is "gorilla-llm/gorilla-7b-hf-v1-ggml" is private.can you check ones.still its getting the 401 Client Error.

pranramesh · 2024-01-30T00:09:47Z

@ShishirPatil I believe the model here (https://huggingface.co/gorilla-llm/gorilla-7b-hf-v1-ggml) is a quantized version of the delta weights model (if this is the model quantized from the llama script posted above by @fire). I tried running inference and it was poor, probably due to the fact that it wasn't merged with llama first.

Resolved #77 , demo displaying local inference with textwebui. K-quantized gorilla models can be found on [Huggingface](https://huggingface.co/gorilla-llm): [Llama-based](https://huggingface.co/gorilla-llm/gorilla-7b-hf-v1-gguf), [MPT-Based](https://huggingface.co/gorilla-llm/gorilla-mpt-7b-hf-v0-gguf), [Falcon-Based](https://huggingface.co/gorilla-llm/gorilla-falcon-7b-hf-v0-gguf), [`gorilla-openfunctions-v0-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v0-gguf), [`gorilla-openfunctions-v1-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v1-gguf) A tutorial walkthrough on how to quantize model using llama.cpp with different quantization methods documented in [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JP_MN-J1rODo9k_-dR_9c9EnZRCfcVNe?usp=sharing). Running local inference with Gorilla on a clean interface is simple. Demoed using [text-generation-webui](https://github.com/oobabooga/text-generation-webui), add your desired models, and run inference. More details in `/inference` README Co-authored-by: Pranav Ramesh <pranramesh@users.noreply.github.com> --------- Co-authored-by: Pranav Ramesh <89561107+pranramesh@users.noreply.github.com> Co-authored-by: Pranav Ramesh <pranramesh@users.noreply.github.com>

Resolved ShishirPatil#77 , demo displaying local inference with textwebui. K-quantized gorilla models can be found on [Huggingface](https://huggingface.co/gorilla-llm): [Llama-based](https://huggingface.co/gorilla-llm/gorilla-7b-hf-v1-gguf), [MPT-Based](https://huggingface.co/gorilla-llm/gorilla-mpt-7b-hf-v0-gguf), [Falcon-Based](https://huggingface.co/gorilla-llm/gorilla-falcon-7b-hf-v0-gguf), [`gorilla-openfunctions-v0-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v0-gguf), [`gorilla-openfunctions-v1-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v1-gguf) A tutorial walkthrough on how to quantize model using llama.cpp with different quantization methods documented in [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JP_MN-J1rODo9k_-dR_9c9EnZRCfcVNe?usp=sharing). Running local inference with Gorilla on a clean interface is simple. Demoed using [text-generation-webui](https://github.com/oobabooga/text-generation-webui), add your desired models, and run inference. More details in `/inference` README Co-authored-by: Pranav Ramesh <pranramesh@users.noreply.github.com> --------- Co-authored-by: Pranav Ramesh <89561107+pranramesh@users.noreply.github.com> Co-authored-by: Pranav Ramesh <pranramesh@users.noreply.github.com>

ShishirPatil added the enhancement New feature or request label Aug 1, 2023

ShishirPatil changed the title ~~[feature] Run gorilla locally 🦍~~ [feature] Run gorilla locally without GPUs 🦍 Aug 2, 2023

ShishirPatil mentioned this issue Aug 2, 2023

load-8bit flag doesn't work #54

Closed

ShishirPatil mentioned this issue Aug 3, 2023

gorilla help : Fails - to reach server gorilla-llm/gorilla-cli#20

Closed

CharlieJCJ mentioned this issue Jan 30, 2024

Quantized Gorilla #160

Merged

ShishirPatil closed this as completed in #160 Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Run gorilla locally without GPUs 🦍 #77

[feature] Run gorilla locally without GPUs 🦍 #77

ShishirPatil commented Aug 1, 2023 •

edited

Loading

fire commented Aug 1, 2023 •

edited

Loading

ShishirPatil commented Aug 1, 2023

fire commented Aug 1, 2023 •

edited

Loading

fire commented Aug 2, 2023 •

edited

Loading

ShishirPatil commented Aug 2, 2023 •

edited

Loading

fire commented Aug 2, 2023

ShishirPatil commented Aug 7, 2023

fire commented Aug 7, 2023

ShishirPatil commented Aug 7, 2023

CHIRU98 commented Dec 1, 2023

pranramesh commented Jan 30, 2024 •

edited

Loading

[feature] Run gorilla locally without GPUs 🦍 #77

[feature] Run gorilla locally without GPUs 🦍 #77

Comments

ShishirPatil commented Aug 1, 2023 • edited Loading

fire commented Aug 1, 2023 • edited Loading

ShishirPatil commented Aug 1, 2023

fire commented Aug 1, 2023 • edited Loading

fire commented Aug 2, 2023 • edited Loading

ShishirPatil commented Aug 2, 2023 • edited Loading

fire commented Aug 2, 2023

gorilla-llm/gorilla-7b-hf-delta-v1

ShishirPatil commented Aug 7, 2023

fire commented Aug 7, 2023

ShishirPatil commented Aug 7, 2023

CHIRU98 commented Dec 1, 2023

pranramesh commented Jan 30, 2024 • edited Loading

ShishirPatil commented Aug 1, 2023 •

edited

Loading

fire commented Aug 1, 2023 •

edited

Loading

fire commented Aug 1, 2023 •

edited

Loading

fire commented Aug 2, 2023 •

edited

Loading

ShishirPatil commented Aug 2, 2023 •

edited

Loading

pranramesh commented Jan 30, 2024 •

edited

Loading