Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Run gorilla locally without GPUs 🦍 #77

Closed
ShishirPatil opened this issue Aug 1, 2023 · 11 comments · Fixed by #160
Closed

[feature] Run gorilla locally without GPUs 🦍 #77

ShishirPatil opened this issue Aug 1, 2023 · 11 comments · Fixed by #160
Labels
enhancement New feature or request

Comments

@ShishirPatil
Copy link
Owner

ShishirPatil commented Aug 1, 2023

Today, Gorilla end-points run on UC Berkeley hosted servers 🐻 When you try our colab, or our chat completion API, or the CLI tool, it hits our GPUs for inference. A popular ask among our users is to run Gorilla locally on Macbooks/Linux/WSL.

Describe the solution you'd like:
Have the model(s) running locally on MPS/CPU/GPU and listening to a port. All the current gorilla end-points can then just hit localhost to get the response to any given prompt.

Additional context:
Here is an application that would immediately use it: https://github.com/gorilla-llm/gorilla-cli
Given, we have LLaMA models, these should be plug-and-play: ggerganov/llama.cpp and karpathy/llama2.c
Also relevant: https://huggingface.co/TheBloke/gorilla-7B-GPTQ

Update 1: If you happen to have an RTX, or V100 or A100 or H100, you can use Gorilla today without any latency hit. The goal of this enhancement is to help those who may not have access to and greatest GPUs.

@ShishirPatil ShishirPatil added the enhancement New feature or request label Aug 1, 2023
@fire
Copy link

fire commented Aug 1, 2023

I am excited about a possible integration using ggml and mpt.

https://github.com/ggerganov/ggml/tree/master/examples/mpt

How much code is specific to gorilla that needs to port from python to c++?

How much functionality is finetuning the llm?

@ShishirPatil
Copy link
Owner Author

Hey @fire for the first cut, we don't have to use any gorilla specific code nor any finetuning. It would just be inference - and there is no change in the architecture of either llama or MPT, so the port should be pretty straightforward. The model weights are here https://huggingface.co/gorilla-llm

@fire
Copy link

fire commented Aug 1, 2023

As of today which model should I be using? (weights)

@fire
Copy link

fire commented Aug 2, 2023

Can someone help me quantize? I'm currently using mobile internet.

# get the repo and build it
git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build && cd build
cmake ..
make -j

# get the model from HuggingFace
# be sure to have git-lfs installed
git clone https://huggingface.co/gorilla-llm/gorilla-mpt-7b-hf-v0

# convert model to FP16
python3 ../examples/mpt/convert-h5-to-ggml.py ./gorilla-mpt-7b-hf-v0 1

# run inference using FP16 precision
./bin/mpt -m ./gorilla-mpt-7b-hf-v0/ggml-model-f16.bin -p "I would like to translate 'I feel very good today.' from English to Chinese." -t 8 -n 64

# quantize the model to 5-bits using Q5_0 quantization
./bin/mpt-quantize ./gorilla-mpt-7b-hf-v0/ggml-model-f16.bin ./gorilla-mpt-7b-hf-v0/ggml-model-q5_0.bin q5_0

# run inference using FP16 precision
./bin/mpt -m ./gorilla-mpt-7b-hf-v0/ggml-model-q5_0.bin -p "I would like to translate 'I feel very good today.' from English to Chinese." -t 8 -n 64

@ShishirPatil ShishirPatil changed the title [feature] Run gorilla locally 🦍 [feature] Run gorilla locally without GPUs 🦍 Aug 2, 2023
@ShishirPatil
Copy link
Owner Author

ShishirPatil commented Aug 2, 2023

@fire good question re: models. gorilla-7b-hf-delta-v1 and gorilla-mpt-7b-hf-v0 are good models to get started with. The first is a diff of the model with llama base, and the second is MPT based.

re: quantize. How do you want to access the quantized model? 👀

@fire
Copy link

fire commented Aug 2, 2023

I was expecting to be put next to https://huggingface.co/gorilla-llm but with a ggml tag and the tag q5 I think.

I've also written up instructions for llama

gorilla-llm/gorilla-7b-hf-delta-v1

# get the repo and build it
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake ..
make -j

# get the model from HuggingFace
# be sure to have git-lfs installed
git clone https://huggingface.co/gorilla-llm/gorilla-7b-hf-delta-v1

# convert model to FP16
python3 convert.py ~/gorilla-7b-hf-delta-v1

# run inference using FP16 precision
./bin/main -m ./gorilla-7b-hf-delta-v1/ggml-model-f16.bin -p "I would like to translate 'I feel very good today.' from English to Chinese." -t 8 -n 64

# quantize the model to 5-bits using Q5_0 quantization
./bin/quantize ./gorilla-7b-hf-delta-v1/ggml-model-f16.bin ./gorilla-7b-hf-delta-v1/ggml-model-q5_0.bin q5_0

# run inference using FP16 precision
./bin/main -m ./gorilla-7b-hf-delta-v1/ggml-model-q5_0.bin -p "I would like to translate 'I feel very good today.' from English to Chinese." -t 8 -n 64

Llama evaluates poorly. No idea why.

@ShishirPatil
Copy link
Owner Author

@fire we have the mpt-ggml and the llama--ggml models up on Huggingface!
gorilla-llm/gorilla-7b-hf-v1-ggml
gorilla-llm/gorilla-mpt-7b-hf-v0-ggml

@fire
Copy link

fire commented Aug 7, 2023

The links are 404'ing not found.

@ShishirPatil
Copy link
Owner Author

Yikes, I think they were private! Made it public. Let me know if it works! Also, feel free to raise a PR for updates to README or anything you want to put into the HF models repo!

@CHIRU98
Copy link

CHIRU98 commented Dec 1, 2023

Hi ShishirPatil still this is "gorilla-llm/gorilla-7b-hf-v1-ggml" is private.can you check ones.still its getting the 401 Client Error.

@pranramesh
Copy link
Contributor

pranramesh commented Jan 30, 2024

@ShishirPatil I believe the model here (https://huggingface.co/gorilla-llm/gorilla-7b-hf-v1-ggml) is a quantized version of the delta weights model (if this is the model quantized from the llama script posted above by @fire). I tried running inference and it was poor, probably due to the fact that it wasn't merged with llama first.

ShishirPatil pushed a commit that referenced this issue Feb 4, 2024
Resolved #77 , demo displaying local inference with textwebui.

K-quantized gorilla models can be found on
[Huggingface](https://huggingface.co/gorilla-llm):
[Llama-based](https://huggingface.co/gorilla-llm/gorilla-7b-hf-v1-gguf),
[MPT-Based](https://huggingface.co/gorilla-llm/gorilla-mpt-7b-hf-v0-gguf),
[Falcon-Based](https://huggingface.co/gorilla-llm/gorilla-falcon-7b-hf-v0-gguf),
[`gorilla-openfunctions-v0-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v0-gguf),
[`gorilla-openfunctions-v1-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v1-gguf)

A tutorial walkthrough on how to quantize model using llama.cpp with
different quantization methods documented in
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JP_MN-J1rODo9k_-dR_9c9EnZRCfcVNe?usp=sharing).

Running local inference with Gorilla on a clean interface is simple.
Demoed using
[text-generation-webui](https://github.com/oobabooga/text-generation-webui),
add your desired models, and run inference.

More details in `/inference` README

Co-authored-by: Pranav Ramesh <pranramesh@users.noreply.github.com>

---------

Co-authored-by: Pranav Ramesh <89561107+pranramesh@users.noreply.github.com>
Co-authored-by: Pranav Ramesh <pranramesh@users.noreply.github.com>
devanshamin pushed a commit to devanshamin/gorilla that referenced this issue Jul 9, 2024
Resolved ShishirPatil#77 , demo displaying local inference with textwebui.

K-quantized gorilla models can be found on
[Huggingface](https://huggingface.co/gorilla-llm):
[Llama-based](https://huggingface.co/gorilla-llm/gorilla-7b-hf-v1-gguf),
[MPT-Based](https://huggingface.co/gorilla-llm/gorilla-mpt-7b-hf-v0-gguf),
[Falcon-Based](https://huggingface.co/gorilla-llm/gorilla-falcon-7b-hf-v0-gguf),
[`gorilla-openfunctions-v0-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v0-gguf),
[`gorilla-openfunctions-v1-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v1-gguf)

A tutorial walkthrough on how to quantize model using llama.cpp with
different quantization methods documented in
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JP_MN-J1rODo9k_-dR_9c9EnZRCfcVNe?usp=sharing).

Running local inference with Gorilla on a clean interface is simple.
Demoed using
[text-generation-webui](https://github.com/oobabooga/text-generation-webui),
add your desired models, and run inference.

More details in `/inference` README

Co-authored-by: Pranav Ramesh <pranramesh@users.noreply.github.com>

---------

Co-authored-by: Pranav Ramesh <89561107+pranramesh@users.noreply.github.com>
Co-authored-by: Pranav Ramesh <pranramesh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants