Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-old-vic13b-q5_1.bin not supported #567

Closed
DjToMeK30 opened this issue May 31, 2023 · 16 comments
Closed

ggml-old-vic13b-q5_1.bin not supported #567

DjToMeK30 opened this issue May 31, 2023 · 16 comments
Labels
bug Something isn't working primordial Related to the primordial version of PrivateGPT, which is now frozen in favour of the new PrivateGPT

Comments

@DjToMeK30
Copy link

Where can I see supported models?
I tried this vicuna models and none of them work properly. I always get error like
error loading model: this format is no longer supported (see ggerganov/llama.cpp#1305)

@DjToMeK30 DjToMeK30 added the bug Something isn't working label May 31, 2023
@shaggy2626
Copy link

Where can I see supported models? I tried this vicuna models and none of them work properly. I always get error like error loading model: this format is no longer supported (see ggerganov/llama.cpp#1305)

i got the same error GPT4All-13B-snoozy.ggmlv3.q8_0.bin'

@DjToMeK30
Copy link
Author

@shaggy2626 I used this method
#220 (comment)
pip install llama-cpp-python==0.1.48

Then I changed
MODEL_N_CTX=2048

I also applied this method
#517 (comment)

Now it works better but only with PDF I think. CSV finds only one row, and html page is no good
I am exporting Google spreadsheet (excel) to pdf

@shaggy2626
Copy link

shaggy2626 commented May 31, 2023

@shaggy2626 I used this method #220 (comment) pip install llama-cpp-python==0.1.48

Then I changed MODEL_N_CTX=2048

I also applied this method #517 (comment)

Now it works better but only with PDF I think. CSV finds only one row, and html page is no good I am exporting Google spreadsheet (excel) to pdf

how long does it take your to provide a response? i have 64gb of ram and it takes an average of 2 minutes for each query. im trying to use this model https://huggingface.co/TheBloke/GPT4All-13B-snoozy-GGML/blob/main/GPT4All-13B-snoozy.ggmlv3.q8_0.bin

with

all-mpnet-base-v2

@DjToMeK30
Copy link
Author

I use
MODEL_TYPE=LlamaCpp
MODEL_PATH=ggml-old-vic13b-q5_1.bin

But yeah, it takes about 2-3min for a response. It starts loading model in memory. I have 32gb
But whole response is crap, on my side. Didn't yet find it useful in my scenario
Maybe it will be better when CSV gets fixed because saving excel/spreadsheet in pdf is not useful really

@jackfood
Copy link

jackfood commented Jun 1, 2023

Agreed that the return result is not quite there yet in term of completeness. either the embedding technique is not that good, or LLM is not up to standard trying what i am asking.

@StephenDWright
Copy link

Just for comparison, I am using wizard Vicuna 13GB ggml but I am using it with GPU implementation where some of the work gets off loaded. Answers take about 4-5 seconds to start generating, 2-3 when asking multiple ones back to back. Answers are pretty good, I would say if 10 was an open AI implementation, the answers I get based on the context are a solid 8. Fairly complete, sometimes a minor mix up because of the order it states the information but accurate. Depending on the question "Tell me about topic" vs "What is specific thing in topic" it can give pretty lengthy and rich answers.

@DjToMeK30
Copy link
Author

Just for comparison, I am using wizard Vicuna 13GB ggml but I am using it with GPU implementation where some of the work gets off loaded. Answers take about 4-5 seconds to start generating, 2-3 when asking multiple ones back to back. Answers are pretty good, I would say if 10 was an open AI implementation, the answers I get based on the context are a solid 8. Fairly complete, sometimes a minor mix up because of the order it states the information but accurate. Depending on the question "Tell me about topic" vs "What is specific thing in topic" it can give pretty lengthy and rich answers.

How did you enable GPU? And what gpu you have? Are you running this on your own files? What type?

@StephenDWright
Copy link

Just for comparison, I am using wizard Vicuna 13GB ggml but I am using it with GPU implementation where some of the work gets off loaded. Answers take about 4-5 seconds to start generating, 2-3 when asking multiple ones back to back. Answers are pretty good, I would say if 10 was an open AI implementation, the answers I get based on the context are a solid 8. Fairly complete, sometimes a minor mix up because of the order it states the information but accurate. Depending on the question "Tell me about topic" vs "What is specific thing in topic" it can give pretty lengthy and rich answers.

How did you enable GPU? And what gpu you have? Are you running this on your own files? What type?

I followed the instructions in the pull request that enables GPU support. It is a bit touchy but got it to work and it helps it fly. Also had to make a change that's in the pull request that improves performance. The GPU I am using is a 12GB 3060. I am currently using it on documentation about labor laws for the teaching profession in my country.

@GuoChang2032
Copy link

@StephenDWright I also tried to enable GPU acceleration, but it was not successful. Can you give me more details? Thank you.

@StephenDWright
Copy link

@StephenDWright I also tried to enable GPU acceleration, but it was not successful. Can you give me more details? Thank you.

I followed the instructs in the thread. in order for it to work, the llamacpp version that should be installed needs to be at least 1.54. I think the one in the requirements file is 1.50. if you install the 1.50 then you have to uninstall it and reinstall it with a flag that makes sure it doesn't reinstall from a cache.

I also had to install pytorch, nvidia CUDA toolkit 11.8, and cmake, which I installed using visual studio. Then you follow the instructions in the thread to build using the flags provided. Even then I had an issue where I had two llamacpp directories, I had to manually remove one from the build directory and pull out the directory from 1.54 and put it in the root of the build directory. It's a little scrappy and I'm not sure if I can repeat the install to be honest. I removed and pulled this repository so many times before I got it to work. My issue however as far as I could tell had to do with the right version of llamacpp. Btw I am using a venv in visual studio code.

You should read out the thread though. What error are you getting?

@GuoChang2032
Copy link

@StephenDWright I didn't encounter any errors, but the GPU just doesn't work, and I feel like my CUDA toolkit hasn't been fixed properly :(

@StephenDWright
Copy link

@StephenDWright I didn't encounter any errors, but the GPU just doesn't work, and I feel like my CUDA toolkit hasn't been fixed properly :(

What kind of GPU is it and when the model loads, what does it output? It should output something like this close to the end.

llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 6656 MB

When you ran

$Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"; $Env:FORCE_CMAKE=1; py ./setup.py install

Does it say found CUDA toolkit and show the path for the toolkit and also Cublas found?

@GuoChang2032
Copy link

@StephenDWright我没有遇到任何错误,但 GPU 无法正常工作,我觉得我的 CUDA 工具包没有得到正确修复:(

它是什么类型的 GPU,当模型加载时,它会输出什么?它应该在接近尾声时输出类似这样的内容。

llama_model_load_internal:[cublas] 将 32 层卸载到 GPU llama_model_load_internal:[cublas] 使用的总 VRAM:6656 MB

当你跑的时候

$Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"; $环境:FORCE_CMAKE=1;py ./setup.py 安装

它是否说找到了 CUDA 工具包并显示了工具包的路径,还找到了 Cublas?

I installed and resolved this issue in the Conda environment, but now I have encountered other issues. However, thank you very much and I salute you

@DjToMeK30
Copy link
Author

@StephenDWright thanks for the follow up. I was wondering what type of graphic card would I need to make this somehow usable. I don't think my GPU would handle it like your do :)
A question would be how to know if you want to have 10 client asking a question to your local AI, how good of a graphic card your pc would need to handle 10 different conversations about your documents

@StephenDWright
Copy link

@StephenDWright thanks for the follow up. I was wondering what type of graphic card would I need to make this somehow usable. I don't think my GPU would handle it like your do :)
A question would be how to know if you want to have 10 client asking a question to your local AI, how good of a graphic card your pc would need to handle 10 different conversations about your documents

I have no idea how it would work with 10 different clients at the same time. I know when I was trying to make a gui for it, I had to be careful to load the model first and not have it load every time a question is asked because it will just use and run out of vram memory. It will probably run very slow as well if you try to generate tokens from multiple clients at the same time. 🤷🏽‍♂️ You would need a pretty beefy card and probably more than a few to make it useable for multiple clients, even if that number is 10. At that point, you may as well just use Openai's API.

@DjToMeK30
Copy link
Author

DjToMeK30 commented Jun 5, 2023

@StephenDWright thanks for the follow up. I was wondering what type of graphic card would I need to make this somehow usable. I don't think my GPU would handle it like your do :)
A question would be how to know if you want to have 10 client asking a question to your local AI, how good of a graphic card your pc would need to handle 10 different conversations about your documents

I have no idea how it would work with 10 different clients at the same time. I know when I was trying to make a gui for it, I had to be careful to load the model first and not have it load every time a question is asked because it will just use and run out of vram memory. It will probably run very slow as well if you try to generate tokens from multiple clients at the same time. 🤷🏽‍♂️ You would need a pretty beefy card and probably more than a few to make it useable for multiple clients, even if that number is 10. At that point, you may as well just use Openai's API.

I know that I would need alot stronger GPU or system in general. Just thinking how much would you need to acomplish something like that regardless of the price. And ofc then you would also need to save model state for each client and load it when needed

mikepsinn added a commit to mikepsinn/privateGPT that referenced this issue Jun 11, 2023
@imartinez imartinez added the primordial Related to the primordial version of PrivateGPT, which is now frozen in favour of the new PrivateGPT label Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working primordial Related to the primordial version of PrivateGPT, which is now frozen in favour of the new PrivateGPT
Projects
None yet
Development

No branches or pull requests

6 participants