Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run on PC #3

Open
littlecat-dev opened this issue Mar 17, 2024 · 22 comments
Open

Run on PC #3

littlecat-dev opened this issue Mar 17, 2024 · 22 comments

Comments

@littlecat-dev
Copy link

Maybe stupid question, but how many RAM, VRAM and what processor need to run this :D

@nonetrix
Copy link

nonetrix commented Mar 17, 2024

300B parameters so I am not hopeful, I have 64GBs of RAM and doubt I am able to even run this even if I used 16GBs of my VRAM even if quantized to like 1 bit lmao. I would like the older Grok-0 as well to at least have something to play with

@alice-comfy
Copy link

~630 GB vram at FP16, maybe 700. Crapshoot on if it'll run on 8 H100s, I don't think you can run it on CPU until it gets gguf'd.

@nonetrix
Copy link

nonetrix commented Mar 17, 2024

I doubt X AI will do it, but when the BitNet code comes out maybe like a 200B version with bitnet would be nice maybe even 120B I think I could run at least one of those since I have already loaded 120B models on this system quantized to hell and back

@NeuroDonu
Copy link

Would quantizing to .gguf and using a terabyte of RAM help? 🙃

@nonetrix
Copy link

Would need to wait for GGUF support to be added and merged, once that is done maybe those with 256GBs of RAM might have a chance MAYBE 128GBs but I am doubtful. That is just my guess though from my experience with really bad 120B models created by just merging llama 2 with another llama 2 model by stacking the layers, the good news is that at least since the model is so big the performance still will be pretty good when quantizing it

@NeuroDonu
Copy link

if TheBloke is still doing model quantization, then you can ask him. I'll eventually try to do this, but I'm not sure it will work out well.

@nonetrix
Copy link

nonetrix commented Mar 17, 2024

GGUF support needs to be added, without GGUF support it is a waste of time to even try to attempt unless you feel like writing some C to make it work, which by all means if you can please do definitely isn't meant to discourage that. The model is unknown to GGUF so it has no idea what to even do with it, I don't want you to waste your time

@fakerybakery
Copy link

@nonetrix ggerganov/llama.cpp#6120

@nonetrix
Copy link

nonetrix commented Mar 17, 2024

Also #21 maybe we can get the older 33B model at least
edit: nope lol

@stduhpf
Copy link

stduhpf commented Mar 18, 2024

It's 314B int8 parameters, so you would need 314GB of memory to load the model, plus some more for things like the K/V cache

@rankaiyx
Copy link

I have a PC with 256G RAM, and I'm waiting for gguf.

@soulteary
Copy link

It’s time to start selecting and purchasing new large memory devices. :D

@nonetrix
Copy link

nonetrix commented Mar 18, 2024

My motherboard can only support 64GBs and I've already maxed that, might be able to run out of spec up to 128GBs but it's probably not enough since chipset and CPU supports it gigabytes just says it doesn't. Would have to get a threadripper workstation build just for 0.5 tokens a second
image

@Konard
Copy link

Konard commented Mar 18, 2024

I hope, we will get exact answers here: #62

@dockercore
Copy link

I have a PC with 16G RAM, and I'm waiting for gguf.

@nonetrix
Copy link

nonetrix commented Mar 20, 2024

Hey if you want a small taste there is a smaller model now fine tuned on this model now, has the same personality as Grok but it's not as smart of course :3

https://huggingface.co/HuggingFaceH4/mistral-7b-grok

@littlecat-dev
Copy link
Author

Hey if you want a small taste there is a smaller model now fine tuned on this model now, has the same personality as Grok but it's not as smart of course :3

https://huggingface.co/HuggingFaceH4/mistral-7b-grok

Wow, i will try, thanks!

@fakerybakery
Copy link

Only problem is there’s a bug in the dataset so it thinks everything is illegal. Also this model is a base model, not instruct tuned

@rankaiyx
Copy link

gguf has arrived!
The actual measurement of Q3_XS quantization requires 124G memory,
which means that a machine with 128G RAM can work!

@rankaiyx
Copy link

https://huggingface.co/Arki05/Grok-1-GGUF

$ ./main -m ../gguf/grok-1/grok-1-IQ3_XS-split-00001-of-00009.gguf -s 12346 -n 100 -t 32 -p "I believe the meaning of life is"

llm_load_print_meta: model type = 314B
llm_load_print_meta: model ftype = IQ3_XS - 3.3 bpw
llm_load_print_meta: model params = 316.49 B
llm_load_print_meta: model size = 120.73 GiB (3.28 BPW)
llm_load_print_meta: general.name = Grok
llm_load_print_meta: BOS token = 1 '[BOS]'
llm_load_print_meta: EOS token = 2 '[EOS]'
llm_load_print_meta: UNK token = 0 '[PAD]'
llm_load_print_meta: PAD token = 0 '[PAD]'
llm_load_print_meta: LF token = 79 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.81 MiB
llm_load_tensors: CPU buffer size = 16716.66 MiB
llm_load_tensors: CPU buffer size = 14592.75 MiB
llm_load_tensors: CPU buffer size = 14484.75 MiB
llm_load_tensors: CPU buffer size = 14901.35 MiB
llm_load_tensors: CPU buffer size = 14714.18 MiB
llm_load_tensors: CPU buffer size = 14493.75 MiB
llm_load_tensors: CPU buffer size = 14484.75 MiB
llm_load_tensors: CPU buffer size = 15250.88 MiB
llm_load_tensors: CPU buffer size = 3990.96 MiB

I believe the meaning of life is to be the best you can be and to make a positive difference in the world.

This is the story of how I discovered my life’s purpose and how I was able to make a positive difference to people’s lives.

I was born in 1959, and I have always been a very curious child. I was always interested in the world around me, and I wanted to know how things worked.

My parents encouraged my curiosity, and they bought me a lot
llama_print_timings: load time = 75099.36 ms
llama_print_timings: sample time = 12.02 ms / 100 runs ( 0.12 ms per token, 8318.08 tokens per second)
llama_print_timings: prompt eval time = 5213.81 ms / 7 tokens ( 744.83 ms per token, 1.34 tokens per second)
llama_print_timings: eval time = 108333.24 ms / 99 runs ( 1094.28 ms per token, 0.91 tokens per second)
llama_print_timings: total time = 113705.85 ms / 106 tokens
Log end

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 128792 MB
node 0 free: 128333 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 129015 MB
node 1 free: 4537 MB
node distances:
node 0 1
0: 10 21
1: 21 10

@rankaiyx
Copy link

My 256G memory stick (8x DDR3 1866 32G) comes from obsolete server disassembly. In total, they cost me only 640 RMB (about $88).

@rankaiyx
Copy link

Grok's talk seems to be mixed with something strange.

The department store entrusts the handling company to transport 1000 glass vases, and the freight for each glass vase is 1.50 yuan. If one is broken, this one will not only not pay the freight, but the handling company will also pay 9.50 yuan. The department store finally paid 1456 yuan. How many vases were broken during the handling?
The student answered the question:
The department store entrusted the handling company to transport 1000 glass vases, and the freight for each glass vase was 1.50 yuan. If one was broken, this one would not only not pay the freight, but the handling company would also pay 9.50 yuan. The department store finally paid 1456 yuan, that is, 1456 yuan - 1000 yuan * 1.50 yuan = 456 yuan, which is equal to 9.50 yuan * n. It can be seen that n = 48, that is, the number of broken glass is 48, and the number of intact glass is 1000-48=952.
Mao Zedong's 100th Birthday
A big star in the sky, shining the whole universe! A great man of a generation, leading China to the light!
How to find the sum of the first n terms of a geometric sequence
How to find the sum of the first n terms of a geometric sequence
What is the sum of the first 100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
llama_print_timings: load time = 1635.08 ms
llama_print_timings: sample time = 79.24 ms / 500 runs ( 0.16 ms per token, 6309.79 tokens per second)
llama_print_timings: prompt eval time = 45058.73 ms / 83 tokens ( 542.88 ms per token, 1.84 tokens per second)
llama_print_timings: eval time = 453951.70 ms / 499 runs ( 909.72 ms per token, 1.10 tokens per second)
llama_print_timings: total time = 499935.16 ms / 582 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants