Allows CPU-based execution #235

louiehelm · 2024-03-20T03:57:50Z

Adds CPU execution to grok-1 model demo

VERY SLOW!

No one should process real world workloads this way.

This is only meant for early dev work by those who don't have 8 x 40GB GPUs

pip install -r requirements-cpu.txt
sed -i 's/USE_CPU_ONLY = False/USE_CPU_ONLY = True/' run.py
python run.py

Still requires:

384GB RAM
1.5 minutes to load into memory
1.1 hours to "compile" grok-1 model
4.2 hours to sample first inference request

Even on a 72 core Xeon Server, these runtimes can require monk-like patience.

So the point isn't to run this end-to-end all day.

It's for developers with high-memory workstations who would rather get this code running slowly than not at all.

Hopefully someone uses this CPU-only workaround early on to bootstrap grok-1 into a more performant model that can eventually be more accessible to a larger pool of devs.

Note: Executing this on most CPUs will emit a series of false warnings about the 8 CPU sub-processes being "stuck". These error messages come from a hardcoded warning within Tensorflow that don't appear to be tuneable or suppressible.

Note 2: If memory usage swells too high, comment out this single line below in checkpoint.py. This reduces peak memory usage from >600GB to closer to ~320GB. The downside is a slightly slower initial load. Adding this "copy_to_shm" load strategy is likely a good time-to-memory trade-off on xAI's server, but may not be on your workstation if it triggers OOM.

def fast_unpickle(path: str) -> Any:
  #  with copy_to_shm(path) as tmp_path:
        with open(path, "rb") as f:
            return pickle.load(f)

run.py

requirements-cpu.txt

trholding · 2024-03-24T03:25:40Z

Still requires:

384GB RAM

1.5 minutes to load into memory

1.1 hours to "compile" grok-1 model

4.2 hours to sample first inference request

Could you add your systems specs here?

I'll add it to: #42 and #183

louiehelm · 2024-03-24T07:12:28Z

Still requires:

384GB RAM

1.5 minutes to load into memory

1.1 hours to "compile" grok-1 model

4.2 hours to sample first inference request

Could you add your systems specs here?

I'll add it to: #42 and #183

CPU: 2 x Intel Xeon E5-2697 v4
Total RAM: 1.5TB RAM

inkoil · 2024-04-03T09:15:26Z

I'm not sure why I got this error?
INFO:rank:(1, 256, 6144)
INFO:rank:(1, 256, 131072)
INFO:rank:State sharding type: <class 'model.TrainingState'>
INFO:rank:(1, 256, 6144)
INFO:rank:(1, 256, 131072)
INFO:rank:Loading checkpoint at ./checkpoints/ckpt-0
INFO:rank:(1, 8192, 6144)
INFO:rank:(1, 8192, 131072)
Output for prompt: The answer to life the universe and everything is of course
INFO:runners:Precompile 1024
INFO:rank:(1, 1, 6144)
INFO:rank:(1, 1, 131072)
INFO:runners:Compiling...
INFO:rank:(1, 1, 6144)
INFO:rank:(1, 1, 131072)
jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

jaxlib.xla_extension.XlaRuntimeError: UNIMPLEMENTED: unsupported operand type BF16 in op dot

I'm using Xeon 5320 + 1TB RAM.
install the software using the requirement-cpu.txt

louiehelm · 2024-04-04T01:45:24Z

I'm not sure why I got this error?

...

jaxlib.xla_extension.XlaRuntimeError: UNIMPLEMENTED: unsupported operand type BF16 in op dot

I'm using Xeon 5320 + 1TB RAM. install the software using the requirement-cpu.txt

I assume you included my changes in run.py too? And changed "USE_CPU_ONLY = False" to "USE_CPU_ONLY = True"?

Hopefully this repository isn't abandoned but it doesn't seem like anyone is maintaining it anymore.

You might be better off running grok-1 in llama.cpp if JAX is crashing for you.

pafend · 2024-04-07T08:02:31Z

For all those who read this and are struggleing but want to run this model once, here is an article on how I managed to get it run for less than $10.

If you want to test things, you might be better off using the more expensive GCP version because it offers the possiblity to be stopped and then you only pay for storage.

I hope someone finds it helpful.

Article:
https://twitter.com/PascalBauerDE/status/1776792056452546822
Fork:
https://github.com/pafend/grok-1-brev

Allows CPU-based execution

1101257

HaoLi111 mentioned this pull request Mar 20, 2024

Quantization with less loss with Expert Offloading? Can we imitate Mixtral-offloading? #236

Open

EwoutH mentioned this pull request Mar 20, 2024

llama: add Grok support ggerganov/llama.cpp#6120

Closed

jaydenpung approved these changes Mar 20, 2024

View reviewed changes

robvdl reviewed Mar 20, 2024

View reviewed changes

run.py Show resolved Hide resolved

robvdl reviewed Mar 20, 2024

View reviewed changes

requirements-cpu.txt Show resolved Hide resolved

louiehelm requested a review from robvdl March 27, 2024 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allows CPU-based execution #235

Allows CPU-based execution #235

louiehelm commented Mar 20, 2024

trholding commented Mar 24, 2024

louiehelm commented Mar 24, 2024

inkoil commented Apr 3, 2024

louiehelm commented Apr 4, 2024

pafend commented Apr 7, 2024

Allows CPU-based execution #235

Are you sure you want to change the base?

Allows CPU-based execution #235

Conversation

louiehelm commented Mar 20, 2024

trholding commented Mar 24, 2024

louiehelm commented Mar 24, 2024

inkoil commented Apr 3, 2024

louiehelm commented Apr 4, 2024

pafend commented Apr 7, 2024