Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seriously, convert ggml to ggjt v1 #10

Closed
alxspiker opened this issue May 12, 2023 · 9 comments
Closed

Seriously, convert ggml to ggjt v1 #10

alxspiker opened this issue May 12, 2023 · 9 comments

Comments

@alxspiker
Copy link
Contributor

          This sounds promising. I was asking myself what can be done by playing around with the LlamaCppEmbeddings. Keep me posted 

A change in models would be the first; then we should tweak the argument

Originally posted by @su77ungr in #8 (comment)

Okay, not kidding been digging and trying so many things. Been learning a lot about how binary files are handled and loaded into memory. Still working on it but heres another find, I converted my alpaca7b model from ggml to ggjt v1 using the convert.py from the LlamaCpp repo and instead of using mlock everytime, the model is loaded with mmap therefor it seems like now it only loads what it needs and has provided slower results:

llama.cpp: loading model from ./models/new.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 2052.00 MB per state)
llama_init_from_file: kv self size  =  512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Starting to index  1  documents @  729  bytes in Qdrant
File ingestion start time: 1683859305.4884982

llama_print_timings:        load time =  7616.03 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =  7615.40 ms /     6 tokens ( 1269.23 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  7660.61 ms

llama_print_timings:        load time =  7616.03 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 14750.81 ms /     6 tokens ( 2458.47 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 14821.94 ms
Time to ingest files: 24.345433473587036 seconds

I was confused at first because the LlamaCppEmbeddings() doesnt support use_mmap argument but LlamaCpp() does. I haven't messed with LlamaCpp() yet but I changed use_mlock to True in LlamaCppEmbeddings() and got the quick results back.

llama.cpp: loading model from ./models/new.bin                                                                 
llama_model_load_internal: format     = ggjt v1 (latest)                                                       
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 2052.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Starting to index  1  documents @  729  bytes in Qdrant
File ingestion start time: 1683859472.9084902

llama_print_timings:        load time =  4136.82 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =  4128.81 ms /     6 tokens (  688.14 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  4172.68 ms

llama_print_timings:        load time =  4136.82 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =  3408.32 ms /     6 tokens (  568.05 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  3423.27 ms
Time to ingest files: 9.958016633987427 seconds

But then...

I realized that because the model didn't have to completely load into the memory when using a converted model and use_mlock was set to its default False, the initial load time seemed instant so I needed to measure the entire script time including the model loading instead of just the ingestion time to get accurate speed results.

Results

# Here is use_mlock=True on ggjt v1 model after using 
# convert.py from llamacpp repo to convert my Alpaca7b ggml model
llama = LlamaCppEmbeddings(use_mlock=True, model_path="./models/new.bin")

Time to ingest files: 7.395503520965576 seconds
Total run time: 18.770099639892578 seconds
# Here is use_mlock=False on ggjt v1 model after using 
# convert.py from llamacpp repo to convert my Alpaca7b ggml model
llama = LlamaCppEmbeddings(use_mlock=False, model_path="./models/new.bin")

Time to ingest files: 15.162402868270874 seconds
Total run time: 16.933820724487305 seconds

So for a small ingestion, the converted model doesn't seem to impact performance as widely as I thought and DOES INSANELY REDUCE MEMORY USAGE, I might be able to load way bigger models now (lord have mercy on my ram). But that minor improvement might add up with bigger documents, I just dont have the time to test large files.

@alxspiker
Copy link
Contributor Author

alxspiker commented May 12, 2023

P.S. I got the tokenizer.model from huggingface, convert.py from llamacpp and put them in the parent folder of my alpaca7b ggml model named model.bin and ran this from shell python .\convert.py .\models\ --outfile new.bin

@su77ungr
Copy link
Owner

Awesome. Looks like a weekend without any sleep again haha. I think Vicuna13b should be our goal since it's the best performing model at this point. Also might be worth taking a look at FastChat.

If you could craft a routine to convert ggml this would increase accessibility to keep it boostraped and simple.

Also feel free to commit your benchmark .txt file // I'm using the default demo files.

I'm around 108ms per token with vic7b @ i5-9600k

@alxspiker
Copy link
Contributor Author

This is starLLM automated to ask What is my name? which I ingested into it.

# use_mmap=True
llm = LlamaCpp(use_mmap=True, model_path=local_path, callbacks=callbacks, verbose=True)

llama_print_timings:        load time =  8441.23 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =  8440.31 ms /     6 tokens ( 1406.72 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  8500.26 ms
 It sounds like your name is Alex.

> Question:
What is my name?

> Answer:
 It sounds like your name is Alex.

> .\source_documents\state_of_the_union.txt:
My name is alx
Total run time: 47.66585969924927 seconds

and

# use_mmap=False
llm = LlamaCpp(use_mmap=False, model_path=local_path, callbacks=callbacks, verbose=True)

llama_print_timings:        load time =  6395.35 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =  6394.58 ms /     6 tokens ( 1065.76 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  6507.05 ms
 Your name is Alexandra.

> Question:
What is my name?

> Answer:
 Your name is Alexandra.

> .\source_documents\state_of_the_union.txt:
My name is alx
Total run time: 42.63529133796692 seconds

So not sure if mmap does much, not sure why or how langchain integrates that argument yet.

@alxspiker
Copy link
Contributor Author

Awesome. Looks like a weekend without any sleep again haha. I think Vicuna13b should be our goal since it's the best performing model at this point. Also might be worth taking a look at FastChat.

If you could craft a routine to convert ggml this would increase accessibility to keep it boostraped and simple.

Also feel free to commit your benchmark .txt file // I'm using the default demo files.

I'm around 108ms per token with vic7b @ i5-9600k

I'm gonna craft an auto convert if your model shows up as an older one like ggml. I could probably even support .pth and such. People will be thankful, I cant believe the performance difference. Ill also work/look into vicuna if you can test it. Ill try to download the model but my areas internet is slow and not stable.

@su77ungr
Copy link
Owner

su77ungr commented May 12, 2023

Why are your runtimes at 1000ms per token? can you shoot me your hardware specs, please.

Also are you using :memory: for testing?

Then we'd be able craft a benchmark script. Jap auto-convert seems reasonable.

@alxspiker
Copy link
Contributor Author

No, I havn't messed around with that yet, just using the db from ssd.

System Manufacturer	LENOVO
System Model	81EM
System Type	x64-based PC
System SKU	LENOVO_MT_81EM_BU_idea_FM_ideapad FLEX 6-14IKB
Processor	Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, 1992 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Mode	UEFI
Platform Role	Mobile
Installed Physical Memory (RAM)	8.00 GB
Available Virtual Memory	21.1 GB

IDK if thats what you need?

@su77ungr
Copy link
Owner

su77ungr commented May 12, 2023

I'm getting >60ms per token hits. Running six threads.

Haven't touched ggml convertion yet. Also did not force RAM since I'm only at 16GiB.

@alxspiker did you try f16_ky=True?

Also ggml-vic7b-uncensored-q4 has a format=ggjt backed in. This might be a reason for this speed

@alxspiker
Copy link
Contributor Author

I'm getting >60ms per token hits. Running six threads.

Haven't touched ggml convertion yet. Also did not force RAM since I'm only at 16GiB.

@alxspiker did you try f16_ky=True?

Also ggml-vic7b-uncensored-q4 has a format=ggjt backed in. This might be a reason for this speed

823.11 ms per token

@su77ungr
Copy link
Owner

su77ungr commented May 12, 2023

Your issue changed my life. My terminal session is close to real time. This is incredible. I'm going to upload the converted ggjt-v1 models onto HuggingFace so it's way easier for people to interact with.
converted vic-7b here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants