You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Max Threads = Poor Performance on 8 thread processor and GGJT model after convert.py
TL:DR - Try setting n_threads to 6 instead of 8 if you have an 8 thread processor. Getting consistently faster results than trying to use all of my 8 threads.
Been doing some testing with a GGJT model to try to get the best performance on a little laptop. I did 2 tests for each change to n_threads. Tests were conducted while nothing else was open.
Results On an 8 thread CPU
n_threads=1
Test 1
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 14464.13 ms
llama_print_timings: sample time = 20.63 ms / 40 runs ( 0.52 ms per run)
llama_print_timings: prompt eval time = 14463.85 ms / 19 tokens ( 761.26 ms per token)
llama_print_timings: eval time = 38962.48 ms / 39 runs ( 999.04 ms per run)
llama_print_timings: total time = 57510.54 ms
Test 2
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 14054.52 ms
llama_print_timings: sample time = 24.77 ms / 40 runs ( 0.62 ms per run)
llama_print_timings: prompt eval time = 14054.15 ms / 19 tokens ( 739.69 ms per token)
llama_print_timings: eval time = 50090.37 ms / 39 runs ( 1284.37 ms per run)
llama_print_timings: total time = 69022.43 ms
n_threads=2
Test 1
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 9662.71 ms
llama_print_timings: sample time = 22.36 ms / 40 runs ( 0.56 ms per run)
llama_print_timings: prompt eval time = 9662.48 ms / 19 tokens ( 508.55 ms per token)
llama_print_timings: eval time = 25339.74 ms / 39 runs ( 649.74 ms per run)
llama_print_timings: total time = 39422.48 ms
Test 2
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 13699.18 ms
llama_print_timings: sample time = 27.64 ms / 40 runs ( 0.69 ms per run)
llama_print_timings: prompt eval time = 13698.78 ms / 19 tokens ( 720.99 ms per token)
llama_print_timings: eval time = 27051.24 ms / 39 runs ( 693.62 ms per run)
llama_print_timings: total time = 46124.61 ms
n_threads=4
Test 1
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 9804.36 ms
llama_print_timings: sample time = 29.62 ms / 40 runs ( 0.74 ms per run)
llama_print_timings: prompt eval time = 9803.58 ms / 19 tokens ( 515.98 ms per token)
llama_print_timings: eval time = 22367.64 ms / 39 runs ( 573.53 ms per run)
llama_print_timings: total time = 38015.92 ms
Test 2
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 7894.51 ms
llama_print_timings: sample time = 23.41 ms / 40 runs ( 0.59 ms per run)
llama_print_timings: prompt eval time = 7894.35 ms / 19 tokens ( 415.49 ms per token)
llama_print_timings: eval time = 17166.80 ms / 39 runs ( 440.17 ms per run)
llama_print_timings: total time = 29655.03 ms
n_threads=6
Test 1
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 8732.21 ms
llama_print_timings: sample time = 29.93 ms / 40 runs ( 0.75 ms per run)
llama_print_timings: prompt eval time = 8731.88 ms / 19 tokens ( 459.57 ms per token)
llama_print_timings: eval time = 26798.23 ms / 39 runs ( 687.13 ms per run)
llama_print_timings: total time = 41384.27 ms
Test 2
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 4623.47 ms
llama_print_timings: sample time = 21.79 ms / 40 runs ( 0.54 ms per run)
llama_print_timings: prompt eval time = 4623.19 ms / 19 tokens ( 243.33 ms per token)
llama_print_timings: eval time = 17870.62 ms / 39 runs ( 458.22 ms per run)
llama_print_timings: total time = 26962.23 ms
n_threads=7 (Seems better than 8, but not as good as 6)
Test 1
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 13266.94 ms
llama_print_timings: sample time = 22.37 ms / 40 runs ( 0.56 ms per run)
llama_print_timings: prompt eval time = 13266.64 ms / 19 tokens ( 698.24 ms per token)
llama_print_timings: eval time = 31370.05 ms / 39 runs ( 804.36 ms per run)
llama_print_timings: total time = 49092.33 ms
Test 2
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 9676.00 ms
llama_print_timings: sample time = 30.28 ms / 40 runs ( 0.76 ms per run)
llama_print_timings: prompt eval time = 9675.46 ms / 19 tokens ( 509.23 ms per token)
llama_print_timings: eval time = 51035.98 ms / 39 runs ( 1308.61 ms per run)
llama_print_timings: total time = 66633.10 ms
n_threads=8 (Max threads)
Test 1
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 31573.62 ms
llama_print_timings: sample time = 23.12 ms / 40 runs ( 0.58 ms per run)
llama_print_timings: prompt eval time = 31573.35 ms / 19 tokens ( 1661.76 ms per token)
llama_print_timings: eval time = 80649.37 ms / 39 runs ( 2067.93 ms per run)
llama_print_timings: total time = 119573.09 ms
Test 2
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 31926.09 ms
llama_print_timings: sample time = 22.00 ms / 40 runs ( 0.55 ms per run)
llama_print_timings: prompt eval time = 31925.73 ms / 19 tokens ( 1680.30 ms per token)
llama_print_timings: eval time = 67654.42 ms / 39 runs ( 1734.73 ms per run)
llama_print_timings: total time = 103776.36 ms
The text was updated successfully, but these errors were encountered:
alxspiker
changed the title
Performance Suggestion
Performance Suggestion / Benchmarks
May 16, 2023
Script used for benchmarking:
Requires llama-cpp-python==0.1.49
importjsonimportargparsefromllama_cppimportLlamaparser=argparse.ArgumentParser()
parser.add_argument("-m", "--model", type=str, default="./newggjt.bin")
args=parser.parse_args()
llm=Llama(model_path=args.model, n_threads=6)
stream=llm(
"Question: What are the names of the planets in the solar system? Answer: ",
max_tokens=48,
stop=["Q:", "\n"],
stream=True,
)
foroutputinstream:
print(output["choices"][0]["text"], end="")
#print(json.dumps(output, indent=2))
Max Threads = Poor Performance on 8 thread processor and GGJT model after convert.py
TL:DR - Try setting n_threads to 6 instead of 8 if you have an 8 thread processor. Getting consistently faster results than trying to use all of my 8 threads.
Been doing some testing with a GGJT model to try to get the best performance on a little laptop. I did 2 tests for each change to n_threads. Tests were conducted while nothing else was open.
Results On an 8 thread CPU
n_threads=1
Test 1
Test 2
n_threads=2
Test 1
Test 2
n_threads=4
Test 1
Test 2
n_threads=6
Test 1
Test 2
n_threads=7 (Seems better than 8, but not as good as 6)
Test 1
Test 2
n_threads=8 (Max threads)
Test 1
Test 2
The text was updated successfully, but these errors were encountered: