# Install guidellm

In [1]:
#eventually use 0.3 when it is released
%pip install --quiet git+https://github.com/vllm-project/guidellm.git@1261fe8

Note: you may need to restart the kernel to use updated packages.



# Time for the benchmark
Running this next cell will create a synchronous-benchmarks.csv file in the directory.  You should be able to click on it in the listing on the left.  

If you have a spreadsheet program on your laptop, right click on the file and "Download" it, then open it locally.  We'll explore it in the next cell.

This benchmark should only take 30 seconds to run because we are only doing a single request at a time ("synchronous").  This will give you metrics similar to a single user using the endpoint.

In [4]:
!guidellm benchmark \
  --target "http://localhost:8080" \
  --rate-type synchronous \
  --max-seconds 30 \
  --data "prompt_tokens=125,output_tokens=56" \
  --output-path=synchronous_benchmarks.csv

Creating backend...
Backend openai_http connected to http://localhost:8080 for model                
TinyLlama/TinyLlama-1.1B-Chat-v1.0.                                             
Creating request loader...
Created loader with 1000 unique requests from                                   
prompt_tokens=129,output_tokens=56.                                             
                                                                                
                                                                                
[?25l╭─ Benchmarks ─────────────────────────────────────────────────────────────────╮
│ [2… syn… ([96mr…[0m [37mReq:[0m    0.0 [37mreq/s[0m,    0.00s [37mLat[0m,     1.0 [37mConc[0m,       0 [37mComp[0m,  … │
│              [37mTok:[0m    0.0 [37mgen/s[0m,    0.0 [37mtot/s[0m,   0.0ms [37mTTFT[0m,    0.0ms [37mITL[0m,  … │
╰──────────────────────────────────────────────────────────────────────────────╯
[2K[1A[2K[1A[2K[1A[2K[1A[2K╭─

# Exploration of the benchmarks.csv file

In the EV column (keep scrolling to the right!) you should see the Mean Time to First Token (TTFT).  This is the average time it took to generate the first token (i.e. process the prompt).

Look in column EY to see how much this varied.

In the EZ column you can see the Total Time for Output Token Mean.  This is the average time it took to generate each token AFTER the prompt was processed and the first token was generated.

Colum FH shows you the Total Output Tokens per Second Mean (which is related to the Time per Output Token).  Column FK shows the variation.  We are only running the tests for 30 seconds, so I think some of the queries get cut off, which is why there are some zeros.

# Running some tests

Did you wonder why the prompt tokens were set to 125?  Change it to 129 and run it again. (It will overwrite the benchmarks.csv file, but you already have the original copy locally)

You should have originally measured an averate TTFT (Column EV) of 15.6 seconds or so.

After you change to 129, you should see Column EV jump to over 17 seconds.  Why such an increase because of a few tokens?  Buckets!

# Throughput tests

If you run the cell below against the server you should see a Total Output tokens per second median in column FI (the file name changed!).  If you look in column FJ and FK you can see the standard deviation and outliers.  FH is the mean.  If you needed to rely on these numbers, you should DEFINITELY run the tests longer than 120 seconds because we expect some variability.

# More changes!

Now we are going to experiment with our batch size.  In the other notebook, you are going to push stop (the square) next to the cell, change the configuration to the batch assigned by your instructor on the line that says ```--max-num-seqs=1 \``` and share your results from column FI and FH.

If you are doing this on your own, increase the output_tokens as well as changing the max-num-seqs to numbers from 2-128.


In [13]:
!guidellm benchmark \
  --target "http://localhost:8080" \
  --rate-type throughput \
  --max-seconds 120 \
  --data "prompt_tokens=125,output_tokens=120" \
  --output-path=throughput_benchmarks.csv

Creating backend...
Backend openai_http connected to http://localhost:8080 for model                
TinyLlama/TinyLlama-1.1B-Chat-v1.0.                                             
Creating request loader...
Created loader with 1000 unique requests from                                   
prompt_tokens=125,output_tokens=120.                                            
                                                                                
                                                                                
[?25l╭─ Benchmarks ─────────────────────────────────────────────────────────────────╮
│ [0… th… ([96mr…[0m [37mReq:[0m    0.0 [37mreq/s[0m,    0.00s [37mLat[0m,     1.0 [37mConc[0m,       0 [37mComp[0m,   … │
│             [37mTok:[0m    0.0 [37mgen/s[0m,    0.0 [37mtot/s[0m,   0.0ms [37mTTFT[0m,    0.0ms [37mITL[0m,   … │
╰──────────────────────────────────────────────────────────────────────────────╯
[2K[1A[2K[1A[2K[1A[2K[1A[2K╭─