BabyLlama with CPP backend #2544

shrinath-suresh · 2023-08-28T19:29:50Z

Description

Benchmarking Babyllama deployment with CPP Backend

Setup and Test

Follow the instructions from README.md to set up the cpp backend environment
Download the stories model using

wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin

Download the tokenizer.bin file

Create a config file named config.json with the path of the downloaded model and tokenizer.

{
"checkpoint_path" : "/home/ubuntu/serve/cpp/stories15M.bin",
"tokenizer_path" : "/home/ubuntu/serve/cpp/src/examples/babyllama/tokenizer.bin"
}

Run the build

cd serve/cpp
./builld.sh

Once the build is successful libllm_handler.so shared object file would be generated in serve/cpp/test/resources/torchscript_model/babyllama/llm_handler folder.

Copy the dummy.pt file to the llm_handler folder.
Move to llm_handler folder and run the following command to generate mar file

cd serve/cpp/test/resources/torchscript_model/babyllama/babyllama_handler
torch-model-archiver --model-name llm --version 1.0 --serialized-file dummy.pt --handler libbabyllama_handler:BabyLlamaHandler --runtime LSP --extra-files config.json

Move the llm.mar to model_store

mkdir model_store
mv llm.mar model_store/llm.mar

Create a new config.properties file and past the content.

default_response_timeout=300000

The default timeout is 120000. When the context size is large, LLM generation takes more time to complete the request in the single gpu machine.

Start the torchserve

torchserve --start --ncs --ts-config config.properties --model-store model_store/

Register the model using curl command

curl -v -X POST "http://localhost:8081/models?initial_workers=1&url=llm.mar"

Update the input in prompt.txt if needed and run

curl http://localhost:8080/predictions/llm -T prompt.txt

Sample response

Hello my name is Daisy. Daisy is three years old. She loves to play with her toys.
One day, Daisy's mommy said, "Daisy, it's time to go to the store." Daisy was so excited! She ran to the store with her mommy.
At the store, Daisy saw a big, red balloon. She wanted it so badly! She asked her mommy, "Can I have the balloon, please?"
Mommy said, "No, Daisy. We don't have enough money for that balloon."
Daisy was sad. She wanted the balloon so much. She started to cry.
Mommy said, "Daisy, don't cry. We can get the balloon. We can buy it and take it home."
Daisy smiled. She was so happy. She hugged her mommy and said, "Thank you, mommy!"
<s>

Benchmarking

Clone the llama2.c repo

git clone https://github.com/karpathy/llama2.c/tree/master

Move to the folder and compile. Executed run will be generated.

cd llama2.c
make -j

Download the model

wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin

Run inference using the following command

./run stories15M.bin -n 256 -i "Hello my name is" -t 1.0 -p 0.9

babyllama_gcc_standalone.txt

The standalone version generates output with 55.29 tokens per second. The variation is due to the compiler options.

Check the PR - karpathy/llama2.c#116 for cmake build support

Clone the krrishnarraj/llama2.c/ branch - from pull request

git clone https://github.com/krrishnarraj/llama2.c.git

Follow build instructions from here or run the following commands

mkdir build
cd build
cmake ..
cmake --build .
cp ../tokenizer.bin .
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin

Once the build is succeded , run the following command

Standalone cmake version generates - 147.39 tokens per second

torchserve curl request

curl http://localhost:8080/predictions/llm -T prompt.txt

ts_log.txt

babyllama with cpp backend generates 172.3 tokens per second

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A
Logs for Test A
Test B
Logs for Test B

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

…rsion errors. Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

chauhang · 2023-08-29T17:40:12Z

cpp/src/examples/babyllama/baby_llama_handler.cc

+    build_transformer(&transformer, checkpoint_path);
+
+    char tokenizer_path[] =
+        "/home/ubuntu/serve/cpp/src/examples/image_classifier/babyllama/"


Path is hard coded at present -- read from config file

@shrinath-suresh you can also add the tokenizer.bin as an additional file when creating the mar file and set the filename as load_model_request->model_dir + "tokenizer.bin"

Updated the code to read the tokenizer and model path from a config file

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

mreso

Hi, thanks for this great contribution! I've left some comments on the PR.

mreso · 2023-09-01T22:46:43Z

cpp/src/examples/babyllama/baby_llama_handler.hh

+#ifndef LLM_HANDLER_HH_
+#define LLM_HANDLER_HH_
+
+#include "run.c"


Is there a reason why run.c gets included here? Its also listed in the cmake file as source file which probably did not work as there is no header file to declare the content. I would recommend removing it from the cmake file and include it in the .cc instead to localize visibility.

Moved run.c import to the .cc file

mreso · 2023-09-01T22:49:31Z

cpp/src/examples/babyllama/run.c

run.c was published under MIT license which requests that we include the copy right notice and license. My proposal is to create a subfolder and include run.c + the original license file. @chauhang Is that a viable proceeding?

Created a subdirectory named llama2.c and copied the run.c with the license file

mreso · 2023-09-01T22:51:09Z

cpp/src/examples/babyllama/baby_llama_handler.cc

+    build_transformer(&transformer, checkpoint_path);
+
+    char tokenizer_path[] =
+        "/home/ubuntu/serve/cpp/src/examples/image_classifier/babyllama/"


@shrinath-suresh you can also add the tokenizer.bin as an additional file when creating the mar file and set the filename as load_model_request->model_dir + "tokenizer.bin"

mreso · 2023-09-01T22:53:53Z

cpp/src/examples/babyllama/baby_llama_handler.cc

+    float topp = 0.9f;  // top-p in nucleus sampling. 1.0 = off. 0.9 works well,
+                        // but slower
+    int steps = 256;    // number of steps to run for
+    unsigned long long rng_seed = 0;


Initializing an rng with 0 (at bits zero) can be problematic in some cases.

Removed the initialization

mreso · 2023-09-01T23:29:12Z

cpp/src/examples/babyllama/baby_llama_handler.cc

+
+      std::string msg = torchserve::Converter::VectorToStr(data_it->second);
+
+      char* msgCStr = new char[msg.size() + 1];  // +1 for the null terminator


Please use smart pointers when allocating dynamic memory and prefer new over malloc.
Something like

std::unique_ptr<int[]> prompt_tokens(new int[(strlen(msgCStr) + 3) * sizeof(int)]);

should work as well.

Updated code to use smart pointers in necessary places

mreso · 2023-09-01T23:51:07Z

cpp/src/examples/babyllama/baby_llama_handler.cc

+    long_vector.push_back(data_ptr[i]);
+  }
+
+  int* prompt_tokens = new int[num_elements];


Updated code to use smart pointer

mreso · 2023-09-01T23:55:11Z

cpp/src/examples/babyllama/baby_llama_handler.cc

+
+  int* prompt_tokens = new int[num_elements];
+  for (int64_t i = 0; i < num_elements; ++i) {
+    prompt_tokens[i] = static_cast<int>(long_vector[i]);


why can't we just copy the data from the tensor instead of going through long_vector?

Updated the logic to directly copy the data from tensor

mreso · 2023-09-01T23:59:13Z

cpp/src/examples/babyllama/baby_llama_handler.cc

+    std::pair<std::string&, std::map<uint8_t, std::string>&>& idx_to_req_id,
+    std::shared_ptr<torchserve::InferenceResponseBatch>& response_batch) {
+  std::vector<torch::Tensor> tensor_vector;
+  auto tokens_list_tensor = inputs[0].toTensor();


Can we extend this to batched processing or at least process all entries in the batch?

Working on the batch processing part. Will keep you posted once it is done

cpp/src/examples/babyllama/baby_llama_handler.cc

shrinath-suresh · 2023-09-04T05:29:40Z

@mreso Thanks for your comments. I will address it.

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

… upfront Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

shrinath-suresh · 2023-09-05T19:28:28Z

@mreso I have addressed most of your comments. Working on updating the code to enable batch processing. Once it is done, I will run a sanity test and update the steps/README with the details. Will keep you posted on this.

lxning · 2023-09-05T22:18:27Z

cpp/src/examples/babyllama/baby_llama_handler.cc

+  return batch_ivalue;
+}
+
+torch::Tensor LlmHandler::Inference(


Could you add with torch.inference_mode(): ?

Is it c10::InferenceMode guard; on cpp ? - https://pytorch.org/cppdocs/notes/inference_mode.html

torch::inferencemode is a high level API, c10::InferenceMode is a low level api. According to libtorch doc, they are trying to use torch::xxx to unify the low level apis.

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

shrinath-suresh · 2023-09-06T17:19:06Z

@mreso I have addressed most of your comments. Working on updating the code to enable batch processing. Once it is done, I will run a sanity test and update the steps/README with the details. Will keep you posted on this.

Updated code to process all the items in the batch. Please review when you find time and let me know if any other changes are needed.

shrinath-suresh · 2023-09-06T17:21:36Z

@lxning Should we add the model and tokenizer download steps in test script. Do we have any concept of setup and teardown in the cpp backend for each test case ?. For the unit tests to pass, these files are mandatory.

lxning · 2023-09-08T17:20:30Z

cpp/src/examples/babyllama/baby_llama_handler.cc

+                                     manifest_->GetModel().serialized_file),
+                         *device));
+


current cpp backend only support one device id, which means there is no across gpu device partition.

i assume this example only work for single gpu.

lxning · 2023-09-08T17:27:53Z

cpp/src/examples/babyllama/baby_llama_handler.cc

+    std::shared_ptr<torchserve::InferenceResponseBatch>& response_batch) {
+  c10::InferenceMode guard;
+  std::vector<torch::Tensor> batch_output_vector;
+  for (const torch::jit::IValue& input : inputs) {


This for loop predict each inference request one by one. Can we optimize this section to either leverage C++ multithreading or GPU batching power?

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

mreso

When we remove the absolute paths (and the tests pass) this will be ready to go for now. We can add batched processing in a follow-up PR @lxning .

mreso · 2023-09-13T15:26:10Z

cpp/test/resources/torchscript_model/babyllama/babyllama_handler/config.json

@@ -0,0 +1,5 @@
+{
+"checkpoint_path" : "/home/ubuntu/serve/cpp/stories15M.bin",


Would be good to make these relative instead of absolute.

@mreso

-rw-rw-r-- 1 ubuntu ubuntu 424K Sep 5 23:55 tokenizer.bin -rw-rw-r-- 1 ubuntu ubuntu 58M Jul 27 04:09 stories15M.bin

These two files are needed at run time for the unit test case. I can think of two apporaches

We can download these files in build.sh and remove it once the unit test case passes

Add the download logic in cpp - serve/cpp/test/backends/torch_scripted/torch_scripted_backend_test.cc when executing the test case.

Second option seems to be more resonable. Whats your opinion ?

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

mreso · 2024-01-25T00:24:45Z

Closing this in favor of #2903 which picks up all changes and adds adjusts.

shrinath-suresh added 4 commits August 25, 2023 10:08

Baby Llama - Porting run.c for integration and fixed clang type conve…

641a708

…rsion errors. Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Custom preprocess implementation

016e4f1

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Free memory only after the inference is done

38d3e93

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Implement Postprocess

52a7927

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

chauhang added the c++ label Aug 29, 2023

chauhang reviewed Aug 29, 2023

View reviewed changes

Setting Fast compiler option

c675664

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

mreso requested changes Sep 2, 2023

View reviewed changes

shrinath-suresh added 11 commits September 4, 2023 23:49

Reading checkpoint path and tokenizer path from config file using folly

374a2e8

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Removing run.c from cmake

48f522c

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Replace auto with appropriate data type

49a3015

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Using smartpointers and initializing the vector with appropriate size…

aeb1bb0

… upfront Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Using smartpointers

ee20424

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Directly converting the tensor values to prompt token ids

f5d9799

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Moving run.c and common variables to .cc file

9b3de26

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Moving run.c to a separate folder

3e0e2c3

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Uncommenting the original run.c main method

5c0495e

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Implemented destructor to free up resources

e75a5ae

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Supporting files for unit test

9afce52

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

shrinath-suresh changed the title ~~[WIP] BabyLlama with CPP backend~~ BabyLlama with CPP backend Sep 5, 2023

lxning reviewed Sep 5, 2023

View reviewed changes

shrinath-suresh added 2 commits September 6, 2023 14:28

Processing all the batch inputs

0d12619

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Setting InferenceMode guard

bd03fd8

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

lxning reviewed Sep 8, 2023

View reviewed changes

Updating InferenceMode to use torch::InferenceMode

d2dc632

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

shrinath-suresh added 4 commits September 12, 2023 21:37

Updating class name to BabyLlamaHandler

67b46aa

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Renaming llm_handler target to babyllama_handler

f30aab2

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Adding dummy pt file

7174cde

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Typo Fix

6dc025b

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

mreso requested changes Sep 13, 2023

View reviewed changes

shrinath-suresh added 2 commits September 14, 2023 09:18

Calculate tokens/per second for batch input

450b85d

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

Adding README.md for babyllama example

8d279be

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>

mreso mentioned this pull request Jan 24, 2024

Feature/cpp baby llama rework #2903

Merged

7 tasks

mreso closed this Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BabyLlama with CPP backend #2544

BabyLlama with CPP backend #2544

shrinath-suresh commented Aug 28, 2023 •

edited

Loading

chauhang Aug 29, 2023

mreso Sep 1, 2023

shrinath-suresh Sep 5, 2023

mreso left a comment

mreso Sep 1, 2023

shrinath-suresh Sep 5, 2023

mreso Sep 1, 2023

shrinath-suresh Sep 5, 2023

mreso Sep 1, 2023

mreso Sep 1, 2023

shrinath-suresh Sep 5, 2023

mreso Sep 1, 2023

shrinath-suresh Sep 5, 2023

mreso Sep 1, 2023

shrinath-suresh Sep 5, 2023

mreso Sep 1, 2023

shrinath-suresh Sep 5, 2023

mreso Sep 1, 2023

shrinath-suresh Sep 5, 2023

shrinath-suresh commented Sep 4, 2023

shrinath-suresh commented Sep 5, 2023

lxning Sep 5, 2023 •

edited

Loading

shrinath-suresh Sep 6, 2023

lxning Sep 8, 2023

shrinath-suresh commented Sep 6, 2023

shrinath-suresh commented Sep 6, 2023

lxning Sep 8, 2023

lxning Sep 8, 2023

mreso left a comment

mreso Sep 13, 2023

shrinath-suresh Sep 15, 2023 •

edited

Loading

mreso commented Jan 25, 2024


		std::string msg = torchserve::Converter::VectorToStr(data_it->second);

		char* msgCStr = new char[msg.size() + 1]; // +1 for the null terminator

		@@ -0,0 +1,5 @@
		{
		"checkpoint_path" : "/home/ubuntu/serve/cpp/stories15M.bin",

BabyLlama with CPP backend #2544

BabyLlama with CPP backend #2544

Conversation

shrinath-suresh commented Aug 28, 2023 • edited Loading

Description

Setup and Test

Benchmarking

Type of change

Feature/Issue validation/testing

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mreso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrinath-suresh commented Sep 4, 2023

shrinath-suresh commented Sep 5, 2023

lxning Sep 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrinath-suresh commented Sep 6, 2023

shrinath-suresh commented Sep 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mreso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrinath-suresh Sep 15, 2023 • edited Loading

Choose a reason for hiding this comment

mreso commented Jan 25, 2024

shrinath-suresh commented Aug 28, 2023 •

edited

Loading

lxning Sep 5, 2023 •

edited

Loading

shrinath-suresh Sep 15, 2023 •

edited

Loading