## TorchServe Continuous Batching Serve Llama-2 on Inferentia-2
This notebook demonstrates TorchServe continuous batching serving Llama-2-13b on Inferentia-2 `inf2.24xlarge`.

### Build a customized docker container to install the code changes from this [PR](https://github.com/pytorch/serve/pull/2803).
This section can be skipped once [Neuron DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) release TorchServe latest version.

In [None]:
!aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
!docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-neuronx:1.13.1-neuronx-py310-sdk2.15.0-ubuntu20.04

In [None]:
!cat Dockerfile

In [None]:
!docker build -t neuron-sdk-215:torchserve-cb .

In [None]:
# Enter into docker container
!mkdir model_store

!docker run -it -v model_store:/home/model-server/model_store --device /dev/neuron0:/dev/neuron0  --device /dev/neuron1:/dev/neuron1  --device /dev/neuron2:/dev/neuron2  --device /dev/neuron3:/dev/neuron3  --device /dev/neuron4:/dev/neuron4  --device /dev/neuron5:/dev/neuron5 neuron-sdk-215:torchserve-cb  bash

In [None]:
# login in Hugginface hub
!huggingface-cli login --token $HUGGINGFACE_TOKEN
!python ~/serve/examples/large_models/utils/Download_model.py --model_path model --model_name models--meta-llama--Llama-2-13b-hf --use_auth_token

# Create TorchServe model artifacts
!torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py -r requirements.txt --config-file model-config.yaml --archive-format no-archive
!mkdir -p /home/model-server/model_store
!mv llama-2-13b /home/model-server/model_store

# Precompile complete once the log "Model llama-2-13b loaded successfully"
torchserve --ncs --start --model-store /home/model-server/model_store --models llama-2-13b --ts-config ../config.properties

# Exit the container

### Run inference

In [None]:
# Start the container
!docker run -it -v model_store:/opt/ml/model --device /dev/neuron0:/dev/neuron0  --device /dev/neuron1:/dev/neuron1  --device /dev/neuron2:/dev/neuron2  --device /dev/neuron3:/dev/neuron3  --device /dev/neuron4:/dev/neuron4  --device /dev/neuron5:/dev/neuron5 -p 8080:8080 -p 8081:8081 -p 8082:8082 neuron-sdk-215:torchserve-cb

In [None]:
# Run single inference request
!python test_stream_response.py

In [None]:
# Run multiple inference requests concurrently
!./tesh.sh