## TorchServe Continuous Batching Serve Llama-2-70B on Inferentia-2
This notebook demonstrates TorchServe continuous batching serving Llama-2-70b on Inferentia-2 `inf2.48xlarge` with DLAMI: Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20231226

### Installation
Note: This section can be skipped once Neuron DLC 2.16 with TorchServe latest version is released.

In [None]:
# Install Python venv
!sudo apt-get install -y python3.9-venv g++

# Create Python venv
!python3.9 -m venv aws_neuron_venv_pytorch

# Activate Python venv
!source aws_neuron_venv_pytorch/bin/activate
!python -m pip install -U pip

# Clone Torchserve git repository
!git clone https://github.com/pytorch/serve.git

# Install dependencies, now all commands run under serve dir
!cd serve
!git checkout feat/inf2_cb
!python ts_scripts/install_dependencies.py --neuronx --environment=dev

# Install torchserve and torch-model-archiver
python ts_scripts/install_from_src.py

### Create model artifacts

Note: run `mv model/models--meta-llama--Llama-2-70b-hf/snapshots/90052941a64de02075ca800b09fcea1bdaacb939/model.safetensors.index.json model/models--meta-llama--Llama-2-70b-hf/snapshots/90052941a64de02075ca800b09fcea1bdaacb939/model.safetensors.index.json.bkp`
 if neuron sdk does not support safetensors

In [None]:
# login in Hugginface hub
!huggingface-cli login --token $HUGGINGFACE_TOKEN
!python examples/large_models/utils/Download_model.py --model_path model --model_name meta-llama/Llama-2-13b-hf --use_auth_token True

# Create TorchServe model artifacts
!torch-model-archiver --model-name llama-2-70b --version 1.0 --handler ts/torch_handler/distributed/base_neuronx_continuous_batching_handler.py -r examples/large_models/inferentia2/llama2/requirements.txt --config-file examples/large_models/inferentia2/llama2/continuous_batching/model-config.yaml --archive-format no-archive

!mkdir -p model_store
!mv llama-2-70b model_store
!mv model model_store/llama-2-70b

### Start TorchServe

In [None]:
torchserve --ncs --start --model-store model_store --models llama-2-70b --ts-config examples/large_models/inferentia2/llama2/config.properties

### Run inference

In [None]:
# Run single inference request
!python examples/large_models/utils/test_llm_streaming_response.py -m llama-2-70b -o 50 -t 2 -n 4 --prompt-text "Today the weather is really nice and I am planning on " --prompt-randomize