## TorchServe Continuous Batching Serve Llama-2 on Inferentia-2
This notebook demonstrates TorchServe continuous batching serving Llama-2-13b on Inferentia-2 `inf2.24xlarge` with DLAMI: Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20231226

### Installation
Note: This section can be skipped once [Neuron DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) release TorchServe latest version.

In [None]:
# Install Python venv
!sudo apt-get install -y python3.9-venv g++

# Create Python venv
!python3.9 -m venv aws_neuron_venv_pytorch

# Activate Python venv
!source aws_neuron_venv_pytorch/bin/activate
!python -m pip install -U pip

# Clone Torchserve git repository
!git clone https://github.com/pytorch/serve.git

# Install dependencies
!python ~/serve/ts_scripts/install_dependencies.py --neuronx --environment=dev

# Install torchserve and torch-model-archiver
python ts_scripts/install_from_src.py

### Create model artifacts

Note: run `mv model/models--meta-llama--Llama-2-13b-hf/snapshots/dc1d3b3bfdb69df26f8fc966c16353274b138c55/model.safetensors.index.json model/models--meta-llama--Llama-2-13b-hf/snapshots/dc1d3b3bfdb69df26f8fc966c16353274b138c55/model.safetensors.index.json.bkp`
 if neuron sdk does not support safetensors

In [None]:
# login in Hugginface hub
!huggingface-cli login --token $HUGGINGFACE_TOKEN
!python ~/serve/examples/large_models/utils/Download_model.py --model_path model --model_name meta-llama/Llama-2-13b-hf --use_auth_token True

# Create TorchServe model artifacts
!torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py -r requirements.txt --config-file model-config.yaml --archive-format no-archive
!mv model llama-2-13b
!mkdir -p ~/serve/model_store
!mv ~/serve/llama-2-13b /home/model-server/model_store

# Precompile complete once the log "Model llama-2-13b loaded successfully"
torchserve --ncs --start --model-store /home/model-server/model_store --models llama-2-13b --ts-config ../config.properties

### Run inference

In [None]:
# Run single inference request
!python ~/serve/examples/large_models/utils/test_llm_streaming_response.py -m llama-2-13b -o 50 -t 2 -n 4 --prompt-text "Today the weather is really nice and I am planning on " --prompt-randomize