# Install vLLM

There is a prior version of the SDK that was upstreamed into the main vLLM repository.  However, most of the time we want to install from source from the aws-neuron fork.  

Instructions are available here:  https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html#nxdi-vllm-user-guide

However, the steps are below.  Run the next three cells.  The pip installs could take 5 minutes.

The AWS workshop environment deploys using a Neuron DLAMI with a recent SDK.  If you are deploying this in your own environment, you may need to match the branch to your SDK version or follow the latest instructions at the link above.

In [None]:
%%bash
git clone -b 2.25.0 https://github.com/aws-neuron/upstreaming-to-vllm.git


In [3]:
%pip install --quiet -r /home/ubuntu/environment/vLLM/upstreaming-to-vllm/requirements/neuron.txt
#expected to produce no output for 4 or 5 minutes.  Remove the --quiet flag if you want to see ALL the packages installed!  Or look in the neuron.txt requirements doc.

Note: you may need to restart the kernel to use updated packages.


In [4]:
!VLLM_TARGET_DEVICE="neuron" pip install --quiet -e /home/ubuntu/environment/vLLM/upstreaming-to-vllm/.
# expected to product no output for 5 or 6 minutes

# Download copies of the model to deploy
We are downloading a copy of the stock Qwen3-8B model as well as the compiled version from Hugging Face.


In [None]:
!hf download aws-neuron/Qwen3-8BSharded --local-dir /home/ubuntu/environment/qwen3
#this could take 3-4 minutes

In [None]:
!hf download Qwen/Qwen3-8B --local-dir /home/ubuntu/environment/Qwen3-8B --exclude "*.safetensors"
#This is the stock model.  It will only take seconds because we don't need to download the weights.

# Make sure you restart your kernel
If you get an error that vllm could not be found, it is because you didn't restart your kernel after installing it above

# Offline inference example

In this example, we load the qwen3 precompiled model artifacts (or NEFF files) and the model presharded for two cores.  We do this because of the system memory limitations of the trn1.2xlarge (32GB of system RAM).  The trn1.2xlarge also has 32GB of device RAM on the Trainium1 device (that has two Neuron cores), but system RAM is (usually) our limiter for compiling.

May take 8 minutes to run

In [None]:
import os
from vllm import LLM, SamplingParams
os.environ['VLLM_NEURON_FRAMEWORK'] = "neuronx-distributed-inference"
os.environ['NEURON_COMPILED_ARTIFACTS'] = "/home/ubuntu/environment/qwen3"
#os.environ['BASE_COMPILE_WORK_DIR'] = "/home/ubuntu/qwen3/"
llm = LLM(
    model="/home/ubuntu/environment/Qwen3-8B", #model weights
    max_num_seqs=1,
    max_model_len=1024,
    device="neuron",
    tensor_parallel_size=2,
    override_neuron_config={})
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# note that top_k must be set to lower than the global_top_k defined in
# the neuronx_distributed_inference.models.config.OnDeviceSamplingConfig
sampling_params = SamplingParams(top_k=10, temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

# Free up the Neuron Cores for the next step -- in production, keep the object around to avoid load times and warmup times
del llm

# Online inference example.
In this case, we are loading the model directly from Hugging Face and compiling what we need as we go (this is a 1.1B parameter model, so it needs less system RAM to compile than the Qwen3-8B example above)
It may take 5 minutes for the model to download, compile and run.

# Restart your kernel!!
Restart your kernel before you run the next cell.  This will remove the python script and anything it has loaded in the devices.


Because you are running this in a Jupyter notebook, this cell will keep running until you stop it.  The server should remain available (and using the Neuron cores) until you stop it.  

You'll run this cell with different parameters your instructor will be discussing and using the guidellm tool in the Benchmark.ipynb notebook to run against this server.

Run the next cell and wait until you see something like this (it should take about 5 minutes):
```
INFO:     Started server process [21298]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
```

In [None]:
!VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference' python -m vllm.entrypoints.openai.api_server \
    --model="TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
    --max-num-seqs=1 \
    --max-model-len=1024 \
    --tensor-parallel-size=2 \
    --port=8080 \
    --device "neuron" 