# Intro
This notebook lets you run the training and inference on the local EC2 instance instead of the other notebooks that use SageMaker.

# Install requirements

We'll use the same files as the SageMaker training, so we'll first move to the assets directory and run our scripts from there.

In [1]:
%cd /home/ubuntu/environment/FineTuning/HuggingFaceExample/01_finetuning/assets

/home/ubuntu/environment/FineTuning/HuggingFaceExample/01_finetuning/assets


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [2]:
%pip install -r requirements.txt


Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting optimum-neuron==0.3.0 (from -r requirements.txt (line 1))
  Downloading optimum_neuron-0.3.0-py3-none-any.whl.metadata (16 kB)
Collecting peft==0.16.0 (from -r requirements.txt (line 2))
  Downloading peft-0.16.0-py3-none-any.whl.metadata (14 kB)
Collecting trl==0.11.4 (from -r requirements.txt (line 3))
  Downloading trl-0.11.4-py3-none-any.whl.metadata (12 kB)
Collecting huggingface_hub==0.33.4 (from -r requirements.txt (line 4))
  Downloading huggingface_hub-0.33.4-py3-none-any.whl.metadata (14 kB)
Collecting datasets==3.6.0 (from -r requirements.txt (line 5))
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting transformers~=4.51.0 (from optimum-neuron==0.3.0->-r requirements.txt (line 1))
  Using cached transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting accelerate==1.8.1 (from

# Training

We will use the same training scripts as we do in the SageMaker examples, we just need to launch them with the torchrun process and the same parameters that we would have passed in.  See the Finetune-TinyLlama-1.1B notebook for more information on the parameters.

Additionally, this example uses a Qwen model.



In [3]:
!torchrun --nnodes 1 --nproc_per_node 2 \
finetune_llama.py \
--bf16 True --dataloader_drop_last True --disable_tqdm True --gradient_accumulation_steps 1 \
--gradient_checkpointing True --learning_rate 5e-05 --logging_steps 10 --lora_alpha 32 \
--lora_dropout 0.05 --lora_r 16 --max_steps 1000 \
--model_id Qwen/Qwen3-1.7B --output_dir ~/environment/ml/qwen \
--per_device_train_batch_size 2 --tensor_parallel_size 2 \
--tokenizer_id Qwen/Qwen3-1.7B

W1004 21:15:25.793000 27737 torch/distributed/run.py:774] 
W1004 21:15:25.793000 27737 torch/distributed/run.py:774] *****************************************
W1004 21:15:25.793000 27737 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1004 21:15:25.793000 27737 torch/distributed/run.py:774] *****************************************
Traceback (most recent call last):
  File "/home/ubuntu/environment/FineTuning/HuggingFaceExample/01_finetuning/assets/finetune_llama.py", line 15, in <module>
    from optimum.neuron import NeuronHfArgumentParser as HfArgumentParser
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/__init__.py", line 19, in <module>
    from .utils.training_utils import patch_transformers_for_neuron_sdk
  File "/home/ubuntu/.local/lib/python3.10/site-packages/opt

# Compilation

Since we have everything installed locally, we don't need to use a training job like on SageMaker.  We can just call the optimum-cli command directly.

The training process runs a merge script at the end, so we are using the output_dir and adding a merged_model path and then saving our compiled model into the compiled_model path.

In [4]:
!optimum-cli export neuron --model /home/ubuntu/environment/ml/qwen/merged_model --task text-generation --sequence_length 512 --batch_size 1 --num_cores 2 /home/ubuntu/environment/ml/qwen/compiled_model


  from pkg_resources import get_distribution
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/optimum-cli", line 3, in <module>
    from optimum.commands.optimum_cli import main
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/commands/__init__.py", line 16, in <module>
    from .env import EnvironmentCommand
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/commands/env.py", line 22, in <module>
    from ..neuron.utils import is_neuron_available, is_neuronx_available
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/__init__.py", line 19, in <module>
    from .utils.training_utils import patch_transformers_for_neuron_sdk
  File "/home/ubuntu/.local/lib/python3.10/site-packages/optimum/neuron/utils/training_utils.py", line 23, in <module>
    from neuronx_distributed.pipeline import NxDPPModel
ModuleNotFoundError: No module named 'neuronx_distributed'


# Inference

We will install the Optimum Neuron vllm option.  Then, run inference using the compiled model!

In [5]:
%pip install optimum-neuron[vllm]


Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting vllm==0.9.2 (from optimum-neuron[vllm])
  Downloading vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl.metadata (15 kB)
Collecting cachetools (from vllm==0.9.2->optimum-neuron[vllm])
  Downloading cachetools-6.2.0-py3-none-any.whl.metadata (5.4 kB)
Collecting sentencepiece (from vllm==0.9.2->optimum-neuron[vllm])
  Downloading sentencepiece-0.2.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (10 kB)
Collecting blake3 (from vllm==0.9.2->optimum-neuron[vllm])
  Downloading blake3-1.0.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (217 bytes)
Collecting py-cpuinfo (from vllm==0.9.2->optimum-neuron[vllm])
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Collecting fastapi>=0.115.0 (from fastapi[standard]>=0.115.0->vllm==0.9.2->optimum-neuron[vllm])
  Downloadin

In [6]:
import os
from vllm import LLM, SamplingParams
llm = LLM(
    model="/home/ubuntu/environment/ml/qwen/compiled_model", #local compiled model
    max_num_seqs=1,
    max_model_len=2048,
    device="neuron",
    tensor_parallel_size=2,
    override_neuron_config={})
example1="""
<|im_start|>system
You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
CREATE TABLE management (department_id VARCHAR); CREATE TABLE department (department_id VARCHAR)<|im_end|>
<|im_start|>user
How many departments are led by heads who are not mentioned?<|im_end|>
<|im_start|>assistant
"""
example2="""
<|im_start|>system
You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
CREATE TABLE courses (course_name VARCHAR, course_id VARCHAR); CREATE TABLE student_course_registrations (student_id VARCHAR, course_id VARCHAR)<|im_end|>
<|im_start|>user
What are the ids of all students for courses and what are the names of those courses?<|im_end|>
<|im_start|>assistant
"""
example3="""
<|im_start|>system
You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
CREATE TABLE table_name_9 (wins INTEGER, year VARCHAR, team VARCHAR, points VARCHAR)<|im_end|>
<|im_start|>user
Which highest wins number had Kawasaki as a team, 95 points, and a year prior to 1981?<|im_end|>
<|im_start|>assistant
"""

prompts = [
    example1,
    example2,
    example3
]

sampling_params = SamplingParams(max_tokens=2048, temperature=0.8)
outputs = llm.generate(prompts, sampling_params)

print("#########################################################")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, \n\n Generated text: {generated_text!r} \n")

  from .autonotebook import tqdm as notebook_tqdm


INFO 10-04 21:18:16 [__init__.py:39] Available plugins for group vllm.platform_plugins:
INFO 10-04 21:18:16 [__init__.py:41] - optimum_neuron -> optimum.neuron.vllm.plugin:register
INFO 10-04 21:18:16 [__init__.py:44] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
ERROR 10-04 21:18:16 [__init__.py:57] Failed to load plugin optimum_neuron
ERROR 10-04 21:18:16 [__init__.py:57] Traceback (most recent call last):
ERROR 10-04 21:18:16 [__init__.py:57]   File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/plugins/__init__.py", line 54, in load_plugins_by_group
ERROR 10-04 21:18:16 [__init__.py:57]     func = plugin.load()
ERROR 10-04 21:18:16 [__init__.py:57]   File "/usr/lib/python3.10/importlib/metadata/__init__.py", line 171, in load
ERROR 10-04 21:18:16 [__init__.py:57]     module = import_module(match.group('module'))
ERROR 10-04 21:18:16 [__init__.py:57]   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_modul

RuntimeError: Device string must not be empty