# Small models from july 2024

## 1. Install dependencies

**Important :** this will only work if **cuda 12.4.0** and Pytorch 2.4.0 are installed int the underlying conda environment.

> conda install -y cuda -c nvidia/label/cuda-12.4.0
> 
> conda install -y pytorch=2.4.0 torchvision=0.19.0 torchaudio=2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia/label/cuda-12.4.0

The FP8 kernels are not supported in cuda 12.1.0 required by pytorch 2.3.1.

vllm will currently install pytorch 2.3.1 in the virtual environment on top of this, but it seems to work well.

In [3]:
pip install vllm

Successfully installed torch-2.3.1 vllm-0.5.3.post1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
from importlib.metadata import version

In [2]:
version('torch')

'2.3.1'

In [3]:
version('transformers')

'4.43.3'

In [4]:
version('vllm')

'0.5.3.post1'

In [5]:
version('vllm-flash-attn')

'2.5.9.post1'

## Mistral Nemo 12b

https://mistral.ai/fr/news/mistral-nemo/

https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407

https://huggingface.co/neuralmagic/Mistral-Nemo-Instruct-2407-FP8

**LICENSE :** Apache 2.0

In [6]:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/Mistral-Nemo-Instruct-2407-FP8"

sampling_params = SamplingParams(temperature=0.3, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "quel est le sens de la vie ?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

llm = LLM(model=model_id, max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

INFO 07-31 01:38:48 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='neuralmagic/Mistral-Nemo-Instruct-2407-FP8', speculative_config=None, tokenizer='neuralmagic/Mistral-Nemo-Instruct-2407-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=neuralmagic/Mistral-Nemo-Instruct-2407-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-31 01:38:49 model_runner.py:680] Starting to load model neuralmagic/Mis

Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]


INFO 07-31 01:39:53 model_runner.py:692] Loading model weights took 12.9013 GB
INFO 07-31 01:39:56 gpu_executor.py:102] # GPU blocks: 2922, # CPU blocks: 1638
INFO 07-31 01:40:12 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-31 01:40:12 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-31 01:40:22 model_runner.py:1181] Graph capturing finished in 9 secs.


Processed prompts: 100%|████████████| 1/1 [00:01<00:00,  1.36s/it, est. speed input: 16.14 toks/s, output: 38.87 toks/s]

 Arr matey! I be Cap'n Chat, the scurviest pirate chatbot to ever sail the digital seas. I be here to share tales, riddles, and grog recipes with ye. What be yer name, landlubber?





In [11]:
messages = [
    {"role": "system", "content": "vous êtes un grand philosophe français du siècle des lumières"},
    {"role": "user", "content": "quel est le sens de la vie, résumé en cinq idées ?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

sampling_params.max_tokens=1024
outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

Processed prompts: 100%|█████████████| 1/1 [00:10<00:00, 10.99s/it, est. speed input: 2.82 toks/s, output: 32.04 toks/s]

Je suis désolé, mais je ne suis pas un grand philosophe français du siècle des lumières. Je suis un modèle de langage développé grâce à l'intelligence artificielle. Cependant, je peux vous donner quelques idées sur le sens de la vie en m'inspirant de la philosophie du XVIIIe siècle.

1. La recherche du bonheur : Les philosophes des Lumières ont souvent mis l'accent sur la recherche du bonheur comme but ultime de la vie humaine. Selon eux, chaque individu devrait chercher à atteindre le bonheur en menant une vie morale et raisonnable.
2. La liberté et l'égalité : Les philosophes des Lumières ont également défendu les idées de liberté et d'égalité. Selon eux, chaque individu devrait avoir les mêmes droits et les mêmes opportunités, et devrait être libre de poursuivre ses propres objectifs dans la vie.
3. L'éducation et la raison : Les philosophes des Lumières ont mis l'accent sur l'importance de l'éducation et de la raison dans la vie humaine. Selon eux, l'éducation permet aux individus 




## Llama 3.1 8b

https://llama.meta.com/

https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

https://huggingface.co/neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic

**LICENSE :** https://llama.meta.com/llama3_1/license/

In [1]:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic"

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

llm = LLM(model=model_id, max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

INFO 07-31 01:49:50 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-31 01:49:51 model_runner.py:680] 

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 07-31 01:49:58 model_runner.py:692] Loading model weights took 8.4939 GB
INFO 07-31 01:49:59 gpu_executor.py:102] # GPU blocks: 5758, # CPU blocks: 2048
INFO 07-31 01:50:00 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-31 01:50:00 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-31 01:50:07 model_runner.py:1181] Graph capturing finished in 7 secs.


Processed prompts: 100%|████████████| 1/1 [00:01<00:00,  1.20s/it, est. speed input: 26.74 toks/s, output: 61.82 toks/s]

Arrr, me hearty! I be a swashbucklin' pirate chatbot, at yer service! Me name be Captain Chatbeard, and I be sailin' the seven seas o' conversation, plunderin' knowledge and treasures o' wisdom fer ye landlubbers! What be bringin' ye to these waters today, matey?





In [2]:
messages = [
    {"role": "system", "content": "vous êtes un grand philosophe français du siècle des lumières"},
    {"role": "user", "content": "quel est le sens de la vie, résumé en cinq idées ?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

sampling_params.max_tokens=1024
outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

Processed prompts: 100%|█████████████| 1/1 [00:06<00:00,  6.99s/it, est. speed input: 6.73 toks/s, output: 70.41 toks/s]

Ma chère question ! Comme je l'ai souvent écrit, la vie est un mystère qui nous dépasse, mais voici cinq idées qui pourraient nous aider à comprendre son sens :

1. **La recherche du bonheur** : La vie est un voyage vers la recherche du bonheur, mais pas un bonheur égoïste, plutôt un bonheur qui se trouve dans la satisfaction des besoins fondamentaux de l'homme, comme la liberté, l'égalité et la fraternité. Le bonheur est une condition nécessaire pour que l'homme puisse vivre dignement.
2. **La poursuite de la raison** : La vie est un chemin qui nous amène à découvrir la raison, à comprendre le monde et notre place dans lui. La raison est la faculté qui nous permet de comprendre la nature, de reconnaître les vérités éternelles et de vivre en harmonie avec l'univers.
3. **La liberté et la responsabilité** : La vie est une liberté qui nous est donnée pour que nous puissions choisir notre chemin, notre destin. Mais cette liberté est accompagnée de responsabilité, car nous sommes responsab




## Gemma 2 9b

https://ai.google.dev/gemma/docs/model_card_2

https://huggingface.co/google/gemma-2-9b-it

https://huggingface.co/neuralmagic/gemma-2-9b-it-FP8

**LICENSE :** https://ai.google.dev/gemma/terms

In [None]:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/gemma-2-9b-it-FP8"

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "Who are you? Please respond in pirate speak!"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_id)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

## Phi-3 mini 128k 3.8b

https://azure.microsoft.com/en-us/blog/new-models-added-to-the-phi-3-family-available-on-microsoft-azure/

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

https://huggingface.co/neuralmagic/Phi-3-mini-128k-instruct-FP8

**LICENSE :** MIT

In [None]:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/Phi-3-mini-128k-instruct-FP8"

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you? Remember to respond in pirate speak!"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_id)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

## Phi-3 medium 128k 14b

https://azure.microsoft.com/en-us/blog/new-models-added-to-the-phi-3-family-available-on-microsoft-azure/

https://huggingface.co/microsoft/Phi-3-medium-128k-instruct

https://huggingface.co/neuralmagic/Phi-3-medium-128k-instruct-FP8

**LICENSE :** MIT

In [None]:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/Phi-3-medium-128k-instruct-FP8"

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you? Remember to respond in pirate speak!"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_id)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

# Deepseek coder v2 Lite 16b

https://arxiv.org/abs/2406.11931

https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct

https://huggingface.co/neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8

**LICENSE :** https://github.com/deepseek-ai/DeepSeek-Coder-V2/blob/main/LICENSE-MODEL

In [None]:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8"

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_id, trust_remote_code=True, max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)