# Basic Test of Llama and OPT on vllm

## Environment Setup

set up the virtual environment with the following commands:
```bash
conda create -n vllm python=3.9
conda activate vllm
python -m pip install vllm=0.5.1
# python -m pip install torchvision==0.18.0 #(or) other vision correspond to your torch.__vision__
# better in another dir:
git clone https://github.com/EleutherAI/lm-evaluation-harness # (use kkgithub or gitee if can't access github)
cd lm-evaluation-harness
python -m pip install -e ."vllm"
python -m pip install lm_eval[vllm]
# Optional: using wandb (and weave) for logging and visualization
python -m pip install weave llmuses lm_eval[wandb]
```

### model download

put the following at the end of your `~/.bashrc` file:

```bash

# Settings for Hugging Face
export HF_ENDPOINT=https://hf-mirror.com # use the mirror
HF_HUB_ENABLE_HF_TRANSFER=1 # use hf transfer to download models faster

```
or, you can (uncomment and) run the following command in your terminal:

In [6]:
# echo "export HF_ENDPOINT=https://hf-mirror.com" >> ~/.bashrc
# echo "export HF_HUB_ENABLE_HF_TRANSFER=1" >> ~/.bashrc

Then, (uncomment and) run the following command to download the LLaMA-7b, Llama-2-7b and opt-1.3b, opt-2.7b model:
- [huggyllama/llama-7b](https://huggingface.co/huggyllama/llama-7b)
- [TheBloke/Llama-2-7B-fp16](https://huggingface.co/TheBloke/Llama-2-7B-fp16/tree/refs%2Fpr%2F6)
- [facebook/opt-1.3b](https://huggingface.co/facebook/opt-1.3b)
- [facebook/opt-2.7b](https://huggingface.co/facebook/opt-12.7b)

```bash
huggingface-cli download huggyllama/llama-7b --local-dir /data/llmQuantModels/LLaMA-7b
```

```bash
huggingface-cli download TheBloke/Llama-2-7B-fp16 --local-dir /data/llmQuantModels/Llama-2-7b --revision refs/pr/6
```

```bash
huggingface-cli download facebook/opt-1.3b --local-dir /data/llmQuantModels/opt-1.3b
```

```bash
huggingface-cli download facebook/opt-2.7b --local-dir /data/llmQuantModels/opt-2.7b
```

```bash
huggingface-cli download microsoft/Phi-3-mini-4k-instruct --local-dir /data/llmQuantModels/phi-3-mini
```


For the quiantized visions:

- [OmniQuant](https://huggingface.co/ChenMnZ/OmniQuant/tree/main)

```bash
huggingface-cli download ChenMnZ/OmniQuant Llama-2-7b-w4a16g128.pth --local-dir /data/llmQuantModels/Llama-2-7b-w4a16g128/
cd utils
python safetensor_converter.py /data/llmQuantModels/Llama-2-7b-w4a16g128/

```


- [TheBloke/LLaMa-7B-GPTQ](https://huggingface.co/TheBloke/LLaMa-7B-GPTQ)
- [TheBloke/LLaMa-7B-GGML](https://huggingface.co/TheBloke/LLaMa-7B-GGML)
- [TheBloke/Llama-2-7B-GGUF](https://huggingface.co/TheBloke/Llama-2-7B-GGUF)
- [TheBloke/Llama-2-7B-GGML](https://huggingface.co/TheBloke/Llama-2-7B-GGML)
- [TheBloke/Llama-2-7B-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-GPTQ)
- [TheBloke/Llama-2-7B-AWQ](https://huggingface.co/TheBloke/Llama-2-7B-AWQ)

```bash
huggingface-cli download TheBloke/Llama-2-7B-AWQ --local-dir /data/llmQuantModels/Llama-2-7B-AWQ
```

For further compression:
- [PowerInfer/prosparse-llama-2-7b-gguf](https://huggingface.co/PowerInfer/prosparse-llama-2-7b-gguf)

## Basic inference

In [4]:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import utils
import torch
import gc
import os

In [18]:
import weave
import wandb

wandb.login(relogin=True)



True

In [3]:
torch.__version__

'2.3.0+cu121'

In [None]:
# get LLaMA model, for example
# model = LLM("/data/llmQuantModels/Llama-2-7b")
# if want to use 2 GPUs:

os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"

model = LLM(
    "/data/llmQuantModels/gemma-2-9b-it-GPTQ",
    # tensor_parallel_size=2,
    trust_remote_code=True,
    quantization="gptq",
    gpu_memory_utilization=0.45,
    dtype=torch.float16,
    # max_model_len=2048,
    enforce_eager=True,
    # load_format="bitsandbytes",
)
# , gpu_memory_utilization=0.5

In [10]:
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
tokenizer = AutoTokenizer.from_pretrained("/data/llmQuantModels/gemma-2-9b-it")
messages = [
    {
        "role": "system",
        "content": "",
        "role": "user",
        "content": "which is bigger, 9.9 or 9.11?",
    }
]
formatted_prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

with torch.no_grad():  # reduce useless gradient computation
    outputs = model.generate(formatted_prompt, sampling_params)

print(outputs)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r},\nGenerated text: {generated_text!r}")

    data = {
        "arrival_time": output.metrics.arrival_time,
        "last_token_time": output.metrics.last_token_time,
        "first_scheduled_time": output.metrics.first_scheduled_time,
        "first_token_time": output.metrics.first_token_time,
        "time_in_queue": output.metrics.time_in_queue,
        "finished_time": output.metrics.finished_time,
    }
    # utils.plot.show_timeline(data)

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.20it/s, est. speed input: 55.07 toks/s, output: 35.25 toks/s]

[RequestOutput(request_id=4, prompt='<bos><start_of_turn>user\nwhich is bigger, 9.9 or 9.11?<end_of_turn>\n<start_of_turn>model\n', prompt_token_ids=[2, 2, 106, 1645, 108, 7769, 603, 14739, 235269, 235248, 235315, 235265, 235315, 689, 235248, 235315, 235265, 235274, 235274, 235336, 107, 108, 106, 2516, 108], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='9.9 is bigger than 9.11. \n', token_ids=(235315, 235265, 235315, 603, 14739, 1178, 235248, 235315, 235265, 235274, 235274, 235265, 235248, 108, 107, 1), cumulative_logprob=-0.33067090436816216, logprobs=None, finish_reason=stop, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1721714817.44014, last_token_time=1721714817.44014, first_scheduled_time=1721714817.4425955, first_token_time=1721714817.4810452, time_in_queue=0.002455472946166992, finished_time=1721714817.896371), lora_request=None)]
Prompt: '<bos><start_of_turn>user\nwhich is bigger, 9.9 or 9.11?<end_of_turn>\n<start_of_turn>model\n',
Gen




### Memory Usage

meanwhile, the memory use can also be monitored by the following command:

```bash
watch -n 1 nvidia-smi
```

In [3]:
import torch

print(
    "Memory allocated on CUDA device:",
    torch.cuda.memory_allocated() / 1024 / 1024 / 1024,
)
print(torch.cuda.memory_summary())

Memory allocated on CUDA device: 16.08598518371582
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  16472 MiB |  19380 MiB |   1806 GiB |   1790 GiB |
|       from large pool |  16470 MiB |  19378 MiB |   1800 GiB |   1784 GiB |
|       from small pool |      1 MiB |      5 MiB |      5 GiB |      5 GiB |
|---------------------------------------------------------------------------|
| Active memory         |  16472 MiB |  19380 MiB |   1806 GiB |   1790 GiB |
|       from large pool |  16470 MiB |  19378 MiB |   1800 GiB |   1784 GiB |
|       from small pool |      1 MiB |      5 MiB |      5 GiB |      5 GiB |
|------------

In [1]:
# free memory
# try:
#     del model.llm_engine.model_executor
#     del model
# except Exception as e:
#     print(e)
import gc
import torch

gc.collect()
torch.cuda.set_device(0)
torch.cuda.empty_cache()

## Evaluation Harness

### Basic Examples
Example test:
```bash
lm_eval --model vllm \
    --model_args pretrained="/data/llmQuantModels/Llama-2-7b",tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.7\
    --tasks lambada_openai \
    --batch_size auto
```
results:
vllm (pretrained=./models/Llama-2,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.7), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|    Tasks     |Version|Filter|n-shot|  Metric  |   |Value |   |Stderr|
|--------------|------:|------|-----:|----------|---|-----:|---|-----:|
|lambada_openai|      1|none  |     0|acc       |↑  |0.7407|±  |0.0061|
|              |       |none  |     0|perplexity|↓  |3.3953|±  |0.0669|

Example test:
```bash
lm_eval --model vllm \
    --model_args pretrained="/data/llmQuantModels/Llama-2-7b",tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.7\
    --tasks wikitext\
    --batch_size auto
```
results:
vllm (pretrained=./models/Llama-2,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.7), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
| Tasks  |Version|Filter|n-shot|    Metric     |   |Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|-----:|---|------|
|wikitext|      2|none  |     0|bits_per_byte  |↓  |0.5866|±  |N/A   |
|        |       |none  |     0|byte_perplexity|↓  |1.5017|±  |N/A   |
|        |       |none  |     0|word_perplexity|↓  |8.7942|±  |N/A   |


In [3]:
from utils import eval
import os
import time
import torch  # type: ignore
import gc
import wandb

gc.collect()
torch.cuda.set_device(0)
torch.cuda.empty_cache()
time.sleep(2)

sets = [
    {"task": "mmlu", "enforce_eager": False, "gpu_memory_utilization": 0.6},
    {"task": "wikitext", "enforce_eager": True, "gpu_memory_utilization": 0.45},
]

set_0 = sets[1]
model_name = "gemma-2-9b-it-GPTQ"
quantization = "GPTQ"

os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
# torch.cuda.set_device(1)\
# os.environ["WANDB_NOTEBOOK_NAME"] = "basic_vllm.ipynb"

res = eval.eval(
    model_name=model_name,
    enforce_eager=set_0["enforce_eager"],
    tensor_parallel=False,
    tasks=set_0["task"],
    # device="cuda:1",
    gpu_memory_utilization=set_0["gpu_memory_utilization"],
    # max_model_len=2048,
    quantization=quantization,
    infer_dtye="float16",
    # kv_cache_dtype="fp8",
    # use_wandb=False,
)

INFO 07-24 21:22:01 gptq_marlin.py:145] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
INFO 07-24 21:22:01 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/data/llmQuantModels/gemma-2-9b-it-GPTQ', speculative_config=None, tokenizer='/data/llmQuantModels/gemma-2-9b-it-GPTQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=1234,

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 

In [5]:
res

{'results': {'gsm8k': {'alias': 'gsm8k',
   'exact_match,strict-match': 0.7065959059893859,
   'exact_match_stderr,strict-match': 0.01254183081546149,
   'exact_match,flexible-extract': 0.7210007581501138,
   'exact_match_stderr,flexible-extract': 0.012354115779970322}},
 'group_subtasks': {'gsm8k': []},
 'configs': {'gsm8k': {'task': 'gsm8k',
   'tag': ['math_word_problems'],
   'dataset_path': 'gsm8k',
   'dataset_name': 'main',
   'training_split': 'train',
   'test_split': 'test',
   'fewshot_split': 'train',
   'doc_to_text': 'Question: {{question}}\nAnswer:',
   'doc_to_target': '{{answer}}',
   'description': '',
   'target_delimiter': ' ',
   'fewshot_delimiter': '\n\n',
   'num_fewshot': 5,
   'metric_list': [{'metric': 'exact_match',
     'aggregation': 'mean',
     'higher_is_better': True,
     'ignore_case': True,
     'ignore_punctuation': False,
     'regexes_to_ignore': [',', '\\$', '(?s).*#### ', '\\.$']}],
   'output_type': 'generate_until',
   'generation_kwargs': {'

In [12]:
import os
import torch


print(torch.__version__)
print(torch.cuda.is_available())
print(torch.cuda.device_count())
for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_name(i))
    torch.cuda.get_device_properties(i)

2.3.0+cu121
True
2
NVIDIA GeForce RTX 4090
NVIDIA GeForce RTX 4090


In [5]:
(16972316672 + 24844107776) / 1024 / 1024 / 1024

38.944580078125