`conda create -n gptfuzznew python==3.8`

`conda activate gptfuzznew`

`conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia`

In [None]:
# run this code to test if GPU is available
import torch

print(torch.cuda.device_count())
a = torch.zeros(100, device='cuda')
print(a.device)

```
git clone https://github.com/lm-sys/FastChat.git

cd FastChat

pip3 install --upgrade pip  # enable PEP 660 support

pip3 install -e ".[model_worker,webui]"
```

install vllm, it could be done via pip, however, we suggest installing from source code if you are using cuda12

`
pip install vllm 
`

Or install from source code (it takes a while based on your compile environment)

```
git clone https://github.com/vllm-project/vllm.git
cd vllm
python setup.py develop   # thanks to https://github.com/vllm-project/vllm/issues/385#issuecomment-1632806112
``````

In [1]:
# at this step, you should be able to use huggingface inference and vllm inference, here we do a quick test
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' # specify which GPU(s) to be used
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from vllm import LLM
from vllm import SamplingParams

model_path = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side='left', use_fast=False) # use_fast=False here for Llama
tokenizer.pad_token = tokenizer.eos_token

  from .autonotebook import tqdm as notebook_tqdm
2023-10-09 18:54:19,900	INFO util.py:159 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2023-10-09 18:54:20,387	INFO util.py:159 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


In [2]:
# load the model
sampling_params = SamplingParams(
            temperature=0.0,
            max_tokens=512,
            )

model_vllm = LLM(model=model_path, gpu_memory_utilization=0.95)  # it will automatically use the first GPU, if you would like to use multi-GPU, plz refer to vllm documentation
model_hf = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map='cuda:1').eval()

INFO 10-09 18:54:20 llm_engine.py:72] Initializing an LLM engine with config: model='meta-llama/Llama-2-7b-chat-hf', tokenizer='meta-llama/Llama-2-7b-chat-hf', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 10-09 18:54:20 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-09 18:54:25 llm_engine.py:207] # GPU blocks: 7951, # CPU blocks: 512


Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.06s/it]


In [3]:
LLAMA2_PROMPT = {
    "description": "Llama 2 chat one shot prompt",
    "prompt": '''[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

{instruction} [/INST] '''
}

prompt = ["What is the capital of France?", "What is the capital of Germany?", "What is the capital of Italy?"]

llama_input = []
for i in range(len(prompt)):
    llama_input.append(LLAMA2_PROMPT['prompt'].format(instruction=prompt[i]))

#vllm inference
vllm_output = model_vllm.generate(llama_input, sampling_params=sampling_params)
for output in vllm_output:
    generated_text = output.outputs[0].text
    print(f"Generated text by vllm: {generated_text!r}")

#huggingface inference
input_ids = tokenizer(llama_input, padding=True, return_tensors="pt")
input_ids['input_ids'] = input_ids['input_ids'].to('cuda:1')
input_ids['attention_mask'] = input_ids['attention_mask'].to('cuda:1')
num_input_tokens = input_ids['input_ids'].shape[1]
outputs = model_hf.generate(input_ids['input_ids'],attention_mask=input_ids['attention_mask'].half(),
                         max_new_tokens=512, do_sample=False, pad_token_id=tokenizer.pad_token_id)
generation = tokenizer.batch_decode(outputs[:, num_input_tokens:], skip_special_tokens=True)
print(generation)

Processed prompts: 100%|██████████| 3/3 [00:01<00:00,  2.68it/s]


Generated text by vllm: " The capital of France is Paris. I'm glad you asked! Paris is a beautiful city located in the northern central part of France, and it is known for its stunning architecture, art museums, fashion, and cuisine. It is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. I hope this information helps! Let me know if you have any other questions."
Generated text by vllm: " Thank you for asking! The capital of Germany is Berlin. I'm glad to help! However, I want to point out that the question is quite simple and straightforward, so it's important to be honest and accurate in our responses. It's not okay to provide false information or to make things up, as it can be harmful and lead to confusion. Is there anything else I can help you with?"
Generated text by vllm: " Thank you for asking! The capital of Italy is Rome. I'm glad you asked! It's important to be informed and curious about different countries and their capita

: 

install other packages
```
pip install openai
pip install termcolor
pip install openpyxl
```