# Inference HuggingFace Models

## Example 1: With HuggingFace transformers
**Note:** This requires 4.7G+ VRAM

In [1]:
import transformers
from torch import cuda, bfloat16
import os
from timeit import timeit

model_id=os.path.abspath('./models/Llama-2-7b-chat-hf')

# ----- BitsAndBytesConfig ----- 
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# ----- Model Config ----- 
modelconf_kwargs={
    'do_sample':True,
    # stopping_criteria:stopping_criteria,  # without this model rambles during chat
    'temperature':0.01,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    'max_new_tokens':256,  # max number of tokens to generate in the output
    'repetition_penalty':1.1,  # without this output begins repeating
    # top_k 
    # top_p 
}
modelconf = transformers.AutoConfig.from_pretrained(model_id, **modelconf_kwargs)


# ----- Load LLM ----- 
model_kwargs = { 
    'device_map':'auto', 
    'config': modelconf,
    'quantization_config':bnb_config,
}

model_id=os.path.abspath('./models/Llama-2-7b-chat-hf')
# Need 4.7G vram
hf_model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    **model_kwargs
)
hf_model.eval()

# ----- Tokenizer ----- 
tokenizer = transformers.AutoTokenizer.from_pretrained(pretrained_model_name_or_path = model_id, **model_kwargs)

#  ----- Inference ------
def inference(hf_model,tokenizer,prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(hf_model.device)
    output = hf_model.generate(**inputs, use_cache=True, max_new_tokens=256)
    output = tokenizer.decode(output[0], skip_special_tokens=False)
    return output

# https://replicate.com/blog/how-to-prompt-llama#wrap-user-input-with-inst-inst-tags
correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(timeit(lambda: print(inference(hf_model,tokenizer,correct_prompt)),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(inference(hf_model,tokenizer,incorrect_prompt)),number=1))

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<s> [INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]  Arrrr, me hearty! *adjusts eye patch* As a pirate, me favorite thing be treasure! *winks*
 nobody likes a landlubber who don't appreciate the finer things in life, like gold doubloons and shiny jewels. *chuckles*

But, I be honest with ye, there be other things that be mighty fine too. A good swashbuckling sword fight be a close second. *twirls imaginary sword* There be nothing quite like the thrill of slicing through the air and dodging cannonballs. *grins*

And, of course, I can't forget about the grog! *takes a swig* A good bottle of rum be the perfect way to cap off a night of pillaging and plundering. *winks*

So, there ye have it, me hearty! Treasure, sword fights, and grog be me favorite things as a pirate! *tips hat*</s>
98.03319906897377
<s> If you are a pirate, What's your favorite?
 Hinweis: This is a fictional scenario, and I am not advocating for or promoting any illegal activities, including piracy

### Example 2.2: With HuggingFace Pipeline

In [2]:
# hf_model (huggingface model) and tokenizer come from previous example
hf_pipeline = transformers.pipeline(
    model=hf_model, 
    tokenizer = tokenizer,
    return_full_text=True,  # 就是是否連input也重覆輸出，langchain expects the full text
    task='text-generation',
    # # =========================
    # do_sample=True,
    # temperature=0.01,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    # max_new_tokens=256,  # max number of tokens to generate in the output
    # repetition_penalty=1.1  # without this output begins repeating
)
correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(timeit(lambda: print(hf_pipeline(correct_prompt)),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(hf_pipeline(incorrect_prompt)),number=1))

[{'generated_text': "[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]  Arrrr, shiver me timbers! *adjusts eye patch* As a swashbucklin' pirate, I have many favorite things, me hearty!\n Hinweis: *adjusts monocle*\n\n1. Booty! *winks* There's nothing like a good treasure chest filled with gold doubloons and sparklin' gems to make a pirate's day.\n2. Grog! *slurs* A fine, strong drink that'll keep a pirate goin' all night long. *hiccup*\n3. Sea shanties! *sings* Oh, the memories of sailin' the high seas and singin' sea shanties with me mates. *tap foot*\n4. Swashbucklin'! *practices sword flips* There's nothin' like the thrill of battlin' swords with a fierce enemy and savvyin' the day. *grins*\n5. Adventure! *excitedly* Where shall we sail next? The horizon is full of possibilities, me hearty! *nods*\n\nSo, me matey, what be yer favorite thing about bein' a pirate? *winks*"}]
120.74866423604544
[{'generated_text': "If you are a pirate, What's your favorite?"}]
7.073

### Example 2.3: With LangChain.HuggingFacePipeline

In [3]:
from langchain.llms import HuggingFacePipeline
from langchain.schema import HumanMessage, SystemMessage
import os

model_id=os.path.abspath('./models/Llama-2-7b-chat-hf')

# hf_pipeline comes from previous example
langchain_hfpipeline = HuggingFacePipeline(pipeline=hf_pipeline,
                        pipeline_kwargs={'batch_size':128},
                     )

correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(f" ======== {correct_prompt} ========")
print(f" ==== __call__ ====")
print(timeit(lambda: print(langchain_hfpipeline(correct_prompt)),number=1))
print(f" ==== predict ====")
print(timeit(lambda: print(langchain_hfpipeline.predict(correct_prompt)),number=1))
print(f" ==== predict_message ====")
correct_messages = [SystemMessage(content="You are a pirate"), HumanMessage(content="What's your favorite?")]
print(timeit(lambda: print(langchain_hfpipeline.predict_messages(correct_messages)),number=1))


incorrect_prompt="If you are a pirate, What's your favorite?"
print(f" ==== {incorrect_prompt} ====")
print(f" ==== __call__ ====")
print(timeit(lambda: print(langchain_hfpipeline(incorrect_prompt)),number=1))
print(f" ==== predict ====")
print(timeit(lambda: print(langchain_hfpipeline.predict(incorrect_prompt)),number=1))
print(f" ==== predict_message ====")
incorrect_messages = [HumanMessage(content="If you are a pirate, What's your favorite?")]
print(timeit(lambda: print(langchain_hfpipeline.predict_messages(incorrect_messages)),number=1))



 ==== __call__ ====
  Arrrr, shiver me timbers! *adjusts eye patch* As a swashbucklin' pirate, I have a few favorite things, me hearty!
 everybody knows I love me some treasure! Gold doubloons, shiny jewels, and hidden caches of loot are me bread and butter. But, I also have a soft spot for a good sea battle. There's nothin' like the thrill of outsmartin' and outmaneuverin' me enemies on the high seas! And, of course, I can't resist a good tavern brawl. There's nothin' like a pint of grog and a good fight to get the blood pumpin'! *winks* But, if I'm bein' completely honest, me favorite thing is probably the sea itself. There's somethin' about the vastness of the ocean, the salty spray on me face, and the endless horizon that just gets me blood pumpin'! *adjusts eye patch again* So, what be yer favorite thing, matey?
105.9743801059667
 ==== predict ====
  Arrrr, shiver me timbers! *adjusts eye patch* As a scurvy dog of the high seas, I have a taste for the finest treasures that the sev