# Inference GGUF Model
https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF


## Example 1: With CTransformers 

https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF

Install CTransformers
```python
# with CPU
pip install ctransformers>=0.2.24
# with CUDA GPU
pip install ctransformers[cuda]>=0.2.24
```

### Example 1.1: With CTransformers Class

In [1]:
from ctransformers import AutoModelForCausalLM
import os
from timeit import timeit

model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF')

# https://github.com/marella/ctransformers#documentation
config = {'max_new_tokens': 256, 'repetition_penalty': 1.1}
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained(model_path_or_repo_id=model_id, model_file="llama-2-7b-chat.Q4_K_M.gguf", model_type="llama", 
                                           # stream=True,
                                           gpu_layers=50, **config)


# https://replicate.com/blog/how-to-prompt-llama#wrap-user-input-with-inst-inst-tags
correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(timeit(lambda: print(llm(correct_prompt, **config)),number=1)) # In my case: GPU:24s/CPU:42s

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt, **config)),number=1))


  Arrrr, me hearty! *adjusts eye patch* As a scurvy dog of the high seas, I have many favorite things, but if I had to choose just one... *crackin' knuckles* it would be... *pauses for dramatic effect* rum!
 Eeee, there's nothing like a good swig o' rum to warm the cockles of me heart and ease the pain o' a long day o' plunderin' the innocent. Mmmmph, it's like liquid gold in me belly! *hiccup*
But wait, there be more! *winks* I also love a good sea shanty to sing with me mates while we're sailin' the seven seas. *busts into a rousing chorus* "What do ye say, me hearties? Should we pillage and plunder, or just drift along, la-dee-dah?"
And o' course, no pirate's life be complete without a trusty parrot on me shoulder. *adjusts feathered friend* Me matey here be a bit
484.46392114899936


Pirate Name:                   Barnaby Blackheart

Favorite Food:                   Fish (loves it raw and fresh)

Favorite Drink:                Grog (a strong drink made from rum and fruit juice)

Fa

### Example 1.2: With LangChain.CTransformers
https://python.langchain.com/docs/integrations/llms/ctransformers  
https://api.python.langchain.com/en/latest/llms/langchain.llms.ctransformers.CTransformers.html  

In [2]:
from langchain.llms import CTransformers
from timeit import timeit
import os

model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF')

# https://github.com/marella/ctransformers#config
config = {'max_new_tokens': 256, 'repetition_penalty': 1.1, 'temperature':0.9}
# https://api.python.langchain.com/en/latest/llms/langchain.llms.ctransformers.CTransformers.html
llm = CTransformers(model=model_id, model_file="llama-2-7b-chat.Q4_K_M.gguf", config=config)

correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(timeit(lambda: print(llm(correct_prompt, **config)),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt, **config)),number=1))

  Arrrr, me hearty! *adjusts eye patch* As a pirate, me favorites be treasure, of course! There's nothing like the thrill of finding a great big pile o' gold doubloons or a chest overflowin' with glitterin' gems. It's like findin' a pot o' gold at the end o' the rainbow! *winks*

But me favorite treasure be the one that's hidden deep within the belly o' a fierce sea monster. *gulps* Those be the most dangerous and excitin' treasures o' all, don't ye think? The thrill o' battle, the rush o' victory, and the bounty o' loot! *flexes sword*
Now, I know what ye be thinkin', "Pirate, how do ye manage to find these hidden treasures?" Well, me lad/lass, it be a combination o' luck, cunning, and a good ol' map. *taps chest* Me and me crew have spent years chartin' the waters o' the seven seas, searchin'
43.56051090999972


1. Drink: Grog (a rum-based drink)
2. Food: Seafood (especially fish and chips)
3. Hobbies: Singing sea shanties, playing the accordion, or juggling cutlasses
4. Place to vis

## Example 2: Llama.Cpp

https://python.langchain.com/docs/integrations/llms/llamacpp
```sh
# CPU
pip install llama-cpp-python
```
__OR__
```sh
# GPU
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir


```

### Example 2.1: With Llama.Cpp
https://github.com/abetlen/llama-cpp-python/blob/main/examples/high_level_api/high_level_api_inference.py


In [3]:
from llama_cpp import Llama

model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf')

# https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L209
llm = Llama(model_path=model_id,n_gpu_layers=-1, verbose=False)



correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
# https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L1409
print(timeit(lambda: print(llm(correct_prompt,
                            max_tokens=256,
                            # stop=["Q:", "\n"],
                            echo=True,
                        )),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt,
                            max_tokens=256,
                            # stop=["Q:", "\n"],
                            echo=True,
                        )),number=1))

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /app/project/models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weigh

{'id': 'cmpl-7eb6dadc-4259-4ddc-a421-7a30e057f1c9', 'object': 'text_completion', 'created': 1695721391, 'model': '/app/project/models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf', 'choices': [{'text': "[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]  Arrrr, me hearty! *adjusts eye patch* As a pirate, I have many fond favorites, but if I had to choose just one, it'd be... (pauses for dramatic effect) ...rum! *winks*\n\nThere's nothing like a good ol' fashioned swig of rum to warm the bones and lift the spirits. And let me tell ye, I've had me share o' rum in me day. *chuckles* From the sweet, smooth island brews to the fiery, spicy varieties, there's a rum for every pirate's taste.\nBut me favorite has to be... (pauses again) ...Blackbeard's Blend! *excitedly* It's a special recipe I learned from the great Blackbeard himself, and it's got just the right amount of kick and flavor. *takes a dram* Mmmm, just thinking about it makes me feel like setting sail for a

### Example 2.2: LangChain.LlamaCpp
https://python.langchain.com/docs/integrations/llms/llamacpp

In [4]:
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from timeit import timeit
import os


model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf')

# https://api.python.langchain.com/en/latest/llms/langchain.llms.llamacpp.LlamaCpp.html
llm = LlamaCpp(
    model_path=model_id,
    max_tokens=256,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]), # Callbacks support token-wise streaming
    verbose=True, # Verbose is required to pass to the callback manager
)
# CPU:42 / GPU:35
correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(timeit(lambda: print(llm(correct_prompt)),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt)),number=1))

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /app/project/models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weigh

  Arrrr, shiver me timbers! *adjusts eye patch* As a scurvy dog of the seven seas, I've got me heart set on a few things, matey.
First and foremost, I love a good swashbucklin' adventure! There's nothin' like sailin' the high seas, battlin' against the wind and waves, and searchin' fer hidden treasure. *winks* And of course, there's nothing better than a good fight with me trusty cutlass. *giggles*
But enough about that! What be yer favorite thing to do on the ocean, matey? 😄  Arrrr, shiver me timbers! *adjusts eye patch* As a scurvy dog of the seven seas, I've got me heart set on a few things, matey.
First and foremost, I love a good swashbucklin' adventure! There's nothin' like sailin' the high seas, battlin' against the wind and waves, and searchin' fer hidden treasure. *winks* And of course, there's nothing better than a good fight with me trusty cutlass. *giggles*
But enough about that! What be yer favorite thing to do on the ocean, matey? 😄
26.759195778984576



llama_print_timings:        load time =   724.62 ms
llama_print_timings:      sample time =    71.19 ms /   154 runs   (    0.46 ms per token,  2163.32 tokens per second)
llama_print_timings: prompt eval time =  2377.54 ms /    28 tokens (   84.91 ms per token,    11.78 tokens per second)
llama_print_timings:        eval time = 23888.43 ms /   153 runs   (  156.13 ms per token,     6.40 tokens per second)
llama_print_timings:       total time = 26757.62 ms
Llama.generate: prefix-match hit




Ahoy matey! I be askin' ye what yer favorite treasure be. Do ye have a hankerin' for gold doubloons or maybe a fine ship to sail the seven seas? Or perhaps ye prefer somethin' a bit more...exotic? Share yer favored treasure with me and I'll make sure to keep it safe from any landlubbers! Arrrr!

Ahoy matey! I be askin' ye what yer favorite treasure be. Do ye have a hankerin' for gold doubloons or maybe a fine ship to sail the seven seas? Or perhaps ye prefer somethin' a bit more...exotic? Share yer favored treasure with me and I'll make sure to keep it safe from any landlubbers! Arrrr!
15.645501707011135



llama_print_timings:        load time =   724.62 ms
llama_print_timings:      sample time =    43.20 ms /    93 runs   (    0.46 ms per token,  2152.98 tokens per second)
llama_print_timings: prompt eval time =  1069.70 ms /    13 tokens (   82.28 ms per token,    12.15 tokens per second)
llama_print_timings:        eval time = 14281.87 ms /    92 runs   (  155.24 ms per token,     6.44 tokens per second)
llama_print_timings:       total time = 15644.08 ms
