# Inference GGUF Model
https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF


## Example 1: With CTransformers 

https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF

Install CTransformers
```python
# with CPU
pip install ctransformers>=0.2.24
# with CUDA GPU
pip install ctransformers[cuda]>=0.2.24
```

### Example 1.1: With CTransformers Class

In [1]:
from ctransformers import AutoModelForCausalLM
import os
from timeit import timeit

model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF')

# https://github.com/marella/ctransformers#documentation
config = {'max_new_tokens': 256, 'repetition_penalty': 1.1}
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained(model_path_or_repo_id=model_id, model_file="llama-2-7b-chat.Q4_K_M.gguf", model_type="llama", 
                                           # stream=True,
                                           gpu_layers=50, **config)


# https://replicate.com/blog/how-to-prompt-llama#wrap-user-input-with-inst-inst-tags
correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(timeit(lambda: print(llm(correct_prompt, **config)),number=1)) # In my case: GPU:24s/CPU:42s

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt, **config)),number=1))


  Arrrr, shiver me timbers! *adjusts eye patch* As a swashbucklin' pirate, I have many fond favorites, but if I had to choose just one, it'd be... *crackin' knuckles* the lootin'!
Aye, there's nothin' quite like the thrill of raidin' a ship and bringin' home the booty. The shiny gold doubloons, the glitterin' jewels, and of course, the rare and valuable treasures that only come from the deepest, darkest waters. *winks*
But me favorite thing about bein' a pirate? *leanin' in* it's the freedom! The open sea, the endless horizons, and the ability to make yer own rules. Aye, there's nothin' quite like the life of a pirate, matey! *raises mug of grog* Here's to the next great adventure!
13.091385690000607


Ahoy matey! 🎉

Yo ho ho and a bottle of rum! 🍻

What be yer favorite thing to do on the high seas? 🌊

Share with yer mateys what brings ye the greatest delight. 💖 #pirate #highseas #rum
4.991972674994031


### Example 1.2: With LangChain.CTransformers
https://python.langchain.com/docs/integrations/llms/ctransformers  
https://api.python.langchain.com/en/latest/llms/langchain.llms.ctransformers.CTransformers.html  

In [2]:
from langchain.llms import CTransformers
import os

model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF')

config = {'max_new_tokens': 256, 'repetition_penalty': 1.1}
# https://api.python.langchain.com/en/latest/llms/langchain.llms.ctransformers.CTransformers.html
llm = CTransformers(model=model_id, model_file="llama-2-7b-chat.Q4_K_M.gguf", config=config)
print(timeit(lambda: print(llm(correct_prompt, **config)),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt, **config)),number=1))

  Arrrr, shiver me timbers! *adjusts eye patch* As a swashbucklin' pirate, I have many favorite things, but if I must choose only one, it be... *takes a sip of grog* ...rum!
 Aye, there be no finer drink than a good rum. It warms the bones and puts a spring in me step, especially after a long day of plunderin' and pillagin' on the high seas. Me favorite is a nice, rich, dark rum, preferably aged for at least 10 years or more. A good rum can make any pirate feel like a king (or queen) of the seven seas! *winks*
But wait, there be other favorites too! I also love me some fine treasure, like gold doubloons and sparklin' gems. There's nothin' better than findin' a good stash of loot after a successful raid on a merchant ship or a hidden island paradise. And of course, there be no better feeling than settling into me trusty ol' chest, filled to the brim with all me treasure and
42.50303835600789


Pirates are known for their love of treasure and adventure. They sail the high seas in search 

## Example 2: Llama.Cpp

https://python.langchain.com/docs/integrations/llms/llamacpp
```sh
# CPU
pip install llama-cpp-python
```
__OR__
```sh
# GPU
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir


```

### Example 2.1: With Llama.Cpp
https://github.com/abetlen/llama-cpp-python/blob/main/examples/high_level_api/high_level_api_inference.py


In [3]:
from llama_cpp import Llama

model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf')

# https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L209
llm = Llama(model_path=model_id,n_gpu_layers=-1, verbose=False)



correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
# https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L1409
print(timeit(lambda: print(llm(correct_prompt,
                            max_tokens=256,
                            # stop=["Q:", "\n"],
                            echo=True,
                        )),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt,
                            max_tokens=256,
                            # stop=["Q:", "\n"],
                            echo=True,
                        )),number=1))

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /app/project/models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weigh

{'id': 'cmpl-8ea8ad16-9575-4fee-944a-0b459129d86e', 'object': 'text_completion', 'created': 1695617218, 'model': '/app/project/models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf', 'choices': [{'text': "[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]  Arrrr, shiver me timbers! *adjusts eye patch*\n\nWell, matey, as a swashbucklin' pirate, I have many favored things. But if I had to choose just one... *scratches chin*\n\nI'd have to go with me trusty cutlass, me hearty! There's nothin' like the feel of a sharp blade in yer hand when ye be battlin' against the scurvy dogs on the high seas. And don't get me started on the satisfaction of sinkin' a ship or two... *cackles*\nBut I also have a soft spot for me grog, matey. There's nothin' like a good ol' fashioned rum to keep yer spirits up after a long day of pillagin' and plunderin'. And don't ye worry about the hangover in the mornin', 'cause we pirates be known for our liver of steel! *winks*\nSo there ye have i

### Example 2.2: LangChain.LlamaCpp
https://python.langchain.com/docs/integrations/llms/llamacpp

In [4]:
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from timeit import timeit
import os


model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf')

# https://api.python.langchain.com/en/latest/llms/langchain.llms.llamacpp.LlamaCpp.html
llm = LlamaCpp(
    model_path=model_id,
    max_tokens=256,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]), # Callbacks support token-wise streaming
    verbose=True, # Verbose is required to pass to the callback manager
)
# CPU:42 / GPU:35
correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(timeit(lambda: print(llm(correct_prompt)),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt)),number=1))

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /app/project/models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weigh

  Arrrr, shiver me timbers! *adjusts eye patch* 'Tis a fine question ye be askin', matey! As a proper pirate, I have many favorites, but if I had to choose just one, it would be... (leaning in close)
 PIRATE'S JUICE! 🍹🏴‍☠️
There's nothing like a good swig of grog to keep me going all day long. And don't even get me started on the treasure it brings! *winks* It be the perfect mix of rum, lime, and a wee bit o' magic. Makes a pirate feel like he's sailin' the high seas and searchin' for hidden loot! 🏴‍☠️🌅
But wait, there be more! *grins mischievously* I also have a soft spot for... 
1. FLAGON O' FIGHTIN' FLAGONS! 🎭🏴‍☠️ These are me favorite grog-fueled festiv  Arrrr, shiver me timbers! *adjusts eye patch* 'Tis a fine question ye be askin', matey! As a proper pirate, I have many favorites, but if I had to choose just one, it would be... (leaning in close)
 PIRATE'S JUICE! 🍹🏴‍☠️
There's nothing like a good swig of grog to keep me going all day long. And don't even get me started on the tre


llama_print_timings:        load time =   732.15 ms
llama_print_timings:      sample time =   118.93 ms /   256 runs   (    0.46 ms per token,  2152.54 tokens per second)
llama_print_timings: prompt eval time =  2418.20 ms /    28 tokens (   86.36 ms per token,    11.58 tokens per second)
llama_print_timings:        eval time = 39144.17 ms /   255 runs   (  153.51 ms per token,     6.51 tokens per second)
llama_print_timings:       total time = 42377.12 ms
Llama.generate: prefix-match hit



[You see a table of treasure in front of you.]
Pirate Captain: "Arrgh! Shiver me timbers! There be treasure upon that table, matey! *takes a seat* Now, what be yer favorite among all o' this booty?"

Please respond with your choice of treasure from the following options:
A) A golden goblet filled with sparkling jewels and coins.
B) A chest overflowing with glittering gold doubloons.
C) A rare and mysterious artifact with strange powers.
D) A fine and luxurious diamond-encrusted peg leg.
E) A map to a hidden treasure that only reveals itself once every hundred years.
[You see a table of treasure in front of you.]
Pirate Captain: "Arrgh! Shiver me timbers! There be treasure upon that table, matey! *takes a seat* Now, what be yer favorite among all o' this booty?"

Please respond with your choice of treasure from the following options:
A) A golden goblet filled with sparkling jewels and coins.
B) A chest overflowing with glittering gold doubloons.
C) A rare and mysterious artifact with s


llama_print_timings:        load time =   732.15 ms
llama_print_timings:      sample time =    77.48 ms /   167 runs   (    0.46 ms per token,  2155.39 tokens per second)
llama_print_timings: prompt eval time =  1063.35 ms /    13 tokens (   81.80 ms per token,    12.23 tokens per second)
llama_print_timings:        eval time = 25513.95 ms /   166 runs   (  153.70 ms per token,     6.51 tokens per second)
llama_print_timings:       total time = 27102.67 ms
