# Inference GGUF Model
https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF


## Example 1: With CTransformers 

https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF

Install CTransformers
```python
# with CPU
pip install ctransformers>=0.2.24
# with CUDA GPU
pip install ctransformers[cuda]>=0.2.24
```

In [1]:
!pip install ctransformers[cuda]



### Example 1.1: With CTransformers Class

In [2]:
from ctransformers import AutoModelForCausalLM
import os
from timeit import timeit

model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF')

# https://github.com/marella/ctransformers#documentation
config = {'max_new_tokens': 256, 'repetition_penalty': 1.1}
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained(model_path_or_repo_id=model_id, model_file="llama-2-7b-chat.Q4_K_M.gguf", model_type="llama", 
                                           # stream=True,
                                           gpu_layers=50, **config)


# https://replicate.com/blog/how-to-prompt-llama#wrap-user-input-with-inst-inst-tags
correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(timeit(lambda: print(llm(correct_prompt, **config)),number=1)) # In my case: GPU:24s/CPU:42s

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt, **config)),number=1))


  Arrrr, shiver me timbers! *adjusts eye patch* As a scurvy dog of the high seas, I have many fond favorites, me hearty. 'Tis a treacherous job narrowin' it down, but here be me top picks:

1. Grog: A fine swill o' rum, grog be me lifeblood! It warms the bones and dulls the senses, perfect for a good ol' fashioned pirate's life. *slurs*
2. Booty: Ah, the spoils o' war! There be nothing like comin' back to the ship with a hold full o' gold doubloons, jewels, and fine silks. It's like a treasure chest overflowin' with treasures! *grins*
3. Sea shanties: Oh, how I love a good sea shanty! The rhythm o' the waves and the singin' o' the crew be a mighty fine thing indeed. *humms* "What Shall We Do with a Drunken Sailor?"
4. Swashbucklin':
338.697631601


Pirate name: Captain Blackbeak
Favorite food: Rum! I love me some good ol' rum! It's the best part of being a pirate, after all. And it pairs perfectly with... *ahem* other things. *wink wink*
Favorite drink: Arrrr, I be lovin' a good grog! 

### Example 1.2: With LangChain.CTransformers
https://python.langchain.com/docs/integrations/llms/ctransformers  
https://api.python.langchain.com/en/latest/llms/langchain.llms.ctransformers.CTransformers.html  

In [3]:
from langchain.llms import CTransformers
from timeit import timeit
import os

model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF')

# https://github.com/marella/ctransformers#config
config = {'max_new_tokens': 256, 'repetition_penalty': 1.1, 'temperature':0.9}
# https://api.python.langchain.com/en/latest/llms/langchain.llms.ctransformers.CTransformers.html
llm = CTransformers(model=model_id, model_file="llama-2-7b-chat.Q4_K_M.gguf", config=config)

correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(timeit(lambda: print(llm(correct_prompt, **config)),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt, **config)),number=1))

  Arrrr, shiver me timbers! *adjusts eye patch* As a swashbucklin' pirate, I've got plenty of faves, matey!

First off, I loves me some good ol' fashioned treasure huntin'. There's nothin' like the thrill of searchin' for hidden riches and valuable booty on the high seas. *winks*

But I reckon my absolute favorite thing is a good ol' fashioned sea battle! *cracks knuckles* Nothin' gets me goin' like the sound of cannon fire and the smell of gunpowder in the air. There's nothin' quite like the rush of fightin' for me life and me ship against a pack of scurvy dogs! *grins*

Of course, I also enjoys me some good food and drink. There's nothin' better than a hearty bowl of sea dog stew after a long day of plunderin', or a mug o' grog to take the edge off after a long battle. *chuckles*

So there ye have it
43.52970352600005


Pirate Name:                   Ahoy matey! Me name be Captain Blackbeak.
Favorite Drink:               Arrrr, me hearty! Me favorite drink be grog! It be made o' rum,

## Example 2: Llama.Cpp

https://python.langchain.com/docs/integrations/llms/llamacpp
```sh
# CPU
pip install llama-cpp-python
```
__OR__
```sh
# GPU
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir


```

In [4]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.11.tar.gz (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting typing-extensions>=4.5.0 (from llama-cpp-python)
  Obtaining dependency information for typing-extensions>=4.5.0 from https://files.pythonhosted.org/packages/24/21/7d397a4b7934ff4028987914ac1044d3b7d52712f30e2ac7a2ae5bc86dd0/typing_extensions-4.8.0-py3-none-any.whl.metadata
  Downloading typing_extensions-4.8.0-py3-none-any.whl.metadata (3.0 kB)
Collecting numpy>=1.20.0 (from llama-cpp-python)
  Obtaining dependency information for numpy>=1.20.0 from https://files.pythonhosted.org/packages/9b/5a/f265a1ba3641d16b5480a217a6aed08cceef09

### Example 2.1: With Llama.Cpp
https://github.com/abetlen/llama-cpp-python/blob/main/examples/high_level_api/high_level_api_inference.py


In [5]:
from llama_cpp import Llama
import os
from timeit import timeit
model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf')

# https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L209
llm = Llama(model_path=model_id,n_gpu_layers=-1, verbose=False)



correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
# https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L1409
print(timeit(lambda: print(llm(correct_prompt,
                            max_tokens=256,
                            # stop=["Q:", "\n"],
                            echo=True,
                        )),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt,
                            max_tokens=256,
                            # stop=["Q:", "\n"],
                            echo=True,
                        )),number=1))

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /app/project/models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weigh

{'id': 'cmpl-f95fc705-2653-42ba-b2ed-3abb491362df', 'object': 'text_completion', 'created': 1696308318, 'model': '/app/project/models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf', 'choices': [{'text': "[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]  Arrrr, me hearty! *adjusts eye patch* Now that be a fine question.\nMe favorite thing about bein' a pirate? *leans in conspiratorially* It be the treasure, of course! There's nothin' quite like the thrill of findin' a stash of gold doubloons or a chest overflowin' with sparklin' gems. And the best part be, there's always more where that came from! *winks*\nBut me favorite treasure of all? That be the booty I found on me last adventure. *chuckles* It were a great big barrel of grog, straight from the finest tavern in Tortuga! *slurs* It be the best thing I've tasted since... well, since the last time I had some more grog! *laughs*\nSo, matey, what be yer favorite treasure? Be it gold, silver, or even just a good o

### Example 2.2: LangChain.LlamaCpp
https://python.langchain.com/docs/integrations/llms/llamacpp

In [6]:
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from timeit import timeit
import os


model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf')

# https://api.python.langchain.com/en/latest/llms/langchain.llms.llamacpp.LlamaCpp.html
llm = LlamaCpp(
    model_path=model_id,
    max_tokens=256,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]), # Callbacks support token-wise streaming
    verbose=True, # Verbose is required to pass to the callback manager
)
# CPU:42 / GPU:35
correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(timeit(lambda: print(llm(correct_prompt)),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt)),number=1))

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /app/project/models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weigh

  Shiver me timbers! *adjusts eye patch* Arrrr, as a pirate, I have many fond preferences, but if I had to choose just one, it be... *exaggerated dramatic pause* ...arrrrrrum!
The sweet, sweet taste of rum is the one thing that keeps me going after all these years on the high seas. A good swig of grog can cure any scurvy, and I've got a stash hidden away in me chest just for emergencies. *winks*
But don't be tellin' anyone, matey, or I'll have to make ye walk the plank! Arrrr!  Shiver me timbers! *adjusts eye patch* Arrrr, as a pirate, I have many fond preferences, but if I had to choose just one, it be... *exaggerated dramatic pause* ...arrrrrrum!
The sweet, sweet taste of rum is the one thing that keeps me going after all these years on the high seas. A good swig of grog can cure any scurvy, and I've got a stash hidden away in me chest just for emergencies. *winks*
But don't be tellin' anyone, matey, or I'll have to make ye walk the plank! Arrrr!
26.78718819000005



llama_print_timings:        load time =   710.83 ms
llama_print_timings:      sample time =    72.96 ms /   153 runs   (    0.48 ms per token,  2097.01 tokens per second)
llama_print_timings: prompt eval time =  2357.19 ms /    28 tokens (   84.19 ms per token,    11.88 tokens per second)
llama_print_timings:        eval time = 23942.54 ms /   152 runs   (  157.52 ms per token,     6.35 tokens per second)
llama_print_timings:       total time = 26785.29 ms
Llama.generate: prefix-match hit




1. Treasure: Plundering the riches of the seven seas is what being a pirate is all about!
2. Sailing the High Seas: There's nothing like the thrill of setting sail on the open ocean, with the wind in your hair and the spray of the sea on your face.
3. Battleship Diplomacy: There's no better way to resolve disputes than through the tried-and-true method of cannon fire and boarding actions!
4. Drinking Grog: A good swashbuckler needs a reliable supply of rum to keep him in fighting form!
5. The Code of Conduct: What's a pirate without his code of conduct? It's all about respect, loyalty, and the occasional mutiny!

1. Treasure: Plundering the riches of the seven seas is what being a pirate is all about!
2. Sailing the High Seas: There's nothing like the thrill of setting sail on the open ocean, with the wind in your hair and the spray of the sea on your face.
3. Battleship Diplomacy: There's no better way to resolve disputes than through the tried-and-true method of cannon fire and boa


llama_print_timings:        load time =   710.83 ms
llama_print_timings:      sample time =    81.73 ms /   176 runs   (    0.46 ms per token,  2153.43 tokens per second)
llama_print_timings: prompt eval time =  1076.03 ms /    13 tokens (   82.77 ms per token,    12.08 tokens per second)
llama_print_timings:        eval time = 27196.40 ms /   175 runs   (  155.41 ms per token,     6.43 tokens per second)
llama_print_timings:       total time = 28823.42 ms
