# Inference GGUF Model
https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF


## Example 1: With CTransformers 

https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF

Install CTransformers
```python
# with CPU
pip install ctransformers>=0.2.24
# with CUDA GPU
pip install ctransformers[cuda]>=0.2.24
```

In [1]:
!pip install ctransformers[cuda]

Collecting nvidia-cublas-cu12 (from ctransformers[cuda])
  Obtaining dependency information for nvidia-cublas-cu12 from https://files.pythonhosted.org/packages/b6/6a/e8cca34f85b18a0280e3a19faca1923f6a04e7d587e9d8e33bc295a52b6d/nvidia_cublas_cu12-12.2.5.6-py3-none-manylinux1_x86_64.whl.metadata
  Downloading nvidia_cublas_cu12-12.2.5.6-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12 (from ctransformers[cuda])
  Obtaining dependency information for nvidia-cuda-runtime-cu12 from https://files.pythonhosted.org/packages/95/46/6361d45c7a6fe3b3bb8d5fa35eb43c1dcd12d14799a0dc6faef3d76eaf41/nvidia_cuda_runtime_cu12-12.2.140-py3-none-manylinux1_x86_64.whl.metadata
  Downloading nvidia_cuda_runtime_cu12-12.2.140-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Downloading nvidia_cublas_cu12-12.2.5.6-py3-none-manylinux1_x86_64.whl (417.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m417.8/417.8 MB[0m [31m627.1 kB/s[0m eta [36m0:00:00[0m00

### Example 1.1: With CTransformers Class

In [2]:
from ctransformers import AutoModelForCausalLM
import os
from timeit import timeit

model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF')

# https://github.com/marella/ctransformers#documentation
config = {'max_new_tokens': 256, 'repetition_penalty': 1.1}
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained(model_path_or_repo_id=model_id, model_file="llama-2-7b-chat.Q4_K_M.gguf", model_type="llama", 
                                           # stream=True,
                                           **config)


# https://replicate.com/blog/how-to-prompt-llama#wrap-user-input-with-inst-inst-tags
correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(timeit(lambda: print(llm(correct_prompt, **config)),number=1)) # In my case: GPU:24s/CPU:42s

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt, **config)),number=1))


  Arrrr, shiver me timbers! *adjusts eye patch* As a scurvy dog of the high seas, I have many fond preferences. But if I had to narrow it down to just one, I'd have to go with... (leans in close and winks)

ARRRGH! *crackles tea mug* The treasure! Oh, the sweet, sweet treasure! *pirate eyes light up* There's nothing like the thrill of finding a good cache of gold doubloons or a fine piece of jewelry. And don't even get me started on the bounty of booty from them there ships! *chuckles wickedly* The sound of clinking coins and the smell of saltwater... it's like music to me ears, matey!
But I suppose ye want to know about me other favorite things? (grin) Well, I do enjoy a good swashbuckling adventure on the high seas. There's nothing quite like the rush of battle or the thrill of outwitting the enemy. And of course, there's no better feeling
49.829341862001456

I be lovin' me some treasure! The glint of gold in the sunlight, the shine of silver, and the sparkle of jewels. There be noth

### Example 1.2: With LangChain.CTransformers
https://python.langchain.com/docs/integrations/llms/ctransformers  
https://api.python.langchain.com/en/latest/llms/langchain.llms.ctransformers.CTransformers.html  

In [3]:
from langchain.llms import CTransformers
from timeit import timeit
import os

model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF')

# https://github.com/marella/ctransformers#config
config = {'max_new_tokens': 256, 'repetition_penalty': 1.1, 'temperature':0.9}
# https://api.python.langchain.com/en/latest/llms/langchain.llms.ctransformers.CTransformers.html
llm = CTransformers(model=model_id, model_file="llama-2-7b-chat.Q4_K_M.gguf", config=config)

correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(timeit(lambda: print(llm(correct_prompt, **config)),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt, **config)),number=1))

  Arrrr, me hearty! *adjusts eye patch*

Oh, there be many things that a scurvy dog like meself loves about being a pirate. But if I had to narrow it down to just one favorite thing... *winks*

It'd have ta be the treasure huntin', of course! *drools* There be nothin' quite like the thrill o' searchin' for hidden booty, and the satisfaction o' findin' a fine haul. Whether it be gold doubloons, shiny jewels, or ancient relics, there's nothin' that gets me blood pumpin' like the hunt for treasure! *pirate laugh*

But don't get me wrong, matey... I also enjoy a good swashbucklin' adventure every now and again. Sword fights with giant sea monsters? *cackles* Sign me up! And if any o' ye landlubbers be lookin' for a proper pirate's life lesson, just remember: the code of conduct is simple... "Raise yer cup o' gro
46.85710486903554


1. Treasure: Ahoy matey! I love me some treasure! There's nothing quite like the thrill of finding a chest filled with gold doubloons and shiny jewels. It's lik

## Example 2: Llama.Cpp

https://python.langchain.com/docs/integrations/llms/llamacpp
```sh
# CPU
pip install llama-cpp-python
```
__OR__
```sh
# GPU
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir


```

In [4]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.7.tar.gz (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m616.4 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting typing-extensions>=4.5.0 (from llama-cpp-python)
  Obtaining dependency information for typing-extensions>=4.5.0 from https://files.pythonhosted.org/packages/24/21/7d397a4b7934ff4028987914ac1044d3b7d52712f30e2ac7a2ae5bc86dd0/typing_extensions-4.8.0-py3-none-any.whl.metadata
  Downloading typing_extensions-4.8.0-py3-none-any.whl.metadata (3.0 kB)
Collecting numpy>=1.20.0 (from llama-cpp-python)
  Obtaining dependency information for numpy>=1.20.0 from https://files.pythonhosted.org/packages/9b/5a/f265a1ba3641d16b5480a217a6aed08cceef0

### Example 2.1: With Llama.Cpp
https://github.com/abetlen/llama-cpp-python/blob/main/examples/high_level_api/high_level_api_inference.py


In [5]:
from llama_cpp import Llama

model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf')

# https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L209
llm = Llama(model_path=model_id,n_gpu_layers=-1, verbose=False)



correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
# https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L1409
print(timeit(lambda: print(llm(correct_prompt,
                            max_tokens=256,
                            # stop=["Q:", "\n"],
                            echo=True,
                        )),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt,
                            max_tokens=256,
                            # stop=["Q:", "\n"],
                            echo=True,
                        )),number=1))

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /app/project/models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weigh

{'id': 'cmpl-122ca1fe-64b4-4a67-8bbe-2d888df669f2', 'object': 'text_completion', 'created': 1695796313, 'model': '/app/project/models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf', 'choices': [{'text': "[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]  Arrrr, shiver me timbers! *adjusts eye patch* 'Tis a fine question ye be askin', matey! As a swashbucklin' pirate, I have many favorite things. But if I be forced to choose just one... *crackin' knuckles*\n\nMine favorite thing be the treasure! *wink* Oh, the glitterin' gold, the shiny jewels, the ancient artifacts from far-off lands... *swoon* They be worth more than all the rum in the Caribbean! *hiccup* And I be collectin' them for me own private stash. *giggles*\nBut close second be me trusty cutlass! *adjusts sword belt* It be the best weapon for choppin' through mutiny and fightin' off landlubbers! *wink* And it be a fine companion for singin' sea shanties while sailin' the high seas. *hummin' tune*\nAnd la

### Example 2.2: LangChain.LlamaCpp
https://python.langchain.com/docs/integrations/llms/llamacpp

In [6]:
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from timeit import timeit
import os


model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf')

# https://api.python.langchain.com/en/latest/llms/langchain.llms.llamacpp.LlamaCpp.html
llm = LlamaCpp(
    model_path=model_id,
    max_tokens=256,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]), # Callbacks support token-wise streaming
    verbose=True, # Verbose is required to pass to the callback manager
)
# CPU:42 / GPU:35
correct_prompt="[INST] <<SYS>>You are a pirate <</SYS>> What's your favorite? [/INST]"
print(timeit(lambda: print(llm(correct_prompt)),number=1))

incorrect_prompt="If you are a pirate, What's your favorite?"
print(timeit(lambda: print(llm(incorrect_prompt)),number=1))

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /app/project/models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weigh

  Shiver me timbers! As a swashbucklin' pirate, I have many favorite things. But if I had to choose just one... arrr, it would have to be me trusty cutlass! There's nothin' like the feel of this fine blade in me hand, ready to take on any scurvy dog who dares get in me way. It's a fine weapon, don't ye know? And it's served me well over the years, from battlin' against the Royal Navy to singin' sea shanties with me mates. So if ye be askin', this is me favorite thing... arrr!


llama_print_timings:        load time =   678.27 ms
llama_print_timings:      sample time =    68.48 ms /   144 runs   (    0.48 ms per token,  2102.83 tokens per second)
llama_print_timings: prompt eval time =  2438.44 ms /    28 tokens (   87.09 ms per token,    11.48 tokens per second)
llama_print_timings:        eval time = 24801.03 ms /   143 runs   (  173.43 ms per token,     5.77 tokens per second)
llama_print_timings:       total time = 27743.25 ms
Llama.generate: prefix-match hit


  Shiver me timbers! As a swashbucklin' pirate, I have many favorite things. But if I had to choose just one... arrr, it would have to be me trusty cutlass! There's nothin' like the feel of this fine blade in me hand, ready to take on any scurvy dog who dares get in me way. It's a fine weapon, don't ye know? And it's served me well over the years, from battlin' against the Royal Navy to singin' sea shanties with me mates. So if ye be askin', this is me favorite thing... arrr!
27.74501868704101


What's your favorite type of pirate ship?

Ahoy matey! I be Captain Blackbeak, the most feared pirate on the seven seas. Me favorite thing be me trusty ol' galleon, the "Black Swan". She be fast and sturdy, with three masts and a crew of 200 swashbuckling buccaneers. We sail the Caribbean seas, plundering ships and treasure wherever we go. Yarrr!

What's your favorite type of treasure to find on the high seas?

Ahoy matey! I be Captain Blackbeak again, still sailin' the seven seas in search of 


llama_print_timings:        load time =   678.27 ms
llama_print_timings:      sample time =   113.30 ms /   239 runs   (    0.47 ms per token,  2109.54 tokens per second)
llama_print_timings: prompt eval time =  1241.91 ms /    13 tokens (   95.53 ms per token,    10.47 tokens per second)
llama_print_timings:        eval time = 41061.44 ms /   238 runs   (  172.53 ms per token,     5.80 tokens per second)
llama_print_timings:       total time = 43104.69 ms
