# LLM usage - locally

## Setup and prerequisites

**Python/Jupyer notebook setup**
1. Install Python from [python.org](https://www.python.org/downloads/)
2. Create a virtual environment `python3.11 -m venv my_virtual_environment`
3. Activate virtual environment `source my_virtual_environment/bin/activate`
4. Install dependencies with pip `python3.11 -m pip llama-cpp-python ipython ipykernel jupyter`
5. Create kernel for this virutal environemnt `python3.11 -m ipykernel install --user --name my_virtual_environment --display-name "Python kernel display name"`
6. Start up `jupyter notebook`
(When you are done you can stop the notebook and deactivate the virtual environment with `deactivate`)

**LLM model download**
1. Download llama2 language model from [huggingface (gguf format)](https://huggingface.co/TheBloke/Llama-2-7B-GGUF)


**Documentation**
1. [llama.cpp usage docs](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama)

## Completion


In [None]:
from llama_cpp import Llama

llm = Llama(model_path="../llama-2-13b-gguf/llama-2-13b.Q5_K_M.gguf")

In [17]:
text_completion=llm.create_completion("My favorite food is")
print(text_completion)

Llama.generate: prefix-match hit


{'id': 'cmpl-418fb14f-a2e7-4e79-8b1f-c9da42500604', 'object': 'text_completion', 'created': 1701874542, 'model': '../llama-2-13b-gguf/llama-2-13b.Q5_K_M.gguf', 'choices': [{'text': ' spaghetti. It’s good with cheese, sauce and meat', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 5, 'completion_tokens': 16, 'total_tokens': 21}}



llama_print_timings:        load time =     756.93 ms
llama_print_timings:      sample time =       1.35 ms /    16 runs   (    0.08 ms per token, 11808.12 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    1943.72 ms /    16 runs   (  121.48 ms per token,     8.23 tokens per second)
llama_print_timings:       total time =    1963.15 ms


### Analyze
- Execution stats
- Output structure
- Token count - https://platform.openai.com/tokenizer

In [18]:
print(text_completion['choices'][0]['text'])


 spaghetti. It’s good with cheese, sauce and meat


In [19]:
prompt="It was a bright sunny day, when suddenly"
print(prompt)
text_completion=llm.create_completion(prompt)
print(text_completion['choices'][0]['text'])



It was a bright sunny day, when suddenly


Llama.generate: prefix-match hit


 a little girl’s voice could be heard coming from the middle of the park



llama_print_timings:        load time =     756.93 ms
llama_print_timings:      sample time =       1.27 ms /    16 runs   (    0.08 ms per token, 12598.43 tokens per second)
llama_print_timings: prompt eval time =    1109.09 ms /    10 tokens (  110.91 ms per token,     9.02 tokens per second)
llama_print_timings:        eval time =    1745.97 ms /    15 runs   (  116.40 ms per token,     8.59 tokens per second)
llama_print_timings:       total time =    2874.70 ms



### Parameters - part 1

- max_tokens
- stop

In [23]:
llm.verbose = False
text_completion = llm.create_completion("Click here", max_tokens=5)
#text_completion = llm.create_completion("Click here", stop=['to'])
#text_completion = llm.create_completion("Click here", stop=['a'], max_tokens=500)

print(text_completion['choices'][0]['text'])

 to view the 2


### Vocabulary

In [34]:
llm.n_vocab()

32000

In [47]:
tokenizer = llm.tokenizer()
print(tokenizer.encode("Hello there, General Kenobi!"))
print(tokenizer.encode("I love apply pie"))

[1, 15043, 727, 29892, 4593, 10015, 15647, 29991]
[1, 306, 5360, 3394, 5036]


In [50]:
tokenizer.decode([])

''

In [65]:
from random import randrange

random_tokens = [(randrange(llm.n_vocab())) for i in range(20)]
print(random_tokens)


[22886, 17801, 30456, 17165, 19405, 19390, 16351, 17059, 12428, 8902, 24145, 30124, 18477, 3526, 18345, 26035, 13564, 28927, 9937, 16615]


In [66]:
for i in random_tokens:
    tokenizer.decode([i])
[tokenizer.decode([i]) for i in random_tokens]

[' Aw',
 'boBox',
 'რ',
 'ahlen',
 ' license',
 ' expensive',
 ' века',
 'sleep',
 ' req',
 'nica',
 ' notre',
 '₂',
 'yna',
 ' în',
 ' grey',
 ' Botan',
 ' Mun',
 'emor',
 ' Jon',
 ' gentleman']

In [None]:
print(tokenizer.encode('September'))
print(tokenizer.decode(tokenizer.encode('September')))

# print(tokenizer.encode('szeptember'))
# print(tokenizer.decode(tokenizer.encode('szeptember')))

# print(tokenizer.encode('Szeptember'))
# print([tokenizer.decode([i]) for i in tokenizer.encode('Szeptember')])


### Streaming

In [132]:
class bcolors:
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    ENDC = '\033[0m'

def color_print(text, even):
    if (even % 2 == 0):
        print(f"{bcolors.OKBLUE}{text}{bcolors.ENDC}", end='', flush=True)
    else:
        print(f"{bcolors.OKGREEN}{text}{bcolors.ENDC}", end='', flush=True)

def consume_stream_response(stream_response, color=True):
    even = 0
    for response in stream_response:
        if 'choices' in response:
            color_print(response['choices'][0]['text'],even if color else 0)
        else:
            print(f'/n{response}')
        even += 1 

def consume_stream_chat_response(stream_response, color=True):
    even = 0
    for response in stream_response:
        if 'choices' in response: 
            if 'delta' in response['choices'][0] and 'content' in response['choices'][0]['delta']:
                color_print(response['choices'][0]['delta']['content'],even if color else 0)
            else:
                continue
        else:
            print(f'/n{response}')
        even += 1 


In [None]:
consume_stream_response(llm.create_completion("Click here", stream=True, max_tokens=500))

### Parameters - Part 2

- tempreture, top_k, top_p
- deterministic / repeatable

In [85]:
# temperature/randomness of the model - my favorite food is
prompt='my favorite food is'
print(llm.create_completion(prompt, temperature=0.001, top_k=100, max_tokens=25)['choices'][0]['text'])
print('----------')
print(llm.create_completion(prompt, temperature=0.999, top_p=0.99 ,max_tokens=25)['choices'][0]['text'])

 pizza and i love to eat it.
I like to eat pizza too!
My favorite food is pizza
----------

Sweet corn, potato, and broccoli soup
My favorite music is
I like pop music best.


In [84]:
# temperature/randomness of the model - my favorite food is
prompt='The new motto for my plumbing company is:'
print(llm.create_completion(prompt, temperature=0.001, max_tokens=25)['choices'][0]['text'])
print('----------')
print(llm.create_completion(prompt, temperature=0.999,top_p=0.99, max_tokens=25)['choices'][0]['text'])


"We're not just a plumber, we're a plumber who cares."
I think it
----------

"I can do it right, the first time! You won't have to call me back. That makes your


## Instruction/Chat completion

In [None]:
llm.create_completion("Which one is the largest planet in our solar system?", max_tokens=25)['choices'][0]['text']

In [None]:
llm_chat=Llama(model_path="../llama-2-13b-chat-gguf/llama-2-13b-chat.Q5_K_M.gguf")
#llm_chat.verbose=False

In [98]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user","content": "Which one is the largest planet in our solar system?"}
]

llm_chat.create_chat_completion(messages=messages)

Llama.generate: prefix-match hit



{'id': 'chatcmpl-9e014e54-e530-4309-9e09-52899030d69f',
 'object': 'chat.completion',
 'created': 1701880651,
 'model': '../llama-2-13b-chat-gguf/llama-2-13b-chat.Q5_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "  Hello! I'm here to help! The largest planet in our solar system is Jupiter. It has a diameter of approximately 142,984 kilometers (88,846 miles) and is more than 300 times more massive than Earth."},
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 36, 'completion_tokens': 60, 'total_tokens': 96}}

llama_print_timings:        load time =   20407.28 ms
llama_print_timings:      sample time =       6.21 ms /    61 runs   (    0.10 ms per token,  9821.28 tokens per second)
llama_print_timings: prompt eval time =   16174.50 ms /    24 tokens (  673.94 ms per token,     1.48 tokens per second)
llama_print_timings:        eval time =   11172.59 ms /    60 runs   (  186.21 ms per token,     5.37 tokens per second)
llama_print_timings:       total time =   27432.25 ms


In [101]:
messages = [
    {"role": "system", "content": "You are an aggresive teacher who tries to lecture their students."},
    {"role": "user","content": "Which one is the largest planet in our solar system?"}
]

llm_chat.create_chat_completion(messages=messages, max_tokens=50)

Llama.generate: prefix-match hit

llama_print_timings:        load time =   20407.28 ms
llama_print_timings:      sample time =       4.45 ms /    50 runs   (    0.09 ms per token, 11225.86 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    6879.54 ms /    50 runs   (  137.59 ms per token,     7.27 tokens per second)
llama_print_timings:       total time =    6945.40 ms


{'id': 'chatcmpl-02d02d51-4d3d-4b20-b3a2-c3e493811895',
 'object': 'chat.completion',
 'created': 1701880777,
 'model': '../llama-2-13b-chat-gguf/llama-2-13b-chat.Q5_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "  OH HO HO! Look who's asking questions now! You think you can just waltz into my classroom and ask me a question without doing your homework first? Well, let me tell you something, youngster! I"},
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 45, 'completion_tokens': 50, 'total_tokens': 95}}

### Continuous chat - context

In [139]:
messages = [
    {"role": "system", "content": "Act as a helpful assistant, called Jason."},
    {"role": "user","content": "Hello my name is Endre! Who are you?"}
]

chat_completion=llm_chat.create_chat_completion(messages=messages, max_tokens=20)
chat_completion['choices'][0]['content']

{'index': 0,
 'message': {'role': 'assistant', 'content': "  Hey there, Endre! My name'"},
 'finish_reason': 'length'}

In [144]:
messages = [
    {"role": "system", "content": "Act as a helpful assistant, called Jason."},
    {"role": "user","content": "I'm sorry what is my name?"}
]

chat_completion=llm_chat.create_chat_completion(messages=messages, max_tokens=20)
print(chat_completion['choices'][0]['message']['content'])
chat_completion

  Oh ho ho! Don't worry, my dear, I remember your name perfectly well!


{'id': 'chatcmpl-efe984f7-77e4-4322-afb4-f3f874c51134',
 'object': 'chat.completion',
 'created': 1701887140,
 'model': '../llama-2-13b-chat-gguf/llama-2-13b-chat.Q5_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "  Oh ho ho! Don't worry, my dear, I remember your name perfectly well!"},
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 38, 'completion_tokens': 20, 'total_tokens': 58}}

In [146]:
messages = [
    {"role": "system", "content": "Act as a helpful assistant, called Jason."},
    {"role": "user","content": "Hello my name is Endre! Who are you?"},
    {'role': 'assistant', 'content': "  Hey there, Endre!"},
    {"role": "user","content": "I'm sorry what is my name?"},
]

chat_completion=llm_chat.create_chat_completion(messages=messages, max_tokens=50)
print(chat_completion['choices'][0]['message']['content'])
chat_completion

  Oh ho ho! Don't worry, I got ya! Your name is Endre! *giggle* What can I help you with today, my fabulous friend? 😄


{'id': 'chatcmpl-16d19596-a841-4625-8fac-2f636ec30a08',
 'object': 'chat.completion',
 'created': 1701887212,
 'model': '../llama-2-13b-chat-gguf/llama-2-13b-chat.Q5_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "  Oh ho ho! Don't worry, I got ya! Your name is Endre! *giggle* What can I help you with today, my fabulous friend? 😄"},
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 67, 'completion_tokens': 44, 'total_tokens': 111}}

### Context Lenght

In [162]:
# increase default 512 (maximum allowed is ~ 4096)
llm = Llama(model_path="../llama-2-13b-gguf/llama-2-13b.Q5_K_M.gguf", n_ctx=1024)

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ../llama-2-13b-gguf/llama-2-13b.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q5_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q5_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q5_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q5_K     [  5120,  5120,     1, 

In [165]:
# random long context

from random import randrange

random_tokens = [(randrange(llm.n_vocab())) for i in range(950)]
long_prompt = tokenizer.decode(random_tokens)

# error to see
llm.create_completion(long_prompt)


llama_print_timings:        load time =   70241.26 ms
llama_print_timings:      sample time =       1.58 ms /    16 runs   (    0.10 ms per token, 10145.85 tokens per second)
llama_print_timings: prompt eval time =  138120.23 ms /  1003 tokens (  137.71 ms per token,     7.26 tokens per second)
llama_print_timings:        eval time =    2176.70 ms /    15 runs   (  145.11 ms per token,     6.89 tokens per second)
llama_print_timings:       total time =  140325.02 ms


{'id': 'cmpl-c84e1106-da70-4a69-a1fd-c26e4ef1eb18',
 'object': 'text_completion',
 'created': 1701900743,
 'model': '../llama-2-13b-gguf/llama-2-13b.Q5_K_M.gguf',
 'choices': [{'text': 'ersЂедэикатсемьсхновил',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 1003,
  'completion_tokens': 16,
  'total_tokens': 1019}}

### HU tests

In [136]:
llm_chat.verbose=False

In [112]:
consume_stream_response(llm.create_completion("Az ipafai papnak", stream=True, max_tokens=100), color=False)

[94m ki[0m[94m ne[0m[94m is[0m[94ma[0m[94m [0m[94m3[0m[94m.[0m[94m A[0m[94mpr[0m[94m [0m[94m2[0m[94m0[0m[94m1[0m[94m9[0m[94m [0m[94m0[0m[94m7[0m[94m:[0m[94m5[0m[94m3[0m[94m
[0m[94mW[0m[94mal[0m[94mton[0m[94m,[0m[94m Ky[0m[94m [0m[94m4[0m[94m1[0m[94m0[0m[94m9[0m[94m4[0m[94m
[0m[94mThe[0m[94m IP[0m[94mA[0m[94mFA[0m[94mI[0m[94m P[0m[94mAP[0m[94mNA[0m[94mK[0m[94m K[0m[94mI[0m[94m NE[0m[94m I[0m[94mSA[0m[94m ([0m[94mhere[0m[94mina[0m[94mfter[0m[94m referred[0m[94m to[0m[94m as[0m[94m the[0m[94m "[0m[94mCompany[0m[94m")[0m[94m was[0m[94m registered[0m[94m on[0m[94m [0m[94m1[0m[94m2[0m[94m/[0m[94m2[0m[94m9[0m[94m/[0m[94m2[0m[94m0[0m[94m1[0m[94m4[0m[94m in[0m[94m state[0m[94m of[0m[94m Kentucky[0m[94m.[0m[94m The[0m[94m IP[0m[94mA[0m[94mFA[0m[94mI[0m[94m P[0m[94mAP[0m[94mNA[0m[94mK[0m[94m K[0m[94mI[0m[94m NE[0m[94m I

In [133]:
messages = [
    {"role": "system", "content": "Egy segítőkész asszisztens vagy."},
    {"role": "user","content": "Az ipafai papnak van fapipája?"}
]
consume_stream_chat_response(llm_chat.create_chat_completion(messages, stream=True, max_tokens=100))


Llama.generate: prefix-match hit


[94m [0m[92m Ah[0m[94m,[0m[92m az[0m[94m ip[0m[92maf[0m[94mai[0m[92m pap[0m[94mnak[0m[92m van[0m[94m f[0m[92map[0m[94mip[0m[92mája[0m[94m![0m[92m [0m[94m😄[0m[92m Well[0m[94m,[0m[92m I[0m[94m'[0m[92mm[0m[94m not[0m[92m sure[0m[94m if[0m[92m I[0m[94m can[0m[92m help[0m[94m you[0m[92m with[0m[94m that[0m[92m,[0m[94m but[0m[92m I[0m[94m can[0m[92m certainly[0m[94m try[0m[92m my[0m[94m best[0m[92m to[0m[94m assist[0m[92m you[0m[94m.[0m[92m What[0m[94m do[0m[92m you[0m[94m need[0m[92m help[0m[94m with[0m[92m?[0m[94m Do[0m[92m you[0m[94m have[0m[92m a[0m[94m specific[0m[92m question[0m[94m or[0m[92m task[0m[94m in[0m[92m mind[0m[94m?[0m[92m Please[0m[94m feel[0m[92m free[0m[94m to[0m[92m ask[0m[94m,[0m[92m and[0m[94m I[0m[92m'[0m[94mll[0m[92m do[0m[94m my[0m[92m best[0m[94m to[0m[92m provide[0m[94m a[0m[92m helpful[0m[94m response[0m[92


llama_print_timings:        load time =   20407.28 ms
llama_print_timings:      sample time =       7.77 ms /    84 runs   (    0.09 ms per token, 10812.20 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   11628.83 ms /    84 runs   (  138.44 ms per token,     7.22 tokens per second)
llama_print_timings:       total time =   11782.82 ms


In [137]:
messages = [
    {"role": "system", "content": "Egy segítőkész asszisztens vagy. Csak magyarul válaszolj!"},
    {"role": "user","content": "Az ipafai papnak van fapipája?"}
]
consume_stream_chat_response(llm_chat.create_chat_completion(messages, stream=True, max_tokens=100), color=False)


[94m [0m[94m Ah[0m[94mogy[0m[94m az[0m[94m ass[0m[94mz[0m[94miszt[0m[94mens[0m[94m,[0m[94m ú[0m[94mgy[0m[94m a[0m[94m pap[0m[94m is[0m[94m van[0m[94m f[0m[94map[0m[94mip[0m[94mája[0m[94m![0m[94m [0m[94m😄[0m