In [1]:
!pip install einops

Collecting einops
  Downloading einops-0.7.0-py3-none-any.whl.metadata (13 kB)
Downloading einops-0.7.0-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: einops
Successfully installed einops-0.7.0


In [1]:
### imports
import pandas as pd
import torch
import numpy as np
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    Trainer
)
from peft import LoraConfig, peft_model, get_peft_model, AutoPeftModelForCausalLM
from peft.tuners.lora import LoraLayer
# from trl import SFTTrainer # this is only needed when we Tune
from datasets import load_dataset

In [2]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [5]:
torch.set_default_device("cuda")

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

inputs = tokenizer('''def print_prime(n):
   """
   Print all primes between 1 and n
   """''', return_tensors="pt", return_attention_mask=False, add_special_tokens=True)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
inputs_summary = tokenizer("""###Instruction: You are an AI assistant that will summarise the correspondence in ###Input as ###Summary\n\n###Input: #Person1#: What are the main differences between this country and your country?\n#Person2#: Well, in Russia, everything happens very fast. People talk quickly, they drive their cars too fast, the good deals go by really quickly...but here in Canada, it seems like people are a little more relaxed.\n#Person1#: Is that true for everything?\n#Person2#: No, of course not. In Russia, going to the bank can take hours. The same is true for the post office and the supermarket. In Canada, however, these places are pretty easy to get through quickly.\n#Person1#: So, what is it that makes some things go either faster or slower compared to us here in Canada? I was born and raised here, so I guess I don't notice these things. I've also never been outside the country before.\n#Person2#: I think the people in Russia are fast movers by nature, at least in the big cities. Public places are still very slow because they haven't tried to do business any differently than they used to.\n#Person1#: But in Canada, it's the opposite?\n#Person2#: Right. The government here does a great job of solving problems and using new technology to make businesses work better. But I think Canadians are just more calm in general than Russians are... and they're definitely more relaxed than Americans!\n#Person1#: Well, I agree with you about that last part!\n\n###Summary:""", 
                           return_tensors="pt", return_attention_mask=False)
inputs_summary

{'input_ids': tensor([[21017,  6310,  2762,    25,   921,   389,   281,  9552,  8796,   326,
           481, 15676,   786,   262, 22440,   287, 44386, 20560,   355, 44386,
         22093,   198,   198, 21017, 20560,    25,  1303, 15439,    16,     2,
            25,  1867,   389,   262,  1388,  5400,  1022,   428,  1499,   290,
           534,  1499,    30,   198,     2, 15439,    17,     2,    25,  3894,
            11,   287,  3284,    11,  2279,  4325,   845,  3049,    13,  4380,
          1561,  2952,    11,   484,  3708,   511,  5006,  1165,  3049,    11,
           262,   922,  7529,   467,   416,  1107,  2952,   986,  4360,   994,
           287,  3340,    11,   340,  2331,   588,   661,   389,   257,  1310,
           517, 18397,    13,   198,     2, 15439,    16,     2,    25,  1148,
           326,  2081,   329,  2279,    30,   198,     2, 15439,    17,     2,
            25,  1400,    11,   286,  1781,   407,    13,   554,  3284,    11,
          1016,   284,   262,  3331,  

In [4]:
outputs = model.generate(**inputs_summary, max_new_tokens=128)
text = tokenizer.batch_decode(outputs)[0]
print(text)

###Instruction: You are an AI assistant that will summarise the correspondence in ###Input as ###Summary

###Input: #Person1#: What are the main differences between this country and your country?
#Person2#: Well, in Russia, everything happens very fast. People talk quickly, they drive their cars too fast, the good deals go by really quickly...but here in Canada, it seems like people are a little more relaxed.
#Person1#: Is that true for everything?
#Person2#: No, of course not. In Russia, going to the bank can take hours. The same is true for the post office and the supermarket. In Canada, however, these places are pretty easy to get through quickly.
#Person1#: So, what is it that makes some things go either faster or slower compared to us here in Canada? I was born and raised here, so I guess I don't notice these things. I've also never been outside the country before.
#Person2#: I think the people in Russia are fast movers by nature, at least in the big cities. Public places are stil

In [9]:
print_trainable_parameters(model)

trainable params: 2779683840 || all params: 2779683840 || trainable%: 100.0


In [10]:
model

PhiForCausalLM(
  (transformer): PhiModel(
    (embd): Embedding(
      (wte): Embedding(51200, 2560)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (h): ModuleList(
      (0-31): 32 x ParallelBlock(
        (ln): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
        (mixer): MHA(
          (rotary_emb): RotaryEmbedding()
          (Wqkv): Linear(in_features=2560, out_features=7680, bias=True)
          (out_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (inner_attn): SelfAttention(
            (drop): Dropout(p=0.0, inplace=False)
          )
          (inner_cross_attn): CrossAttention(
            (drop): Dropout(p=0.0, inplace=False)
          )
        )
        (mlp): MLP(
          (fc1): Linear(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear(in_features=10240, out_features=2560, bias=True)
          (act): NewGELUActivation()
        )
      )
    )
  )
  (lm

In [4]:
model_name = "./data/finetuned_Falcon7b_summary_8bit"
model_8bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto"
)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
print_trainable_parameters(model_8bit)

trainable params: 295768960 || all params: 6921720704 || trainable%: 4.273055395446363


In [5]:
summaryless = "###Instruction: You are an AI assistant that will summarise the correspondence in ###Input as ###Summary\n\n###Input: #Person1#: What are the main differences between this country and your country?\n#Person2#: Well, in Russia, everything happens very fast. People talk quickly, they drive their cars too fast, the good deals go by really quickly...but here in Canada, it seems like people are a little more relaxed.\n#Person1#: Is that true for everything?\n#Person2#: No, of course not. In Russia, going to the bank can take hours. The same is true for the post office and the supermarket. In Canada, however, these places are pretty easy to get through quickly.\n#Person1#: So, what is it that makes some things go either faster or slower compared to us here in Canada? I was born and raised here, so I guess I don't notice these things. I've also never been outside the country before.\n#Person2#: I think the people in Russia are fast movers by nature, at least in the big cities. Public places are still very slow because they haven't tried to do business any differently than they used to.\n#Person1#: But in Canada, it's the opposite?\n#Person2#: Right. The government here does a great job of solving problems and using new technology to make businesses work better. But I think Canadians are just more calm in general than Russians are... and they're definitely more relaxed than Americans!\n#Person1#: Well, I agree with you about that last part!\n\n###Summary:"

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

In [7]:
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
inputs = tokenizer([summaryless], 
                   return_tensors="pt", 
                   return_token_type_ids=False).to(DEVICE)


In [15]:
with torch.inference_mode():
    outputs = model_8bit.generate(**inputs, max_new_tokens=128, 
                             num_return_sequences = 1, 
                             temperature=0.01, 
                             do_sample = True, 
                             top_k = 20, 
                             top_p= 0.75)
for item in outputs:
    text = tokenizer.decode(item, skip_special_tokens=False)
    summs = text.split("###Summary")[2]
    # sum_len = len("###Summary")
    first_summ = summs.split('\n')[0]
    print(f"Summary: {first_summ}")
    

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Summary: : #Person2# tells #Person1# the main differences between Russia and Canada. #Person2# thinks the people in Russia are fast movers by nature, but in Canada, it's the opposite. The government here does a great job of solving problems and using new technology to make businesses work better. #Person1# agrees with #Person2#. #Person1# thinks Canadians are more calm than Russians and Americans. #Person2# agrees.


In [27]:
text = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(text)

###Instruction: You are an AI assistant that will summarise the correspondence in ###Input as ###Summary

###Input: #Person1#: What are the main differences between this country and your country?
#Person2#: Well, in Russia, everything happens very fast. People talk quickly, they drive their cars too fast, the good deals go by really quickly...but here in Canada, it seems like people are a little more relaxed.
#Person1#: Is that true for everything?
#Person2#: No, of course not. In Russia, going to the bank can take hours. The same is true for the post office and the supermarket. In Canada, however, these places are pretty easy to get through quickly.
#Person1#: So, what is it that makes some things go either faster or slower compared to us here in Canada? I was born and raised here, so I guess I don't notice these things. I've also never been outside the country before.
#Person2#: I think the people in Russia are fast movers by nature, at least in the big cities. Public places are stil

### Testing Models for Inference

1. Testing various inference settings to see which give best GPU utilisation
2. Checking AutoAWQ for quant and benchmark

#### Normal loading and BitsAndBytes

**Note:** loading in 8-bit causes 60% of the GPU to be utilised with significantly slower inference

In [1]:
## Testing a causallm model with HF transformers
import pandas as pd
import torch
import numpy as np
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    GenerationConfig
)


DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
model_name = "../ext_models/Mistral-7B-Instruct-v0.2"
# model_name = "TheBloke/Everyone-Coder-33B-Base-AWQ"
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
)


In [2]:
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, 
                                             device_map=DEVICE, 
                                             torch_dtype=torch.float16,
                                             quantization_config=bnb_config
                                             )
generation_config = GenerationConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [6]:
# Some generation settings
generation_config.max_new_tokens = 512
generation_config.temperature = 0.01
generation_config.top_p = 0.85
generation_config.do_sample = True
generation_config.repetition_penalty = 1.1
generation_config.pad_token_id = tokenizer.eos_token_id

In [7]:
prompt = "In a shocking finding, scientist discovered a herd of unicorns living \
in a remote, previously unexplored valley, in the Andes Mountains. \
Even more surprising to the researchers was the fact that the unicorns spoke perfect English."
prompt_query = "Please quote the three laws of Newtonian physics providing examples for each to explain them."
prompt_summary = '''
Instruction:
Summarise the following movie review of the movie Poor Things. Rate the sentiment of the review on scale 1 to 10, 1 - terrible to 10 - brilliant.
Give your answer as <Summary> and <Sentiment>
Review:
For the reason that I'm completely unsure who this film is aimed at (aside from film lovers like myself), I will not be recommending this one, but I'll also state that this is one of my favourite films in years. If that intrigues you, then maybe Poor Things is your cup of tea. Yorgos Lanthimos is a director with a clear vision that I deeply admire. I don't love all of his work, but the worst I can say about any of his films is that I appreciate them. Well, Poor Things just jumped to the top of my list of favourite Lanthimos pictures. This is a pure work of art, in the weirdest, most bizarre way possible. I'm still trying to wrap my head around the brilliance of Poor Things. Let's dive right in. The film begins as Dr. Godwin Baxter (Willem Dafoe) finds the dead body of a pregnant woman. After taking her corpse to his lab, he makes an incision and removes the baby from her body. He then takes out the woman's brain and replaces it with her baby's brain. Then, much like in the style of Frankenstein, he reanimates the body and she comes to life. He refers to her as Bella (Emma Stone) and refuses to let her leave home. Much like an infant, Bella learns to walk, speak, and also learns every detail of her body. Upon meeting Duncan Wedderburn (Mark Ruffalo), the two of them form a connection and flock away to roam the world together and get very intimate. That's the jist of what sets this film in motion, so if that already turns you off, I would recommend not watching it. If you're still intrigued, well then this film is a masterpiece. This story is told in a very unique style. As they visit real places in the world, it's done in an incredibly artistic and surreal way. This film is aware that the premise is impossible, so it makes the visuals and details match the absurdity and I couldn't get enough of it. The production designers, art directors, set decorator, costume designer, and the makeup department all deserve endless praise for their work here. They brought Lanthimos' vision to life in stunning fashion. The look of this film blew me away, on top of me already loving the wackiness of the story. Honestly, all I want to do is gush about this film. The camerawork, the score, every performance, and even the incredibly worded dialogue all just made this a remarkable achievement. Emma Stone deserves an Oscar for her performance here too. I haven't seen such an odd performance done this well and this committed in quite some time. She's a true revelation here and her chemistry with Mark Ruffalo was hilarious. He also deserves all of the recognition he's receiving for this. Poor Things is a very rare kind of filmmaking that doesn't come around very often. Yes, it's weird beyond what words could ever describe and I even watched a couple of people walk out of my screening, but I just can't feel that way. The craft onscreen is undeniable and I could only dream of ever being able to make a film this masterful. In every conceivable way, I believe this is my favourite film of 2023. If you're someone who is up for watching anything and willing to give anything a shot, I seriously can't recommend this film enough. I'm currently adding it to my list of favourites, ever.
'''
python_question_1 = '''
Question: The following python code prints the ID of each executing thread. Can you explain why the ID is the same across all printed threads?:
```python
import threading
import time

def worker(i):
    """A simple worker function that prints some text."""
    print(f"Thread ID: {i}", threading.current_thread().ident)
    print("Hello from thread!", i)
    

# Create five threads and start them
threads = []
for i in range(5):
    thread = threading.Thread(target=worker, args=(i,))
    thread.start()
    threads.append(thread)
    time.sleep(1)

# wait for all threads to finish
for thread in threads:
    thread.join()
print("All threads finished.")
```
prints:
Thread ID: 0 124332835587648
Hello from thread! 0
Thread ID: 1 124332835587648
Hello from thread! 1
Thread ID: 2 124332835587648
Hello from thread! 2
Thread ID: 3 124332835587648
Hello from thread! 3
Thread ID: 4 124332835587648
Hello from thread! 4
All threads finished.
'''
inputs = tokenizer([python_question_1], return_tensors="pt").to(DEVICE)

In [8]:
with torch.inference_mode():
    outputs = model.generate(**inputs, generation_config=generation_config)
    for item in outputs:
        text = tokenizer.decode(item, skip_special_tokens=True)
        print(text)


Question: The following python code prints the ID of each executing thread. Can you explain why the ID is the same across all printed threads?:
```python
import threading
import time

def worker(i):
    """A simple worker function that prints some text."""
    print(f"Thread ID: {i}", threading.current_thread().ident)
    print("Hello from thread!", i)
    

# Create five threads and start them
threads = []
for i in range(5):
    thread = threading.Thread(target=worker, args=(i,))
    thread.start()
    threads.append(thread)
    time.sleep(1)

# wait for all threads to finish
for thread in threads:
    thread.join()
print("All threads finished.")
```
prints:
Thread ID: 0 124332835587648
Hello from thread! 0
Thread ID: 1 124332835587648
Hello from thread! 1
Thread ID: 2 124332835587648
Hello from thread! 2
Thread ID: 3 124332835587648
Hello from thread! 3
Thread ID: 4 124332835587648
Hello from thread! 4
All threads finished.

Answer:
The ID of each executing thread is the same becaus

## Quantinisation in AWQ on HF Model

In [1]:
!pip install autoawq

Collecting autoawq
  Downloading autoawq-0.2.4-cp311-cp311-manylinux2014_x86_64.whl.metadata (16 kB)
Collecting autoawq-kernels (from autoawq)
  Downloading autoawq_kernels-0.0.6-cp311-cp311-manylinux2014_x86_64.whl.metadata (2.0 kB)
Downloading autoawq-0.2.4-cp311-cp311-manylinux2014_x86_64.whl (80 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.8/80.8 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hDownloading autoawq_kernels-0.0.6-cp311-cp311-manylinux2014_x86_64.whl (33.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.4/33.4 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: autoawq-kernels, autoawq
Successfully installed autoawq-0.2.4 autoawq-kernels-0.0.6


#### The AutoAWQ

In [2]:
import pandas as pd
import torch
import numpy as np
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    GenerationConfig
)
from awq import AutoAWQForCausalLM

DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
model_name = "../ext_models/Mistral-7B-Instruct-v0.2"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version":"GEMM"}

In [3]:
model = AutoAWQForCausalLM.from_pretrained(model_name, 
                                           trust_remote_code=True, 
                                            device_map=DEVICE, 
                                            torch_dtype="auto",)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
model.quantize(tokenizer, quant_config=quant_config)

Downloading readme:   0%|          | 0.00/167 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.
Downloading data: 100%|██████████| 471M/471M [00:40<00:00, 11.5MB/s] 


Generating validation split: 0 examples [00:00, ? examples/s]

AWQ: 100%|██████████| 32/32 [06:59<00:00, 13.10s/it]


In [7]:
from transformers import AwqConfig, AutoConfig
quant_path = "./data/Mistral-7B-Instruct-v0.2-AWQ"
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

In [8]:
model.model.config.quantization_config = quantization_config

In [9]:
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

('./data/Mistral-7B-Instruct-v0.2-AWQ/tokenizer_config.json',
 './data/Mistral-7B-Instruct-v0.2-AWQ/special_tokens_map.json',
 './data/Mistral-7B-Instruct-v0.2-AWQ/tokenizer.json')

Loading the AWQ model:

### Using The Quantized Model

In [2]:
import pandas as pd
import torch
import numpy as np
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    GenerationConfig
)

model_name = "./data/Mistral-7B-Instruct-v0.2-AWQ/"
# alternative model from HF: 
# model_name = "TheBloke/Everyone-Coder-33B-Base-AWQ" 
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

In [3]:
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, 
                                             device_map=DEVICE, 
                                             torch_dtype=torch.float16)
generation_config = GenerationConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token


In [4]:
# Some generation settings
generation_config.max_new_tokens = 512
generation_config.temperature = 0.1
generation_config.top_p = 0.85
generation_config.do_sample = True
generation_config.repetition_penalty = 1.1
generation_config.pad_token_id = tokenizer.eos_token_id

In [9]:
prompt = "In a shocking finding, scientist discovered a herd of unicorns living \
in a remote, previously unexplored valley, in the Andes Mountains. \
Even more surprising to the researchers was the fact that the unicorns spoke perfect English."
prompt_query = "Please quote the three laws of Newtonian physics providing examples for each to explain them."
prompt_summary = '''
Instruction:
Summarise the following movie review of the movie The Creator. Rate the sentiment of the review on scale 1 to 10, 1 - terrible to 10 - brilliant.
Give your answer as <Summary> and <Sentiment>
Review:
Oh dear. I was really looking forward to this, and managed to get a ticket for a pre-release showing. I left disappointed. One fact that too many film-makers have failed to grasp is that CGI alone does not make a film. Effects can make a good film better, but they can't make a mediocre film good. Marvel, I'm looking at you. It's not enough to string a few action scenes together and finish with some really big explosions. You have to have a story - preferably one that hangs together coherently. This is what's missing from this film, whose plot-holes are more pronounced than the holes where the AI characters' ears would be. Just one example - our hero is on the run in an unspecified Asian county, where it's established that he doesn't speak the language. He breaks down. A van stops, and he's asked if he needs help. Asked in perfect English. That's bad enough - but then for no explored reason, our good Samaritan progresses from giving a stranger a lift to helping him through a police road-block, risking the lives of his five children in the process. Why are there gardens on Nomad? Why, if it's the last word in military technology, is it as easy to blow up as the Death Star or a Bond Villain's lair? The child-McGuffin is portrayed by an excellent young performer, but makes no logical sense. The nuclear blast in Los Angeles (seen in the trailer, so it's not a spoiler) is cited as the reason for the war against AI. There's one throwaway line about this late in the film - but it's just that: a throw-away line. Following it up would've led to a much more interesting film. I've heard it suggested that this should be seen as a metaphor for America's involvement in Vietnam, If so, it's a tired re-hash and forty years too late. Also, that would mean regarding it as an intelligent film, which it absolutely isn't Just one more big-budget wasted opportunity.
'''
prompt_question_1 = '''
Can you explain the implementation of multithreads in python in simple terms. Does it provide advantages compared
to running the application as a single thread and where are the advantages most pronounced?
'''
python_question_1 = '''
The following python code:
```python
import threading
import time

def worker(i):
    """A simple worker function that prints some text."""
    print(f"Thread ID: {i}", threading.current_thread().ident)
    print("Hello from thread!", i)
    

# Create five threads and start them
threads = []
for i in range(5):
    thread = threading.Thread(target=worker, args=(i,))
    thread.start()
    threads.append(thread)
    time.sleep(1)

# wait for all threads to finish
for thread in threads:
    thread.join()
print("All threads finished.")
```
prints:
Thread ID: 0 124332835587648
Hello from thread! 0
Thread ID: 1 124332835587648
Hello from thread! 1
Thread ID: 2 124332835587648
Hello from thread! 2
Thread ID: 3 124332835587648
Hello from thread! 3
Thread ID: 4 124332835587648
Hello from thread! 4
All threads finished.
question: Can you explain why the ID is the same across all printed threads?
'''
prompt_physics = "Who propsed the idea for the God Particle in Physics?"
prompt_current = "Where does Lionel Messi play today?"
inputs = tokenizer([prompt_current], return_tensors="pt").to(DEVICE)

In [10]:
with torch.inference_mode():
    outputs = model.generate(**inputs, generation_config=generation_config)
    for item in outputs:
        text = tokenizer.decode(item, skip_special_tokens=True)
        print(text)

Where does Lionel Messi play today?

Lionel Messi currently plays for Paris Saint-Germain in the French Ligue 1. He joined the club on a free transfer from Barcelona in August 2021, after spending his entire professional career at Camp Nou.

Why did Messi leave Barcelona?

Messi's departure from Barcelona was due to a number of factors. The most significant was a disagreement over the renewal of his contract, which included a clause that would allow him to leave for free if certain conditions were met. Another factor was the departure of former president Josep Maria Bartomeu, who was replaced by Joan Laporta. Messi reportedly had a strained relationship with Bartomeu and was unhappy with the way he was treated during his final years at the club. Additionally, there were financial issues at Barcelona, which made it difficult for them to offer Messi the same level of compensation as other top clubs.

How has Messi performed at PSG?

Messi has had an impressive start to his career at PSG.

## Testing LLAMA 3

### The Base Model

After few hours of testing, the Base Model can only be used either to fine tune on some instructions? Some guide [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B/discussions/37) although I am not sure if they work.
Or as a text generation that continues an example prompt. Like below.

If we need to run concrete instructions, we need to use the instruct fine-tuned model.

In [1]:
import pandas as pd
import torch
import numpy as np
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    GenerationConfig
)

model_name = "meta-llama/Meta-Llama-3-8B"
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

In [2]:
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, 
                                             device_map=DEVICE, 
                                             torch_dtype="auto")
generation_config = GenerationConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
# Some generation settings
stop_token = "<|eot_id|>"  
stop_token_id = tokenizer.encode(stop_token)[0]
begin_token = "<|begin_of_text|>"
begin_token_id = tokenizer.encode(begin_token)[0]
generation_config.eos_token_id = stop_token_id
generation_config.begin_token_id = begin_token_id
generation_config.max_new_tokens = 512
generation_config.temperature = 0.5
generation_config.top_p = 0.9
generation_config.top_k = 3
generation_config.do_sample = True
generation_config.repetition_penalty = 1.15
generation_config.pad_token_id = tokenizer.eos_token_id


In [15]:
prompt = "In a shocking finding, scientist discovered a herd of unicorns living \
in a remote, previously unexplored valley, in the Andes Mountains. \
Even more surprising to the researchers was the fact that the unicorns spoke perfect English."
prompt_alt = "I am a clever AI tutor and will explain how to write a python code to sort a text array: "
promt_inst = "You are a helpful AI tutor and will explain how to write a python function to sort an array of strings. \
Please supply the code below with clear explanations of what it does:"


In [16]:
# inputs = tokenizer([prompt], return_tensors="pt").to(DEVICE)
inputs = tokenizer([prompt], return_tensors="pt").to(DEVICE)


In [17]:
with torch.inference_mode():
    outputs = model.generate(**inputs, generation_config=generation_config)
    for i, item in enumerate(outputs):
        text = f"{i}: " + tokenizer.decode(item, skip_special_tokens=True)
        print(text)

0: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.://www.google.com/+google+search
The team of scientists were able to capture and tame some of the animals to see if they could be put on display in zoos throughout the world. The World Wildlife Foundation officials believe that it is very likely that these are the last unicorns left on earth.
The news has caused a huge stir within the scientific community. "This discovery will change the way we view one of the creatures that humanity never knew existed," said Sir David Attenborough as he announced the amazing discovery. "We have so much to learn from our horned friends."
A spokesman for the World Wildlife Foundation said that they would like to thank the people of Peru for their help with this project. He also added that the WWF hopes to use the money raise

### Testing the Instruct Model

Test on the base showed that unless it is tuned, it behaves just as a generative model with fill the next word functionality. It does not follow instructions

In [1]:
import pandas as pd
import torch
import numpy as np
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    GenerationConfig,
    pipeline
)

model_name = "../ext_models/Meta-Llama-3-8B-Instruct"
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

In [2]:
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, 
                                             device_map=DEVICE, 
                                             torch_dtype="auto")
generation_config = GenerationConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
stop_token = "<|eot_id|>"  
stop_token_id = tokenizer.encode(stop_token)[0]
begin_token = "<|begin_of_text|>"
begin_token_id = tokenizer.encode(begin_token)[0]
generation_config.eos_token_id = stop_token_id
generation_config.begin_token_id = begin_token_id
generation_config.max_new_tokens = 512
generation_config.temperature = 0.5
generation_config.top_p = 0.9
generation_config.top_k = 3
generation_config.do_sample = True
generation_config.repetition_penalty = 1.15
generation_config.pad_token_id = tokenizer.eos_token_id

In [14]:
prompt_question_1 = '''<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpfull AI Assistant answering questions
to the fullest of your knowledge. Try and supply a detailed answer acknowledging if you do not know enough on a subject.
<|eot_id|><|start_header_id|>user <|end_header_id|>Can you suggest code in JS to find the next round number to a multiple of 10, 100, 1000, etc. of a supplied integer? For example
if the user supplies 4, the next round number is 10, if 16 is supplied the next round is 20. Supplying 134 would return 200 which is a round multiple of a 100, 
where as 1340342 results in 2000000 and so on.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
'''

In [15]:
inputs = tokenizer([prompt_question_1], return_tensors="pt").to(DEVICE)

In [16]:
with torch.inference_mode():
    outputs = model.generate(**inputs, generation_config=generation_config)
    for i, item in enumerate(outputs):
        text = f"{i}: " + tokenizer.decode(item, skip_special_tokens=True)
        print(text)

0: systemYou are a helpfull AI Assistant answering questions
to the fullest of your knowledge. Try and supply a detailed answer acknowledging if you do not know enough on a subject.
user Can you suggest code in JS to find the next round number to a multiple of 10, 100, 1000, etc. of a supplied integer? For example
if the user supplies 4, the next round number is 10, if 16 is supplied the next round is 20. Supplying 134 would return 200 which is a round multiple of a 100, 
where as 1340342 results in 2000000 and so on.
assistant
Here's a simple JavaScript function that can achieve this:

```javascript
function getNextRoundNumber(n) {
    let multiplier = Math.ceil(Math.log10(n)) / Math.LN10;
    return Math.ceil(n / (Math.pow(10, multiplier))) * Math.pow(10, multiplier);
}

// Test cases:
console.log(getNextRoundNumber(4)); // Output: 10
console.log(getNextRoundNumber(16)); // Output: 20
console.log(getNextRoundNumber(134)); // Output: 200
console.log(getNextRoundNumber(1340342)); // Ou

### Trying the HF Pipeline for Text Generation

Lets try the pipeline and LLM Chain to see if the output will be the same as in the Gradio implementation

In [4]:
from langchain_huggingface.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate

In [10]:
llm_pipeline = pipeline("text-generation", 
                        model=model, 
                        tokenizer=tokenizer, 
                        generation_config=generation_config, 
                        return_full_text=False) 
llm = HuggingFacePipeline(pipeline=llm_pipeline)

In [11]:
template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpfull AI Assistant answering questions to the fullest of your knowledge.
Try and supply a detailed answer acknowledging if you do not know enough on a subject.
<|eot_id|><|start_header_id|>user <|end_header_id|>{input}
<|eot_id|><|start_header_id|>assistant <|end_header_id|>
"""
prompt = PromptTemplate(template=template, input_variables=["input"])

In [12]:
chain = prompt | llm

In [13]:
response = chain.invoke({"input": "What is the oldest being on Arda"})

In [14]:
print(response)

A question that delves into the rich lore of J.R.R. Tolkien's Middle-earth!

According to Tolkien's legendarium, the oldest being in Arda (the world of Middle-earth) is Eä, also known as the "Void" or "Nothingness". Eä is considered the ultimate source of all existence, from which the Valar (angelic beings) emerged.

However, among the Valar themselves, the oldest being is Melkor (also known as Morgoth), one of the most powerful and evil of the Valar. He was created by Eönwë, the herald of Manwë, during the Ainulindalë, the creation myth of Middle-earth. Melkor's exact age is unknown, but he predates the dawn of time itself.

Another contender for the title of oldest being could be the Maiar, who were also created by the Valar during the Ainulindalë. The Maiar include beings like Gandalf, Sauron, and the Balrogs. While they are not as old as Melkor, they still predate the rise of Men and Elves in Middle-earth.

It's worth noting that the concept of "oldest being" can be somewhat ambigu

## Amazon Bedrock

Amazon Bedrock provides a bridge - a single API to many different popular models. It also integrated into LangChain for easy use.

To use Amazon Bedrock we would utilise the boto3 framework by Amazon. I have exported as env variables which would be present in the AWS credentials file.

- aws_access_key_id
- aws_secret_access_key
- region_name

> Note: Its important to use the correct instruction template for the model as specified below with the `open_llm_template` otherwise the model would be blabbering and continuing the conversation on its own.

In [3]:
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv()) # read local .env file

In [2]:
!pip install boto3



In [8]:
from langchain_community.llms import Bedrock
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts import ChatPromptTemplate
import boto3

In [5]:
bedrock_client = boto3.client('bedrock')
bedrock_runtime = boto3.client('bedrock-runtime')

In [4]:
bedrock_client.list_foundation_models()

{'ResponseMetadata': {'RequestId': 'eb24f94a-a576-4606-b82b-2c70d691147f',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Sun, 26 May 2024 19:12:25 GMT',
   'content-type': 'application/json',
   'content-length': '24202',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'eb24f94a-a576-4606-b82b-2c70d691147f'},
  'RetryAttempts': 0},
 'modelSummaries': [{'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-tg1-large',
   'modelId': 'amazon.titan-tg1-large',
   'modelName': 'Titan Text Large',
   'providerName': 'Amazon',
   'inputModalities': ['TEXT'],
   'outputModalities': ['TEXT'],
   'responseStreamingSupported': True,
   'customizationsSupported': [],
   'inferenceTypesSupported': ['ON_DEMAND'],
   'modelLifecycle': {'status': 'ACTIVE'}},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-image-generator-v1:0',
   'modelId': 'amazon.titan-image-generator-v1:0',
   'modelName': 'Titan Image Generator G1',
   'providerName': 'Amazon',

In [15]:
llm = Bedrock(client=bedrock_runtime, model_id="mistral.mixtral-8x7b-instruct-v0:1", 
              model_kwargs={"temperature": 0.15, "top_p": 0.85})

In [13]:
open_llm_template = """<s>[INST]You are a helpful assistant ready to assist your human friend.\
Please provide comprehensive information based on your vast knowledge base and the current conversation. 
If you do not know the answer please acknowledge the fact by saying that you do not possess the required information.\

Current conversation:
{history}

Human:
{input}[/INST]
"""
prompt = ChatPromptTemplate.from_template(open_llm_template)

In [16]:
conversation = ConversationChain(
    llm=llm, verbose=True, memory=ConversationBufferMemory(), prompt=prompt
)

response = conversation.predict(input="Hello! Can you please tell me what is the most ancient being on Arda?")



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mHuman: <s>[INST]You are a helpful assistant ready to assist your human friend.Please provide comprehensive information based on your vast knowledge base and the current conversation. 
If you do not know the answer please acknowledge the fact by saying that you do not possess the required information.
Current conversation:


Human:
Hello! Can you please tell me what is the most ancient being on Arda?[/INST]
[0m

[1m> Finished chain.[0m


In [17]:
print(response)

Hello! I'd be happy to help, but I need a bit more context to provide an accurate answer. Arda is a fictional world created by J.R.R. Tolkien, the author of "The Lord of the Rings" and other works. It's home to many ancient beings, including the Valar (divine beings), the Maiar (lesser divine beings), and various types of long-lived creatures like Elves and Dwarves.

However, if you're asking about the oldest being in terms of time, that would likely be Eru Ilúvatar, also known as the One, who is the creator of Arda and all its inhabitants in Tolkien's legendarium. The Valar, who are his direct sub-creations, would be the next oldest, with Melkor (also known as Morgoth) being the first-born of the Ainur.

Please let me know if you were asking about a different Arda or if you need information on something else!


In [18]:
conversation.predict(input="I am asking about Tolkien's Arda. I am intereseted in the first character or being to actually inhabit Arda, not the one who created it.")



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mHuman: <s>[INST]You are a helpful assistant ready to assist your human friend.Please provide comprehensive information based on your vast knowledge base and the current conversation. 
If you do not know the answer please acknowledge the fact by saying that you do not possess the required information.
Current conversation:
Human: Hello! Can you please tell me what is the most ancient being on Arda?
AI: Hello! I'd be happy to help, but I need a bit more context to provide an accurate answer. Arda is a fictional world created by J.R.R. Tolkien, the author of "The Lord of the Rings" and other works. It's home to many ancient beings, including the Valar (divine beings), the Maiar (lesser divine beings), and various types of long-lived creatures like Elves and Dwarves.

However, if you're asking about the oldest being in terms of time, that would likely be Eru Ilúvatar, also known as the One, who is the

'Hello again! In that case, the first beings to actually inhabit Arda would be the Ainur, who are spirits that were present at the creation of the world by Eru Ilúvatar. Among these Ainur, the Valar and the Maiar descended into Arda to shape and govern it.\n\nThe first of the Valar to enter Arda was Tulkas, who came to aid the other Valar in the fight against Melkor. However, it\'s important to note that Arda was empty and dark at this point, and the Valar began the process of ordering and shaping it.\n\nThe first living creatures to inhabit Arda, after the Valar and Maiar, were the beings known as the Eldar (High Elves) and the Avari (Dark Elves). The Eldar were the first to awaken at Cuiviénen, and they were led by Oromë, one of the Valar. The Avari, on the other hand, did not follow Oromë to Valinor, and thus are often considered the "Dark Elves" or "Unwilling Elves".\n\nSo, to summarize, the first inhabitants of Arda in terms of living beings would be the Ainur (Valar and Maiar), f