<a href="https://colab.research.google.com/github/vtecftwy/utseus-dives/blob/main/nbs/dive_2_1_llms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Diving into LLMs</h1>

In [1]:
try:
    from nb_guided_dive.llm import *
except ModuleNotFoundError:
    print('installing nb_guided_dive')
    !pip install -Uqq git+https://github.com/vtecftwy/nb-guided-dive.git
    print('nb_guided_dive installed')
    from nb_guided_dive.llm import *

try:
    from transformers import pipeline
except ModuleNotFoundError as e:
    print('Must install transformers and accelerate before running')
    !pip -qq install transformers accelerate
    print(('transformers and accelerate installed'))

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

import random
import torch
from pprint import pprint
from IPython.display import Markdown, display

In [2]:
def print_bert_results(res):
    for i, item in enumerate(res):
        masked = item['token_str']
        mask_s = item['sequence'].find(masked)
        mask_e = mask_s + len(masked)
        str_md = f"- " + item['sequence'][:mask_s] + '... **' + item['sequence'][mask_s:mask_e] + '** ...' + item['sequence'][mask_e:] + f"(*Prob: {item['score']:.3f}*)"
        display(Markdown(str_md))

def print_gpt2_results(res):
    display(Markdown(res[0]['generated_text']))

def md(text):
    display(Markdown(text))


We will explore several open source LLMs of various capabilities:

In [3]:
bert_checkpoint = 'google-bert/bert-base-uncased'
gpt2_checkpoint = 'openai-community/gpt2'
qwen_checkpoint = 'Qwen/Qwen2.5-1.5B-Instruct'
deepseek_checkpoint = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B'

# Masked Models (Encoder Only): BERT

In [4]:
bert_ppl = pipeline('fill-mask', model=bert_checkpoint);

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From ðŸ‘‰v4.50ðŸ‘ˆ onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architectur

In [5]:
children_names(bert_ppl.model)

We have 2 top-level modules:
 - bert
 - cls


BERT is trained to predict which token was masked in a sentence.

The masked token to predict is represented by the special token `[MASK]` (id: 103).

In [6]:
text = "Hi Andrej Karpathy, are you [MASK] by the recent evolution of large language models."
res = bert_ppl([text])
print_bert_results(res)

- hi andrej karpathy, are you ... **surprised** ... by the recent evolution of large language models.(*Prob: 0.232*)

- hi andrej karpathy, are you ... **intrigued** ... by the recent evolution of large language models.(*Prob: 0.071*)

- hi andrej karpathy, are you ... **bothered** ... by the recent evolution of large language models.(*Prob: 0.062*)

- hi andrej karpathy, are you ... **impressed** ... by the recent evolution of large language models.(*Prob: 0.050*)

- hi andrej karpathy, are you ... **puzzled** ... by the recent evolution of large language models.(*Prob: 0.041*)

# Explore BERT

<img src="https://www.charlier-tang.com/guided-dive/resources/img/LLM-Inside.jpg"/>

## The Tokenizer

In [7]:
bert_tok = bert_ppl.tokenizer

When we apply the tokenizer to our text, it splits it into tokens and convert each token into an integer

In [8]:
token_ids = bert_tok(text)
pos_mask = token_ids['input_ids'].index(103)
token_ids['input_ids']

[101,
 7632,
 7213,
 3501,
 10556,
 14536,
 17308,
 1010,
 2024,
 2017,
 103,
 2011,
 1996,
 3522,
 6622,
 1997,
 2312,
 2653,
 4275,
 1012,
 102]

We can see the split in text format as well

In [9]:
tokens_txt = bert_tok.tokenize(text)
tokens_txt

['hi',
 'andre',
 '##j',
 'ka',
 '##rp',
 '##athy',
 ',',
 'are',
 'you',
 '[MASK]',
 'by',
 'the',
 'recent',
 'evolution',
 'of',
 'large',
 'language',
 'models',
 '.']

Each model has its own tokenizer, defining a list of tokens (vocab) and the mapping between text form and numeric form

In [10]:
md(f"BERT has a vocabulary ot **{bert_tok.vocab_size:,d}** tokens.")

BERT has a vocabulary ot **30,522** tokens.

We can randomly pick a sample of this vocabulary (text token and numerical token)

In [11]:
items = list(bert_tok.vocab.items())
selection = sorted(random.sample(items, 20))
pprint({id:token for id, token in selection})

{'##bruck': 28985,
 '##ey': 3240,
 '##jun': 19792,
 '##ntly': 20630,
 '[unused541]': 546,
 '[unused588]': 593,
 'avid': 18568,
 'botanist': 17098,
 'chopper': 28057,
 'kept': 2921,
 'manually': 21118,
 'olympus': 26742,
 'physician': 7522,
 'registrar': 24580,
 'section': 2930,
 'sierra': 7838,
 'spilled': 13439,
 'subfamily': 10946,
 'translations': 11913,
 'wally': 18202}


## The Embeddings

After the tokenizer, each token is converted into a vector (embedding) capturing the semantic meaning of the token.

In [12]:
bert_model, bert_cls = get_children(bert_ppl.model)
embs, encoder = get_children(bert_model)

Steps to get to the embedding:
- text to token_ids with the tokenizer
- token_ids to embeddings (one vector per token) with the embedding layer


In [13]:
md(f"'*{text}*'")
encoded_input = bert_tok(text, return_tensors='pt')['input_ids'].to(bert_ppl.model.device)
print(encoded_input)
md(f"Text encoded into **{encoded_input.shape[1]} tokens**")


'*Hi Andrej Karpathy, are you [MASK] by the recent evolution of large language models.*'

tensor([[  101,  7632,  7213,  3501, 10556, 14536, 17308,  1010,  2024,  2017,
           103,  2011,  1996,  3522,  6622,  1997,  2312,  2653,  4275,  1012,
           102]], device='cuda:0')


Text encoded into **21 tokens**

In [14]:
embeddings = embs(encoded_input)
md(f"Embedding layer outputs **{embeddings.shape[1]:,d} embedding vectors**. One per token")
md(f"Each embedding vector has **{embeddings.shape[2]:,d}** values. The embedding space has {embeddings.shape[1]} dimensions.")

Embedding layer outputs **21 embedding vectors**. One per token

Each embedding vector has **768** values. The embedding space has 21 dimensions.

In [15]:
embeddings

tensor([[[ 0.1686, -0.2858, -0.3261,  ..., -0.0276,  0.0383,  0.1640],
         [ 0.2743, -0.8239,  0.8169,  ..., -0.3081, -0.8084, -0.6725],
         [ 0.2850,  0.1554, -0.1726,  ...,  0.8624,  0.9615,  0.5178],
         ...,
         [ 0.0755,  0.9946, -0.3133,  ..., -0.3525, -0.4831,  0.3692],
         [-0.1950,  0.3005, -0.2888,  ...,  0.5752,  0.4682,  0.5196],
         [-0.0907,  0.1288, -0.2127,  ..., -0.2871,  0.1035, -0.1262]]],
       device='cuda:0', grad_fn=<NativeLayerNormBackward0>)

In [16]:
embeddings.shape

torch.Size([1, 21, 768])

## The Classifier

Remember, LLM are essentially token classifiers:

<img src="https://www.charlier-tang.com/guided-dive/resources/img/llm-as-token-classifier.jpg"/>

In [17]:
children_names(bert_ppl.model)
print()
children_names(bert_model)

We have 2 top-level modules:
 - bert
 - cls

We have 2 top-level modules:
 - embeddings
 - encoder


In [18]:
bert_cls

BertOnlyMLMHead(
  (predictions): BertLMPredictionHead(
    (transform): BertPredictionHeadTransform(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (transform_act_fn): GELUActivation()
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    )
    (decoder): Linear(in_features=768, out_features=30522, bias=True)
  )
)

The last layer of BERT takes the final embedding vector (768 values) and outputs "probabilities" of being the masked token for each of the tokens in the vocab (30,522)

In [19]:
pred_probs = bert_ppl.model(encoded_input).logits[0, pos_mask, :]
pred_probs.shape

torch.Size([30522])

In [20]:
md(text)
v,idx = torch.topk(pred_probs, k=20, dim=0)
for pred in bert_tok.decode(idx).split(' '):
    md(f"- {pred}")

Hi Andrej Karpathy, are you [MASK] by the recent evolution of large language models.

- surprised

- intrigued

- bothered

- impressed

- puzzled

- confused

- amused

- concerned

- shocked

- disappointed

- disturbed

- influenced

- fascinated

- interested

- excited

- inspired

- amazed

- startled

- attracted

- offended

In [21]:
v,idx = torch.topk(pred_probs, k=40, dim=0, largest=False)
for pred in bert_tok.decode(idx).split(' '):
    md(f"- {pred}")

- ##encyignort

- shortestncylchpregramility

- luxrialishlyrchllisildentlypictryittously

- longest

- mutuallyrgh

- sac

- titular

- federallylgearmllarticknch

- pillow

- facto

- hashndenhom

- bark

- lifetimerumlessly

# Causal - Autoregressive Model (Decoder only): GPT2

In [22]:
gpt2_ppl = pipeline('text-generation', model=gpt2_checkpoint);

Device set to use cuda:0


In [23]:
children_names(gpt2_ppl.model)

We have 2 top-level modules:
 - transformer
 - lm_head


In [24]:
text = 'Hi Andrej Karpathy, what do you think of the recent evolution of large language models? I am absolutely '

res = gpt2_ppl(text, max_new_tokens=30)
print_gpt2_results(res)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hi Andrej Karpathy, what do you think of the recent evolution of large language models? I am absolutely icty-dazed because I was not expecting to see so many new research coming out about languages from different parts of the world. So I thought

# Cleanup Memory

In [25]:
bert_model.cpu()
gpt2_ppl.model.cpu()
clean_up(bert_model, bert_tok, gpt2_ppl, encoded_input)
torch.cuda.empty_cache()

# Fine Tuned LLM: DeepSeek R1

In [26]:
deepseek_ppl = pipeline("text-generation", model=deepseek_checkpoint)

Device set to use cuda:0


## Create a conversation prompt

Fine tuned model expect structured prompts indicating a "role" and the "content".

In [27]:
prompt = "Explain quantum entanglement as it you were a teacher"
messages = [{"role": "user", "content": prompt}]

## Generate response

In [28]:
response = deepseek_ppl(messages, max_new_tokens=800)
md(f"{len(response)} response of {len(response[0]['generated_text'][1]['content'])} characters")

(1, 3919)

In [29]:
md(response[0]['generated_text'][1]['content'])

Okay, so I'm trying to understand quantum entanglement. I remember hearing the term in a science class, but I'm not really clear on what it means. Let me think about this step by step.

First, I know that entanglement is something related to particles, probably light particles like photons. But wait, isn't that true for all particles, or is it a special kind? I think it's a phenomenon where particles become interconnected, so their properties are linked, even when they're far apart. But how exactly does that work?

I remember something about particles having states. Like, maybe each particle can be in multiple states at the same time. For example, an electron could be in an up spin or down spin state. But if two particles are entangled, their spins are linked. So if one is up, the other must be down, and vice versa. That makes sense, but how does that actually happen?

I think it's about the quantum state of the system. When two particles are entangled, their quantum states are correlated. So if I measure one, the other instantly gives a result based on that measurement. That seems spooky because it's instantaneous, not just due to distance. But how is this possible? Is it because of something like a wave function that's not just a simple probability distribution?

I'm also thinking about the EPR paradox, which was about whether quantum entanglement was real. Einstein, Podolsky, and Rosen thought it was, but later experiments suggested it's more than that. Maybe it's about how the universe doesn't know which state the particles are in until you measure them. So the act of measurement affects the other particle, even if they're far apart.

There's this Bell's theorem thing too. I think it's about whether local hidden variables (which are like classical theories) could explain entanglement. But experiments show that no local hidden variables can explain the results, so entanglement must be a real quantum effect. That's pretty mind-blowing.

But wait, how does this apply in real life? I know that quantum computing uses entangled particles to process information faster. So if we can entangle particles, we can do things like teleporting one particle's state to another, which would be useful for secure communication. But I'm not sure how practical that is yet. Also, the faster-than-light communication is still a big mystery. We don't know if it's possible or how it would work.

I'm also confused about the difference between quantum entanglement and classical correlations. I think entanglement is stronger than classical correlations because it's due to the quantum state being entangled, not just classical probabilities. So it's not just about correlations between two particles but something more fundamental about the system.

Another thing I'm trying to grasp is the concept of "spooky action at a distance." Einstein associated this with the non-local nature of quantum mechanics, implying that particles could influence each other regardless of distance. But I'm not entirely sure if that's the same as entanglement or just a different term for the same phenomenon.

I'm also wondering about the mathematical description of entanglement. The wave function of the system must be a superposition of states. When you measure one particle, the wave function collapses, and the other particle's state becomes determined. But how does this collapse work exactly? Is it deterministic, or is it probabilistic?

I think it's deterministic in the sense that the wave function is a fixed state, and the measurement outcome is determined by the quantum state. But since we don't know the exact state until we measure, it's like we're just learning about it. So it's not like a hidden variable theory where you can predict with certainty, but rather a probabilistic reality where the outcome is determined by the quantum state, but we don't know what it is until we observe it.

I'm also

In [30]:
prompt = "Explain quantum entanglement like if you were a teacher. Provide the answer in markdown format"
messages = [{"role": "user", "content": prompt}]
response = deepseek_ppl(messages, max_new_tokens=800)
md(f"{len(response)} response of {len(response[0]['generated_text'][1]['content'])} characters")

1 response of 3926 characters

In [None]:
md(response[0]['generated_text'][1]['content'])

Okay, so I need to explain quantum entanglement like a teacher. Hmm, where do I start? I remember learning about entanglement in my science class, but I'm a bit fuzzy on the details. Let me try to piece this together.

First, I think entanglement has something to do with particles being connected in a way that the state of one instantly influences the state of another, no matter the distance between them. But how exactly does that work? I remember something about particles being in multiple states at once, like superposition. So maybe when you measure one, it affects the other.

Wait, so particles like electrons can be in a superposition of states, right? That means they can be in two places at once. But how does that lead to them being connected? I think it's because the act of measuring one affects the other, but I'm not entirely sure how that happens.

I remember something about the EPR paradox, which involves Einstein, Podolsky, and Rosen. They proposed that entanglement is a real phenomenon that Einstein didn't accept because he thought it contradicts relativity. But now, experiments have shown that entanglement does happen, so maybe it's more about the statistical correlation rather than violating relativity.

So, quantum entanglement is when two particles share a common state. When you measure one, the other's state collapses instantly, even if they're far apart. But how does this work with probabilities? I think it's about the correlation between the probabilities of the outcomes. For example, if you measure one particle and get a certain result, the other has a high probability of matching that result.

Wait, but how does this work probabilistically? If I measure one particle and get heads, does the other always get heads too? Or is there a chance it could get tails? I think it's the latter. So, there's a probability distribution between them, and that's what's being referred to as entanglement.

I'm also a bit confused about the difference between quantum entanglement and classical correlation. I think quantum entanglement is stronger because it's based on the principles of quantum mechanics, whereas classical correlation is more about classical physics, like the correlation in everyday phenomena.

Another thing I'm not sure about is the role of the observer. Does the act of measuring affect the state of the particle? I think yes. So, the observer collapses the wavefunction, and that affects the other particle. But I'm not entirely clear on how that leads to the connection.

I also recall that entanglement is a key part of quantum computing and quantum communication. So, it's not just a theoretical concept but has practical applications. But why is it important? Maybe it's because it allows for faster and more secure communication, which is useful in fields like cryptography.

Wait, but I'm still not clear on the mathematical representation. How do we describe entangled states? I think it's using vectors and matrices, something like the Bell states. Each Bell state represents a possible entangled state between two qubits. But I'm not exactly sure how these states are constructed or what they signify.

Also, I'm trying to remember if there are any real-world applications beyond quantum computing. I think it's used in quantum cryptography for secure communication, like quantum key distribution, which relies on the principles of entanglement. But I'm not sure how that works exactly.

I'm getting a bit overwhelmed with all these concepts. Maybe I should break it down into simpler parts. Start with the basics of quantum mechanics, then move on to entanglement, and then discuss its implications and applications.

So, to summarize my thoughts: Entanglement is when quantum particles share a common state, and the measurement of one affects the other instantaneously. This leads to correlations in their states, which can be stronger than classical correlations. It's a phenomenon supported by experiments and has applications

In [32]:
deepseek_ppl.model.cpu()
clean_up(deepseek_ppl)
torch.cuda.empty_cache()

# Fine Tuned LLS: Alibaba Qwen

In [4]:
qwen_ppl = pipeline("text-generation",model=qwen_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cuda:0


In [5]:
# Create conversation prompt
messages = [{"role": "user", "content": "Explain quantum entanglement to me as if you were a teacher. Format reply in markdown"}]

# Generate response
response = qwen_ppl(messages, max_new_tokens=800)
print(response[0]['generated_text'])


[{'role': 'user', 'content': 'Explain quantum entanglement to me as if you were a teacher. Format reply in markdown'}, {'role': 'assistant', 'content': '# Quantum Entanglement Explained\n\n## Introduction\nQuantum entanglement is a phenomenon that occurs when pairs or groups of particles interact so strongly that the state of one particle cannot be described independently of the others. This means that the state of each particle depends on the states of all other particles in the system.\n\n## Basic Concepts\n1. **Particles**: In quantum mechanics, every particle has an associated wave function, which describes its possible positions and momenta.\n2. **Entangled Particles**: When two or more particles become entangled, their combined wave functions describe not just the individual states but also the correlations between them.\n\n## The Principle of Non-Locality\nThe most famous example of quantum entanglement is Einstein\'s "spooky action at a distance." According to special relativit

In [6]:
Markdown(response[0]['generated_text'][1]['content'])

# Quantum Entanglement Explained

## Introduction
Quantum entanglement is a phenomenon that occurs when pairs or groups of particles interact so strongly that the state of one particle cannot be described independently of the others. This means that the state of each particle depends on the states of all other particles in the system.

## Basic Concepts
1. **Particles**: In quantum mechanics, every particle has an associated wave function, which describes its possible positions and momenta.
2. **Entangled Particles**: When two or more particles become entangled, their combined wave functions describe not just the individual states but also the correlations between them.

## The Principle of Non-Locality
The most famous example of quantum entanglement is Einstein's "spooky action at a distance." According to special relativity, information can't travel faster than light; thus, it seems paradoxical for particles connected through entanglement to instantaneously influence each other over large distances.

3. **Bell's Theorem**: John Stewart Bell formulated a theorem that showed how measurements on entangled particles must always give correlated results regardless of the distance separating them, even when measured simultaneously. This directly contradicts classical physics' view of causality and locality.

4. **EPR Paradox (Einstein-Podolsky-Rosen)**: A thought experiment involving entangled particles where measurements on one particle seem to instantly affect the outcome of another, no matter how far apart they are.

5. **No-Signaling Condition**: Another way to express this non-locality is through a condition known as the no-signaling principle, which requires that the measurement outcomes of entangled particles should not depend on whether signals have been sent between them before making a measurement.

6. **Uncertainty Principle**: Unlike classical systems, entangled particles exhibit phenomena like superposition and interference, allowing us to perform tasks impossible with classical bits. This is because these properties donâ€™t commute, meaning that measuring one part affects the probability amplitudes of the other parts.

7. **Quantum Teleportation**: An early demonstration of entanglement using photons was used to transmit quantum information from one location to another without physically sending any particles themselves.

8. **Quantum Information Processing**: Today, entanglement plays a crucial role in many advanced technologies including quantum computing, cryptography, and teleportation, providing unprecedented computational power beyond what is achievable with classical computers.

9. **Quantum Cryptography**: Uses entangled particles to create unbreakable codes based on the principles of quantum mechanics, ensuring secure communication channels.

10. **Quantum Key Distribution (QKD)**: A method of transmitting cryptographic keys securely using quantum mechanics, enabling encrypted communications protected against eavesdropping.

## Conclusion
Quantum entanglement represents a fundamental aspect of quantum mechanics that challenges our understanding of space-time and causality. It opens up new possibilities for technology and scientific research, pushing the boundaries of what we know about the universe at the smallest scales.