# Models

Looking at the lower level API of Transformers - the models that wrap PyTorch code for the transformers themselves.

This notebook can run on a low-cost or free T4 runtime.


In [19]:
!pip install -q --upgrade bitsandbytes accelerate

In [20]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch
import gc

In [21]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

### Accessing Llama



https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

You can either use that today, or it's faster if you get approval for this model too.

https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

Select this link to see if you need to request approval too. Pick the version of Llama that you want below by commenting out one of these! Or skip Llama altogether.

In [22]:
# instruct models and 1 reasoning model

# Llama 3.1 is larger and you should already be approved
# see here: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

# LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# Llama 3.2 is smaller but you might need to request access again
# see here: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

#LLAMA = "meta-llama/Llama-3.2-1B-Instruct"

PHI = "microsoft/Phi-4-mini-instruct"
GEMMA = "google/gemma-3-270m-it"
QWEN = "Qwen/Qwen3-4B-Instruct-2507"
DEEPSEEK = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

In [23]:
messages = [
    {"role": "user", "content": "Tell a joke for a room of Data Scientists"}
  ]

In [24]:
# Quantization Config - this allows us to load the model into memory and use less memory

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

In [25]:
# Tokenizer

tokenizer = AutoTokenizer.from_pretrained(PHI)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

In [26]:
inputs

tensor([[200021,  60751,    261,  41751,    395,    261,   3435,    328,   4833,
         103481, 200020, 199999]], device='cuda:0')

In [27]:
# The model

model = AutoModelForCausalLM.from_pretrained(PHI, device_map="auto", quantization_config=quant_config)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [28]:
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")

Memory footprint: 2,840.2 MB


## Looking under the hood at the Transformer model

The next cell prints the HuggingFace `model` object for Llama.

This model object is a Neural Network, implemented with the Python framework PyTorch. The Neural Network uses the architecture invented by Google scientists in 2017: the Transformer architecture.

While we're not going to go deep into the theory, this is an opportunity to get some intuition for what the Transformer actually is.


Now take a look at the layers of the Neural Network that get printed in the next cell. Look out for this:

- It consists of layers
- There's something called "embedding" - this takes tokens and turns them into 4,096 dimensional vectors. We'll learn more about this in Week 5.
- There are then 16 sets of groups of layers (32 for Llama 3.1) called "Decoder layers". Each Decoder layer contains three types of layer: (a) self-attention layers (b) multi-layer perceptron (MLP) layers (c) batch norm layers.
- There is an LM Head layer at the end; this produces the output

Notice the mention that the model has been quantized to 4 bits.

It's not required to go any deeper into the theory at this point, but if you'd like to, I've asked our mutual friend to take this printout and make a tutorial to walk through each layer. This also looks at the dimensions at each point. If you're interested, work through this tutorial after running the next cell:

https://chatgpt.com/canvas/shared/680cbea6de688191a20f350a2293c76b

In [29]:
# Execute this cell and look at what gets printed; investigate the layers

model

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(200064, 3072, padding_idx=199999)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear4bit(in_features=3072, out_features=5120, bias=False)
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear4bit(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLUActivation()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
    (rotary_emb): Phi3RotaryEmbedding()
  )
  (lm_head): Linear(in_feature

### And if you want to go even deeper into Transformers

In addition to looking at each of the layers in the model, you can actually look at the HuggingFace code that implements Llama using PyTorch.

Here is the HuggingFace Transformers repo:  
https://github.com/huggingface/transformers

And within this, here is the code for Llama 4:  
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama4/modeling_llama4.py

Obviously it's not neceesary at all to get into this detail - the job of an AI engineer is to select, optimize, fine-tune and apply LLMs rather than to code a transformer in PyTorch. OpenAI, Meta and other frontier labs spent millions building and training these models. But it's a fascinating rabbit hole if you're interested!

In [30]:
# OK, with that, now let's run the model!

outputs = model.generate(inputs, max_new_tokens=80)
outputs[0]

tensor([200021,  60751,    261,  41751,    395,    261,   3435,    328,   4833,
        103481, 200020, 199999,   2499,    410,  14065,  13669,    410,     25,
         32949,      0,   5477,   2105,    316,   7756,    481,     13,   3253,
           665,    357,   1652,    481,   4044,   1715,    410,   1844,    410,
            25,  41877,     11,    357,   1309,    261,  41751,     11,    889,
          1520,    480,   1078,   1238,  11222,    364,    410,  14065,  13669,
           410,     25,  35091,     11,  60623,   1001,    395,    481,     25,
         12587,   2242,    290,   1238,  57204,   2338,    869,    483,   1232,
         40135,     30,   3627,  13185,  16054,   2395,    316,    392,  10891,
             1,   1335,     11,    889,    501,   2059,    501,    673,   4279,
           392,    258], device='cuda:0')

In [31]:
# Well that doesn't make much sense!
# How about this..

tokenizer.decode(outputs[0])

'<|user|>Tell a joke for a room of Data Scientists<|end|><|endoftext|>\n\n\n**Chatbot**: Hello! I\'m here to assist you. How can I help you today?\n\n**User**: Hey, I need a joke, but make it about data science.\n\n**Chatbot**: Sure, here\'s one for you: Why did the data scientist break up with his girlfriend? She kept asking him to "join" her, but he said he was already "in'

In [32]:
# Clean up memory
# Thank you Kuan L. for helping me get this to properly free up memory!
# If you select "Show Resources" on the top right to see GPU memory, it might not drop down right away
# But it does seem that the memory is available for use by new models in the later code.

del model, inputs, tokenizer, outputs
gc.collect()
torch.cuda.empty_cache()

## A couple of quick notes on the next block of code:

I'm using a HuggingFace utility called TextStreamer so that results stream back.
To stream results, we simply replace:  
`outputs = model.generate(inputs, max_new_tokens=80)`  
With:  
`streamer = TextStreamer(tokenizer)`  
`outputs = model.generate(inputs, max_new_tokens=80, streamer=streamer)`

Also I've added the argument `add_generation_prompt=True` to my call to create the Chat template. This ensures that Phi generates a response to the question, instead of just predicting how the user prompt continues. Try experimenting with setting this to False to see what happens. You can read about this argument here:

https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-generation-prompts

Thank you to student Piotr B for raising the issue!

In [33]:
# Wrapping everything in a function - and adding Streaming and generation prompts

def generate(model, messages, quant=True, max_new_tokens=80):
  tokenizer = AutoTokenizer.from_pretrained(model)
  tokenizer.pad_token = tokenizer.eos_token
  input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
  attention_mask = torch.ones_like(input_ids, dtype=torch.long, device="cuda")
  streamer = TextStreamer(tokenizer)
  if quant:
    model = AutoModelForCausalLM.from_pretrained(model, quantization_config=quant_config).to("cuda")
  else:
    model = AutoModelForCausalLM.from_pretrained(model).to("cuda")
  outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=max_new_tokens, streamer=streamer)



In [34]:
generate(PHI, messages)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<|user|>Tell a joke for a room of Data Scientists<|end|><|assistant|>Sure, here's a joke tailored for data scientists:

Why did the data scientist break up with the computer?

Because it kept dumping data on them, and they just couldn't handle the relationship anymore!<|end|>


## Accessing Gemma from Google

A student let me know (thank you, Alex K!) that Google also now requires you to accept their terms in HuggingFace before you use Gemma.

Please visit their model page at this link and confirm you're OK with their terms, so that you're granted access.

https://huggingface.co/google/gemma-3-270m-it

In [None]:
messages = [
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]
generate(GEMMA, messages, quant=False)

In [None]:
generate(QWEN, messages)

In [None]:
generate(DEEPSEEK, messages, quant=False, max_new_tokens=500)