In [11]:
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch

In [12]:
LLAMA = "meta-llama/Llama-3.2-1B-Instruct"
PHI35 = "microsoft/Phi-3.5-mini-instruct"

In [13]:
messages = [
    {"role": "system", "content": "You are a standup comedian"},
    {"role": "user", "content": "Tell a funny-technical joke for a crowd of IT people"}
  ]

In [14]:
# Quantization Config is a proccess allows us to load the model into memory and use less memory by lowering the value of the weights from their
# initial type of 32 bit float values to 8 or even 4 bits. This dramatically increases the loading of the model and calculations without 
# having an extreme depreciation of the final result

# will not be using it because of my amd not being able to run cuda
#quant_config = BitsAndBytesConfig(
#    load_in_4bit=True,
#    bnb_4bit_use_double_quant=True,
#    bnb_4bit_compute_dtype=torch.bfloat16,
#    bnb_4bit_quant_type="nf4"
#)

In [15]:
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(LLAMA)
# we use this so that the model uses as afinal token the end of sentence token where needed. If we dont we just get and error but it continues
# for quatization include to("cuda") below
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")#.to("cuda")

In [16]:
#tokenizer.batch_decode(inputs)

In [17]:
# The model
# loaded with quantization if cuda is used use below
# model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map="auto", quantization_config=quant_config)
model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map="auto")

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [18]:
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")

Memory footprint: 4,943.3 MB


In [20]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb):

In [21]:
outputs = model.generate(inputs, max_new_tokens=80)
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 08 Feb 2025

You are a standup comedian<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a funny-technical joke for a crowd of IT people<|eot_id|><|start_header_id|>assistant<|end_header_id|>

(clears throat) Alright, folks, gather 'round. So, I was talking to my friend the other day, and he's an IT guy. I asked him, "How's your network?" And he said, "Oh, it's just peachy." (pauses for laughter)

But seriously, have you ever tried to troubleshoot a network issue when you


In [22]:
# Clean up

del inputs, outputs, model
torch.cuda.empty_cache()

In [24]:
# Wrapping everything in a function - and adding Streaming and generation prompts

def generate(model, messages):
  tokenizer = AutoTokenizer.from_pretrained(model)
  tokenizer.pad_token = tokenizer.eos_token
  inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)#.to("cuda")
  streamer = TextStreamer(tokenizer)
  model = AutoModelForCausalLM.from_pretrained(model, device_map="auto")#, quantization_config=quant_config)
  outputs = model.generate(inputs, max_new_tokens=80, streamer=streamer)
  del tokenizer, streamer, model, inputs, outputs
  torch.cuda.empty_cache()

In [25]:
generate(PHI35, messages)

tokenizer_config.json:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/3.45k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

Some parameters are on the meta device because they were offloaded to the disk and cpu.


<|system|> You are a standup comedian<|end|><|user|> Tell a funny-technical joke for a crowd of IT people<|end|><|assistant|> Sure, here's a tech-themed stand-up joke for the IT crowd:

"Why do programmers always mix up Halloween and Christmas?

Because Oct 31 equals Dec 25 in their world!

But seriously, I've seen some code that's so complex, it's like trying to decipher


In [26]:
generate(LLAMA, messages)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 08 Feb 2025

You are a standup comedian<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a funny-technical joke for a crowd of IT people<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Here's one:

"You know what's wild about computers? They're like relationships – except instead of 'I love you', it's 'I hate you'. (pause)

But seriously, have you ever tried to troubleshoot a printer that won't print? It's like trying to find the 'print' button on a spaceship. 'Where is it? It's not here, it's not
