<a href="https://colab.research.google.com/github/rahiakela/small-language-models-fine-tuning/blob/main/domain-specific-small-language-models/02-running-inference/01_running_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## GPT-Neo inference with the HF's Transformers Library

The code in this notebook is to introduce readers to the inference (text generation) with the [GPT-Neo model](https://github.com/EleutherAI/gpt-neo) using the Hugging Face's [Transformers library](https://github.com/huggingface/transformers). It can be executed in the Colab free tier with hardware acceleration (GPU).  

Install the missing requirements in the Colab VM (HF's Accelerate only).

In [None]:
!pip install accelerate

Download the GPT-Neo 2.7B model and the associated tokenizer from the HF's Hub. The model is loaded in full precision and is then loaded into the GPU.

In [None]:
import torch
from transformers import GPTNeoForCausalLM, GPT2Tokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_id = "EleutherAI/gpt-neo-2.7B"
tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model = GPTNeoForCausalLM.from_pretrained(model_id, device_map="auto")
model.to(device)

```log
GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(50257, 2560)
    (wpe): Embedding(2048, 2560)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-31): 32 x GPTNeoBlock(
        (ln_1): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=2560, out_features=2560, bias=False)
            (v_proj): Linear(in_features=2560, out_features=2560, bias=False)
            (q_proj): Linear(in_features=2560, out_features=2560, bias=False)
            (out_proj): Linear(in_features=2560, out_features=2560, bias=True)
          )
        )
        (ln_2): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=2560, out_features=10240, bias=True)
          (c_proj): Linear(in_features=10240, out_features=2560, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=2560, out_features=50257, bias=False)
)
```

Verify where the model layers have been loaded (all in the GPU memory or also RAM and/or disk).

In [None]:
model.hf_device_map

{'': 0}

Perform standard inference (text completion).

In [None]:
prompt = "The story so far: in the beginning, the universe was created."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generated_ids = model.generate(input_ids,
                               do_sample=True,
                               temperature=0.9,
                               max_length=200,
                               pad_token_id=50256)
generated_text = tokenizer.decode(generated_ids[0])
print(generated_text)

The story so far: in the beginning, the universe was created. It had size, shape and form.

Then one day, a huge explosion occurred that left the universe in the state we know it now.

The beginning of the universe – in its current form

So how did the universe start?

Well, let’s take a closer look.

We’ve been told that before the universe exploded, it was created.

What we don’t know

But where did the universe come from?

We don’t know – we just know that there had to have been something before all this.

How big was the universe at the time of the big bang?

We don’t know. We know that it had size, shape and form – but nothing else.

We see pictures of the early universe with these three characteristics – size, shape and form – but we don


## Few-shot learning

Do few-shot text classification (the model can generalize learning from few new and unseen examples.

In [None]:
prompt = """
Sentence: This movie is very nice.
Sentiment: positive

#####

Sentence: I hated this movie, it sucks.
Sentiment: negative

#####

Sentence: This movie was actually pretty funny.
Sentiment: positive

#####

Sentence: This movie could have been better.
Sentiment: neutral
"""
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generated_ids = model.generate(input_ids,
                               do_sample=True,
                               temperature=0.9,
                               max_length=200,
                               pad_token_id=50256)
generated_text = tokenizer.decode(generated_ids[0])
print(generated_text)


Sentence: This movie is very nice.
Sentiment: positive

#####

Sentence: I hated this movie, it sucks.
Sentiment: negative

#####

Sentence: This movie was actually pretty funny.
Sentiment: positive

#####

Sentence: This movie could have been better.
Sentiment: neutral

#####

Sentence: I liked this movie, it was actually kinda fun.
Sentiment: neutral

#####

Sentence: I liked the movie.
Sentiment: neutral.

#####

Sentence: This movie did not really make me laugh.
Sentiment: neutral

#####

Sentence: This movie is alright.
Sentiment: neutral

#####

Sentence: I really enjoyed this movie.
Sentiment: neutral

#####

Sentence: This movie was actually pretty


## Code generation

In [None]:
prompt = """Instruction: Generate a Python function that lets you reverse a list of integers.

Answer: """
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generated_ids = model.generate(input_ids,
                               do_sample=True,
                               temperature=0.9,
                               max_length=200,
                               pad_token_id=50256
                               )
generated_text = tokenizer.decode(generated_ids[0])
print(generated_text)

Instruction: Generate a Python function that lets you reverse a list of integers.

Answer:  To reverse a list, you can use the function *reverse*:
import copy

def reverse():    
    numbers = [123, 321, 456, 888, 321, 888, 888, 888]   # the list 
    numbers[::-1] = [555, 444, 222, 555]
    return numbers

print(reverse())

Output: 
[999, 666, 465, 496, 6666, 666, 999, 222, 456, 555, 722, 556, 222, 555, 722, 555, 444, 888, 888, 888]

<|endoftext|>


## Batch Prompting

Do batch text completion.

In [None]:
texts = ["Once there was a man ", "The weather today will be ", "A great soccer player must "]

tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
encoding = tokenizer(texts, padding=True, return_tensors='pt').to(device)
with torch.no_grad():
    generated_ids = model.generate(**encoding,
                                   do_sample=True,
                                   temperature=0.9,
                                   max_length=50,
                                   pad_token_id=50256)
generated_texts = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True)

for text in generated_texts:
  print("---------")
  print(text)

---------
Once there was a man  
Who was given to drink  
And as he was lying  
In bed he said to a friend,  
“I think I shall drown myself.” “No,
---------
The weather today will be 
hot with a high of 
86 and a low of 72. 
The high will last all day long and 
the low will start after sunset. 
Temperatures will be in the low
---------
A great soccer player must 
always have confidence on the field,
and that’s what we’re trying to 
figure out right now.
>> Our next speaker, 
she is a professional athlete.
>>


## Evaluating LLM

In [None]:
%%shell

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

In [None]:
!lm_eval \
  --model hf-auto \
  --model_args pretrained=EleutherAI/gpt-neo-2.7B,dtype="float16" \
  --tasks wikitext \
  --device cuda:0

2025-09-23 05:57:25.102717: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1758607045.133734    9967 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1758607045.143366    9967 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1758607045.166206    9967 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1758607045.166239    9967 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1758607045.166248    9967 computation_placer.cc:177] computation placer alr

## KV Caching

Benchmarking the model on text completion: comparing the cases where the KV cache is used to those where it isn't.

In [None]:
import time
import numpy as np

prompt = "The story so far: in the beginning, the universe was created."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

for use_cache in (True, False):
  times = []
  for _ in range(20):
    start = time.time()
    generated_ids = model.generate(input_ids,
                                  do_sample=True,
                                  temperature=0.9,
                                  max_length=200,
                                  pad_token_id=50256,
                                  use_cache=use_cache)
    times.append(time.time() - start)
  print(f"{'Using' if use_cache else 'No'} KV cache: {round(np.mean(times), 3)} +- {round(np.std(times), 3)} seconds")

## Estimating the generation time

Benchmarking the model's total generation time.

In [None]:
import time
import numpy as np

prompt = "The story so far: in the beginning, the universe was created."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

max_length = 300
times = []
inference_runs = 21
for _ in range(inference_runs):
  start = time.time()
  generated_ids = model.generate(input_ids,
                                do_sample=True,
                                temperature=0.9,
                                max_length=max_length,
                                pad_token_id=50256,
                                )
  times.append(time.time() - start)
print(f"Average Total Generation time: {round(np.mean(times[1:]), 3)} +- {round(np.std(times[1:]), 3)} seconds")