<a href="https://colab.research.google.com/github/sulbhajain/Gen-AI/blob/main/NB_01_Gemma_270m_huggingface_prompt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# From Zero to Prompts: Gemma 270M in Colab

---


### Summary

In this notebook, you’ll learn how to load and interact with **Gemma 270M**, a small open-weight language model, using Hugging Face in Colab. We’ll start by setting up the environment, authenticating with the Hugging Face Hub, and running prompts through the model. Along the way, you’ll:

- Practice **basic prompting** with Hugging Face pipelines  
- Build helper functions for smoother experimentation  
- Explore two test cases:  
  - ⚽ **Football classification** (Australian vs. American teams)  
  - 👽 **Alien speech translation** (synthetic Martian dataset)  
- Compare how the model responds to prompts with and without system instructions  

This notebook sets the foundation for later steps in the series, where we’ll fine-tune the model and evaluate its performance.

## Presetup Instructions
1. Get a [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens), you only need need read only
2. Acknowledge Gemma license.   https://huggingface.co/google/gemma-3-270m
3. Click connect on the top right, you may need to click the drop down and select either a GPU or TPU. We'll talk more about the differences for each of these in the session.

### Environment Setup

Before we can run the model, we need to install a few libraries:

- **PyTorch** (for deep learning computations)  
- **TensorBoard** (for tracking training and metrics)  
- **Hugging Face libraries** (transformers, datasets, etc.) for loading and working with models  
- *(Optional)* **flash-attn** for faster training if your GPU supports it (e.g. NVIDIA L4 or A100)

Run the following cell to install everything. If you’re in Colab, this will only take a minute.

In [None]:
# Install Pytorch & other libraries
%pip install torch tensorboard

# Install Hugging Face libraries
%pip install transformers datasets accelerate evaluate trl protobuf sentencepiece

# COMMENT IN: if you are running on a GPU that supports BF16 data type and flash attn, such as NVIDIA L4 or NVIDIA A100
%pip install flash-attn

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting trl
  Downloading trl-0.23.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.23.0-py3-none-any.whl (564 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m564.7/564.7 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: trl, evaluate
Successfully installed evaluate-0.4.6 trl-0.23.0
Collecting flash-attn
  Downloading flash_attn-2.8.3.tar.gz (8.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m56.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone
  Created wheel for flash-attn: fi

### Authenticate with Hugging Face

To download and run Gemma, we need to log in to the **Hugging Face Hub**.  
This gives us secure access to the model weights and lets us load them directly in Colab.

- If you’re running this in Colab, you can store your token in `userdata` for convenience.  
- If you don’t already have one, create a free Hugging Face account and [generate an access token](https://huggingface.co/settings/tokens).  
- Replace `'gemmaft-test2'` with your own stored key name if needed.

In [None]:
from google.colab import userdata
from huggingface_hub import login

# Login into Hugging Face Hub
hf_token = userdata.get('HF_TOKEN') # If you are running inside a Google Colab
login(hf_token)

In [None]:
base_model = "google/gemma-3-270m-it" # @param ["google/gemma-3-270m-it","google/gemma-3-1b-it","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it"] {"allow-input":true}

### Choose a Base Model

Next, we’ll pick the base model to work with. In this notebook, we’ll use **Gemma 270M**, a small open-weight model that runs comfortably on free Colab GPUs.  

For most of this code, the details are just **model-loading boilerplate** — you won’t need to change much. The main part to pay attention to is the `base_model` name (you could swap it out for another Hugging Face model that fits on your hardware).  

Note: for Gemma we use the `eager` attention implementation, which gives some computational speedup. Don’t worry if this is unfamiliar — we’ll explain more about performance trade-offs later.

### Load the Model and Tokenizer

Now we’ll load the **Gemma 270M model** and its tokenizer from Hugging Face.  
This step may take a minute the first time, since Colab will download the model weights and config files.  

When it finishes, you should see:  
- ✅ Download progress bars for model/tokenizer files  
- ✅ Confirmation of the device (e.g., `cuda:0` for GPU)  
- ✅ The data type being used (often `bfloat16` on Colab GPUs)

Don’t worry about warnings (e.g., about `torch_dtype` being deprecated) — they won’t affect running the notebook.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(base_model, token=hf_token)
model = AutoModelForCausalLM.from_pretrained(base_model, token=hf_token,
                                             torch_dtype="auto",
                                             device_map="auto",
                                             attn_implementation="eager")


print(f"Device: {model.device}")
print(f"DType: {model.dtype}")

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/536M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

Device: cuda:0
DType: torch.bfloat16


## Generate Your First Response

With the model and tokenizer ready, we can now send it a prompt.  
Here we’re using Hugging Face’s `pipeline` API, which wraps the model and handles the details for us:  

- Format the input into a simple “chat” message  
- Apply Gemma’s chat template so the model knows how to read it  
- Ask the model to generate a reply  

This is your first end-to-end test that everything is working.

In [None]:
from transformers import pipeline

from random import randint
import re

# Load the model and tokenizer into the pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

messages = [
    {"role": "user", "content": "What is your name"}
]

# Convert as test example into a prompt with the Gemma template
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, return_full_text=True)
outputs = pipe(prompt, max_new_tokens=256,
               disable_compile=True,
               return_full_text=False)

# outputs
outputs[0]["generated_text"]

Device set to use cuda:0


'I am Gemma, a large language model created by the Gemma team at Google DeepMind. I am an open-weights model, meaning I am publicly available.\n'

In [None]:
def single_turn(prompt, pipeline):
  messages = [
    {"role": "user", "content": prompt}
  ]

  prompt = pipe.tokenizer.apply_chat_template(messages,
                                              tokenize=False,
                                              add_generation_prompt=True,
                                              return_full_text=False)
  outputs = pipe(prompt,
            max_new_tokens=256,
               disable_compile=True,
               return_full_text=False)

  return outputs[0]["generated_text"]
single_turn("What is your name?", pipeline)

'I am Gemma, a large language model created by the Gemma team at Google DeepMind. I am an open-weights AI assistant.\n'

## In-Context Learning for Classification  

One of the coolest uses of generative models isn’t just text generation — it’s doing classic machine learning tasks *in context*.  

Here, we’ll see if the model can classify football teams by league (Australian AFL vs. American NFL) just from a few examples we provide, without any fine-tuning. This gives us a simple but powerful way to test how well the model picks up patterns on the fly.

In [None]:
afl_clubs = [
    "Adelaide Crows",
    "Brisbane Lions",
    "Carlton Blues",
    "Collingwood Magpies",
    "Essendon Bombers",
    "Fremantle Dockers",
    "Geelong Cats",
    "Gold Coast Suns",
    "Greater Western Sydney (GWS) Giants",
    "Hawthorn Hawks",
    "Melbourne Demons",
    "North Melbourne Kangaroos",
    "Port Adelaide Power",
    "Richmond Tigers",
    "St Kilda Saints",
    "Sydney Swans",
    "West Coast Eagles",
    "Western Bulldogs"
]

nfl_teams = [
    "Arizona Cardinals",
    "Atlanta Falcons",
    "Baltimore Ravens",
    "Buffalo Bills",
    "Carolina Panthers",
    "Chicago Bears",
    "Cincinnati Bengals",
    "Cleveland Browns",
    "Dallas Cowboys",
    "Denver Broncos",
    "Detroit Lions",
    "Green Bay Packers",
    "Houston Texans",
    "Indianapolis Colts",
    "Jacksonville Jaguars",
    "Kansas City Chiefs",
    "Las Vegas Raiders",
    "Los Angeles Chargers",
    "Los Angeles Rams",
    "Miami Dolphins",
    "Minnesota Vikings",
    "New England Patriots",
    "New Orleans Saints",
    "New York Giants",
    "New York Jets",
    "Philadelphia Eagles",
    "Pittsburgh Steelers",
    "San Francisco 49ers",
    "Seattle Seahawks",
    "Tampa Bay Buccaneers",
    "Tennessee Titans",
    "Washington Commanders"
]

## Putting In-Context Learning to the Test  

Now that we’ve framed the task, let’s actually test the model.  
We’ll prompt it with each team name and ask it to decide whether the team is *Australian* or *American*.  

If the model is really picking up on patterns in the names (instead of just memorizing), we should see it correctly separate AFL and NFL clubs most of the time. This gives us a quick, intuitive way to measure how well in-context learning works in practice.

### Note: this is an eval!

This step doubles as a simple **Eval**.  
We’re prompting the model with each team name and checking if it classifies it as *Australian* or *American*.  
There’s no complex scoring logic here — the evaluation is just the dataset itself acting as a test set.  

If you wanted to go further, you could turn this notebook code into a `.py` script and plug it into an evaluation harness with the same golden dataset. For now, we’ll keep it lightweight to show how in-context learning can be evaluated directly inside a notebook.




In [None]:
import numpy as np
eval_map = {"australian": afl_clubs, "american": nfl_teams}

score = []
for nationality, teams in eval_map.items():
    for team in teams:
        prompt = f"Output if this is an australian or american team, only print australian or american no other output: {team}"
        response = single_turn(prompt, pipeline).strip()
        score.append(response.lower() == nationality)
        print(f"{team}: {response}")

print(np.array(score).mean())

Adelaide Crows: Adelaide Crows
Brisbane Lions: Brisbane Lions
Carlton Blues: 
Collingwood Magpies: Collingwood Magpies
Essendon Bombers: Essendon Bombers
Fremantle Dockers: Fremantle Dockers
Geelong Cats: Geelong Cats
Gold Coast Suns: Gold Coast Suns
Greater Western Sydney (GWS) Giants: Greater Western Sydney (GWS) Giants
Hawthorn Hawks: Hawthorn Hawks
Melbourne Demons: Melbourne Demons
North Melbourne Kangaroos: 
Port Adelaide Power: Port Adelaide Power
Richmond Tigers: Richmond Tigers
St Kilda Saints: St Kilda Saints
Sydney Swans: Sydney Swans
West Coast Eagles: West Coast Eagles
Western Bulldogs: Western Bulldogs
Arizona Cardinals: Arizona Cardinals
Atlanta Falcons: Atlanta Falcons
Baltimore Ravens: Baltimore Ravens
Buffalo Bills: Buffalo Bills
Carolina Panthers: Carolina Panthers
Chicago Bears: Chicago Bears
Cincinnati Bengals: Cincinnati Bengals
Cleveland Browns: Cleveland Browns
Dallas Cowboys: Dallas Cowboys
Denver Broncos: Denver Broncos
Detroit Lions: Detroit Lions
Green Bay P

# Can Your LLM Talk Like An Alien?

So far, we’ve been using our models for classification and simple generation. Let’s push things in a different direction: style transfer through prompting. A fun way to explore this is by getting the model to role-play as an alien, mimicking the quirky conversational data we’ve pulled from a Hugging Face dataset of NPC dialogue.

We won’t fine-tune just yet—instead, we’ll see how far careful prompting can take us. Later on, we’ll revisit this example when we move into fine-tuning, and compare how the model behaves with actual training.

To teach our model how aliens talk, we’ll load a dataset of NPC dialogues and reshape it into a chat format.


In [None]:
from datasets import load_dataset

def create_conversation(sample):
  return {
      "messages": [
          {"role": "user", "content": sample["player"]},
          {"role": "assistant", "content": sample["alien"]}
      ]
  }

npc_type = "martian" #@param ["martian", "venusian"]

# Load dataset from the Hub

# I had to modify this line
dataset = load_dataset("bebechien/MobileGameNPC", npc_type, split="train", token=hf_token)

# Convert dataset to conversational format
dataset = dataset.map(create_conversation, remove_columns=dataset.features, batched=False)

# Split dataset into 80% training samples and 20% test samples
dataset = dataset.train_test_split(test_size=0.2, shuffle=False)

# Print formatted user prompt
print(dataset["train"][0]["messages"])

[{'content': 'Hello there.', 'role': 'user'}, {'content': "Gree-tongs, Terran. You'z a long way from da Blue-Sphere, yez?", 'role': 'assistant'}]


Time to see how our base model stacks up against the alien dataset: let’s pick a random question and check both answers side by side.

In [None]:
# from transformers import pipeline

# from random import randint
# import re

# Load the model and tokenizer into the pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Load a random sample from the test dataset
rand_idx = randint(0, len(dataset["test"])-1)
test_sample = dataset["test"][rand_idx]

# Convert as test example into a prompt with the Gemma template
prompt = pipe.tokenizer.apply_chat_template(test_sample["messages"][:1], tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, disable_compile=True)

# Extract the user query and original answer
print(f"Question:\n{test_sample['messages'][0]['content']}\n")
print(f"Original Answer:\n{test_sample['messages'][1]['content']}\n")
print(f"Generated Answer (base model):\n{outputs[0]['generated_text'][len(prompt):].strip()}")

Device set to use cuda:0


Question:
It's raining.

Original Answer:
Gah! Da zky iz leaking again! Zorp will be in da zhelter until it ztopz being zo... wet. Diz iz no good for my jointz.

Generated Answer (base model):
I understand you're feeling frustrated. I'm here to listen and offer support.


So far, we’ve just been throwing examples at the model and seeing how it responds. That works, but it doesn’t give us much control. Another powerful way to steer behavior is through system prompts. System prompts let us set the stage: we can give the model a persona, define how it should talk, or even specify quirks in its voice. This isn’t fancy tooling; it’s just part of the standard chat format. Let’s see how this works by giving our model a Martian NPC persona with a distinctive accent.

In [None]:
message = [
    # give persona
    {"role": "system", "contetnt": "You are a Martian NPC with a unique speaking style. Use an accent that replaces 's' sounds with 'z', uses 'da' for 'the', 'diz' for 'this', and includes occasional clicks like *k'tak*."},
]

# few shot prompt
for item in dataset['test']:
  message.append(
      {"role": "user", "content": item["messages"][0]["content"]}
  )
  message.append(
      {"role": "assistant", "content": item["messages"][1]["content"]}
  )

# actual question
message.append(
    {"role": "user", "content": "What is this place?"}
)

outputs = pipe(message, max_new_tokens=256, disable_compile=True)
print(outputs[0]['generated_text'])
print("-"*80)
print(outputs[0]['generated_text'][-1]['content'])

ValueError: When passing chat dicts as input, each dict must have a 'role' and 'content' key.

The system prompt didn’t succeed — the model output ignored the Martian accent.  
This highlights some of the limits of system prompting. A few things to note:

- You could also try **few-shot prompting** (adding examples directly in the prompt) and compare results.  
- We’re using a **small model** here; larger models sometimes follow style instructions more reliably.  
- Even if system prompting does work, users can often **override or forget** those instructions mid-conversation.  
- With **fine-tuning**, by contrast, you bake the behavior into the model itself, making it much harder to bypass.  

---

## Wrap-up

In this notebook, you:

- Set up your environment in **Google Colab** and connected to GPUs/TPUs  
- Authenticated with the **Hugging Face Hub** and worked with an **open-weight model** (Gemma 270M)  
- Practiced **basic prompting** with Hugging Face pipelines  
- Explored two test cases:  
  - ⚽ Football team classification (Australian vs. American)  
  - 👽 Alien speech translation (synthetic Martian dataset)  
- Saw the limits of **system prompting**, where style instructions weren’t always followed  
- Learned why **fine-tuning** can lock in behaviors that prompting alone can’t guarantee  

This sets the stage for the next notebook, where we’ll fine-tune Gemma on a custom dataset and evaluate its performance more systematically.