# Understanding Large Language Models - Lab 1: Setting Up Your Ecosystem

## Introduction
### This notebook will guide you through setting up your environment for the course.
### We will install the necessary dependencies, download a pre-trained T5 model, and run a simple text-to-text prediction.

In [1]:
!pip install transformers==4.37.2 torch sentencepiece datasets transformers[torch] accelerate==0.28.0

Collecting transformers==4.37.2
  Obtaining dependency information for transformers==4.37.2 from https://files.pythonhosted.org/packages/85/f6/c5065913119c41ecad148c34e3a861f719e16b89a522287213698da911fc/transformers-4.37.2-py3-none-any.whl.metadata
  Downloading transformers-4.37.2-py3-none-any.whl.metadata (129 kB)
     ---------------------------------------- 0.0/129.4 kB ? eta -:--:--
     ----- ------------------------------- 20.5/129.4 kB 330.3 kB/s eta 0:00:01
     ----------------- ------------------- 61.4/129.4 kB 656.4 kB/s eta 0:00:01
     -------------------------------------- 129.4/129.4 kB 1.1 MB/s eta 0:00:00
Collecting accelerate==0.28.0
  Obtaining dependency information for accelerate==0.28.0 from https://files.pythonhosted.org/packages/a0/11/9bfcf765e71a2c84bbf715719ba520aeacb2ad84113f14803ff1947ddf69/accelerate-0.28.0-py3-none-any.whl.metadata
  Downloading accelerate-0.28.0-py3-none-any.whl.metadata (18 kB)
Collecting tokenizers<0.19,>=0.14 (from transformers==4.37

In [2]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

def generate_text(input_text, max_length=50):
    
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids

    ### PRINT OUT input_ids
    print(input_ids)
    print(tokenizer.decode(input_ids[0]))
    
    output_ids = model.generate(input_ids, max_length=max_length)

    ### PRINT OUT output_ids
    print(output_ids)
    print(tokenizer.decode(output_ids[0]))
    
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)


example_input = "translate English to French: How are you?"
output_text = generate_text(example_input)

print("Input:", example_input)
print("Output:", output_text)



tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


tensor([[13959,  1566,    12,  2379,    10,   571,    33,    25,    58,     1]])
translate English to French: How are you?</s>
tensor([[   0, 5257,    3, 6738,   18, 3249,   58,    1]])
<pad> Comment êtes-vous?</s>
Input: translate English to French: How are you?
Output: Comment êtes-vous?


In [3]:
### Try at least 5 other example inputs
### Example 1: "How are you?"
### Example 2: "What is your name?"
### Example 3: "Where is the nearest restaurant?"
### Example 4: "I love learning new things."
### Example 5: "This is a beautiful day."

example_input = "How are you?"
output_text = generate_text(example_input)
print(output_text)

example_input = "What is your name?"
output_text = generate_text(example_input)
print(output_text)

example_input = "Where is the nearest restaurant?"
output_text = generate_text(example_input)
print(output_text)

example_input = "I love learning new things."
output_text = generate_text(example_input)
print(output_text)

example_input = "This is a beautiful day."
output_text = generate_text(example_input)
print(output_text)

tensor([[571,  33,  25,  58,   1]])
How are you?</s>
tensor([[    0,  2739, 15840,   146,    58,     1]])
<pad> Wie bist du?</s>
Wie bist du?
tensor([[363,  19,  39, 564,  58,   1]])
What is your name?</s>
tensor([[   0, 2751,   19,   39,  564,   58,    1]])
<pad> Was is your name?</s>
Was is your name?
tensor([[ 2840,    19,     8, 13012,  2062,    58,     1]])
Where is the nearest restaurant?</s>
tensor([[   0, 3488,  229,  211, 6233,    3,   58,    1]])
<pad> Wo ist das Restaurant?</s>
Wo ist das Restaurant?
tensor([[  27,  333, 1036,  126,  378,    5,    1]])
I love learning new things.</s>
tensor([[    0,  1674, 23803,     3,    15,     7,     6,  2802, 13367,   170,
         14891,   170, 14891,     5,     1]])
<pad> Ich liebe es, neue Dinge zu lernen zu lernen.</s>
Ich liebe es, neue Dinge zu lernen zu lernen.
tensor([[100,  19,   3,   9, 786, 239,   5,   1]])
This is a beautiful day.</s>
tensor([[    0,   644,   229,   266, 11878,  1743,     5,     1]])
<pad> Das ist eine schön

# T5 and the Prefix + Input Structure

T5 (Text-to-Text Transfer Transformer) is explicitly trained to follow a **prefix + input** format, guiding it to perform the correct NLP task.

## Why Prefixes Matter
T5 was trained using structured prompts like:
- `translate English to French: How are you?` → `Comment allez-vous?`
- `summarize: The Eiffel Tower is in Paris.` → `Eiffel Tower is in Paris.`
- `question: Who discovered gravity? context: Isaac Newton discovered gravity.` → `Isaac Newton`
- `sentiment: I love this movie!` → `positive`

## Without a Prefix?
❌ `How are you?` → (Unpredictable output)  
✅ `translate English to French: How are you?` → `Comment allez-vous?`

## Custom Prefixes
Fine-tune T5 with your own prefixes:
- `explain: What is...`
- `medical diagnosis: Patient has high fever...`

In [4]:
import torch
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from datasets import Dataset

csv_filename = "explain_dataset.csv"

df = pd.read_csv(csv_filename)
dataset = Dataset.from_pandas(df)

def preprocess_function(examples):
    inputs = examples["Input"]
    targets = examples["Response"]
    
    model_inputs = tokenizer(inputs, max_length=64, truncation=True, padding="max_length")
    
    labels = tokenizer(targets, max_length=64, truncation=True, padding="max_length").input_ids

    model_inputs["labels"] = labels
    
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True)

dataset_split = tokenized_dataset.train_test_split(test_size=0.2)

train_dataset = dataset_split["train"]
eval_dataset = dataset_split["test"]

training_args = TrainingArguments(
    output_dir="./t5-fine-tuned",
    evaluation_strategy="epoch",
    learning_rate=3e-4,
    per_device_train_batch_size=4,
    num_train_epochs=3,
    save_strategy="epoch",
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

# Save the fine-tuned model
model.save_pretrained("./t5-custom-response")
tokenizer.save_pretrained("./t5-custom-response")

print("Fine-tuning complete! Model saved to ./t5-custom-response")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


Epoch,Training Loss,Validation Loss
1,No log,0.004157
2,No log,0.000495
3,0.234400,0.000281


Fine-tuning complete! Model saved to ./t5-custom-response


In [5]:
original_model = T5ForConditionalGeneration.from_pretrained("t5-small")
fine_tuned_model = T5ForConditionalGeneration.from_pretrained("./t5-custom-response")

def generate_response(model, input_text):
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    output_ids = model.generate(input_ids, max_length=64)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

test_question = "explain: What is machine learning?"

original_output = generate_response(original_model, test_question)
fine_tuned_output = generate_response(fine_tuned_model, test_question)

print("Original Model Output:")
print(original_output)
print("\n\n\nFine-Tuned Model Output:")
print(fine_tuned_output)



Original Model Output:
Warum ist die Frage, wie machmach learning?



Fine-Tuned Model Output:
Machine learning is a subset of AI that enables systems to learn from data and improve without explicit programming.
