<a href="https://colab.research.google.com/github/yellowssnake/221_zlotin/blob/main/assets/Notebooks/TinyStories_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q transformers datasets accelerate nvidia-ml-py3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.4/302.4 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for nvidia-ml-py3 (setup.py) ... [?25l[?25hdone


# Description

In this assignment, you will train a language model (LM) using the TinyStories dataset, focusing on optimizing model performance within the constraints of Google Colab’s hardware. For the sake of speed, we will do it on the part of the dataset.

```
Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun.
Beep was a healthy car because he always had good fuel....
```

Your objective is to maximize the size of the model without exceeding the available computational resources (~ 16GB VRAM). You could start with the Hugging Face Transformers library and experiment with various memory optimization techniques, such as (but not limited to):

 * Different batch size
 * Different optimizer
 * Gradient accumulation
 * Activation checkpointing
 * CPU offloading
 * 8bit optimizers

You have a baseline of training gpt-2 model prepared in this  colab notebook. You can easily switch it to opt-350m, opt-1.3b, gpt2 etc. You can find a great beginner-level guide on the topic [here](https://huggingface.co/docs/transformers/v4.18.0/en/performance).

```
A long time ago in a galaxy far far away... a little girl named Lily was playing in the garden. She was so excited! She wanted to explore the garden and see what was around her.
Suddenly, she heard a loud noise. Lily looked up and saw a big, hairy creature. Lily was so excited! She ran to the creature and grabbed it by the arm. The creature was so big and hairy that Lily couldn't help but laugh.
```

![](https://hse24.fmin.xyz/gpt2_generation.jpeg)

You have to fill this table with your description/observations.

| Setup | # of parameters | GPU peak memory, MB | Final eval loss | Batch Size | Time to run 5 epochs, s | Generation example | Comment |
|:---:|:---:|:---:|:---:|:---:|:---:|:---------:|:---------:|
| Baseline (OPT-125M) | 125 M | 9044 | 1.928 | 8 | 442.34 | `A long time ago in a galaxy far far away... there was a little girl named Lily. She was three years old and loved to explore. One day, she decided to go for a walk in the park. Lily was so excited to go for a walk. She asked her mom, "What do you want to do?" Her mom smiled and said, "I want to explore the galaxy." Lily was so excited to explore the galaxy.` |  |
| Baseline (GPT2-S) | 124 M | 13016 | 2.001 | 8 | 487.75 | `A long time ago in a galaxy far far away... a little girl named Lily was playing in the garden. She was so excited! She wanted to explore the garden and see what was around her. Suddenly, she heard a loud noise. Lily looked up and saw a big, hairy creature. Lily was so excited! She ran to the creature and grabbed it by the arm. The creature was so big and hairy that Lily couldn't help but laugh.` | The generation seems more interesting, despite the fact, that eval loss is higher. |
|  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |

For each unique trick for memory optimization, you will get 4 points (maximum 20 points). A combination of tricks is not counted as a unique trick, but will, probably, be necessary to train big models. The maximum grade is bounded with the size of the trained model:

* If the model size you train is <= 125M - you can get a maximum of 8 points.
* If the model size you train is 126M <= 350M - you can get a maximum of 12 points.
* If the model size you train is 350M <= 1B - you can get a maximum of 16 points.
* If you fit 1B model or more - you can get a maximum 20 points.

# Baseline

In [None]:
import torch

torch.cuda.synchronize()
torch.cuda.empty_cache()  # Clears the cache
torch.cuda.reset_peak_memory_stats()  #

from transformers import AutoModelForCausalLM, AutoTokenizer, \
    TrainingArguments, Trainer, logging, DataCollatorForLanguageModeling
from datasets import load_datasetsx
from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

print("💎 Before training:")
print_gpu_utilization()

# Suppress less critical logs
logging.set_verbosity_error()

# Load the dataset with both training and evaluation splits
train_dataset = load_dataset("roneneldan/TinyStories", split="train[:500]")
eval_dataset = load_dataset("roneneldan/TinyStories", split="train[500:1000]")

HF_cardname = "openai-community/gpt2"
# HF_cardname = "openai-community/gpt2-medium"
# HF_cardname = "openai-community/gpt2-large"
# HF_cardname = "openai-community/gpt2-XL"
# HF_cardname = "facebook/opt-125m"
# HF_cardname = "facebook/opt-350m"
# HF_cardname = "facebook/opt-1.3b"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(HF_cardname, use_fast=False)
EOS_TOKEN = tokenizer.eos_token

# Ensure the tokenizer has a padding token, set EOS_TOKEN as padding token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = EOS_TOKEN

# Function to process the dataset
def formatting_func(examples):
    inputs = [tokenizer(text + EOS_TOKEN, truncation=True, max_length=512, padding="max_length", return_tensors="pt") for text in examples['text']]
    return {'input_ids': [input['input_ids'].squeeze() for input in inputs], 'labels': [input['input_ids'].squeeze() for input in inputs]}

# Process the datasets
processed_train_dataset = train_dataset.map(formatting_func, batched=True, remove_columns=["text"])
processed_eval_dataset = eval_dataset.map(formatting_func, batched=True, remove_columns=["text"])

# Initialize Data Collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

print("💎 Dataset loaded")
print_gpu_utilization()

# Define and load the model
model = AutoModelForCausalLM.from_pretrained(HF_cardname)

print("💎 Model loaded")
print_gpu_utilization()

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
)

# Initialize the trainer with the data collator
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_train_dataset,
    eval_dataset=processed_eval_dataset,
    data_collator=data_collator
)

# Start training
trainer.train()

def print_summary(trainer):
    # Access training result metrics directly from the trainer state
    print(f"💎 Training time: {trainer.state.log_history[-1]['train_runtime']:.2f} seconds")
    print(f"Samples/second: {trainer.state.log_history[-1]['train_samples_per_second']:.2f}")
    print_gpu_utilization()
    num_params = sum(p.numel() for p in trainer.model.parameters() if p.requires_grad)
    print(f"Total Trainable Parameters: {num_params}")

print_summary(trainer)

💎 Before training:
GPU memory occupied: 1978 MB.


Repo card metadata block was not found. Setting CardData to empty.
Repo card metadata block was not found. Setting CardData to empty.


💎 Dataset loaded
GPU memory occupied: 1978 MB.
💎 Model loaded
GPU memory occupied: 1978 MB.
{'eval_loss': 2.0613200664520264, 'eval_runtime': 24.2955, 'eval_samples_per_second': 20.58, 'eval_steps_per_second': 2.593, 'epoch': 1.0}
{'eval_loss': 2.0225725173950195, 'eval_runtime': 24.3054, 'eval_samples_per_second': 20.572, 'eval_steps_per_second': 2.592, 'epoch': 2.0}
{'eval_loss': 2.006747007369995, 'eval_runtime': 24.1332, 'eval_samples_per_second': 20.718, 'eval_steps_per_second': 2.611, 'epoch': 3.0}
{'eval_loss': 2.0053534507751465, 'eval_runtime': 24.5421, 'eval_samples_per_second': 20.373, 'eval_steps_per_second': 2.567, 'epoch': 4.0}
{'eval_loss': 2.0014288425445557, 'eval_runtime': 24.1794, 'eval_samples_per_second': 20.679, 'eval_steps_per_second': 2.606, 'epoch': 5.0}
{'train_runtime': 487.7478, 'train_samples_per_second': 5.126, 'train_steps_per_second': 0.646, 'train_loss': 2.1369303385416667, 'epoch': 5.0}
💎 Training time: 487.75 seconds
Samples/second: 5.13
GPU memory oc

In [None]:
# Encode the prompt text
input_ids = tokenizer.encode(
    "A long time ago in a galaxy far far away...",
    return_tensors="pt").cuda()

# Generate text using the model
output_ids = model.generate(input_ids, max_length=100)

# Decode the generated ids to text
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Print the generated text
print(generated_text)

A long time ago in a galaxy far far away... a little girl named Lily was playing in the garden. She was so excited! She wanted to explore the garden and see what was around her.

Suddenly, she heard a loud noise. Lily looked up and saw a big, hairy creature. Lily was so excited! She ran to the creature and grabbed it by the arm. The creature was so big and hairy that Lily couldn't help but laugh. 

Lily was so
