# Lesson 5: Training Cycle for Large Language Models (LLMs)
*  
This lesson focuses on understanding and implementing the complete training cycle for LLMs
Pretraining is very expensive! Please check costs carefully before starting a pretraining project.
*  You can get a rough estimate your training job cost using [this calculator](https://huggingface.co/training-cluster)
from Hugging Face. For training on other infrastructure, e.g. AWS or Google Cloud, please consult those providers for up to date cost estimates.

## Training Cycle Overview:
The training process can be visualized as a four-step cycle:
1. **Data Preparation**: Organize and preprocess datasets. This was already done earlier and dataset saved.
2. **Hyperparameter Configuration**: Define and tune model parameters for the model created earlier.
3. **Training**: Execute the training loop using the `Trainer` class.
4. **Monitoring**: Eval and techniques to monitor and debug training performance.


    
### **1. Data Preparation**

The first step in training a language model is preparing your dataset. Hugging Face's `Dataset` class provides tools to handle and preprocess data effectively.

- **Key Functions:**
  - `len()`: Retrieves the number of samples in the dataset.
  - `get_item()`: Accesses individual samples, typically containing `input_ids` (tokenized text) and `labels`.

- **Steps:**
  - Load your dataset using a library like Hugging Face's `datasets`.
  - Tokenize the text using a tokenizer compatible with your model architecture (e.g., GPT or LLaMA).
  - Format the dataset to include input features (`input_ids`) and labels for supervised learning tasks.

---

### **2. Hyperparameter Configuration**

Hyperparameters define how your training process operates and significantly influence model performance.
- Configure these parameters in the `TrainingArguments` class.
- **Key Hyperparameters:**
  - **Batch Size:** Number of samples processed together in one forward/backward pass.
  - **Learning Rate (LR):** Determines how much to adjust weights during optimization.
  - **Warmup Steps:** Gradually increases the learning rate at the start of training to stabilize updates.
  - **Weight Decay:** Regularization term to prevent overfitting.
  - **Learning Rate Scheduler:** Dynamically adjusts the learning rate during training.

- **Configuration Considerations:**
  - Adjust hyperparameters based on hardware limitations and model size.
  - Use pre-defined configurations from academic literature or previous experiments as a starting point.

---

### **3. Model Training**

The Hugging Face `Trainer` class simplifies the training process by integrating data, models, and hyperparameters.

- **Inputs:**
  - **Data:** Tokenized datasets for training and evaluation.
  - **Model:** Pre-trained or randomly initialized transformer model.
  - **Config:** Hyperparameter configurations defined earlier.

- **Steps:**
  - Initialize your model using pre-trained weights or random initialization.
  - Define a `Trainer` instance with your model, datasets, and training arguments.
  - Begin training while monitoring metrics like loss and validation accuracy.

---

### **4. Monitoring**

Monitoring is essential to ensure that your training process is progressing as expected and to identify potential issues early.

- **Metrics to Monitor:**
  - **Loss:** Indicates how well the model is learning.
  - **Validation Accuracy:** Measures generalization capability.
  - **Resource Usage:** Tracks GPU/CPU utilization and memory consumption.

- **Tools for Monitoring:**
  - Built-in logging features in frameworks like Hugging Face's `Trainer`.
  - External tools like TensorBoard or Weights & Biases for advanced visualization and analysis.

---
**Next Steps**:
- Experiment with different hyperparameter configurations to improve model performance.
- Explore advanced optimization techniques such as mixed precsion (`fp16`) and gradient checkpointing for memory efficiency.



In [None]:
import os
import subprocess
from typing import Tuple

def setup_environment(curr_proj_folder: str = "pretraining-llms", google_drive_base_folder: str = "Colab Notebooks",\
                      run_remote: bool= True, use_gpu: bool = True) -> Tuple[str, bool]:
    """
    Sets up the environment for running code, handling local and remote execution.

    Args:
        curr_proj_folder (str, optional): Folder name of the current project. Defaults to "pretraining-llms".
        google_drive_base_folder (str, optional): Folder name of the Google drive base folder. Defaults to ""Colab Notebooks".
        use_gpu (bool, optional): Whether to use GPU if available. Defaults to True.

    Returns:
        Tuple[str,bool]: (computed path_to_scripts,mount_success status)
    """
    # Initialize mount status for Colab
    mount_success = False
    # Remote run code
    if run_remote:
      from google.colab import drive
      # Mount Google Drive
      drive.mount('/content/drive')
      # Check if the mount was successful
      if os.path.ismount('/content/drive'):
        print("Google Drive mounted successfully!")
        mount_success = True
      else:
        print("Drive mount failed.")
      # By Default, this is complete mount path
      mount_path = '/content/drive/MyDrive'

      # complete path to current files
      path_to_scripts = os.path.join(mount_path, google_drive_base_folder,curr_proj_folder)
      # Create the directory if it doesn't exist
      if not os.path.exists(path_to_scripts):
        os.makedirs(path_to_scripts)
        # change to the path
      os.chdir(path_to_scripts)
      print(f'Running code in path {os.getcwd()}')
    # Local Run
    else:
      path_to_scripts  = os.getcwd()
      # folder name provided as argument should match the one existing
      assert os.path.basename(path_to_scripts ) == curr_proj_folder, \
          f"Folder Name Mismatch: {os.path.basename(path_to_scripts )} != {curr_proj_folder}"
      print(f'Running code in path {path_to_scripts }')
    # check GPU usage
    if use_gpu:
      try:
        gpu_info = subprocess.check_output("nvidia-smi", shell=True).decode('utf-8')
        print("******GPU is available and will be used:**********")
        print(gpu_info)
      except subprocess.SubprocessError:
        print("GPU check failed (nvidia-smi not found or no GPU available). Falling back to CPU.")
        use_gpu = False  # Force CPU usage if GPU check fails
    else:
        print("******use_gpu is set to False. Using CPU******")
    return  path_to_scripts,mount_success

Always set following parameters as needed before each run

In [None]:
# Project-specific configuration parameters
# Specifies the current project folder name
curr_proj_folder = "pretraining-llms"
# Base folder name in Google Drive where notebooks are stored
google_drive_base_folder = "Colab Notebooks"
# Flag to determine whether to use GPU for computations
use_gpu = True
# Flag to indicate remote execution environment
run_remote = True
# Flag to control model loading from a specific folder or through URL
# NOTE: we will be using pretrained model saved locally
load_model_from_folder = True

if run_remote:
  run_local = False
  run_local_usingColab = False
else:
  run_local = False
  run_local_usingColab = not run_local

# call method to setup environment
path_to_scripts,mount_success = setup_environment(curr_proj_folder = curr_proj_folder, \
                                   google_drive_base_folder =  google_drive_base_folder,\
                                    run_remote = run_remote, use_gpu = use_gpu)

In [None]:
from IPython.display import display
from PIL import Image
image_path = os.path.join(path_to_scripts,"images","lesson5_train_cycle.jpg")
img = Image.open(image_path)
display(img)

In [None]:
import warnings
warnings.filterwarnings('ignore')
# check torch versions installed
import torch
print("PyTorch version:", torch.__version__)
if use_gpu:
    print("CUDA runtime version:", torch.version.cuda)
    print("CUDA available:", torch.cuda.is_available())

### 1. Load the model to be trained

Load the upscaled model we saved in **//saved_model//TinySolar-308m-4k-init** from the previous lesson:


In [None]:
from transformers import AutoModelForCausalLM
model_name = "TinySolar-308m-4k-init"
# download model if not present  locally
model_path_or_name = os.path.join(path_to_scripts ,"savedModel",model_name)
device_map = "auto" if torch.cuda.is_available() else "cpu"
pretrained_model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map=device_map,
    torch_dtype=torch.bfloat16,
)

In [None]:
pretrained_model

### 2. Load dataset

* Here  we will  update two methods on the `Dataset` object to allow it to interface with the trainer

* These will be applied when we  specify the dataset that was  created in Lesson 3 as the training data and  stored in **.//saved_pretrain_cleaned_data//packaged_pretrained_dataset.json**



In [None]:
!pip install datasets
import datasets
from torch.utils.data import Dataset
dataset_path = ".//saved_pretrain_cleaned_data//packaged_pretrained_dataset.json"

class CustomDataset(Dataset):
    def __init__(self, args, split="train"):
        """Initializes the custom dataset object."""
        self.args = args
        # Loads the dataset from a Json  file
        self.dataset = datasets.load_dataset(
                     "json",  data_files= args.dataset_name,
                     split=split)

    def __len__(self):
        """Returns the number of samples in the dataset."""
        # PyTorch uses this method to determine how many batches are needed during training.
        return len(self.dataset)

    def __getitem__(self, idx):
        """
        Returns input and op token for each example
        at the specified index in dictionary format :allows PyTorch to access individual data points
         using an index """
        # Convert the lists to a LongTensor for PyTorch
        # In next-word prediction (language modeling), the labels are the same as the input IDs.
        # "input_ids: The processed tokenized input sequence.
        # "labels": The same input sequence, used as target output for training.
        input_ids = torch.LongTensor(self.dataset[idx]["input_ids"])
        labels = torch.LongTensor(self.dataset[idx]["input_ids"])
        # Return the sample as a dictionary

        return {"input_ids": input_ids, "labels": labels}

### 3. Configure Training Arguments

Here , We  set up the training run. The training dataset you created in Lesson 3 is specified in the Dataset configuration section.
Some of common parameters to change/work with are:
* **optim** : Training optimizers: LLM's commonly used adamw_torch
* **max_steps**: Number of maximum training steps
 With Pre-training, Its common practice to set the num_train_epochs =1 , instead of setting max_steps, to process all data at once
* **per_device_train_batch_size** : **rule of thumb is to use largest batch size allowed per memory** e.g,
  the created dataset has Max seq length as 32 tokens, so batch size is 64 tokens (for  batch size as 2).
   if there are 8 GPU available, it will be 8*64-512 tokens per training step

In [None]:
from dataclasses import dataclass, field
import transformers

@dataclass
class CustomArguments(transformers.TrainingArguments):
    dataset_name: str = field(                           # Dataset configuration
        default="./packaged_pretrained_dataset.json")
    num_proc: int = field(default=1)                     # Number of subprocesses for data preprocessing
    max_seq_length: int = field(default=32)              # Maximum sequence length

    # Core training configurations
    seed: int = field(default=0)                         # Random seed for initialization, ensuring reproducibility
    optim: str = field(default="adamw_torch")            # Optimizer, here it's AdamW implemented in PyTorch
    max_steps: int = field(default=30)                   # Number of maximum training steps
    per_device_train_batch_size: int = field(default=2)  # Batch size per device during training

    # Other training configurations
    learning_rate: float = field(default=5e-5)           # Initial learning rate for the optimizer
    weight_decay: float = field(default=0)               # Weight decay
    warmup_steps: int = field(default=10)                # Number of steps for the learning rate warmup phase
    lr_scheduler_type: str = field(default="linear")     # Type of learning rate scheduler
    gradient_checkpointing: bool = field(default=True)   # Enable gradient checkpointing to save memory
    dataloader_num_workers: int = field(default=2)       # Number of subprocesses for data loading
    bf16: bool = field(default=True)                     # Use bfloat16 precision for training on supported hardware
    gradient_accumulation_steps: int = field(default=1)  # Number of steps to accumulate gradients before updating model weights

    # Logging configuration
    logging_steps: int = field(default=3)                # Frequency of logging training information
    report_to: str = field(default="none")               # Destination for logging (e.g., WandB, TensorBoard)

    # Saving configuration : for intermediate checpoints
    save_strategy: str = field(default="steps")          # Can be replaced with "epoch"
    save_steps: int = field(default=3)                   # Frequency of saving training checkpoint
    save_total_limit: int = field(default=2)             # The total number of checkpoints to be saved

# Create Hugging Face Argument Parser

This is used to parse the input arguments and also add argument to set the output directory where the model will be saved: The `HfArgumentParser` class automatically generates the necessary `argparse` arguments based on the dataclass structure.

`args, = parser.parse_args_into_dataclasses(args=["--output_dir", "output"])` does the following:  
Instead of reading arguments from the command line (`sys.argv`), it uses the explicitly provided list `["--output_dir", "output"]`.  
The reason `--output_dir` is needed is that the `CustomArguments` class inherits from `transformers.TrainingArguments`, which requires an `output_dir` parameter.  

It parses these arguments and converts them into an instance of your `CustomArguments` dataclass

In [None]:
parser = transformers.HfArgumentParser(CustomArguments)
# pass the dataset path to the CustomArguments  object overriding default one
args, = parser.parse_args_into_dataclasses(
    args=["--output_dir", "output",
          "--dataset_name", dataset_path])

In [None]:
print(f"Dataset path: {args.dataset_name}")

As we pass arguments to the custom dataset, it will be configured as needed for the trainer.

In [None]:
train_dataset = CustomDataset(args=args)

Check the shape of the dataset:

In [None]:
print("Input shape: ", train_dataset[0]['input_ids'].shape)

## 4. Run the trainer and monitor the loss

First, set up a callback to log the loss values during training (note this cell is not shown in the video):

In [None]:
from transformers import Trainer, TrainingArguments, TrainerCallback

# Define a custom callback to log the loss values
class LossLoggingCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            self.logs.append(logs)

    def __init__(self):
        self.logs = []

# Initialize the callback
loss_logging_callback = LossLoggingCallback()

Then, create an instance of the Hugging Face `Trainer` object from the `transformers` library. Call the `train()` method of the trainder to initialize the training run:

In [None]:
import os
from tqdm import tqdm

# Update training arguments for faster training
args.num_train_epochs = 5  # Set the number of epochs to 2
args.max_steps = 10 # Set a maximum number of steps

trainer = Trainer(
    model=pretrained_model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=None,
    callbacks=[loss_logging_callback]
)

# Instead of manual loop, call the train method directly:
trainer.train() # This will handle the training loop internally

You can use the code below to save intermediate model checkpoints in your own training run:

In [None]:
# Saving configuration
    # save_strategy: str = field(default="steps")          # Can be replaced with "epoch"
    # save_steps: int = field(default=3)                   # Frequency of saving training checkpoint
    # save_total_limit: int = field(default=2)             # The total number of checkpoints to be saved

### Checking the performance of an intermediate checkpoint

Below, you can try generating text using an intermediate checkpoint of the model. This checkpoint was saved after 10,000 training steps.  
As in  previous lessons, we use  the Solar tokenizer and then set up a `TextStreamer` object to display the text as it is generated:

In [None]:
# load tokenizer
from transformers import AutoTokenizer, TextStreamer
model_name = "TinySolar-248m-4k"
# Force forward slashes for Hugging Face compatibility
upstage_path = os.path.join("upstage",model_name).replace('\\', '/')
# download model if not present  locally
model_path_or_name = os.path.join(path_to_scripts ,"models",model_name) if load_model_from_folder else upstage_path
tokenizer = AutoTokenizer.from_pretrained(model_path_or_name)

In [None]:
# load checkpoint
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
import torch

checkpoint_name = "checkpoint-10"
# Force forward slashes for Hugging Face compatibility
upstage_path = os.path.join("upstage","output",model_name).replace('\\', '/')
# download Checkpoint if not present  locally
checkPoint_path_or_name = os.path.join(path_to_scripts ,"output",checkpoint_name) if load_model_from_folder else upstage_path
print(f'using checkpont from path {checkPoint_path_or_name}')
model = AutoModelForCausalLM.from_pretrained(
    checkPoint_path_or_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)


In [None]:
# run inference
prompt = "I am an engineer. I love"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

streamer = TextStreamer(
    tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

outputs = model.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=64,
    do_sample=True,
    temperature=1.0,
)

### Result
`when people ask me anything like that, so, for me, my passion to become the next leader and not just of mine was really important.
One question I have for...'`

Notes:
* **Its better  than before** that its **not repeating again and again**
*  dosnt make ton of sence as this small model cannot generate long stretch of senetence


In [None]:
if run_remote  :
  from google.colab import drive
  drive.flush_and_unmount()
  print('All changes made in this colab session should now be visible in Drive.')