# Lesson 4: Preparing Your Model for Training

This lesson focuses on **configuring and initializing language models for training**, with an emphasis on weight initialization strategies and model architecture manipulation.

## Key Objectives:
1. **Environment Configuration**: Sets up Local or Remote Run with and Without GPU
2. **Model Configuration**
   - Use `LlamaConfig` to set up model architecture
   - Customize parameters like hidden layers, hidden size, and attention heads
3. **Weight Initialization Strategies**
   - Random initialization
   - Reusing pretrained model weights
   - Downscaling from larger models
   - Depth upscaling of smaller models
4. **Model Manipulation Techniques**
   - Layer removal for downscaling
   - Layer duplication and concatenation for upscaling
5. **Practical Implementation**
   - Loading and saving models using Hugging Face's Transformers library
   - Handling GPU allocation and memory management
   - Tokenizer configuration and text generation
6. **Transfer Learning Approaches**
   - Continued pre-training
   - Fine-tuning strategies

## Next Steps:
- Implement fine-tuning on task-specific datasets.
- Explore advanced techniques like mixed-precision training and gradient checkpointing.
ike mixed-precision training and gradient checkpointing


In [None]:
import os
import subprocess
from typing import Tuple

def setup_environment(curr_proj_folder: str = "pretraining-llms", google_drive_base_folder: str = "Colab Notebooks",\
                      run_remote: bool= True, use_gpu: bool = True) -> Tuple[str, bool]:
    """
    Sets up the environment for running code, handling local and remote execution.

    Args:
        curr_proj_folder (str, optional): Folder name of the current project. Defaults to "pretraining-llms".
        google_drive_base_folder (str, optional): Folder name of the Google drive base folder. Defaults to ""Colab Notebooks".
        use_gpu (bool, optional): Whether to use GPU if available. Defaults to True.

    Returns:
        Tuple[str,bool]: (computed path_to_scripts,mount_success status)
    """
    # Initialize mount status for Colab
    mount_success = False
    # Remote run code
    if run_remote:
      from google.colab import drive
      # Mount Google Drive
      drive.mount('/content/drive')
      # Check if the mount was successful
      if os.path.ismount('/content/drive'):
        print("Google Drive mounted successfully!")
        mount_success = True
      else:
        print("Drive mount failed.")
      # By Default, this is complete mount path
      mount_path = '/content/drive/MyDrive'

      # complete path to current files
      path_to_scripts = os.path.join(mount_path, google_drive_base_folder,curr_proj_folder)
      # Create the directory if it doesn't exist
      if not os.path.exists(path_to_scripts):
        os.makedirs(path_to_scripts)
        # change to the path
      os.chdir(path_to_scripts)
      print(f'Running code in path {os.getcwd()}')
    # Local Run
    else:
      path_to_scripts  = os.getcwd()
      # folder name provided as argument should match the one existing
      assert os.path.basename(path_to_scripts ) == curr_proj_folder, \
          f"Folder Name Mismatch: {os.path.basename(path_to_scripts )} != {curr_proj_folder}"
      print(f'Running code in path {path_to_scripts }')
    # check GPU usage
    if use_gpu:
      try:
        gpu_info = subprocess.check_output("nvidia-smi", shell=True).decode('utf-8')
        print("******GPU is available and will be used:**********")
        print(gpu_info)
      except subprocess.SubprocessError:
        print("GPU check failed (nvidia-smi not found or no GPU available). Falling back to CPU.")
        use_gpu = False  # Force CPU usage if GPU check fails
    else:
        print("******use_gpu is set to False. Using CPU******")
    return  path_to_scripts,mount_success

Always set following parameters as needed before each run

In [None]:
# Project-specific configuration parameters
# Specifies the current project folder name
curr_proj_folder = "pretraining-llms"
# Base folder name in Google Drive where notebooks are stored
google_drive_base_folder = "Colab Notebooks"
# Flag to determine whether to use GPU for computations
use_gpu = True
# Flag to indicate remote execution environment
run_remote = False
# Flag to control model loading from a specific folder or through URL
load_model_from_folder = False

if run_remote:
  run_local = False
  run_local_usingColab = False
else:
  run_local = False
  run_local_usingColab = not run_local

# call method to setup environment
path_to_scripts,mount_success = setup_environment(curr_proj_folder = curr_proj_folder, \
                                   google_drive_base_folder =  google_drive_base_folder,\
                                    run_remote = run_remote, use_gpu = use_gpu)

### Decoder Only Transformer Architecture

* Decoder-only architecture is used in models like Llama and is an Autoregrresive Model




In [None]:
from IPython.display import display
from PIL import Image
image_path = os.path.join(path_to_scripts,"images","decoderOnly.jpg")
img = Image.open(image_path)
display(img)

Each layer processes the input sequence and passes information to the next layer.The final layer produces output probabilities for the next token



## Key Components

- **Stack of Decoder Blocks**:  
  The model is composed of multiple stacked decoder layers.

Each decoder block contains the following elements:

1. **Linear Transformations**:  
   Used to project input data into different feature spaces for better representation.

2. **Self-attention Mechanism**:  
   - Computes attention scores to focus on relevant parts of the input.
   - Uses masked self-attention to prevent looking at future tokens, ensuring autoregressive behavior.

3. **Layer Normalization**:  
   - Stabilizes training and speeds up convergence by normalizing activations.

4. **Feed-forward Layers**:  
   - Consist of fully connected layers that apply non-linear transformations to enhance the model’s expressiveness.


5.  **Classifier Layer**:  
  The final output from the last decoder block is passed through a classifier layer to generate most probable token predictions.

- **Number of Layers (n)**:  
  The architecture can have multiple decoder blocks, where "n" determines the depth of the model.




In [None]:
image_path = os.path.join(path_to_scripts,"images","solar10_7_B_Pretrain.jpg")
img = Image.open(image_path)
display(img)

* Here, starting point was English language  SOLAR model , and it was **pretrained  for Koren Language extension**
* **The 200B tokens were a mix of Korean and English language tokens**
* the price is 0.2M, expensive but much cheaper than Training from scratch
* **number of parameters is 10.7B, which increased from 7B English language model**


In [None]:
# Ignore insignificant warnings (ex: deprecation warnings)
import warnings
warnings.filterwarnings('ignore')

# Set a seed value for reproducibility
import torch

def fix_torch_seed(seed=42):
    """
    Fix random seed for reproducibility across all PyTorch operations
    Args:
    seed (int): Seed value to use for random number generation
    """
    # Set seed for CPU-based random number generation
    # Affects:
    # - Initial weight randomization in model layers
    # - Data shuffling operations
    # - Any CPU-based random number generation in PyTorch
    torch.manual_seed(seed)
    # Set seed for CUDA (GPU) random number generation
    # Important for:
    # - GPU-based dropout layers
    # - GPU-accelerated matrix operations with randomness
    # - Ensures reproducibility when using CUDA-enabled devices
    torch.cuda.manual_seed(seed)
    # Force cuDNN to use deterministic algorithms
    # Tradeoffs:
    # - May slightly reduce performance (~10-20%)
    # - Ensures reproducibility in convolution operations
    # - Required for exact reproducibility of results
    torch.backends.cudnn.deterministic = True
    # Disable cuDNN benchmarking optimization
    # Why:
    # - Benchmarking automatically selects fastest algorithms
    # - Different algorithms may be selected across runs
    # - Disabling ensures consistent algorithm selection
    torch.backends.cudnn.benchmark = False
# Initialize the seed configuration with default seed value
# Note: Must be called before any model initialization or data loading
fix_torch_seed()

## 1. Model configuration

*  Configure models based on Meta's Llama family.

Let's create a `LlamaConfig` object to configure the architecture:

In [None]:
from transformers import LlamaConfig
config = LlamaConfig()
print(config)

let's update parameters to create a smaller, more manageable model:

In [None]:
config.num_hidden_layers = 12      # reduced from 32 to 12
config.hidden_size = 1024          # reduced 1/4 from 4096 to 1024
config.intermediate_size = 4096    # reduced 1/3 from 11008 to 4096 (dimension of MLP representations)
config.num_key_value_heads = 8     # reduced 1/4 from 32 to 8 (defaults to num_attention_heads=32)
config.torch_dtype = "bfloat16"    # for half-precision training
config.use_cache = False           # `True` is incompatible w/ gradient checkpointing
print(config)

## 2. Weight initialization

Ther are  four different ways to initialize the weights of a model for training
1. **Random weight initialization**
2. **Using an existing model** for continued pre-training
3. **Downscaling** an existing model  
4. **Upscaling** an existing model

### 2.1 Random weight initialization

* Sets all weights to values from a **truncated normal distribution** with mean 0 and standard deviation of 0.02.
* Values beyond 2-sigma from the mean are set to 0. ( *+-2Sigma*)
*  We will import and use **LlamaForCausalLM:** **Class specifically designed for autoregressive text generation**

**NOTE:**
Local O/p shows:

LlamaAttention with projection dimensions:

q_proj: 1024 → 4096

k_proj: 1024 → 1024

v_proj: 1024 → 1024

o_proj: 4096 → 1024

**DeepLearning.AI course** output  shows:

LlamaSdpaAttention with projection dimensions:

q_proj: 1024 → 1024

k_proj: 1024 → 256

v_proj: 1024 → 256

o_proj: 1024 → 1024
This difference is likely due to version changes in the **Transformers library.**

In [None]:
from transformers import LlamaForCausalLM
# Use the LlamaConfig object defined above to create  architecture
model = LlamaForCausalLM(config)

# set model to be run on GPU if needed and precision as in 
# Manual device/dtype configuration
model = model.to("cuda" if torch.cuda.is_available() else "cpu")  
# Cast weights to bfloat16
model = model.to(torch.bfloat16)  
print(model)
def print_nparams(model):
    """
    Calculate the total number of model parameters
    Args:
        model: The PyTorch model to analyze
    Returns:
        None, prints the total parameter count
    """
    nparams = sum(p.numel() for p in model.parameters())
    print(f"The total number of parameters is: {nparams}")
print_nparams(model)

Show sample of the weights in a single layer:
* In transformer models like Llama, the **self-attention mechanism projects input embeddings into query (q), key (k), and value (v) representations**. The **q_proj** specifically handles the transformation of input embeddings into query vectors that are used to determine which parts of the sequence the model should focus on.

In [None]:
# from the LLama model layers , choose layer 0, and extract the weight parameters
# for Query Projection in the self attention mechanism
layer_name = "model.layers.0.self_attn.q_proj.weight"

# go through each layer, and extract the one chosen
for name, param in model.named_parameters():
    if name == layer_name:
        print(f"First 30 weights of layer '{layer_name}':")
        print(param.data.view(-1)[:30])
        break

Use this  model for inference with a randomly initialized weights as above.  
**Result: Random gibberish  op as model  is not trained yet**

In [None]:
# Load a tokenizer from Upstage Solar,# which is compatible with  the Llama-2 tokenizer
model_name = "SOLAR-10.7B-v1.0"
# Force forward slashes for Hugging Face compatibility
upstage_path = os.path.join("upstage",model_name).replace('\\', '/') 
# download model if not present  locally
model_path_or_name = os.path.join(path_to_scripts ,"models",model_name) if load_model_from_folder else upstage_path
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained(model_path_or_name)
# Run simple inference with prompt
from transformers import TextStreamer

# set generate() params
 # Maximum number of tokens to generate
max_new_tokens = 128
 # Use greedy decoding
do_sample=False
# The temperature parameter only applies when do_sample=True.
temperature= 0.7  if  do_sample else None
repetition_penalty=1.1

prompt = "I am an engineer. I love"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Configure text streaming for the  model
streamer = TextStreamer(
    tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

# Generate text with specific parameters
outputs = model.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=max_new_tokens,
    do_sample=do_sample,
    temperature=temperature,
    repetition_penalty=repetition_penalty
)

Remove the model from memory to avoid crashing the kernel:

In [None]:
# NOTE: We're running large models in a limited environment. Run me if you encounter any memory issues.
import gc
del model
del streamer
del outputs
gc.collect()

### Reuse general pretrained model weights

Load an existing Pre-trained model ( **Whose weights were Retrained from random**) and use it as-is  or
**Retrain with new Data : Continued Pre-Training**

The op is

```
to travel and have a great time, but I'm not sure if I can do it all again.
I've been working on my first book for the last 10 years, and I've always  
 wanted to write about something that has happened in my life.             
  It's been a long journey, but I've finally found my voice.  I've written a lot of books, and I've had some really good ones.             
   I've also written a few short stories, and I've done a                  
    couple of short story collections.                                       
             I've also written a few short stories, and I
```



In [None]:
from transformers import AutoModelForCausalLM
model_name = "TinySolar-248m-4k"
# Force forward slashes for Hugging Face compatibility
upstage_path = os.path.join("upstage",model_name).replace('\\', '/') 
# download model if not present  locally
model_path_or_name = os.path.join(path_to_scripts ,"models",model_name) if load_model_from_folder else upstage_path

device_map = "auto" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map=device_map,
    torch_dtype=torch.bfloat16,
)
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained(model_path_or_name)

# Run simple inference with prompt
from transformers import TextStreamer

prompt = "I am an engineer. I love"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

streamer = TextStreamer(
    tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

# Generate text with specific parameters
outputs = model.generate(
    **inputs,
    streamer=streamer,
     # Enable caching for faster generation
    use_cache=True,
    max_new_tokens=128,
    do_sample=False,
    repetition_penalty=1.1
)

Remove the model from memory to avoid crashing the kernel:

In [None]:
# NOTE: We're running large models in a limited environment. Run me if you encounter any memory issues.
del model
gc.collect()

### Downscaling from a general pretrained model

**Downscale :  Remove layers usually in the middle and Retrain**
Example is shown below: Remove 2 layers from tinySolar-248m-4k model , 12 layer model to a 10 layer model

In [None]:
image_path = os.path.join(path_to_scripts,"images","downScale_model.jpg")
img = Image.open(image_path)
display(img)


###  Why Remove Layers from the Middle?
1.  Middle Layers Are Less Specialized:In transformer architectures, the **lower layers tend to extract basic features (e.g., syntax)**, while the upper layers focus on task-specific or high-level features (e.g., semantics).

2.  Middle layers often act as intermediaries and are less specialized, making them good candidates for removal **without significantly impacting performance.**

Preserve Model Functionality:

By keeping the first few and last few layers intact, t**he model retains its ability to process input embeddings (early layers) and generate meaningful outputs (later layers).**

**Why Retrain After Downscaling?**

* **Restore Coherence/Same performance:**
The remaining parameters **need to adjust to compensate for the removed layers**.Retraining helps align weights across layers for smooth forward propagation.

* **Adapt to New Architecture:** The reduced model architecture may require fine-tuning on a large corpus of text so that it can re-learn intermediate representations.**

In [None]:
# load TinySolar-248m-4k model and corresponding toeknizer again
from transformers import AutoTokenizer, AutoConfig

model_name = "TinySolar-248m-4k"
# Force forward slashes for Hugging Face compatibility
upstage_path = os.path.join("upstage",model_name).replace('\\', '/') 
# download model if not present  locally
model_path_or_name = os.path.join(path_to_scripts ,"models",model_name) if load_model_from_folder else upstage_path

device_map = "auto" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map=device_map,
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path_or_name)

In [None]:
print(model)
print_nparams(model)

Remove the middle two layers (layers 5 and 6) and update the configuration:

In [None]:
# extract layers
layers = model.model.layers
 # Keep first 5 [0:4] and last 5 layers
model.model.layers = layers[:5] + layers[-5:]

#Respecify the configuraion
config = AutoConfig.from_pretrained(
    model_path_or_name,
    num_hidden_layers=len(model.model.layers),
)
model.config = config

print_nparams(model)

Clear the memory to avoid crashing the kernel:

In [None]:
# NOTE: We're running large models in a limited environment. Run me if you encounter any memory issues.
import gc
del model
gc.collect()

### Depth Upscaling from a general pretrained model

**Depth Upscaling:** involves increasing the number of layers in a pretrained model while preserving and leveraging the knowledge from the original model



In [None]:
image_path = os.path.join(path_to_scripts,"images","depth_upscaling.jpg")
img = Image.open(image_path)
display(img)

###  How the Process Works
* **Start with a base model**, like Model A **with  pretrained weights** and **N(4 here) layers**
* **Duplicate this model** and then N1(1 here)layers are removed from the end of the original base model and N1 (1 here) from the beginning of the duplicate. This process results in two N-N1 (4-1=3) layer models.

* Concatanate these models, forming a new model with 2*(N-N1) layers.
* By removing N1 layers from each part, the **'middle' layers of the upscaled model are effectively discarded.** This strategy **reduces the layer distance at the seam**, instead of directly connecting layer N to layer 1, a**s would be the case in a simple duplication method**

* In The second stage called **continued pre-training**,   further pretraining the scaled model to **recover and potentially surpass the performance of the base LLM.**

###Why Depth Upscaling?
1.  **Increase Model Capacity:** Adding more layers allows the model to capture more complex patterns and representations.

2.  **Leverage pretrained Knowledge:** Instead of training a larger model from scratch, you can **reuse and expand** an existing smaller model.

3.  **Efficient Transfer Learning:** By copying weights from a smaller pretrained model, it **reduces the amount of training required** for the larger model.

### Compariosn with MIxture of Experts
1.  Methods like  **mixture-of-experts (MoE)**, may involve more intricate modifications to the model architecture, **including the introduction of expert layers and gating mechanisms** which can make **training and integration more challenging.**

2.  **DUS**  excels in **capturing long-range dependencies and handling complex representations** although it may incur **higher computational costs** and the risk of **overfitting**. On the other hand, MOE **employs a gating network to dynamically allocate specific "sub-experts" for different inputs, enhancing efficiency and robustness while reducing computational demands**. However, it may not be as effective for tasks requiring broad global context.

Here you are going to upscale the tinySolar-248m-4k model from 12 layers to 16 layers. Here are the steps we 'll take:
1. Configure a 16 layer model and initialize it with random weights
2. Load the 12 layer tinySolar-248m-4k model into memory
3. Copy the bottom 8 and top 8 layers from the 12 layer model and use them to overwrite the random weights of the 16 layer model
4. Copy over the embedding and classifying layers to replace the randomly initialized counterparts in the 16 layer model

### Example Steps in Depth Upscaling
1. ***Configure a Larger Model with Random Weights***.
Start by creating a configuration for the larger model (in this case, 16 layers).Initialize the new model with random weights.

In [None]:
from transformers import LlamaConfig, LlamaForCausalLM

# Create configuration for a 16-layer model
config = LlamaConfig(
    num_hidden_layers=16,  # Target number of layers
    hidden_size=1024,
    intermediate_size=4096,
    num_attention_heads=32,
    num_key_value_heads=8,
    torch_dtype="bfloat16",
    use_cache=False
)

# Initialize the larger model with random weights
upscaled_model = LlamaForCausalLM(config)
upscaled_model = upscaled_model.to(dtype=torch.bfloat16)
print_nparams(upscaled_model)

 2. ***Load the Pretrained Smaller Model***: Load the pretrained smaller model (12 layers) that you want to upscale.



In [None]:
model_name = "TinySolar-248m-4k"
# Force forward slashes for Hugging Face compatibility
upstage_path = os.path.join("upstage",model_name).replace('\\', '/') 
# download model if not present  locally
model_path_or_name = os.path.join(path_to_scripts ,"models",model_name) if load_model_from_folder else upstage_path

device_map = "auto" if torch.cuda.is_available() else "cpu"
pretrained_model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map=device_map,
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path_or_name)
print_nparams(pretrained_model)

3.  ***Copy Layers from Pretrained Model to Larger Model***  
Specifically:

* Use the first few layers and last few layers from the smaller model.

* use them to overwrite the random weights of the Larger  model



In [None]:
from copy import deepcopy

# Combine first 8 and last 8 layers from
#the pretrained 12-layer model
upscaled_model.model.layers = deepcopy(pretrained_model.model.layers[:8]) + \
                     deepcopy(pretrained_model.model.layers[4:])
#This creates a 16-layer model as:
#A1, A2, A3, A4, A5, A6, A7, A8, A5, A6, A7, A8, A9, A10, A11, A12

# Copy embedding layer (shared across all layers)
upscaled_model.model.embed_tokens = deepcopy(pretrained_model.model.embed_tokens)

# Copy language modeling head (final output layer)
upscaled_model.lm_head = deepcopy(pretrained_model.lm_head)

Check the number of parameters is still Same:

In [None]:
print_nparams(upscaled_model)  # 308839424 => 308M

 4. **Fine-Tune or Retrain on New Data**
The upscaled model now needs to be fine-tuned or retrained on a large corpus of text to adapt its weights and fully utilize its increased capacity.

*Example Output*
Before fine-tuning, we can test the upscaled model with simple inference

In [None]:
# Run simple inference to show no trained model
prompt = "I am an engineer. I love"

inputs = tokenizer(prompt, return_tensors="pt").to(upscaled_model.device)

streamer = TextStreamer(
    tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)
# Move the upscaled_model to the same device as the inputs
upscaled_model = upscaled_model.to(inputs.input_ids.device)

outputs = upscaled_model.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=128,
    do_sample=False,
    repetition_penalty=1.1
)

### Save the model to disk

Note the new model name here which reflects the 308 million parameters of the new, upscaled model.

In [None]:
save_dir = ".//savedModel"
os.makedirs(save_dir, exist_ok=True)
model_path = os.path.join(save_dir,"TinySolar-308m-4k-init").replace('\\', '//') 
upscaled_model.save_pretrained(model_path)

In [None]:
#check if model files were created sucessfully
def check_files_in_folder(folder_path):
    # Get list of files in the folder
    files = [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]
    
    # Check if folder is empty
    if not files:
        print(f"The folder '{folder_path}' is empty.")
        return False
    
    # Print filenames
    print(f"Files found in '{folder_path}':")
    for file in files:
        print(f"- {file}")
    
    return True
check_files_in_folder(model_path)

In [None]:
if run_remote  :
  from google.colab import drive
  drive.flush_and_unmount()
  print('All changes made in this colab session should now be visible in Drive.')