# Lesson 1: Pre-Training and Model Performance
### Compare how text generation varies between
   **base general model**, a **fine-tuned model**,and a **specialized pre-trained model**.s:

## Key Objectives:
1. Compare text generation between:
   - Base general model
   - Fine-tuned model
   - Specialized pre-trained model

## Core Components:
1. **Environment Setup**: Configures local or remote (Colab) runtime with GPU support
2. **Model Loading**: Initializes different model types for comparison
3. **Text Generation**: Implements and compares generation across model variants
4. **Performance Analysis**: Evaluates and contrasts output quality and task-specific performance

## Key Concepts Covered:
- Transfer learning in NLP
- Impact of domain-specific pre-training
- Fine-tuning vs. specialized pre-training trade-offs

## Practical Outcomes:
- Understanding the effects of pre-training on model performance
- Insights into choosing appropriate model types for specific tasks
- Hands-on experience with different LLM variants



In [None]:
import os
import subprocess
from typing import Tuple

def setup_environment(curr_proj_folder: str = "pretraining-llms", google_drive_base_folder: str = "Colab Notebooks",\
                      run_remote: bool= True, use_gpu: bool = True) -> Tuple[str, bool]:
    """
    Sets up the environment for running code, handling local and remote execution.

    Args:
        curr_proj_folder (str, optional): Folder name of the current project. Defaults to "pretraining-llms".
        google_drive_base_folder (str, optional): Folder name of the Google drive base folder. Defaults to ""Colab Notebooks".
        use_gpu (bool, optional): Whether to use GPU if available. Defaults to True.

    Returns:
        Tuple[str,bool]: (computed path_to_scripts,mount_success status)
    """
    # Initialize mount status for Colab
    mount_success = False
    # Remote run code
    if run_remote:
      from google.colab import drive
      # Mount Google Drive
      drive.mount('/content/drive')
      # Check if the mount was successful
      if os.path.ismount('/content/drive'):
        print("Google Drive mounted successfully!")
        mount_success = True
      else:
        print("Drive mount failed.")
      # By Default, this is complete mount path
      mount_path = '/content/drive/MyDrive'

      # complete path to current files
      path_to_scripts = os.path.join(mount_path, google_drive_base_folder,curr_proj_folder)
      # Create the directory if it doesn't exist
      if not os.path.exists(path_to_scripts):
        os.makedirs(path_to_scripts)
        # change to the path
      os.chdir(path_to_scripts)
      print(f'Running code in path {os.getcwd()}')
    # Local Run
    else:
      path_to_scripts  = os.getcwd()
      # folder name provided as argument should match the one existing
      assert os.path.basename(path_to_scripts ) == curr_proj_folder, \
          f"Folder Name Mismatch: {os.path.basename(path_to_scripts )} != {curr_proj_folder}"
      print(f'Running code in path {path_to_scripts }')
    # check GPU usage
    if use_gpu:
      try:
        gpu_info = subprocess.check_output("nvidia-smi", shell=True).decode('utf-8')
        print("******GPU is available and will be used:**********")
        print(gpu_info)
      except subprocess.SubprocessError:
        print("GPU check failed (nvidia-smi not found or no GPU available). Falling back to CPU.")
        use_gpu = False  # Force CPU usage if GPU check fails
    else:
        print("******use_gpu is set to False. Using CPU******")
    return  path_to_scripts,mount_success


### Always set  following parameters before each run

In [None]:
# Project-specific configuration parameters
# Specifies the current project folder name
curr_proj_folder = "pretraining-llms"
# Base folder name in Google Drive where notebooks are stored
google_drive_base_folder = "Colab Notebooks"
# Flag to determine whether to use GPU for computations
use_gpu = True
# Flag to indicate remote execution environment
run_remote = False
# Flag to control model loading from a specific folder or through URL
load_model_from_folder = False

## Possible Scenarios
1. Remote Mode Active:
   - `run_remote = True`
   - `run_local = False`
   - `run_local_usingColab = False`
**Scenario: Running on Google Colab**
2. Remote Mode Disabled:
   - `run_remote = False`
   - `run_local = True`
   - `run_local_usingColab = False`
**Scenario: Running on Local Computer with Jupyter Lab**
2. Remote Mode Disabled:
   - `run_remote = False`
   - `run_local = False`
   - `run_local_usingColab = True`
**Scenario: Running on Google Colab using Local PC compute resources**

In [None]:
if run_remote:
  run_local = False
  run_local_usingColab = False
else:
  run_local = False
  run_local_usingColab = not run_local

# call method to setup environment
path_to_scripts,mount_success = setup_environment(curr_proj_folder = curr_proj_folder, \
                                   google_drive_base_folder =  google_drive_base_folder,\
                                    run_remote = run_remote, use_gpu = use_gpu)

In [None]:
from IPython.display import display
from PIL import Image
import os

image_path = os.path.join(path_to_scripts,"images","pretrain_diag.jpg")
img = Image.open(image_path)
display(img)

# Lesson Summary: Pre-Training and Model Performance

## What is Pre-Training?
- Is a process of taking a model, **generally a transformer neural network,** and training it **from  scratch  on a large corpus of text** using supervised learning so that it learns to **repeatedly predict the next token given an input prompt.**  
This is shown above , where **Each text sample is turned into many
input-output pairs**, like you see here. Over time, the model learns to correctly
predict the next word.

- Called pre-training because it is the **first step of training
an LLM before any fine-tuning**  to have it follow instructions or further alignment to human preferences is carried out.

**Output of Pre-Training**
- Produces a **base model**, which could be
1.  **Trained from scratch** (Can be costly)
2.  **ReTrained** or a **Fine Tuned** on specific task


**Where Pre-Training excels:**
*  Build tasks in specific domains
*  Stronger ability in  other languages

**Depth Upscaling**
* Creates new LLM by duplicating layers of smaller pre-trained model.
* this is further pre-trained -> larger batter model
*  can be 70% less costly than traditional pre-training



# Detailed explanation of Pre-training methods
 This is taken from the paper

 [Continual Learning for Large Language Models: A Survey"](https://arxiv.org/abs/2402.01364)  

In [None]:
image_path = os.path.join(path_to_scripts,"images","lesson1_llmpaper1.jpg")
img = Image.open(image_path)
display(img)
image_path = os.path.join(path_to_scripts,"images","lesson1_llmpaper2.jpg")
img = Image.open(image_path)
display(img)

### Start Processing


In [None]:
# check torch versions installed
import torch
print("PyTorch version:", torch.__version__)
if use_gpu:
    print("CUDA runtime version:", torch.version.cuda)
    print("CUDA available:", torch.cuda.is_available())

In [None]:
# Suppress warning messages for cleaner output
import warnings
warnings.filterwarnings('ignore')

**Setting Random Seed**  
Setting a random seed ensures reproducibility of results across different runs.
The number 42 is commonly used as a default seed (reference to "The Hitchhiker's Guide to the Galaxy")

In [None]:
def fix_torch_seed(seed=42):
    """
    This function sets various random seeds in PyTorch to ensure reproducible results
    The default seed is 42, but can be changed by passing a different value
    """

    # Sets the seed for generating random numbers for CPU operations
    # This affects all random number generation in PyTorch for CPU computations
    torch.manual_seed(seed)

    # Sets the seed for generating random numbers on CUDA (GPU) operations
    # This ensures reproducibility when using GPU acceleration
    torch.cuda.manual_seed(seed)

    # When True, ensures that CUDA selects deterministic algorithms
    # This may slow down performance but guarantees reproducibility
    torch.backends.cudnn.deterministic = True

    # When False, prevents CUDA from auto-tuning algorithms
    # This disables the automatic selection of the best algorithm
    # which could change between runs and affect reproducibility
    torch.backends.cudnn.benchmark = False

# Call the function with default seed value (42)
fix_torch_seed()

## 2.1 Load a general pretrained model

This course will work with small models that fit within the memory of the learning platform.
* **TinySolar-248m-4k is a small decoder-only model** with **248M parameters** (similar in scale to GPT2) and a **4096 token context window**. 
   You can find the model on the Hugging Face model library at [this link](https://huggingface.co/upstage/TinySolar-248m-4k).
Load the model in three steps:
1. Specify the path to the model in the Hugging Face model library
2. Load the model using `AutoModelforCausalLM` in the `transformers` library
3. Load the tokenizer for the model from the same model path  

### Notes:  
### AutoModelForCausalLM :  
is a specialized class in the Hugging Face Transformers library designed to facilitate causal (autoregressive) language modeling.

* The class **automatically selects the appropriate model architecture** (e.g., GPT, GPT-2, GPT-Neo) based on the provided **model identifier.**

* Unlike **AutoModel**, which outputs  only the hidden state embedded representations,   
    **AutoModelForCausalLM** appends an additional linear layer (a language modeling head) on top of the base model.   
    This **head maps the dense hidden states to a sparse representation corresponding to the probabilities for each token in the vocabulary**.   
  This extra component is essential for generating meaningful text since it converts the output into real-word predictions.  
   for example, if you wanted to do a classification task with BERT, using AutoModelForSequenceClassification would load the BERT model  
   and an additional layer that maps the BERT embeddings to one of your classification labels.  
   If you just use AutoModel to load the BERT model, your output would just be raw BERT embeddings.

* The class is optimized for tasks such as **auto-completion, creative writing, dialogue generation**
  and other applications where you need to *predict subsequent tokens* from a given prompt.
### AutoTokenizer     
is a high-level class in the Hugging Face Transformers library that automatically **selects and loads the appropriate tokenizer for a given pretrained model**

**Key Features and Functionality**  
1. **Automatic Selection of the Right Tokenizer**  
When initilaized with a model name or path (using the **from_pretrained method**), it inspects the model configuration and automatically  
    instantiates the correct subclass (for example, a BERT-based tokenizer, a GPT-2 tokenizer, etc.)

2. **Uniform Interface for Different Tokenizers**
Regardless of whether the underlying tokenizer is a **traditional Python implementation or the faster Rust-based “Fast” version**,   
AutoTokenizer **exposes a consistent and easy-to-use interface**. You can call methods like\__call__, encode, and decode without worrying about the underlying implementation details.

3. **Handling Special Tokens and Configuration**:
Aautomatically handles special tokens (such as [CLS], [SEP], or padding tokens) that a model needs during training   
and inference: ensuring that the  inputs are correctly formatted for the model.

4. **Efficient Tokenization**
By default, if a **_Fast tokenizer_** is available, AutoTokenizer instantiates that version.   
Fast tokenizers are **built on top of the Hugging Face Tokenizers library**,   
which is highly optimized (in Rust) for batch tokenization and provides additional utilities such as **_mapping between tokens and characters_**.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "TinySolar-248m-4k"
# Force forward slashes for Hugging Face compatibility
upstage_path = os.path.join("upstage",model_name).replace('\\', '/') 
# download model if not present  locally
model_path_or_name = os.path.join(path_to_scripts ,"models",model_name) if load_model_from_folder else upstage_path
# choose device map as CPU or Auto for GPU support
device_map = "auto" if torch.cuda.is_available() else "cpu"
# Load this model: generic and small
tiny_general_model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map=device_map,
    # Specify precision for efficiency
    torch_dtype=torch.bfloat16
)

# Load the tokenizer by specifying just model path/name
tiny_general_tokenizer = AutoTokenizer.from_pretrained(model_path_or_name)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "TinySolar-248m-4k"
# Force forward slashes for Hugging Face compatibility
upstage_path = os.path.join("upstage",model_name).replace('\\', '/') 
# download model if not present  locally
model_path_or_name = os.path.join(path_to_scripts ,"models",model_name) if load_model_from_folder else upstage_path
# choose device map as CPU or Auto for GPU support
device_map = "auto" if torch.cuda.is_available() else "cpu"
# Load this model: generic and small
tiny_general_model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map=device_map,
    # Specify precision for efficiency
    torch_dtype=torch.bfloat16
)

# Load the tokenizer by specifing just model path/name
tiny_general_tokenizer = AutoTokenizer.from_pretrained(model_path_or_name)

## 2.2 Generate text samples

Here we will try  predicting/autocomplete  some text with the model. We will  set a prompt, instantiate a text streamer, and then have the model complete the prompt:  
  
**TextStreamer** captures tokens as they are generated by a  models generate() method. As new tokens are produced,   
  they are **immediately converted to human-readable text** and pushed to the output (typically standard output).   
**Advantage**: Instead of waiting for the entire output to be produced, it akllows to see the generated text in real time as the model produces each token.

 **model.generate()** method supports various parameters such as:

1.  **max_new_tokens**: Determines how many tokens will be generated beyond the prompt.

2.  **do_sample**: If set to True, it **enables sampling strategies for introducing randomness**; if False, it often resorts to greedy decoding.

3.  **temperature**: Lower values (close to 0) make the output more deterministic, while higher values introduce more diversity in the generated text.

4.  **repetition_penalty**: Helps reduce repetitiveness in the output by **penalizing repeated tokens**  

5.  **do_sample=False**,   ensures that the model uses **greedy decoding**, where it always selects the token with the highest probability at each step.This makes the **output deterministic **

**This flow is shown below**

In [None]:
image_path = os.path.join(path_to_scripts,"images","lesson1_img1.jpg")
img = Image.open(image_path)
display(img)


In [None]:
from transformers import TextStreamer

prompt = "I am an engineer. I love"

# set generate() params
 # Maximum number of tokens to generate
max_new_tokens = 128
 # Use greedy decoding
do_sample=False
# The temperature parameter only applies when do_sample=True.
temperature= 0.7  if  do_sample else None
repetition_penalty=1.1


# Tokenize input ( pt means Pytorch)
model_inputs = tiny_general_tokenizer(prompt, return_tensors="pt")
# moves the model_inputs tensor to the same device where the model (tiny_general_model) is located.
# By using tiny_general_model.device, you ensure that the input is moved to the correct GPU
#if the model is on the GPU, or to the CPU if the model is on the CPU.
model_inputs.to(tiny_general_model.device)
# Configure text streaming
streamer = TextStreamer(
    # tokenizer to stream
    tiny_general_tokenizer,
     #exclude the original prompt from the output, displaying only the newly generated text.
    skip_prompt=True,
    #   Remove special tokens from output
    skip_special_tokens=True
)

# Generate text with specific parameters
outputs = tiny_general_model.generate(
    **model_inputs,
    streamer=streamer,
     # Enable caching for faster generation
    use_cache=True,
    max_new_tokens=max_new_tokens,
    do_sample=do_sample,
   # temperature=temperature,
    repetition_penalty=repetition_penalty
)


## 2.3 Result:
The model generates text that follows common patterns in natural language.  
In this case, after prompt prompt = "***I am an engineer. I love***", it predicts phrases like "***to travel and have a great time"*** because these are common continuations in general language usage.

In [None]:
print(outputs)


## 3.1. Generate Python code completion  with same pretrained general model

Use the model to write a python function called `find_max()` that finds the maximum value in a list of numbers:

In [None]:
# change the prompt
prompt = "def find_max(numbers):"
# rest of code is same
model_inputs = tiny_general_tokenizer( prompt, return_tensors="pt"
).to(tiny_general_model.device)

# streamer objects stays the same

In [None]:
# Generate text with specific parameters
outputs = tiny_general_model.generate(
    **model_inputs,
    streamer=streamer,
     # Enable caching for faster generation
    use_cache=True,
    max_new_tokens=max_new_tokens,
    do_sample=do_sample,
   # temperature=temperature,
    repetition_penalty=repetition_penalty
)

## 3.2 Result
*  The model  **fail miserably as model was trained on English, and not on code samples**
* It generates some comments, but no calculations for finding max

## 4.1. Generate Python code completion  with finetuned Python model

* FineTuning involves Re-training the model on small amount of data, which is task specific.

* Lets use  one such model from  the Hugging Face model library at [this link](https://huggingface.co/upstage/TinySolar-248m-4k-code-instruct).

*  This has been **fine tuned on Python code**
*  Note that we will also use **corresponding tokenizer**

In [None]:
model_name = "TinySolar-248m-4k-code-instruct"
# Force forward slashes for Hugging Face compatibility
upstage_path = os.path.join("upstage",model_name).replace('\\', '/') 
# download model if not present  locally
model_path_or_name = os.path.join(path_to_scripts ,"models",model_name) if load_model_from_folder else upstage_path
# choose device map as CPU or Auto for GPU support
device_map = "auto" if torch.cuda.is_available() else "cpu"
# Load this model: Fine tuned
tiny_finetuned_model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map=device_map,
    # Specify precision for efficiency
    torch_dtype=torch.bfloat16
)

# Load the Fine tuned tokenizer by specifying just model path/name
tiny_finetuned_tokenizer = AutoTokenizer.from_pretrained(model_path_or_name)

In [None]:
# set the prompt for creating python method
prompt =  "def find_max(numbers):"

model_inputs = tiny_finetuned_tokenizer(
    prompt, return_tensors="pt"
).to(tiny_finetuned_model.device)

# Configure text streaming for finetuned model
streamer_finetuned = TextStreamer(
    # Fine tuned tokenizer to stream
    tiny_finetuned_tokenizer,
     #exclude the original prompt from the output
    skip_prompt=True,
    #   Remove special tokens from output
    skip_special_tokens=True
)

# Generate text with specific parameters
outputs_fineTuned = tiny_finetuned_model.generate(
    **model_inputs,
    streamer=streamer_finetuned,
     # Enable caching for faster generation
    use_cache=True,
    max_new_tokens=max_new_tokens,
    do_sample=do_sample,
   # temperature=temperature,
    repetition_penalty=repetition_penalty
)


## 4.2 Result:

*************************
*  There are some operations going on, but **we can have better results**

## 5.1 Generate Python code completion with Pre-Trained Python model

* Here we will  use a version of **TinySolar-248m-4k** : **TinySolar-248m-4k-py** that has been further pretrained (**continued pretraining**) on a large selection  
   of python code samples. You can find the model on Hugging Face at [this link](https://huggingface.co/upstage/TinySolar-248m-4k-py).


In [None]:
model_name = "TinySolar-248m-4k-py"
# Force forward slashes for Hugging Face compatibility
upstage_path = os.path.join("upstage",model_name).replace('\\', '/') 
# download model if not present  locally
model_path_or_name = os.path.join(path_to_scripts ,"models",model_name) if load_model_from_folder else upstage_path
# choose device map as CPU or Auto for GPU support
device_map = "auto" if torch.cuda.is_available() else "cpu"
# Load this model: Fine tuned
tiny_custom_model = AutoModelForCausalLM.from_pretrained(
    model_path_or_name,
    device_map=device_map,
    # Specify precision for efficiency
    torch_dtype=torch.bfloat16
)

# Load the tokenizer by specifing just model path/name
tiny_custom_tokenizer = AutoTokenizer.from_pretrained(model_path_or_name)

In [None]:
prompt = "def find_max(numbers):"

model_inputs = tiny_custom_tokenizer(prompt, return_tensors="pt").to(tiny_custom_model.device)

# Configure text streaming for Pretrained model
streamer_custom = TextStreamer(
    # Fine tuned tokenizer to stream
    tiny_custom_tokenizer,
     #exclude the original prompt from the output
    skip_prompt=True,
    #   Remove special tokens from output
    skip_special_tokens=True
)

# Generate text with specific parameters
outputs_Custom = tiny_custom_model.generate(
    **model_inputs,
    streamer=streamer_custom,
     # Enable caching for faster generation
    use_cache=True,
    max_new_tokens=max_new_tokens,
    do_sample=do_sample,
   # temperature=temperature,
    repetition_penalty=repetition_penalty
)

## 5.2  Result :This is  much better
**************************
*Find the maximum number of numbers in a list."""  
   max = 0  
   for num in numbers:  
       if num > max:  
           max = num  
   return max*
   ****************************

In [None]:
## Verify the function from prompt result
def find_max(numbers):
   max = 0
   for num in numbers:
       if num > max:
           max = num
   return max

In [None]:
find_max([1,3,5,1,6,7,2])

In [None]:
if run_remote  :
  from google.colab import drive
  drive.flush_and_unmount()
  print('All changes made in this colab session should now be visible in Drive.')