# Lesson 3: LLM Data Packaging Tutorial: From Tokens to Training Sequences

This notebook demonstrates the complete workflow for preparing tokenized data for LLM training, focusing on efficient sequence packing:

## Core Processes:
**Environment Configuration**: Sets up Local or Remote Run with and without GPU

**Data Preparation**:
- Loads preprocessed JSON data
- Implements dataset sharding (10 shards)
- Uses Hugging Face `datasets` library

**Tokenization Workflow**:
- Initializes SOLAR-10.7B tokenizer
- Implements BOS/EOS token insertion
- Processes text through parallel mapping
- Analyzes token distributions (millions per shard)

**Sequence Packaging**:
- Concatenates all token IDs
- Sets max sequence length (32 for demo)
- Reshapes into fixed-length sequences
- Handles length divisibility constraints

## Key Technical Components:
**Special Tokens**: Adds beginning/end markers  
**Memory Optimization**:
- Numpy array manipulation
- Sequence length vs batch size tradeoffs  

**Efficiency Features**:
- Dataset sharding for manageable processing
- Cached mapping operations
- GPU availability checks

## Output Artifacts:
- Packaged dataset in Hugging Face format
- JSON serialization of token sequences
- Complete pipeline from raw text to training-ready batches

## Visual Guides Included:
1. Tokenization process diagram  
2. Sequence packing visual helps the**model understand sequence boundaries.**
*model understand sequence boundaries.**



In [None]:
import os
import subprocess
from typing import Tuple

def setup_environment(curr_proj_folder: str = "pretraining-llms", google_drive_base_folder: str = "Colab Notebooks",\
                      run_remote: bool= True, use_gpu: bool = True) -> Tuple[str, bool]:
    """
    Sets up the environment for running code, handling local and remote execution.

    Args:
        curr_proj_folder (str, optional): Folder name of the current project. Defaults to "pretraining-llms".
        google_drive_base_folder (str, optional): Folder name of the Google drive base folder. Defaults to ""Colab Notebooks".
        use_gpu (bool, optional): Whether to use GPU if available. Defaults to True.

    Returns:
        Tuple[str,bool]: (computed path_to_scripts,mount_success status)
    """
    # Initialize mount status for Colab
    mount_success = False
    # Remote run code
    if run_remote:
      from google.colab import drive
      # Mount Google Drive
      drive.mount('/content/drive')
      # Check if the mount was successful
      if os.path.ismount('/content/drive'):
        print("Google Drive mounted successfully!")
        mount_success = True
      else:
        print("Drive mount failed.")
      # By Default, this is complete mount path
      mount_path = '/content/drive/MyDrive'

      # complete path to current files
      path_to_scripts = os.path.join(mount_path, google_drive_base_folder,curr_proj_folder)
      # Create the directory if it doesn't exist
      if not os.path.exists(path_to_scripts):
        os.makedirs(path_to_scripts)
        # change to the path
      os.chdir(path_to_scripts)
      print(f'Running code in path {os.getcwd()}')
    # Local Run
    else:
      path_to_scripts  = os.getcwd()
      # folder name provided as argument should match the one existing
      assert os.path.basename(path_to_scripts ) == curr_proj_folder, \
          f"Folder Name Mismatch: {os.path.basename(path_to_scripts )} != {curr_proj_folder}"
      print(f'Running code in path {path_to_scripts }')
    # check GPU usage
    if use_gpu:
      try:
        gpu_info = subprocess.check_output("nvidia-smi", shell=True).decode('utf-8')
        print("******GPU is available and will be used:**********")
        print(gpu_info)
      except subprocess.SubprocessError:
        print("GPU check failed (nvidia-smi not found or no GPU available). Falling back to CPU.")
        use_gpu = False  # Force CPU usage if GPU check fails
    else:
        print("******use_gpu is set to False. Using CPU******")
    return  path_to_scripts,mount_success

Always set following parameters before each run

In [None]:
# Project-specific configuration parameters
# Specifies the current project folder name
curr_proj_folder = "pretraining-llms"
# Base folder name in Google Drive where notebooks are stored
google_drive_base_folder = "Colab Notebooks"
# Flag to determine whether to use GPU for computations
use_gpu = True
# Flag to indicate remote execution environment
run_remote = False
# Flag to control model loading from a specific folder or through URL
load_model_from_folder = False


if run_remote:
  run_local = False
  run_local_usingColab = False
else:
  run_local = False
  run_local_usingColab = not run_local

# call method to setup environment
path_to_scripts,mount_success = setup_environment(curr_proj_folder = curr_proj_folder, \
                                   google_drive_base_folder =  google_drive_base_folder,\
                                    run_remote = run_remote, use_gpu = use_gpu)

### Tokenization:
LLMs don't process raw text directly - they work with numerical representations called tokens. Before training an LLM, we need to:

 **Tokenize our text data**: convert text into numbers using a specific tokenizer

 **Pack these tokens** into fixed-length sequences for efficient training

This process is crucial because:

1.   **Transforms** human-readable text into machine-processable numbers.
2.   **Ensures** all training examples have** consistent dimensions.**
3.   **Optimizes memory usage** during training.
4.   **Adds special tokens** to help the **model understand sequence boundaries.**

### 3.1 Loading and Preparing the Dataset
First, we'll load our preprocessed dataset and examine a small portion of it to make our notebook run faster.

In [None]:
import datasets
# preTrained data path
load_dir = ".//saved_pretrain_cleaned_data"
file_load_path = os.path.join(load_dir + "//preprocessed_dataset.json")
# Load the dataset from a json  file
dataset = datasets.load_dataset(
    "json",  data_files=file_load_path)
print(dataset)

**Sharding:**
Use the `shard` method of the Hugging Face `Dataset` object to split the dataset into 10 smaller pieces, or *shards* (think shards of broken glass). You can read more about sharding at [this link](https://huggingface.co/docs/datasets/en/process#shard).

In [None]:
# Split the dataset into 10 shards and use only the first shard
num_shards = 10
# Access the 'train' split of the dataset
train_dataset = dataset['train']
each_shard_len = int(train_dataset.num_rows / num_shards)
# shard/chunk to show
index_shard = 4
print(f'Original data with {train_dataset.num_rows} rows is split to {num_shards} shards , each with {each_shard_len} rows')
# Apply shard to the 'train' split
dataset = train_dataset.shard(num_shards=num_shards, index=index_shard)
print(dataset)

### Tokenization
  

In [None]:
from IPython.display import display
from PIL import Image
image_path = os.path.join(path_to_scripts,"images","lesson3_tokenization.jpg")
img = Image.open(image_path)
display(img)

### 3.2 Loading a Tokenizer
use a pre-trained tokenizer from the Hugging Face Transformers library  
Note: we set `use_fast=False` because the fast Rust implementation sometimes has issues with very long text samples.  
Instead, we'll use the Python implementation and leverage parallel processing through the dataset library's map function.




In [None]:
# requires conda install conda-forge::sentencepiece
from transformers import AutoTokenizer
model_name = "SOLAR-10.7B-v1.0"
# Force forward slashes for Hugging Face compatibility
upstage_path = os.path.join("upstage",model_name).replace('\\', '/') 
# download model if not present  locally
model_path_or_name = os.path.join(path_to_scripts ,"models",model_name) if load_model_from_folder else upstage_path

tokenizer = AutoTokenizer.from_pretrained(
    model_path_or_name,
    # # Using the Python implementation instead of Rust
    use_fast=False
)
# Test the tokenizer on a simple sentence
tokenizer.tokenize("I'm a short sentence")

Create a helper function:

In [None]:
def tokenization(example):
    # Tokenize the text into tokens
    tokens = tokenizer.tokenize(example["text"])

    # Convert tokens to numerical IDs
    token_ids = tokenizer.convert_tokens_to_ids(tokens)

    # Add special tokens: beginning-of-sequence (BOS) and
    #end-of-sequence (EOS)
    # These help the model recognize where sequences start and end
    token_ids = [tokenizer.bos_token_id] + token_ids + [tokenizer.eos_token_id]

    # Store the token IDs in the example
    example["input_ids"] = token_ids

    # Count the number of tokens for later analysis
    example["num_tokens"] = len(token_ids)

    return example

###  3.3 Applying Tokenization to the Dataset
Let's apply the  tokenization function to the sharded dataset:

In [None]:
# Apply the tokenization function to data
dataset = dataset.map(tokenization, load_from_cache_file=True)
print(dataset)
# Look at one example row
sample = dataset[3]
# load  First 30 characters of text
print("text", sample["text"][:30])
  # First 30 token IDs
print("\ninput_ids", sample["input_ids"][:30])
print("\nnum_tokens", sample["num_tokens"])

###   3.4 Calculating Total Tokens
Even with our small shard of data (about 4,000 text samples), we have **millions of tokens.**


In [None]:
import numpy as np
total_tokens = np.sum(dataset["num_tokens"])
print(f"Total number of tokens in the dataset: {total_tokens}")

#### Packing the Data
Our dataset currently contains **examples of variable lengths**, but for efficient training, we need fixed-length sequences. We'll now **pack our tokens into uniform-length sequences.**






In [None]:
image_path = os.path.join(path_to_scripts,"images","lesson3_packing.jpg")
img = Image.open(image_path)
display(img)


### 4.1 Concatenate All Tokens and Set Maximum Sequence Length
* Concatenate all token IDs into one long list and
* Decide on a **maximum sequence length** for our model:This parameter   
   _determines the fixed length of token_ sequences that will be fed into your model during training.

**Key Considerations When setting the maximum sequence length**

* **Memory constraints**: **Longer sequences require more GPU/TPU memory** during training.

* **Model architecture**: Different model architectures support different context lengths. Modern models like SOLAR and Llama 2 typically use 4096 tokens or longer as their maximum sequence length.

* **Task requirements**: The nature of the  downstream tasks influences the ideal sequence length. Tasks requiring **long-range understanding (like summarizing books) **benefit from longer sequences.

* **Training efficiency**: Shorter sequences allow for **larger batch sizes**, which can speed up training, but at the **cost of potentially limiting the model's ability to capture long-range dependencies.**

In [None]:
# Concatenate all input_ids into a single long array
input_ids = np.concatenate(dataset["input_ids"])
print(f"Total number of concatenated tokens: {len(input_ids)}")

# Set the maximum sequence length
# In practice, modern LLMs like SOLAR and Llama 2 use 4096 or longer
max_seq_length = 32  # Using a small value for demonstration


 ### 4.2 Reshape into Fixed-Length Sequences
   Reshape the  token list into fixed-length sequences

In [None]:
# Calculate the total length that's divisible by max_seq_length
total_length = len(input_ids) - len(input_ids) % max_seq_length
print(f"Adjusted total length: {total_length}")

# Discard extra tokens from the end to make it evenly divisible
input_ids = input_ids[:total_length]
print(f"Shape after truncation: {input_ids.shape}")

# Reshape into a 2D array with dimensions [num_sequences, max_seq_length]
input_ids_reshaped = input_ids.reshape(-1, max_seq_length).astype(np.int32)
print(f"Shape after reshaping: {input_ids_reshaped.shape}")


### 4.3 Convert to Hugging Face Dataset and save it

In [None]:
# Convert numpy array to list for Dataset creation
input_ids_list = input_ids_reshaped.tolist()

# Create a new Dataset with the packed sequences
packaged_pretrain_dataset = datasets.Dataset.from_dict(
    {"input_ids": input_ids_list}
)
print(packaged_pretrain_dataset)
# Save the packed dataset as a  json file
save_dir = load_dir
file_path = save_dir + ".//packaged_pretrained_dataset.json"
packaged_pretrain_dataset.to_json(file_path)
if os.path.exists(file_path):
    print(f"File '{file_path}' created successfully.")
else:
    print(f"File '{file_path}' creation failed.")

In [None]:
if run_remote  :
  from google.colab import drive
  drive.flush_and_unmount()
  print('All changes made in this colab session should now be visible in Drive.')