# Processing Data to Keep GPUs Busy

This part of the script focuses on preparing a dataset for efficient processing by the GPU. The data preparation steps ensure the GPUs remain fully utilized during training by dividing tasks into manageable chunks and keeping data pipelines optimized.

4 steps.
1) get dataset
2) store and load data
3) process/tokenize data
4) load a batch of data and send it to the GPU

![Architecture](../Images/image-2.png)

![Data processing pipeline](../Images/image-3.png)

### The Point: Make Training Fast
Your goal: time(data processing pipeline) <= time(GPU iteration)
- The bottleneck is the slowest part of your training loop
- Identify the bottleneck, remove it,repeat
- The GPU should be your bottleneck (flops or bandwidth cost) in most cases because GPUs are the more expensive and limited resource. GPU optimization is its own topic :)
- Therefore the CPU data processing pipeline should not be the bottleneck
- Your goal is to load batch of data faster than the GPU can process it. If you can do this, then you don't need to optimize the data processing pipeline - optimizing it will have no effect on training speed
- If your GPU is sitting idle waiting for a new batch of data then your data processing pipeline is a bottleneck and you should optimize
⁃ Optimize the data processing pipeline by profiling, accurately identifying which step is the bottleneck, and applying optimizations specific to that part of the pipeline

### Libraries and Dataset Loading

In [1]:
import numpy as np

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

!pip install datasets
import datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

##### Purpose:
- datasets is used to load a large dataset for text summarization.
- The dataset is streamed and reduced to 3,500 examples to limit memory usage.

##### Streaming
Refers to the ability to load and process data from a dataset incrementally, rather than loading the entire dataset into memory at once. This is particularly useful when working with large datasets that cannot fit into memory. The streaming=True option in the datasets.load_dataset function provided by the Hugging Face Datasets library enables this behavior.

##### How Streaming Works in Hugging Face Datasets:
- On-Demand Loading: Data is loaded one sample or batch at a time, directly from the source (e.g., a remote server or file storage) without needing to download and store the entire dataset locally.
- Efficient Resource Usage: It minimizes memory usage, as only the required data points are kept in memory during processing.

In [16]:
dataset = datasets.load_dataset("ccdv/arxiv-summarization", split='train', streaming=True)
raw_dataset = list(dataset.take(3500))

### Chunking and Segmentation

Why chunking?
- Articles are too long to process directly.
- Splitting into fixed-size chunks allows efficient batching and computation.

Suppose we have a dataset of **scientific articles**, each with different lengths. For this example, assume:
- The articles vary in length from 3000 to 10000 tokens.
- **Segment length** is set to `512` tokens.
- Each sequence (document) is split into **segments** of size 512.
- **Chunk size** (total sequence length) is `5120` tokens (`10 segments × 512 tokens`).
- A **batch** consists of `8` such sequences.

#### **Documents (Dataset Level)**
These are the raw articles in the dataset. For simplicity:
- Document 1: 3500 tokens
- Document 2: 7200 tokens
- Document 3: 10240 tokens
- ... (more documents)

#### **Chunks**
Each document is divided into chunks of **5120 tokens** (if long enough). For documents shorter than 5120 tokens, they might be discarded, padded, or processed differently depending on the task.

- Document 2 (7200 tokens) → 1 full chunk of 5120 tokens + 1 leftover chunk of 2080 tokens (discarded or padded).
- Document 3 (10240 tokens) → 2 full chunks of 5120 tokens.

#### **Segments**
Each **chunk** is further divided into smaller **segments** of `512 tokens` for processing. For example:
- Document 3, Chunk 1 (5120 tokens) → split into 10 segments:
  - Segment 1: Tokens 0–511
  - Segment 2: Tokens 512–1023
  - ...
  - Segment 10: Tokens 4608–5119

These **segments** allow processing piece-by-piece, reducing memory usage.

#### **Batches**
A **batch** consists of multiple sequences (chunks from different documents) that are processed together. With a batch size of `8`, a single batch might look like this:
- Sequence 1: Chunk 1 of Document 2
- Sequence 2: Chunk 2 of Document 3
- Sequence 3: Chunk 1 of Document 4
- ...
- Sequence 8: Chunk 1 of Document 9

Each batch is passed through the model in parallel to maximize training efficiency.

---

### Visualization

| **Level**       | **Example**                                |
|------------------|--------------------------------------------|
| **Document**     | Entire raw text of a scientific article.   |
| **Chunk**        | A slice of the document (5120 tokens).     |
| **Segment**      | Smaller parts of a chunk (512 tokens each).|
| **Batch**        | 8 sequences (chunks) processed together.   |

---

### Summary
- **Segments** are subdivisions of a single chunk (smaller pieces of data for processing sequentially within one sequence).
- **Chunks** are subdivisions of documents to create manageable sequences of fixed length (for memory constraints).
- **Batches** combine multiple sequences (from chunks of different documents) for parallel processing.

This hierarchy enables efficient memory management and parallelism during training.

In [None]:
# BATCH SIZE: 4 (papers)
# CHUNK SIZE: 5 (each paper broken into 5 chunks of n tokens each)


#        forward pass 1 | FP 2    | FP 3    | FP 4    | FP 5    |
#
# paper 1:      chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 |
# paper 2:      chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 |
# paper 3:      chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 |
# paper 4:      chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 |
#
#
#
#        forward pass 6 | FP 7    | FP 8    | FP 9    | FP 10   |
#
# paper 5:      chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 |
# paper 6:      chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 |
# paper 7:      chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 |
# paper 8:      chunk 1 | chunk 2 | chunk 3 | chunk 4 | chunk 5 |


In [17]:
segments = 10
segment_length = 512
chunk_size = segments * segment_length
chunk_size

5120

Articles are filtered to ensure each is long enough for chunking into 10 segments of 512 tokens.

In [18]:
raw_articles = [x['article'] for x in raw_dataset]
raw_articles = [x for x in raw_articles if len(x) > 5120]
print ("number of articles", len(raw_articles))

number of articles 3401


In [19]:
unique_chars = set(''.join([i for i in raw_articles]))
print ("character set length", len(unique_chars))
print ("character set", ''.join(sorted(unique_chars)))

character set length 70
character set 
 !"#$%&'()*+,-./0123456789:;<=>?@[\]^_`abcdefghijklmnopqrstuvwxyz{|}~


Converts the first raw article (a string) into an array of 8-bit integers (dtype=np.uint8), where each integer represents an ASCII code of a character in the article.

- Extracts the first 512 tokens (characters) from the array.

In [20]:
np.fromstring(raw_articles[0], dtype=np.uint8)[:512]

  np.fromstring(raw_articles[0], dtype=np.uint8)[:512]


array([ 97, 100, 100, 105, 116, 105, 118, 101,  32, 109, 111, 100, 101,
       108, 115,  32,  64, 120,  99, 105, 116, 101,  32, 112, 114, 111,
       118, 105, 100, 101,  32,  97, 110,  32, 105, 109, 112, 111, 114,
       116,  97, 110, 116,  32, 102,  97, 109, 105, 108, 121,  32, 111,
       102,  32, 109, 111, 100, 101, 108, 115,  32, 102, 111, 114,  32,
       115, 101, 109, 105, 112,  97, 114,  97, 109, 101, 116, 114, 105,
        99,  32, 114, 101, 103, 114, 101, 115, 115, 105, 111, 110,  32,
       111, 114,  32,  99, 108,  97, 115, 115, 105, 102, 105,  99,  97,
       116, 105, 111, 110,  32,  46,  32, 115, 111, 109, 101,  32, 114,
       101,  97, 115, 111, 110, 115,  32, 102, 111, 114,  32, 116, 104,
       101,  32, 115, 117,  99,  99, 101, 115, 115,  32, 111, 102,  32,
        97, 100, 100, 105, 116, 105, 118, 101,  32, 109, 111, 100, 101,
       108, 115,  32,  97, 114, 101,  32, 116, 104, 101, 105, 114,  32,
       105, 110,  99, 114, 101,  97, 115, 101, 100,  32, 102, 10

The function decode_text takes a sequence of tokens (numerical values) and converts them into text by interpreting each token as an ASCII code. Here's what it does step by step:

- Input: It accepts a sequence of integers (tokens).
- Conversion to Characters: It uses the chr() function to convert each integer into its corresponding ASCII character.
- Concatenation: The resulting characters are joined into a single string using ''.join().

In [21]:
def decode_text(tokens):
    return ''.join([chr(i) for i in tokens])

decode_text(np.fromstring(raw_articles[0], dtype=np.uint8)[:512])

  decode_text(np.fromstring(raw_articles[0], dtype=np.uint8)[:512])


'additive models @xcite provide an important family of models for semiparametric regression or classification . some reasons for the success of additive models are their increased flexibility when compared to linear or generalized linear models and their increased interpretability when compared to fully nonparametric models . \n it is well - known that good estimators in additive models are in general less prone to the curse of high dimensionality than good estimators in fully nonparametric models . \n many ex'

In [8]:
raw_articles[0][:512]

'additive models @xcite provide an important family of models for semiparametric regression or classification . some reasons for the success of additive models are their increased flexibility when compared to linear or generalized linear models and their increased interpretability when compared to fully nonparametric models . \n it is well - known that good estimators in additive models are in general less prone to the curse of high dimensionality than good estimators in fully nonparametric models . \n many ex'

In [22]:
converted = [np.fromstring(doc, dtype=np.uint8) for doc in raw_articles]

  converted = [np.fromstring(doc, dtype=np.uint8) for doc in raw_articles]


### Text Preprocessing

Articles are converted into a numerical format suitable for PyTorch tensors:
- Encoding: Converts text to np.uint8 arrays for GPU processing.
- Clipping: Ensures article lengths are multiples of the chunk size.
- Chunking: Divides articles into smaller arrays of fixed size

Clipping ensures that documents can be split into equal-sized chunks (segments) of 5120 tokens. For example:

If a document's original length is 5157 tokens, clipping removes 37 tokens, leaving 5120 tokens.
This avoids issues with uneven chunking during further processing (e.g., when reshaping or dividing the data into smaller segments for batching or training).

In [23]:
def clip_article(doc, chunk_size):
    remainder = len(doc) % chunk_size
    return doc[:-remainder]

clipped = [clip_article(doc, 5120) for doc in converted]

In [25]:
clipped = [clip_article(doc, 5120) for doc in converted if len(doc) >= 5120]

In [11]:
clipped[1].shape[0] / 5120

3.0

This line of code processes the clipped articles by splitting each one into equally sized chunks (of size `chunk_size`) and then storing the resulting chunks in a new NumPy array.

**Reshape Each Document**:
   - For each document (`doc`) in `clipped`, `doc.reshape(-1, chunk_size)` splits the document into multiple chunks of size `chunk_size`.
   - The `-1` in `.reshape(-1, chunk_size)` automatically determines the number of chunks required based on the document length.

### Example
Suppose:
- `chunk_size = 5120`.
- A document in `clipped` has 10240 tokens.

The reshaping:
- Splits the document into `10240 / 5120 = 2` chunks.
- The resulting array for this document would have a shape of `(2, 5120)`.

If `clipped` contains 3 such documents, `chunked` would be a 3D array:
- Shape: `(3, 2, 5120)` (3 documents, each with 2 chunks, and each chunk having 5120 tokens).

### Purpose
The reshaping ensures that the data is ready for batch processing or sequential input into a machine learning model, where fixed-sized chunks are often required. It simplifies further operations, such as:
- Combining chunks into batches for training.
- Processing data sequentially in smaller segments.

In [30]:
chunked = np.concatenate([doc.reshape(-1, chunk_size) for doc in clipped], axis=0)

np.array() requires all elements to have the same shape to form a homogeneous array

chunked will be a single 2D NumPy array where all reshaped documents are stacked together. The shape will be (total_segments, 5120) where total_segments is the sum of the rows across all reshaped documents.

- Tensors are an extension of arrays with additional features, commonly used in deep learning and PyTorch (but also found in TensorFlow and other ML libraries).
- A tensor is essentially a multi-dimensional array that allows efficient operations on GPUs, which are crucial for training models.
- Tensors can store not just numerical data, but also perform automatic differentiation, which is key for backpropagation in neural networks.
- PyTorch tensors are similar to NumPy arrays, but have additional functionality, such as support for GPU computation and gradients for automatic differentiation

In [31]:
# converts the list or array chunked into a PyTorch tensor with a data type of torch.long, which is commonly used for storing integer values.
processed_data = torch.tensor(chunked, dtype=torch.long)
processed_data.shape

torch.Size([20853, 5120])

### DataLoader Setup

- The dataset is split into 80% train, 10% validation, and 10% test subsets.
- PyTorch DataLoader is used to create shuffled batches for each split.

In [33]:
eighty_split = int(processed_data.shape[0] * .8)
ninety_split = int(processed_data.shape[0] * .9)

DataLoader is a convenient way to handle large datasets efficiently by batching, shuffling, and providing data in a way that is ready to be fed into your model for training or evaluation.
- DataLoader for the training set, where batch_size=8 means each batch will contain 8 data samples, and shuffle=True ensures the data is shuffled before being loaded into batches
- Wrapping the DataLoader in iter() converts it into an iterator, allowing you to use it in a for loop for batch-wise processing.

In [34]:
train_loader = iter(DataLoader(processed_data[:eighty_split], batch_size = 8, shuffle = True))
val_loader = iter(DataLoader(processed_data[eighty_split:ninety_split], batch_size = 8, shuffle = True))
test_loader = iter(DataLoader(processed_data[ninety_split:], batch_size = 8, shuffle = True))

In [36]:
example = next(train_loader)
example.shape

torch.Size([8, 5120])

a common technique in sequence-based tasks like language modeling. It splits the sequence into two parts: the input sequence (seq) and the target labels (labels), where the model is trained to predict the next token in the sequence.
- example: This refers to a sequence of tokens, such as a sentence or document, represented as a tensor or array. Each element is usually a numerical representation of a token (e.g., a word or character).
- [:, :-1]: This slices the sequence from the start (index 0) to the second-to-last element. In the context of language modeling, this is the input sequence (seq) that the model will use to predict the next token.
- [:, 1:]: This slices the sequence starting from the second element (index 1) to the end. This is the target sequence (labels), which is used as the ground truth for training. The model tries to predict the token at index i+1 given the token at index i.

In [37]:
seq, labels = example[:, :-1], example[:, 1:]
print(seq.shape)
print(seq[0][:15])
print(labels.shape)
print(labels[0][:15])

torch.Size([8, 5119])
tensor([ 32, 108, 101,  97, 115, 116,  32, 116, 119, 111,  32, 118, 101, 114,
        116])
torch.Size([8, 5119])
tensor([108, 101,  97, 115, 116,  32, 116, 119, 111,  32, 118, 101, 114, 116,
        105])


In [38]:
seq.chunk(10, dim=-1)[0].shape

torch.Size([8, 512])

The code you provided iterates through segments of sequences (seq) and their corresponding labels (labels), each chunked into parts of size 10 (defined by chunk(10, dim=-1)), and prints the decoded text of the second token from each segment
- chunk(10, dim=-1) splits the seq tensor into chunks of size 10 along the last dimension. The -1 refers to the last axis, so if seq is a 2D tensor of shape (batch_size, sequence_length), this operation splits the sequence into smaller subsequences of length 10.
- zip pairs each chunk of seq with the corresponding chunk of labels. This creates an iterable where each element is a tuple containing a seq_segment and a labels_segment

In [39]:
for seq_segment, labels_segment in zip(seq.chunk(10, dim = -1), labels.chunk(10, dim = -1)):
    print(decode_text(seq_segment[1]), "\n *********** \n")

 carried out a literature search in order to identify known objects associated with the infrared point source centroids . 
 table  [ targets ] lists the targeted point sources from this study , corresponding labels from @xcite and @xcite , and additional information from the literature . 
 all point sources are identified as sites of recent or ongoing star formation , and are typically young protostars or young stellar clusters . a number of the point sources exhibit interesting spectral features that are w 
 *********** 

orth a closer look . 
 we describe these point sources in detail below . 
 unresolved emission from two bright star clusters in n66 exhibit silicate emission features in their spectra : ngc  346 ( ps9 ) and n66b ( ps6 ) ; see figure  [ starclusters ] . 
 both of these point sources are bright h@xmath9  sources @xcite , contain ` blue ' stars @xcite , and have been modeled as @xmath0  3  myr old with _ hubble _ color - magnitude diagrams @xcite . 
 this age is consist

### Example Training Loop

Model Description:
- An embedding layer maps input tokens to a 16-dimensional space.
- Hidden layers apply transformations with ReLU activation.
- The output layer predicts the next token in the sequence.

Training Loop:
- Each batch is divided into 10 smaller chunks to fit memory constraints and optimize GPU utilization.
- Backpropagation occurs for each chunk separately.

Validation and Testing
- Validation is conducted after every 50 iterations to monitor model performance without training.

In [40]:
model = nn.Sequential(
    nn.Embedding(128,16), # (vocab_size, embedding_dim) # Embedding layer
    nn.Linear(16, 150),
    nn.ReLU(),
    nn.Linear(150,150),
    nn.ReLU(),
    nn.Linear(150, 128), # (params, vocab_size) # Output matches embedding size
)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
model.train()

Sequential(
  (0): Embedding(128, 16)
  (1): Linear(in_features=16, out_features=150, bias=True)
  (2): ReLU()
  (3): Linear(in_features=150, out_features=150, bias=True)
  (4): ReLU()
  (5): Linear(in_features=150, out_features=128, bias=True)
)

In [41]:
segments = 10
for i in range(300):

    data = next(train_loader) # (batch_size, sequence_length) # (8, 5120)
    seq, labels = data[:, :-1], data[:, 1:]
    train_loss = 0.
    model.train()

    for seq_segment, labels_segment in zip(seq.chunk(segments, dim = -1), labels.chunk(segments, dim = -1)): # ten passes of (8, 512)
        optimizer.zero_grad()
        y_pred = model(seq_segment)
        y_pred = y_pred.transpose(2,1) # Match shapes for CrossEntropyLoss# Match shapes for CrossEntropyLoss
        loss = loss_fn(y_pred, labels_segment)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    if i % 5 == 0:
        print (train_loss / segments)

    if i > 0 and i % 50 == 0:
        val_data = next(val_loader)
        seq, labels = val_data[:, :-1], val_data[:, 1:]
        eval_loss = 0.
        model.eval()
        for seq_segment, labels_segment in zip(seq.chunk(segments, dim = -1), labels.chunk(segments, dim = -1)): # ten passes of (8, 512)
            with torch.no_grad():
                y_pred = model(seq_segment)
                y_pred = y_pred.transpose(2,1)
                loss = loss_fn(y_pred, labels_segment)
                eval_loss += loss.item()

        print ("VALIDATION LOSS", (eval_loss / segments))

4.814817905426025
3.471447801589966
3.1637600898742675
2.9949301719665526
3.086688208580017
2.7922783136367797
2.8021355867385864
2.7291361331939696
2.9221490621566772
2.6424169540405273
2.6004881143569945
VALIDATION LOSS 2.771096420288086
2.5817805528640747
2.5987738132476808
2.6307487726211547
2.694489097595215
2.5678778648376466
2.4874101638793946
2.5722203254699707
2.5212898015975953
2.7539854288101195
2.4800504207611085
VALIDATION LOSS 2.565450644493103
2.5101852416992188
2.6612972974777223
2.4917702674865723
2.5158987998962403
2.450358510017395
2.4149893045425417
2.460058665275574
2.66586594581604
2.5224470138549804
2.465202474594116
VALIDATION LOSS 2.730003571510315
2.4767351388931274
2.5565998792648315
2.5127267360687258
2.4015066385269166
2.3902504444122314
2.4590445041656492
2.424598789215088
2.4134016752243044
2.390960693359375
2.584371018409729
VALIDATION LOSS 2.51204776763916
2.6035442113876344
2.417050862312317
2.4082767963409424
2.582948160171509
2.520162510871887
2.4698

In [42]:
test_data = next(test_loader)
seq, labels = test_data[:, :-1], test_data[:, 1:]
test_loss = 0.
model.eval()
for seq_segment, labels_segment in zip(seq.chunk(segments, dim = -1), labels.chunk(segments, dim = -1)): # ten passes of (8, 512)
    with torch.no_grad():
        y_pred = model(seq_segment)
        y_pred = y_pred.transpose(2,1)
        loss = loss_fn(y_pred, labels_segment)
        test_loss += loss.item()

print ("TEST LOSS", (test_loss / segments))

TEST LOSS 2.447762060165405


### Key Takeaways:

- Efficient Data Processing: Splitting into chunks ensures GPU memory is optimally utilized.
- Iterative Approach: Training loop processes small chunks sequentially to minimize memory overhead.
- Dynamic Testing:Validation and testing steps ensure the model generalizes to unseen data.