# Lesson 3: Data Packaging
## 1. Tokenizing and creating input_ids

Start by loading the dataset from the previous lesson:

In [1]:
import datasets

dataset = datasets.load_dataset(
    "parquet", 
    data_files="./data/preprocessed_dataset.parquet", 
    split="train"
)
print(dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 40474
})


Use the `shard` method of the Hugging Face `Dataset` object to split the dataset into 10 smaller pieces, or *shards* (think shards of broken glass). You can read more about sharding at [this link](https://huggingface.co/docs/datasets/en/process#shard).

In [2]:
dataset = dataset.shard(num_shards=10, index=0)
print(dataset)

Dataset({
    features: ['text'],
    num_rows: 4048
})


Load the tokenizer and try it out:

In [21]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path_or_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_path_or_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_path_or_name)

print("Tokenizer and Model loaded successfully!")




tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Tokenizer and Model loaded successfully!


In [25]:
tokenizer.tokenize("I'm a short sentence")

['I', "'m", 'Ġa', 'Ġshort', 'Ġsentence']

In [23]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path_or_name = "EleutherAI/gpt-neo-125M"
tokenizer = AutoTokenizer.from_pretrained(model_path_or_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_path_or_name)

print("Tokenizer and Model loaded successfully!")


tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/526M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

Tokenizer and Model loaded successfully!


In [26]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

input_text = "The future of AI is"
inputs = tokenizer(input_text, return_tensors="pt")
output = model.generate(inputs.input_ids, max_length=50, num_return_sequences=1, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The future of AI is not yet clear.











































In [27]:
tokenizer.tokenize("Samridhi likes cats")

['Sam', 'rid', 'hi', 'Ġlikes', 'Ġcats']

In [28]:
#space representation

Create a helper function:

In [29]:
def tokenization(example):
    # Tokenize
    tokens = tokenizer.tokenize(example["text"])

    # Convert tokens to ids
    token_ids = tokenizer.convert_tokens_to_ids(tokens)

    # Add <bos>, <eos> tokens to the front and back of tokens_ids 
    # bos: begin of sequence, eos: end of sequence
    token_ids = [
        tokenizer.bos_token_id] \
        + token_ids \
        + [tokenizer.eos_token_id
    ]
    example["input_ids"] = token_ids

    # We will be using this column to count the total number of tokens 
    # in the final dataset
    example["num_tokens"] = len(token_ids)
    return example

Tokenize all the examples in the pretraining dataset:

In [30]:
dataset = dataset.map(tokenization, load_from_cache_file=False)
print(dataset)

Map:   0%|          | 0/4048 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2000 > 1024). Running this sequence through the model will result in indexing errors


Dataset({
    features: ['text', 'input_ids', 'num_tokens'],
    num_rows: 4048
})


In [33]:
sample = dataset[1]

print("text", sample["text"][:30]) # 
print("\ninput_ids", sample["input_ids"][:30])
print("\nnum_tokens", sample["num_tokens"])

text I recently upgraded to iTunes 

input_ids [50256, 40, 2904, 17955, 284, 4830, 1105, 13, 19, 319, 616, 670, 13224, 357, 11209, 838, 8, 290, 6810, 617, 7650, 4069, 351, 617, 711, 20713, 290, 7259, 13, 1081]

num_tokens 2002


Check the total number of tokens in the dataset:

In [34]:
import numpy as np
np.sum(dataset["num_tokens"])

4624174

## 2. Packing the data

![Packing data for training](./data_packing.png)

Concatenate input_ids for all examples into a single list:

In [None]:
# serialisation - first step

In [35]:
input_ids = np.concatenate(dataset["input_ids"])
print(len(input_ids))

4624174


In [37]:
max_seq_length = 32

In [38]:
total_length = len(input_ids) - len(input_ids) % max_seq_length
print(total_length)

4624160


Discard extra tokens from end of the list so number of tokens is exactly divisible by `max_seq_length`:

In [39]:
input_ids = input_ids[:total_length]
print(input_ids.shape)

(4624160,)


In [40]:
input_ids_reshaped = input_ids.reshape(-1, max_seq_length).astype(np.int32)
input_ids_reshaped.shape  

(144505, 32)

In [41]:
type(input_ids_reshaped)

numpy.ndarray

Convert to Hugging Face dataset:

In [42]:
input_ids_list = input_ids_reshaped.tolist()
packaged_pretrain_dataset = datasets.Dataset.from_dict(
    {"input_ids": input_ids_list}
)
print(packaged_pretrain_dataset)

Dataset({
    features: ['input_ids'],
    num_rows: 144505
})


## 3. Save the packed dataset to disk

In [43]:
packaged_pretrain_dataset.to_parquet("./data/packaged_pretrain_dataset.parquet")

Creating parquet from Arrow format:   0%|          | 0/145 [00:00<?, ?ba/s]

19074660

In [None]:
# PAKCAGAING

In [17]:
import datasets



In [20]:
dataset = dataset.shard(num_shards=10, index=0)
print(dataset)

Dataset({
    features: ['text'],
    num_rows: 4048
})
