# Lecture 2: Data Preparation

*This notebook also has Lesson 3.*

In this notebook, we'll carry out some of the data cleaning steps required to prepare data for pretraining.

In [1]:
import warnings
warnings.filterwarnings("ignore")

## Sourcing datasets for pretraining

In this section, we'll see two ways to source data for training:
1. Download an existing dataset from Hugging Face
2. Create a dataset of python scripts sourced from Github

### Download data from Hugging face

The dataset we'll download is a subset of a much larger dataset called **Red Pajama**. The full, 1 trillion token dataset is available on Hugging Face at [this link](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T).

In [2]:
import datasets

pretraining_dataset = datasets.load_dataset(
    "upstage/Pretraining_Dataset",
    split="train"
)

pretraining_dataset.parquet:   0%|          | 0.00/150M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/60000 [00:00<?, ? examples/s]

In [3]:
print(pretraining_dataset)

Dataset({
    features: ['text', 'meta'],
    num_rows: 60000
})


In [4]:
# We only need "text" column
pretraining_dataset = pretraining_dataset.select_columns(
    ["text"]
)

In [5]:
# Let's see how it looks like ...
print(pretraining_dataset[0]["text"][:500])

In 1793 Zaman Shah, a grandson of Ahmad Shah Durrani, won a brief war of succession to become ruler of Afghanistan. The support of Painda Khan, chief of the Baraksai branch of the Durrani tribe, was decisive in his victory. In the next fifty year., the brothers of Zaman shah and the sons of Painda Khan were to dominate the affairs of Afghanistan. The Durrani tribe was very large with several branches and numerous clans. 1 Abmad Shah and his successors belonged to the Sadozai clan, but other clan


### Compare pretraining and fine-tuning datasets

Now, we'll download a fine-tuning dataset to contrast with the pretraining dataset loaded above. 

***Note***: [Alpaca model and instruction tuning dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). 

In [6]:
instruction_dataset = datasets.load_dataset(
    "c-s-ale/alpaca-gpt4-data",
    split='train'
)
print(instruction_dataset)

README.md:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

alpaca_gpt4_data.json:   0%|          | 0.00/43.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 52002
})


In [7]:
i=21
print("Instruction: " + instruction_dataset[i]["instruction"] 
      + "\nInput: " + instruction_dataset[i]["input"] 
      + "\nOutput: " + instruction_dataset[i]["output"])

Instruction: Compare and contrast the Cuban Missile Crisis and the Vietnam War.
Input: 
Output: The Cuban Missile Crisis (1962) and the Vietnam War (1955-1975) were two pivotal events in US history that occurred during the Cold War. Although both events were characterized by heightened tensions between the United States and Communist powers, they had some key differences that set them apart.

The Cuban Missile Crisis was a 13-day political and military standoff between the United States and the Soviet Union over the installation of Soviet nuclear missiles in Cuba, just 90 miles from US shores. In contrast, the Vietnam War was a prolonged conflict that lasted for almost two decades, involving the US, South Vietnam, North Vietnam, and other Communist allies. The war was a result of the US attempting to contain the spread of communism in Southeast Asia.

The Cuban Missile Crisis is often considered the closest the world ever came to nuclear war, with the US and the Soviet Union coming to 

In contrast to the pretraining data, fine-tuning datasets are structured into question-answer pairs or instruction-response sets that can include additional input context if required.

### Scrape python code from Github

Let's download a selection of python scripts from Github and then prepare them as a Hugging Face `Dataset` object to use in training. 

The same pattern here will work for preparing any text scraped from the web.

In [None]:
# Import some required packages
import os
import requests

# Path to directory to store python scripts
code_dir = "./working/code/"

In [9]:
# if code_dir does not exist, then create it

if not os.path.exists(code_dir):
    os.makedirs(code_dir)

In [10]:
urls = [
    "https://raw.githubusercontent.com/TheAlgorithms/Python/master/searches/double_linear_search_recursion.py",
    "https://raw.githubusercontent.com/KosingZhu/tensorflow/master/tensorflow/python/tools/module_util.py",
    "https://raw.githubusercontent.com/EricRemmerswaal/tensorflow/master/tensorflow/python/distribute/distribute_coordinator_context.py",
    "https://raw.githubusercontent.com/computationalartist/tensorflow/master/tensorflow/python/ops/numpy_ops/integration_test/benchmarks/numpy_mlp.py",
    "https://raw.githubusercontent.com/Van-an/tensorflow/master/tensorflow/python/distribute/coordinator/values.py",
    "https://raw.githubusercontent.com/nkgwer/tensorflow/master/tensorflow/lite/tools/visualize.py",
    "https://raw.githubusercontent.com/gitblazer/youtube-dl/master/youtube_dl/version.py",
    "https://raw.githubusercontent.com/Joshua-Barawa/My-Photos/master/venv/lib/python3.8/site-packages/django/contrib/messages/__init__.py",
    "https://raw.githubusercontent.com/PaliC/pytorch/master/test/fx/test_subgraph_rewriter.py"
]

In [11]:
# Retrieve the python scripts
from tqdm.auto import tqdm

pbar = tqdm(total=len(urls))

for url in urls:
    print(f"Working on url: {url}")
    response = requests.get(url)
    file_name = os.path.basename(url)
    file_path = os.path.join(code_dir, file_name)
    
    with open(file_path, "wb") as file:
        file.write(response.content)
        
    pbar.update(1)

  0%|          | 0/9 [00:00<?, ?it/s]

Working on url: https://raw.githubusercontent.com/TheAlgorithms/Python/master/searches/double_linear_search_recursion.py
Working on url: https://raw.githubusercontent.com/KosingZhu/tensorflow/master/tensorflow/python/tools/module_util.py
Working on url: https://raw.githubusercontent.com/EricRemmerswaal/tensorflow/master/tensorflow/python/distribute/distribute_coordinator_context.py
Working on url: https://raw.githubusercontent.com/computationalartist/tensorflow/master/tensorflow/python/ops/numpy_ops/integration_test/benchmarks/numpy_mlp.py
Working on url: https://raw.githubusercontent.com/Van-an/tensorflow/master/tensorflow/python/distribute/coordinator/values.py
Working on url: https://raw.githubusercontent.com/nkgwer/tensorflow/master/tensorflow/lite/tools/visualize.py
Working on url: https://raw.githubusercontent.com/gitblazer/youtube-dl/master/youtube_dl/version.py
Working on url: https://raw.githubusercontent.com/Joshua-Barawa/My-Photos/master/venv/lib/python3.8/site-packages/djan

In [12]:
files = os.listdir(code_dir)
for file in files:
    print(file)

module_util.py
numpy_mlp.py
test_subgraph_rewriter.py
double_linear_search_recursion.py
distribute_coordinator_context.py
__init__.py
version.py
values.py
visualize.py


In [13]:
code_dataset = []

for file in os.listdir(code_dir):
    code_dataset.append(
        {'text': open(os.path.join(code_dir, file), 'r').read()}
    )

In [14]:
for i, elem in enumerate(code_dataset):
    for k, v in elem.items():
        print("{} element has {} of length {}.".format(i+1, 
                                                       k, 
                                                       len(v)
                                                      )
             )

1 element has text of length 1905.
2 element has text of length 14.
3 element has text of length 15037.
4 element has text of length 911.
5 element has text of length 14.
6 element has text of length 106.
7 element has text of length 68.
8 element has text of length 17041.
9 element has text of length 16631.


In [15]:
# Convert list to Hugging Face `Dataset` object
code_dataset = datasets.Dataset.from_list(code_dataset)
print(code_dataset)

Dataset({
    features: ['text'],
    num_rows: 9
})


In [16]:
# Combine the python code dataset with the pretraining dataset

dataset = datasets.concatenate_datasets(
    [pretraining_dataset, code_dataset]
)
print(dataset)

Dataset({
    features: ['text'],
    num_rows: 60009
})


## Data cleaning

Let's do following cleaning steps:
1. Filter out samples that are too short
2. Remove repetitions within a single text example
3. Remove duplicated documents
4. Quality filter to remove non-English texts 

In [17]:
memory_numrows = [dataset.num_rows]
dataset.num_rows

60009

### Remove examples that are too short

In [18]:
import heapq

def paragraph_length_filter(x):
    """
    The function checks if a text is long enough both in number of lines 
    and in line length:
          1. At least 3 lines
          2. The 3 longest lines must each be at least 3 characters
    
    Returns False if a page has too few lines or lines are too short.
    """
    lines = x['text'].split('\n')
    if (
        len(lines) < 3  # checking if total lines are less than 3?
        # Look at the lengths of all the lines. Pick the 3 longest lines. 
        # If any of those is less than 3 characters, 
        # this page isn't useful — reject it.
        or min(heapq.nlargest(3, [len(line) for line in lines])) < 3
    ):
        return False
    return True

In [19]:
dataset = dataset.filter(
    paragraph_length_filter,
    load_from_cache_file=False
)

Filter:   0%|          | 0/60009 [00:00<?, ? examples/s]

In [20]:
memory_numrows.append(dataset.num_rows)
print("Dataset now has {} rows, which are {} rows less than previous version.".
      format(dataset.num_rows, memory_numrows[-2]-dataset.num_rows))

Dataset now has 52356 rows, which are 7653 rows less than previous version.


### Remove repeated text within training examples

In [21]:
def find_duplicates(paragraphs):
    """
    Use this function to find the number of repetitions (same sentences)
    in a paragraph.
    """
    # store unique elements that we've seen
    unique_x = set()

    # track,
    #   1. total character count of duplicate elements
    #   2. number of duplicate elements
    duplicate_chars = 0
    duplicate_elements = 0

    # iterate over each element in a paragraph, if element already in unique_x -->
    # this means, it is a duplicate. In such case, add num of elements to
    # duplicate_chars and 1 to duplicate_elements.
    for element in paragraphs:
        if element in unique_x:
            duplicate_chars += len(element)
            duplicate_elements += 1
        else:
            unique_x.add(element)
    return duplicate_elements, duplicate_chars

In [22]:
import re

def paragraph_repetition_filter(x):
    """
    Returns False if a page has too many repetitions i.e,
        1. more than 30% of the paragraph is duplicates
        2. more than 20% characters in the page are from duplicated paragraphs
    """
    # get the text
    text = x['text']
    
    # Split by paragraphs (2 or more newlines)
    paragraphs = re.compile(r"\n{2,}").split(text.strip()) 

    # Find number of duplicates in paragraphs
    paragraphs_duplicates, char_duplicates = find_duplicates(paragraphs)
    if paragraphs_duplicates / len(paragraphs) > 0.3: # more than 30% is duplicate
        return False
    if char_duplicates / len(text) > 0.2: # more than 20% of para length
                                          # belongs to duplicate characters
        return False
    return True

In [23]:
dataset = dataset.filter(
    paragraph_repetition_filter,
    load_from_cache_file=False
)

Filter:   0%|          | 0/52356 [00:00<?, ? examples/s]

In [24]:
memory_numrows.append(dataset.num_rows)
print("Dataset now has {} rows, which are {} rows less than previous version.".
      format(dataset.num_rows, memory_numrows[-2]-dataset.num_rows))

Dataset now has 52326 rows, which are 30 rows less than previous version.


### Deduplication

In [25]:
def deduplication(ds):
    def dedup_func(x):
        """Use this function to remove duplicate entries"""
        if x['text'] in unique_text:
            return False
        else:
            unique_text.add(x['text'])
            return True

    unique_text = set()

    ds = ds.filter(dedup_func, load_from_cache_file=False, num_proc=1)
    return ds

dataset = deduplication(dataset)

Filter:   0%|          | 0/52326 [00:00<?, ? examples/s]

In [26]:
memory_numrows.append(dataset.num_rows)
print("Dataset now has {} rows, which are {} rows less than previous version.".
      format(dataset.num_rows, memory_numrows[-2]-dataset.num_rows))

Dataset now has 43597 rows, which are 8729 rows less than previous version.


### Quality filter - Language

In [27]:
!pip install fasttext



In [28]:
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

--2025-06-07 07:22:58--  https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.189.14, 3.163.189.96, 3.163.189.108, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.189.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 131266198 (125M) [application/octet-stream]
Saving to: ‘lid.176.bin’


2025-06-07 07:22:59 (200 MB/s) - ‘lid.176.bin’ saved [131266198/131266198]



In [29]:
import fasttext

# Load the model
model = fasttext.load_model('lid.176.bin')

# Predict the language of a text
text = "This is a sample sentence."
predictions = model.predict(text)
print(predictions)


(('__label__en',), array([0.98663521]))


In [None]:
import urllib, fasttext

def english_language_filter(ds):
    # load language detection model
    model = fasttext.load_model("./working/lid.176.bin")
    def is_english(x):
        # Predict language of the text and probability
        language, score = model.predict(x['text'].replace("\n", ""))

        language = language[0].split("__")[2]
        return score > 0.4 and language == "en"  # True if both conditions are true
        
    ds = ds.filter(is_english, load_from_cache_file=False, num_proc=1)
    return ds

dataset = english_language_filter(dataset)



Filter:   0%|          | 0/43597 [00:00<?, ? examples/s]

In [31]:
memory_numrows.append(dataset.num_rows)
print("Dataset now has {} rows, which are {} rows less than previous version.".
      format(dataset.num_rows, memory_numrows[-2]-dataset.num_rows))

Dataset now has 40473 rows, which are 3124 rows less than previous version.


## Save the dataset to disk

In [None]:
file_path = "./working/data/preprocessed_dataset.parquet"
dataset.to_parquet(file_path)

Creating parquet from Arrow format:   0%|          | 0/41 [00:00<?, ?ba/s]

197100832

# Lecture 3: Data Packaging

## Load the stored dataset

In [None]:
import datasets

dataset = datasets.load_dataset(
    "parquet", 
    data_files="./working/data/preprocessed_dataset.parquet", 
    split="train"
)

print(dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 40473
})


In [34]:
dataset = dataset.shard(num_shards=10, 
                        index=0)
print(dataset)

Dataset({
    features: ['text'],
    num_rows: 4048
})


In [35]:
# let's load a tokenizer 

from transformers import AutoTokenizer

checkpoint_solar = "upstage/SOLAR-10.7B-v1.0"
tokenizer = AutoTokenizer.from_pretrained(
    checkpoint_solar, 
    use_fast=False
)

tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

In [36]:
tokenizer.tokenize("I'm a short sentence")

['▁I', "'", 'm', '▁a', '▁short', '▁sentence']

In [37]:
def tokenization(example):
    # Tokenize
    tokens = tokenizer.tokenize(example["text"])

    # Convert tokens to ids
    token_ids = tokenizer.convert_tokens_to_ids(tokens)

    # Add <bos>, <eos> tokens to the front and back of tokens_ids 
    # bos: begin of sequence, eos: end of sequence
    token_ids = [
        tokenizer.bos_token_id] \
        + token_ids \
        + [tokenizer.eos_token_id
    ]
    example["input_ids"] = token_ids

    # We will be using this column to count the total number of tokens 
    # in the final dataset
    example["num_tokens"] = len(token_ids)
    return example

In [38]:
dataset = dataset.map(tokenization, 
                      load_from_cache_file=False)
print(dataset)

Map:   0%|          | 0/4048 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'input_ids', 'num_tokens'],
    num_rows: 4048
})


In [39]:
for example in dataset.select(range(4)):
    print("len(text): {} \t len(input_ids): {}".
          format(
              len(example['text']),
              len(example['input_ids'])
          )
         )


len(text): 2441 	 len(input_ids): 644
len(text): 19254 	 len(input_ids): 5459
len(text): 1327 	 len(input_ids): 358
len(text): 1320 	 len(input_ids): 279


## Packing the dataset

In [40]:
# concatenate input_ids into one single list
import numpy as np

input_ids = np.concatenate(dataset["input_ids"])
print(len(input_ids))

4924245


In [41]:
# this is the variable which will be used when converting the 1D above to 2D.
# this variable will be used to fix the number of columns of 2D.
max_seq_length = 32

In [42]:
# when we divide input_ids into batches of max_seq_length, some portion will be left
# that portion can be computed using "len(input_ids) % max_seq_length".
# subtracting this with input_ids will give the new length that we have to maintain
# for input_ids.
total_length = len(input_ids) - len(input_ids) % max_seq_length
print(total_length)

4924224


In [43]:
# discarding the extra tokens now
input_ids = input_ids[:total_length]
print(input_ids.shape)

(4924224,)


In [44]:
# and now, 1D --> 2D
input_ids_reshaped = input_ids.reshape(-1, max_seq_length).astype(np.int32)
input_ids_reshaped.shape

(153882, 32)

In [45]:
# converting to huggingface dataset
input_ids_list = input_ids_reshaped.tolist()
packaged_pretrain_dataset = datasets.Dataset.from_dict(
    {"input_ids": input_ids_list}
)
print(packaged_pretrain_dataset)

Dataset({
    features: ['input_ids'],
    num_rows: 153882
})


## Saving to disk

In [None]:
packaged_pretrain_dataset.to_parquet("./working/data/packaged_pretrain_dataset.parquet")

Creating parquet from Arrow format:   0%|          | 0/154 [00:00<?, ?ba/s]

20312424