# Data Processing for Megatron Bridge LLMs with the DCLM Dataset

This notebook provides a step-by-step guide to preprocessing a DCLM subdataset for use with Megatron Bridge. In this example, the subdataset is 68 GB in its compressed form and expands to 199 GB once decompressed. For more information about the dataset, check out the [README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/tutorials/data/dclm/README.md).

## Download Data

The example below demonstrates how to download the `global-shard_01_of_10/local-shard_0_of_10` subdataset of DCLM dataset from HuggingFace. To download the full dataset, set `allow_patterns` to `*.jsonl.zst`.

In [None]:
from huggingface_hub import login, snapshot_download

# login to HF using token
login(token="input your HF token")

# download dataset
snapshot_download(
    repo_id="mlfoundations/dclm-baseline-1.0",
    repo_type="dataset",
    local_dir="/data/dclm",
    allow_patterns="global-shard_01_of_10/local-shard_0_of_10/**", # set to '*.jsonl.zst' to download full dataset
    resume_download=True,
    max_workers=32,  # Don't hesitate to increase this number to lower the download time
)

## Decompress files

The dataset is hosted on HuggingFace in compressed format. To work with the `.jsonl` files, the downloaded files need to be decompressed first.

**Note:** This script may require the `parallel` and `zstd` packages to be installed.

In [None]:
%%bash

# decompress files
PATH_TO_SAVE=/data/dclm/decompressed
SOURCE_DIR=/data/dclm/global-shard_01_of_10/local-shard_0_of_10
NUM_WORKERS=32

mkdir -p ${PATH_TO_SAVE}
cd ${SOURCE_DIR}
find . -name "*.zst" | parallel -j${NUM_WORKERS} "zstd -d {} -o ${PATH_TO_SAVE}/{.}"

## Merge files

Each DCLM subdataset contains hundreds of small `.jsonl` files. To simplify handling, we merge all `.jsonl` files from the current subdataset into a single `.jsonl` file.

In [None]:
%%bash

# merge files
SOURCE_DIR=/data/dclm/decompressed
PATH_TO_SAVE=/data/dclm/decompressed/merged.jsonl

cd ${SOURCE_DIR}
awk '1' *.jsonl > ${PATH_TO_SAVE}

# remove small .jsonl files
rm shard_*

# Shuffle data

This example demonstrates how to shuffle the data. First, we split the merged `.jsonl` file into smaller chunks to parallelize shuffling and speed up the process.

In [None]:
%%bash

# split file into chunks before shuffling
LINES_PER_SPLIT=1000000
SOURCE_FILE=/data/dclm/decompressed/merged.jsonl
CHUNKS_DIR=/data/dclm/decompressed/chunks

mkdir -p ${CHUNKS_DIR}
split -l ${LINES_PER_SPLIT} ${SOURCE_FILE} ${CHUNKS_DIR}/chunk_

# shuffle files
NUM_WORKERS=16
SHUFFLE_CHUNKS_DIR=/data/dclm/decompressed/shuffled_chunks
PATH_TO_SAVE=/data/dclm/decompressed/shuffled.jsonl

mkdir -p ${SHUFFLE_CHUNKS_DIR}
ls "${CHUNKS_DIR}"/chunk_* | parallel -j"${NUM_WORKERS}" 'shuf {} -o '"${SHUFFLE_CHUNKS_DIR}"'/$(basename {})_shuf'

# remove unshuffled chunks
rm -rf ${CHUNKS_DIR}
# merge shuffled chunnks into single .jsonl file
awk '1' ${SHUFFLE_CHUNKS_DIR}/chunk_* > ${PATH_TO_SAVE}
# remove shuffled chunks
rm -rf ${SHUFFLE_CHUNKS_DIR}


## Preprocess Data to bin/idx format

This step uses the data preprocessing [script](https://github.com/NVIDIA/Megatron-LM/blob/main/tools/preprocess_data.py) from Megatron-LM, so Megatron-LM must be installed.

In [None]:
%%bash

# install Megatron-LM

# Install Megatron Core with required dependencies
pip install megatron-core
pip install --no-build-isolation transformer-engine[pytorch]

# Clone repository for examples
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM

In [None]:
%%bash

# run data preprocessing
python3 Megatron-LM/tools/preprocess_data.py \
    --input /data/dclm/decompressed/shuffled.jsonl \
    --output-prefix /data/dclm/preprocessed \
    --tokenizer-type HuggingFaceTokenizer \
    --tokenizer-model meta-llama/Meta-Llama-3-8B \
    --log-interval 10000 \
    --workers 32 \
    --append-eod