# Pretraining using SlimPajama

Let's see how we can use the data generated from the [data pipeline notebook](./data_pipeline.ipynb) to pretrain a model. All we need to do is define the data module based on the generated data and replace it with the mock data module provided by default in the [NeMo LLM recipes](../../../nemo/collections/llm/recipes/__init__.py).

In [None]:
import nemo_run as run
from typing import Optional
import lightning.pytorch as pl
from nemo.collections import llm
from nemo.collections.common.tokenizers import SentencePieceTokenizer

## Define the data module
To define the data module, we can use `llm.PreTrainingDataModule` and pass in the data paths and tokenizer. In case you don't have either of the two, please refer to the [data pipeline notebook](./data_pipeline.ipynb). You can also look at the definition of the data module for the other parameters supported like `split`, `num_workers`, `index_mapping_dir`, etc.

In [2]:
def slimpajama(
    gbs: int = 256,
    mbs: int = 4,
    seq_length: int = 8192,
) -> run.Config[pl.LightningDataModule]:

    return run.Config(
        llm.PreTrainingDataModule,
        paths=["/data/slimpajama_megatron/concatenated_chunk1.jsonl_text_document"],
        seq_length=seq_length,
        global_batch_size=gbs,
        micro_batch_size=mbs,
        tokenizer=run.Config(SentencePieceTokenizer, model_path="/data/tokenizer/tokenizer.model"),
        split="99,8,2",
        num_workers=2,
        index_mapping_dir="/data/index_mapping",
    )

### Configure the recipe and launch pretraining
Once the data module is defined, you can use an existing recipe and replace the data module as shown below.
To learn more about the recipes, refer to the [quickstart](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html).

In [3]:
def configure_recipe(nodes: int = 1, gpus_per_node: int = 1):
    recipe = llm.llama3_8b.pretrain_recipe(
        dir="/checkpoints/llama-new", # Path to store checkpoints
        name="llama_pretraining",
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
    )

    recipe.model.config.num_layers = 1
    recipe.model.config.hidden_size = 128
    recipe.trainer.max_steps = 30
    recipe.data = slimpajama(
        gbs=32,
        mbs=1,
    )
    recipe.trainer.val_check_interval = 20
    recipe.trainer.strategy.context_parallel_size = 1
    recipe.log.ckpt.save_optim_on_train_end = True
    return recipe

In [4]:
def local_executor_torchrun(nodes: int = 1, devices: int = 1) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NEMO_ENV_VARNAME_TESTING": "1",
        "CUDA_VISIBLE_DEVICES": "0"
    }

    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)
    return executor


In [5]:
def run_pretraining():
    recipe = configure_recipe()
    executor = local_executor_torchrun(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)

    run.run(recipe, executor=executor)

## Run Pretraining
Now, you can just call the `run_pretraining` function to start pretraining on your local machine using torchrun.

In [None]:
run_pretraining()