# Introduction

This notebook walks you through the end-to-end workflow for domain-adaptive pretraining (DAPT) and supervised fine-tuning (SFT) of large language models. Using NVIDIA NeMo Curator for data curation and the NeMo Framework for training, you will curate domain-specific datasets, adapt a tokenizer, pretrain and fine-tune the model, evaluate performance, and deploy with NVIDIA NIM. The result is a complete, adaptable pipeline for building domain-specialized LLMs.

# Table of Contents

1. [Data Curation](#1-data-curation)  
2. [Custom Tokenizer Training](#2-custom-tokenizer-training)  
3. [Domain-Adaptive Pretraining](#3-domain-adaptive-pretraining)  
4. [Supervised Fine-Tuning (SFT)](#4-supervised-fine-tuning-sft)  
5. [Optimized Deployment](#5-optimized-deployment)  

## Objectives

#### 1. Data Curation  
- **Efficient DAPT Data Curation** – Best practices for building high-quality domain corpora using open-source datasets.  
- **Scalable Processing Pipeline** – Text extraction, filtering, deduplication, and data blending with NeMo Curator for large-scale datasets.  

#### 2. Custom Tokenizer Training  
- **Domain-Specific Tokenization** – Improve efficiency on specialized data.  
- **Low-Overhead Adaptation** – Minimize retraining and fine-tuning effort.  
- **Balanced Performance** – Optimize for domain data while preserving general-purpose capability.  

#### 3. Domain-Adaptive Pretraining  
- **Data Preprocessing** – Prepare curated text for pretraining.  
- **Optimized Training** – Efficiently adapt LLMs to new domains.  
- **Evaluation** – Compare baseline and domain-adapted models.  

#### 4. Supervised Fine-Tuning (SFT)  
- **Custom DataModules** – Build DataModule classes for SFT datasets.  
- **Task Adaptation** – Fine-tune LLMs to improve task performance.  
- **Inference** – Generate outputs with fine-tuned models.  
- **Evaluation** – Assess model quality on curated benchmarks. 

#### 5. Reasoning [TO DO]


#### 5. Optimized Deployment  
- **Checkpoint Conversion** – Export to Hugging Face–compatible format.  
- **NIM Deployment** – Build optimized inference engines and deploy with NVIDIA NIM.  


# Getting Started
1. [Prerequisites](#prerequisites)  
2. [Launch Docker Container](#prerequisites)
3. [Data Curation](#data-curation)
4. [Custom Tokenizer Training](#custom-tokenizer-training)
5. [Domain Adaptive Pretraining](#domain-adaptive-training)
6. [Supervised Fine Tuning](#supervised-fine-tuning)
7. [Deployment]

### Prerequisites

#### Hardware [TO DO]- GPU, CPU, system memory, disk space

1. Launchable this version
2. Make a README.md for the github
3. Data Paths & Notebook Paths must be modified
4. Reasoning Sections
5. Add software components before the prerequisites
6. NVAIE License- Swastika to check
7. Add technical diagram- right below the introduction
8. Remove evals folder (add to features coming soon section)
9. [NVIDIA-AI-Blueprints/financial-fraud-detection](https://github.com/NVIDIA-AI-Blueprints/Financial-Fraud-Detection/)
10. Use case description, modify table of contents, Target Audience
11. Directory Structure


#### Clone repository and install software

1. **Clone** Git Repository

In [None]:
!git clone https://gitlab-master.nvidia.com/swastikad/dapt_bp_mirror.git

2. **Install** [Docker](https://docs.docker.com/engine/install/ubuntu/)

**Tip:** Ensure the Docker Compose plugin version is 2.29.1 or higher. Run docker compose version to confirm. Refer to Install the Compose plugin Docker documentation for more information.

3. **Install** [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-the-nvidia-container-toolkit) to configure Docker for GPU-accelerated containers, for example NVIDIA NIM, NeMo Framework container etc. If you are using a system deployed with Brev you can skip this step since Brev systems come with NVIDIA Container Toolkit preinstalled.

**Note:** After installing the toolkit, follow the instructions in the Configure Docker section in the NVIDIA Container Toolkit documentation.

#### Get API Keys

**Let's start by logging into the NVIDIA Container Registry.**

The NVIDIA NGC API Key is a mandatory key that is required to use this blueprint. This is needed to log into the NVIDIA container registry, nvcr.io, and to pull secure container images used in this NVIDIA NIM Blueprint. Refer to [Generating NGC API Keys](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html#generating-api-key) in the NVIDIA NGC User Guide for more information.

Authenticate with the NVIDIA Container Registry with the following command:

In [None]:

!docker login nvcr.io

**Note:** Use oauthtoken as the username and your API key as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.After installing the toolkit, follow the instructions in the Configure Docker section in the NVIDIA Container Toolkit documentation.

### Launch NeMo Framework Docker Container

Run the following commands in your terminal to launch the NeMo Framework container.  
All objectives from Sections **1–4** (Data Curation through Supervised Fine-Tuning) will be executed inside this container.  

```
docker pull nvcr.io/nvidia/nemo:25.04
docker run -it --rm --gpus '"device=0,1' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.04
```

### Data Curation
[TO DO]- Add a one liner

We will use the datasets in the `dapt-curation/code/data` folder to illustrate data curation with this pipeline. Sample data is collected from:  

- **GitHub** – `/domain-specific-llms/data/dapt-sources/raw/github`  
  (Clone repositories, extract text from source files, and convert to JSONL.)  
- **ArXiv PDFs** – `/domain-specific-llms/data/dapt-sources/raw/arxiv_pdfs`  
  (Extract text from PDFs, convert to TXT, and store as JSONL.)  
- **Wikipedia** – `/domain-specific-llms/data/dapt-sources/raw/wikipedia`  
  (Parse HTML pages, convert to TXT, and store as JSON.)  

---

#### Tutorial Steps  

1. **Install requirements and import libraries**  
2. **Download raw data** from GitHub repos, Wikipedia URLs, and ArXiv PDFs; extract metadata and convert to JSONL  
3. **Load datasets** into the workspace  
4. *(Optional)* **Inspect file types and sizes**  
5. **Run the NeMo Curator pipeline:**  
   - File type identification and separation  
   - Document-level exact deduplication  
   - Heuristic-based quality filtering (line count, word count, frequent N-grams, etc.)  
   - Unicode error correction with *ftfy*  
   - PII redaction  
   - GPU-accelerated fuzzy and semantic deduplication  
6. **Save filtered and curated data**  
7. **Blend and shuffle datasets**  

---

#### Usage

For custom installations inside container

``` bash
pip uninstall nemo-curator
rm -r /opt/NeMo-Curator
git clone --branch v0.9.0 https://github.com/NVIDIA/NeMo-Curator.git /opt/NeMo-Curator
pip install --extra-index-url https://pypi.nvidia.com "/opt/NeMo-Curator[all]"
```

Then, install the following dependencies for running the DAPT tutorial:

``` bash

apt update
apt-get install poppler-utils
apt-get install tesseract-ocr
apt install libtesseract-dev
pip install -r /workspace/domain-specific-llms/src/dapt_curation/requirements.txt
pip uninstall --yes $(pip list --format=freeze | grep opencv)
rm -rf /usr/local/lib/python3.10/dist-packages/cv2/
pip install "opencv-python-headless<4.9.0" # Since we want to use numpy<2
python -c "import nltk; nltk.download('punkt_tab')"
python -c "import nltk; nltk.download('averaged_perceptron_tagger_eng')"
cd ./domain-specific-llms
python ./deploy/01_data_curation/01_dapt_data_curation.py --device "gpu"

```

This will download chip-design related datasets and begin the data curation pipeline.

Please use `--device "gpu"` to enable semantic and fuzzy deduplication, which require the GPU.

---

#### What Happens During Execution

- Ingestion: Raw documents are loaded into a distributed Dask pipeline.
- Normalization: Content is standardized (stripped boilerplate, converted to plain text).
- Filtering: Each document is scored and selectively removed based on chosen filters.
- Deduplication: Near-duplicate documents are detected and dropped to reduce redundancy.
- Packaging: The curated dataset is written to disk in shard format for downstream use.

### Custom Tokenizer Training

Custom tokenizers are trained or adapted to better capture domain-specific vocabulary, acronyms, and expressions. This preserves key terms as single tokens, reducing fragmentation and improving representational efficiency.  

To get started, navigate to the tokenizer training directory:

```bash
cd dapt_bp_mirror/step2-custom-token-pretraining/02_custom_tokenizer_training
```
Then open and follow the notebook:

`custom_tokenization_llama7b.ipynb`

### Domain Adaptive Continued Training

**Domain-Adaptive Pretraining (DAPT)** specializes a general-purpose LLM (e.g., Llama-2-7B) on domain-specific text, improving accuracy and context-awareness for specialized tasks while retaining general language capabilities.  

To get started, navigate to the DAPT training directory:

```bash
cd dapt_bp_mirror/step3-domain-adaptive-pretraining/03_domain_adaptive_pretraining
```
Then open and follow the notebook:

`domain_adaptive_pretraining_nemo2.0.ipynb`

### Supervised Fine-Tuning (SFT)

**Supervised Fine-Tuning (SFT)** further customizes a pretrained or domain-adapted model using curated, high-quality labeled data to align performance with specific tasks or human preferences.  


To get started, navigate to the SFT training directory:

```bash
cd dapt_bp_mirror/step4-supervised-fine-tuning
```

Then open and follow the notebook:

`supervised_fine_tuning.ipynb`


### NIM Deployment

### NIM Deployment