# NeMo Curator Download and Extract Tutorial

## About This Notebook
NeMo Curator has pre-built download and extract pipelines for **Common Crawl**, **Wikipedia** and **ArXiv** datasets. In this tutorial, we will introduce how to execute these pipelines.

## Prerequisites
`wget` will be needed in the tutorial for data downloading.

In [None]:
!apt update
!apt install -y wget

To run a pipeline in NeMo Curator, we must start a Ray cluster. This can be done manually (see the [Ray documentation](https://docs.ray.io/en/latest/ray-core/starting-ray.html)) or with Curator's `RayClient`:

In [None]:
from nemo_curator.core.client import RayClient

try:
    ray_client = RayClient()
    ray_client.start()
except Exception as e:
    msg = f"Error initializing Ray client: {e}"
    raise RuntimeError(msg) from e

## Common Crawl Download and Extract

### About Common Crawl Dataset
The [Common Crawl](https://commoncrawl.org/) dataset is a massive, openly available archive of web data collected by the nonprofit Common Crawl organization since 2008. It consists of petabytes of raw web page data, metadata extracts, and text extracts, providing a valuable resource for anyone interested in large-scale web analysis or AI research. Common Crawl currently stores the crawl data using the [Web ARChive (WARC)](https://en.wikipedia.org/wiki/WARC_(file_format)) format. In this section, we will explore how to use NeMo Curator to download the Common Crawl raw data (WARC files) and extract the text from the it. 

In [None]:
from nemo_curator.pipeline.pipeline import Pipeline
from nemo_curator.stages.text.download.common_crawl.stage import CommonCrawlDownloadExtractStage

In [None]:
!mkdir -p data/common_crawl

In [None]:
stage = CommonCrawlDownloadExtractStage(
    start_snapshot="2025-30",
    end_snapshot="2025-30",
    download_dir="data/common_crawl",
    crawl_type="main",
    url_limit=2,
    record_limit=1000,
)

`CommonCrawlDownloadExtractStage` is the pre-built download and extract pipeline for the Common Crawl dataset, including 4 core components:
1. URL Generator: gets the URLs for Common Crawl data
2. Downloader: downloads the WARC files from the Common Crawl to a local directory
3. Iterator: extracts contents from the downloaded WARC files
4. Extractor: extracts text from HTML content

Core arguments for this class:
* `start_snapshot`: the first Common Crawl snapshot that will be included in the download. For CC-MAIN datasets, use the format 'YYYY-WeekNumber' (e.g., '2020-50' or '2021-04'); for CC-NEWS datasets (when `crawl_type="news"`), use the 'YYYY-MM' (Year-Month) format. A list of valid Common Crawl snapshots can be found [here](https://data.commoncrawl.org/).
* `end_snapshot`: the last Common Crawl snapshot that will be included in the download, it should be the same format as `start_snapshot`.
* `download_dir`: the location of the downloaded snapshots will be stored.
* `crawl_type`: optional; the type of Common Crawl dataset, it could be `"main"` (default) or `"news"`.
* `url_limit`: optional; the maximum number of WARC files to download from the snapshot range.
* `record_limit`: optional; the maximum number of records to extract from each WARC file.

As the tutorial is tested in a single node machine with limited resource, we set the `url_limit` and `record_limit` to a small number for demonstration purpose.

In [None]:
pipeline = Pipeline("common_crawl")
pipeline.add_stage(stage)
results = pipeline.run()

After running above code, the Common Crawl `warc.gz` files should be downloaded to the `download_dir`.

The extracted texts are stored in memory in the `results` variable, which is a list of `nemo_curator.tasks.document.DocumentBatch` objects.
The length of `results` is the same as the `url_limit` (i.e., the number of WARC files downloaded). 

`results[i].data` is a Pandas Dataframe holding the extracted texts and related metadata.

In [None]:
print(f"len(results): {len(results)}")

In [None]:
results[0].data

We can use `JsonlWriter` to save the Pandas Dataframe as `jsonl` files.

In [None]:
!mkdir -p out_jsonl/common_crawl

In [None]:
from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter

output_dir = "out_jsonl/common_crawl"
jsonl_writer = JsonlWriter(path=output_dir, write_kwargs={"force_ascii": False})

for idx, result in enumerate(results):
    jsonl_writer.write_data(result, f"{output_dir}/{idx}.jsonl")

At this point, the `jsonl` files should be located in `out_jsonl/common_crawl`.

## Wikipedia Download and Extract

### About Wikipedia Dataset
The [Wikipedia](https://dumps.wikimedia.org/backup-index.html) dataset contains a complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML. These snapshots are provided at the very least monthly and usually twice a month. In this section, we will introduce how to download the Wikipedia dump files (`.bz2`) and extract the text from these raw files using NeMo Curator.

In [None]:
from nemo_curator.pipeline.pipeline import Pipeline
from nemo_curator.stages.text.download.wikipedia.stage import WikipediaDownloadExtractStage

In [None]:
!mkdir -p data/wikipedia

In [None]:
stage = WikipediaDownloadExtractStage(language="fr", download_dir="data/wikipedia", url_limit=2, record_limit=1000)

Similar to `CommonCrawlDownloadExtractStage`, `WikipediaDownloadExtractStage` is the pre-built download and extract pipeline for the Wikipedia dataset, including 4 core components:
1. URL Generator: gets the URLs for Wikipedia dump files
2. Downloader: downloads the Wikipedia dump files (`.bz2`) from wikimedia.org to a local directory
3. Iterator: processes downloaded Wikipedia dump files and extracts article content
4. Extractor: extracts for Wikipedia articles from MediaWiki XML dumps

Core arguments for this class:
* `language`: Language code for the Wikipedia dump (e.g., "en", "es", "fr"), a full list of language code could be found [here](https://www.wikidata.org/wiki/Help:Wikimedia_language_codes/lists/all)
* `download_dir`: Directory to store downloaded `.bz2` files
* `dump_date`: Specific dump date in "YYYYMMDD" format (if None, uses latest)
* `wikidumps_index_prefix`: Base URL for Wikipedia dumps
* `verbose`: If True, enables verbose logging
* `url_limit`: Maximum number of dump URLs to process
* `record_limit`: Maximum number of articles to extract per file
* `add_filename_column`: Whether to add filename column to output
* `log_frequency`: How often to log progress during iteration

A list of latest Wikipedia dump files could be found [here](https://dumps.wikimedia.org/backup-index-bydb.html).

As the tutorial is tested in a single node machine with limited resource, we set the `url_limit` and `record_limit` to a small number for demonstration purpose.

In [None]:
pipeline = Pipeline("wikipedia")
pipeline.add_stage(stage)
results = pipeline.run()

After running above code, the Wikipedia `.bz2` dump files should be downloaded to the `download_dir`.

`results[i].data` is a Pandas Dataframe holding the extracted texts and related metadata.

In [None]:
print(f"len(results): {len(results)}")

In [None]:
results[0].data

We can use `JsonlWriter` to save the Pandas Dataframe as `jsonl` files.

In [None]:
!mkdir out_jsonl/wikipedia

In [None]:
from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter

output_dir = "out_jsonl/wikipedia"
jsonl_writer = JsonlWriter(path=output_dir, write_kwargs={"force_ascii": False})

for idx, result in enumerate(results):
    jsonl_writer.write_data(result, f"{output_dir}/{idx}.jsonl")

At this point, the `jsonl` files should be located in `out_jsonl/wikipedia`.

## ArXiv Download and Extract

### About ArXiv Dataset
[ArXiv](https://info.arxiv.org/help/bulk_data_s3.html) is an open access research sharing platform and access to bulk data is also open, with certain stipulations. The ArXiv source files (mostly TeX/LaTeX with figures in `tar.gz` format) are available from Amazon S3 in requester pays buckets, and the downloader should pay Amazon for the download based on bandwidth used. In this section, we will walk through how to leverage NeMo Curator to download and extract the ArXiv dataset.

### Prerequisites
Before proceeding, we need to set **AWS credentials** in the environment as the data are stored in AWS S3 [requester pays buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html) (pricing details could be found [here](https://aws.amazon.com/s3/pricing/)).

#### Set your AWS Credentials

In [None]:
%env AWS_ACCESS_KEY_ID=
%env AWS_SECRET_ACCESS_KEY=
%env AWS_SESSION_TOKEN=

#### Ensure the ArXiv S3 bucket accessibility

In [None]:
!s5cmd --request-payer=requester ls s3://arxiv/src/ | head | grep '.tar'

If everything is set, the above command should be able to list the `.tar` files in the S3 bucket like:
```
2010/12/23 05:14:01         227420160  arXiv_src_0001_001.tar
2010/12/23 05:18:10         228853760  arXiv_src_0002_001.tar
2010/12/23 05:22:17         232980480  arXiv_src_0003_001.tar
2010/12/23 05:26:33         193167360  arXiv_src_0004_001.tar
2010/12/23 05:30:12         257617920  arXiv_src_0005_001.tar
2010/12/23 05:34:58         244418560  arXiv_src_0006_001.tar
2010/12/23 05:39:30         247439360  arXiv_src_0007_001.tar
2010/12/23 05:44:19         289003520  arXiv_src_0008_001.tar
2010/12/23 05:49:21         232693760  arXiv_src_0009_001.tar
2010/12/23 05:53:40         280913920  arXiv_src_0010_001.tar
```

### Run ArXiv Download and Extract Pipeline

In [None]:
from nemo_curator.pipeline.pipeline import Pipeline
from nemo_curator.stages.text.download.arxiv.stage import ArxivDownloadExtractStage

In [None]:
!mkdir -p data/arxiv

In [None]:
stage = ArxivDownloadExtractStage(download_dir="data/arxiv", url_limit=2, record_limit=1000)

Similar to previous examples, the `ArxivDownloadExtractStage` pipeline obtains a list of ArXiv tar file URLs, downloads the tar files, and then extracts the contained LaTeX source files. 

Core arguments for this class:
* `download_dir` (str, optional): The directory where the raw downloaded tar files will be kept. Defaults to "./arxiv_downloads".
* `url_limit` (Optional[int], optional): Limits the maximum number of ArXiv tar file URLs to download and process. If None, all available URLs (from get_arxiv_urls) are processed.
* `record_limit` (Optional[int], optional): Limits the maximum number of records to extract from each tar file. If None, all available records are extracted.
* `add_filename_column` (bool | str, optional): If True, adds a column to the output DataFrame with the filename of the tar file. If a string, adds a column with the specified name. Defaults to True.
* `log_frequency` (int, optional): How often to log progress. Defaults to 1000.
* `verbose` (bool, optional): If True, prints verbose output. Defaults to False.

As the tutorial is tested in a single node machine with limited resource, we set the `url_limit` and `record_limit` to a small number for demonstration purpose.

In [None]:
pipeline = Pipeline("arxiv")
pipeline.add_stage(stage)
results = pipeline.run()

After running above code, the ArXiv `.tar` files should be downloaded to the `download_dir`.

`results[i].data` is a Pandas Dataframe holding the extracted texts and related metadata.

In [None]:
print(f"len(results): {len(results)}")

In [None]:
results[0].data

We can use `JsonlWriter` to save the Pandas Dataframe as `jsonl` files.

In [None]:
!mkdir out_jsonl/arxiv

In [None]:
from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter

output_dir = "out_jsonl/arxiv"
jsonl_writer = JsonlWriter(path=output_dir, write_kwargs={"force_ascii": False})

for idx, result in enumerate(results):
    jsonl_writer.write_data(result, f"{output_dir}/{idx}.jsonl")

At this point, the `jsonl` files should be located in `out_jsonl/arxiv`.

## Conclusion

Since the pipelines ran to completion and the results were written to JSONL files, we can shut down the Ray cluster with:

In [None]:
try:
    ray_client.stop()
except Exception as e:  # noqa: BLE001
    print(f"Error stopping Ray client: {e}")

In this tutorial, we have introduced how to download and extract the Common Crawl, Wikipedia and ArXiv datasets with NeMo Curator. Apart from the built-in pipelines, developers can also create a custom download and extract pipeline for other data sources. The framework follows a 4-step pipeline pattern where each step is implemented as an abstract base class with corresponding stages:

```
1. URLGenerator → URLGenerationStage    (URLs from config/input)
2. DocumentDownloader → DocumentDownloadStage    (local files from URLs)
3. DocumentIterator → DocumentIterateStage    (raw records from files)
4. DocumentExtractor → DocumentExtractStage    (structured data from records)
```

For more details, you can refer to the source code of these pipelines and the README file in `nemo_curator/stages/text/download/`.