<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/huggingface-transformers/huggingface-course/05-dataset-library/3_big_data_handling_with_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Big data handling with 🤗 Datasets to the rescue!

Nowadays it is not uncommon to find yourself working with multi-gigabyte datasets, especially if you’re planning to pretrain a transformer like BERT or GPT-2 from scratch. In these cases, even loading the data can be a challenge. For example, the WebText corpus used to pretrain GPT-2 consists of over 8 million documents and 40 GB of text — loading this into your laptop’s RAM is likely to give it a heart attack!

Fortunately, 🤗 Datasets has been designed to overcome these limitations. It frees you from memory management problems by treating datasets as memory-mapped files, and from hard drive limits by streaming the entries in a corpus.

**Reference**:

[Big data? 🤗 Datasets to the rescue!](https://huggingface.co/course/chapter5/4?fw=pt)

[Apache Arrow: Read DataFrame With Zero Memory](https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a)

##Setup

In [None]:
!pip install datasets transformers[sentencepiece]
!pip install zstandard
!pip install psutil

In [3]:
from datasets import load_dataset
from datasets import interleave_datasets

from transformers import AutoTokenizer

import psutil
import timeit
from itertools import islice

##Load dataset

Next, we can load the dataset using the method for remote files

In [None]:
# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")

In [None]:
pubmed_dataset

In [None]:
pubmed_dataset[0]

##The magic of memory mapping

In [7]:
# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

RAM used: 1171.57 MB
