### Loading and processing IMDB movie review dataset
In this example, we will load the IMDB dataset from Hugging Face, 
use `torchdata.nodes` to process it and generate training batches.

In [1]:
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification

In [2]:
import torch
from torch.utils.data import default_collate, RandomSampler, SequentialSampler

In [3]:
# Load IMDB dataset from huggingface datasets and select the "train" split
dataset = load_dataset("imdb", streaming=False)
dataset = dataset["train"]
# Since dataset is a Map-style dataset, we can setup a sampler to shuffle the data
# Please refer to the migration guide here https://pytorch.org/data/main/migrate_to_nodes_from_utils.html
# to migrate from torch.utils.data to torchdata.nodes

sampler = RandomSampler(dataset)
# Use a standard bert tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# Now we can set up some torchdata.nodes to create our pre-proc pipeline

All torchdata.nodes.BaseNode implementations are Iterators.
MapStyleWrapper creates an Iterator that combines sampler and dataset to create an iterator.
Under the hood, MapStyleWrapper just does:
```python
node = IterableWrapper(sampler)
node = Mapper(node, map_fn=dataset.__getitem__)  # You can parallelize this with ParallelMapper
```

In [4]:
from torchdata.nodes import MapStyleWrapper, ParallelMapper, Batcher, PinMemory, Loader
node = MapStyleWrapper(map_dataset=dataset, sampler=sampler)

# Now we want to transform the raw inputs. We can just use another Mapper with
# a custom map_fn to perform this. Using ParallelMapper allows us to use multiple
# threads (or processes) to parallelize this work and have it run in the background
max_len = 512
batch_size = 2
def bert_transform(item):
    encoding = tokenizer.encode_plus(
        item["text"],
        add_special_tokens=True,
        max_length=max_len,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        return_tensors="pt",
    )
    return {
        "input_ids": encoding["input_ids"].flatten(),
        "attention_mask": encoding["attention_mask"].flatten(),
        "labels": torch.tensor(item["label"], dtype=torch.long),
    }
node = ParallelMapper(node, map_fn=bert_transform, num_workers=2) # output items are Dict[str, tensor]

# Next we batch the inputs, and then apply a collate_fn with another Mapper
# to stack the tensors between. We use torch.utils.data.default_collate for this
node = Batcher(node, batch_size=batch_size) # output items are List[Dict[str, tensor]]
node = ParallelMapper(node, map_fn=default_collate, num_workers=2) # outputs are Dict[str, tensor]

# we can optionally apply pin_memory to the batches
if torch.cuda.is_available():
    node = PinMemory(node)

# Since nodes are iterators, they need to be manually .reset() between epochs.
# We can wrap the root node in Loader to convert it to a more conventional Iterable.
loader = Loader(node)

In [5]:
# Inspect a batch
batch = next(iter(loader))
print(batch)
# In a batch we get three keys, as defined in the method `bert_transform`.
# Since the batch size is 2, two samples are stacked together for each key.

{'input_ids': tensor([[ 101, 1045, 2572,  ..., 2143, 2000,  102],
        [ 101, 2004, 1037,  ...,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([0, 1])}
