<a href="https://colab.research.google.com/github/simecek/dspracticum2024/blob/main/lesson05/Datasets_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HuggingFace `datasets` package

Key Features of the `datasets` package:
 * **Access to a wide variety of datasets**: From standard NLP benchmarks to specialized datasets.
 * **Efficient data loading and handling**: It uses Apache Arrow under the hood, allowing efficient memory usage and processing speed.
 * **Integration with Hugging Face Hub**: It allows users to upload and download datasets to/from the Hugging Face Hub.
 * **Support for streaming datasets**: For very large datasets that may not fit into memory.

In [1]:
# on colab uncomment this to install the package
# !pip install -qq datasets

## Loading a dataset

The `load_dataset` function is the main entry point to load datasets. You can access a dataset either from the Hugging Face Hub, your local disk, or a remote location.

Here’s an example of loading a dataset of 138,830 arXiv papers converted to multi-markdown (.mmd) format.

In [None]:
from datasets import load_dataset

ds = load_dataset("neuralwork/arxiver")

ds

In [None]:
ds['train']['title'][:5]

In [None]:
ds['train'].features

In [None]:
sum(["neural network" in title.lower() for title in ds['train']['title']])

## Accessing the Dataset Splits, Columns and Data Points

In [6]:
train_data = ds['train']

In [None]:
train_data.select(range(5))

In [None]:
train_data['title'][:5]

In [None]:
shuffled_train_data = train_data.shuffle(seed=42)
shuffled_train_data['title'][:5]

## Filtering and Transformations

In [10]:
neural_network_papers = ds['train'].filter(lambda x: 'neural network' in x['title'].lower())

In [None]:
neural_network_papers

In [12]:
# Define a function that creates a new column 'title_length'
def title_length(x):
    return {'title_length': len(x['title'])}

In [None]:
neural_network_papers.map(title_length)

## Streaming Large Datasets

For very large datasets that don't fit in memory, the `datasets` library supports dataset streaming. This loads data in chunks as needed instead of loading everything at once.

In [None]:
streamed_dataset = load_dataset("neuralwork/arxiver", split="train", streaming=True)

# Iterate over the first 5 examples in the streamed dataset
for i, x in enumerate(streamed_dataset):
    if i == 5:
        break
    print(x)

In [None]:
x['markdown']

## Saving and Loading Datasets Locally

You can save datasets locally and load them later to avoid redownloading or reprocessing.

In [None]:
# Save the tokenized dataset
neural_network_papers.save_to_disk("neural_network_papers")

# Load the dataset from the saved location
from datasets import load_from_disk
loaded_dataset = load_from_disk("neural_network_papers")

# Check loaded dataset
print(loaded_dataset[0])

## Uploading to Hugging Face Hub


In [None]:
neural_network_papers.push_to_hub("simecek/neural_network_papers")

In [None]:
from huggingface_hub import login
login("PUT HERE YOUR TOKEN")