<a href="https://colab.research.google.com/github/simecek/dspracticum2024/blob/main/lesson05/Datasets_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HuggingFace `datasets` package

Key Features of the `datasets` package:
 * **Access to a wide variety of datasets**: From standard NLP benchmarks to specialized datasets.
 * **Efficient data loading and handling**: It uses Apache Arrow under the hood, allowing efficient memory usage and processing speed.
 * **Integration with Hugging Face Hub**: It allows users to upload and download datasets to/from the Hugging Face Hub.
 * **Support for streaming datasets**: For very large datasets that may not fit into memory.

In [1]:
# on colab uncomment this to install the package
# !pip install -qq datasets

## Loading a dataset

The `load_dataset` function is the main entry point to load datasets. You can access a dataset either from the Hugging Face Hub, your local disk, or a remote location.

Here’s an example of loading a dataset of 138,830 arXiv papers converted to multi-markdown (.mmd) format.

In [2]:
from datasets import load_dataset

ds = load_dataset("neuralwork/arxiver")

ds

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'abstract', 'authors', 'published_date', 'link', 'markdown'],
        num_rows: 138380
    })
})

In [3]:
ds['train']['title'][:5]

['Image Completion via Dual-path Cooperative Filtering',
 "High Sensitivity Beamformed Observations of the Crab Pulsar's Radio\n  Emission",
 'kNN-Res: Residual Neural Network with kNN-Graph coherence for point\n  cloud registration',
 'On the origin of the evolution of the halo occupation distribution',
 'PROSE: Predicting Operators and Symbolic Expressions using Multimodal\n  Transformers']

In [4]:
ds['train'].features

{'id': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'abstract': Value(dtype='string', id=None),
 'authors': Value(dtype='string', id=None),
 'published_date': Value(dtype='string', id=None),
 'link': Value(dtype='string', id=None),
 'markdown': Value(dtype='string', id=None)}

In [5]:
sum(["neural network" in title.lower() for title in ds['train']['title']])

2558

## Accessing the Dataset Splits, Columns and Data Points

In [6]:
train_data = ds['train']

In [7]:
train_data.select(range(5))

Dataset({
    features: ['id', 'title', 'abstract', 'authors', 'published_date', 'link', 'markdown'],
    num_rows: 5
})

In [8]:
train_data['title'][:5]

['Image Completion via Dual-path Cooperative Filtering',
 "High Sensitivity Beamformed Observations of the Crab Pulsar's Radio\n  Emission",
 'kNN-Res: Residual Neural Network with kNN-Graph coherence for point\n  cloud registration',
 'On the origin of the evolution of the halo occupation distribution',
 'PROSE: Predicting Operators and Symbolic Expressions using Multimodal\n  Transformers']

In [9]:
shuffled_train_data = train_data.shuffle(seed=42)
shuffled_train_data['title'][:5]

['Some Results on Zumkeller Numbers',
 'Editable-DeepSC: Cross-Modal Editable Semantic Communication Systems',
 'Accelerating Kaluza-Klein Universe in Modified Theory of Gravitation',
 'Connecting Speech Encoder and Large Language Model for ASR',
 'Initial On-Sky Performance testing of the Single-Photon Imager for\n  Nanosecond Astrophysics (SPINA) system']

## Filtering and Transformations

In [10]:
neural_network_papers = ds['train'].filter(lambda x: 'neural network' in x['title'].lower())

In [11]:
neural_network_papers

Dataset({
    features: ['id', 'title', 'abstract', 'authors', 'published_date', 'link', 'markdown'],
    num_rows: 2558
})

In [12]:
# Define a function that creates a new column 'title_length'
def title_length(x):
    return {'title_length': len(x['title'])}

In [13]:
neural_network_papers.map(title_length)

Dataset({
    features: ['id', 'title', 'abstract', 'authors', 'published_date', 'link', 'markdown', 'title_length'],
    num_rows: 2558
})

## Streaming Large Datasets

For very large datasets that don't fit in memory, the `datasets` library supports dataset streaming. This loads data in chunks as needed instead of loading everything at once.

In [14]:
streamed_dataset = load_dataset("neuralwork/arxiver", split="train", streaming=True)

# Iterate over the first 5 examples in the streamed dataset
for i, x in enumerate(streamed_dataset):
    if i == 5:
        break
    print(x)

{'id': '2305.00379', 'title': 'Image Completion via Dual-path Cooperative Filtering', 'abstract': 'Given the recent advances with image-generating algorithms, deep image\ncompletion methods have made significant progress. However, state-of-art\nmethods typically provide poor cross-scene generalization, and generated masked\nareas often contain blurry artifacts. Predictive filtering is a method for\nrestoring images, which predicts the most effective kernels based on the input\nscene. Motivated by this approach, we address image completion as a filtering\nproblem. Deep feature-level semantic filtering is introduced to fill in missing\ninformation, while preserving local structure and generating visually realistic\ncontent. In particular, a Dual-path Cooperative Filtering (DCF) model is\nproposed, where one path predicts dynamic kernels, and the other path extracts\nmulti-level features by using Fast Fourier Convolution to yield semantically\ncoherent reconstructions. Experiments on thre

In [15]:
x['markdown']

"# A Concise Overview of Safety Aspects\n\n###### Abstract\n\nAs of today, robots exhibit impressive agility but also pose potential hazards to humans using/collaborating with them. Consequently, safety is considered the most paramount factor in human-robot interaction (HRI). This paper presents a multi-layered safety architecture, integrating both physical and cognitive aspects for effective HRI. We outline critical requirements for physical safety layers as service modules that can be arbitrarily queried. Further, we showcase an HRI scheme that addresses human factors and perceived safety as high-level constraints on a validated impact safety paradigm. The aim is to enable safety certification of human-friendly robots across various HRI scenarios.\n\nKeywords:Human-robot interaction, gracefulness, safety\n\n## 1 Introduction\n\nHuman-friendly robots are distinguished by their ability to delicately react and physically interact with the world through compliant hardware and adaptive co

## Saving and Loading Datasets Locally

You can save datasets locally and load them later to avoid redownloading or reprocessing.

In [16]:
# Save the tokenized dataset
neural_network_papers.save_to_disk("neural_network_papers")

# Load the dataset from the saved location
from datasets import load_from_disk
loaded_dataset = load_from_disk("neural_network_papers")

# Check loaded dataset
print(loaded_dataset[0])

Saving the dataset (0/1 shards):   0%|          | 0/2558 [00:00<?, ? examples/s]

{'id': '2304.00050', 'title': 'kNN-Res: Residual Neural Network with kNN-Graph coherence for point\n  cloud registration', 'abstract': 'In this paper, we present a residual neural network-based method for point\nset registration that preserves the topological structure of the target point\nset. Similar to coherent point drift (CPD), the registration (alignment)\nproblem is viewed as the movement of data points sampled from a target\ndistribution along a regularized displacement vector field. While the coherence\nconstraint in CPD is stated in terms of local motion coherence, the proposed\nregularization term relies on a global smoothness constraint as a proxy for\npreserving local topology. This makes CPD less flexible when the deformation is\nlocally rigid but globally non-rigid as in the case of multiple objects and\narticulate pose registration. A Jacobian-based cost function and\ngeometric-aware statistical distances are proposed to mitigate these issues.\nThe latter allows for mea

## Uploading to Hugging Face Hub


In [17]:
neural_network_papers.push_to_hub("simecek/neural_network_papers")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/513 [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/datasets/simecek/neural_network_papers/commit/028c161f576f2e078b75ced241ef41b04dfdb542', commit_message='Upload dataset', commit_description='', oid='028c161f576f2e078b75ced241ef41b04dfdb542', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
from huggingface_hub import login
login("PUT HERE YOUR TOKEN")