# Introduction to Datasets

[Datasets](https://huggingface.co/datasets) provides a simple unified API to download and access datasets for machine learning research, avoiding the need to write custom data loading code for each one. This notebook focuses on key NLP datasets for tasks like text classification, question answering, and sentence similarity.

In this tutorial, you will learn how to:

- Load a dataset and its subsets
- Use the load_dataset_builder() function to load a dataset builder and inspect a dataset’s attributes
- Use the get_dataset_config_names() function to retrieve a list of all the possible configurations available to your dataset:



This code installs the required packages datasets, evaluate, transformers[torch], and rich using the pip package manager. The exclamation mark before pip indicates that the code is being executed in a Jupyter notebook or IPython environment.

In [None]:
!pip install datasets evaluate transformers[torch] rich

The code imports the `load_dataset` function from the `datasets` module and the `pprint` function from the `rich.pretty` module. These functions can be used to load datasets and pretty print objects respectively.

In [None]:
from datasets import load_dataset
from rich import print


This code loads the "mrpc" dataset from the "glue" library using the `load_dataset` function. The dataset object is stored in the variable `dataset`.

In [None]:
dataset = load_dataset("glue","mrpc")

In [None]:
print(dataset)

This code loads the MRPC dataset from the "glue" module's train split and assigns it to the variable `train_dataset`. It is used to access and manipulate the training data for the MRPC task.

In [None]:
train_dataset = load_dataset("glue","mrpc",split="train")

In [None]:
print(train_dataset)

The code loads the training split of the SQuAD dataset using the `load_dataset` function from an unknown source. The dataset is stored in the `squad_train` variable.

In [None]:
squad_train = load_dataset("squad",split="train")

In [None]:
print(squad_train[10])

The code imports the `load_dataset_builder` function from the `datasets` module. This function can be used to load a dataset builder object in Python.

In [None]:
from datasets import load_dataset_builder

This code loads a dataset builder for the MRPC task from the GLUE benchmark. The dataset builder can be used to access and manipulate the MRPC dataset for natural language processing tasks.

In [None]:
ds_builder = load_dataset_builder("glue","mrpc")

In [None]:
print(ds_builder.info)

The code imports the `get_dataset_config_names` function from the `datasets` module. This function is used to retrieve the names of the available dataset configurations.

In [None]:
from datasets import get_dataset_config_names

This code calls the `get_dataset_config_names()` function with the argument `"glue"`. It returns a list of configuration names for the "glue" dataset.

In [None]:
configs = get_dataset_config_names("glue")

In [None]:
print(configs)