# Examples on Using Hugging Face `Dataset` Library

This downloads the IMDB dataset from Hugging Face:

In [1]:
from datasets import load_dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

We can do the same by loading a CSV file:

In [17]:
# CSV > Datasets
from datasets import load_dataset
from datasets import Dataset
import pandas as pd

ds = Dataset.from_csv("/home/users/testuser/courses/ml-course/Data/imdb_train.csv",
                      names=['label','text'])
ds

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['label', 'text'],
    num_rows: 1000
})

`Dataset` uses Pandas underneath, so the below is the same as above:

In [10]:
# CSV > Pandas > Datasets
from datasets import load_dataset
from datasets import Dataset
import pandas as pd
#ds = load_dataset('csv',"~/courses/ml-course/Data/imdb_train.csv")
#ds
df = pd.read_csv("~/courses/ml-course/Data/imdb_train.csv", names=['label','text'])
dataset = Dataset.from_pandas(df)
dataset


Dataset({
    features: ['label', 'text'],
    num_rows: 1000
})

Finally, we can put multiple datasets inside a `DatasetDict`, which is what we get when we download the data directly from Hugging Face:

In [2]:
# CSV > Dataset > DatasetDict
train_data_path = "../Data/imdb_train.csv"
test_data_path = "../Data/imdb_test.csv"

from datasets import Dataset,DatasetDict
import os
dataset = DatasetDict()
dataset["train"]  = Dataset.from_csv(os.path.abspath(train_data_path), 
                                     names=['label','text'])
dataset["test"]  = Dataset.from_csv(os.path.abspath(test_data_path), 
                                    names=['label','text'])
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
})

Convert a list of dictionaries to `Dataset`:

In [None]:
# List of Dict > Datasets
from datasets import Dataset
text = [{'text':"This wine is really good."} for i in range(1000)]
dataset = Dataset.from_list(text)
dataset

Dataset({
    features: ['text'],
    num_rows: 1000
})