# 1.准备与加载本地数据

> 该部分参考的文档有：
> 
> https://huggingface.co/docs/datasets/loading
> 
> https://huggingface.co/docs/datasets/nlp_load

本地数据可以处理成 csv 或者 json 格式的数据。csv 格式的数据本来就是每行对应一条数据，这个不用说。json 格式也要处理成这种每行一条数据的格式，文件中每行是一个 json 格式的数据，举例如下：

```json
{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}

```


加载本地的`csv`和`json`格式数据的代码如下：

In [2]:
from datasets import load_dataset

# 加载 csv 格式数据
dataset_csv = load_dataset("csv", data_files="dataset/test.csv")

# 加载 json 格式数据
dataset_json = load_dataset("json", data_files="dataset/test.json")

Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 1002.46it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 249.36it/s]
Generating train split: 3 examples [00:00, 70.71 examples/s]
Downloading data files: 100%|██████████| 1/1 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 142.83it/s]
Generating train split: 1 examples [00:00, 133.03 examples/s]


另外还可以在加载时，指定训练集和测试集，此时参数`data_files`对应的值是一个字典。样例代码如下所示：

In [3]:
dataset_train_test = load_dataset("json", data_files={"train":["dataset/test_1.json", "dataset/test_2.json"], "test":"dataset/test.json"})

Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 1983.12it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 569.92it/s]
Generating train split: 2 examples [00:00, 153.38 examples/s]
Generating test split: 1 examples [00:00, 95.02 examples/s]


# 2.对dataset进行访问

> 该部分参考文档：
> 
> https://huggingface.co/docs/datasets/access

以数据集 rotten_tomatoes 为例进行说明，加载数据集的代码如下。加载之后会获取到一个 Dataset 对象，下面主要是介绍该对象的使用方法。

In [4]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")

Downloading builder script: 100%|██████████| 5.03k/5.03k [00:00<?, ?B/s]
Downloading metadata: 100%|██████████| 2.02k/2.02k [00:00<?, ?B/s]
Downloading readme: 100%|██████████| 7.25k/7.25k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 488k/488k [00:00<00:00, 1.26MB/s]
Generating train split: 100%|██████████| 8530/8530 [00:00<00:00, 31398.76 examples/s]
Generating validation split: 100%|██████████| 1066/1066 [00:00<00:00, 21079.39 examples/s]
Generating test split: 100%|██████████| 1066/1066 [00:00<00:00, 22167.00 examples/s]


## 2.1 使用索引或列名进行访问

说明：索引和列名结合使用时，要先使用索引，后使用列名。虽然顺序反过来结果是相同的，但是速度会慢很多。



In [5]:
# 使用索引访问
dataset[0]

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

In [None]:
# 使用列名访问
dataset["text"]

In [7]:
# 索引和列名结合使用
dataset[0]["text"]

'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

## 2.2 分片

In [8]:
# Get the first three rows
dataset[:3]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic'],
 'label': [1, 1, 1]}

In [9]:
# Get rows between three and six
dataset[3:6]

{'text': ['if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
  "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
  'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .'],
 'label': [1, 1, 1]}

# 3.对数据进行tokenize

> 该部分参考文档有：
> 
> https://huggingface.co/docs/datasets/use_dataset

## 3.1 直接将文本传入tokenizer

这里只是介绍 tokenize 的最简单的用法，关于 tokenize 的详细用法可以见 Tokenizers 库的文档。

下面，tokenize 以模型 bert-base-uncased 为例，数据集以 rotten_tomatoes 为例进行说明。加载 tokenizer 和数据集的代码如下：

In [11]:
from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("rotten_tomatoes", split="train")

tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 1.19MB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


对数据集中的单条数据做 tokenize 的代码如下，返回值中有三个字段 `input_ids`、`oken_type_ids`、`attention_mask`。然后有一点可以看出，当输入是单条文本时，返回值中每个字段对应的值都是单个向量；而当输入多条文本时，返回值中每个字段对应的是向量的列表。

In [None]:
tokenizer(dataset[0]["text"])

In [17]:
tokenizer(dataset["text"])

{'input_ids': [[101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1000, 16608, 1000, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102], [101, 1996, 9882, 2135, 9603, 13633, 1997, 1000, 1996, 2935, 1997, 1996, 7635, 1000, 11544, 2003, 2061, 4121, 2008, 1037, 5930, 1997, 2616, 3685, 23613, 6235, 2522, 1011, 3213, 1013, 2472, 2848, 4027, 1005, 1055, 4423, 4432, 1997, 1046, 1012, 1054, 1012, 1054, 1012, 23602, 1005, 1055, 2690, 1011, 3011, 1012, 102], [101, 4621, 2021, 2205, 1011, 8915, 23267, 16012, 24330, 102], [101, 2065, 2017, 2823, 2066, 2000, 2175, 2000, 1996, 5691, 2000, 2031, 4569, 1010, 2001, 28518, 2003, 1037, 2204, 2173, 2000, 2707, 1012, 102], [101, 19391, 2004, 2242, 4678, 1010, 2019, 3277, 3185, 2008, 1005, 1055, 2061, 7481, 1998, 10326, 2135, 5159, 2008, 2009, 2987, 1005, 1056, 2514, 2066, 2028, 1012, 102], [

## 3.2 使用 map 函数

使用 map 函数时需要先定义一个对数据做 tokenize 的函数，比如下述样例代码中的 `tokenization` 函数。当调用 map 函数时指定了参数 `batched` 时，那么自定义的函数 `tokenization` 的入参也是多条数据的，对该入参的使用方式和对 `Dataset` 对象的使用方式基本完全相同。

In [18]:
def tokenization(example):
    return tokenizer(example["text"])

dataset = dataset.map(tokenization, batched=True)

Map: 100%|██████████| 8530/8530 [00:00<00:00, 11501.64 examples/s]


上述样例代码只是对 map 函数的最简单的用法，其他的高级用法比如 map 函数还允许输入与输出的数据数量不同等等功能。

## 3.3 设置 dataset 的格式

huggingface 的这个 datasets 库不仅支持 PyTorch，还支持 TensorFlow，所以对数据集做完 tokenize 之后，还需要设置其是哪种框架。另外数据集中可能还包括部署 tensor 的字段，这些字段最好在训练之前丢弃掉，否则在训练的 loop 中可能会报错。这两个操作可以通过如下一行代码实现：

In [19]:
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])

ValueError: Columns ['labels'] not in the dataset. Current columns in the dataset: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask']