<a href="https://colab.research.google.com/github/weihanchen/google-colab-python-learn/blob/main/jupyter-examples/huggingface/hugging_face_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 【Hugging Face】Ep.3 前往Dataset掏金趣
AI模型訓練最重要的燃料就是「資料」了， 而HuggingFace的Dataset也公開了不少的資料集， 非常適合我們進行練功， 就讓我們先從dataset的部份開始吧！

## 安裝套件

In [None]:
!pip install datasets

## 檢查資料集資訊
使用[load_dataset_builder()](https://huggingface.co/docs/datasets/v2.13.1/en/package_reference/loading_methods#datasets.load_dataset_builder)來檢查資料集， 這次檢查的資料集為「[imdb](https://huggingface.co/datasets/imdb)」

In [None]:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("imdb")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

### 基本資訊
可以得知這是一個電影的資料集， 包含正向與負向的標籤。

In [None]:
print(ds_builder.info.description)

print(ds_builder.info.features)

Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}


## 索引值操作

In [None]:
from datasets import load_dataset

# 載入訓練的資料集
ds = load_dataset("imdb", split='train')



In [None]:
# 第一列
ds[0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [None]:
# 最後一列
ds[-1]

{'text': 'The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.',
 'label': 1}

## 洗牌
整理好的資料集固然很好， 但我們更希望打散， 畢竟統計學就是讓資料越分散越好， 才不會因為某些特性相似的資料族群拉高了整體的佔比， 因此我們可以適當的進行洗牌， 讓整個資料集更為分散。

首先我們觀察到前5列的資料對應的標籤都是「0」， 這樣對於我們的訓練較缺乏了隨機性， 因此我們決定來洗牌一番。

In [None]:
ds['label'][:5]

[0, 0, 0, 0, 0]

進行洗牌, 這邊的`seed`代表的是亂數種子, 會固定數值是因為有利於測試， 每次運行都會生成相同的亂數序列，這對於測試和調試程式非常有用。

In [None]:
shuffled_ds = ds.shuffle(seed=42)

shuffled_ds["label"][:5]



[1, 1, 0, 1, 0]

## 過濾
資料集雖然猶如黃金， 但其實夾雜著許多的雜訊， 我們可以透過「過濾」的方式， 篩選出我們需要的特定數據， 例如: 文字包含「U.S」且長度不要太長的資料才要進行訓練， 因此我們可以這樣做。

In [None]:
ds1 = ds.filter(lambda x: 'U.S' in x['text'] and len(x['text']) < 500)

ds1[:3]



{'text': ['It is not un-common to see U.S. re-makes of foreign movies that fall flat on their face, but here is the flip side!!! This is an awful re-make of the U.S. movie "Wide Awake" by the British!<br /><br />"Wide Awake" is strange but entertaining and funny! "Liam" on the other hand is just strange. I must give credit to "Liam" for one thing, and that is making it clear that I made the right choice in changing my religion!',
  'I saw this movie on Comedy Central a few times. This movie was pretty good. It\'s an interesting adventure with the life of Sunny Davis, who is arranged to marry the king of Ohtar, so that the U.S. can get an army base there to balance power in the Middle East. Some good jokes, including "Sunnygate." I also just loved the ending theme. It gave me great political spirit. Ten out of ten was my rating for this movie.',
  '"Antwone Fisher" tells of a young black U.S. Navy enlisted man and product of childhood abuse and neglect (Luke) whose hostility toward othe

## 更多的操作方式

請參考「[datasets/process](https://huggingface.co/docs/datasets/process)」