<a href="https://colab.research.google.com/github/veager/StudyNotes/blob/new/Codes/PyTorch-Tutorial/torch_utils_data%E6%A8%A1%E5%9D%97_Dataset%E5%92%8CDataLoader%E7%B1%BB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

`PyTorch` 库中 `torch.utils.data` 模块

- `Dataset` 类

- `DataLoader` 

博客链接，[地址](https://www.cnblogs.com/veager/articles/16297540.html)

# 0. 加载 iris 数据

In [23]:
# 加载 iris 数据
from sklearn.datasets import load_iris
data = load_iris()
X, Y = data.data, data.target

print(X.shape, Y.shape)

print(X[:5])
print(Y[:5])

(150, 4) (150,)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0 0 0 0 0]


In [24]:
import torch
import torch.utils.data as tud

X, Y = torch.tensor(X), torch.tensor(Y, dtype=torch.long)  # 数据转换为  torch.Tensor 类型

# 1. `Dataset` 类

## 1.1 `TensorData()` 函数

In [25]:
mydataset = tud.TensorDataset(X, Y)

print(mydataset)

# 获取 Dataset 样本总数，len() 函数也可
print(mydataset.__len__())  
print(mydataset.__getitem__(0)) 
print(mydataset.__getitem__([1, 2, 3]))

print(len(mydataset))
print(mydataset[0])
print(mydataset[[1, 2, 3]])
print(mydataset[1: 4])

<torch.utils.data.dataset.TensorDataset object at 0x7f211fa81e90>
150
(tensor([5.1000, 3.5000, 1.4000, 0.2000], dtype=torch.float64), tensor(0))
(tensor([[4.9000, 3.0000, 1.4000, 0.2000],
        [4.7000, 3.2000, 1.3000, 0.2000],
        [4.6000, 3.1000, 1.5000, 0.2000]], dtype=torch.float64), tensor([0, 0, 0]))
150
(tensor([5.1000, 3.5000, 1.4000, 0.2000], dtype=torch.float64), tensor(0))
(tensor([[4.9000, 3.0000, 1.4000, 0.2000],
        [4.7000, 3.2000, 1.3000, 0.2000],
        [4.6000, 3.1000, 1.5000, 0.2000]], dtype=torch.float64), tensor([0, 0, 0]))
(tensor([[4.9000, 3.0000, 1.4000, 0.2000],
        [4.7000, 3.2000, 1.3000, 0.2000],
        [4.6000, 3.1000, 1.5000, 0.2000]], dtype=torch.float64), tensor([0, 0, 0]))


# 2. `DataLoader` 类

In [26]:
n_sample = len(mydataset)
batch_size = 16

dataloader = tud.DataLoader(mydataset, batch_size=batch_size)
for i, (X_, Y_) in enumerate(dataloader):
    # print(X_, Y_)
    print(i, X_.size(), Y_.size())

# 等价于，但无法 shuffle
for i in range((n_sample - 1) // batch_size + 1):
    X_, Y_ = mydataset[i * batch_size : (i + 1) * batch_size]
    print(i, X_.size(), Y_.size())

0 torch.Size([16, 4]) torch.Size([16])
1 torch.Size([16, 4]) torch.Size([16])
2 torch.Size([16, 4]) torch.Size([16])
3 torch.Size([16, 4]) torch.Size([16])
4 torch.Size([16, 4]) torch.Size([16])
5 torch.Size([16, 4]) torch.Size([16])
6 torch.Size([16, 4]) torch.Size([16])
7 torch.Size([16, 4]) torch.Size([16])
8 torch.Size([16, 4]) torch.Size([16])
9 torch.Size([6, 4]) torch.Size([6])
0 torch.Size([16, 4]) torch.Size([16])
1 torch.Size([16, 4]) torch.Size([16])
2 torch.Size([16, 4]) torch.Size([16])
3 torch.Size([16, 4]) torch.Size([16])
4 torch.Size([16, 4]) torch.Size([16])
5 torch.Size([16, 4]) torch.Size([16])
6 torch.Size([16, 4]) torch.Size([16])
7 torch.Size([16, 4]) torch.Size([16])
8 torch.Size([16, 4]) torch.Size([16])
9 torch.Size([6, 4]) torch.Size([6])


## 2.1 GPU 加速

In [27]:
import torch
from torch.utils.data.dataloader import default_collate

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

dataloader = tud.DataLoader(
    mydataset, batch_size=16, shuffle=True,
    collate_fn=lambda x: tuple(x_.to(device) for x_ in default_collate(x)))

cpu


# 3. 基本工具

## 3.1 子集提取

### (1) `random_split()` 函数

In [28]:
n_sample = X.size()[0]
n_train = int(n_sample*0.7)
n_valid = int(n_sample*0.2)
n_test = n_sample - n_train - n_valid
lens = [n_train, n_valid, n_test]
print(n_sample, lens)

# 随机顺序划分样本
d1, d2, d3 = tud.random_split(mydataset, lens, generator=None)
print(d1.__len__(), d2.__len__(), d3.__len__())

# 按顺序划分样本
d1 = tud.Subset(mydataset, range(n_train))
d2 = tud.Subset(mydataset, range(n_train, n_train + n_valid))
d3 = tud.Subset(mydataset, range(n_train + n_valid, n_sample))

print(len(d1), len(d2), len(d3))

150 [105, 30, 15]
105 30 15
105 30 15


### (2) `ConcatDataset()` 函数

In [29]:
# 方式一
dc1 = d1 + d2 + d3

# 方式二：与方式一等价
dc2 = tud.ConcatDataset([d1, d2, d3])

print(dc1.__len__(), dc1.__len__())

150 150


### (3) `Subset()` 函数

In [30]:
d4 = tud.Subset(mydataset, [1,2,3,4])
print(d4.__len__())

4
