<a href="https://colab.research.google.com/github/veager/StudyNotes/blob/new/Codes/PyTorch-Tutorial/torch_utils_data%E6%A8%A1%E5%9D%97_Dataset%E5%92%8CDataLoader%E7%B1%BB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

`PyTorch` 库中 `torch.utils.data` 模块

- `Dataset` 类

- `DataLoader` 

博客链接，[地址](https://www.cnblogs.com/veager/articles/16297540.html)

# 0. 加载 iris 数据

In [None]:
# 加载 iris 数据
from sklearn.datasets import load_iris
data = load_iris()
X, Y = data.data, data.target

print(X.shape, Y.shape)

print(X[:5])
print(Y[:5])

(150, 4) (150,)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0 0 0 0 0]


In [None]:
import torch
import torch.utils.data as tud

X, Y = torch.tensor(X), torch.tensor(Y, dtype=torch.long)  # 数据转换为  torch.Tensor 类型

# 1. `Dataset` 类

## 1.1 `TensorData()` 函数

In [None]:
mydataset = tud.TensorDataset(X, Y)

print(mydataset)

# 获取 Dataset 样本总数，len() 函数也可
print(mydataset.__len__(), len(mydataset))  

print(mydataset.__getitem__(0))
print(mydataset.__getitem__([1, 2]))

<torch.utils.data.dataset.TensorDataset object at 0x7fd714fef310>
150 150
(tensor([5.1000, 3.5000, 1.4000, 0.2000], dtype=torch.float64), tensor(0))
(tensor([[4.9000, 3.0000, 1.4000, 0.2000],
        [4.7000, 3.2000, 1.3000, 0.2000]], dtype=torch.float64), tensor([0, 0]))


# 2. `DataLoader` 类

In [None]:
dataloader = tud.DataLoader(mydataset, batch_size=16)
for i, (X_, Y_) in enumerate(dataloader):
    # print(X_, Y_)
    print(X_.size(), Y_.size())

torch.Size([16, 4]) torch.Size([16])
torch.Size([16, 4]) torch.Size([16])
torch.Size([16, 4]) torch.Size([16])
torch.Size([16, 4]) torch.Size([16])
torch.Size([16, 4]) torch.Size([16])
torch.Size([16, 4]) torch.Size([16])
torch.Size([16, 4]) torch.Size([16])
torch.Size([16, 4]) torch.Size([16])
torch.Size([16, 4]) torch.Size([16])
torch.Size([6, 4]) torch.Size([6])


## 2.1 GPU 加速

In [None]:
import torch
from torch.utils.data.dataloader import default_collate

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

dataloader = tud.DataLoader(
    mydataset, batch_size=16, shuffle=True,
    collate_fn=lambda x: tuple(x_.to(device) for x_ in default_collate(x)))

cpu


# 3. 基本工具

## 3.1 子集提取

### (1) `random_split()` 函数

In [None]:
N = X.size()[0]
lens = [int(N*0.7), int(N*0.2), int(N*0.1)]
print(N, lens)

d1, d2, d3 = tud.random_split(mydataset, lens, generator=None)
print(d1.__len__(), d2.__len__(), d3.__len__())

150 [105, 30, 15]
105 30 15


### (2) `ConcatDataset()` 函数

In [None]:
# 方式一
dc1 = d1 + d2 + d3

# 方式二：与方式一等价
dc2 = tud.ConcatDataset([d1, d2, d3])

print(dc1.__len__(), dc1.__len__())

150 150


### (3) `Subset()` 函数

In [None]:
d4 = tud.Subset(mydataset, [1,2,3,4])
print(d4.__len__())

4
