# Data

## data.TensorDataset

`data.TensorDataset` 用于将多个tensor组合为数据集，**它接收任意个tensor作为参数**，但是要求这些tensor必须具有**相同的个数**（第一维度）。它会返回一个数据集，按照索引进行一一对应。

导入一些需要的库。

In [33]:
import numpy as np
import torch
from torch.utils import data

生成人造数据集。

In [34]:
def synthetic_data(w, b, m, n):
    """
    生成 Y=wX+b+noise 
    
    :param w: weight vector 
    :param b: bias scalar
    :param m: number of samples
    :param n: number of features
    :return: Y, labels value vector
    """
    # 生成一个期望值为0，方差为1，m*n的input features
    X = torch.normal(0, 1, (m, n))
    Y = X @ w.reshape(n, 1) + b
    
    # 噪声
    Y += torch.normal(0, 0.1, (m, 1))
    
    return X, Y.reshape(-1, 1)

w = torch.tensor([2, -3.4])
b = 4.2

features, labels = synthetic_data(w, b, 200, len(w))

用特征features和标签labels生成dataset。

In [35]:
dataset = data.TensorDataset(features, labels)

for pair in dataset:
    print(pair)

(tensor([1.4950, 0.0587]), tensor([7.2415]))
(tensor([ 0.2207, -1.0449]), tensor([8.2712]))
(tensor([-0.2932,  1.4716]), tensor([-1.3809]))
(tensor([-0.8460,  0.2562]), tensor([1.7191]))
(tensor([ 0.3670, -1.5124]), tensor([10.0870]))
(tensor([ 0.9160, -0.1174]), tensor([6.4876]))
(tensor([-0.1704,  0.0618]), tensor([3.4861]))
(tensor([0.3976, 1.0506]), tensor([1.4095]))
(tensor([0.3924, 0.7528]), tensor([2.4889]))
(tensor([ 0.0940, -1.3543]), tensor([8.9968]))
(tensor([ 0.4798, -0.4883]), tensor([6.7201]))
(tensor([0.9718, 1.1525]), tensor([2.2297]))
(tensor([-0.5423,  0.3668]), tensor([1.9711]))
(tensor([-1.3049,  1.7295]), tensor([-4.2141]))
(tensor([-2.1158, -1.4262]), tensor([4.7655]))
(tensor([-0.4838, -0.9356]), tensor([6.3588]))
(tensor([-0.9606,  0.9473]), tensor([-1.2523]))
(tensor([ 0.4634, -0.7921]), tensor([7.7563]))
(tensor([ 0.2095, -0.5693]), tensor([6.5598]))
(tensor([-0.1995,  1.1571]), tensor([-0.3136]))
(tensor([-0.9703, -0.3003]), tensor([3.3520]))
(tensor([-0.7258

## data.DataLoader

`data.DataLoader` 负责批量加载数据，并且提供批量处理、数据打乱、多线程加载等功能，它可以高效地从数据集中抽取小批量的数据集。它接收一个Dataset对象，其余参数可以控制批量大小，数据打乱，多线程加载等功能。它返回一个迭代器 (Iterator) 形式的batch，可以用 `for`, `next` 得到batch中的元素。

创建一个dataloader。

- dataset: Dataset对象
- batch_size: 批量大小
- shuffle: 是否进行数据打乱
- drop_last: 是否丢弃最后无法捆成一簇的数据
- num_workers: 进程数 

In [36]:
dataloader = data.DataLoader(dataset, batch_size=15, shuffle=True, drop_last=True, num_workers=3)

for batch in dataloader:
    print(batch)

[tensor([[ 1.3558,  0.8292],
        [-0.2102, -0.0858],
        [-0.1874, -0.9614],
        [-0.3694, -0.2711],
        [-1.0749, -0.5412],
        [-0.0551,  1.8919],
        [-0.8716,  1.0865],
        [-0.1498, -0.4730],
        [ 1.2825, -0.9060],
        [ 0.7776, -1.1440],
        [ 0.7336, -0.4600],
        [ 0.0403,  0.2541],
        [ 0.5916,  0.5946],
        [ 0.5926, -0.3433],
        [-0.2401,  1.1554]]), tensor([[ 3.8128],
        [ 3.9762],
        [ 7.0837],
        [ 4.4053],
        [ 3.7498],
        [-2.2782],
        [-1.3417],
        [ 5.4869],
        [ 9.6842],
        [ 9.6277],
        [ 7.3649],
        [ 3.6250],
        [ 3.4779],
        [ 6.3189],
        [-0.4262]])]
[tensor([[ 0.1852,  0.7636],
        [ 0.9565,  0.0851],
        [ 0.6234, -2.0083],
        [ 0.8078,  0.4124],
        [-1.1240,  0.7222],
        [-2.1158, -1.4262],
        [ 1.2341,  0.3414],
        [ 0.4141,  0.7711],
        [ 0.6196,  0.6529],
        [ 1.3843, -0.5777],
        [