The following additional libraries are needed to run this
notebook. Note that running on Colab is experimental, please report a Github
issue if you have any problem.

In [1]:
!pip install d2l==1.0.3


Collecting d2l==1.0.3
  Downloading d2l-1.0.3-py3-none-any.whl.metadata (556 bytes)
Collecting jupyter==1.0.0 (from d2l==1.0.3)
  Downloading jupyter-1.0.0-py2.py3-none-any.whl.metadata (995 bytes)
Collecting numpy==1.23.5 (from d2l==1.0.3)
  Downloading numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting matplotlib==3.7.2 (from d2l==1.0.3)
  Downloading matplotlib-3.7.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting matplotlib-inline==0.1.6 (from d2l==1.0.3)
  Downloading matplotlib_inline-0.1.6-py3-none-any.whl.metadata (2.8 kB)
Collecting requests==2.31.0 (from d2l==1.0.3)
  Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting pandas==2.0.3 (from d2l==1.0.3)
  Downloading pandas-2.0.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy==1.10.1 (from d2l==1.0.3)
  Downloading scipy-1.10.1-cp311-cp311-manylinux_2_17_x86_64.manylinux201

# Synthetic Regression Data
:label:`sec_synthetic-regression-data`


Machine learning is all about extracting information from data.
So you might wonder, what could we possibly learn from synthetic data?
While we might not care intrinsically about the patterns
that we ourselves baked into an artificial data generating model,
such datasets are nevertheless useful for didactic purposes,
helping us to evaluate the properties of our learning
algorithms and to confirm that our implementations work as expected.
For example, if we create data for which the correct parameters are known *a priori*,
then we can check that our model can in fact recover them.


In [1]:
%matplotlib inline
import random
import torch
from d2l import torch as d2l

## Generating the Dataset

For this example, we will work in low dimension
for succinctness.
The following code snippet generates 1000 examples
with 2-dimensional features drawn
from a standard normal distribution.
The resulting design matrix $\mathbf{X}$
belongs to $\mathbb{R}^{1000 \times 2}$.
We generate each label by applying
a *ground truth* linear function,
corrupting them via additive noise $\boldsymbol{\epsilon}$,
drawn independently and identically for each example:

(**$$\mathbf{y}= \mathbf{X} \mathbf{w} + b + \boldsymbol{\epsilon}.$$**)

For convenience we assume that $\boldsymbol{\epsilon}$ is drawn
from a normal distribution with mean $\mu= 0$
and standard deviation $\sigma = 0.01$.
Note that for object-oriented design
we add the code to the `__init__` method of a subclass of `d2l.DataModule` (introduced in :numref:`oo-design-data`).
It is good practice to allow the setting of any additional hyperparameters.
We accomplish this with `save_hyperparameters()`.
The `batch_size` will be determined later.


In [2]:
class SyntheticRegressionData(d2l.DataModule):
    """Synthetic data for linear regression."""
    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
                 batch_size=32):
        super().__init__()
        self.save_hyperparameters()
        n = num_train + num_val
        self.X = torch.randn(n, len(w))
        noise = torch.randn(n, 1) * noise
        self.y = torch.matmul(self.X, w.reshape((-1, 1))) + b + noise

Below, we set the true parameters to $\mathbf{w} = [2, -3.4]^\top$ and $b = 4.2$.
Later, we can check our estimated parameters against these *ground truth* values.


In [3]:
data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2)

[**Each row in `features` consists of a vector in $\mathbb{R}^2$ and each row in `labels` is a scalar.**] Let's have a look at the first entry.


In [4]:
print('features:', data.X[0],'\nlabel:', data.y[0])

features: tensor([-0.6454, -1.3908]) 
label: tensor([7.6257])


## Reading the Dataset

Training machine learning models often requires multiple passes over a dataset,
grabbing one minibatch of examples at a time.
This data is then used to update the model.
To illustrate how this works, we
[**implement the `get_dataloader` method,**]
registering it in the `SyntheticRegressionData` class via `add_to_class` (introduced in :numref:`oo-design-utilities`).
It (**takes a batch size, a matrix of features,
and a vector of labels, and generates minibatches of size `batch_size`.**)
As such, each minibatch consists of a tuple of features and labels.
Note that we need to be mindful of whether we're in training or validation mode:
in the former, we will want to read the data in random order,
whereas for the latter, being able to read data in a pre-defined order
may be important for debugging purposes.


In [5]:
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
    if train:
        indices = list(range(0, self.num_train))
        # The examples are read in random order
        random.shuffle(indices)
    else:
        indices = list(range(self.num_train, self.num_train+self.num_val))
    for i in range(0, len(indices), self.batch_size):
        batch_indices = torch.tensor(indices[i: i+self.batch_size])
        yield self.X[batch_indices], self.y[batch_indices]

To build some intuition, let's inspect the first minibatch of
data. Each minibatch of features provides us with both its size and the dimensionality of input features.
Likewise, our minibatch of labels will have a matching shape given by `batch_size`.


In [6]:
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)

X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1])


While seemingly innocuous, the invocation
of `iter(data.train_dataloader())`
illustrates the power of Python's object-oriented design.
Note that we added a method to the `SyntheticRegressionData` class
*after* creating the `data` object.
Nonetheless, the object benefits from
the *ex post facto* addition of functionality to the class.

Throughout the iteration we obtain distinct minibatches
until the entire dataset has been exhausted (try this).
While the iteration implemented above is good for didactic purposes,
it is inefficient in ways that might get us into trouble with real problems.
For example, it requires that we load all the data in memory
and that we perform lots of random memory access.
The built-in iterators implemented in a deep learning framework
are considerably more efficient and they can deal
with sources such as data stored in files,
data received via a stream,
and data generated or processed on the fly.
Next let's try to implement the same method using built-in iterators.

## Concise Implementation of the Data Loader

Rather than writing our own iterator,
we can [**call the existing API in a framework to load data.**]
As before, we need a dataset with features `X` and labels `y`.
Beyond that, we set `batch_size` in the built-in data loader
and let it take care of shuffling examples  efficiently.


In [7]:
@d2l.add_to_class(d2l.DataModule)
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
    tensors = tuple(a[indices] for a in tensors)
    dataset = torch.utils.data.TensorDataset(*tensors)
    return torch.utils.data.DataLoader(dataset, self.batch_size,
                                       shuffle=train)

In [8]:
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
    i = slice(0, self.num_train) if train else slice(self.num_train, None)
    return self.get_tensorloader((self.X, self.y), train, i)

The new data loader behaves just like the previous one, except that it is more efficient and has some added functionality.


In [9]:
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)

X shape: torch.Size([32, 2]) 
y shape: torch.Size([32, 1])


For instance, the data loader provided by the framework API
supports the built-in `__len__` method,
so we can query its length,
i.e., the number of batches.


In [10]:
len(data.train_dataloader())

32

## Summary

Data loaders are a convenient way of abstracting out
the process of loading and manipulating data.
This way the same machine learning *algorithm*
is capable of processing many different types and sources of data
without the need for modification.
One of the nice things about data loaders
is that they can be composed.
For instance, we might be loading images
and then have a postprocessing filter
that crops them or modifies them in other ways.
As such, data loaders can be used
to describe an entire data processing pipeline.

As for the model itself, the two-dimensional linear model
is about the simplest we might encounter.
It lets us test out the accuracy of regression models
without worrying about having insufficient amounts of data
or an underdetermined system of equations.
We will put this to good use in the next section.  


## Exercises

1. What will happen if the number of examples cannot be divided by the batch size. How would you change this behavior by specifying a different argument by using the framework's API?
1. Suppose that we want to generate a huge dataset, where both the size of the parameter vector `w` and the number of examples `num_examples` are large.
    1. What happens if we cannot hold all data in memory?
    1. How would you shuffle the data if it is held on disk? Your task is to design an *efficient* algorithm that does not require too many random reads or writes. Hint: [pseudorandom permutation generators](https://en.wikipedia.org/wiki/Pseudorandom_permutation) allow you to design a reshuffle without the need to store the permutation table explicitly :cite:`Naor.Reingold.1999`.
1. Implement a data generator that produces new data on the fly, every time the iterator is called.
1. How would you design a random data generator that generates *the same* data each time it is called?


[Discussions](https://discuss.d2l.ai/t/6663)


In [11]:
#第一题
import torch
from torch.utils.data import TensorDataset, DataLoader

# 生成示例数据
x = torch.randn(11, 5)  # 假设特征数据
y = torch.randint(0, 2, (11,))  # 假设标签数据
dataset = TensorDataset(x, y)

# 默认行为（丢弃不完整批次）
default_loader = DataLoader(dataset, batch_size = 3)
for batch_x, batch_y in default_loader:
    print(batch_x.shape, batch_y.shape)

# 改变行为（保留不完整批次）
changed_loader = DataLoader(dataset, batch_size = 3, drop_last = False)
for batch_x, batch_y in changed_loader:
    print(batch_x.shape, batch_y.shape)

torch.Size([3, 5]) torch.Size([3])
torch.Size([3, 5]) torch.Size([3])
torch.Size([3, 5]) torch.Size([3])
torch.Size([2, 5]) torch.Size([2])
torch.Size([3, 5]) torch.Size([3])
torch.Size([3, 5]) torch.Size([3])
torch.Size([3, 5]) torch.Size([3])
torch.Size([2, 5]) torch.Size([2])


In [16]:
#第二题
def generate_and_load_data_in_chunks(chunk_size = 1000):
    num_samples = 5000  # 假设总的样本数量
    for start in range(0, num_samples, chunk_size):
        end = min(start + chunk_size, num_samples)
        # 这里简单生成虚拟数据，实际可替换为真实数据生成逻辑
        data_chunk = [(i, f"example_{i}") for i in range(start, end)]
        yield data_chunk

# 使用示例
for data_chunk in generate_and_load_data_in_chunks():
    print(len(data_chunk))
    print(data_chunk)

1000
[(0, 'example_0'), (1, 'example_1'), (2, 'example_2'), (3, 'example_3'), (4, 'example_4'), (5, 'example_5'), (6, 'example_6'), (7, 'example_7'), (8, 'example_8'), (9, 'example_9'), (10, 'example_10'), (11, 'example_11'), (12, 'example_12'), (13, 'example_13'), (14, 'example_14'), (15, 'example_15'), (16, 'example_16'), (17, 'example_17'), (18, 'example_18'), (19, 'example_19'), (20, 'example_20'), (21, 'example_21'), (22, 'example_22'), (23, 'example_23'), (24, 'example_24'), (25, 'example_25'), (26, 'example_26'), (27, 'example_27'), (28, 'example_28'), (29, 'example_29'), (30, 'example_30'), (31, 'example_31'), (32, 'example_32'), (33, 'example_33'), (34, 'example_34'), (35, 'example_35'), (36, 'example_36'), (37, 'example_37'), (38, 'example_38'), (39, 'example_39'), (40, 'example_40'), (41, 'example_41'), (42, 'example_42'), (43, 'example_43'), (44, 'example_44'), (45, 'example_45'), (46, 'example_46'), (47, 'example_47'), (48, 'example_48'), (49, 'example_49'), (50, 'example_

In [17]:
import hashlib

def shuffle_virtual_data(num_examples):
    # 定义伪随机函数
    def pseudo_random_index(i):
        hash_object = hashlib.sha256(str(i).encode())
        hash_int = int(hash_object.hexdigest(), 16)
        return hash_int % num_examples

    virtual_data = [(i, f"virtual_example_{i}") for i in range(num_examples)]
    shuffled_data = []
    for i in range(num_examples):
        j = pseudo_random_index(i)
        shuffled_data.append(virtual_data[j])
    return shuffled_data

# 使用示例
num_examples = 100
shuffled_result = shuffle_virtual_data(num_examples)
for element in shuffled_result:
    print(element)

(5, 'virtual_example_5')
(15, 'virtual_example_15')
(61, 'virtual_example_61')
(78, 'virtual_example_78')
(22, 'virtual_example_22')
(53, 'virtual_example_53')
(3, 'virtual_example_3')
(49, 'virtual_example_49')
(59, 'virtual_example_59')
(87, 'virtual_example_87')
(17, 'virtual_example_17')
(48, 'virtual_example_48')
(36, 'virtual_example_36')
(32, 'virtual_example_32')
(65, 'virtual_example_65')
(63, 'virtual_example_63')
(73, 'virtual_example_73')
(31, 'virtual_example_31')
(70, 'virtual_example_70')
(83, 'virtual_example_83')
(23, 'virtual_example_23')
(87, 'virtual_example_87')
(13, 'virtual_example_13')
(48, 'virtual_example_48')
(55, 'virtual_example_55')
(77, 'virtual_example_77')
(2, 'virtual_example_2')
(31, 'virtual_example_31')
(46, 'virtual_example_46')
(4, 'virtual_example_4')
(24, 'virtual_example_24')
(75, 'virtual_example_75')
(27, 'virtual_example_27')
(68, 'virtual_example_68')
(27, 'virtual_example_27')
(47, 'virtual_example_47')
(13, 'virtual_example_13')
(53, 'vir

In [14]:
#第三题
import numpy as np

def data_generator():
    while True:
        # 模拟生成3通道、尺寸为32x32的图像数据
        image_data = np.random.randint(0, 256, size=(32, 32, 3), dtype=np.uint8)
        yield image_data

# 使用示例
gen = data_generator()
for _ in range(3):
    new_image = next(gen)
    print(new_image.shape)

(32, 32, 3)
(32, 32, 3)
(32, 32, 3)


In [15]:
#第四题
import numpy as np

def fixed_random_data_generator():
    np.random.seed(42)  # 固定随机种子
    fixed_data = np.random.rand(5)  # 生成固定的随机数据
    while True:
        yield fixed_data

# 使用示例
gen = fixed_random_data_generator()
for _ in range(3):
    same_data = next(gen)
    print(same_data)

[0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
[0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
[0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
