# CS224N: PyTorch Tutorial (Winter '21)

### Author: Dilara Soylu

In this notebook, we will have a basic introduction to `PyTorch` and work on a toy NLP task. Following resources have been used in preparation of this notebook:
* ["Word Window Classification" tutorial notebook]((https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/materials/ww_classifier.ipynb) by Matt Lamm, from Winter 2020 offering of CS224N
* Official PyTorch Documentation on [Deep Learning with PyTorch: A 60 Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) by Soumith Chintala
* PyTorch Tutorial Notebook, [Build Basic Generative Adversarial Networks (GANs) | Coursera](https://www.coursera.org/learn/build-basic-generative-adversarial-networks-gans) by Sharon Zhou, offered on Coursera

Many thanks to Angelica Sun and John Hewitt for their feedback.

## Introduction
[PyTorch](https://pytorch.org/) is a machine learning framework that is used in both academia and industry for various applications. PyTorch started of as a more flexible alternative to [TensorFlow](https://www.tensorflow.org/), which is another popular machine learning framework. At the time of its release, `PyTorch` appealed to the users due to its user friendly nature: as opposed to defining static graphs before performing an operation as in `TensorFlow`, `PyTorch` allowed users to define their operations as they go, which is also the approached integrated by `TensorFlow` in its following releases. Although `TensorFlow` is more widely preferred in the industry, `PyTorch` is often times the preferred machine learning framework for researchers. If you would like to learn more about the differences between the two, you can check out [this](https://blog.udacity.com/2020/05/pytorch-vs-tensorflow-what-you-need-to-know.html) blog post.

PyTorch 是一个用于各种应用的机器学习框架，在学术界和工业界都有广泛的使用。PyTorch 最初作为 TensorFlow 的一个更灵活的替代品发布，后者是另一个流行的机器学习框架。在发布时，PyTorch 因其用户友好的特性吸引了用户：与 TensorFlow 需要在操作之前定义静态计算图不同，PyTorch 允许用户在操作过程中定义计算，这一方式后来也被 TensorFlow 的后续版本集成。虽然 TensorFlow 在工业界更为广泛使用，但 PyTorch 通常是研究人员更喜欢的机器学习框架。如果你想了解两者之间的更多差异，可以查看这篇博客文章。

Now that we have learned enough about the background of `PyTorch`, let's start by importing it into our notebook. To install `PyTorch`, you can follow the instructions here. Alternatively, you can open this notebook using `Google Colab`, which already has `PyTorch` installed in its base kernel. Once you are done with the installation process, run the following cell:

现在我们已经了解了 PyTorch 的背景，让我们开始在我们的笔记本中导入它。要安装 PyTorch，你可以按照此处的说明进行操作。或者，你可以使用 Google Colab 打开此笔记本，Google Colab 的基础内核已经安装了 PyTorch。安装完成后，运行以下单元格：

In [65]:
import torch
import torch.nn as nn

# Import pprint, module we use for making our print statements prettier
import pprint
pp = pprint.PrettyPrinter()

We are all set to start our tutorial. Let's dive in!

## Tensors

Tensors are the most basic building blocks in `PyTorch`.  Tensors are similar to matrices, but the have extra properties and they can represent higher dimensions. For example, an square image with 256 pixels in both sides can be represented by a `3x256x256` tensor, where the first 3 dimensions represent the color channels, red, green and blue.

张量是 PyTorch 中最基本的构建块。张量类似于矩阵，但具有额外的属性，可以表示更高的维度。例如，一张边长为 256 像素的正方形图像可以用一个 3x256x256 的张量来表示，其中前 3 个维度代表颜色通道：红色、绿色和蓝色。

### Tensor Initialization
There are several ways to instantiate tensors in `PyTorch`, which we will go through next.

#### **From a Python List**

We can initalize a tensor from a `Python` list, which could include sublists. The dimensions and the data types will be automatically inferred by `PyTorch` when we use [`torch.tensor()`](https://pytorch.org/docs/stable/generated/torch.tensor.html).

我们可以从一个 Python 列表初始化一个张量，这个列表可以包含子列表。当我们使用 torch.tensor() 时，PyTorch 会自动推断张量的维度和数据类型。

In [66]:
# Initialize a tensor from a Python List
data = [
        [0, 1],
        [2, 3],
        [4, 5]
       ]
x_python = torch.tensor(data)

# Print the tensor
x_python

tensor([[0, 1],
        [2, 3],
        [4, 5]])

We can also call `torch.tensor()` with the optional `dtype` parameter, which will set the data type. Some useful datatypes to be familiar with are: `torch.bool`, `torch.float`, and `torch.long`.

我们也可以使用可选的 dtype 参数来调用 torch.tensor()，这将设置数据类型。一些常用的数据类型有：torch.bool、torch.float 和 torch.long。

In [67]:
# We are using the dtype to create a tensor of particular type
x_float = torch.tensor(data, dtype=torch.float)
x_float

tensor([[0., 1.],
        [2., 3.],
        [4., 5.]])

In [68]:
# We are using the dtype to create a tensor of particular type
x_bool = torch.tensor(data, dtype=torch.bool)
x_bool

tensor([[False,  True],
        [ True,  True],
        [ True,  True]])

We can also get the same tensor in our specified data type using methods such as `float()`, `long()` etc.

In [69]:
x_python.float()

tensor([[0., 1.],
        [2., 3.],
        [4., 5.]])

We can also use `tensor.FloatTensor`, `tensor.LongTensor`, `tensor.Tensor` classes to instantiate a tensor of particular type. `LongTensor`s are particularly important in NLP as many methods that deal with indices require the indices to be passed as a `LongTensor`, which is a 64 bit integer.

我们还可以使用 tensor.FloatTensor、tensor.LongTensor、tensor.Tensor 类来实例化特定类型的张量。LongTensor 在自然语言处理（NLP）中特别重要，因为许多处理索引的方法要求将索引作为 LongTensor（即 64 位整数）传递。

In [70]:
# `torch.Tensor` defaults to float
# Same as torch.FloatTensor(data)
x = torch.Tensor(data)
x

tensor([[0., 1.],
        [2., 3.],
        [4., 5.]])

#### **From a NumPy Array**
We can also initialize a tensor from a `NumPy` array.

In [71]:
import numpy as np

# Initialize a tensor from a NumPy array
ndarray = np.array(data)
x_numpy = torch.from_numpy(ndarray)

# Print the tensor
x_numpy

tensor([[0, 1],
        [2, 3],
        [4, 5]], dtype=torch.int32)

#### **From a Tensor**
We can also initialize a tensor from another tensor, using the following methods:

* `torch.ones_like(old_tensor)`: Initializes a tensor of `1s`.
* `torch.zeros_like(old_tensor)`: Initializes a tensor of `0s`.
* `torch.rand_like(old_tensor)`: Initializes a tensor where all the elements are sampled from a uniform distribution between `0` and `1`.
* `torch.randn_like(old_tensor)`: Initializes a tensor where all the elements are sampled from a normal distribution.

All of these methods preserve the tensor properties of the original tensor passed in, such as the `shape` and `device`, which we will cover in a bit.

我们还可以使用以下方法从另一个张量初始化一个新的张量：

* `torch.ones_like(old_tensor)`：初始化一个全为 `1` 的张量。
* `torch.zeros_like(old_tensor)`：初始化一个全为 `0` 的张量。
* `torch.rand_like(old_tensor)`：初始化一个所有元素均从 `0` 到 `1` 的均匀分布中采样的张量。
* `torch.randn_like(old_tensor)`：初始化一个所有元素均从正态分布中采样的张量。

所有这些方法都会保留传入的原始张量的属性，如 `shape` 和 `device`，这些我们稍后会讲到。

In [72]:
# Initialize a base tensor
x = torch.tensor([[1., 2], [3, 4]])
x

tensor([[1., 2.],
        [3., 4.]])

In [73]:
# Initialize a tensor of 0s
x_zeros = torch.zeros_like(x)
x_zeros

tensor([[0., 0.],
        [0., 0.]])

In [74]:
# Initialize a tensor of 1s
x_ones = torch.ones_like(x)
x_ones

tensor([[1., 1.],
        [1., 1.]])

In [75]:
# Initialize a tensor where each element is sampled from a uniform distribution
# between 0 and 1
x_rand = torch.rand_like(x)
x_rand

tensor([[0.1556, 0.8162],
        [0.6770, 0.8598]])

In [76]:
# Initialize a tensor where each element is sampled from a normal distribution
x_randn = torch.randn_like(x)
x_randn

tensor([[-1.8337,  1.0947],
        [ 1.7748,  2.2634]])

#### **By Specifying a Shape**
We can also instantiate tensors by specifying their shapes (which we will cover in more detail in a bit). The methods we could use follow the ones in the previous section:

我们还可以通过指定形状来实例化张量（稍后我们将详细介绍）。我们可以使用的方法遵循上一节中的方法：

* `torch.zeros()`
* `torch.ones()`
* `torch.rand()`
* `torch.randn()`

In [77]:
# Initialize a 2x3x2 tensor of 0s
shape = (4, 2, 2)
x_zeros = torch.zeros(shape) # x_zeros = torch.zeros(4, 3, 2) is an alternative
x_zeros

tensor([[[0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.]]])

#### **With `torch.arange()`**
We can also create a tensor with `torch.arange(end)`, which returns a `1-D` tensor with elements ranging from `0` to `end-1`. We can use the optional `start` and `step` parameters to create tensors with different ranges.  

我们还可以使用 `torch.arange(end)` 创建一个张量，这将返回一个包含从 `0` 到 `end-1` 元素的一维张量。我们可以使用可选的 `start` 和 `step` 参数来创建具有不同范围的张量。

In [78]:
# Create a tensor with values 0-9
x = torch.arange(10)
x

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### Tensor Properties

Tensors have a few properties that are important for us to cover. These are namely `shape`, and the `device` properties.

张量有几个重要的属性需要我们去了解。这些属性分别是“形状”和“设备”属性。

#### Data Type

The `dtype` property lets us see the data type of a tensor.

In [79]:
# Initialize a 3x2 tensor, with 3 rows and 2 columns
x = torch.ones(3, 2)
print(x.dtype)
print(x)

torch.float32
tensor([[1., 1.],
        [1., 1.],
        [1., 1.]])


#### Shape

The `shape` property tells us the shape of our tensor. This can help us identify how many dimensional our tensor is as well as how many elements exist in each dimension.

In [80]:
# Initialize a 3x2 tensor, with 3 rows and 2 columns
x = torch.Tensor([[1, 2], [3, 4], [5, 6]])
x

tensor([[1., 2.],
        [3., 4.],
        [5., 6.]])

In [81]:
# Print out its shape
# Same as x.size()
x.shape

torch.Size([3, 2])

In [82]:
# Print out the number of elements in a particular dimension
# 0th dimension corresponds to the rows
x.shape[0]

3

We can also get the size of a particular dimension with the `size()` method.


In [83]:
# Get the size of the 0th dimension
x.size(0)

3

We can change the shape of a tensor with the `view()` method.

我们可以使用“view()”方法改变张量的形状。

In [84]:
# Example use of view()
# x_view shares the same memory as x, so changing one changes the other
x_view = x.view(2, 3)
x_view

tensor([[1., 2., 3.],
        [4., 5., 6.]])

在 PyTorch 中，我们可以使用 -1 来让 PyTorch 推断某个维度的大小。具体来说，当你想要对一个张量进行重新调整形状（即重新排列其元素以形成一个新的形状）时，如果不确定某个维度的大小，可以将该维度指定为 -1，PyTorch 会根据张量的总元素数和其他维度的大小自动计算出这个维度的大小。

在这个例子中，x.view(3, -1) 的作用是将张量 x 重新调整形状，使得新张量有 3 行，每行的元素个数由 PyTorch 根据 x 的总元素数来推断。

假设 x 的形状是 (6, 4)，即 x 有 24 个元素。那么 x.view(3, -1) 会将 x 调整为一个形状为 (3, 8) 的张量，因为 3 * 8 = 24。

这种方法在不改变数据顺序的情况下调整张量形状，非常方便处理高维数据。

In [85]:
# We can ask PyTorch to infer the size of a dimension with -1
x_view = x.view(3, -1)
x_view

tensor([[1., 2.],
        [3., 4.],
        [5., 6.]])

We can also use `torch.reshape()` method for a similar purpose. There is a subtle difference between `reshape()` and `view()`: `view()` requires the data to be stored contiguously in the memory. You can refer to [this](https://stackoverflow.com/questions/49643225/whats-the-difference-between-reshape-and-view-in-pytorch) StackOverflow answer for more information. In simple terms, contiguous means that the way our data is laid out in the memory is the same as the way we would read elements from it. This happens because some methods, such as `transpose()` and `view()`, do not actually change how our data is stored in the memory. They just change the meta information about out tensor, so that when we use it we will see the elements in the order we expect.

我们还可以使用 torch.reshape() 方法来实现类似的目的。reshape() 和 view() 之间有一个细微的区别：view() 要求数据在内存中是连续存储的。你可以参考 这篇 StackOverflow 回答获取更多信息。简单来说，连续存储意味着数据在内存中的布局方式与我们从中读取元素的方式是一致的。这是因为某些方法，比如 transpose() 和 view()，实际上并不会改变数据在内存中的存储方式，它们只是改变了张量的元信息，以便我们使用时按照我们期望的顺序查看元素。

`reshape()` calls `view()` internally if the data is stored contiguously, if not, it returns a copy. The difference here isn't too important for basic tensors, but if you perform operations that make the underlying storage of the data non-contiguous (such as taking a transpose), you will have issues using `view()`. If you would like to match the way your tensor is stored in the memory to how it is used, you can use the `contiguous()` method.  

如果数据在内存中是连续存储的，reshape() 在内部调用 view()，如果不是，它会返回一个复制品。对于基本的张量来说，这种差异并不太重要，但如果进行了使数据的底层存储非连续的操作（如取转置），使用 view() 就会出现问题。如果希望使张量在内存中的存储方式与使用方式匹配，可以使用 contiguous() 方法。

In [86]:
# Change the shape of x to be 3x2
# x_reshaped could be a reference to or copy of x
x_reshaped = torch.reshape(x, (2, 3))
x_reshaped

tensor([[1., 2., 3.],
        [4., 5., 6.]])

We can use `torch.unsqueeze(x, dim)` function to add a dimension of size `1` to the provided `dim`, where `x` is the tensor. We can also use the corresponding use `torch.squeeze(x)`, which removes the dimensions of size `1`.

我们可以使用 torch.unsqueeze(x, dim) 函数在张量 x 的指定维度 dim 上添加一个大小为 1 的维度。我们也可以使用对应的 torch.squeeze(x) 函数，它会移除张量 x 中大小为 1 的维度。

In [87]:
# Initialize a 5x2 tensor, with 5 rows and 2 columns
x = torch.arange(10).reshape(5, 2)
x

tensor([[0, 1],
        [2, 3],
        [4, 5],
        [6, 7],
        [8, 9]])

In [88]:
# Add a new dimension of size 1 at the 1st dimension
# 这里unsqueeze函数中传的参数指定了在第一维加一个新维度
x = x.unsqueeze(1)
x.shape

torch.Size([5, 1, 2])

In [89]:
# Squeeze the dimensions of x by getting rid of all the dimensions with 1 element
# 通过删除所有只有 1 个元素的维度来压缩 x 的维度
x = x.squeeze()
x.shape

torch.Size([5, 2])

If we want to get the total number of elements in a tensor, we can use the `numel()` method.

如果我们想要获取张量中元素的总数，可以使用 `numel()` 方法。

In [90]:
x

tensor([[0, 1],
        [2, 3],
        [4, 5],
        [6, 7],
        [8, 9]])

In [91]:
# Get the number of elements in tensor.
x.numel()

10

#### **Device**
Device property tells `PyTorch` where to store our tensor. Where a tensor is stored determines which device, `GPU` or `CPU`, would be handling the computations involving it. We can find the device of a tensor with the `device` property.

`Device` 属性告诉 `PyTorch` 应该将张量存储在哪里。张量存储的位置决定了处理与其相关的计算的设备，可以是 `GPU` 或者 `CPU`。我们可以通过 `device` 属性查看张量所在的设备。

In [92]:
# Initialize an example tensor
x = torch.Tensor([[1, 2], [3, 4]])
x

tensor([[1., 2.],
        [3., 4.]])

In [93]:
# Get the device of the tensor
x.device

device(type='cpu')

We can move a tensor from one device to another with the method `to(device)`.

In [94]:
# Check if a GPU is available, if so, move the tensor to the GPU
if torch.cuda.is_available():
  x = x.to('cuda')

print(x.device)

cuda:0


### Tensor Indexing
In `PyTorch` we can index tensors, similar to `NumPy`.

在“PyTorch”中，我们可以索引张量，类似于“NumPy”。

In [95]:
# Initialize an example tensor
x = torch.Tensor([
                  [[1, 2], [3, 4]],
                  [[5, 6], [7, 8]],
                  [[9, 10], [11, 12]]
                 ])
x

tensor([[[ 1.,  2.],
         [ 3.,  4.]],

        [[ 5.,  6.],
         [ 7.,  8.]],

        [[ 9., 10.],
         [11., 12.]]])

In [96]:
x.shape

torch.Size([3, 2, 2])

In [97]:
# Access the 0th element, which is the first row
x[0] # Equivalent to x[0, :]

tensor([[1., 2.],
        [3., 4.]])

We can also index into multiple dimensions with `:`.

In [98]:
# Get the top left element of each element in our tensor
x[:, 0, 0]

tensor([1., 5., 9.])

We can also access arbitrary elements in each dimension.

我们还可以访问每个维度中的任意元素。

In [99]:
# Print x again to see our tensor
x

tensor([[[ 1.,  2.],
         [ 3.,  4.]],

        [[ 5.,  6.],
         [ 7.,  8.]],

        [[ 9., 10.],
         [11., 12.]]])

In [100]:
# Let's access the 0th and 1st elements, each twice
i = torch.tensor([0, 0, 1, 1])
x[i]

tensor([[[1., 2.],
         [3., 4.]],

        [[1., 2.],
         [3., 4.]],

        [[5., 6.],
         [7., 8.]],

        [[5., 6.],
         [7., 8.]]])

In [101]:
i = [2 - i for i in range(3)]
print(i)
print(x.shape)
print(x[i])


[2, 1, 0]
torch.Size([3, 2, 2])
tensor([[[ 9., 10.],
         [11., 12.]],

        [[ 5.,  6.],
         [ 7.,  8.]],

        [[ 1.,  2.],
         [ 3.,  4.]]])


In [102]:
# Let's access the 0th elements of the 1st and 2nd elements
i = torch.tensor([1, 2])
j = torch.tensor([0])
print(x)
x[i, j]

tensor([[[ 1.,  2.],
         [ 3.,  4.]],

        [[ 5.,  6.],
         [ 7.,  8.]],

        [[ 9., 10.],
         [11., 12.]]])


tensor([[ 5.,  6.],
        [ 9., 10.]])

We can get a `Python` scalar value from a tensor with `item()`.

我们可以使用“item()”从张量中获取“Python”标量值。

In [103]:
x[0, 0, 0]

tensor(1.)

In [104]:
x[0, 0, 0].item()

1.0

### Operations
PyTorch operations are very similar to those of `NumPy`. We can work with both scalars and other tensors.


In [105]:
# Create an example tensor
x = torch.ones((3,2,2))
x

tensor([[[1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.]]])

In [106]:
# Perform elementwise addition
# Use - for subtraction
x + 2

tensor([[[3., 3.],
         [3., 3.]],

        [[3., 3.],
         [3., 3.]],

        [[3., 3.],
         [3., 3.]]])

In [107]:
# Perform elementwise multiplication
# Use / for division
x * 2

tensor([[[2., 2.],
         [2., 2.]],

        [[2., 2.],
         [2., 2.]],

        [[2., 2.],
         [2., 2.]]])

We can apply the same operations between different tensors of compatible sizes.


In [108]:
# Create a 4x3 tensor of 6s
a = torch.ones((4,3)) * 6
a

tensor([[6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.]])

In [109]:
# Create a 1D tensor of 2s
b = torch.ones(3) * 2
b

tensor([2., 2., 2.])

In [110]:
# Divide a by b
a / b

tensor([[3., 3., 3.],
        [3., 3., 3.],
        [3., 3., 3.],
        [3., 3., 3.]])

We can use `tensor.matmul(other_tensor)` for matrix multiplication and `tensor.T` for transpose. Matrix multiplication can also be performed with `@`.

我们可以使用 `tensor.matmul(other_tensor)` 进行矩阵乘法，使用 `tensor.T` 进行转置操作。矩阵乘法也可以使用 `@` 符号来执行。

In [111]:
# Alternative to a.matmul(b)
# a @ b.T returns the same result since b is 1D tensor and the 2nd dimension
# is inferred
a @ b

tensor([36., 36., 36., 36.])

In [112]:
pp.pprint(a.shape)
pp.pprint(a.T.shape)

torch.Size([4, 3])
torch.Size([3, 4])


We can take the mean and standard deviation along a certain dimension with the methods `mean(dim)` and `std(dim)`. That is, if we want to get the mean `3x2` matrix in a `4x3x2` matrix, we would set the `dim` to be 0. We can call these methods with no parameter to get the mean and standard deviation for the whole tensor. To use `mean` and `std` our tensor should be a floating point type.

我们可以使用方法 `mean(dim)` 和 `std(dim)` 沿着指定的维度计算均值和标准差。例如，如果我们想在一个 `4x3x2` 的张量中获取一个 `3x2` 矩阵的均值，我们可以将 `dim` 设置为 0。我们也可以不传递参数来调用这些方法，从而计算整个张量的均值和标准差。使用 `mean` 和 `std` 方法时，我们的张量应该是浮点类型的。

In [113]:
# Create an example tensor
m = torch.tensor(
    [
     [1., 1.],
     [2., 2.],
     [3., 3.],
     [4., 4.]
    ]
)

pp.pprint("Mean: {}".format(m.mean()))
pp.pprint("Mean in the 0th dimension: {}".format(m.mean(0)))
pp.pprint("Mean in the 1st dimension: {}".format(m.mean(1)))

'Mean: 2.5'
'Mean in the 0th dimension: tensor([2.5000, 2.5000])'
'Mean in the 1st dimension: tensor([1., 2., 3., 4.])'


We can concatenate tensors using `torch.cat`.

我们可以使用“torch.cat”连接张量。

In [114]:
# Concatenate in dimension 0 and 1
print(a)
a_cat0 = torch.cat([a, a, a], dim=0)
a_cat1 = torch.cat([a, a, a], dim=1)

print("Initial shape: {}".format(a.shape))
print("Shape after concatenation in dimension 0: {}".format(a_cat0.shape))
print("Shape after concatenation in dimension 1: {}".format(a_cat1.shape))

tensor([[6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.]])
Initial shape: torch.Size([4, 3])
Shape after concatenation in dimension 0: torch.Size([12, 3])
Shape after concatenation in dimension 1: torch.Size([4, 9])


Most of the operations in `PyTorch` are not in place. However, `PyTorch` offers the in place versions of operations available by adding an underscore (`_`) at the end of the method name.

在 PyTorch 中，大多数操作都不是原地操作。然而，PyTorch 提供了原地操作的版本，通过在方法名后面添加下划线 (`_`) 来使用这些版本。

In [115]:
# Print our tensor
a

tensor([[6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.]])

In [116]:
# add() is not in place
a.add(a)
a

tensor([[6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.]])

In [117]:
# add_() is in place
print(a)
a.add_(a)
a

tensor([[6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.]])


tensor([[12., 12., 12.],
        [12., 12., 12.],
        [12., 12., 12.],
        [12., 12., 12.]])

## Autograd
`PyTorch` and other machine learning libraries are known for their automatic differantiation feature. That is, given that we have defined the set of operations that need to be performed, the framework itself can figure out how to compute the gradients. We can call the `backward()` method to ask `PyTorch` to calculate the gradiends, which are then stored in the `grad` attribute.

“PyTorch”和其他机器学习库以其自动微分功能而闻名，也就是说，只要我们定义了需要执行的操作集，框架本身就可以计算出如何计算梯度。 “backward()”方法要求“PyTorch”计算梯度，然后将其存储在“grad”属性中。

In [118]:
# Create an example tensor
# requires_grad parameter tells PyTorch to store gradients
x = torch.tensor([2.], requires_grad=True)

# Print the gradient if it is calculated
# Currently None since x is a scalar
pp.pprint(x.grad)

None


在PyTorch中，默认情况下，每次调用 .backward() 方法时，梯度会累加到张量的 .grad 属性中。这意味着如果一个张量已经具有非零的梯度，并且再次调用 .backward() 方法，新的梯度将会被加到现有梯度上，而不是替换掉原有的梯度。

这种行为对于训练神经网络特别有用，因为通常在一个批次（batch）中计算多个样本的损失函数，然后将它们的梯度累积起来，最后更新模型参数。如果不想累积梯度，可以在每次 .backward() 调用前先将 .grad 属性置零。

In [119]:
# Calculating the gradient of y with respect to x
y = x * x * 3 # 3x^2
# x.grad.zero_()
y.backward()
print(x)
pp.pprint(x.grad) # d(y)/d(x) = d(3x^2)/d(x) = 6x = 12

tensor([2.], requires_grad=True)
tensor([12.])


Let's run backprop from a different tensor again to see what happens.

In [120]:
z = x * x * 3 # 3x^2
z.backward()
print(x)
pp.pprint(x.grad)

tensor([2.], requires_grad=True)
tensor([24.])


We can see that the `x.grad` is updated to be the sum of the gradients calculated so far. When we run backprop in a neural network, we sum up all the gradients for a particular neuron before making an update. This is exactly what is happening here! This is also the reason why we need to run `zero_grad()` in every training iteration (more on this later). Otherwise our gradients would keep building up from one training iteration to the other, which would cause our updates to be wrong.


我们可以看到 x.grad 被更新为到目前为止计算得到的梯度之和。当我们在神经网络中运行反向传播时，我们会在更新之前将某个神经元的所有梯度加起来。这就是这里发生的情况！这也是为什么我们需要在每个训练迭代中运行 zero_grad() 的原因（稍后会详细讨论）。否则，我们的梯度会从一个训练迭代积累到另一个，导致我们的更新出错。

## Neural Network Module

So far we have looked into the tensors, their properties and basic operations on tensors. These are especially useful to get familiar with if we are building the layers of our network from scratch. We will utilize these in Assignment 3, but moving forward, we will use predefined blocks in the `torch.nn` module of `PyTorch`. We will then put together these blocks to create complex networks. Let's start by importing this module with an alias so that we don't have to type `torch` every time we use it.

到目前为止，我们已经研究了张量、它们的属性以及张量的基本操作。如果我们打算从头开始构建网络的各个层，这些内容尤其有用。在作业3中，我们将利用这些知识。然而，往后我们会使用 PyTorch 的 `torch.nn` 模块中预定义的模块。我们会将这些模块组合起来创建复杂的网络结构。让我们从导入这个模块开始，并使用别名，这样每次使用时就不需要输入 `torch`。

In [121]:
import torch.nn as nn

### **Linear Layer** 线性层
We can use `nn.Linear(H_in, H_out)` to create a a linear layer. This will take a matrix of `(N, *, H_in)` dimensions and output a matrix of `(N, *, H_out)`. The `*` denotes that there could be arbitrary number of dimensions in between. The linear layer performs the operation `Ax+b`, where `A` and `b` are initialized randomly. If we don't want the linear layer to learn the bias parameters, we can initialize our layer with `bias=False`.

我们可以使用 `nn.Linear(H_in, H_out)` 来创建一个线性层。这将接受维度为 `(N, *, H_in)` 的矩阵作为输入，并输出维度为 `(N, *, H_out)` 的矩阵。这里的 `*` 表示中间可以有任意数量的维度。线性层执行的操作是 `Ax+b`，其中 `A` 和 `b` 是随机初始化的参数。如果我们不希望线性层学习偏置参数，可以使用 `bias=False` 来初始化我们的层。

In [122]:
# Create the inputs
input = torch.ones(2,3,4)
# N* H_in -> N*H_out


# Make a linear layers transforming N,*,H_in dimensinal inputs to N,*,H_out
# dimensional outputs
linear = nn.Linear(4, 2)
# linear = nn.Linear(4, 2, bias=False)
# nn.Linear(2,1)
linear_output = linear(input)
print("the input is {}".format(input))
print(linear_output)
print("the input's shape is {}".format(input.shape))
print("the output's shape is {}".format(linear_output.shape))

the input is tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])
tensor([[[-0.2490, -0.0327],
         [-0.2490, -0.0327],
         [-0.2490, -0.0327]],

        [[-0.2490, -0.0327],
         [-0.2490, -0.0327],
         [-0.2490, -0.0327]]], grad_fn=<ViewBackward0>)
the input's shape is torch.Size([2, 3, 4])
the output's shape is torch.Size([2, 3, 2])


上述代码将返回一个包含两个元素的列表 params，其中：

params[0] 是权重矩阵 A，维度为 (H_out, H_in)，对于上面的示例，是 (2, 4)。

params[1] 是偏置向量 b，维度为 (H_out,)，对于上面的示例，是 (2,)。

In [123]:
list(linear.parameters()) # Ax + b

[Parameter containing:
 tensor([[-0.4654, -0.1241,  0.1760,  0.2889],
         [-0.4991,  0.3690, -0.3155, -0.0010]], requires_grad=True),
 Parameter containing:
 tensor([-0.1244,  0.4139], requires_grad=True)]

### **Other Module Layers**
There are several other preconfigured layers in the `nn` module. Some commonly used examples are `nn.Conv2d`, `nn.ConvTranspose2d`, `nn.BatchNorm1d`, `nn.BatchNorm2d`, `nn.Upsample` and `nn.MaxPool2d` among many others. We will learn more about these as we progress in the course. For now, the only important thing to remember is that we can treat each of these layers as plug and play components: we will be providing the required dimensions and `PyTorch` will take care of setting them up.

在 `nn` 模块中还有几个预配置的层。一些常用的示例包括 `nn.Conv2d`、`nn.ConvTranspose2d`、`nn.BatchNorm1d`、`nn.BatchNorm2d`、`nn.Upsample` 和 `nn.MaxPool2d` 等等。随着课程的进展，我们会更多地了解这些层。目前，唯一需要记住的重要事情是，我们可以将每个这些层视为即插即用的组件：我们只需提供所需的维度，`PyTorch` 就会负责设置它们。

### **Activation Function Layer** 激活层
We can also use the `nn` module to apply activations functions to our tensors. Activation functions are used to add non-linearity to our network. Some examples of activations functions are `nn.ReLU()`, `nn.Sigmoid()` and `nn.LeakyReLU()`. Activation functions operate on each element seperately, so the shape of the tensors we get as an output are the same as the ones we pass in.

我们也可以使用 `nn` 模块将激活函数应用到我们的张量上。激活函数用于给网络增加非线性。一些常见的激活函数包括 `nn.ReLU()`、`nn.Sigmoid()` 和 `nn.LeakyReLU()`。激活函数对每个元素进行操作，因此输出张量的形状与输入张量相同。

In [124]:
linear_output

tensor([[[-0.2490, -0.0327],
         [-0.2490, -0.0327],
         [-0.2490, -0.0327]],

        [[-0.2490, -0.0327],
         [-0.2490, -0.0327],
         [-0.2490, -0.0327]]], grad_fn=<ViewBackward0>)

In [125]:
sigmoid = nn.Sigmoid()
output = sigmoid(linear_output)
print("the input of sigmoid's shape is {}".format(linear_output.shape))
print("the output of sigmoid's shape is {}".format(output.shape))
output

the input of sigmoid's shape is torch.Size([2, 3, 2])
the output of sigmoid's shape is torch.Size([2, 3, 2])


tensor([[[0.4381, 0.4918],
         [0.4381, 0.4918],
         [0.4381, 0.4918]],

        [[0.4381, 0.4918],
         [0.4381, 0.4918],
         [0.4381, 0.4918]]], grad_fn=<SigmoidBackward0>)

### **Putting the Layers Together**
So far we have seen that we can create layers and pass the output of one as the input of the next. Instead of creating intermediate tensors and passing them around, we can use `nn.Sequentual`, which does exactly that.

到目前为止，我们已经看到，我们可以创建层并将一个层的输出作为下一个层的输入。我们可以使用 `nn.Sequentual`，而不是创建中间张量并传递它们，它就是这样做的。

In [126]:
block = nn.Sequential(
    nn.Linear(4, 2),
    nn.Sigmoid()
)

input = torch.ones(2,3,4)
output = block(input)
output

tensor([[[0.3418, 0.3809],
         [0.3418, 0.3809],
         [0.3418, 0.3809]],

        [[0.3418, 0.3809],
         [0.3418, 0.3809],
         [0.3418, 0.3809]]], grad_fn=<SigmoidBackward0>)

### Custom Modules 自定义模块

Instead of using the predefined modules, we can also build our own by extending the `nn.Module` class. For example, we can build a the `nn.Linear` (which also extends `nn.Module`) on our own using the tensor introduced earlier! We can also build new, more complex modules, such as a custom neural network. You will be practicing these in the later assignment.

除了使用预定义模块，我们还可以通过扩展 `nn.Module` 类来构建自己的模块。例如，我们可以使用前面介绍的张量自己构建一个类似于 `nn.Linear`（也是扩展自 `nn.Module` 的）的模块！我们还可以构建更复杂的新模块，比如自定义的神经网络。你将在后面的作业中练习这些内容。

To create a custom module, the first thing we have to do is to extend the `nn.Module`. We can then initialize our parameters in the `__init__` function, starting with a call to the `__init__` function of the super class. All the class attributes we define which are `nn` module objects are treated as parameters, which can be learned during the training. Tensors are not parameters, but they can be turned into parameters if they are wrapped in `nn.Parameter` class.

要创建一个自定义模块，第一步是扩展 `nn.Module`。然后我们可以在 `__init__` 函数中初始化参数，首先调用超类的 `__init__` 函数。我们定义的所有类属性，如果是 `nn` 模块对象，都会被视为参数，在训练过程中可以学习它们。张量不是参数，但如果它们被包装在 `nn.Parameter` 类中，就可以将其作为参数处理。

All classes extending `nn.Module` are also expected to implement a `forward(x)` function, where `x` is a tensor. This is the function that is called when a parameter is passed to our module, such as in `model(x)`.

所有扩展自 `nn.Module` 的类还应该实现一个 `forward(x)` 函数，其中 `x` 是一个张量。这个函数在将参数传递给我们的模块时调用，比如 `model(x)` 中的 `x`。

In [127]:
class MultilayerPerceptron(nn.Module):

  def __init__(self, input_size, hidden_size):
    # Call to the __init__ function of the super class
    # 调用父类 nn.Module 的初始化方法，确保父类的初始化逻辑被执行
    super(MultilayerPerceptron, self).__init__()

    # Bookkeeping: Saving the initialization parameters
    # 这两行代码将传入的 input_size 和 hidden_size 参数保存为实例属性，以便在定义模型时使用。
    self.input_size = input_size
    self.hidden_size = hidden_size

    # Defining of our model
    # There isn't anything specific about the naming of `self.model`. It could
    # be something arbitrary.
    # 定义我们的模型
    # “self.model”的命名没有任何特殊之处。它可以是任意的。
    self.model = nn.Sequential(
        nn.Linear(self.input_size, self.hidden_size),
        nn.ReLU(),
        nn.Linear(self.hidden_size, self.input_size),
        nn.Sigmoid()
    )

  def forward(self, x):
    output = self.model(x)
    return output

Here is an alternative way to define the same class. You can see that we can replace `nn.Sequential` by defining the individual layers in the `__init__` method and connecting the in the `forward` method.

这是定义相同类的另一种方式。你可以看到，我们可以通过在 `__init__` 方法中定义各个层，并在 `forward` 方法中连接它们，来替换 `nn.Sequential`。

In [128]:
class MultilayerPerceptron(nn.Module):

  def __init__(self, input_size, hidden_size):
    # Call to the __init__ function of the super class
    super(MultilayerPerceptron, self).__init__()

    # Bookkeeping: Saving the initialization parameters
    self.input_size = input_size
    self.hidden_size = hidden_size

    # Defining of our layers
    self.linear = nn.Linear(self.input_size, self.hidden_size)
    self.relu = nn.ReLU()
    self.linear2 = nn.Linear(self.hidden_size, self.input_size)
    self.sigmoid = nn.Sigmoid()

  def forward(self, x):
    linear = self.linear(x)
    relu = self.relu(linear)
    linear2 = self.linear2(relu)
    output = self.sigmoid(linear2)
    print("forward function被自动执行了")
    return output

Now that we have defined our class, we can instantiate it and see what it does.

现在我们已经定义了我们的类，我们可以实例化它并看看它能做什么。

In [129]:
# Make a sample input
input = torch.randn(2, 5)

# Create our model
model = MultilayerPerceptron(5, 3)

print("the input is : \n {}".format(input))
# Pass our input through our model
model(input)

the input is : 
 tensor([[ 1.3909, -0.5912, -0.5251,  1.0540, -0.6228],
        [ 1.0283,  0.0940, -0.1146,  0.2216,  0.2519]])
forward function被自动执行了


tensor([[0.5363, 0.4418, 0.5460, 0.5666, 0.5043],
        [0.5305, 0.4534, 0.5292, 0.5707, 0.4894]], grad_fn=<SigmoidBackward0>)

We can inspect the parameters of our model with `named_parameters()` and `parameters()` methods.

我们可以使用 `named_parameters()` 和 `parameters()` 方法来检查模型的参数。

In [130]:
print(list(model.named_parameters()))
print("-" * 80)
print(list(model.parameters()))

[('linear.weight', Parameter containing:
tensor([[ 0.2844,  0.4322, -0.1225,  0.4251,  0.0993],
        [-0.4250, -0.2914,  0.3961, -0.0591,  0.0406],
        [-0.0911, -0.3171,  0.2475,  0.1379, -0.2612]], requires_grad=True)), ('linear.bias', Parameter containing:
tensor([-0.3084, -0.0295, -0.3946], requires_grad=True)), ('linear2.weight', Parameter containing:
tensor([[ 0.1882, -0.1808,  0.4383],
        [-0.3787,  0.3676,  0.5601],
        [ 0.5456,  0.0542,  0.5238],
        [-0.1340,  0.0587,  0.2232],
        [ 0.4793,  0.4065,  0.4155]], requires_grad=True)), ('linear2.bias', Parameter containing:
tensor([ 0.0923, -0.1271,  0.0307,  0.3059, -0.1182], requires_grad=True))]
--------------------------------------------------------------------------------
[Parameter containing:
tensor([[ 0.2844,  0.4322, -0.1225,  0.4251,  0.0993],
        [-0.4250, -0.2914,  0.3961, -0.0591,  0.0406],
        [-0.0911, -0.3171,  0.2475,  0.1379, -0.2612]], requires_grad=True), Parameter containing

## Optimization 优化
We have showed how gradients are calculated with the `backward()` function. Having the gradients isn't enought for our models to learn. We also need to know how to update the parameters of our models. This is where the optomozers comes in. `torch.optim` module contains several optimizers that we can use. Some popular examples are `optim.SGD` and `optim.Adam`. When initializing optimizers, we pass our model parameters, which can be accessed with `model.parameters()`, telling the optimizers which values it will be optimizing. Optimizers also has a learning rate (`lr`) parameter, which determines how big of an update will be made in every step. Different optimizers have different hyperparameters as well.

我们已经展示了如何使用 `backward()` 函数计算梯度。然而，仅有梯度并不足以让我们的模型学习。我们还需要知道如何更新模型的参数。这就是优化器发挥作用的地方。`torch.optim` 模块包含了多种优化器供我们使用。一些常见的例子包括 `optim.SGD` 和 `optim.Adam`。在初始化优化器时，我们通过传递模型的参数（可以通过 `model.parameters()` 访问）告诉优化器它将优化哪些值。优化器还有一个学习率 (`lr`) 参数，它决定每一步更新的幅度大小。不同的优化器还有不同的超参数。

In [131]:
import torch.optim as optim

After we have our optimization function, we can define a `loss` that we want to optimize for. We can either define the loss ourselves, or use one of the predefined loss function in `PyTorch`, such as `nn.BCELoss()`. Let's put everything together now! We will start by creating some dummy data.

在我们拥有优化函数之后，我们可以定义一个要优化的损失函数 loss。我们可以自己定义损失函数，也可以使用 PyTorch 中预定义的损失函数之一，比如 nn.BCELoss()。现在让我们把所有的东西都放在一起吧！我们将从创建一些虚拟数据开始。

In [132]:
# Create the y data
y = torch.ones(10, 5)

# Add some noise to our goal y to generate our x
# We want out model to predict our original data, albeit the noise
x = y + torch.randn_like(y)
x

tensor([[-0.5732,  0.3369,  1.5707, -0.6915, -0.0843],
        [ 0.2842,  2.0114,  1.6029,  2.1764,  0.5551],
        [ 1.3770,  0.2411,  2.5890,  1.6288,  0.8746],
        [-0.6140,  2.5605,  0.2022,  0.1500,  0.3114],
        [ 0.2285,  0.8403, -0.2322, -0.2084, -1.3668],
        [ 0.8640,  0.4660,  1.3431,  0.2402,  0.6711],
        [ 0.1234,  1.4809,  0.3616,  2.0273,  0.1224],
        [ 1.0940, -0.2841,  2.4534,  0.7063,  1.2543],
        [ 0.3050,  0.9035,  2.5951,  2.2284,  0.0268],
        [-0.1267,  1.8280,  1.9639, -0.1851,  1.0359]])

Now, we can define our model, optimizer and the loss function.

In [133]:
# Instantiate the model
model = MultilayerPerceptron(5, 3)

# Define the optimizer
adam = optim.Adam(model.parameters(), lr=1e-1)

# Define loss using a predefined loss function
# 使用预定义的二元交叉熵损失函数 (nn.BCELoss())。这个损失函数通常用于二分类任务，衡量模型预测与真实标签之间的差异
loss_function = nn.BCELoss()

# Calculate how our model is doing now
# y_pred = model(x)：使用模型 model 对输入 x 进行预测，得到预测结果 y_pred
y_pred = model(x)
print("the prediction value is :\n {}".format(y_pred))

# 计算预测值 y_pred 相对于真实标签 y 的损失值，并使用 .item() 方法将损失值转换为 Python 数值类型，方便后续打印或记录
print(loss_function(y_pred, y).item())

forward function被自动执行了
the prediction value is :
 tensor([[0.3628, 0.4871, 0.4378, 0.4830, 0.4526],
        [0.4081, 0.3948, 0.4663, 0.5676, 0.3965],
        [0.3641, 0.4616, 0.4342, 0.5136, 0.4214],
        [0.4125, 0.3839, 0.4686, 0.5785, 0.3882],
        [0.4268, 0.4636, 0.4103, 0.5309, 0.4319],
        [0.4057, 0.3973, 0.4645, 0.5660, 0.3966],
        [0.4170, 0.3712, 0.4706, 0.5917, 0.3775],
        [0.3458, 0.4832, 0.4191, 0.4997, 0.4223],
        [0.3522, 0.4982, 0.4288, 0.4767, 0.4513],
        [0.3931, 0.4277, 0.4576, 0.5361, 0.4189]], grad_fn=<SigmoidBackward0>)
0.8197198510169983


Let's see if we can have our model achieve a smaller loss. Now that we have everything we need, we can setup our training loop.

让我们看看我们能否让我们的模型实现更小的损失。现在我们已经拥有了所需的一切，我们可以设置我们的训练循环了。

In [134]:
# Set the number of epoch, which determines the number of training iterations
n_epoch = 10

for epoch in range(n_epoch):
  # Set the gradients to 0
  # 在每次反向传播计算梯度后，需要清除之前计算的梯度，以避免梯度的累积影响下一次的优化步骤。这就是使用 zero_grad() 方法的目的
  adam.zero_grad()

  # Get the model predictions
  y_pred = model(x)

  # Get the loss
  loss = loss_function(y_pred, y)

  # Print stats
  print(f"Epoch {epoch}: traing loss: {loss}")

  # Compute the gradients
  # 反向传播计算梯度
  loss.backward()

  # Take a step to optimize the weights
  # 用于执行优化步骤，即根据计算得到的梯度更新模型参数
  adam.step()


forward function被自动执行了
Epoch 0: traing loss: 0.8197198510169983
forward function被自动执行了
Epoch 1: traing loss: 0.6946194171905518
forward function被自动执行了
Epoch 2: traing loss: 0.5555264353752136
forward function被自动执行了
Epoch 3: traing loss: 0.4089398980140686
forward function被自动执行了
Epoch 4: traing loss: 0.2753055691719055
forward function被自动执行了
Epoch 5: traing loss: 0.17014183104038239
forward function被自动执行了
Epoch 6: traing loss: 0.09768878668546677
forward function被自动执行了
Epoch 7: traing loss: 0.051497675478458405
forward function被自动执行了
Epoch 8: traing loss: 0.025548718869686127
forward function被自动执行了
Epoch 9: traing loss: 0.012314668856561184


In [135]:
list(model.named_parameters())

[('linear.weight',
  Parameter containing:
  tensor([[ 0.7638,  0.9341,  1.3337,  0.7523,  0.4563],
          [-0.0690,  1.0464, -0.1449, -0.7780, -1.3294],
          [-0.7790, -0.7387, -0.2126, -0.2951, -0.1450]], requires_grad=True)),
 ('linear.bias',
  Parameter containing:
  tensor([ 1.0171,  0.5077, -0.3609], requires_grad=True)),
 ('linear2.weight',
  Parameter containing:
  tensor([[0.8198, 1.2320, 0.0357],
          [1.2306, 1.3758, 0.9619],
          [0.8857, 0.4290, 0.1428],
          [0.6065, 0.8632, 0.0661],
          [1.1845, 1.0837, 0.5626]], requires_grad=True)),
 ('linear2.bias',
  Parameter containing:
  tensor([0.5321, 0.2765, 0.7522, 1.2645, 0.3139], requires_grad=True))]

You can see that our loss is decreasing. Let's check the predictions of our model now and see if they are close to our original `y`, which was all `1s`.

In [136]:
# See how our model performs on the training data
y_pred = model(x)
y_pred

forward function被自动执行了


tensor([[0.9845, 0.9939, 0.9699, 0.9798, 0.9903],
        [0.9983, 0.9999, 0.9992, 0.9963, 0.9998],
        [0.9986, 0.9999, 0.9993, 0.9968, 0.9999],
        [0.9987, 0.9997, 0.9931, 0.9966, 0.9993],
        [0.9956, 0.9976, 0.9521, 0.9912, 0.9935],
        [0.9842, 0.9966, 0.9904, 0.9807, 0.9960],
        [0.9899, 0.9981, 0.9926, 0.9859, 0.9975],
        [0.9956, 0.9995, 0.9976, 0.9925, 0.9994],
        [0.9985, 0.9999, 0.9992, 0.9965, 0.9999],
        [0.9980, 0.9998, 0.9977, 0.9957, 0.9996]], grad_fn=<SigmoidBackward0>)

In [137]:
# Create test data and check how our model performs on it
x2 = y + torch.randn_like(y)
y_pred = model(x2)
print("the x2 loss is {}".format(loss_function(y_pred, y).item()))
y_pred

forward function被自动执行了
the x2 loss is 0.008661006577312946


tensor([[0.9936, 0.9991, 0.9964, 0.9900, 0.9989],
        [0.9998, 1.0000, 0.9996, 0.9992, 1.0000],
        [0.9218, 0.9601, 0.9449, 0.9368, 0.9573],
        [0.9980, 0.9999, 0.9990, 0.9958, 0.9998],
        [0.9910, 0.9975, 0.9854, 0.9865, 0.9960],
        [0.9976, 0.9997, 0.9970, 0.9950, 0.9995],
        [0.9941, 0.9991, 0.9956, 0.9905, 0.9988],
        [0.9990, 0.9999, 0.9995, 0.9975, 0.9999],
        [0.9975, 0.9995, 0.9926, 0.9945, 0.9989],
        [0.9954, 0.9995, 0.9975, 0.9923, 0.9993]], grad_fn=<SigmoidBackward0>)

Great! Looks like our model almost perfectly learned to filter out the noise from the `x` that we passed in!

## Demo: Word Window Classification

Until this part of the notebook, we have learned the fundamentals of PyTorch and built a basic network solving a toy task. Now we will attempt to solve an example NLP task. Here are the things we will learn:

1. Data: Creating a Dataset of Batched Tensors
2. Modeling
3. Training
4. Prediction

In this section, our goal will be to train a model that will find the words in a sentence corresponding to a `LOCATION`, which will be always of span `1` (meaning that `San Fransisco` won't be recognized as a `LOCATION`). Our task is called `Word Window Classification` for a reason. Instead of letting our model to only take a look at one word in each forward pass, we would like it to be able to consider the context of the word in question. That is, for each word, we want our model to be aware of the surrounding words. Let's dive in!

到目前为止，我们已经学习了 PyTorch 的基础知识并构建了解决简单任务的基本网络模型。现在我们将尝试解决一个示例的自然语言处理（NLP）任务。以下是我们将要学习的内容：

1. 数据：创建一个批量张量的数据集
2. 建模
3. 训练
4. 预测

在这一部分，我们的目标是训练一个模型，该模型能够找出句子中对应于 `LOCATION` 的词语，而这些词语始终只会是长度为 `1` 的词（意味着 `San Francisco` 不会被识别为 `LOCATION`）。我们的任务被称为 `Word Window Classification`，有其原因。我们希望模型在每次前向传播时不仅仅看一个单词，而是能够考虑到该单词周围的上下文。换句话说，对于每个单词，我们希望模型能够感知到其周围的其他单词。让我们开始吧！

### Data

The very first task of any machine learning project is to set up our training set. Usually, there will be a training corpus we will be utilizing. In NLP tasks, the corpus would generally be a `.txt` or `.csv` file where each row corresponds to a sentence or a tabular datapoint. In our toy task, we will assume that we have already read our data and the corresponding labels into a `Python` list.

任何机器学习项目的第一个任务是设置训练集。通常情况下，我们会使用一个训练语料库。在自然语言处理任务中，语料库通常是一个 .txt 或 .csv 文件，其中每一行对应一个句子或一个表格数据点。在我们的示例任务中，我们假设已经将数据及其对应的标签读入到一个 Python 列表中。

In [138]:
# Our raw data, which consists of sentences
corpus = [
          "We always come to Paris",
          "The professor is from Australia",
          "I live in Stanford",
          "He comes from Taiwan",
          "The capital of Turkey is Ankara"
         ]

#### Preprocessing

To make it easier for our models to learn, we usually apply a few preprocessing steps to our data. This is especially important when dealing with text data. Here are some examples of text preprocessing:
* **Tokenization**: Tokenizing the sentences into words.
* **Lowercasing**: Changing all the letters to be lowercase.
* **Noise removal:** Removing special characters (such as punctuations).
* **Stop words removal**: Removing commonly used words.

为了让我们的模型更容易学习，通常会对数据进行一些预处理步骤，特别是在处理文本数据时尤为重要。以下是一些文本预处理的示例：
* **分词（Tokenization）**：将句子分割成单词。
* **转换为小写（Lowercasing）**：将所有字母改为小写。
* **去噪（Noise removal）**：移除特殊字符（如标点符号）。
* **停用词去除（Stop words removal）**：移除常用词语。

Which preprocessing steps are necessary is determined by the task at hand. For example, although it is useful to remove special characters in some tasks, for others they may be important (for example, if we are dealing with multiple languages). For our task, we will lowercase our words and tokenize.

根据手头的任务确定哪些预处理步骤是必要的。例如，虽然在某些任务中移除特殊字符很有用，但在其他情况下它们可能很重要（例如，如果我们处理多种语言）。对于我们的任务，我们将会将单词转换为小写并进行分词。

In [139]:
# The preprocessing function we will use to generate our training examples
# Our function is a simple one, we lowercase the letters
# and then tokenize the words.
def preprocess_sentence(sentence):
  return sentence.lower().split()

# Create our training set
train_sentences = [sent.lower().split() for sent in corpus]
train_sentences

[['we', 'always', 'come', 'to', 'paris'],
 ['the', 'professor', 'is', 'from', 'australia'],
 ['i', 'live', 'in', 'stanford'],
 ['he', 'comes', 'from', 'taiwan'],
 ['the', 'capital', 'of', 'turkey', 'is', 'ankara']]

For each training example we have, we should also have a corresponding label. Recall that the goal of our model was to determine which words correspond to a `LOCATION`. That is, we want our model to output `0` for all the words that are not `LOCATION`s and `1` for the ones that are `LOCATION`s.

对于每个训练样本，我们都应该有一个相应的标签。回想一下，我们模型的目标是确定哪些单词对应于“地点”。也就是说，我们希望模型对所有不是“地点”的单词输出`0`，对那些是“地点”的单词输出`1`。

In [140]:
# Set of locations that appear in our corpus
locations = set(["australia", "ankara", "paris", "stanford", "taiwan", "turkey"])

# Our train labels
train_labels = [[1 if word in locations else 0 for word in sent] for sent in train_sentences]
train_labels

[[0, 0, 0, 0, 1],
 [0, 0, 0, 0, 1],
 [0, 0, 0, 1],
 [0, 0, 0, 1],
 [0, 0, 0, 1, 0, 1]]

#### Converting Words to Embeddings

Let's look at our training data a little more closely. Each datapoint we have is a sequence of words. On the other hand, we know that machine learning models work with numbers in vectors. How are we going to turn words into numbers? You may be thinking embeddings and you are right!

让我们更仔细地看看我们的训练数据。我们每个数据点都是一个单词序列。另一方面，我们知道机器学习模型处理的是向量中的数字。我们要如何将单词转换成数字呢？你可能会想到嵌入，你是对的！

Imagine that we have an embedding lookup table `E`, where each row corresponds to an embedding. That is, each word in our vocabulary would have a corresponding embedding row `i` in this table. Whenever we want to find an embedding for a word, we will follow these steps:
1. Find the corresponding index `i` of the word in the embedding table: `word->index`.
2. Index into the embedding table and get the embedding: `index->embedding`.

想象我们有一个嵌入查找表 `E`，其中每一行对应一个嵌入。也就是说，我们词汇表中的每个单词在这个表中都有一个对应的嵌入行 `i`。每当我们想要找到一个单词的嵌入时，我们将按照以下步骤进行：
1. 找到该单词在嵌入表中的对应索引 `i`：`单词->索引`。
2. 在嵌入表中索引并获取嵌入：`索引->嵌入`。

Let's look at the first step. We should assign all the words in our vocabulary to a corresponding index. We can do it as follows:
1. Find all the unique words in our corpus.
2. Assign an index to each.

让我们看看第一步。我们应该将词汇表中的所有单词分配一个对应的索引。我们可以按以下步骤进行：
1. 找出我们语料库中的所有唯一单词。
2. 给每个单词分配一个索引。

In [141]:
# Find all the unique words in our corpus
vocabulary = set(w for s in train_sentences for w in s)
vocabulary

{'always',
 'ankara',
 'australia',
 'capital',
 'come',
 'comes',
 'from',
 'he',
 'i',
 'in',
 'is',
 'live',
 'of',
 'paris',
 'professor',
 'stanford',
 'taiwan',
 'the',
 'to',
 'turkey',
 'we'}

`vocabulary` now contains all the words in our corpus. On the other hand, during the test time, we can see words that are not contained in our vocabulary. If we can figure out a way to represent the unknown words, our model can still reason about whether they are a `LOCATION` or not, since we are also looking at the neighboring words for each prediction.

`vocabulary` 现在包含了我们语料库中的所有单词。另一方面，在测试时，我们可能会看到不在词汇表中的单词。如果我们能找到一种方法来表示未知单词，我们的模型仍然可以推断它们是否是 `LOCATION`，因为在每次预测时我们也会查看邻近的单词。

We introduce a special token, `<unk>`, to tackle the words that are out of vocabulary. We could pick another string for our unknown token if we wanted. The only requirement here is that our token should be unique: we should only be using this token for unknown words. We will also add this special token to our vocabulary.

我们引入一个特殊的标记 `<unk>` 来处理超出词汇表的单词。如果愿意，我们可以选择另一个字符串作为我们的未知标记。唯一的要求是我们的标记应该是唯一的：我们只应使用这个标记来表示未知单词。我们还会将这个特殊标记添加到我们的词汇表中。

In [142]:
# Add the unknown token to our vocabulary
vocabulary.add("<unk>")

Earlier we mentioned that our task was called `Word Window Classification` because our model is looking at the surroundings words in addition to the given word when it needs to make a prediction.

前面我们提到过，我们的任务被称为 `Word Window Classification`，因为在模型需要进行预测时，它不仅会查看给定的单词，还会查看周围的单词。

For example, let's take the sentence "We always come to Paris". The corresponding training label for this sentence is `0, 0, 0, 0, 1` since only Paris, the last word, is a `LOCATION`. In one pass (meaning a call to `forward()`), our model will try to generate the correct label for one word. Let's say our model is trying to generate the correct label `1` for `Paris`. If we only allow our model to see `Paris`, but nothing else, we will miss out on the important information that the word `to` often times appears with `LOCATION`s.

例如，假设句子是 "We always come to Paris"。由于只有最后一个单词 Paris 是一个 `LOCATION`，因此该句子的对应训练标签是 `0, 0, 0, 0, 1`。在一次传递（即一次调用 `forward()`）中，我们的模型将尝试为一个单词生成正确的标签。假设我们的模型试图为 `Paris` 生成正确的标签 `1`。如果我们只允许模型看到 `Paris`，而看不到其他内容，我们将错过“to”这个单词通常与 `LOCATION` 一起出现的重要信息。

Word windows allow our model to consider the surrounding `+N` or `-N` words of each word when making a prediction. In our earlier example for `Paris`, if we have a window size of 1, that means our model will look at the words that come immediately before and after `Paris`, which are `to`, and, well, nothing. Now, this raises another issue. `Paris` is at the end of our sentence, so there isn't another word following it. Remember that we define the input dimensions of our `PyTorch` models when we are initializing them. If we set the window size to be `1`, it means that our model will be accepting `3` words in every pass. We cannot have our model expect `2` words from time to time.

单词窗口允许我们的模型在做出预测时考虑每个单词周围的 `+N` 或 `-N` 个单词。在前面关于 `Paris` 的例子中，如果我们有一个窗口大小为 1，这意味着我们的模型将查看紧接在 `Paris` 之前和之后的单词，即 `to` 和，嗯，没有其他单词。这引发了另一个问题。`Paris` 在句子的末尾，所以没有另一个单词跟在它后面。记住，我们在初始化 `PyTorch` 模型时定义了输入维度。如果我们将窗口大小设置为 `1`，这意味着我们的模型将在每次传递中接受 `3` 个单词。我们不能让模型有时只期望 `2` 个单词。

The solution is to introduce a special token, such as `<pad>`, that will be added to our sentences to make sure that every word has a valid window around them. Similar to `<unk>` token, we could pick another string for our pad token if we wanted, as long as we make sure it is used for a unique purpose.

解决方案是引入一个特殊的标记，比如 `<pad>`，将其添加到我们的句子中，以确保每个单词周围都有一个有效的窗口。类似于 `<unk>` 标记，如果愿意，我们可以为我们的填充标记选择另一个字符串，只要确保它用于唯一的目的即可。

In [143]:
# Add the <pad> token to our vocabulary
vocabulary.add("<pad>")

# Function that pads the given sentence
# We are introducing this function here as an example
# We will be utilizing it later in the tutorial
def pad_window(sentence, window_size, pad_token="<pad>"):
  window = [pad_token] * window_size
  return window + sentence + window

# Show padding example
window_size = 2
pad_window(train_sentences[0], window_size=window_size)

['<pad>', '<pad>', 'we', 'always', 'come', 'to', 'paris', '<pad>', '<pad>']

Now that our vocabularly is ready, let's assign an index to each of our words.

现在我们的词汇表准备好了，让我们为每个单词分配一个索引。

In [144]:
# We are just converting our vocabularly to a list to be able to index into it
# Sorting is not necessary, we sort to show an ordered word_to_ind dictionary
# That being said, we will see that having the index for the padding token
# be 0 is convenient as some PyTorch functions use it as a default value
# such as nn.utils.rnn.pad_sequence, which we will cover in a bit
ix_to_word = sorted(list(vocabulary))

# Creating a dictionary to find the index of a given word
word_to_ix = {word: ind for ind, word in enumerate(ix_to_word)}
word_to_ix

{'<pad>': 0,
 '<unk>': 1,
 'always': 2,
 'ankara': 3,
 'australia': 4,
 'capital': 5,
 'come': 6,
 'comes': 7,
 'from': 8,
 'he': 9,
 'i': 10,
 'in': 11,
 'is': 12,
 'live': 13,
 'of': 14,
 'paris': 15,
 'professor': 16,
 'stanford': 17,
 'taiwan': 18,
 'the': 19,
 'to': 20,
 'turkey': 21,
 'we': 22}

In [145]:
ix_to_word[1]

'<unk>'

Great! We are ready to convert our training sentences into a sequence of indices corresponding to each token.

太好了！我们已经准备好将我们的训练句子转换为每个标记对应的索引序列了。

In [162]:
# Given a sentence of tokens, return the corresponding indices
def convert_token_to_indices(sentence, word_to_ix):
  indices = []
  for token in sentence:
    # Check if the token is in our vocabularly. If it is, get it's index.
    # If not, get the index for the unknown token.
    if token in word_to_ix:
      index = word_to_ix[token]
    else:
      index = word_to_ix["<unk>"]
    indices.append(index)
  return indices

# More compact version of the same function
# 完成了和上面函数一样的功能！
def _convert_token_to_indices(sentence, word_to_ix):
  # return [word_to_ind.get(token, word_to_ix["<unk>"]) for token in sentence]
  # dict的get()方法尝试从 word_to_ix 字典中获取 token（即单词）的索引
  return [word_to_ix.get(token, word_to_ix["<unk>"]) for token in sentence]

# Show an example
example_sentence = ["we", "always", "come", "to", "kuwait"]
example_indices = convert_token_to_indices(example_sentence, word_to_ix)
example_indices_test = _convert_token_to_indices(example_sentence, word_to_ix)
restored_example = [ix_to_word[ind] for ind in example_indices]

print(f"Original sentence is: {example_sentence}")
print(f"Going from words to indices: {example_indices}")
print(f"Going from indices to words: {restored_example}")
print(f"the compacted function has same output: {example_indices_test}")

Original sentence is: ['we', 'always', 'come', 'to', 'kuwait']
Going from words to indices: [22, 2, 6, 20, 1]
Going from indices to words: ['we', 'always', 'come', 'to', '<unk>']
the compacted function has same output: [22, 2, 6, 20, 1]


In the example above, `kuwait` shows up as `<unk>`, because it is not included in our vocabulary. Let's convert our `train_sentences` to `example_padded_indices`.

在上面的例子中，`kuwait` 显示为 `<unk>`，因为它未包含在我们的词汇表中。让我们将我们的 `train_sentences` 转换为 `example_padded_indices`。

In [147]:
# Converting our sentences to indices
example_padded_indices = [convert_token_to_indices(s, word_to_ix) for s in train_sentences]
example_padded_indices

[[22, 2, 6, 20, 15],
 [19, 16, 12, 8, 4],
 [10, 13, 11, 17],
 [9, 7, 8, 18],
 [19, 5, 14, 21, 12, 3]]

Now that we have an index for each word in our vocabularly, we can create an embedding table with `nn.Embedding` class in `PyTorch`. It is called as follows `nn.Embedding(num_words, embedding_dimension)` where `num_words` is the number of words in our vocabulary and the `embedding_dimension` is the dimension of the embeddings we want to have. There is nothing fancy about `nn.Embedding`: it is just a wrapper class around a trainabe `NxE` dimensional tensor, where `N` is the number of words in our vocabulary and `E` is the number of embedding dimensions. This table is initially random, but it will change over time. As we train our network, the gradients will be backpropagated all the way to the embedding layer, and hence our word embeddings would be updated. We will initiliaze the embedding layer we will use for our model in our model, but we are showing an example here.

现在我们为词汇表中的每个单词都有了一个索引，我们可以使用 `PyTorch` 中的 `nn.Embedding` 类创建一个嵌入表。它的调用方式如下：`nn.Embedding(num_words, embedding_dimension)`，其中 `num_words` 是我们词汇表中的单词数量，`embedding_dimension` 是我们希望拥有的嵌入维度。关于 `nn.Embedding` 并没有什么特别的：它只是一个围绕可训练的 `NxE` 维张量的包装类，其中 `N` 是词汇表中单词的数量，`E` 是嵌入维度。这个表最初是随机的，但会随着时间改变。当我们训练网络时，梯度将一直反向传播到嵌入层，因此我们的词嵌入会得到更新。我们将在模型中初始化用于嵌入层的嵌入表，这里只是展示一个示例。

In [148]:
# Creating an embedding table for our words
embedding_dim = 5
embeds = nn.Embedding(len(vocabulary), embedding_dim)

# Printing the parameters in our embedding table
list(embeds.parameters())

[Parameter containing:
 tensor([[-8.8797e-01,  9.0129e-02,  4.9432e-01, -5.7977e-02, -2.4530e-01],
         [ 7.2132e-01, -9.1917e-01,  5.2815e-04, -2.8196e-01,  1.6774e+00],
         [ 1.1514e-02, -8.3316e-02,  1.1639e+00, -1.4331e-01, -1.4484e+00],
         [-1.5556e+00, -7.0344e-01, -7.2735e-01,  3.9447e-02, -2.9022e-01],
         [-1.4843e+00, -8.6632e-02, -2.0215e+00, -1.0391e-01,  2.5616e-01],
         [ 6.4172e-01,  2.9437e-01, -5.7398e-01,  2.0036e+00, -4.8637e-01],
         [-9.3919e-01,  1.1492e-01, -2.0537e-01,  5.4826e-01, -1.3616e-01],
         [-4.4233e-02, -8.7528e-01,  1.6569e+00, -4.6451e-01, -6.2155e-01],
         [ 3.1305e-01, -1.5734e+00,  2.0768e-01,  7.9895e-01, -5.2259e-02],
         [ 1.1877e-03,  1.0866e+00, -1.0282e+00,  9.7133e-01, -1.8694e+00],
         [-8.0738e-01,  7.7805e-01, -4.0332e-01,  1.0427e+00,  1.1000e+00],
         [ 3.6213e-01, -1.3958e-02,  1.2187e+00, -1.0414e+00, -7.5985e-02],
         [-9.9113e-01, -1.3071e+00,  2.4603e+00, -2.1908e+00, -1.

To get the word embedding for a word in our vocabulary, all we need to do is to create a lookup tensor. The lookup tensor is just a tensor containing the index we want to look up `nn.Embedding` class expects an index tensor that is of type Long Tensor, so we should create our tensor accordingly.

要获取我们词汇表中单词的词嵌入，我们只需要创建一个查找张量。查找张量只是一个包含我们想要查找的索引的张量。`nn.Embedding` 类期望一个长整型张量作为索引张量，因此我们应该相应地创建我们的张量。

在 PyTorch 中，nn.Embedding 类实例化后，其对象 embeds 是一个可以调用的对象，具有一个 __call__ 方法，这使得它能够像函数一样被调用。当你使用 embeds(index_tensor) 时，实际上调用了 __call__ 方法，该方法接受一个索引张量 index_tensor 作为输入，并返回对应索引处的嵌入向量。

具体来说，nn.Embedding 内部维护了一个嵌入表（embedding table），这是一个可学习的参数矩阵，其形状为 (num_embeddings, embedding_dim)，其中 num_embeddings 是词汇表的大小，embedding_dim 是每个嵌入向量的维度。当你调用 embeds(index_tensor) 时，PyTorch 会自动根据 index_tensor 中的每个索引，从嵌入表中获取对应的嵌入向量。

In [160]:
# Get the embedding for the word Paris
# 随便初始化了一个embeds，具有vocabulary的形状大小，只要后面不断用他去处理vocabulary
# 那他就相当于是vocabulary的专用embeds了
index = word_to_ix["paris"]
index_tensor = torch.tensor(index, dtype=torch.long)
print(index_tensor)
# 使用 index_tensor 查找并获取 "paris" 的嵌入向量
paris_embed = embeds(index_tensor)
paris_embed

tensor(15)


tensor([-0.0437, -0.6241,  0.0575, -1.1279,  0.0849],
       grad_fn=<EmbeddingBackward0>)

In [161]:
# We can also get multiple embeddings at once
index_paris = word_to_ix["paris"]
index_ankara = word_to_ix["ankara"]
indices = [index_paris, index_ankara]
indices_tensor = torch.tensor(indices, dtype=torch.long)
embeddings = embeds(indices_tensor)
embeddings

tensor([[-0.0437, -0.6241,  0.0575, -1.1279,  0.0849],
        [-1.5556, -0.7034, -0.7274,  0.0394, -0.2902]],
       grad_fn=<EmbeddingBackward0>)

Usually, we define the embedding layer as part of our model, which you will see in the later sections of our notebook.

通常，我们将嵌入层定义为模型的一部分，这一点将在我们笔记本的后续部分中看到。

#### Batching Sentences 批处理句子

We have learned about batches in class. Waiting our whole training corpus to be processed before making an update is constly. On the other hand, updating the parameters after every training example causes the loss to be less stable between updates. To combat these issues, we instead update our parameters after training on a batch of data. This allows us to get a better estimate of the gradient of the global loss. In this section, we will learn how to structure our data into batches using the `torch.util.data.DataLoader` class.

我们在课堂上学习过批处理的概念。等待整个训练语料库被处理完毕再进行更新是昂贵的。另一方面，每处理一个训练样本后就更新参数会导致更新之间的损失不稳定。为了应对这些问题，我们改为在一批数据上训练后再更新参数。这样可以更好地估计全局损失的梯度。在本节中，我们将学习如何使用 `torch.util.data.DataLoader` 类将数据分成批次。

We will be calling the `DataLoader` class as follows: `DataLoader(data, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)`.  The `batch_size` parameter determines the number of examples per batch. In every epoch, we will be iterating over all the batches using the `DataLoader`. The order of batches is deterministic by default, but we can ask `DataLoader` to shuffle the batches by setting the `shuffle` parameter to `True`. This way we ensure that we don't encounter a bad batch multiple times.

我们将如下调用 `DataLoader` 类：`DataLoader(data, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)`。`batch_size` 参数确定每个批次的示例数。每个 epoch 中，我们将使用 `DataLoader` 迭代所有批次。默认情况下，批次的顺序是确定性的，但我们可以通过将 `shuffle` 参数设置为 `True`，要求 `DataLoader` 对批次进行洗牌。这样可以确保我们不会多次遇到同一个不良批次。

将 `shuffle` 参数设置为 `True`，在每个 epoch 开始时，数据集中的数据会被随机打乱，因此组成的每个batch的搭配情况也会不一样。

If provided, `DataLoader` passes the batches it prepares to the `collate_fn`. We can write a custom function to pass to the `collate_fn` parameter in order to print stats about our batch or perform extra processing. In our case, we will use the `collate_fn` to:
1. Window pad our train sentences.
2. Convert the words in the training examples to indices.
3. Pad the training examples so that all the sentences and labels have the same length. Similarly, we also need to pad the labels. This creates an issue because when calculating the loss, we need to know the actual number of words in a given example. We will also keep track of this number in the function we pass to the `collate_fn` parameter.

如果提供了，`DataLoader` 将准备好的批次传递给 `collate_fn`。我们可以编写一个自定义函数，将其传递给 `collate_fn` 参数，以便打印有关我们批次的统计信息或执行额外的处理。在我们的情况下，我们将使用 `collate_fn` 来完成以下任务：
1. 对训练句子进行窗口填充。
2. 将训练示例中的单词转换为索引。
3. 对训练示例进行填充，以使所有句子和标签具有相同的长度。同样，我们还需要对标签进行填充。这会导致一个问题，因为在计算损失时，我们需要知道给定示例中实际的单词数。我们还将在传递给 `collate_fn` 参数的函数中跟踪这个数量。

Because our version of the `collate_fn` function will need to access to our `word_to_ix` dictionary (so that it can turn words into indices), we will make use of the `partial` function in `Python`, which passes the parameters we give to the function we pass it.

因为我们的 `collate_fn` 函数版本需要访问我们的 `word_to_ix` 字典（以便将单词转换为索引），我们将利用 `Python` 中的 `partial` 函数，该函数将我们提供的参数传递给我们传递给它的函数。

In [163]:
from torch.utils.data import DataLoader
from functools import partial

def custom_collate_fn(batch, window_size, word_to_ix):
  # Break our batch into the training examples (x) and labels (y)
  # We are turning our x and y into tensors because nn.utils.rnn.pad_sequence
  # method expects tensors. This is also useful since our model will be
  # expecting tensor inputs.
  # 使用了“解包”操作符*，这意味着它将 batch 列表中的每个元素作为单独的参数传递给 zip 函数
  # zip 函数会将多个迭代器的对应元素打包成一个个元组，然后返回这些元组组成的迭代器
  # 所以这里x, y会得到两个对应的元组
  x, y = zip(*batch)

  # Now we need to window pad our training examples. We have already defined a
  # function to handle window padding. We are including it here again so that
  # everything is in one place.
  # 为window增加pad，避免前后元素不满足window_size的检查范围
  def pad_window(sentence, window_size, pad_token="<pad>"):
    window = [pad_token] * window_size
    return window + sentence + window

  # Pad the train examples.
  x = [pad_window(s, window_size=window_size) for s in x]

  # Now we need to turn words in our training examples to indices. We are
  # copying the function defined earlier for the same reason as above.
  # 现在我们需要将训练示例中的单词转换为索引。我们复制之前定义的函数，原因与上述相同。
  def convert_tokens_to_indices(sentence, word_to_ix):
    return [word_to_ix.get(token, word_to_ix["<unk>"]) for token in sentence]

  # Convert the train examples into indices.
  # dict的get()方法尝试从 word_to_ix 字典中获取 token（即单词）的索引
  x = [convert_tokens_to_indices(s, word_to_ix) for s in x]

  # We will now pad the examples so that the lengths of all the example in
  # one batch are the same, making it possible to do matrix operations.
  # We set the batch_first parameter to True so that the returned matrix has
  # the batch as the first dimension.
  # 我们现在将对示例进行填充，使一个批次中所有示例的长度相同，从而可以进行矩阵运算。
  # 我们将 batch_first 参数设置为 True，以便返回的矩阵将批次作为第一个维度。
  pad_token_ix = word_to_ix["<pad>"]

  # pad_sequence function expects the input to be a tensor, so we turn x into one
  # 填充序列：将一批序列（张量列表）填充到相同的长度，以便于后续的批处理操作
  # 相关参数：
  # sequences：要填充的序列列表，每个序列是一个张量。
  # batch_first：布尔值，指定返回的张量是否以批次维度为第一维度，默认为 False。如果设置为 True，则返回的张量形状为 (batch_size, max_length, *)；否则为 (max_length, batch_size, *)。
  # padding_value：用于填充的值，默认为 0。
  x = [torch.LongTensor(x_i) for x_i in x]
  x_padded = nn.utils.rnn.pad_sequence(x, batch_first=True, padding_value=pad_token_ix)

  # We will also pad the labels. Before doing so, we will record the number
  # of labels so that we know how many words existed in each example.
  lengths = [len(label) for label in y]
  lenghts = torch.LongTensor(lengths)

  y = [torch.LongTensor(y_i) for y_i in y]
  y_padded = nn.utils.rnn.pad_sequence(y, batch_first=True, padding_value=0)

  # We are now ready to return our variables. The order we return our variables
  # here will match the order we read them in our training loop.
  return x_padded, y_padded, lenghts

This function seems long, but it really doesn't have to be. Check out the alternative version below where we remove the extra function declarations and comments.

In [164]:
def _custom_collate_fn(batch, window_size, word_to_ix):
  # Prepare the datapoints
  x, y = zip(*batch)
  # 加pad字符放置window超出
  # 字符变数字,以便运算
  x = [pad_window(s, window_size=window_size) for s in x]
  x = [convert_tokens_to_indices(s, word_to_ix) for s in x]

  # Pad x so that all the examples in the batch have the same size
  # 进行pad,变成矩阵
  pad_token_ix = word_to_ix["<pad>"]
  x = [torch.LongTensor(x_i) for x_i in x]
  x_padded = nn.utils.rnn.pad_sequence(x, batch_first=True, padding_value=pad_token_ix)

  # Pad y and record the length
  # 对标签y进行pad,变成矩阵
  lengths = [len(label) for label in y]
  lenghts = torch.LongTensor(lengths)
  y = [torch.LongTensor(y_i) for y_i in y]
  y_padded = nn.utils.rnn.pad_sequence(y, batch_first=True, padding_value=0)

  return x_padded, y_padded, lenghts

Now, we can see the `DataLoader` in action.

zip() 函数可以将多个序列按顺序打包成元组，这些元组可以同时迭代

collate_fn 返回了三个参数：batched_x（处理后的特征数据）、batched_y（处理后的标签数据）、batched_lengths（样本长度信息）。这种灵活性使得 DataLoader 能够适应不同类型和形状的数据，同时也支持处理不同形式的输入和输出。

In [165]:
# Parameters to be passed to the DataLoader
data = list(zip(train_sentences, train_labels))
batch_size = 2
shuffle = True
window_size = 2
# partial是部分应用
# 部分应用是指固定一个函数的一部分参数，然后生成一个新的函数，该新函数接受剩余的参数
# collate_fn可以输入内容并分割数据和标签, 调整成为矩阵格式用于训练
# 也就是数据预处理
collate_fn = partial(custom_collate_fn, window_size=window_size, word_to_ix=word_to_ix)

# Instantiate the DataLoader
# DataLoader 可以将数据集分成批次（batch），每个批次包含指定数量的数据样本。这样做有助于提高训练效率，特别是在大型数据集上。
# collate_fn：用于自定义批处理过程的函数，默认为 None。
# 如果提供了 collate_fn，DataLoader 在每个批次加载数据之前会调用此函数对数据进行预处理，如填充、转换等操作。
loader = DataLoader(data, batch_size=batch_size, shuffle=shuffle, collate_fn=collate_fn)

# Go through one loop
counter = 0
# 由于loader中的collate_fn能返回三个参数，所以对应了迭代器中的参数数量
for batched_x, batched_y, batched_lengths in loader:
  print(f"Iteration {counter}")
  print("Batched Input:")
  print(batched_x)
  print("Batched Labels:")
  print(batched_y)
  print("Batched Lengths:")
  print(batched_lengths)
  print("")
  counter += 1

Iteration 0
Batched Input:
tensor([[ 0,  0, 22,  2,  6, 20, 15,  0,  0],
        [ 0,  0, 10, 13, 11, 17,  0,  0,  0]])
Batched Labels:
tensor([[0, 0, 0, 0, 1],
        [0, 0, 0, 1, 0]])
Batched Lengths:
tensor([5, 4])

Iteration 1
Batched Input:
tensor([[ 0,  0,  9,  7,  8, 18,  0,  0,  0],
        [ 0,  0, 19, 16, 12,  8,  4,  0,  0]])
Batched Labels:
tensor([[0, 0, 0, 1, 0],
        [0, 0, 0, 0, 1]])
Batched Lengths:
tensor([4, 5])

Iteration 2
Batched Input:
tensor([[ 0,  0, 19,  5, 14, 21, 12,  3,  0,  0]])
Batched Labels:
tensor([[0, 0, 0, 1, 0, 1]])
Batched Lengths:
tensor([6])



The batched input tensors you see above will be passed into our model. On the other hand, we started off saying that our model will be a window classifier. The way our input tensors are currently formatted, we have all the words in a sentence in one datapoint. When we pass this input to our model, it needs to create the windows for each word, make a prediction as to whether the center word is a `LOCATION` or not for each window, put the predictions together and return.

上面你看到的批处理输入张量将被传递到我们的模型中。另一方面，我们开始时说过，我们的模型将是一个窗口分类器。目前我们的输入张量格式化后，每个数据点中包含了一个句子中的所有单词。当我们将这个输入传递给我们的模型时，它需要为每个单词创建窗口，针对每个窗口预测中心单词是否是“LOCATION”，然后将这些预测组合在一起并返回结果。

We could avoid this problem if we formatted our data by breaking it into windows beforehand. In this example, we will instead how our model take care of the formatting.

我们可以避免这个问题，如果我们在预先将数据分解为窗口的情况下进行格式化。在这个例子中，我们将展示如何让我们的模型处理这种格式化。

Given that our `window_size` is `N` we want our model to make a prediction on every `2N+1` tokens. That is, if we have an input with `9` tokens, and a `window_size` of `2`, we want our model to return `5` predictions. This makes sense because before we padded it with `2` tokens on each side, our input also had `5` tokens in it!

假设我们的 `window_size` 是 `N`，我们希望我们的模型在每 `2N+1` 个标记上进行预测。也就是说，如果我们有一个包含 `9` 个标记的输入，窗口大小为 `2`，我们希望我们的模型返回 `5` 个预测结果。这是有道理的，因为在我们在每一侧填充 `2` 个标记之前，我们的输入中也有 `5` 个标记！

We can create these windows by using for loops, but there is a faster `PyTorch` alternative, which is the `unfold(dimension, size, step)` method. We can create the windows we need using this method as follows:

我们可以使用循环来创建这些窗口，但是在 PyTorch 中有一个更快的替代方法，那就是 `unfold(dimension, size, step)` 方法。我们可以按照以下方式使用这个方法来创建所需的窗口：

`unfold(dimension, size, step)`方法：

dimension：指定在哪个维度上创建滑动窗口。例如，对于一个二维张量（如图片数据），可以选择在行（0）或列（1）上进行滑动窗口操作。

size：窗口的大小，即每个滑动窗口的长度。在给定的维度上，窗口的大小决定了滑动窗口的片段长度。

step：窗口在指定维度上滑动的步长。默认为 1，表示每次滑动一个元素；可以设置为大于 1 的整数，以实现更大的步长。

In [171]:
# Print the original tensor
print(f"Original Tensor: ")
print(batched_x)
print("")

# Create the 2 * 2 + 1 chunks
chunk = batched_x.unfold(1, window_size*2 + 1, 1)
print(f"Windows: ")
print(chunk)

Original Tensor: 
tensor([[ 0,  0, 19,  5, 14, 21, 12,  3,  0,  0]])

Windows: 
tensor([[[ 0,  0, 19,  5, 14],
         [ 0, 19,  5, 14, 21],
         [19,  5, 14, 21, 12],
         [ 5, 14, 21, 12,  3],
         [14, 21, 12,  3,  0],
         [21, 12,  3,  0,  0]]])


实操理解一下unfold函数：

In [170]:
tensor = torch.tensor([[1, 2, 3],
                       [4, 5, 6],
                       [7, 8, 9]])

tensor_try_window_a = tensor.unfold(0, 2, 1)
tensor_try_window_b = tensor.unfold(1, 2, 1)
print(tensor_try_window_a)
print(tensor_try_window_b)

tensor([[[1, 4],
         [2, 5],
         [3, 6]],

        [[4, 7],
         [5, 8],
         [6, 9]]])
tensor([[[1, 2],
         [2, 3]],

        [[4, 5],
         [5, 6]],

        [[7, 8],
         [8, 9]]])


### Model

Now that we have prepared our data, we are ready to build our model. We have learned how to write custom `nn.Module` classes. We will do the same here and put everything we have learned so far together.

In [172]:
class WordWindowClassifier(nn.Module):

  def __init__(self, hyperparameters, vocab_size, pad_ix=0):
    super(WordWindowClassifier, self).__init__()

    """ Instance variables """
    self.window_size = hyperparameters["window_size"]
    self.embed_dim = hyperparameters["embed_dim"]
    self.hidden_dim = hyperparameters["hidden_dim"]
    self.freeze_embeddings = hyperparameters["freeze_embeddings"]

    """ Embedding Layer
    Takes in a tensor containing embedding indices, and returns the
    corresponding embeddings. The output is of dim
    (number_of_indices * embedding_dim).

    If freeze_embeddings is True, set the embedding layer parameters to be
    non-trainable. This is useful if we only want the parameters other than the
    embeddings parameters to change.

    """
    # nn.Embedding 是 PyTorch 中的一个类，用于实现词嵌入（Word Embedding）的功能。
    # 词嵌入是将单词映射到连续向量空间中的表示形式，通常用于自然语言处理任务中，如文本分类、命名实体识别等。
    # vocab_size是词汇表大小
    # self.embed_dim是词嵌入向量的维度，即每个单词将被映射到的向量空间的维度大小
    self.embeds = nn.Embedding(vocab_size, self.embed_dim, padding_idx=pad_ix)
    if self.freeze_embeddings:
      self.embed_layer.weight.requires_grad = False

    """ Hidden Layer
    """
    full_window_size = 2 * window_size + 1
    self.hidden_layer = nn.Sequential(
      nn.Linear(full_window_size * self.embed_dim, self.hidden_dim),
      nn.Tanh()
    )

    """ Output Layer
    """
    self.output_layer = nn.Linear(self.hidden_dim, 1)

    """ Probabilities
    """
    self.probabilities = nn.Sigmoid()

  # 模型在init里面定义的层，在forward中进行具体的使用！
  def forward(self, inputs):
    """
    Let B:= batch_size
        L:= window-padded sentence length
        D:= self.embed_dim
        S:= self.window_size
        H:= self.hidden_dim

    inputs: a (B, L) tensor of token indices
    """
    B, L = inputs.size()

    """
    Reshaping.
    Takes in a (B, L) LongTensor
    Outputs a (B, L~, S) LongTensor
    """
    # Fist, get our word windows for each word in our input.
    token_windows = inputs.unfold(1, 2 * self.window_size + 1, 1)
    _, adjusted_length, _ = token_windows.size()

    # Good idea to do internal tensor-size sanity checks, at the least in comments!
    # 检查形状对不对
    assert token_windows.size() == (B, adjusted_length, 2 * self.window_size + 1)

    """
    Embedding.
    Takes in a torch.LongTensor of size (B, L~, S)
    Outputs a (B, L~, S, D) FloatTensor.
    """
    embedded_windows = self.embeds(token_windows)

    """
    Reshaping.
    Takes in a (B, L~, S, D) FloatTensor.
    Resizes it into a (B, L~, S*D) FloatTensor.
    -1 argument "infers" what the last dimension should be based on leftover axes.
    """
    embedded_windows = embedded_windows.view(B, adjusted_length, -1)

    """
    Layer 1.
    Takes in a (B, L~, S*D) FloatTensor.
    Resizes it into a (B, L~, H) FloatTensor
    """
    layer_1 = self.hidden_layer(embedded_windows)

    """
    Layer 2
    Takes in a (B, L~, H) FloatTensor.
    Resizes it into a (B, L~, 1) FloatTensor.
    """
    output = self.output_layer(layer_1)

    """
    Softmax.
    Takes in a (B, L~, 1) FloatTensor of unnormalized class scores.
    Outputs a (B, L~, 1) FloatTensor of (log-)normalized class scores.
    """
    output = self.probabilities(output)
    output = output.view(B, -1)

    return output

### Training

We are now ready to put everything together. Let's start with preparing our data and intializing our model. We can then intialize our optimizer and define our loss function. This time, instead of using one of the predefined loss function as we did before, we will define our own loss function.

In [173]:
# Prepare the data
data = list(zip(train_sentences, train_labels))
batch_size = 2
shuffle = True
window_size = 2
collate_fn = partial(custom_collate_fn, window_size=window_size, word_to_ix=word_to_ix)

# Instantiate a DataLoader
loader = DataLoader(data, batch_size=batch_size, shuffle=shuffle, collate_fn=collate_fn)

# Initialize a model
# It is useful to put all the model hyperparameters in a dictionary
model_hyperparameters = {
    "batch_size": 4,
    "window_size": 2,
    "embed_dim": 25,
    "hidden_dim": 25,
    "freeze_embeddings": False,
}

vocab_size = len(word_to_ix)
model = WordWindowClassifier(model_hyperparameters, vocab_size)

# Define an optimizer
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# Define a loss function, which computes to binary cross entropy loss
def loss_function(batch_outputs, batch_labels, batch_lengths):
    # Calculate the loss for the whole batch
    # BCELoss()常用于二分类任务的损失函数
    bceloss = nn.BCELoss()
    loss = bceloss(batch_outputs, batch_labels.float())

    # Rescale the loss. Remember that we have used lengths to store the
    # number of words in each training example
    # 重新调整损失。请记住，我们使用长度来存储每个训练示例中的单词数量
    loss = loss / batch_lengths.sum().float()

    return loss

Unlike our earlier example, this time instead of passing all of our training data to the model at once in each epoch, we will be utilizing batches. Hence, in each training epoch iteration, we also iterate over the batches.

In [174]:
# Function that will be called in every epoch
def train_epoch(loss_function, optimizer, model, loader):

  # Keep track of the total loss for the batch
  total_loss = 0
  for batch_inputs, batch_labels, batch_lengths in loader:
    # Clear the gradients
    optimizer.zero_grad()
    # Run a forward pass
    outputs = model.forward(batch_inputs)
    # Compute the batch loss
    loss = loss_function(outputs, batch_labels, batch_lengths)
    # Calculate the gradients
    loss.backward()
    # Update the parameteres
    optimizer.step()
    total_loss += loss.item()

  return total_loss


# Function containing our main training loop
def train(loss_function, optimizer, model, loader, num_epochs=10000):

  # Iterate through each epoch and call our train_epoch function
  for epoch in range(num_epochs):
    epoch_loss = train_epoch(loss_function, optimizer, model, loader)
    if epoch % 100 == 0: print(epoch_loss)

Let's start training!

In [175]:
num_epochs = 1000
train(loss_function, optimizer, model, loader, num_epochs=num_epochs)

0.24125424027442932
0.22220438718795776
0.17085469886660576
0.11967884749174118
0.10180189833045006
0.0804689358919859
0.057760629802942276
0.051059434190392494
0.03829230275005102
0.04034239985048771


### Prediction

Let's see how well our model is at making predictions. We can start by creating our test data.

In [181]:
# Create test sentences
test_corpus = ["She comes from Paris",
               "She has a pretty dog",
               "He travels to China"
               ]
test_sentences = [s.lower().split() for s in test_corpus]
test_labels = [[0, 0, 0, 1],
               [0, 0, 0, 0, 0],
               [0, 0, 0, 1]]

# Create a test loader
test_data = list(zip(test_sentences, test_labels))
batch_size = 1
shuffle = False
window_size = 2
collate_fn = partial(custom_collate_fn, window_size=2, word_to_ix=word_to_ix)
test_loader = torch.utils.data.DataLoader(test_data,
                                           batch_size=1,
                                           shuffle=False,
                                           collate_fn=collate_fn)
print(test_data)

[(['she', 'comes', 'from', 'paris'], [0, 0, 0, 1]), (['she', 'has', 'a', 'pretty', 'dog'], [0, 0, 0, 0, 0]), (['he', 'travels', 'to', 'china'], [0, 0, 0, 1])]


Let's loop over our test examples to see how well we are doing.

In [182]:
for test_instance, labels, _ in test_loader:
  outputs = model.forward(test_instance)
  print(labels)
  print(outputs)

tensor([[0, 0, 0, 1]])
tensor([[0.1098, 0.0611, 0.0427, 0.8780]], grad_fn=<ViewBackward0>)
tensor([[0, 0, 0, 0, 0]])
tensor([[0.2230, 0.2685, 0.2744, 0.3572, 0.4100]], grad_fn=<ViewBackward0>)
tensor([[0, 0, 0, 1]])
tensor([[0.0521, 0.2016, 0.2370, 0.6765]], grad_fn=<ViewBackward0>)
