# Preliminaries

To prepare for your dive into deep learning, you will need a few survival skills: 
- (i) techniques for **storing** and **manipulating** data;
- (ii) libraries for **ingesting** and **preprocessing** data from a variety of sources;
- (iii) knowledge of the basic linear algebraic operations that we apply to high-dimensional data elements;
- (iv) just enough calculus to determine which direction to adjust each parameter in order to decrease the loss function;
- (v) the ability to automatically compute derivatives so that you can forget much of the calculus you just learned;
- (vi) some basic fluency in probability, our primary language for reasoning under uncertainty;
- (vii) some aptitude for finding answers in the official documentation when you get stuck.

## Data Manipulation

In order to get anything done, we need some **way to store** and **manipulate data**. Generally, there are two important things we need to do with data: 
- (i) acquire them.
- (ii) process them once they are inside the computer.

There is no point in acquiring data without some way to store it, so to start, let’s get our hands dirty with **n-dimensional arrays**, which we also call **tensors**. If you already know the NumPy scientific computing package, this will be a breeze. For all modern deep learning frameworks, the tensor class (ndarray in MXNet, Tensor in PyTorch and TensorFlow) resembles NumPy’s ndarray, with a few killer features added. 
- First, the **tensor class supports automatic differentiation**.
- Second, it **leverages GPUs to accelerate numerical computation**, whereas NumPy only runs on CPUs.

These properties make neural networks both easy to code and fast to run.

To start, we import the **PyTorch** library. Note that the package name is **torch**.

In [1]:
import torch

A tensor represents a (possibly multidimensional) array of numerical values. In the one-dimensional case, i.e., when only one axis is needed for the data, a tensor is called a **vector**. With two axes, a tensor is called a **matrix**. With `k>2` axes, we drop the specialized names and just refer to the object as a $k^{th}$**-order tensor**.

### arange

PyTorch provides a variety of functions for creating new tensors prepopulated with values. For example, by invoking **arange(n)**, we can create a vector of evenly spaced values, starting at 0 (included) and ending at n (not included). By default, the interval size is 1. Unless otherwise specified, new tensors are stored in main memory and designated for CPU-based computation.

In [2]:
x = torch.arange(12, dtype=torch.float32)
x

tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.])

### numel

Each of these values is called an **element** of the tensor. The tensor x contains 12 elements. We can inspect the total number of elements in a tensor via its **`numel`** method.

In [3]:
x.numel()

12

### shape

We can access a tensor’s shape (the length along each axis) by inspecting its **shape** attribute. Because we are dealing with a vector here, the shape contains just a single element and is identical to the size.

In [4]:
x.shape

torch.Size([12])

### reshape

We can change the **shape** of a tensor **without altering its size or values**, by invoking reshape. For example, we can transform our vector x whose shape is (12,) to a matrix X with shape (3, 4). This new tensor retains all elements but reconfigures them into a matrix.

In [5]:
X = x.reshape(3, 4)
X

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])

Note that specifying every shape component to reshape is redundant. Because we already know our tensor’s size, we can work out one component of the shape given the rest. For example, given a tensor of size `n` and target shape `(h,w)`, we know that `w=n/h`. To automatically infer one component of the **shape**, we can place a **-1** for the shape component that should be **inferred automatically**. In our case, instead of calling x.reshape(3, 4), we could have equivalently called **x.reshape(-1, 4)** or **x.reshape(3, -1)**.

In [6]:
X = x.reshape(3, -1)
print(X.shape)

X = x.reshape(-1, 4)
print(X.shape)

torch.Size([3, 4])
torch.Size([3, 4])


### zeros

Practitioners often need to work with tensors initialized to contain all 0s or 1s. We can construct a tensor with all elements set to 0 and a shape of (2, 3, 4) via the **zeros** function.

In [7]:
torch.zeros((2, 3, 4))

tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]])

### ones

Similarly, we can create a tensor with all 1s by invoking ones.

In [8]:
torch.ones((2, 3, 4))

tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])

### sampling from probability distribution

We often wish to sample each element randomly (and independently) from a given probability distribution. For ex: the parameters of neural networks are often initialized randomly. The following snippet creates a tensor with elements drawn from a **standard Gaussian (normal) distribution** with **mean 0** and **standard deviation 1**.

In [10]:
torch.randn(3, 4)

tensor([[-0.0972, -1.2942,  1.1896,  1.0699],
        [-0.7763,  0.0304,  2.1304, -0.1287],
        [ 1.1961,  0.9533, -0.6336, -0.4243]])

### create tensors from list

Finally, we can construct tensors by supplying the exact values for each element by supplying (possibly nested) Python list(s) containing numerical literals. Here, we construct a **matrix** with a **list of lists**, where the **outermost list** corresponds to **axis 0**, and the **inner list** corresponds to **axis 1**.

## Indexing and Slicing

As with Python lists, we can access tensor elements by indexing (starting with 0). To access an element based on its position relative to the end of the list, we can use **negative indexing**. Finally, we can access whole ranges of indices via slicing (e.g., **`X[start:stop])`**, where the returned value includes the first index (start) but not the last (stop). Finally, when only one index (or slice) is specified for a ${k^{th}}$ - order tensor, it is applied along axis 0. Thus, in the following code, **[-1]** selects the last row and **[1:3]** selects the second and third rows.

In [12]:
X

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])

In [13]:
X[-1]

tensor([ 8.,  9., 10., 11.])

In [14]:
X[1:3]

tensor([[ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])

Beyond reading them, we can also write elements of a matrix by specifying indices.

In [15]:
X[1, 2] = 17
X

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5., 17.,  7.],
        [ 8.,  9., 10., 11.]])

## Operations

Now that we know how to construct tensors and how to read from and write to their elements, we can begin to manipulate them with various mathematical operations. Among the most useful of these are the **`elementwise operations`**. These apply a standard scalar operation to each element of a tensor. For functions that take two tensors as inputs, elementwise operations apply some standard binary operator on each pair of corresponding elements. We can create an elementwise function from any function that maps from **a scalar to a scalar**.

In mathematical notation, we denote such **`unary scalar operators`** (taking one input) by the signature $f:\mathbb{R}\to\mathbb{R}$. This just means that **the function maps from any real number onto some other real number**. Most standard operators, including **unary** ones like ${e^{x}}$ , can be applied elementwise.

In [16]:
torch.exp(x)

tensor([1.0000e+00, 2.7183e+00, 7.3891e+00, 2.0086e+01, 5.4598e+01, 1.4841e+02,
        2.4155e+07, 1.0966e+03, 2.9810e+03, 8.1031e+03, 2.2026e+04, 5.9874e+04])

<img src="binary elementwise operations.png" />

In [18]:
x = torch.tensor([1.0, 2, 4, 8])
y = torch.tensor([2, 2, 2, 2])
x + y

tensor([ 3.,  4.,  6., 10.])

In [19]:
x - y

tensor([-1.,  0.,  2.,  6.])

In [20]:
x * y

tensor([ 2.,  4.,  8., 16.])

In [21]:
 x / y

tensor([0.5000, 1.0000, 2.0000, 4.0000])

In [22]:
x ** y

tensor([ 1.,  4., 16., 64.])

In addition to elementwise computations, we can also perform **linear algebraic operations**, such as dot products and matrix multiplications.

## Tensor Concatenation

We can also concatenate multiple tensors, stacking them end-to-end to form a larger one. We just need to provide a list of tensors and tell the system along which axis to concatenate. The example below shows what happens when we concatenate two matrices along **rows (axis 0)** instead of **columns (axis 1)**. We can see that the first output’s **axis-0 length (6)** is the sum of the two input tensors’ **axis-0 lengths (3+3)** while the second output’s **axis-1 length (8)** is the sum of the two input tensors’ **axis-1 lengths (4+4)**.

In [25]:
X = torch.arange(12, dtype=torch.float32).reshape((3,4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])

print(X)
print(Y)

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])
tensor([[2., 1., 4., 3.],
        [1., 2., 3., 4.],
        [4., 3., 2., 1.]])


In [26]:
torch.cat((X, Y), dim=0)

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [ 2.,  1.,  4.,  3.],
        [ 1.,  2.,  3.,  4.],
        [ 4.,  3.,  2.,  1.]])

In [24]:
torch.cat((X, Y), dim=1)

tensor([[ 0.,  1.,  2.,  3.,  2.,  1.,  4.,  3.],
        [ 4.,  5.,  6.,  7.,  1.,  2.,  3.,  4.],
        [ 8.,  9., 10., 11.,  4.,  3.,  2.,  1.]])

## Binary Tensors

Sometimes, we want to construct a binary tensor via logical statements. Take X == Y as an example. For each position i, j, if X[i, j] and Y[i, j] are equal, then the corresponding entry in the result takes value 1, otherwise it takes value 0.

In [27]:
X==Y

tensor([[False,  True, False,  True],
        [False, False, False, False],
        [False, False, False, False]])

## Sum all the elements in a Tensor

Summing all the elements in the tensor yields a tensor with only one element.

In [29]:
X

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])

In [28]:
X.sum()

tensor(66.)

## Broadcasting

By now, you know how to perform **elementwise binary operations** on two tensors of the same shape. Under certain conditions, even when shapes differ, we can still perform elementwise binary operations by invoking the broadcasting mechanism. Broadcasting works according to the following two-step procedure: 
* (i) expand one or both arrays by copying elements along axes with length 1 so that after this transformation, the two tensors have the same shape.
* (ii) perform an elementwise operation on the resulting arrays.

In [31]:
a = torch.arange(3).reshape((3, 1))
b = torch.arange(2).reshape((1, 2))

print(a)
print(b)

tensor([[0],
        [1],
        [2]])
tensor([[0, 1]])


Since a and b are 3 X 1 and 1 X 2 matrices, respectively, their shapes do not match up. Broadcasting produces a larger 3 x 2 matrix by replicating matrix a along the columns and matrix b along the rows before adding them elementwise.

In [32]:
a + b

tensor([[0, 1],
        [1, 2],
        [2, 3]])

## Saving Memory

Running operations can cause new memory to be allocated to host results. For example, if we write **Y = X + Y**, we dereference the tensor that Y used to point to and instead point Y at the newly allocated memory. We can demonstrate this issue with Python’s **`id()`** function, which gives us the exact address of the referenced object in memory. Note that after we run **Y = Y + X**, **`id(Y)`** points to a different location. That is because Python first evaluates Y + X, allocating new memory for the result and then points Y to this new location in memory.

In [34]:
before = id(Y)
print(before)

Y = Y + X

after = id(Y)
print(after)

before == after

4744075600
4744013344


False

This might be **undesirable** for two reasons: 
* First, we do not want to run around **allocating memory unnecessarily** all the time. In machine learning, we often have hundreds of megabytes of parameters and update all of them multiple times per second. Whenever possible, we want to **perform these updates `in place`**.
* Second, we might point at the same parameters from multiple variables. If we do not update in place, we must be careful to update all of these references, lest we spring a **memory leak** or inadvertently refer to **stale parameters**.

1. **Memory Leak:** A memory leak occurs when a program allocates memory but fails to release it when it's no longer needed. In the context of machine learning parameters:
   - If we create new copies of parameters instead of updating in place, we might forget to delete the old versions.
   - Over time, this can lead to accumulation of unused memory, causing the program to consume more and more resources.

3. **Stale Parameters:** "Stale" refers to outdated or obsolete data. In this context:
   - If multiple variables point to the same parameters, and we update by creating a new copy instead of modifying the original, some variables might still reference the old (stale) version.
   - This can lead to inconsistencies in the model, where different parts of the program are using different versions of the same parameters.

In [37]:
Z = torch.zeros_like(Y)
print(Z)
print('id(Z):', id(Z))

Z[:] = X + Y
print(Z)
print('id(Z):', id(Z))

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])
id(Z): 4744074640
tensor([[ 2.,  4., 10., 12.],
        [13., 17., 21., 25.],
        [28., 30., 32., 34.]])
id(Z): 4744074640


If the value of **X** is not reused in subsequent computations, we can also use **`X[:] = X + Y`** or **`X += Y`** to reduce the memory overhead of the operation.

In [40]:
before = id(X)
X += Y
after = id(X)
after == before

True

## Conversion to Other Python Objects