<a href="https://colab.research.google.com/github/rahiakela/data-learning-research-and-practice/blob/main/deep-learning-with-python-by-francois-chollet/2-mathematical-building-blocks/1_gears_of_neural_networks_tensor_operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##The gears of neural networks: Tensor operations

Much as any computer program can be ultimately reduced to a small set of binary
operations on binary inputs (AND, OR, NOR, and so on), all transformations learned
by deep neural networks can be reduced to a handful of tensor operations (or tensor functions)
applied to tensors of numeric data. For instance, it’s possible to add tensors,
multiply tensors, and so on.

A Keras layer instance looks like this:

```python
keras.layers.Dense(512, activation="relu")
```

This layer can be interpreted as a function, which takes as input a matrix and returns
another matrix—a new representation for the input tensor. 

Specifically, the function
is as follows (where W is a matrix and b is a vector, both attributes of the layer):

```python
output = relu(dot(input, W) + b)
```

Let’s unpack this. We have three tensor operations here:

- A dot product (dot) between the input tensor and a tensor named W
- An addition (+) between the resulting matrix and a vector b
- A relu operation: relu(x) is max(x, 0); “relu” stands for “rectified linear unit”

##Element-wise operations

The relu operation and addition are element-wise operations: operations that are
applied independently to each entry in the tensors being considered. This means
these operations are highly amenable to massively parallel implementations.

If you want to write a naive Python implementation of
an element-wise operation, you use a for loop, as in this naive implementation of an
element-wise relu operation:

In [1]:
def naive_relu(x):
  # x is a rank-2 NumPy tensor
  assert len(x.shape) == 2

  # Avoid overwriting the input tensor
  x = x.copy()
  for i in range(x.shape[0]):
    for j in range(x.shape[1]):
      x[i, j] = max(x[i, j], 0)
  return x

You could do the same for addition:

In [2]:
def naive_add(x, y):
  # x and y are rank-2 NumPy tensors
  assert len(x.shape) == 2
  assert x.shape == y.shape
  
  # Avoid overwriting the input tensor
  x = x.copy()
  for i in range(x.shape[0]):
    for j in range(y.shape[1]):
      x[i, j] += y[i, j]
  return x

On the same principle, you can do element-wise multiplication, subtraction, and so on.

In practice, when dealing with NumPy arrays, these operations are available as welloptimized
built-in NumPy functions, which themselves delegate the heavy lifting to a
Basic Linear Algebra Subprograms (BLAS) implementation. BLAS are low-level,
highly parallel, efficient tensor-manipulation routines that are typically implemented
in Fortran or C.

So, in NumPy, you can do the following element-wise operation, and it will be blazing
fast:

In [3]:
import numpy as np

In [4]:
x = np.random.random((20, 100))
y = np.random.random((20, 100))

In [5]:
z = x + y
z

array([[0.5833715 , 1.64167321, 0.47598757, ..., 0.94894288, 1.06589035,
        0.57077926],
       [0.79683924, 1.69161099, 0.78681001, ..., 1.02353888, 1.45680047,
        0.93536571],
       [1.5425921 , 1.03753149, 1.65542218, ..., 1.56734636, 0.75605085,
        0.39763872],
       ...,
       [0.87791879, 1.24838634, 0.94751821, ..., 0.47714805, 1.28274589,
        1.19634836],
       [1.79288733, 0.97149845, 0.96372108, ..., 1.14259272, 1.05272603,
        0.70999261],
       [0.73725283, 1.00935213, 1.97691635, ..., 1.08769461, 1.00669283,
        1.40341021]])

In [6]:
z = np.maximum(z, 0)
z

array([[0.5833715 , 1.64167321, 0.47598757, ..., 0.94894288, 1.06589035,
        0.57077926],
       [0.79683924, 1.69161099, 0.78681001, ..., 1.02353888, 1.45680047,
        0.93536571],
       [1.5425921 , 1.03753149, 1.65542218, ..., 1.56734636, 0.75605085,
        0.39763872],
       ...,
       [0.87791879, 1.24838634, 0.94751821, ..., 0.47714805, 1.28274589,
        1.19634836],
       [1.79288733, 0.97149845, 0.96372108, ..., 1.14259272, 1.05272603,
        0.70999261],
       [0.73725283, 1.00935213, 1.97691635, ..., 1.08769461, 1.00669283,
        1.40341021]])

Let’s actually time the difference:

In [7]:
import time

In [8]:
t0 = time.time()

for _ in range(1000):
  z = x + y
  z = np.maximum(z, 0.0)
print("Took: {0:.2f} s".format(time.time() - t0))

Took: 0.01 s


This takes 0.01 s. Meanwhile, the naive version takes a stunning 2.73 s:

In [9]:
t0 = time.time()

for _ in range(1000):
  z = naive_add(x, y)
  z = naive_relu(z)
print("Took: {0:.2f} s".format(time.time() - t0))

Took: 2.70 s


Likewise, when running TensorFlow code on a GPU, element-wise operations are executed
via fully vectorized CUDA implementations that can best utilize the highly parallel
GPU chip architecture.

##Broadcasting

When possible, and if there’s no ambiguity, the smaller tensor will be broadcast to match the shape of the larger tensor. 

Broadcasting consists of two steps:
1. Axes (called broadcast axes) are added to the smaller tensor to match the ndim of the larger tensor.
2. The smaller tensor is repeated alongside these new axes to match the full shape of the larger tensor.

Let’s look at a concrete example. Consider X with shape `(32, 10)` and y with shape `(10,)`:

In [14]:
X = np.random.random((32, 10))    # X is a random matrix with shape (32, 10).
y = np.random.random((10, ))      # y is a random vector with shape (10,).

First, we add an empty first axis to y, whose shape becomes `(1, 10)`:

In [15]:
y = np.expand_dims(y, axis=0)    # The shape of y is now (1, 10).
y.shape

(1, 10)

Then, we repeat y 32 times alongside this new axis, so that we end up with a tensor Y with shape `(32, 10)`, where `Y[i, :] == y` for i in range`(0, 32)`:

In [31]:
# Repeat y 32 times along axis 0 to obtain Y, which has shape (32, 10).
Y = np.concatenate([y] * 32, axis=0)
Y.shape

(32, 10)

At this point, we can proceed to add X and Y, because they have the same shape.

The repetition operation is entirely virtual: it happens at the
algorithmic level rather than at the memory level. But thinking of the vector being repeated 10 times alongside a new axis is a helpful mental model.

In [33]:
def naive_add_matrix_and_vector(x, y):
  assert len(x.shape) == 2   # x is a rank-2 NumPy tensor
  assert len(y.shape) == 1   # y is a NumPy vector
  assert len(x.shape[1]) == len(y.shape[0])

  # Avoid overwriting the input tensor
  x = x.copy()
  for i in range(x.shape[0]):
    for j in range(x.shape[1]):
      x[i, j] += y[j]
  return x            

With broadcasting, you can generally perform element-wise operations that take two inputs tensors if one tensor has shape `(a, b, … n, n + 1, … m)` and the other has shape `(n, n + 1, … m)`. The broadcasting will then automatically happen for axes a through `n - 1`.

The following example applies the element-wise maximum operation to two tensors
of different shapes via broadcasting:

In [34]:
x = np.random.random((64, 3, 32, 10))   # x is a random tensor with shape (64, 3, 32, 10).
y = np.random.random((32, 10))          # y is a random tensor with shape (32, 10).

z = np.maximum(x, y)                    # The output z has shape (64, 3, 32, 10) like x
z.shape

(64, 3, 32, 10)

In [37]:
x = np.random.random((64, 3, 32, 10))   # x is a random tensor with shape (64, 3, 32, 10).
y = np.random.random((3, 32, 10))       # y is a random tensor with shape (3, 32, 10).

z = np.maximum(x, y)                    # The output z has shape (64, 3, 32, 10) like x
z.shape

(64, 3, 32, 10)

##Tensor product