<a href="https://colab.research.google.com/github/rahiakela/data-learning-research-and-practice/blob/main/deep-learning-with-python-by-francois-chollet/2-mathematical-building-blocks/1_gears_of_neural_networks_tensor_operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##The gears of neural networks: Tensor operations

Much as any computer program can be ultimately reduced to a small set of binary
operations on binary inputs (AND, OR, NOR, and so on), all transformations learned
by deep neural networks can be reduced to a handful of tensor operations (or tensor functions)
applied to tensors of numeric data. For instance, it’s possible to add tensors,
multiply tensors, and so on.

A Keras layer instance looks like this:

```python
keras.layers.Dense(512, activation="relu")
```

This layer can be interpreted as a function, which takes as input a matrix and returns
another matrix—a new representation for the input tensor. 

Specifically, the function
is as follows (where W is a matrix and b is a vector, both attributes of the layer):

```python
output = relu(dot(input, W) + b)
```

Let’s unpack this. We have three tensor operations here:

- A dot product (dot) between the input tensor and a tensor named W
- An addition (+) between the resulting matrix and a vector b
- A relu operation: relu(x) is max(x, 0); “relu” stands for “rectified linear unit”

##Element-wise operations

The relu operation and addition are element-wise operations: operations that are
applied independently to each entry in the tensors being considered. This means
these operations are highly amenable to massively parallel implementations.

If you want to write a naive Python implementation of
an element-wise operation, you use a for loop, as in this naive implementation of an
element-wise relu operation:

In [2]:
def naive_relu(x):
  # x is a rank-2 NumPy tensor
  assert len(x.shape) == 2

  # Avoid overwriting the input tensor
  x = x.copy()
  for i in range(x.shape[0]):
    for j in range(x.shape[1]):
      x[i, j] = max(x[i, j], 0)
  return x

You could do the same for addition:

In [3]:
def naive_add(x, y):
  # x and y are rank-2 NumPy tensors
  assert len(x.shape) == 2
  assert x.shape == y.shape
  
  # Avoid overwriting the input tensor
  x = x.copy()
  for i in range(x.shape[0]):
    for j in range(y.shape[1]):
      x[i, j] += y[i, j]
  return x

On the same principle, you can do element-wise multiplication, subtraction, and so on.

In practice, when dealing with NumPy arrays, these operations are available as welloptimized
built-in NumPy functions, which themselves delegate the heavy lifting to a
Basic Linear Algebra Subprograms (BLAS) implementation. BLAS are low-level,
highly parallel, efficient tensor-manipulation routines that are typically implemented
in Fortran or C.

So, in NumPy, you can do the following element-wise operation, and it will be blazing
fast:

In [4]:
import numpy as np

In [5]:
x = np.random.random((20, 100))
y = np.random.random((20, 100))

In [6]:
z = x + y
z

array([[1.01274148, 1.61155269, 1.53795073, ..., 0.77090228, 1.54068284,
        1.09296087],
       [1.73714791, 0.40342784, 1.33469845, ..., 0.54809805, 0.34937233,
        1.55131   ],
       [0.39046875, 0.80679872, 0.99828881, ..., 1.04383479, 0.90647764,
        1.37171019],
       ...,
       [1.25587013, 1.55416512, 0.8519494 , ..., 0.59000461, 1.33213851,
        1.63504904],
       [1.41566618, 1.70147724, 1.67182684, ..., 0.27070605, 0.34975674,
        0.86700909],
       [0.60044965, 0.19018441, 1.08574581, ..., 0.43884102, 0.47323858,
        1.10698798]])

In [7]:
z = np.maximum(z, 0)
z

array([[1.01274148, 1.61155269, 1.53795073, ..., 0.77090228, 1.54068284,
        1.09296087],
       [1.73714791, 0.40342784, 1.33469845, ..., 0.54809805, 0.34937233,
        1.55131   ],
       [0.39046875, 0.80679872, 0.99828881, ..., 1.04383479, 0.90647764,
        1.37171019],
       ...,
       [1.25587013, 1.55416512, 0.8519494 , ..., 0.59000461, 1.33213851,
        1.63504904],
       [1.41566618, 1.70147724, 1.67182684, ..., 0.27070605, 0.34975674,
        0.86700909],
       [0.60044965, 0.19018441, 1.08574581, ..., 0.43884102, 0.47323858,
        1.10698798]])

Let’s actually time the difference:

In [8]:
import time

In [9]:
t0 = time.time()

for _ in range(1000):
  z = x + y
  z = np.maximum(z, 0.0)
print("Took: {0:.2f} s".format(time.time() - t0))

Took: 0.01 s


This takes 0.01 s. Meanwhile, the naive version takes a stunning 2.73 s:

In [10]:
t0 = time.time()

for _ in range(1000):
  z = naive_add(x, y)
  z = naive_relu(z)
print("Took: {0:.2f} s".format(time.time() - t0))

Took: 2.73 s


Likewise, when running TensorFlow code on a GPU, element-wise operations are executed
via fully vectorized CUDA implementations that can best utilize the highly parallel
GPU chip architecture.

##Broadcasting