<a href="https://colab.research.google.com/github/sagar9926/NLP_Specialisation/blob/main/SequenceModelling/Week2_HiddenStateActivation_and_Perplexity_calculation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hidden State Activation : Ungraded Lecture Notebook

In this notebook you'll take another look at the hidden state activation function. It can be written in two different ways. 

I'll show you, step by step, how to implement each of them and then how to verify whether the results produced by each of them are same or not.

## Background

![vanilla rnn](https://github.com/amanjeetsahu/Natural-Language-Processing-Specialization/raw/d562105e68a0b85012ad3ebbb29b2af6344ad4e5/Natural%20Language%20Processing%20with%20Sequence%20Models/Week%202/vanilla_rnn.PNG)


This is the hidden state activation function for a vanilla RNN.

$h^{<t>}=g(W_{h}[h^{<t-1>},x^{<t>}] + b_h)$                                                    

Which is another way of writing this:         

$h^{<t>}=g(W_{hh}h^{<t-1>} \oplus W_{hx}x^{<t>} + b_h)$                                        

Where 

- $W_{h}$ in the first formula is denotes the *horizontal* concatenation of $W_{hh}$ and $W_{hx}$ from the second formula.

- $W_{h}$ in the first formula is then multiplied by $[h^{<t-1>},x^{<t>}]$, another concatenation of parameters from the second formula but this time in a different direction, i.e *vertical*!

Let us see what this means computationally.

## Imports

In [2]:
import numpy as np


## Joining (Concatenation)

### Weights

A join along the vertical boundary is called a *horizontal concatenation* or *horizontal stack*. 

Visually, it looks like this:- $W_h = \left [ W_{hh} \ | \ W_{hx} \right ]$

I'll show you two different ways to achieve this using numpy.

__Note: The values used to populate the arrays, below, have been chosen to aid in visual illustration only. They are NOT what you'd expect to use building a model, which would typically be random variables instead.__

* Try using random initializations for the weight arrays.

In [3]:
# Create some dummy data

# w_hh = np.full((3, 2), 1)  # illustration purposes only, returns an array of size 3x2 filled with all 1s
# w_hx = np.full((3, 3), 9)  # illustration purposes only, returns an array of size 3x3 filled with all 9s


### START CODE HERE ###
# Try using some random initializations, though it will obfuscate the join. eg: uncomment these lines
w_hh = np.random.standard_normal((3,2))
w_hx = np.random.standard_normal((3,3))
### END CODE HERE ###

print("-- Data --\n")
print("w_hh :")
print(w_hh)
print("w_hh shape :", w_hh.shape, "\n")
print("w_hx :")
print(w_hx)
print("w_hx shape :", w_hx.shape, "\n")

# Joining the arrays
print("-- Joining --\n")
# Option 1: concatenate - horizontal
w_h1 = np.concatenate((w_hh, w_hx), axis=1)
print("option 1 : concatenate\n")
print("w_h :")
print(w_h1)
print("w_h shape :", w_h1.shape, "\n")

# Option 2: hstack
w_h2 = np.hstack((w_hh, w_hx))
print("option 2 : hstack\n")
print("w_h :")
print(w_h2)
print("w_h shape :", w_h2.shape)

-- Data --

w_hh :
[[-1.1602095   1.07698639]
 [ 0.53377345  0.37280108]
 [-1.69456594  0.39087425]]
w_hh shape : (3, 2) 

w_hx :
[[-1.50859014 -0.55515867 -0.38592884]
 [-0.42146834  0.93811223 -0.54972085]
 [ 0.70142032  0.41936974  0.65612773]]
w_hx shape : (3, 3) 

-- Joining --

option 1 : concatenate

w_h :
[[-1.1602095   1.07698639 -1.50859014 -0.55515867 -0.38592884]
 [ 0.53377345  0.37280108 -0.42146834  0.93811223 -0.54972085]
 [-1.69456594  0.39087425  0.70142032  0.41936974  0.65612773]]
w_h shape : (3, 5) 

option 2 : hstack

w_h :
[[-1.1602095   1.07698639 -1.50859014 -0.55515867 -0.38592884]
 [ 0.53377345  0.37280108 -0.42146834  0.93811223 -0.54972085]
 [-1.69456594  0.39087425  0.70142032  0.41936974  0.65612773]]
w_h shape : (3, 5)


### Hidden State & Inputs
Joining along a horizontal boundary is called a vertical concatenation or vertical stack. Visually it looks like this:

$[h^{<t-1>},x^{<t>}] = \left[ \frac{h^{<t-1>}}{x^{<t>}} \right]$


I'll show you two different ways to achieve this using numpy.

*Try using random initializations for the hiddent state and input matrices.*


In [4]:
# Create some more dummy data
h_t_prev = np.full((2, 1), 1)  # illustration purposes only, returns an array of size 2x1 filled with all 1s
x_t = np.full((3, 1), 9)       # illustration purposes only, returns an array of size 3x1 filled with all 9s

# Try using some random initializations, though it will obfuscate the join. eg: uncomment these lines

### START CODE HERE ###
# h_t_prev = np.random.standard_normal((2,1))
# x_t = np.random.standard_normal((3,1))
### END CODE HERE ###

print("-- Data --\n")
print("h_t_prev :")
print(h_t_prev)
print("h_t_prev shape :", h_t_prev.shape, "\n")
print("x_t :")
print(x_t)
print("x_t shape :", x_t.shape, "\n")

# Joining the arrays
print("-- Joining --\n")

# Option 1: concatenate - vertical
ax_1 = np.concatenate(
    (h_t_prev, x_t), axis=0
)  # note the difference in axis parameter vs earlier
print("option 1 : concatenate\n")
print("ax_1 :")
print(ax_1)
print("ax_1 shape :", ax_1.shape, "\n")

# Option 2: vstack
ax_2 = np.vstack((h_t_prev, x_t))
print("option 2 : vstack\n")
print("ax_2 :")
print(ax_2)
print("ax_2 shape :", ax_2.shape)

-- Data --

h_t_prev :
[[1]
 [1]]
h_t_prev shape : (2, 1) 

x_t :
[[9]
 [9]
 [9]]
x_t shape : (3, 1) 

-- Joining --

option 1 : concatenate

ax_1 :
[[1]
 [1]
 [9]
 [9]
 [9]]
ax_1 shape : (5, 1) 

option 2 : vstack

ax_2 :
[[1]
 [1]
 [9]
 [9]
 [9]]
ax_2 shape : (5, 1)


## Verify Formulas
Now you know how to do the concatenations, horizontal and vertical, lets verify if the two formulas produce the same result.

__Formula 1:__ $h^{<t>}=g(W_{h}[h^{<t-1>},x^{<t>}] + b_h)$ 

__Formula 2:__ $h^{<t>}=g(W_{hh}h^{<t-1>} \oplus W_{hx}x^{<t>} + b_h)$


To prove:- __Formula 1__ $\Leftrightarrow$ __Formula 2__

We will ignore the bias term $b_h$ and the activation function $g(\ )$ because the transformation will be identical for each formula. So what we really want to compare is the result of the following parameters inside each formula:

$W_{h}[h^{<t-1>},x^{<t>}] \quad \Leftrightarrow \quad W_{hh}h^{<t-1>} \oplus W_{hx}x^{<t>} $

We'll see how to do this using matrix multiplication combined with the data and techniques (stacking/concatenating) from above.

* Try adding a sigmoid activation function and bias term to the checks for completeness.


In [5]:
# Data

w_hh = np.full((3, 2), 1)  # returns an array of size 3x2 filled with all 1s
w_hx = np.full((3, 3), 9)  # returns an array of size 3x3 filled with all 9s
h_t_prev = np.full((2, 1), 1)  # returns an array of size 2x1 filled with all 1s
x_t = np.full((3, 1), 9)       # returns an array of size 3x1 filled with all 9s


# If you want to randomize the values, uncomment the next 4 lines

# w_hh = np.random.standard_normal((3,2))
# w_hx = np.random.standard_normal((3,3))
# h_t_prev = np.random.standard_normal((2,1))
# x_t = np.random.standard_normal((3,1))

# Results
print("-- Results --")
# Formula 1
stack_1 = np.hstack((w_hh, w_hx))
stack_2 = np.vstack((h_t_prev, x_t))

print("\nFormula 1")
print("Term1:\n",stack_1)
print("Term2:\n",stack_2)
formula_1 = np.matmul(np.hstack((w_hh, w_hx)), np.vstack((h_t_prev, x_t)))
print("Output:")
print(formula_1)

# Formula 2
mul_1 = np.matmul(w_hh, h_t_prev)
mul_2 = np.matmul(w_hx, x_t)
print("\nFormula 2")
print("Term1:\n",mul_1)
print("Term2:\n",mul_2)

formula_2 = np.matmul(w_hh, h_t_prev) + np.matmul(w_hx, x_t)
print("\nOutput:")
print(formula_2, "\n")

# Verification 
# np.allclose - to check if two arrays are elementwise equal upto certain tolerance, here  
# https://numpy.org/doc/stable/reference/generated/numpy.allclose.html

print("-- Verify --")
print("Results are the same :", np.allclose(formula_1, formula_2))

### START CODE HERE ###
# # Try adding a sigmoid activation function and bias term as a final check
# # Activation
# def sigmoid(x):
#     return 1 / (1 + np.exp(-x))

# # Bias and check
# b = np.random.standard_normal((formula_1.shape[0],1))
# print("Formula 1 Output:\n",sigmoid(formula_1+b))
# print("Formula 2 Output:\n",sigmoid(formula_2+b))

# all_close = np.allclose(sigmoid(formula_1+b), sigmoid(formula_2+b))
# print("Results after activation are the same :",all_close)
### END CODE HERE ###

-- Results --

Formula 1
Term1:
 [[1 1 9 9 9]
 [1 1 9 9 9]
 [1 1 9 9 9]]
Term2:
 [[1]
 [1]
 [9]
 [9]
 [9]]
Output:
[[245]
 [245]
 [245]]

Formula 2
Term1:
 [[2]
 [2]
 [2]]
Term2:
 [[243]
 [243]
 [243]]

Output:
[[245]
 [245]
 [245]] 

-- Verify --
Results are the same : True


# Working with JAX numpy and calculating perplexity: Ungraded Lecture Notebook

Normally you would import `numpy` and rename it as `np`. 

However in this week's assignment you will notice that this convention has been changed. 

Now standard `numpy` is not renamed and `trax.fastmath.numpy` is renamed as `np`. 

The rationale behind this change is that you will be using Trax's numpy (which is compatible with JAX) far more often. Trax's numpy supports most of the same functions as the regular numpy so the change won't be noticeable in most cases.


In [6]:
!pip install -q -U trax
import numpy
import trax
import trax.fastmath.numpy as np

# Setting random seeds
#trax.supervised.trainer_lib.init_random_number_generators(32)
numpy.random.seed(32)

[K     |████████████████████████████████| 634kB 34.3MB/s 
[K     |████████████████████████████████| 4.3MB 15.4MB/s 
[K     |████████████████████████████████| 153kB 60.5MB/s 
[K     |████████████████████████████████| 256kB 58.2MB/s 
[K     |████████████████████████████████| 2.5MB 33.9MB/s 
[K     |████████████████████████████████| 61kB 8.3MB/s 
[K     |████████████████████████████████| 3.9MB 37.4MB/s 
[K     |████████████████████████████████| 1.2MB 33.5MB/s 
[K     |████████████████████████████████| 368kB 61.4MB/s 
[K     |████████████████████████████████| 3.3MB 36.4MB/s 
[K     |████████████████████████████████| 901kB 41.5MB/s 
[?25h

One important change to take into consideration is that the types of the resulting objects will be different depending on the version of numpy. With regular numpy you get `numpy.ndarray` but with Trax's numpy you will get `jax.interpreters.xla.DeviceArray`. These two types map to each other. So if you find some error logs mentioning DeviceArray type, don't worry about it, treat it like you would treat an ndarray and march ahead.

You can get a randomized numpy array by using the `numpy.random.random()` function.

This is one of the functionalities that Trax's numpy does not currently support in the same way as the regular numpy. 

In [7]:
numpy_array = numpy.random.random((5,10))
print(f"The regular numpy array looks like this:\n\n {numpy_array}\n")
print(f"It is of type: {type(numpy_array)}")

The regular numpy array looks like this:

 [[0.85888927 0.37271115 0.55512878 0.95565655 0.7366696  0.81620514
  0.10108656 0.92848807 0.60910917 0.59655344]
 [0.09178413 0.34518624 0.66275252 0.44171349 0.55148779 0.70371249
  0.58940123 0.04993276 0.56179184 0.76635847]
 [0.91090833 0.09290995 0.90252139 0.46096041 0.45201847 0.99942549
  0.16242374 0.70937058 0.16062408 0.81077677]
 [0.03514717 0.53488673 0.16650012 0.30841038 0.04506241 0.23857613
  0.67483453 0.78238275 0.69520163 0.32895445]
 [0.49403187 0.52412136 0.29854125 0.46310814 0.98478429 0.50113492
  0.39807245 0.72790532 0.86333097 0.02616954]]

It is of type: <class 'numpy.ndarray'>


You can easily cast regular numpy arrays or lists into trax numpy arrays using the `trax.fastmath.numpy.array()` function:

In [8]:
trax_numpy_array = np.array(numpy_array)
print(f"The trax numpy array looks like this:\n\n {trax_numpy_array}\n")
print(f"It is of type: {type(trax_numpy_array)}")



The trax numpy array looks like this:

 [[0.8588893  0.37271115 0.55512875 0.9556565  0.7366696  0.81620514
  0.10108656 0.9284881  0.60910916 0.59655344]
 [0.09178413 0.34518623 0.6627525  0.44171348 0.5514878  0.70371246
  0.58940125 0.04993276 0.56179184 0.7663585 ]
 [0.91090834 0.09290995 0.9025214  0.46096042 0.45201847 0.9994255
  0.16242374 0.7093706  0.16062407 0.81077677]
 [0.03514718 0.5348867  0.16650012 0.30841038 0.04506241 0.23857613
  0.67483455 0.7823827  0.69520164 0.32895446]
 [0.49403188 0.52412134 0.29854125 0.46310815 0.9847843  0.50113493
  0.39807245 0.72790533 0.86333096 0.02616954]]

It is of type: <class 'jaxlib.xla_extension.DeviceArray'>


Hope you now understand the differences (and similarities) between these two versions and numpy. **Great!**

The previous section was a quick look at Trax's numpy. However this notebook also aims to teach you how you can calculate the perplexity of a trained model.


## Calculating Perplexity

The perplexity is a metric that measures how well a probability model predicts a sample and it is commonly used to evaluate language models. It is defined as: 

$$P(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}$$

As an implementation hack, you would usually take the log of that formula (to enable us to use the log probabilities we get as output of our `RNN`, convert exponents to products, and products into sums which makes computations less complicated and computationally more efficient). You should also take care of the padding, since you do not want to include the padding when calculating the perplexity (because we do not want to have a perplexity measure artificially good). The algebra behind this process is explained next:


$$log P(W) = {log\big(\sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}\big)}$$

$$ = {log\big({\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}\big)^{\frac{1}{N}}}$$ 

$$ = {log\big({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\big)^{-\frac{1}{N}}} $$
$$ = -\frac{1}{N}{log\big({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\big)} $$
$$ = -\frac{1}{N}{\big({\sum_{i=1}^{N}{logP(w_i| w_1,...,w_{n-1})}}\big)} $$

You will be working with a real example from this week's assignment. The example is made up of:
   - `predictions` : batch of tensors corresponding to lines of text predicted by the model.
   - `targets` : batch of actual tensors corresponding to lines of text.

In [9]:
!git clone https://github.com/amanjeetsahu/Natural-Language-Processing-Specialization.git

Cloning into 'Natural-Language-Processing-Specialization'...
remote: Enumerating objects: 586, done.[K
remote: Total 586 (delta 0), reused 0 (delta 0), pack-reused 586[K
Receiving objects: 100% (586/586), 292.43 MiB | 36.11 MiB/s, done.
Resolving deltas: 100% (75/75), done.
Checking out files: 100% (496/496), done.


In [10]:
from trax import layers as tl

# Load from .npy files
predictions = numpy.load('/content/Natural-Language-Processing-Specialization/Natural Language Processing with Sequence Models/Week 2/predictions.npy')
targets = numpy.load('/content/Natural-Language-Processing-Specialization/Natural Language Processing with Sequence Models/Week 2/targets.npy')
# Cast to jax.interpreters.xla.DeviceArray
predictions = np.array(predictions)
targets = np.array(targets)

# Print shapes
print(f'predictions has shape: {predictions.shape}')
print(f'targets has shape: {targets.shape}')



predictions has shape: (32, 64, 256)
targets has shape: (32, 64)


Notice that the predictions have an extra dimension with the same length as the size of the vocabulary used.

Because of this you will need a way of reshaping `targets` to match this shape. For this you can use `trax.layers.one_hot()`.

Notice that `predictions.shape[-1]` will return the size of the last dimension of `predictions`.

In [11]:
reshaped_targets = tl.one_hot(targets, predictions.shape[-1]) #trax's one_hot function takes the input as one_hot(x, n_categories, dtype=optional)
print(f'reshaped_targets has shape: {reshaped_targets.shape}')

reshaped_targets has shape: (32, 64, 256)


By calculating the product of the predictions and the reshaped targets and summing across the last dimension, the total log perplexity can be computed:

In [12]:
total_log_ppx = np.sum(predictions * reshaped_targets, axis= -1)

In [13]:
total_log_ppx.shape

(32, 64)

Now you will need to account for the padding so this metric is not artificially deflated (since a lower perplexity means a better model). For identifying which elements are padding and which are not, you can use `np.equal()` and get a tensor with `1s` in the positions of actual values and `0s` where there are paddings.

In [16]:
#test
np.equal(np.array([0, 1, 3]), 1)

DeviceArray([False,  True, False], dtype=bool)

In [17]:
targets

DeviceArray([[105, 110,  32, ...,   0,   0,   0],
             [ 97, 110, 110, ...,   0,   0,   0],
             [111, 102,  32, ...,   0,   0,   0],
             ...,
             [105,  32,  97, ...,   0,   0,   0],
             [101, 100, 103, ...,   0,   0,   0],
             [121, 111, 117, ...,   0,   0,   0]], dtype=int32)

In [18]:
non_pad = 1.0 - np.equal(targets, 0)
print(f'non_pad has shape: {non_pad.shape}\n')
print(f'non_pad looks like this: \n\n {non_pad}')

non_pad has shape: (32, 64)

non_pad looks like this: 

 [[1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 ...
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]]


By computing the product of the total log perplexity and the non_pad tensor we remove the effect of padding on the metric:

In [20]:
real_log_ppx = total_log_ppx * non_pad
print(f'real perplexity still has shape: {real_log_ppx.shape}')

real perplexity still has shape: (32, 64)


You can check the effect of filtering out the padding by looking at the two log perplexity tensors:

In [21]:
print(f'log perplexity tensor before filtering padding: \n\n {total_log_ppx}\n')
print(f'log perplexity tensor after filtering padding: \n\n {real_log_ppx}')

log perplexity tensor before filtering padding: 

 [[ -5.396545    -1.0311184   -0.66916656 ... -22.37673    -23.18771
  -21.843483  ]
 [ -4.5857706   -1.1341286   -8.538033   ... -20.15686    -26.837097
  -23.57502   ]
 [ -5.2223887   -1.2824144   -0.17312431 ... -21.328228   -19.854412
  -33.88444   ]
 ...
 [ -5.396545   -17.291681    -4.360766   ... -20.825802   -21.065838
  -22.443115  ]
 [ -5.9313164  -14.247417    -0.2637329  ... -26.743248   -18.38433
  -22.355278  ]
 [ -5.670536    -0.10595131   0.         ... -23.332523   -28.087376
  -23.878807  ]]

log perplexity tensor after filtering padding: 

 [[ -5.396545    -1.0311184   -0.66916656 ...  -0.          -0.
   -0.        ]
 [ -4.5857706   -1.1341286   -8.538033   ...  -0.          -0.
   -0.        ]
 [ -5.2223887   -1.2824144   -0.17312431 ...  -0.          -0.
   -0.        ]
 ...
 [ -5.396545   -17.291681    -4.360766   ...  -0.          -0.
   -0.        ]
 [ -5.9313164  -14.247417    -0.2637329  ...  -0.          -0.


To get a single average log perplexity across all the elements in the batch you can sum across both dimensions and divide by the number of elements. Notice that the result will be the negative of the real log perplexity of the model:

In [22]:
log_ppx = np.sum(real_log_ppx) / np.sum(non_pad)
log_ppx = -log_ppx
print(f'The log perplexity and perplexity of the model are respectively: {log_ppx} and {np.exp(log_ppx)}')

The log perplexity and perplexity of the model are respectively: 2.3281209468841553 and 10.258646965026855
