In [55]:
!wget https://github.com/werowe/HypatiaAcademy/raw/refs/heads/master/ml/my_array.npy


--2025-09-07 15:05:22--  https://github.com/werowe/HypatiaAcademy/raw/refs/heads/master/ml/my_array.npy
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/werowe/HypatiaAcademy/refs/heads/master/ml/my_array.npy [following]
--2025-09-07 15:05:22--  https://raw.githubusercontent.com/werowe/HypatiaAcademy/refs/heads/master/ml/my_array.npy
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44528 (43K) [application/octet-stream]
Saving to: ‘my_array.npy’


2025-09-07 15:05:23 (1.03 MB/s) - ‘my_array.npy’ saved [44528/44528]



In [56]:
import numpy as np

X=np.load('my_array.npy')



# How to Compute Attention Over Bi-LSTM Hidden States

## Steps for Computing Attention

### 1. Gather Bi-LSTM Hidden States

Stack the Bi-LSTM hidden states for all input timesteps:  
$ a^{\langle 1 \rangle}, a^{\langle 2 \rangle}, \ldots, a^{\langle T_x \rangle} $  
These are typically concatenated forward and backward states per timestep.

### 2. Use Decoder State

Take the current decoder hidden state:  
$ s^{\langle t-1 \rangle} $ 
This is the state computed by the decoder at the previous output step.

### 3. Combine for “Energy” Scores

For each encoder timestep $ i $, concatenate $ a^{\langle i \rangle} $ and $ s^{\langle t-1 \rangle} $.  
Feed this combined vector into a small neural network (“energy” function), typically a dense layer with tanh/relu activation:

$
e_i = f_{\text{energy}}(a^{\langle i \rangle}, s^{\langle t-1 \rangle})
$

This yields a scalar energy score for each encoder position.

### 4. Compute Attention Weights

Collect all $ e_i $ and apply softmax across all encoder positions to get attention weights:

$
\alpha_i = \frac{\exp(e_i)}{\sum_{j=1}^{T_x} \exp(e_j)}
$

### 5. Calculate Context Vector

Use the attention weights to compute a weighted sum over the Bi-LSTM hidden states:

$
\text{context}^{\langle t \rangle} = \sum_{i=1}^{T_x} \alpha_i \cdot a^{\langle i \rangle}
$

This context vector is fed into the decoder LSTM for generating output at time \( t \).

---

## Illustrated Flow

| Step             | Inputs                       | Operation                   | Output             |
|------------------|-----------------------------|-----------------------------|--------------------|
| Energy scoring   | $(a^{\langle i \rangle}, s^{\langle t-1 \rangle}$) & | Dense network               | $e_i$ per position |
| Softmax          | $[e_1, ..., e_{T_x}]$     | Normalization               | $[α_1, ..., α_{T_x}]$ |
| Weighted sum     | $(a^{\langle i \rangle}, \alpha_i)$ | Dot product       | context vector     |

---

## Notes

- The process is identical for Bi-LSTM as for standard LSTM, except \( a \) is dimensionality-doubled due to concatenation of forward and backward states.
- This setup allows the decoder to dynamically "attend" to different parts of the input at every output step, based on both encoder features and current decoding state.

This is the standard attention mechanism used in classic sequence-to-sequence models with Bi-LSTM encoders.

In [49]:
X.shape

(5, 30, 37)

`model = tf.keras.Model(inputs, forward_seq, backward_seq, h_fw, h_bw)`

Defines a Keras model whose outputs are:

1.	The time series of forward hidden states (all timesteps)
2.	The time series of backward hidden states (all timesteps)
3.	The final forward hidden state (from the last timestep of the forward LSTM)
4.	The final backward hidden state (from the last timestep of the backward LSTM)


outputs: sequence, h_fw, c_fw, h_bw, c_bw

* h_fw: The final hidden state (output at the last timestep) from the forward LSTM. Shape: (batch_size, units) = (5, 16)
  
* c_fw: The final cell state from the forward LSTM.
	 
* h_bw: The final hidden state from the backward LSTM.
 
* c_bw: The final cell state from the backward LSTM.
  
    

In [50]:
sequence.shape

(None, 30, 32)

2. **Typical process per decoder step:**
   - For each position $i$, **concatenate** the encoder hidden state at $i$ ($a^{\langle i \rangle}$, from `sequence[:, i, :]`) with the decoder’s current hidden state ($s_{t-1}$).
   - Pass each concatenated vector through a small dense network (often just a Dense layer with tanh, sometimes more).
   - **Collect all the energy scores** for each position into a vector, then apply softmax over encoder positions (across timesteps) to get attention weights ($\alpha_i$).
   - Compute the **context vector** as the weighted sum of encoder hidden states, weighted by $\alpha_i$.


The variable `s_prev`—the decoder's previous hidden state—**does not come from your Bi-LSTM encoder directly** during attention computation at each step. Here's how it's used and initialized in encoder-decoder (seq2seq) architectures:

## Where Does `s_prev` Come From?

- **At the first decoder step (t=1):**
  - The decoder's initial hidden state (`s_prev`) is typically set using the **final hidden states of the encoder**.
  - With a Bi-LSTM encoder, this is often done by **concatenating or transforming** the final forward and backward hidden states (e.g., from `h_fw` and `h_bw`).
  - If your decoder’s hidden state dimension is not the same as the concatenated encoder states, use a Dense (fully connected) layer to project them to the correct dimension.

- **For subsequent decoding steps (t > 1):**
  - The decoder’s current hidden state (`s_t`) is passed as `s_prev` into the next step—it is updated internally by the decoder LSTM based on its previous output and the context vector from attention.

## Typical Initialization Summary
- **s_prev at t=1:**  
  Concatenate `h_fw` and `h_bw` (shape `(batch, 32)`), then possibly pass through a Dense layer to match decoder hidden size (shape `(batch, n_s)`).
- **s_prev at t > 1:**  
  Output of previous step’s decoder hidden state.

## Example for First Step:



In [57]:
import tensorflow as tf
from tensorflow.keras.layers import Layer, Concatenate, Dense
from tensorflow.keras import Input, Model

# 1. Define the input tensor for the model: shape (sequence_length, feature_dim)
inputs = Input(shape=(30, 37))  # (batch_size, 30, 37)

# 2. Create a bidirectional LSTM layer.
#    - `units=16` means each LSTM (forward and backward) has 16 hidden units.
#    - `return_sequences=True` so we get output at every timestep (not just last).
#    - `return_state=True` so we get the final h, c for both directions.
units = 16
bilstm = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(units, return_sequences=True, return_state=True)
)

# 3. Pass the inputs through the Bi-LSTM.
#    - 'sequence' is the full output (batch, time, features).
#    - h_fw, c_fw: final hidden + cell for the forward LSTM.
#    - h_bw, c_bw: final hidden + cell for the backward LSTM.
sequence, h_fw, c_fw, h_bw, c_bw = bilstm(inputs)
fw_last = h_fw  # Forward direction last output (batch, units)
bw_last = h_bw  # Backward direction last output (batch, units)

# 4. Prepare initial decoder hidden state for attention/decoding.
#    - Here we concatenate both last forward and backward states.
#    - Shape: (batch, 32)
s_prev = Concatenate()([fw_last, bw_last])

# 5. Define a manual attention layer as a custom Keras Layer.
class ManualAttention(Layer):
    def __init__(self, units, **kwargs):
        super().__init__(**kwargs)
        # Concatenation layer to combine encoder hidden state and decoder state
        self.concat = Concatenate(axis=-1)
        # Dense layer to score (energy) each encoder timestep with respect to decoder state
        self.energy_fc = Dense(1, activation='tanh')
        self.units = units

    def call(self, sequence, s_prev):
        # sequence: shape (batch, timesteps, 2*units)
        # s_prev: shape (batch, decoder_state_dim)
        energies = []
        # Iterate over each encoder timestep to compute attention "energy"
        for i in range(sequence.shape[1]):  # For every time step
            a_i = sequence[:, i, :]            # Get encoder state at timestep i
            concat = self.concat([a_i, s_prev])# Concatenate with decoder state
            e_i = self.energy_fc(concat)       # Pass through energy scoring dense layer
            energies.append(e_i)               # Collect score for this timestep
        # energies: list of (batch,1) => stack into (batch, timesteps, 1)
        energies = tf.stack(energies, axis=1)  
        energies = tf.squeeze(energies, axis=-1) # (batch, timesteps)
        # Apply softmax over all encoder timesteps to get attention weights
        alphas = tf.nn.softmax(energies, axis=1) 
        alphas_expanded = tf.expand_dims(alphas, axis=-1) # (batch, timesteps, 1)
        # Compute the context vector as weighted sum of encoder states
        context = tf.reduce_sum(alphas_expanded * sequence, axis=1)  # (batch, 2*units)
        return context, alphas        # context: summary vector; alphas: attention weights

# 6. Create an instance of the ManualAttention layer
#    - Here, units*2 since encoder output per timestep is concatenated (fw+bw)
manual_attention = ManualAttention(units*2)

# 7. Use the ManualAttention layer to compute context and attention weights.
#    - context: weighted sum of encoder outputs
#    - alphas: attention weights (can be interpreted as where the model "looked")
context, alphas = manual_attention(sequence, s_prev)
