
### **The Need for LSTM: Addressing RNN Limitations**
While **Recurrent Neural Networks (RNNs) **are designed for sequential data, they often excel primarily at short-term memory. This limitation arises due to two main challenges:

**Long-Term Dependencies:** RNNs struggle to remember relationships between elements separated by long distances in a sequence. For example, understanding a sentence's meaning might require connecting its first and last words, which traditional RNNs often fail to achieve.

**Vanishing Gradients:** During training, the gradients used to update the network's weights can become extremely small for earlier layers. This "vanishing gradient" problem prevents earlier layers from learning effectively, hindering the RNN's ability to capture long-term dependencies.

These shortcomings motivated the development of Long Short-Term Memory (LSTM) networks. LSTMs are specifically designed to address these issues and effectively model long-range dependencies in sequential data, making them a powerful tool for various text processing tasks.

### How LSTMs Handle RNN Limitations
LSTMs tackle the challenges of traditional RNNs through their unique architecture:

**Memory Cells:** LSTMs introduce memory cells that act like information storage units. These cells can retain crucial information over long sequences, enabling the network to "remember" important context from earlier inputs.

**Gating Mechanisms:** LSTMs utilize gates to regulate the flow of information into and out of the memory cells. These gates control what information gets stored, discarded, and utilized, allowing the network to selectively retain and access relevant information over extended periods.

By incorporating memory cells and gating mechanisms, LSTMs effectively address the following:

Long-Term Dependencies: The ability to store and access information over long sequences enables LSTMs to capture relationships between elements separated by significant distances, mitigating the issue of long-term dependencies.

Vanishing Gradients: The gating mechanisms help regulate the flow of gradients during training, preventing them from diminishing too rapidly and allowing earlier layers to learn effectively. This addresses the vanishing gradient problem and enables LSTMs to capture long-term dependencies effectively.

In [None]:
!pip install torch==2.2.2
!pip install torchtext==0.17.2
!pip install numpy==1.26.0


In [2]:
import torch
import torch.nn as nn
import torch.optim as optim

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

### LSTM: A Deep Dive into Long-Term Memory
In RNNs, we typically maintain a single hidden state to capture information from previous inputs. However, LSTMs introduce an additional state called the cell state, which acts as a long-term memory mechanism. This cell state is crucial for retaining crucial information over extended sequences.

Along with the cell state, LSTMs incorporate three essential gates that regulate the flow of information into and out of the cell state:

**Forget Gate:** This gate decides what information to discard from the cell state. It analyzes the previous hidden state and the current input, assigning a value between 0 and 1 to each element in the cell state. A value of 0 indicates complete forgetting, while 1 signifies retention.

**Input Gate:** This gate determines what new information to store in the cell state. It considers the previous hidden state and the current input, selectively updating the cell state with relevant information.

**Output Gate:** This gate controls what information from the cell state is outputted to the hidden state. It analyzes the previous hidden state, the current input, and the updated cell state, selectively outputting relevant information to the hidden state.

These three gates work together to regulate the flow of information through the LSTM cell, enabling it to capture long-term dependencies and mitigate the vanishing gradient problem. We will discuss each gate in detail in the following sections.

I hope this provides a clear introduction to the core components of LSTM networks, setting the stage for a deeper exploration of each gate's functionality.


In [29]:
#This code is for short term memory, similiar to RNN(Check my RNN model notebook for detail)
X = [1.0, 5.0, 7.0, 4.0]  # Input values, processed sequentially
W_xh = torch.tensor([-10.0], requires_grad=True)  # Weight applied to the current input (input-to-hidden)
W_hh = torch.tensor([10.0], requires_grad=True)  # Weight applied to the previous hidden state (hidden-to-hidden)
b_h = torch.tensor([0.0], requires_grad=True)  # Bias term for the hidden state calculation
x_t = 1  # Current input value (placeholder)
h_prev = torch.tensor([-1.0], requires_grad=True)  # Initial hidden state (responsible for short term memory)
W_hy = torch.tensor([4.0], requires_grad=True)  # Weight applied to the hidden state for output (hidden-to-output)
b_y = torch.tensor([5.0], requires_grad=True)  # Bias term for the output calculation
y_hat_t = torch.tensor([15.0], requires_grad=True)  # Target output (expected value)
ct_prev = torch.tensor([0.0], requires_grad=True) # Initial cell state (responsible for long term memory)


### Forget Gate
When we multiply the
𝑓
𝑡
f
t
​
  vector with
𝑐
𝑡
−
1
c
t−1
​
 , this vector is responsible for removing the unimportant information from the long-term memory
𝑐
𝑡
c
t
​
 .



In [30]:
f_t = torch.sigmoid(W_xh * x_t + W_hh * h_prev) #forget gate activation function
W_hf = torch.tensor([0.5], requires_grad=True) #forget gate weight
b_f = torch.tensor([0.5], requires_grad=True) #forget gate bias

def forget_gate(W_xh, W_hf, b_f, x_t, h_prev):
  f_t = torch.sigmoid(torch.matmul(x_t, W_xh) + torch.matmul(h_prev, W_hf) + b_f)
  return f_t

### Input Gate
When we multiply the
𝑖
𝑡
i
t
​
  vector with
𝑐
~
𝑡
c
~
  
t
​
 , this vector is responsible for deciding what new information should be added to the long-term memory
𝑐
𝑡
c
t
​
 .

In [35]:
W_xc = torch.tensor([0.5], requires_grad=True) #input gate weight cabduduate cell state
W_hc = torch.tensor([0.5], requires_grad=True) #input gate weight hidden state
b_c = torch.tensor([0.5], requires_grad=True) #input gate bias candidiate cell state
W_xi = torch.tensor([0.5], requires_grad=True) #input gate weight input gate
W_hi = torch.tensor([0.5], requires_grad=True) #input gate weight hidden state
b_i = torch.tensor([0.5], requires_grad=True) #input gate bias input gate

def calculate_candidate_cell_state(W_xc, W_hc, b_c, x_t, h_prev):
  c_tilde = torch.tanh(torch.matmul(x_t, W_xc) + torch.matmul(h_prev, W_hc) + b_c)
  return c_tilde

def input_gate(W_xi, W_hi, b_i, x_t, h_prev):
  i_t = torch.sigmoid(torch.matmul(x_t, W_xi) + torch.matmul(h_prev, W_hi) + b_i)
  return i_t * calculate_candidate_cell_state(W_xc, W_hc, b_c, x_t, h_prev)

### Output Gate
When we multiply the
𝑜
𝑡
o
t
​
  vector with the
𝑐
𝑡
c
t
​
 , this vector is responsible for determining which part of the long-term memory
𝑐
𝑡
c
t
​
  should be outputted at the current timestep.

In [36]:
W_ho = torch.tensor([0.5], requires_grad=True) #output gate weight hidden state
W_oh = torch.tensor([0.5], requires_grad=True) #output gate weight
b_o = torch.tensor([0.5], requires_grad=True) #output gate bias

def output_gate(W_ho, W_hh, b_o, x_t, h_prev):
  o_t = torch.sigmoid(torch.matmul(x_t, W_ho) + torch.matmul(h_prev, W_hh) + b_o)
  return o_t * torch.tanh(ct_prev)

In [37]:
for x in X:
  x_t = torch.tensor([float(x)], requires_grad=True)

  ct_prev = ct_prev * forget_gate(W_xh, W_hf, b_f, x_t, h_prev)
  ct_prev = ct_prev + input_gate(W_xi, W_hi, b_i, x_t, h_prev)
  h_t = output_gate(W_ho, W_hh, b_o, x_t, h_prev)
  h_prev=h_t

print(h_t)
y_t = torch.sigmoid(W_hy * h_t + b_y)
print(y_t)

tensor([0.7355], grad_fn=<MulBackward0>)
tensor([0.9996], grad_fn=<SigmoidBackward0>)
