# NLP with DL

# RNN (recurrent Neural Networks) with Natural Language Processing

## Recurrent Neural Networks (RNN)

The RNNs are a kind of Deep Learning construct that is generally used to predict the next step. **The biggest difference between them and other deep learning structures is that they remember. Another difference is that, while in other neural networks, each input is independent of the other, in RNNs the inputs are related to each other**. RNNs make associations between inputs to follow the next step and remember all their associations while they are being trained.

![image.png](attachment:ef164f5d-5c47-4a8a-8266-09b983b84913.png)

**2 important problems: vanishing gradients vs. exploding gradience**

But, vanishing and exploding gradients are two frequent issues that arise during RNN backpropagation. In the **vanishing gradient problem, the gradient values decrease rapidly, and the training stops**. The model can't capture the relationships between the beginning and the end of long sentences. But in the **exploding gradients, the gradient values grow to infinity exponentially rapidly and model weight values become NaN due to the unstable process**. The model can't learn anything from training data. This is manily because of the short-term memory problem.

**Short-term Memory**

Short-term memory is a problem for recurrent neural networks due to vanishing gradient issues. They'll have difficulty transferring information from earlier time steps to later ones if the sequence is lengthy enough. If you're attempting to predict anything from a paragraph of text, RNN’s may leave out essential information at the beginning of sequences. 

**The vanishing gradient problem affects recurrent neural networks during backpropagation**. Gradients are values that are used to update the weights of a neural network. When a gradient diminishes as it backpropagates through time, this is known as the vanishing gradient issue. When a gradient value falls below a certain threshold, it no longer contributes much to learning.

- updated new weight = weight – learning rate * gradient
- 1.01  = 1.01001 – 0.00001
- 1.00999 = 1.01 – 0.00001
- 1.00998 = 1.00999 – 0.00001

(As can be seen above, when the gradient value is too small, the update in the weights almost comes to a halt.)

Layers that get a tiny gradient update in recurrent neural networks stop learning. Those are generally the first layers to appear. RNNs can forget what they've seen in longer sequences since these layers don't learn, resulting in short-term memory.

**RNNs work very well in short sequences (like short sentences) and have less computational cost**. However, they can’t show this performance in long strings. **Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU)**, which are improved variants of RNN, are used as a solution to this problem.

**How does RNN work?**

**Recurrent Neural Networks (RNNs) are a type of artificial neural network designed to process sequential data by maintaining internal memory**. They are particularly effective in tasks that involve sequential or time-dependent data, such as natural language processing and speech recognition.

In an RNN, the words are converted into numeric vectors. Then the RNN processes each vector in the sequence sequentially. It transmits the previous concealed state from one step to the next during processing. The hidden state behaves as the neural network's memory. It stores prior network data.

![image.png](attachment:5a4ab4f1-d59d-440f-8f95-49967eca1b6f.png)

**By updating the hidden state at each time step and leveraging the information from past inputs, RNNs can model sequential relationships and capture long-term dependencies in the data. However, traditional RNNs can suffer from the "vanishing gradient" problem, where the ability to capture long-term dependencies diminishes over time.**

## Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU)

![image.png](attachment:78744448-3efc-44d3-9e2f-ccd517883b24.png)

**Internal processes called gates in the LSTM and GRU regulate the flow of information**. These gates figure out which data in a sequence should be kept and which should be discarded. They can then send important information along long sequences to create predictions.

Before diving into deep to the details of LSTM and GRU, here is a brief explanation of activation functions tanh and sigmoid used in gates:

**Tanh Activations**:

When vectors pass through a neural network, they go through a number of changes as a result of various math operations. As a result of these processes, the tanh activation is used so that some values don't become very large and other values don't become insignificant. Thus, **all vector values stay between -1 and +1 values**.

**Sigmoid activations**:

Sigmoid activations are at the Gates. Sigmoid activation is comparable to Tanh activation. **Instead of squishing numbers ranging from -1 to 1, it squishes values ranging from 0 to 1**. Because every integer multiplied by 0 equals 0, values vanish or are forgotten. This is useful for updating or forgetting data.

![image.png](attachment:537d4ea3-f383-495d-b542-df1772155c1c.png)

Gates are fundamental components in LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) architectures that allow these models to selectively control the flow of information through the recurrent neural network. They enable the models to capture and preserve relevant information over long sequences and address the vanishing gradient problem. Here's the explanation of the function of gates in LSTM and GRU:

### Long Short-Term Memory (LSTM)

In an LSTM cell, there are three separate gates (forget gate, input gate, and output gate) that control information flow. 

1) **Forget gate**: This gate determines whether or not to retain the information. The sigmoid function is used for this process. The closer the value is to 0, the more likely it will be forgotten, and the closer it is to 1, the more likely it will be retained.

2) **Input gate**: The input gate determines which values from the current input and the previous hidden state should be passed to the cell state. It computes a sigmoid activation that ranges between 0 and 1 for each element in the input. A value of 0 means "discard," and a value of 1 means "keep."

3) **Cell state**: The cell state carries relevant information throughout the sequence. Thus, previous information isn't forgotten and the model can make more accurate predictions. Therefore, you can think of it as the memory of the network.

As the cell state moves, information is added to or removed from the cell state via gates. The gates are distinct neural networks that determine whether information about the cell state is permitted. During training, the gates can choose which knowledge is necessary to retain and which to discard.

4) **Output gate**: It is used to determine the hidden state to be transferred to the next step. Since the hidden state contains information about previous inputs (words, letters, etc.), it is used for predictions that the model will make. The current input and the previous hidden state are passed through into the sigmoid function. Then the current cell state is passed through into the tanh function and the results are multiplied. The output is now the new hidden state.

**To summarize briefly, the forget gate decides whether to continue transferring the information from the previous steps to the cell state. The input gate decides whether or not to add to the cell state a piece of new information. The output gate determines the next hidden state.**

![image.png](attachment:505054cc-130b-480d-98fd-d1f962a75355.png)

## Gated Recurrent Units (GRU)

GRU emerged in 2014 and is the latest version of RNNs. Its working logic is very similar to LSTM. Unlike LSTM, in GRU, cell state and hidden state are combined. Also, it only has 2 gates. These are the update gate and the reset gate. Let us now briefly examine them.

- **Update gate**: It decides what information to discard and what information to include, such as the forget and input gates in LSTM.

- **Reset gate**: It is the gate that decides how much of the information from other steps should be forgotten.

Although GRU is faster than LSTM, both LSTM, and GRU give very good results. It's up to the user to try both and decide.

![image.png](attachment:7f4720a4-98f9-4445-9a0f-3bf84a9d9a91.png)

![image.png](attachment:f0893d43-93a1-40e0-a6b7-5e9a05b45952.png)

![image.png](attachment:e03ffe9a-29ef-49cb-92ef-15ee76873757.png)

## Working Logic of LSTM and GRU

Suppose the following comments are made on a shopping site.

**Negative Comments:**

I've been saving for 6 months just to get this phone. I bought it very lovingly and willingly, but the product is **worthless**.

The laptop that I dreamed about for months, arrived yesterday. The shipping was very fast but the laptop was a **complete disappointment**.

The phone is a game beast in terms of both screen quality and processor power, but the charging time is very bad. You have to constantly charge. Frankly, I'm **not satisfied** at all

**Positive Comments**:

Although the TV has minor shortcomings, I would **definitely recommend** it to everyone.

**Awesome, a complete price performance product** that everyone should have at home

Shipping was very late but the product is **great**.



LSTM and GRU ignore all other words except bold words. The model understands that the comments made through bold words are positive or negative. If we train our model  with the more comments , the more successful our model will yield.