## Recurrent Neural Networks (RNNs)
RNNs are networks with loops in them, allowing information to persist. These are called recurrent because they perform the same task for every element of a sequence, with the output being dependent on the previous computations. Another way to think about RNNs is that they have a 'memory' that captures information about what has been calculated so far.

In theory, RNNs can make use of information in arbitrarily long sequences, but in practice, they are limited to looking back only a few steps due to what’s called the vanishing gradient problem. We'll talk more about this issue shortly.

Here is a simple implementation of a basic RNN in PyTorch:

In [None]:
import torch
from torch import nn

class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleRNN, self).__init__()

        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

rnn = SimpleRNN(input_size=10, hidden_size=20, output_size=10)

In this code, input_size is the number of input features per time step, hidden_size is the number of features in the hidden state, and output_size is the number of output features.

Issues in Training RNNs
One of the main issues when training RNNs is the so-called vanishing gradient problem. This problem refers to the issue where the gradients of the loss function decay exponentially quickly when propagating back in time, which makes the network hard to train. This happens because the backpropagation of errors involves repeatedly multiplying gradients through every time step, so the gradients can vanish if they have values less than one.

The exploding gradients problem is another issue, which is the opposite of the vanishing gradients problem. Here, the gradient gets too large, resulting in an unstable network. In practice, this can be dealt with by a technique called gradient clipping, which essentially involves scaling down the gradients when they get too large.

A very popular type of RNN that mitigates the vanishing gradient problem are the Long Short-Term Memory (LSTM) unit and Gated Recurrent Unit (GRU).

### Long Short-Term Memory (LSTM)
LSTMs are a special type of RNN specifically designed to avoid the vanishing gradient problem. LSTMs do not really have a fundamentally different architecture from RNNs, but they use a different function to compute the hidden state.

The key to LSTMs is the cell state, which runs straight down the entire chain, with only some minor linear interactions. It’s kind of like a conveyor belt. Information can be easily placed onto the cell state and easily taken off, without any "turbulence" caused by the RNN's complex functions.

This is an oversimplified view of the LSTM, but the main point is that LSTMs have a way to remove or add information to the cell state, carefully regulated by structures called gates.

Here is a basic implementation of an LSTM in PyTorch:

In [None]:
class SimpleLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleLSTM, self).__init__()

        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(input_size, hidden_size)  
        self.hidden2out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        output, hidden = self.lstm(input.view(1, 1, -1), hidden)
        output = self.hidden2out(output.view(1, -1))
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return (torch.zeros(1, 1, self.hidden_size), torch.zeros(1, 1, self.hidden_size))

lstm = SimpleLSTM(input_size=10, hidden_size=20, output_size=10)

### Gated Recurrent Unit (GRU)
GRU is a newer generation of recurrent neural networks and is pretty similar to LSTM. But, GRU has two gates (reset and update gates) compared to the three gates (input, forget, and output gates) of LSTM. The fewer gates in a GRU make it a bit more computationally efficient than an LSTM.

Here is a basic implementation of a GRU in PyTorch:

In [None]:
class SimpleGRU(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleGRU, self).__init__()

        self.hidden_size = hidden_size
        self.gru = nn.GRU(input_size, hidden_size)
        self.hidden2out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        output, hidden = self.gru(input.view(1, 1, -1), hidden)
        output = self.hidden2out(output.view(1, -1))
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size)

gru = SimpleGRU(input_size=10, hidden_size=20, output_size=10)

### Implementing LSTMs and GRUs in PyTorch
Let's implement these models in PyTorch and use them on a simple task.

Before we start, remember that LSTMs and GRUs expect input data in a specific format: (seq_len, batch, input_size). Here,
* seq_len is the length of the sequence,
* batch is the batch size (number of sequences), and
* input_size is the number of features in the input.

For this example, let's create an LSTM network that takes sequences of length 10, each with 5 features, and returns a single output value.

#### Implementing LSTM

In [None]:
import torch
from torch import nn

class LSTMNetwork(nn.Module):
    def __init__(self, input_size=5, hidden_layer_size=10, output_size=1):
        super().__init__()
        self.hidden_layer_size = hidden_layer_size

        self.lstm = nn.LSTM(input_size, hidden_layer_size)

        self.linear = nn.Linear(hidden_layer_size, output_size)

        self.hidden_cell = (torch.zeros(1,1,self.hidden_layer_size),
                            torch.zeros(1,1,self.hidden_layer_size))

    def forward(self, input_seq):
        lstm_out, self.hidden_cell = self.lstm(input_seq.view(len(input_seq) ,1, -1), self.hidden_cell)
        predictions = self.linear(lstm_out.view(len(input_seq), -1))
        return predictions[-1]

This code defines an LSTM module that accepts an input sequence, applies LSTM computations to it, and then passes it through a linear layer to obtain the output.

#### Implementing GRU
Now let's implement a GRU model, which is simpler than LSTM since it has fewer gates.

In [None]:
class GRUNetwork(nn.Module):
    def __init__(self, input_size=5, hidden_layer_size=10, output_size=1):
        super().__init__()
        self.hidden_layer_size = hidden_layer_size

        self.gru = nn.GRU(input_size, hidden_layer_size)

        self.linear = nn.Linear(hidden_layer_size, output_size)

        self.hidden_cell = torch.zeros(1,1,self.hidden_layer_size)

    def forward(self, input_seq):
        gru_out, self.hidden_cell = self.gru(input_seq.view(len(input_seq) ,1, -1), self.hidden_cell)
        predictions = self.linear(gru_out.view(len(input_seq), -1))
        return predictions[-1]

### Sequence to Sequence Models, Language Modelling
Sequence-to-Sequence (Seq2Seq) models are a type of model designed to handle sequence inputs and sequence outputs, where the length of the output sequence may not be equal to the length of the input sequence. They are popularly used in tasks like machine translation, chatbots, and any application that requires generating text output of arbitrary length.

Language modelling, on the other hand, is the task of predicting the next word (or character) in a sequence given the previous words (or characters). Both Seq2Seq models and language models are built using the recurrent networks (RNN, LSTM, GRU) we discussed earlier.

Let's dive in!

#### Sequence to Sequence Model
A basic Seq2Seq model is composed of two main components: an encoder and a decoder. Both are recurrent networks (like LSTMs or GRUs) but they serve different purposes. The encoder takes the input sequence and compresses the information into a context vector (usually the final hidden state of the encoder). The decoder takes this context vector and generates the output sequence.

Here's a simple implementation of a Seq2Seq model using PyTorch:

In [None]:
import torch
import torch.nn as nn

class Seq2Seq(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(Seq2Seq, self).__init__()
        self.hidden_dim = hidden_dim

        self.encoder = nn.GRU(input_dim, hidden_dim)
        self.decoder = nn.GRU(hidden_dim, output_dim)

    def forward(self, src):
        outputs, hidden = self.encoder(src)
        output, hidden = self.decoder(hidden)
        return output

To use the Seq2Seq model, you'll need to provide the model with input sequences. Create some dummy data to pass the to the class:

In [None]:
import torch

# Assume we have 10 sequences, each of length 5, and each element in the sequence has 20 features
input_dim = 20
hidden_dim = 30
output_dim = 10
seq_len = 5
n_seqs = 10

model = Seq2Seq(input_dim, hidden_dim, output_dim)
input_data = torch.randn(seq_len, n_seqs, input_dim)  # Random input data

output = model(input_data)

print(output.shape)  # Should be [seq_len, n_seqs, output_dim]`

This model takes an input_dim sized input and uses an encoder to create a hidden_dim sized context vector. The decoder then takes this context vector and generates an output_dim sized output.

#### Language Modelling
In a language model, we're given a sequence of words (or characters), and our task is to predict the next word. In this case, both the input and output sequences have the same length. Here is a simple implementation of a character-level language model using LSTM and PyTorch:

In [None]:
class CharRNN(nn.Module):
    def __init__(self, tokens, n_hidden=256, n_layers=2,
                               drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr

        self.chars = tokens
        self.int2char = dict(enumerate(self.chars))
        self.char2int = {ch: ii for ii, ch in self.int2char.items()}

        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        self.dropout = nn.Dropout(drop_prob)
        
        self.fc = nn.Linear(n_hidden, len(self.chars))

    def forward(self, x, hidden):
        r_output, hidden = self.lstm(x, hidden)
        
        out = self.dropout(r_output)
        
        out = out.contiguous().view(-1, self.n_hidden)
        
        out = self.fc(out)
        
        return out, hidden

To use the CharRNN model for language modeling, you'll need a sequence of characters. Let's create some then see what the model outputs:

In [None]:
import torch

# Define set of all possible characters
tokens = ['a', 'b', 'c', 'd', 'e']

# Convert tokens to integers
token_to_int = {ch: ii for ii, ch in enumerate(tokens)}

# Assume we have a text sequence
text = "abcde"

# Convert characters to integers
input_data = [token_to_int[ch] for ch in text]

# Initialize the model
model = CharRNN(tokens, n_hidden=256, n_layers=2, drop_prob=0.5)

# Convert list of integers to PyTorch tensor
input_data = torch.tensor(input_data)

# Add batch dimension
input_data = input_data.unsqueeze(0)

# Forward pass through the model
output, hidden = model(input_data, None)

print(output.shape)  # Should be [1, len(text), len(tokens)]

### Implementing LSTMs and GRUs on real-world data
So far, we've learned the basic theory behind LSTMs and GRUs, as well as how to implement them using PyTorch. Now, we're going to learn how to apply these techniques to real-world data.

One common application for LSTMs and GRUs is in natural language processing (NLP) tasks, where we have sequences of words or characters as input. A common task in NLP is text classification, where the goal is to assign a label to a piece of text. Let's go through an example of using LSTMs for sentiment analysis on movie reviews, using the IMDB dataset available in torchtext.

Firstly, you'll need to download and preprocess the data:

In [None]:
from torchtext.legacy import data, datasets

# set up fields
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False)

# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)

# build the vocabulary
TEXT.build_vocab(train, vectors='glove.6B.100d')
LABEL.build_vocab(train)

# make iterator for splits
train_iter, test_iter = data.BucketIterator.splits(
    (train, test), batch_size=64, device=device)

Next, let's define our model. We'll use an LSTM followed by a fully connected layer for classification:

In [None]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)
        
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
        
        embedded = self.dropout(self.embedding(text))
        
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths, batch_first=True)
        
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
                
        return self.fc(hidden)

Now we can create an instance of our RNN class, and define our training loop:

In [None]:
# create an RNN
model = RNN(len(TEXT.vocab), 100, 256, 1, 2, True, 0.5, TEXT.vocab.stoi[TEXT.pad_token])
model = model.to(device)

# define optimizer and loss
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

# define metric
def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

# training loop
model.train()
for epoch in range(10):
    for batch in train_iter:
        optimizer.zero_grad()
        text, text_lengths = batch.text
        predictions = model(text, text_lengths).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        print(f'Epoch: {epoch+1}, Loss: {loss.item()}, Acc: {acc.item()}')

This will train the LSTM on the IMDB dataset, and print the training loss and accuracy at each step. Note that we're using packed padded sequences here to efficiently handle sequences of different lengths.

We are only showing the training loop here, but in a real scenario, you would also want to include a validation loop to monitor the model's performance on unseen data, and prevent overfitting. You'd also want to save the model's parameters after training.

More advanced techniques and different applications can be found in the my pytorch-intermediate, pytorch-advanced and pytorch-rnn repos.

### Applications and Limitations of RNNs
Recurrent Neural Networks (RNNs), including their variants like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs), are powerful tools for modeling sequence data. Let's discuss some of their real-world applications, and also their limitations.

#### Real-world Applications of RNNs
* Natural Language Processing (NLP): RNNs are heavily used in NLP tasks because of their effectiveness in handling sequence data. Some applications in NLP include:
    * Text Generation: Given a sequence of words (or characters), predict the next word (or character) in the sequence. This could be used to generate entirely new text that matches the style of the input.
    * Machine Translation: Translate a text from one language to another. A Seq2Seq model is commonly used for this task, where the source language text is input to the encoder, and the decoder generates the text in the target language.
    * Sentiment Analysis: Determine the sentiment expressed in a piece of text. The text is input to the RNN, which then outputs a sentiment value.
    * Speech Recognition: Convert spoken language into written text. The sequence of audio data is input to the RNN, which outputs the corresponding text.
* Time Series Prediction: RNNs can be used to predict future values in a time series, given past values. This is useful in many fields such as finance (e.g., stock prices), weather forecasting, and health (e.g., heart rate data).
* Music Generation: Similar to text generation, RNNs can be used to generate entirely new pieces of music that match the style of a given input sequence.
* Video Processing: RNNs can also be used in video processing tasks like activity recognition, where the goal is to understand what's happening in a video sequence.

#### Limitations of RNNs
While RNNs are powerful, they do come with some limitations:
* Difficulty with long sequences: While LSTMs and GRUs alleviate this issue to some extent, RNNs in general struggle with very long sequences due to the vanishing gradients problem. This is when gradients are backpropagated through many time steps, they can get very small (vanish), which makes the network harder to train.
* Sequential computation: RNNs must process sequences one element at a time, which makes it hard to fully utilize modern hardware that is capable of parallel computations.
* Memory of past information: In theory, RNNs should be able to remember information from the past, in practice, they tend to forget earlier inputs when trained with standard optimization techniques.

There are different ways to address these limitations, including using attention mechanisms (which we'll cover later), or using architectures like the Transformer, which are designed to overcome these issues.