<a href="https://colab.research.google.com/github/sum-coderepo/DeepLearning-Pytorch/blob/master/LSTM/LSTMBasic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Taken From Here


In [None]:
https://galhever.medium.com/sentiment-analysis-with-pytorch-part-4-lstm-bilstm-model-84447f6c4525

The hidden state acts as the neural networks memory. It holds information on previous data the network has seen before.

The operations on the information is controlled by three corresponding gates:
 

*   Forget gate: Controls which content to keep and which should be forgotten from prior steps.
*   Input Gate: Controls which information from the current step is relevant to add to the next steps.
* Output Gate: Controls what should be the next hidden state, i.e. the output of the current step.



**What is BiLSTM Model?** </br></br>
Bidirectional LSTM (BiLSTM) model maintains two separate states for forward and backward inputs that are generated by two different LSTMs. The first LSTM is a regular sequence that starts from the beginning of the sentence, while in the second LSTM, the input sequence are fed in the opposite order. The idea behind bi-directional network is to capture information of surrounding inputs. It usually learns faster than one-directional approach although it depends on the task.

In [1]:
lr = 1e-4
batch_size = 50
dropout_keep_prob = 0.5
embedding_size = 300
max_document_length = 100  # each sentence has until 100 words
dev_size = 0.8 # split percentage to train\validation data
max_size = 5000 # maximum vocabulary size
seed = 1
num_classes = 3
num_hidden_nodes = 93
hidden_dim2 = 128
num_layers = 2  # LSTM layers
bi_directional = False 
num_epochs = 7

In [2]:
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch.autograd import Variable

class LSTM(nn.Module):

    # define all the layers used in model
    def __init__(self, vocab_size, embedding_dim, lstm_units, hidden_dim , num_classes, lstm_layers,
                 bidirectional, dropout, pad_index, batch_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_index)
        self.lstm = nn.LSTM(embedding_dim,
                            lstm_units,
                            num_layers=lstm_layers,
                            bidirectional=bidirectional,
                            batch_first=True)
        num_directions = 2 if bidirectional else 1
        self.fc1 = nn.Linear(lstm_units * num_directions, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, num_classes)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout)
        self.lstm_layers = lstm_layers
        self.num_directions = num_directions
        self.lstm_units = lstm_units


    def init_hidden(self, batch_size):
        h, c = (Variable(torch.zeros(self.lstm_layers * self.num_directions, batch_size, self.lstm_units)),
                Variable(torch.zeros(self.lstm_layers * self.num_directions, batch_size, self.lstm_units)))
        return h, c

    def forward(self, text, text_lengths):
        batch_size = text.shape[0]
        h_0, c_0 = self.init_hidden(batch_size)

        embedded = self.embedding(text)
        packed_embedded = pack_padded_sequence(embedded, text_lengths, batch_first=True)
        output, (h_n, c_n) = self.lstm(packed_embedded, (h_0, c_0))
        output_unpacked, output_lengths = pad_packed_sequence(output, batch_first=True)
        out = output_unpacked[:, -1, :]
        rel = self.relu(out)
        dense1 = self.fc1(rel)
        drop = self.dropout(dense1)
        preds = self.fc2(drop)
        return preds

The `pack_padded_sequence` is a format that enables the model to ignore the padded elements. LSTM model does not distinguish between padded elements and regular elements, but using this function it will not perform gradients calculation for backpropagation step for the padded values. When we feed the model with packed input it becomes dynamic and save unnecessary calculations. The pad_packed_sequence function is a reversed operation for `pack_padded_sequence` and will bring the output back to the familiar format `[batch_size, sentence_length, hidden_features].`
</br>
</br>

 **Init_hidden Function** </br>
In the beginning we need to initialize the hidden states to zero and feed the LSTM layer with it so we can use a function that will do it for us for each batch separately.


 **LSTM Layer**</br>
Pytorch’s nn.LSTM expects to a 3D-tensor as an input `[batch_size, sentence_length, embbeding_dim]`.</br>

For each word in the sentence, each layer computes the input i, forget f and output o gate and the new cell content c’ (the new content that should be written to the cell). It will also compute the current cell state and the hidden state.



**Parameters for LSTM Layer:**</br></br>
**Input_size:** The number of features for each element in the input in our model. E.g., In our case each element (word) has 300 features that refer to the embedding_dim.</br></br>
**Hidden_size:** This variable defines the number of LSTM hidden units.</br></br>
**Num_layers:** This argument defines for multi-layer LSTMs the number of stacking LSTM layers in the model. In our case for example, we set this argument to lstm_layers=2 which means that the input x at time t of the second layer is the hidden state h at time t of the previous layer multiplied by dropout.</br></br>
**Batch_first:** nn.LSTM layer expects the batch dimension in the input to be first as [batch_size, sentence_length, embbeding_dim] using the batch_first=TRUE it can be provided.</br></br>
**Dropout:** If this argument will be greater than zero, it will produce Dropout layer with dropout probability on each output of the LSTM layer except the last one.</br></br>
**Bidirectional:** By changing bidirectional variable modes we can control the model type (False= LSTM\True= BiLSTM).</br></br>

The inputs and output for the LSTM Layer can be explained by the diagram below (w represents the number of LSTM layers, in our case it’s equal to 2):

The input of the LSTM Layer:
**Input:** In our case it’s a packed input but it can also be the original sequence while each Xi represents a word in the sentence (with padding elements).</br></br>
**h_0:** The initial hidden state that we feed with the model.</br></br>
**c_0:** The initial cell state that we feed with the model.</br></br>
The output of the LSTM Layer:</br>
**Output:** The first value returned by LSTM contains all the hidden states throughout the sequence.</br></br>
**h_n:** The second output are the last hidden states of each of the LSTM layers.</br></br>
**c_n:** The third output is the last cell state for each of the LSTM layers.</br></br>
To get the hidden state of the last time step we used output_unpacked[:, -1, :] command and we use it to feed the next fully-connected layer.</br></br>

In [6]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://miro.medium.com/max/640/0*UOHtKtIqmTTGdUm-.png")

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
path = 'C:/'
path_data = os.path.join(path, "data")

# parameters
model_type = "LSTM"
data_type = "token" # or: "morph"

char_based = True
if char_based:
    tokenizer = lambda s: list(s) # char-based
else:
    tokenizer = lambda s: s.split() # word-based

Text.build_vocab(train_data, max_size=max_size)
Label.build_vocab(train_data)
vocab_size = len(Text.vocab)

train_iterator, valid_iterator, test_iterator = create_iterator(train_data, valid_data, test_data, batch_size, device)

# loss function
loss_func = nn.CrossEntropyLoss()
lstm_model = LSTM(vocab_size, embedding_size, n_filters, filter_sizes, pool_size, hidden_size, num_classes, dropout_keep_prob)

# optimization algorithm
optimizer = torch.optim.Adam(lstm_model.parameters(), lr=lr)
# train and evaluation
if (to_train):
    # train and evaluation
    run_train(num_epochs, lstm_model, train_iterator, valid_iterator, optimizer, loss_func, model_type)

# load weights
lstm_model.load_state_dict(torch.load(os.path.join(path, "saved_weights_LSTM.pt")))
# predict
test_loss, test_acc = evaluate(lstm_model, test_iterator, loss_func)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc * 100:.2f}%')

The input of the nn.LSTM is "input of shape (seq_len, batch, input_size)" with "input_size – The number of expected features in the input x",

And the output is: "output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the LSTM, for each t."



In [None]:
import torch
import torch.nn as nn

input_size = 5
hidden_size = 10
num_layers = 1
output_size = 1

lstm = nn.LSTM(input_size, hidden_size, num_layers)
fc = nn.Linear(hidden_size, output_size)

X = [
    [[1,2,3,4,5]],
    [[1,2,3,4,5]],
    [[1,2,3,4,5]],
    [[1,2,3,4,5]],
    [[1,2,3,4,5]],
    [[1,2,3,4,5]],
    [[1,2,3,4,5]],
]

X = torch.tensor(X, dtype=torch.float32)

print(X.shape)         # (seq_len, batch_size, input_size) = (7, 1, 5)
out, hidden = lstm(X)  # Where X's shape is ([7,1,5])
print(out.shape)       # (seq_len, batch_size, hidden_size) = (7, 1, 10)
out = out[-1]          # Get output of last step
print(out.shape)       # (batch, hidden_size) = (1, 10)
out = fc(out)          # Push through linear layer
print(out.shape)       # (batch_size, output_size) = (1, 1)

 batch_size = 1 and output_size = 1
 