<p style="text-align: center; font-size:50px;">About LSTM</p>

#### Let's take a quick recap on RNN before moving on since LSTM is another improved version of RNN. 
#### RNN is able to process inputs in a sequential manner.
#### Additionally, it has the capability to take into account previous time step's information into its outputs through the presence of hidden states. 
#### These hidden states are calculated throughout and they are fed in as another input during the next cycle. 



<p style="text-align: center; font-size:30px;">Limitations of LSTMs</p>

#### However, there are limitations to RNNs.
#### The biggest drawback amongst them is the problem of vanishing or exploding gradients. 
#### The main mechanism to Deep Learning is Backpropagation and this is how models learn. 
#### However, when the input sequence and learning process gets longer, the model is required to backpropagate longer and peform more matix multiplications due to the chain rule.
#### Due to this repetition, small gradients constantly multiplied will result in gradients becoming smaller till they 'vanish' and large gradients constantly multiplied will result in gradients becoming larger till they 'explode'.
#### The former will cause models to gradually ignore old information and potentially be unable to train its parameters. 
#### The latter will cause the model to be extremely unstable during the trainig process.  
#### The problem of vanishing gradients is what makes RNN models to suffer from 'short term memory' where they can only take into consideration relatively new information and ignore older ones. 

<p style="text-align: center; font-size:30px;">What are LSTMs?</p>

#### With LSTMs, Gating mechanism is introduced to combat the problem of 'short term memory'.
#### This mechanism essentially allows LSTM to hold on to contextual information longer and in the case of NLP, this will allow LSTMs to carry out a more humane conversation.
#### So how does an LSTM manage this?
#### With RNNs, it took in the hidden state and the current sequential input and passed them into a tanh activation function to output an output and a new hidden state. 

![image](https://blog.floydhub.com/content/images/2019/06/Slide18.JPG)<br></br>
taken from [website](https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/)

#### However, with LSTMs, it takes in 3 set of information at each time step. 
#### It takes in the current information, short-term memory (similar to hidden state), and the long-term memory (**cell state**).
#### These inputs are then passed through **Gates** which can be seen as filters. 
#### They decide which information to to be kept and continued forward and which to discard. 
#### There are 3 gates in total, the input gate, forget gate and the output gate. 
#### These gates are the ones that are going to be trained to filter out optimally. 

![img](https://blog.floydhub.com/content/images/2019/06/Slide19.JPG)
<br></br>
taken from [website](https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/)

<p style="text-align: center; font-size:30px;">Input Gate</p>

![img](https://blog.floydhub.com/content/images/2019/06/Slide20.JPG)
<br></br>
taken from [website](https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/)

#### The input gate deals with the current input and the hidden state / short-term memory. 
#### The **first layer** is the *sigmoid* function. 
#### This layer transforms both inputs into the range of 0 and 1, 0 being unimportant and 1 being important. 
#### With backpropagation, this layer will idealy be able to decide which information in the inputs are important and which are not. 
#### The **second layer** takes both inputs and pass them through the tanh function.
#### The outputs from these two layers are then multiplied.

<p style="text-align: center; font-size:30px;">Forget Gate</p>

![img](https://blog.floydhub.com/content/images/2019/06/Slide21.JPG)
<br></br>
taken from [website](https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/)

#### This gate decides which information to be kept or discarded from the long-term memory 
#### This is done by multiplying the incoming long-term memory by a forget vector generated by the current input and incoming short-term memory.
#### This forget vector is also a filter.
#### It is obtained by passing the current input and hidden state through a sigmoid function, just like in the input gate.
#### The output from this forget gate and the output from the input gate, as stated above, undergo a pointwise addition to produce a new version of the long term memory. 

<p style="text-align: center; font-size:30px;">Ouput Gate</p>

![img](https://blog.floydhub.com/content/images/2019/06/Slide22.JPG)
<br></br>
taken from [website](https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/)

#### The output gate will take in the new cell state, current input and hidden state to produce the new hidden state (short-term memory) and the output. 
#### As we can see above, the hidden state and the current input wil be passed into yet another sigmoid function while the new cell state (long-term memory) will be passed into the tanh function.
#### The 2 outputs would then be multiplied to produce the new hidden state, short-term memory.

#### All of these gates and architecture is what makes up LSTMs!
#### It doesn't seem so scary that we broke down the 3 different gates right?
#### Let's see what this all looks like in code.

<p style="text-align: center; font-size:30px;">Code Implementation</p>

#### With PyTorch, all of the above implementation is succintly put into place in the LSTM module.
#### All we need to provide is the input dimension, hidden dimension and the number of layers.
#### More information can be looked up [here](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html).

In [1]:
import torch 
import torch.nn as nn 

input_dim = 5
hidden_dim = 10 
n_layers = 2

lstm_layer = nn.LSTM(input_dim, hidden_dim, n_layers, batch_first=True) # batch_first = True for inputs where the batch number comes infront 

#### So for our dummy input, we would have to create an input with 5 dimensions, so it will have a shape something like this - (1, 1, 5); translates to (batch_size, sequence_length, input dimension)
#### We would also have to initialize the hidden state and the cell state to begin the model.
#### We will store these two in a tuple format as such: (hidden state, cell state) 

In [2]:
batch_size = 1
seq_len = 1

inp = torch.randn(batch_size, seq_len, input_dim)
hidden_state = torch.randn(n_layers, batch_size, hidden_dim)
cell_state = torch.randn(n_layers, batch_size, hidden_dim)
hidden = (hidden_state, cell_state)

print(inp)
print("\n")
print(hidden)

tensor([[[ 1.4733,  2.9324, -0.9992, -1.3117, -0.4922]]])


(tensor([[[ 0.6269, -0.8276,  1.0036, -1.2189, -0.8868, -0.6501,  0.0609,
          -0.4498,  0.1131, -0.7280]],

        [[ 0.4134, -0.0928, -0.3454, -0.6712,  0.3524, -0.5298, -0.3610,
           1.8115, -0.5859, -1.4539]]]), tensor([[[ 1.3978, -1.0708,  0.0432,  0.5728, -1.0821, -0.2057, -1.2685,
          -0.5819, -0.5274,  2.8776]],

        [[ 0.2657, -1.1280, -2.4326,  1.2444,  1.4502, -1.9727,  1.1661,
           0.9338,  0.3324, -0.0380]]]))


#### Let's feed in the input and hidden states to the LSTM layer and see what we get out of it.

In [3]:
out, hidden = lstm_layer(inp, hidden)
print("Output shape: ", out.shape)
print("Hidden: ", hidden)

Output shape:  torch.Size([1, 1, 10])
Hidden:  (tensor([[[ 0.2829, -0.1436, -0.0019,  0.5156, -0.5277, -0.0451, -0.3547,
          -0.2026, -0.2404,  0.5355]],

        [[ 0.0401, -0.4179, -0.3990,  0.2737,  0.2545, -0.4048,  0.3471,
           0.1079, -0.0328, -0.1444]]], grad_fn=<StackBackward0>), tensor([[[ 0.8280, -0.5253, -0.0040,  0.9147, -0.8501, -0.1955, -0.5973,
          -0.8007, -0.9457,  1.3947]],

        [[ 0.0735, -0.8427, -1.6736,  0.4338,  1.1018, -1.0239,  0.4952,
           0.2497, -0.0474, -0.2649]]], grad_fn=<StackBackward0>))


#### In this scenario, the input only had a sequence length of 1. 
#### However, in most cases, our input would have a sequence length that is way longer, especially for NLP tasks. 
#### In that case, we just need to change the sequence length and we would obtain an output with that sequence length. 

In [4]:
seq_len = 3
inp = torch.randn(batch_size, seq_len, input_dim)
out, hidden = lstm_layer(inp, hidden)
print(out.shape)

torch.Size([1, 3, 10])


#### As we can see, the output shows a sequence length of 3 after taking in an input of sequence length 3. 

#### Now that we know how a LSTM layer in PyTorch works, let's put this to practical use. 
#### We will be utilizing a dataset from Kaggle that contains Amazon reviews that are classified as either positive or negative. 
#### Let's see what the data looks like.

In [5]:
import bz2
from collections import Counter
import re
import nltk
import numpy as np
from dotenv import load_dotenv
import os
from tqdm import tqdm

load_dotenv()
path_to_image = os.getenv("AMAZON_REVIEWS_PATH")

train_file = bz2.BZ2File(path_to_image + 'train.ft.txt.bz2', "r")
test_file = bz2.BZ2File(path_to_image + 'test.ft.txt.bz2', "r")

train_file = train_file.readlines()
test_file = test_file.readlines()

num_train = 3600000//4  # We're training on the first 800,000 reviews in the dataset
num_test = 400000//4  # Using 200,000 reviews from test set
 
train_file = [x.decode('utf-8') for x in train_file[:num_train]]
test_file = [x.decode('utf-8') for x in test_file[:num_test]]

In [6]:
train_file[0]

'__label__2 Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^\n'

#### The file is in the format of __label__(classification) (sentence).
#### Positive sentiment is stored as 2 while negative ones are stored as 1.
#### However, we will be changing them to 1 for positive and 0 for negative sentiments. 
#### In the sentences, there are also URLs written but this is irrelevant in our case, so we will be transforming them to a standard <url\>.

In [7]:
# Extracting labels from sentences
train_labels = [0 if x.split(' ')[0] == '__label__1' else 1 for x in train_file]
train_sentences = [x.split(' ', 1)[1][:-1].lower() for x in train_file]

test_labels = [0 if x.split(' ')[0] == '__label__1' else 1 for x in test_file]
test_sentences = [x.split(' ', 1)[1][:-1].lower() for x in test_file]

# Some simple cleaning of data
for i in range(len(train_sentences)):
    train_sentences[i] = re.sub('\d','0',train_sentences[i])

for i in range(len(test_sentences)):
    test_sentences[i] = re.sub('\d','0',test_sentences[i])

# Modify URLs to <url>
for i in range(len(train_sentences)):
    if 'www.' in train_sentences[i] or 'http:' in train_sentences[i] or 'https:' in train_sentences[i] or '.com' in train_sentences[i]:
        train_sentences[i] = re.sub(r"([^ ]+(?<=\.[a-z]{3}))", "<url>", train_sentences[i])
        
for i in range(len(test_sentences)):
    if 'www.' in test_sentences[i] or 'http:' in test_sentences[i] or 'https:' in test_sentences[i] or '.com' in test_sentences[i]:
        test_sentences[i] = re.sub(r"([^ ]+(?<=\.[a-z]{3}))", "<url>", test_sentences[i])

#### Now that cleaning is done, we have to perform **tokenization**.
#### Tokenization is the act of splitting up a sentence into tokens, be it words or punctuations.

In [8]:
words = Counter()  # Dictionary that will map a word to the number of times it appeared in all the training sentences
for i, sentence in enumerate(tqdm(train_sentences)):
    # The sentences will be stored as a list of words/tokens
    train_sentences[i] = []
    for word in nltk.word_tokenize(sentence):  # Tokenizing the words
        words.update([word.lower()])  # Converting all the words to lowercase
        train_sentences[i].append(word)

100%|██████████| 900000/900000 [04:15<00:00, 3520.05it/s]


#### Additionally, to remove any unlikely words, we will remove words with occurence of only 1. 
#### To account for potentially unencountered words or padding, we will add them to our words dictionary too.

In [9]:
# Removing the words that only appear once
words = {k:v for k,v in words.items() if v>1}
# Sorting the words according to the number of appearances, with the most common word being first
words = sorted(words, key=words.get, reverse=True)
# Adding padding and unknown to our vocabulary so that they will be assigned an index
words = ['_PAD','_UNK'] + words
# Dictionaries to store the word to index mappings and vice versa
word2idx = {o:i for i,o in enumerate(words)}
idx2word = {i:o for i,o in enumerate(words)}

#### Now that we got the tokenization and indexing in place, let us apply them to our sentences.

In [10]:
for i, sentence in enumerate(tqdm(train_sentences)):
    # Looking up the mapping dictionary and assigning the index to the respective words
    train_sentences[i] = [word2idx[word] if word in word2idx else word2idx['_UNK'] for word in sentence]

100%|██████████| 900000/900000 [00:05<00:00, 155450.00it/s]


In [11]:
for i, sentence in enumerate(tqdm(test_sentences)):
    # For test sentences, we have to tokenize the sentences as well
    test_sentences[i] = [word2idx[word.lower()] if word.lower() in word2idx else word2idx['_UNK'] for word in nltk.word_tokenize(sentence)]

100%|██████████| 100000/100000 [00:27<00:00, 3633.09it/s]


#### Now onto the last preprocessing step. 
#### We'll be padding the sentences with 0s and shortening the lengthy sentences so that the data can be trained in batches to speed things up.

In [12]:
# Defining a function that either shortens sentences or pads sentences with 0 to a fixed length
def pad_input(sentences, seq_len):
    features = np.zeros((len(sentences), seq_len),dtype=int)
    for ii, review in enumerate(sentences):
        if len(review) != 0:
            # Padding will be applied to the front
            # If the length of the sentence is shorter than the seq_len, it will be padded with 0s from the start
            features[ii, -len(review):] = np.array(review)[:seq_len] 
    return features

seq_len = 200  # The length that the sentences will be padded/shortened to

train_sentences = pad_input(train_sentences, seq_len)
test_sentences = pad_input(test_sentences, seq_len)

# Converting our labels into numpy arrays
train_labels = np.array(train_labels)
test_labels = np.array(test_labels)

#### Let's see what a padded sentence looks like.

In [13]:
train_sentences[0]

array([     0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
      

#### Ok so now that's done, let us go ahead and divide our training set into training and validation sets.

In [14]:
split_frac = 0.5 # 50% validation, 50% test
split_id = int(split_frac * len(test_sentences))
val_sentences, test_sentences = test_sentences[:split_id], test_sentences[split_id:]
val_labels, test_labels = test_labels[:split_id], test_labels[split_id:]

#### Let's now feed in these data into DataLoaders using PyTorch library.

In [15]:
import torch
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn

train_data = TensorDataset(torch.from_numpy(train_sentences), torch.from_numpy(train_labels))
val_data = TensorDataset(torch.from_numpy(val_sentences), torch.from_numpy(val_labels))
test_data = TensorDataset(torch.from_numpy(test_sentences), torch.from_numpy(test_labels))

batch_size = 100

train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
val_loader = DataLoader(val_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

#### Let's set up our device too so that we won't be held back by our CPU.

In [16]:
device = torch.device('mps')
device

device(type='mps')

#### Now onto the model architecture.

In [17]:
class SentimentNet(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        super(SentimentNet, self).__init__()
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(drop_prob)
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x, hidden):
        batch_size = x.size(0)
        x = x.long()
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        out = self.dropout(lstm_out)
        out = self.fc(out)
        out = self.sigmoid(out)
        
        out = out.view(batch_size, -1)
        out = out[:,-1]
        return out, hidden
    
    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data
        hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().to(device),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().to(device))
        return hidden

In [18]:
vocab_size = len(word2idx) + 1
output_size = 1
embedding_dim = 400
hidden_dim = 512
n_layers = 2

model = SentimentNet(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
model.to(device)

lr=0.005
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

#### Training our model.

In [19]:
epochs = 2
counter = 0
print_every = 100
clip = 5
valid_loss_min = np.Inf

model.train()
for i in range(epochs):
    h = model.init_hidden(batch_size)
    
    for inputs, labels in tqdm(train_loader):
        counter += 1
        h = tuple([e.data for e in h])
        inputs, labels = inputs.to(device), labels.to(device)
        model.zero_grad()
        output, h = model(inputs, h)
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        
        if counter%print_every == 0:
            val_h = model.init_hidden(batch_size)
            val_losses = []
            model.eval()
            for inp, lab in val_loader:
                val_h = tuple([each.data for each in val_h])
                inp, lab = inp.to(device), lab.to(device)
                out, val_h = model(inp, val_h)
                val_loss = criterion(out.squeeze(), lab.float())
                val_losses.append(val_loss.item())
                
            model.train()
            print("Epoch: {}/{}...".format(i+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))
            if np.mean(val_losses) <= valid_loss_min:
                torch.save(model.state_dict(), './state_dict.pt')
                print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(valid_loss_min,np.mean(val_losses)))
                valid_loss_min = np.mean(val_losses)

  1%|          | 99/9000 [00:50<1:18:38,  1.89it/s]

Epoch: 1/2... Step: 100... Loss: 0.486787... Val Loss: 0.502854


  1%|          | 100/9000 [02:06<56:45:35, 22.96s/it]

Validation loss decreased (inf --> 0.502854).  Saving model ...


  2%|▏         | 199/9000 [03:12<1:36:47,  1.52it/s] 

Epoch: 1/2... Step: 200... Loss: 0.417106... Val Loss: 0.370372


  2%|▏         | 200/9000 [04:58<78:44:59, 32.22s/it]

Validation loss decreased (0.502854 --> 0.370372).  Saving model ...


  3%|▎         | 299/9000 [06:27<2:04:21,  1.17it/s] 

Epoch: 1/2... Step: 300... Loss: 0.295222... Val Loss: 0.294911


  3%|▎         | 300/9000 [08:10<76:36:17, 31.70s/it]

Validation loss decreased (0.370372 --> 0.294911).  Saving model ...


  4%|▍         | 399/9000 [09:30<1:38:37,  1.45it/s] 

Epoch: 1/2... Step: 400... Loss: 0.189338... Val Loss: 0.286997


  4%|▍         | 400/9000 [11:01<66:31:54, 27.85s/it]

Validation loss decreased (0.294911 --> 0.286997).  Saving model ...


  6%|▌         | 499/9000 [12:13<1:25:44,  1.65it/s] 

Epoch: 1/2... Step: 500... Loss: 0.331447... Val Loss: 0.253947


  6%|▌         | 500/9000 [13:33<58:07:17, 24.62s/it]

Validation loss decreased (0.286997 --> 0.253947).  Saving model ...


  7%|▋         | 599/9000 [14:38<1:31:10,  1.54it/s] 

Epoch: 1/2... Step: 600... Loss: 0.231808... Val Loss: 0.248325


  7%|▋         | 600/9000 [15:53<53:34:41, 22.96s/it]

Validation loss decreased (0.253947 --> 0.248325).  Saving model ...


  8%|▊         | 699/9000 [16:54<1:23:48,  1.65it/s] 

Epoch: 1/2... Step: 700... Loss: 0.240730... Val Loss: 0.239384


  8%|▊         | 700/9000 [18:12<55:23:18, 24.02s/it]

Validation loss decreased (0.248325 --> 0.239384).  Saving model ...


  9%|▉         | 800/9000 [20:33<54:48:18, 24.06s/it]

Epoch: 1/2... Step: 800... Loss: 0.340206... Val Loss: 0.240664


 10%|█         | 900/9000 [22:53<54:16:57, 24.13s/it]

Epoch: 1/2... Step: 900... Loss: 0.271002... Val Loss: 0.245783


 11%|█         | 999/9000 [23:59<1:22:42,  1.61it/s] 

Epoch: 1/2... Step: 1000... Loss: 0.248659... Val Loss: 0.230929


 11%|█         | 1000/9000 [25:16<52:27:25, 23.61s/it]

Validation loss decreased (0.239384 --> 0.230929).  Saving model ...


 12%|█▏        | 1100/9000 [27:39<50:19:52, 22.94s/it]

Epoch: 1/2... Step: 1100... Loss: 0.217460... Val Loss: 0.244217


 13%|█▎        | 1199/9000 [28:42<1:23:51,  1.55it/s] 

Epoch: 1/2... Step: 1200... Loss: 0.199470... Val Loss: 0.222822


 13%|█▎        | 1200/9000 [30:02<53:03:13, 24.49s/it]

Validation loss decreased (0.230929 --> 0.222822).  Saving model ...


 14%|█▍        | 1299/9000 [31:02<1:17:09,  1.66it/s] 

Epoch: 1/2... Step: 1300... Loss: 0.187632... Val Loss: 0.222632


 14%|█▍        | 1300/9000 [32:23<52:17:48, 24.45s/it]

Validation loss decreased (0.222822 --> 0.222632).  Saving model ...


 16%|█▌        | 1399/9000 [33:25<1:25:19,  1.48it/s] 

Epoch: 1/2... Step: 1400... Loss: 0.153965... Val Loss: 0.219430


 16%|█▌        | 1400/9000 [34:44<51:12:58, 24.26s/it]

Validation loss decreased (0.222632 --> 0.219430).  Saving model ...


 17%|█▋        | 1499/9000 [35:46<1:20:44,  1.55it/s] 

Epoch: 1/2... Step: 1500... Loss: 0.189915... Val Loss: 0.214531


 17%|█▋        | 1500/9000 [37:04<49:07:47, 23.58s/it]

Validation loss decreased (0.219430 --> 0.214531).  Saving model ...


 18%|█▊        | 1600/9000 [39:24<48:19:03, 23.51s/it]

Epoch: 1/2... Step: 1600... Loss: 0.279372... Val Loss: 0.216673


 19%|█▉        | 1700/9000 [41:43<47:55:27, 23.63s/it]

Epoch: 1/2... Step: 1700... Loss: 0.163903... Val Loss: 0.217184


 20%|██        | 1800/9000 [44:00<45:47:52, 22.90s/it]

Epoch: 1/2... Step: 1800... Loss: 0.247262... Val Loss: 0.214748


 21%|██        | 1899/9000 [45:05<1:13:39,  1.61it/s] 

Epoch: 1/2... Step: 1900... Loss: 0.208919... Val Loss: 0.213868


 21%|██        | 1900/9000 [46:25<48:15:02, 24.47s/it]

Validation loss decreased (0.214531 --> 0.213868).  Saving model ...


 22%|██▏       | 1999/9000 [47:29<1:20:29,  1.45it/s] 

Epoch: 1/2... Step: 2000... Loss: 0.148321... Val Loss: 0.210958


 22%|██▏       | 2000/9000 [48:57<52:09:31, 26.82s/it]

Validation loss decreased (0.213868 --> 0.210958).  Saving model ...


 23%|██▎       | 2099/9000 [49:59<1:11:10,  1.62it/s] 

Epoch: 1/2... Step: 2100... Loss: 0.260450... Val Loss: 0.207000


 23%|██▎       | 2100/9000 [51:17<45:31:33, 23.75s/it]

Validation loss decreased (0.210958 --> 0.207000).  Saving model ...


 24%|██▍       | 2200/9000 [53:39<43:53:48, 23.24s/it]

Epoch: 1/2... Step: 2200... Loss: 0.304374... Val Loss: 0.208405


 26%|██▌       | 2300/9000 [55:59<44:28:06, 23.89s/it]

Epoch: 1/2... Step: 2300... Loss: 0.163608... Val Loss: 0.207344


 27%|██▋       | 2400/9000 [58:19<42:39:06, 23.26s/it]

Epoch: 1/2... Step: 2400... Loss: 0.241463... Val Loss: 0.208968


 28%|██▊       | 2500/9000 [1:01:04<53:36:08, 29.69s/it]

Epoch: 1/2... Step: 2500... Loss: 0.185054... Val Loss: 0.213986


 29%|██▉       | 2600/9000 [1:03:32<43:22:39, 24.40s/it]

Epoch: 1/2... Step: 2600... Loss: 0.181295... Val Loss: 0.209748


 30%|███       | 2700/9000 [1:06:02<46:21:48, 26.49s/it]

Epoch: 1/2... Step: 2700... Loss: 0.142324... Val Loss: 0.215918


 31%|███       | 2799/9000 [1:07:22<1:13:41,  1.40it/s] 

Epoch: 1/2... Step: 2800... Loss: 0.215754... Val Loss: 0.205298


 31%|███       | 2800/9000 [1:08:46<43:52:50, 25.48s/it]

Validation loss decreased (0.207000 --> 0.205298).  Saving model ...


 32%|███▏      | 2900/9000 [1:11:10<42:25:28, 25.04s/it]

Epoch: 1/2... Step: 2900... Loss: 0.191231... Val Loss: 0.209452


 33%|███▎      | 2999/9000 [1:12:17<1:02:24,  1.60it/s] 

Epoch: 1/2... Step: 3000... Loss: 0.079199... Val Loss: 0.203712


 33%|███▎      | 3000/9000 [1:13:52<48:20:45, 29.01s/it]

Validation loss decreased (0.205298 --> 0.203712).  Saving model ...


 34%|███▍      | 3100/9000 [1:16:58<52:22:17, 31.96s/it]

Epoch: 1/2... Step: 3100... Loss: 0.292360... Val Loss: 0.222340


 36%|███▌      | 3199/9000 [1:18:18<1:16:12,  1.27it/s] 

Epoch: 1/2... Step: 3200... Loss: 0.170858... Val Loss: 0.199280


 36%|███▌      | 3200/9000 [1:19:44<42:39:39, 26.48s/it]

Validation loss decreased (0.203712 --> 0.199280).  Saving model ...


 37%|███▋      | 3300/9000 [1:22:02<37:45:17, 23.85s/it]

Epoch: 1/2... Step: 3300... Loss: 0.191370... Val Loss: 0.200201


 38%|███▊      | 3400/9000 [1:24:48<48:09:02, 30.95s/it]

Epoch: 1/2... Step: 3400... Loss: 0.193973... Val Loss: 0.209092


 39%|███▉      | 3500/9000 [1:27:18<39:03:22, 25.56s/it]

Epoch: 1/2... Step: 3500... Loss: 0.227910... Val Loss: 0.211358


 40%|████      | 3600/9000 [1:30:15<45:23:17, 30.26s/it]

Epoch: 1/2... Step: 3600... Loss: 0.109052... Val Loss: 0.208596


 41%|████      | 3700/9000 [1:32:39<34:06:53, 23.17s/it]

Epoch: 1/2... Step: 3700... Loss: 0.197191... Val Loss: 0.206733


 42%|████▏     | 3799/9000 [1:33:41<56:50,  1.53it/s]   

Epoch: 1/2... Step: 3800... Loss: 0.153083... Val Loss: 0.197325


 42%|████▏     | 3800/9000 [1:34:57<33:25:40, 23.14s/it]

Validation loss decreased (0.199280 --> 0.197325).  Saving model ...


 43%|████▎     | 3899/9000 [1:36:00<56:07,  1.51it/s]   

Epoch: 1/2... Step: 3900... Loss: 0.236235... Val Loss: 0.193792


 43%|████▎     | 3900/9000 [1:37:19<34:09:31, 24.11s/it]

Validation loss decreased (0.197325 --> 0.193792).  Saving model ...


 44%|████▍     | 4000/9000 [1:39:36<31:56:38, 23.00s/it]

Epoch: 1/2... Step: 4000... Loss: 0.272521... Val Loss: 0.195973


 46%|████▌     | 4100/9000 [1:41:54<31:42:14, 23.29s/it]

Epoch: 1/2... Step: 4100... Loss: 0.093463... Val Loss: 0.196697


 47%|████▋     | 4200/9000 [1:44:12<31:19:41, 23.50s/it]

Epoch: 1/2... Step: 4200... Loss: 0.208734... Val Loss: 0.198449


 48%|████▊     | 4300/9000 [1:46:27<30:13:06, 23.15s/it]

Epoch: 1/2... Step: 4300... Loss: 0.236781... Val Loss: 0.197796


 49%|████▉     | 4400/9000 [1:48:46<30:46:56, 24.09s/it]

Epoch: 1/2... Step: 4400... Loss: 0.235728... Val Loss: 0.198216


 50%|█████     | 4500/9000 [1:51:04<29:21:07, 23.48s/it]

Epoch: 1/2... Step: 4500... Loss: 0.195045... Val Loss: 0.197779


 51%|█████     | 4600/9000 [1:53:25<30:04:09, 24.60s/it]

Epoch: 1/2... Step: 4600... Loss: 0.139202... Val Loss: 0.215025


 52%|█████▏    | 4700/9000 [1:55:55<30:22:48, 25.43s/it]

Epoch: 1/2... Step: 4700... Loss: 0.175416... Val Loss: 0.202637


 53%|█████▎    | 4800/9000 [1:58:24<28:08:23, 24.12s/it]

Epoch: 1/2... Step: 4800... Loss: 0.156954... Val Loss: 0.200486


 54%|█████▍    | 4901/9000 [2:01:31<30:51:55, 27.11s/it]

Epoch: 1/2... Step: 4900... Loss: 0.185270... Val Loss: 0.198494


 56%|█████▌    | 5000/9000 [2:04:17<28:17:32, 25.46s/it]

Epoch: 1/2... Step: 5000... Loss: 0.274219... Val Loss: 0.195871


 57%|█████▋    | 5100/9000 [2:06:47<27:57:36, 25.81s/it]

Epoch: 1/2... Step: 5100... Loss: 0.309938... Val Loss: 0.203488


 58%|█████▊    | 5200/9000 [2:09:08<24:39:59, 23.37s/it]

Epoch: 1/2... Step: 5200... Loss: 0.288327... Val Loss: 0.198016


 59%|█████▉    | 5300/9000 [2:11:31<25:21:50, 24.68s/it]

Epoch: 1/2... Step: 5300... Loss: 0.211069... Val Loss: 0.209740


 60%|██████    | 5400/9000 [2:13:50<23:39:57, 23.67s/it]

Epoch: 1/2... Step: 5400... Loss: 0.171950... Val Loss: 0.199787


 61%|██████    | 5500/9000 [2:16:07<21:43:00, 22.34s/it]

Epoch: 1/2... Step: 5500... Loss: 0.155911... Val Loss: 0.203747


 62%|██████▏   | 5600/9000 [2:18:19<21:07:09, 22.36s/it]

Epoch: 1/2... Step: 5600... Loss: 0.141213... Val Loss: 0.200105


 63%|██████▎   | 5700/9000 [2:20:32<20:31:20, 22.39s/it]

Epoch: 1/2... Step: 5700... Loss: 0.162184... Val Loss: 0.204469


 64%|██████▍   | 5800/9000 [2:22:45<20:01:38, 22.53s/it]

Epoch: 1/2... Step: 5800... Loss: 0.267422... Val Loss: 0.202936


 65%|██████▌   | 5888/9000 [2:23:37<30:46,  1.69it/s]   

#### Let's see how well our model has performed on the test set.

In [20]:
# Loading the best model
model.load_state_dict(torch.load('./state_dict.pt'))

test_losses = []
num_correct = 0
h = model.init_hidden(batch_size)

model.eval()
for inputs, labels in tqdm(test_loader):
    h = tuple([each.data for each in h])
    inputs, labels = inputs.to(device), labels.to(device)
    output, h = model(inputs, h)
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    pred = torch.round(output.squeeze())  # Rounds the output to 0/1
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)

print("Test loss: {:.3f}".format(np.mean(test_losses)))
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}%".format(test_acc*100))

100%|██████████| 500/500 [00:59<00:00,  8.47it/s]

Test loss: 0.192
Test accuracy: 92.560%





#### A test accuracy of 92.5%! Impressive! 
#### I hope we see how impressive a LSTM can be.