# Sentiment Analysis with an RNN

In this notebook, I implemented a recurrent neural network that performs sentiment analysis. 
>Using an RNN rather than a strictly feedforward network is more accurate since we can include information about the *sequence* of words. 

Here I'll use a dataset of Amazon baby products reviews, accompanied by product names and rates.


### Network Architecture

The architecture for this network is shown below.

<img src="assets/network_diagram_product.png" width=40%>

>**First, I'll pass in words to an embedding layer.** We need an embedding layer because we have tens of thousands of words, so we'll need a more efficient representation for our input data than one-hot encoded vectors. You can actually train an embedding with the Skip-gram Word2Vec model and use those embeddings as input, here. However, it's good enough to just have an embedding layer and let the network learn a different embedding table on its own. *In this case, the embedding layer is for dimensionality reduction, rather than for learning semantic representations.*

>**After input words are passed to an embedding layer, the new embeddings will be passed to LSTM cells.** The LSTM cells will add *recurrent* connections to the network and give us the ability to include information about the *sequence* of words in the product review data. 

>**Finally, the LSTM outputs will go to a sigmoid output layer.** We're using a sigmoid function because positive and negative = 1 and 0, respectively, and a sigmoid will output predicted, sentiment values between 0-1. 

We don't care about the sigmoid outputs except for the **very last one**; we can ignore the rest. We'll calculate the loss by comparing the output at the last time step and the training label (pos or neg).


#### Outline:

* [Load in and visualize the data](#1)
* [Data pre-processing](#2)
    * [Tokenizing the words](#2_1)
    * [Encoding the labels](#2_2)
    * [Removing Outliers](#2_3)
* [Padding sequences](#3)
* [Training, Validation, Test split](#4)
* [DataLoaders and Batching](#5)
* [Sentiment Network with PyTorch](#6)
    * [Instantiate the network](#6_1)
    * [Training](#6_2)
    * [Testing](#6_3)

---
<a id ="1"></a>
## Load in and visualize the data

In [1]:
import numpy as np
import pandas as pd

products = pd.read_csv("data/amazon_baby.csv")
products.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [2]:
products["review"][0]

'These flannel wipes are OK, but in my opinion not worth keeping.  I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality.  I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'

In [3]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183531 entries, 0 to 183530
Data columns (total 3 columns):
name      183213 non-null object
review    182702 non-null object
rating    183531 non-null int64
dtypes: int64(1), object(2)
memory usage: 4.2+ MB


<a id="2"></a>
## Data pre-processing

The first step when building a neural network model is getting your data into the proper form to feed into the network. Since I'm using embedding layers, I'll encode each word with an integer. I'll also want to clean it up a bit.

You can see an example of the reviews data above. Here are the processing steps, I'll want to take:
> * I'll remove all products without a review or name.
* I'll also remove all the reviews whose rate is 3. Because I can't decide 3 as a positive or negative rate.
* I'll want to make all the characters lowercase.
* I'll get rid of periods and extraneous punctuation.
* Then I'll combine all the reviews together into one big string.

First, let's remove all punctuation. Then get all the text and split it into individual words.

In [4]:
# Remove all the products without a review or name
print(products.isna().sum())
print()
products.dropna(axis = 0, inplace = True)
print(products.isna().sum())

name      318
review    829
rating      0
dtype: int64

name      0
review    0
rating    0
dtype: int64


In [5]:
# Remove all the reviews whose rate is 3
print(len(products[products["rating"] == 3]))
products = products[products["rating"]!=3]

assert len(products[products["rating"] == 3]) == 0               

16705


In [6]:
# Reset index after removing some rows
products.reset_index(drop = True, inplace=True)

In [7]:
# Check out the number of remaining data
print(f"Number of the reviews : {products.shape[0]}")
print(f"Number of unique products : {len(products['name'].unique())}")

Number of the reviews : 165679
Number of unique products : 30629


>As you can see we have `165679` reviews for `30629` unique products.

In [8]:
from string import punctuation

print(punctuation)

# Make all the characters lowercase
products["review"] = products["review"].str.lower()

# Eliminate all the punctuations 
products["review"] = products["review"].apply(lambda review: ''.join([c for c in review if c not in punctuation]))

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [9]:
#  Combine all the reviews together into one big string.
reviews = list(products["review"])
all_text = " ".join(reviews)

# Create a list of the words
words = all_text.split()

In [10]:
words[:20]

['it',
 'came',
 'early',
 'and',
 'was',
 'not',
 'disappointed',
 'i',
 'love',
 'planet',
 'wise',
 'bags',
 'and',
 'now',
 'my',
 'wipe',
 'holder',
 'it',
 'keps',
 'my']

In [11]:
len(words)

13326115

<a id = "2_1"></a>
### Tokenizing the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

> * Now I'm going to encode the words with integers. Later I'm going to pad the input vectors with zeros, so the integers **start at 1, not 0**.
* Also, I'm going to convert the reviews to integers and store the reviews in a new list called `reviews_ints`. 

In [12]:
from collections import Counter

# Count of each word and sort from most frequent to least
vocab_count = Counter(words)
vocab_sorted = [vocab for vocab,_ in vocab_count.most_common()]

# Build a dictionary that maps words to integers
vocab_to_int = {vocab: i+1 for i, vocab in enumerate(vocab_sorted)}

# Store the tokenized reviews in reviews_ints
reviews_ints = [list(map(lambda c: vocab_to_int[c],r.split())) for r in reviews]

In [13]:
# Stats about vocabulary
print('Unique words: ', len((vocab_to_int)))
print()

# Print tokens in first review
print('Tokenized review: \n', reviews_ints[:4])

Unique words:  141028

Tokenized review: 
 [[3, 254, 1057, 2, 16, 21, 481, 4, 50, 3511, 2460, 360, 2, 79, 10, 754, 679, 3, 48247, 10, 5472, 455, 4140, 2, 117, 21, 467, 245, 100, 3], [28, 148, 2, 196, 2, 829, 73, 3, 48248, 1, 505, 144, 256, 48249, 100, 5, 380, 213, 9, 8, 759, 11, 2817], [8, 7, 6, 66, 58, 205, 1, 221, 4, 17, 21, 218, 328, 506, 45, 8, 2, 3, 7, 6, 1597, 5130, 6734, 5, 2510, 1, 3165, 106, 4, 50, 220, 67, 8, 66, 7, 121, 75, 11178, 10, 86, 52, 12, 225, 2074, 11, 1, 3165, 41, 7, 20, 3538, 11, 845, 2, 94, 48, 55, 7070, 4, 50, 1, 8344, 1, 3186, 12, 1, 93, 2, 1, 3819, 6734, 11, 8, 1945], [42, 11, 10, 192, 17, 2672, 5017, 29, 4, 206, 5, 11179, 47, 99, 169, 665, 295, 4, 218, 48250, 5, 94, 3165, 7070, 6735, 3, 7, 81, 54, 133, 5, 137, 14, 82, 192, 5, 901, 47, 5, 1176, 263, 169, 665, 7, 212, 2, 338, 47, 286, 57, 5258, 7, 6, 386, 123, 798, 2, 6, 32, 249, 9, 893, 493, 23, 61, 774, 47, 3717, 198, 48251, 9, 8, 798, 23, 42, 1408]]


<a id = "2_2"></a>
### Encoding the labels


Our labels are rating to movies. To use these labels in our network, we need to convert them to 0 and 1.
I'm going to convert ratings below 3 to `0(negative)` and others to `1(positive)`.

**Note that I removed all the reviews whose ranking was equal 3.**


In [14]:
# 1=positive (rating>3), 0=negative (rating<3) label conversion
encoded_labels  = [int(rate> 3) for rate in products["rating"]]

In [15]:
encoded_labels[108:120]

[1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1]

In [16]:
list(products["rating"][108:120])

[4, 2, 5, 5, 5, 5, 5, 4, 4, 1, 4, 5]

<a id = "2_3"></a>
### Removing Outliers

As an additional pre-processing step, I want to make sure that the reviews are in good shape for standard processing. That is, the network will expect a standard input text size, and so, I'll want to shape the reviews into a specific length. I'll approach this task in two main steps:

1. Getting rid of extremely long or short reviews; the outliers
2. Padding/truncating the remaining data so that we have reviews of the same length.

<img src="assets/outliers_padding_ex.png" width=40%>

Before I pad the review text, I should check for reviews of extremely short or long lengths; outliers that may mess with the training.

In [17]:
# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print(f"Zero-length reviews: {review_lens[0]}")
print(f"Maximum review length: {max(review_lens)}")

Zero-length reviews: 1
Maximum review length: 2699


>So, a couple issues here. We seem to have one review with zero length. And, the maximum review length is way too many steps for the RNN. I'll have to remove any super short reviews and truncate super long reviews. This removes outliers and should allow the model to train more efficiently.

In [18]:
print('Number of reviews before removing outliers: ', len(reviews_ints))

# Get the indices of reviews with length bigger than 0
non_zero_ind = [i for i, review in enumerate(reviews_ints) if len(review)>0]

# Remove any reviews/labels with zero length from the reviews_ints list.
reviews_ints = np.array([reviews_ints[i] for i in non_zero_ind])
encoded_labels = np.array([encoded_labels[i] for i in non_zero_ind])

print('Number of reviews after removing outliers: ', len(reviews_ints))

Number of reviews before removing outliers:  165679
Number of reviews after removing outliers:  165678


---
<a id ="3"></a>
## Padding sequences

To deal with both short and very long reviews, I'll pad or truncate all the reviews to a specific length. For reviews shorter than some `seq_length`, I'll pad with 0s. For reviews longer than `seq_length`, I can truncate them to the first `seq_length` words. A good `seq_length`, in this case, is 200.

As a small example, if the `seq_length=10` and an input review is: 
```
[117, 18, 128]
```
The resultant, padded sequence should be: 

```
[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]
```

>The final `features` array is going to be a 2D array, with as many rows as there are reviews, and as many columns as the specified `seq_length`.


In [19]:
def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's 
        or truncated to the input seq_length.
    '''
    
    features = np.zeros((len(reviews_ints), seq_length), dtype=int)
    
    for i, review in enumerate(reviews_ints):
        features[i, -len(review):] = np.array(review)[:seq_length]
    
    return features

In [20]:
seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)

## test statements 
assert len(features)==len(reviews_ints), "Features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches 
print(features[:30,100:110])

[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0    29     1  3165]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [ 1315  2188   316   967    68    52  1021  3840    24    41]
 [    0     0     0     0     0     0     0     0     0     0]
 [    6  1188     9  1087 12208     2     6  1716    36   303]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     4    69     8 14178     9   592     9    10]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0

<a id = "4"></a>
## Training, Validation, Test split

With the data in nice shape, I'll split it into training, validation, and test sets.

In [21]:
split_frac = 0.9


# Split data into training, validation, and test data (features and labels, x and y)
train_ind = int(split_frac*len(features))
train_x, train_y = features[:train_ind], encoded_labels[:train_ind]
others_x, others_y = features[train_ind:], encoded_labels[train_ind:]

valid_ind = int(0.5*len(others_x))
valid_x, valid_y = others_x[:valid_ind], others_y[:valid_ind]
test_x, test_y = others_x[valid_ind:], others_y[valid_ind:]

# Print out the shapes of resultant feature data
print("".ljust(20), "Feature Shapes:".ljust(20), "Label shape" )
print("Train set:".ljust(20), f"{train_x.shape}".ljust(20), train_y.shape)
print("Validation set:".ljust(20), f"{valid_x.shape}".ljust(20), valid_y.shape)
print("Test set:".ljust(20), f"{test_x.shape}".ljust(20), test_y.shape)

                     Feature Shapes:      Label shape
Train set:           (149110, 200)        (149110,)
Validation set:      (8284, 200)          (8284,)
Test set:            (8284, 200)          (8284,)


---
<a id= "5"></a>
## DataLoaders and Batching

After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets.

This is an alternative to creating a generator function for batching our data into full batches.

In [22]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# Create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(valid_x), torch.from_numpy(valid_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

batch_size = 64

# Create data loaders
train_loader = DataLoader(train_data, batch_size = batch_size, shuffle = True, drop_last = True)
valid_loader = DataLoader(valid_data, batch_size = batch_size, shuffle = True, drop_last = True)
test_loader = DataLoader(test_data, batch_size = batch_size, shuffle = True, drop_last = True)

In [23]:
# obtain one batch of training data
train_iter = iter(train_loader)
sample_reviews, sample_labels = train_iter.next()

print('Sample input size: ', sample_reviews.size()) # batch_size, seq_length
print('Sample input: \n', sample_reviews)
print()
print('Sample label size: ', sample_labels.size()) # batch_size
print('Sample label: \n', sample_labels)

Sample input size:  torch.Size([64, 200])
Sample input: 
 tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  7.8100e+02,
          5.6780e+03,  1.7580e+03],
        [ 1.3510e+03,  3.7270e+03,  6.6900e+02,  ...,  7.0900e+02,
          2.0000e+00,  1.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.0500e+02,
          1.6400e+02,  1.1870e+03],
        ...,
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  6.0000e+00,
          5.4300e+02,  6.8000e+01],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  1.0000e+00,
          1.4180e+03,  1.2500e+02],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  1.0100e+02,
          1.0520e+03,  6.3000e+01]])

Sample label size:  torch.Size([64])
Sample label: 
 tensor([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  0,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  0,  1,  1,  0,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  0,  1,  1,  0,  1,  1,  0,  1,  1,  1,  1,  1,  0,


---
<a id = "6"></a>
# Sentiment Network with PyTorch

The network architecture : 

<img src="assets/network_diagram_product.png" width=40%>

The layers are as follows:
1. An [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) that converts our word tokens (integers) into embeddings of a specific size.
2. An [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) defined by a hidden_state size and number of layers
3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
4. A sigmoid activation layer which turns all outputs into a value 0-1; return **only the last sigmoid output** as the output of this network.

#### The Embedding Layer

We need to add an [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) because there are 141000+ words in the vocabulary. It is massively inefficient to one-hot encode that many classes. So, instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table. You could train an embedding layer using Word2Vec, then load it here. But, it's fine to just make a new layer, using it for only dimensionality reduction, and let the network learn the weights.


#### The LSTM Layer(s)

I'll create an [LSTM](https://pytorch.org/docs/stable/nn.html#lstm) to use in the recurrent network, which takes in an input_size, a hidden_dim, a number of layers, a dropout probability (for dropout between multiple layers), and a batch_first parameter.

Most of the time, the network will have better performance with more layers; between 2-3. Adding more layers allows the network to learn really complex relationships. 

In [24]:
train_on_gpu = torch.cuda.is_available()

print("Training on GPU" if train_on_gpu else "Training on CPU. No GPU is available.")

Training on GPU


In [25]:
from torch import nn

class RNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """    
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob = 0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(RNN, self).__init__()
        
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # layers
        self.embed = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, self.hidden_dim, self.n_layers, 
                            dropout = drop_prob, 
                            batch_first = True)
        self.output = nn.Linear(self.hidden_dim, self.output_size)
        self.dropout = nn.Dropout(drop_prob)
        
    def forward(self, x, hidden):
        """
        Perform a forward pass of the model on some input and hidden state.
        """
        batch_size = x.size(0)
        
        # embeddings and lstm_out
        embeds = self.embed(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
        
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)

        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.output(out)
        sig_out = nn.functional.sigmoid(out)
        
        # reshape to be batch_size first
        sig_out = sig_out.contiguous().view(batch_size, -1)
        sig_out = sig_out[:, -1]
        
        return sig_out, hidden
        
        
    def init_hidden(self, batch_size):
        """Initializes hidden state"""
        weight = next(self.parameters()).data
        
        # Initialized to zero, for hidden state and cell state of LSTM
        if train_on_gpu:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else :
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
            
        return hidden

<a id = "6_1"></a>
## Instantiate the network

Here, I'll instantiate the network. First up, defining the hyperparameters.

* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `embedding_dim`: Number of columns in the embedding lookup table; size of the embeddings.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3



In [29]:
vocab_size = len(vocab_to_int) +1
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 3

net = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

RNN(
  (embed): Embedding(141029, 400)
  (lstm): LSTM(400, 256, num_layers=3, batch_first=True, dropout=0.5)
  (output): Linear(in_features=256, out_features=1, bias=True)
  (dropout): Dropout(p=0.5)
)


<a id ="6_2"></a>
## Training

In [30]:
# loss and optimizer
lr = 0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr = lr)

In [31]:
epochs = 3

min_val_loss = np.Inf # Keep track of best version of the model
counter = 0
print_every = 100
clip = 5

if train_on_gpu:
    net.cuda()

net.train()
for e in range(epochs):
    hidden = net.init_hidden(batch_size)
    
    for in_reviews, labels in train_loader:
        counter += 1
        
        if train_on_gpu:
            in_reviews, labels = in_reviews.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history            
        hidden = tuple([each.data for each in hidden])
        
        # Zero accumulated gradients               
        optimizer.zero_grad()
        
        output, hidden = net(in_reviews, hidden)
        
        # Calculate the loss and perform backprop        
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()
        
        if counter%print_every == 0:
            val_hidden = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            
            for val_reviews, val_labels in valid_loader:
                if train_on_gpu:
                    val_reviews, val_labels = val_reviews.cuda(), val_labels.cuda()
                    
                val_hidden = tuple([each.data for each in val_hidden])
                
                val_output, val_hidden = net(val_reviews, val_hidden)
                
                val_loss = criterion(val_output.squeeze(), val_labels.float())
                
                val_losses.append(val_loss.item())
            
            val_loss_mean_batch = np.mean(val_losses)
            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(val_loss_mean_batch))
            
            # Save the model with the best validation loss
            if val_loss_mean_batch < min_val_loss:
                min_val_loss = val_loss_mean_batch
                torch.save(net.state_dict(), "model.pt")

Epoch: 1/3... Step: 100... Loss: 0.337720... Val Loss: 0.319798
Epoch: 1/3... Step: 200... Loss: 0.306408... Val Loss: 0.266114
Epoch: 1/3... Step: 300... Loss: 0.319464... Val Loss: 0.332644
Epoch: 1/3... Step: 400... Loss: 0.247424... Val Loss: 0.242880
Epoch: 1/3... Step: 500... Loss: 0.351944... Val Loss: 0.317771
Epoch: 1/3... Step: 600... Loss: 0.298553... Val Loss: 0.212778
Epoch: 1/3... Step: 700... Loss: 0.195063... Val Loss: 0.201189
Epoch: 1/3... Step: 800... Loss: 0.186409... Val Loss: 0.191201
Epoch: 1/3... Step: 900... Loss: 0.213330... Val Loss: 0.184585
Epoch: 1/3... Step: 1000... Loss: 0.123002... Val Loss: 0.211778
Epoch: 1/3... Step: 1100... Loss: 0.246231... Val Loss: 0.174623
Epoch: 1/3... Step: 1200... Loss: 0.185962... Val Loss: 0.166628
Epoch: 1/3... Step: 1300... Loss: 0.096090... Val Loss: 0.164244
Epoch: 1/3... Step: 1400... Loss: 0.211272... Val Loss: 0.163486
Epoch: 1/3... Step: 1500... Loss: 0.159955... Val Loss: 0.158895
Epoch: 1/3... Step: 1600... Loss: 

---
<a id = "6_3"></a>
## Testing

There are a few ways to test the network.

* **Test data performance:** First, I'll see how the trained model performs on all of the defined test_data, above. I'll calculate the average loss and accuracy over the test data.

* **Inference on user-generated data:** Second, I'll see if I can input just one example review at a time (without a label), and see what the trained model predicts.

### Test set loss and accuracy

In [33]:
# Get the best model for testing
device = "cuda:0" if torch.cuda.is_available() else "cpu"
net.load_state_dict(torch.load('model.pt', map_location=device ))

In [34]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # Get predicted outputs
    output, h = net(inputs, h)
    
    # Calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # Convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # Compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# Avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# Accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.124
Test accuracy: 0.951


### Inference on a test review

In [35]:
# negative test review
test_review_neg = "It was nothing special, and to be honest, I want my money back!!!"

In [36]:
def predict(net, test_review, sequence_length=200):
    ''' Prints out whether a given review is predicted to be 
        positive or negative in sentiment, using a trained model.
        
        params:
        net - A trained net 
        test_review - a review made of normal text and punctuation
        sequence_length - the padded length of a review
        '''
    #### Preprocessing steps ####
    
    test_review = test_review.lower()
    
    # Get rid of punctuation
    test_review = "".join([c for c in test_review if c not in punctuation])
    
    # Splitting to words
    test_review = test_review.split()
    
    # Tokenize the review
    test_review = [[vocab_to_int[c] for c in test_review if c in vocab_to_int.keys()]]

    # Pad tokenized sequence
    test_review = pad_features(test_review, seq_length = sequence_length)
    
    #### Predict ####
    
    # Conver to Tensor
    test_review = torch.from_numpy(test_review)
    
    hidden = net.init_hidden(test_review.size(0))
    
    if(train_on_gpu):
        test_review = test_review.cuda()
        
    output, hidden = net(test_review, hidden)
    pred = torch.round(output.squeeze())
    
    # print custom response based on whether test_review is pos/neg
    print(f"The review is predicted as {'POSITIVE' if pred.item()==1 else 'NEGATIVE'}")
    

In [37]:
# positive test review
test_review_pos = "This product was better than I'd expected. I loved it."

In [38]:
# call function
# try negative and positive reviews!
seq_length=200
predict(net, test_review_neg, seq_length)

The review is predicted as NEGATIVE


In [39]:
predict(net, test_review_pos, seq_length)

The review is predicted as POSITIVE
