In [0]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [0]:
!ls '/gdrive/My Drive/data/'

ls: cannot access '/gdrive/My Drive/data/': No such file or directory


In [0]:
PATH = 'data/'

In [0]:
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
from string import punctuation
from collections import Counter
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset

In [0]:
# read data from text files
with open(PATH+'reviews.txt', 'r') as file:
    reviews = file.read()
with open(PATH+'labels.txt', 'r') as file:
    labels = file.read()

FileNotFoundError: ignored

In [0]:
print(reviews[:1000])
print()
print(labels[:26])

## Data pre-processing

The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

You can see an example of the reviews data above. Here are the processing steps, we'll want to take:
>* We'll want to get rid of periods and punctuations.
* Also, you might notice that the reviews are delimited with newline characters `\n`. To deal with those, I'm going to split the text into each review using `\n` as the delimiter. 
* Then I can combined all the reviews back together into one big string.

First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [0]:
# get rid of punctuation
reviews = reviews.lower() # lowercase, standardize
# print(reviews)
all_text = ''.join([text for text in reviews if text not in punctuation])

# split by new lines and spaces
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# create a list of words
words = all_text.split()

In [0]:
words[:30]

### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.


In [0]:
## Build a dictionary that maps words to integers
counts = Counter(words)
# print(counts)
vocab = sorted(counts, key=counts.get, reverse=True)
# print(vocab)
vocab_to_integer = {word: integer for integer, word in enumerate(vocab, 1)}
# print(vocab_to_integer)

# ## use the dict to tokenize each review in reviews_split
# ## store the tokenized reviews in reviews_ints
reviews_ints = []
for review in reviews_split:
    reviews_ints.append([vocab_to_integer[word] for word in review.split()])

**Let's test our code**<br>
Let's print out the number of unique words in our vocabulary and the contents of the first tokenized review.

In [0]:
# stats about vocabulary
print('Unique words: ', len((vocab_to_integer)))  # should ~ 74000+
print()

# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])

### Encoding the labels

Our labels are "positive" or "negative". To use these labels in our network, we need to convert them to 0 and 1.

In [0]:
# 1=positive, 0=negative
labels_split = labels.split('\n')
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])
encoded_labels

In [0]:
encoded_labels.shape

### Removing Outliers

As an additional pre-processing step, we want to make sure that our reviews are in good shape for standard processing. That is, our network will expect a standard input text size, and so, we'll want to shape our reviews into a specific length. We'll approach this task in two main steps:

1. Getting rid of extremely long or short reviews; the outliers
2. Padding/truncating the remaining data so that we have reviews of the same length.

Before we pad our review text, we should check for reviews of extremely short or long lengths; outliers that may mess with our training.

In [0]:
# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Okay, a couple issues here. We seem to have one review with zero length. And, the maximum review length is way too many steps for our RNN. We'll have to remove any super short reviews and truncate super long reviews. This removes outliers and should allow our model to train more efficiently.

>We first remove *any* reviews with zero length from the `reviews_ints` list and their corresponding label in `encoded_labels`.

In [0]:
print('Number of reviews before removing outliers: ', len(reviews_ints))

## remove any reviews/labels with zero length from the reviews_ints list.

# get indices of any reviews with length 0
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]

# remove 0-length reviews and their labels
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Number of reviews after removing outliers: ', len(reviews_ints))

## Padding sequences

To deal with both short and very long reviews, we'll pad or truncate all our reviews to a specific length. For reviews shorter than some `seq_length`, we'll pad with 0s. For reviews longer than `seq_length`, we can truncate them to the first `seq_length` words. A good `seq_length`, in this case, is 200.

> We define a function that returns an array `features` that contains the padded data, of a standard size, that we'll pass to the network. 
* The data should come from `review_ints`, since we want to feed integers to the network. 
* Each row should be `seq_length` elements long. 
* For reviews shorter than `seq_length` words, **left pad** with 0s. That is, if the review is `['best', 'movie', 'ever']`, `[117, 18, 128]` as integers, the row will look like `[0, 0, 0, ..., 0, 117, 18, 128]`. 
* For reviews longer than `seq_length`, use only the first `seq_length` words as the feature vector.

As a small example, if the `seq_length=10` and an input review is: 
```
[117, 18, 128]
```
The resultant, padded sequence should be: 

```
[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]
```

In [0]:
def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's 
        or truncated to the input seq_length.
    '''
    
    # getting the correct rows x cols shape
    features = np.zeros((len(reviews_ints), seq_length), dtype=int)

    # for each review, I grab that review and 
    for i, row in enumerate(reviews_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]
    
    return features

In [0]:
# Test your implementation!

seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)

## test statements - do not change - ##
assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches 
print(features[:30,:10])

## Training, Validation, Test

With our data in a good shape, we'll split it into training, validation, and test sets equals to 80%, 10%, 10%

In [0]:
split_frac = 0.8

## split data into training, validation, and test data (features and labels, x and y)

split_idx = int(len(features)*0.8)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

## DataLoaders and Batching

After creating training, test, and validation data, we then create DataLoaders for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets.

```
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)
```

In [0]:
# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data  = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 50

# make sure the SHUFFLE your training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader  = DataLoader(test_data, shuffle=True, batch_size=batch_size)

In [0]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

In [0]:
# First checking if GPU is available
if torch.cuda.is_available():
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

In [0]:
class RNNModel(nn.Module):
    
    # The RNN model that will be used to perform Sentiment analysis.

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        
        # Initialize the model by setting up the layers.
        
        super(RNNModel, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        # dropout layer
        self.dropout = nn.Dropout(0.3)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
    
        # Perform a forward pass of our model on some input and hidden state.
        
        batch_size = x.size(0)

        # embeddings and lstm_out
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)
        
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels
        
        # return last sigmoid output and hidden state
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if torch.cuda.is_available():
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden
        

## Instantiate the network

Here, we'll instantiate the network. First up, defining the hyperparameters.

* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3

> Define the model  hyperparameters.


In [0]:
# Instantiate the model 
vocab_size = len(vocab_to_integer)+1 # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 3

model = RNNModel(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(model)

## Model training

We have data and training hyparameters:

* `lr`:  Learning rate for our optimizer.
* `epochs`:  Number of times to iterate through the training dataset.
* `clip`:  The maximum gradient value to clip at (to prevent exploding gradients).

In [0]:
# loss and optimization functions

lr=0.001
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

In [0]:
# training parameters
epochs = 10 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip = 5 # gradient clipping
valid_loss_min = np.Inf

# move model to GPU, if available
if torch.cuda.is_available():
    model.cuda()

model.train()

# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    hidden = model.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if torch.cuda.is_available():
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        hidden = tuple([each.data for each in hidden])

        # zero accumulated gradients
        model.zero_grad()

        # get the output from the model
        output, hidden = model(inputs, hidden)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = model.init_hidden(batch_size)
            val_losses = []
            model.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if torch.cuda.is_available():
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = model(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            model.train()
            print("Epoch: {}/{}\t".format(e+1, epochs),
                  "step: {}\t".format(counter),
                  "train loss: {:.3f}\t".format(loss.item()),
                  "val loss: {:.3f}".format(np.mean(val_losses)))
            
            if (valid_loss_min <= valid_lossess):
              torch.save(model.state_dict(), PATH+'model.pt')
              valid_loss_min = valid_loss

Load model with lowest validation loss

In [0]:
model = model.load_state_dict(torch.load(PATH+'model.pt'))

### Model testing

* **Test data performance:**  We'll see how our trained model performs on our defined test_data. We'll calculate accuracy over the test data.

* **Inference on user-generated data:** We'll see if we can input just one example review at a time (without a label), and see what the trained model will predict. <br>*Inference:* Look at new, user input data and predict its label.

In [0]:
# Get test data accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
hidden = model.init_hidden(batch_size)

model.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    hidden = tuple([each.data for each in hidden])

    if torch.cuda.is_available():
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, hidden = model(inputs, hidden)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    prediction = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = prediction.eq(labels.float().view_as(prediction))
    correct = np.squeeze(correct_tensor.numpy()) if not torch.cuda.is_available() else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# accuracy over all test data
test_accuracy = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.2f}".format(test_accuracy))

### Inference on a test review
    
> We'll writea `predict` function that takes in a trained model, a plain text_review, and a sequence length, and prints out a custom statement for a positive or negative review! This task will be completed with the help of helper function  `tokenize_review` function.

In [0]:
# negative test review
test_review_negative = 'The worst movie I have seen; acting was terrible and I want my money back. \
                   This movie had bad acting and the dialogue was slow.'

In [0]:
def tokenize_review(test_review):
    test_review = test_review.lower() # lowercase
    # get rid of punctuation
    test_text = ''.join([c for c in test_review if c not in punctuation])

    # splitting by spaces
    test_words = test_text.split()

    # tokens
    test_ints = []
    test_ints.append([vocab_to_integer[word] for word in test_words])

    return test_ints

# test code and generate tokenized review
test_ints = tokenize_review(test_review_negative)
print(test_ints)

In [0]:
# test sequence padding
seq_length = 200
features = pad_features(test_ints, seq_length)

print(features)

In [0]:
# test conversion to tensor and pass into your model
feature_tensor = torch.from_numpy(features)
print(feature_tensor.size())

In [0]:
def predict(model, test_review, sequence_length=200):
    
    model.eval()
    
    # tokenize review
    test_ints = tokenize_review(test_review)
    
    # pad tokenized sequence
    seq_length=sequence_length
    features = pad_features(test_ints, seq_length)
    
    # convert to tensor to pass into your model
    feature_tensor = torch.from_numpy(features)
    
    batch_size = feature_tensor.size(0)
    
    # initialize hidden state
    hidden = model.init_hidden(batch_size)
    
    if torch.cuda.is_available():
        feature_tensor = feature_tensor.cuda()
    
    # get the output from the model
    output, hidden = model(feature_tensor, hidden)
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze()) 
    # printing output value, before rounding
    print('Prediction value: {:.2f}'.format(output.item()))
    
    # print custom response
    if(pred.item()==1):
        print("Positive review!")
    else:
        print("Negative review!")
        

#### Test negative review

In [0]:
# call function
seq_length = 200 # good to use the length that was trained on

predict(model, test_review_negative, seq_length)

#### Test positive review

In [0]:
test_review_pos = 'I do not love deep learning, it\'s very bad'

In [0]:
seq_length = 200 # good to use the length that was trained on

predict(model, test_review_pos, seq_length)