---
### Load in and visualize the data

In [3]:
import numpy as np

# read data from text files
with open('data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

In [4]:
print(reviews[:2000])
print()
print(labels[:20])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

## Data pre-processing


In [5]:
from string import punctuation

print(punctuation)

# get rid of punctuation
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [6]:
# split by new lines and spaces
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)
# create a list of words
words = all_text.split()

In [7]:
reviews_split[0]

'bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t   '

In [8]:
words[:30]

len(reviews_split)

25001

### Encoding the words

In [9]:
# feel free to use this import 
from collections import Counter

def int_encoding(text, reviews_split):
    """
    makes list of int encoded reviews from reviews_split according to text dictionary.
    """
    ## Build a dictionary that maps words to integers
    vocab_to_int = {}
    int_word = dict(enumerate(set(text)))
    for integer in int_word:
        vocab_to_int[int_word[integer]] = integer

    ## use the dict to tokenize each review in reviews_split
    ## store the tokenized reviews in reviews_ints
    reviews_ints = []
    for review in reviews_split:
        temp = []
        for word in review.split():
            temp.append(vocab_to_int[word])
        reviews_ints.append(temp)
    return reviews_ints

reviews_ints = int_encoding(words, reviews_split)

## testing

In [10]:
# stats about vocabulary
#print('Unique words: ', len((vocab_to_int)))  # should ~ 74000+
#print()

# print tokens in first review
print('Tokenized review: \n', reviews_ints[2][:1000])

Tokenized review: 
 [45107, 31207, 49935, 22462, 66089, 57726, 23584, 40766, 67955, 623, 73828, 31616, 17718, 14031, 55189, 3521, 12766, 5174, 4002, 63657, 42878, 9480, 10817, 257, 72669, 68400, 79, 13601, 20292, 46935, 57507, 22154, 70785, 5174, 73863, 38594, 31207, 35167, 31616, 9480, 71547, 63249, 36239, 46549, 18114, 9480, 64442, 22462, 38572, 3521, 8720, 52523, 22711, 60525, 53735, 14064, 19695, 22462, 69879, 9480, 9122, 42878, 49245, 24270, 66787, 5174, 20856, 57159, 9480, 25546, 36664, 31207, 60525, 8339, 11212, 57297, 24586, 11553, 5174, 2574, 27367, 42878, 9480, 55167, 13372, 13372, 14031, 25008, 8339, 14479, 72669, 26993, 3521, 9064, 5174, 51480, 42878, 9480, 55167, 31616, 3521, 19030, 19628, 9480, 45109, 14479, 68400, 1377, 22154, 3521, 31828, 9480, 45223, 23125, 3521, 73503, 20896, 42878, 9480, 11139, 3521, 54467, 16784, 57507, 14479, 68400, 23410, 5174, 34428, 25008, 42341, 29503, 56645, 5174, 24586, 64442, 257, 70832, 62580, 70141, 29503, 63775, 13372, 13372, 36467, 31833

### Encoding the labels


In [11]:
# 1=positive, 0=negative label conversion
l = lambda x: 1 if x=='positive' else 0
encoded_labels = list(map(l, labels.split('\n')))

### Removing Outliers

We want to make sure that our reviews are in good shape for standard processing. That is, our network will expect a standard input text size, and so, we'll want to reshape our reviews with 2 steps:

1. Getting rid of extremely long or short reviews
2. Padding the remaining data so that we have reviews of the same length.

In [12]:
# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 1
Maximum review length: 2514


In [13]:
len(encoded_labels)

25001

In [14]:
print('Number of reviews before removing outliers: ', len(reviews_ints))
print('Number of labels after removing outliers: ', len(encoded_labels))

## removing any reviews/labels with 0 length from the reviews_ints list.

for i in range(len(reviews_ints)):
    if len(reviews_ints[i]) == 0:
        reviews_ints.pop(i)
        encoded_labels.pop(i)

print('Number of reviews after removing outliers: ', len(reviews_ints))
print('Number of labels after removing outliers: ', len(encoded_labels))

Number of reviews before removing outliers:  25001
Number of labels after removing outliers:  25001
Number of reviews after removing outliers:  25000
Number of labels after removing outliers:  25000


## Defining additional functions

pad_features 

In [15]:
def cut_add(l, seq_length):
    '''
    Cut or pad features until it size equals to seq_length
    '''
    if len(l)>=seq_length:
        return l[:seq_length]
    else:
        for i in range(seq_length-len(l)):
            l.append(0)
        return l

def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's 
        or truncated to the input seq_length.
    '''
    ## implement function
    features = np.zeros([len(reviews_ints), seq_length])
    for i in range(len(reviews_ints)):
        reviews_ints[i] = reviews_ints[i][::-1] 
    for i in range(len(reviews_ints)):
        features[i] = cut_add(reviews_ints[i], seq_length)[::-1]
        
    return np.array(features)

In [16]:
# Testing

seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)
encoded_labels = np.array(encoded_labels)

## test statements - do not change - ##
assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches 
print(features)

[[    0.     0.     0. ... 42341. 35817. 51156.]
 [    0.     0.     0. ... 24586. 57633.  6842.]
 [29503. 12358. 42878. ...  5174.  4002. 35341.]
 ...
 [29781. 18114. 42315. ... 52281. 28439. 35235.]
 [    0.     0.     0. ... 35522. 69570. 22154.]
 [    0.     0.     0. ... 49904. 70832. 48038.]]


## Creating training, validation and test sets

In [17]:
split_frac = 0.8
split_int = int(len(features)*split_frac)
## split data into training, validation, and test data (features and labels, x and y)
train_x = features[:split_int]
train_y = encoded_labels[:split_int]

valid_x = features[split_int:split_int+int((len(features)-split_int)/2)]
valid_y = encoded_labels[split_int:split_int+int((len(features)-split_int)/2)]

test_x = features[split_int+int((len(features)-split_int)/2):]
test_y = encoded_labels[split_int+int((len(features)-split_int)/2):]

## print out the shapes of our feature data
print('Train set:', train_x.shape)
print('Validation set set:', valid_x.shape)
print('Test set:', test_x.shape)

Train set: (20000, 200)
Validation set set: (2500, 200)
Test set: (2500, 200)


---
## making dataloaders

In [18]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(valid_x), torch.from_numpy(valid_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 50

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

In [19]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)


torch.from_numpy(np.array(sample_x)).shape

Sample input size:  torch.Size([50, 200])
Sample input: 
 tensor([[    0.,     0.,     0.,  ...,  5174.,  9480., 40762.],
        [  257., 56645.,  3521.,  ..., 51156., 72990., 42341.],
        [46093., 70832., 41314.,  ...,  9480., 66586., 11488.],
        ...,
        [    0.,     0.,     0.,  ...,  3521., 66586., 44891.],
        [    0.,     0.,     0.,  ..., 52283., 58659., 46567.],
        [ 5174.,  9480., 56705.,  ..., 71357., 49904., 54693.]],
       dtype=torch.float64)

Sample label size:  torch.Size([50])
Sample label: 
 tensor([0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
        1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1,
        0, 1], dtype=torch.int32)


torch.Size([50, 200])

In [20]:
# Checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


## Defining network

In [21]:
import torch.nn as nn

class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()
        #for backprop/init_hidden
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        #for saving model
        self.embedding_dim= embedding_dim
        self.vocab_size = vocab_size
        
        # define all layers
        self.embed = nn.Embedding(vocab_size, embedding_dim)
        self.dropout = nn.Dropout(drop_prob)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, batch_first = True, dropout = 0.5)
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """

        x = torch.tensor(x).to('cuda').long()
        
        x = self.embed(x)
        lstm_out, hidden = self.lstm(x, hidden)
        lstm_out = self.dropout(lstm_out)
        
        lstm_out = lstm_out[:, -1, :]
        #flatting
        fc_input = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        fc_out = self.fc(fc_input)
        out = self.sigmoid(fc_out)
        
        # return last sigmoid output and hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if(train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden
        

## Instantiating the network

Here, we'll instantiate the network, defining the hyperparameters.

* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
* `output_size`: Size of our desired output; (2)
* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells.
* `n_layers`: Number of LSTM layers in the network. 

In [22]:
# Instantiate the model w/ hyperparams
vocab_size = len(set(words))
output_size = 1
embedding_dim = 500
hidden_dim = 128
n_layers = 2

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

SentimentRNN(
  (embed): Embedding(74072, 500)
  (dropout): Dropout(p=0.5, inplace=False)
  (lstm): LSTM(500, 128, num_layers=2, batch_first=True, dropout=0.5)
  (fc): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)


In [23]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

In [24]:
# training params

epochs = 3 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip=5 # gradient clipping

# move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)
        # calculate the loss and perform backprop
        
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

  x = torch.tensor(x).to('cuda').long()


Epoch: 1/3... Step: 100... Loss: 0.593892... Val Loss: 0.645778
Epoch: 1/3... Step: 200... Loss: 0.542995... Val Loss: 0.560587
Epoch: 1/3... Step: 300... Loss: 0.549729... Val Loss: 0.508724
Epoch: 1/3... Step: 400... Loss: 0.428433... Val Loss: 0.581501
Epoch: 2/3... Step: 500... Loss: 0.511221... Val Loss: 0.525026
Epoch: 2/3... Step: 600... Loss: 0.329264... Val Loss: 0.468024
Epoch: 2/3... Step: 700... Loss: 0.492686... Val Loss: 0.457073
Epoch: 2/3... Step: 800... Loss: 0.571418... Val Loss: 0.485198
Epoch: 3/3... Step: 900... Loss: 0.295330... Val Loss: 0.534054
Epoch: 3/3... Step: 1000... Loss: 0.513462... Val Loss: 0.444375
Epoch: 3/3... Step: 1100... Loss: 0.397177... Val Loss: 0.419812
Epoch: 3/3... Step: 1200... Loss: 0.404741... Val Loss: 0.410731


## Saving model

In [53]:
model_name = '3_epochs_128hidden_sentiment_LSTM.net'
'vocab_size' and 'embedding_dim'
checkpoint = {'hidden_dim': net.hidden_dim,
              'vocab_size':net.vocab_size,
              'embedding_dim': net.embedding_dim,
              'n_layers': net.n_layers,
              'state_dict': net.state_dict(),
              'output_size': net.output_size
             }

with open(model_name, 'wb') as f:
    torch.save(checkpoint, f)

## Testing model on test set

In [25]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = net(inputs, h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

  x = torch.tensor(x).to('cuda').long()


Test loss: 0.433
Test accuracy: 0.811
