The dataset for this exercise is from 

https://github.com/udacity/deep-learning-v2-pytorch/tree/master/sentiment-rnn/data


## Data Preprocessing

In [1]:
with open('data/reviews.txt') as f:
  reviews = f.read()
with open('data/labels.txt') as f:
  labels = f.read()

In [2]:
print(reviews[:1000])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

In [3]:
print(labels[:20])

positive
negative
po


### Remove Punctuation

In [4]:
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [5]:
reviews = reviews.lower()
reviews = ''.join([char for char in reviews if char not in punctuation])

In [6]:
print(reviews[:1000])

bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t   
story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra audience is turned into an insane  violent mo

In [7]:
reviews = reviews.split('\n')
labels = labels.split('\n')
len(reviews), len(labels)

(25001, 25001)

In [8]:
reviews[0]

'bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t   '

In [9]:
words = ' '.join(reviews)
words = words.split()

### Word to Vector

In [10]:
from collections import Counter

In [11]:
word_counts = Counter(words)
len(word_counts)

74072

In [12]:
word_sorted_by_desc_cnt = sorted(word_counts, key=word_counts.get, reverse=True)
print(word_sorted_by_desc_cnt[:10])

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'it', 'in', 'i']


In [13]:
# indexing the word from 1, later we will use leading 0s to pad sequence 
# under a fixed length
word2int = {word: idx+1 for idx, word in enumerate(word_sorted_by_desc_cnt)}

In [14]:
reviews_int = [
  [word2int[word] for word in review.split()] for review in reviews]

print(reviews_int[0])

[21025, 308, 6, 3, 1050, 207, 8, 2138, 32, 1, 171, 57, 15, 49, 81, 5785, 44, 382, 110, 140, 15, 5194, 60, 154, 9, 1, 4975, 5852, 475, 71, 5, 260, 12, 21025, 308, 13, 1978, 6, 74, 2395, 5, 613, 73, 6, 5194, 1, 24103, 5, 1983, 10166, 1, 5786, 1499, 36, 51, 66, 204, 145, 67, 1199, 5194, 19869, 1, 37442, 4, 1, 221, 883, 31, 2988, 71, 4, 1, 5787, 10, 686, 2, 67, 1499, 54, 10, 216, 1, 383, 9, 62, 3, 1406, 3686, 783, 5, 3483, 180, 1, 382, 10, 1212, 13583, 32, 308, 3, 349, 341, 2913, 10, 143, 127, 5, 7690, 30, 4, 129, 5194, 1406, 2326, 5, 21025, 308, 10, 528, 12, 109, 1448, 4, 60, 543, 102, 12, 21025, 308, 6, 227, 4146, 48, 3, 2211, 12, 8, 215, 23]


### Check for Outliers

In [15]:
import pandas as pd

In [16]:
reviews_len = pd.DataFrame([len(review) for review in reviews_int], columns=['review_len'])
reviews_len['review_len'].describe()

count    25001.000000
mean       240.798208
std        179.020628
min          0.000000
25%        130.000000
50%        179.000000
75%        293.000000
max       2514.000000
Name: review_len, dtype: float64

In [17]:
reviews_len.groupby('review_len').size()[:10]

review_len
0     1
10    1
11    1
12    2
13    1
14    1
15    2
16    2
17    2
19    2
dtype: int64

In [18]:
# remove the empty review and its corresponding label from dataset
zero_indices = [idx for idx, review in enumerate(reviews_int) if len(review)==0]

reviews_int = [review for idx, review in enumerate(reviews_int) if idx not in zero_indices]

In [19]:
import numpy as np

In [20]:
labels = [label for idx, label in enumerate(labels) if idx not in zero_indices]

### One-hot Encode Labels

In [21]:
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels])

In [22]:
len(encoded_labels)

25000

### Check Unique Label Counts

We want to check if our dataset is balanced before moving on to the next step

In [23]:
Counter(encoded_labels)

Counter({0: 12500, 1: 12500})

### Pad Short Reviews and Truncate Long Reviews

We will set the max length of the sequence to 200. Review under this length will be padded with leading zeros. This is the reason we start word2int at 1. We also need to remember that our vocab_size is (len(word2int) + 1).

In [24]:
def pad_sequence(reviews, sequence_len):
  
  num_rows = len(reviews)
  padded_reviews = np.zeros((num_rows, sequence_len), dtype = int)
  for idx, review in enumerate(reviews):
    padded_reviews[idx, -len(review):] = review[: sequence_len]
  
  return padded_reviews

In [25]:
# pad sequence shorter than 200 characters with leading 0s
# cut sequence longer than 200 characters at 200

sequence_len = 200

reviews_array = pad_sequence(reviews_int, sequence_len=sequence_len)

### Train-Test-Validation Split

We will use 80% of the data as training set, 10% of the data as validation set, and the rest as testing set.

In [26]:
train_test_split_ratio = 0.8

In [27]:
np.random.seed(42)

In [28]:
shuffled_idx = np.arange(len(reviews_array))

np.random.shuffle(shuffled_idx)
shuffled_idx

array([ 6868, 24016,  9668, ...,   860, 15795, 23654])

In [29]:
train_size = int(len(reviews_array) * train_test_split_ratio)
train_indices = shuffled_idx[:train_size]

remaining_indices = shuffled_idx[train_size: ]
val_size = (len(reviews_array) - train_size) // 2
val_indices =  remaining_indices[val_size:]
test_indices =  remaining_indices[:val_size]

train_x = reviews_array[train_indices, :]
train_y = encoded_labels[train_indices]

val_x = reviews_array[val_indices, :]
val_y = encoded_labels[val_indices]

test_x = reviews_array[test_indices, :]
test_y = encoded_labels[test_indices]

print(f"{'train_x:': <8} {train_x.shape}, {'train_y:': <8} {train_y.shape}")
print(f"{'val_x:': <8} {val_x.shape}, {'val_y:': <8} {val_y.shape}")
print(f"{'test_x:': <8} {test_x.shape}, {'test_y:': <8} {test_y.shape}")

train_x: (20000, 200), train_y: (20000,)
val_x:   (2500, 200), val_y:   (2500,)
test_x:  (2500, 200), test_y:  (2500,)


### Batching

In [30]:
import torch
from torch.utils.data import DataLoader, TensorDataset

In [31]:
train_ds = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
val_ds = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_ds = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

In [32]:
batch_size = 50

# If observations in dataloader cannot be divided by batch_size
# we will need to drop the last incomplete batch as it will cause error 
# in LSTM layer later. 
# We also need to turn shuffle off for val_dl and test_dl, since we 
# want to concat the predictions and compare with the entire list of targets

train_dl = DataLoader(train_ds, batch_size, shuffle=True, drop_last=True)
val_dl = DataLoader(val_ds, batch_size, shuffle=False, drop_last=True)
test_dl = DataLoader(test_ds, batch_size, shuffle=False, drop_last=True)

In [33]:
for batch_x, batch_y in train_dl:
  print(batch_x.shape)
  print(batch_y.shape)
  break

torch.Size([50, 200])
torch.Size([50])


## Sentiment Network using LSTM

In [34]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(f"Device is {device}")

Device is cuda


In [35]:
import torch.nn as nn

In [36]:
class SentimentNet(nn.Module):
  def __init__(self, vocab_size, embed_dim, hidden_size, num_lstm_layers, dropout_prob, out_size):
    super().__init__()
    self.vocab_size = vocab_size
    self.embed_dim = embed_dim
    self.hidden_size = hidden_size
    self.num_lstm_layers = num_lstm_layers
    self.dropout_prob = dropout_prob
    self.out_size = out_size

    self.embed = nn.Embedding(vocab_size, embed_dim)
    self.lstm = nn.LSTM(
        input_size=embed_dim, hidden_size=hidden_size, 
        num_layers=num_lstm_layers, batch_first=True, dropout=dropout_prob)
    self.dropout = nn.Dropout(0.3)
    self.fc = nn.Linear(in_features=hidden_size, out_features=out_size)
    self.sigmoid = nn.Sigmoid()
  
  def forward(self, x, hidden):
    batch_size = x.size(0)
    out = self.embed(x)

    out, hidden = self.lstm(out, hidden)

    out = out.contiguous().view(-1, self.hidden_size)
    out = self.dropout(out)
    out = self.fc(out)

    out = out.view(batch_size, -1)
    out = self.sigmoid(out)
    
    # we only need the output at the end of the sequence
    out = out[:, -1]

    return out, hidden
  
  def init_hidden(self, batch_size):
    # we use weight to keep the data type of the model parameters
    weight = next(self.parameters()).data

    hidden_state = (
      weight.new(self.num_lstm_layers, batch_size, self.hidden_size).zero_().to(device),
      weight.new(self.num_lstm_layers, batch_size, self.hidden_size).zero_().to(device))
    
    return hidden_state


Before starting training, we want to make sure the dimensions in the model is correct. Let's instantiate a model and feed some data through it.

In [37]:
# vacab_size should be word2int + 1 so that it includes the paddings
vocab_size = len(word2int) + 1
embed_dim = 3
hidden_size = 4
num_lstm_layers = 2
dropout_prob = 0.3
out_size = 1

model = SentimentNet(
    vocab_size, embed_dim, hidden_size, 
    num_lstm_layers, dropout_prob, out_size).to(device)

model

SentimentNet(
  (embed): Embedding(74073, 3)
  (lstm): LSTM(3, 4, num_layers=2, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=4, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [38]:
hidden = model.init_hidden(batch_size)

for batch_x, batch_y in train_dl:
  out, hidden = model(batch_x.to(device), hidden)
  break
  
print(out.shape, hidden[0].shape)

torch.Size([50]) torch.Size([2, 50, 4])


### Training

In [39]:
PATH = 'model_checkpoint'

In [40]:
# vacab_size should be word2int + 1 so that it includes the paddings
vocab_size = len(word2int) + 1
embed_dim = 500
hidden_size = 256
num_lstm_layers = 2
dropout_prob = 0.3
out_size = 1

model = SentimentNet(
    vocab_size, embed_dim, hidden_size, 
    num_lstm_layers, dropout_prob, out_size).to(device)

lr = 0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [41]:
epochs = 10

grad_clip_thresh = 5
log_every = 100

counter = 0
min_val_loss = np.infty

for epoch in range(epochs):
  
  
  hidden = model.init_hidden(batch_size)

  for batch_x, batch_y in train_dl:
    # train
    model.train()
    
    counter += 1
    batch_x, batch_y = batch_x.to(device), batch_y.to(device)

    optimizer.zero_grad()
    # detach hidden state before feeding it in to the model
    hidden = tuple(state.data for state in hidden)
    out, hidden = model(batch_x, hidden)

    loss = criterion(out, batch_y.float())
    loss.backward()

    # clip gradient before updating the weight
    nn.utils.clip_grad_norm_(model.parameters(), grad_clip_thresh)
    optimizer.step()
    # print(f"\r{counter}: {out.shape}", end="", flush=True)
    print(f"\rEpoch: {epoch+1}, batch: {counter}, train_loss: {loss.item():.6f}", end="", flush=True)

    
    if counter % log_every == 0:
      # validation mode
      model.eval()

      val_hidden = model.init_hidden(batch_size)
      val_losses = []
      for batch_x, batch_y in val_dl:
        batch_x, batch_y = batch_x.to(device), batch_y.to(device)
        val_hidden = tuple(state.data for state in val_hidden)
        with torch.no_grad():
          out, val_hidden = model(batch_x, val_hidden)
          val_loss = criterion(out, batch_y.float())
          val_losses.append(val_loss.item())
      if np.mean(val_losses) < min_val_loss:
        min_val_loss = val_loss
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'train_loss': loss.item(),
            'val_loss': val_loss.item(),
            'model_hyperparameters': {
                'vocab_size': model.vocab_size,
                'embed_dim': model.embed_dim,
                'hidden_size': model.hidden_size,
                'num_lstm_layers': model.num_lstm_layers,
                'dropout_prob': model.dropout_prob,
                'out_size': model.out_size,
            },
        }, PATH)
      print(f"\rEpoch: {epoch+1}, batch: {counter}, train_loss: {loss.item():.6f}, val_loss: {np.mean(val_losses):.6f}")

Epoch: 1, batch: 100, train_loss: 0.663304, val_loss: 0.661472
Epoch: 1, batch: 200, train_loss: 0.585354, val_loss: 0.580288
Epoch: 1, batch: 300, train_loss: 0.635410, val_loss: 0.649840
Epoch: 1, batch: 400, train_loss: 0.573247, val_loss: 0.515238
Epoch: 2, batch: 500, train_loss: 0.454927, val_loss: 0.518009
Epoch: 2, batch: 600, train_loss: 0.370448, val_loss: 0.430542
Epoch: 2, batch: 700, train_loss: 0.301400, val_loss: 0.408504
Epoch: 2, batch: 800, train_loss: 0.277392, val_loss: 0.392027
Epoch: 3, batch: 900, train_loss: 0.243589, val_loss: 0.410152
Epoch: 3, batch: 1000, train_loss: 0.240738, val_loss: 0.447372
Epoch: 3, batch: 1100, train_loss: 0.296274, val_loss: 0.426252
Epoch: 3, batch: 1200, train_loss: 0.340190, val_loss: 0.377071
Epoch: 4, batch: 1300, train_loss: 0.091623, val_loss: 0.461042
Epoch: 4, batch: 1400, train_loss: 0.195940, val_loss: 0.506260
Epoch: 4, batch: 1500, train_loss: 0.399545, val_loss: 0.512712
Epoch: 4, batch: 1600, train_loss: 0.151209, val_

### Testing

In [42]:
checkpoint = torch.load(PATH)
print(f"checkpoint val_loss: {val_loss}")
model = SentimentNet(**checkpoint['model_hyperparameters'])
model.load_state_dict(checkpoint['model_state_dict'])
model.to(device)
# optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
model

checkpoint val_loss: 0.9944574236869812


SentimentNet(
  (embed): Embedding(74073, 500)
  (lstm): LSTM(500, 256, num_layers=2, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [43]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [44]:
# validation mode
model.eval()

test_hidden = model.init_hidden(batch_size)
test_losses = []
test_preds = torch.tensor([])
for batch_x, batch_y in test_dl:
  batch_x, batch_y = batch_x.to(device), batch_y.to(device)
  test_hidden = tuple(state.data for state in test_hidden)
  
  with torch.no_grad():
    out, test_hidden = model(batch_x, test_hidden)
    test_loss = criterion(out, batch_y.float())
    test_losses.append(test_loss.item())
    batch_pred = torch.round(out).cpu()
    test_preds = torch.cat((test_preds, batch_pred), dim=0)

   
print(f"test_loss: {np.mean(test_losses):.6f}")

cm = confusion_matrix(
    test_ds.tensors[1].to('cpu').numpy(), test_preds.to('cpu').numpy())
# TN, FN, TP, FP = cm[0, 0], cm[1, 0], cm[1, 1], cm[0, 1]
accuracy = accuracy_score(
    test_ds.tensors[1].to('cpu').numpy(), test_preds.to('cpu').numpy())
print(f"test_accuracy: {accuracy:.6f}")
print(f"confusion matrix:\n{cm}")

test_loss: 0.371326
test_accuracy: 0.844800
confusion matrix:
[[1047  200]
 [ 188 1065]]


### Prediction

Let's use our model to predict the sentiment of [this positive review](https://readysteadycut.com/2020/05/28/dorohedoro-season-1-netflix-review/) of [Netflix's Dorohedoro](https://www.netflix.com/title/80991903#:~:text=2020%20%7C%20TV%2DMA%20%7C%201,Takagi%2C%20Reina%20Kondo%2C%20Kenyu%20Horiuchi)

In [45]:
# https://readysteadycut.com/2020/05/28/dorohedoro-season-1-netflix-review/

new_review = """Netflix has a winner on its hands with new original anime Dorohedoro, a 12-part medley of violence, eccentricity, and class-conscious world-building that bows out unflatteringly begging for a sequel but does a fine job of making a case for one throughout its run.


Snappy in terms of both its dialogue and its dinosaur-headed hero, Caiman (Wataru Takagi), an amnesiac with a particular grudge against the oppressive Sorcerers who meddle with the lives and biology of the citizens in the decrepit “Hole”, Dorohedoro does an admirable job of fleshing out its world and cast while maintaining a steady drip-feed of dramatic secrets and blood-soaked violence.


There’s plenty to like here in terms of characterization, world-building, and plotting, even if some of the mysteries are necessarily left unresolved. The CG animation, too, works well in its bloody broad strokes, and while narratively and aesthetically Dorohedoro lacks a degree of subtlety, you can’t help but imagine that’s very much the point.

Perhaps, though, that subtlety is to be found in the characterization, which treats figures on both sides of the class divide to rounded personalities or at least entertainingly eccentric traits. While Caiman is inarguably the lead, the focus is divided evenly enough that your favorite character could conceivably be anyone.

Dorohedoro’s long-term success on Netflix will depend, one assumes, on how people take to its wide-open ending, which might not work as a satisfying conclusion. Whether that puts people off or entices them to the follow-up remains to be seen, but all the requisite elements are here for a well-liked original anime that’ll attract an enthusiastic fan base over the coming weeks.
"""

In [46]:
def tokenize_review(text_review):
  # review to lower case, remove punctuation, and to list of words
  review_cleaned = ''.join([char for char in text_review.lower() if char not in punctuation]).split()
  # map word to integer
  review_int = [word2int[word] for word in review_cleaned if word in word2int]
  return review_int

In [47]:
new_review_int = tokenize_review(new_review)

In [48]:
new_reviews_array = pad_sequence([new_review_int], sequence_len)

In [49]:
new_reviews_tensors = torch.from_numpy(new_reviews_array)

In [50]:
new_reviews_tensors.shape

torch.Size([1, 200])

In [51]:
model.eval()
with torch.no_grad():
  hidden = model.init_hidden(1)
  out, hidden = model(new_reviews_tensors.to(device), hidden)
print(f"out: {out.cpu().item():.6f}")
print(f"pred: {torch.round(out).cpu().item():.0f}")

out: 0.967452
pred: 1
