Deep Learning Models -- A collection of various deep learning architectures, models, and tips for TensorFlow and PyTorch in Jupyter Notebooks.
- Author: Sebastian Raschka
- GitHub Repository: https://github.com/rasbt/deeplearning-models

# Bidirectional Multi-layer RNN with LSTM with Own Dataset in CSV Format (Yelp Review Polarity)

Dataset Description

```
Yelp Review Polarity Dataset

Version 1, Updated 09/09/2015

ORIGIN

The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data. For more information, please refer to http://www.yelp.com/dataset_challenge

The Yelp reviews polarity dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the above dataset. It is first used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).


DESCRIPTION

The Yelp reviews polarity dataset is constructed by considering stars 1 and 2 negative, and 3 and 4 positive. For each polarity 280,000 training samples and 19,000 testing samples are take randomly. In total there are 560,000 trainig samples and 38,000 testing samples. Negative polarity is class 1, and positive class 2.

The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 2 columns in them, corresponding to class index (1 and 2) and review text. The review texts are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".backslash followed with an "n" character, that is "\n".
```

In [1]:
%load_ext watermark
%watermark -a 'Sebastian Raschka' -v -p torch


import torch
import torch.nn.functional as F
from torchtext import data
from torchtext import datasets
import time
import random
import pandas as pd
import numpy as np

torch.backends.cudnn.deterministic = True

Sebastian Raschka 

CPython 3.7.3
IPython 7.9.0

torch 1.3.0


## General Settings

In [2]:
RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)

VOCABULARY_SIZE = 5000
LEARNING_RATE = 1e-3
BATCH_SIZE = 128
NUM_EPOCHS = 50
DROPOUT = 0.5
DEVICE = torch.device('cuda:3' if torch.cuda.is_available() else 'cpu')

EMBEDDING_DIM = 128
BIDIRECTIONAL = True
HIDDEN_DIM = 256
NUM_LAYERS = 2
OUTPUT_DIM = 2

## Dataset

The Yelp Review Polarity dataset is available from Xiang Zhang's Google Drive folder at

https://drive.google.com/drive/u/0/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M

From the Google Drive folder, download the file 

- `yelp_review_polarity_csv.tar.gz`

In [3]:
!tar xvzf  yelp_review_polarity_csv.tar.gz

yelp_review_polarity_csv/
yelp_review_polarity_csv/readme.txt
yelp_review_polarity_csv/test.csv
yelp_review_polarity_csv/train.csv


Check that the dataset looks okay:

In [4]:
df = pd.read_csv('yelp_review_polarity_csv/train.csv', header=None, index_col=None)
df.columns = ['classlabel', 'content']
df['classlabel'] = df['classlabel']-1
df.head()

Unnamed: 0,classlabel,content
0,0,"Unfortunately, the frustration of being Dr. Go..."
1,1,Been going to Dr. Goldberg for over 10 years. ...
2,0,I don't know what Dr. Goldberg was like before...
3,0,I'm writing this review to give you a heads up...
4,1,All the food is great here. But the best thing...


In [5]:
np.unique(df['classlabel'].values)

array([0, 1])

In [6]:
np.bincount(df['classlabel'])

array([280000, 280000])

In [7]:
df[['classlabel', 'content']].to_csv('yelp_review_polarity_csv/train_prepocessed.csv', index=None)

In [8]:
df = pd.read_csv('yelp_review_polarity_csv/test.csv', header=None, index_col=None)
df.columns = ['classlabel', 'content']
df['classlabel'] = df['classlabel']-1
df.head()

Unnamed: 0,classlabel,content
0,1,"Contrary to other reviews, I have zero complai..."
1,0,Last summer I had an appointment to get new ti...
2,1,"Friendly staff, same starbucks fair you get an..."
3,0,The food is good. Unfortunately the service is...
4,1,Even when we didn't have a car Filene's Baseme...


In [9]:
np.unique(df['classlabel'].values)

array([0, 1])

In [10]:
np.bincount(df['classlabel'])

array([19000, 19000])

In [11]:
df[['classlabel', 'content']].to_csv('yelp_review_polarity_csv/test_prepocessed.csv', index=None)

In [12]:
del df

Define the Label and Text field formatters:

In [13]:
TEXT = data.Field(sequential=True,
                  tokenize='spacy',
                  include_lengths=True) # necessary for packed_padded_sequence

LABEL = data.LabelField(dtype=torch.float)


# If you get an error [E050] Can't find model 'en'
# you need to run the following on your command line:
#  python -m spacy download en

Process the dataset:

In [14]:
fields = [('classlabel', LABEL), ('content', TEXT)]

train_dataset = data.TabularDataset(
    path="yelp_review_polarity_csv/train_prepocessed.csv", format='csv',
    skip_header=True, fields=fields)

test_dataset = data.TabularDataset(
    path="yelp_review_polarity_csv/test_prepocessed.csv", format='csv',
    skip_header=True, fields=fields)

Split the training dataset into training and validation:

In [15]:
train_data, valid_data = train_dataset.split(
    split_ratio=[0.95, 0.05],
    random_state=random.seed(RANDOM_SEED))

print(f'Num Train: {len(train_data)}')
print(f'Num Valid: {len(valid_data)}')

Num Train: 532000
Num Valid: 28000


Build the vocabulary based on the top "VOCABULARY_SIZE" words:

In [16]:
TEXT.build_vocab(train_data,
                 max_size=VOCABULARY_SIZE,
                 vectors='glove.6B.100d',
                 unk_init=torch.Tensor.normal_)

LABEL.build_vocab(train_data)

print(f'Vocabulary size: {len(TEXT.vocab)}')
print(f'Number of classes: {len(LABEL.vocab)}')

Vocabulary size: 5002
Number of classes: 2


In [17]:
list(LABEL.vocab.freqs)[-10:]

['1', '0']

The TEXT.vocab dictionary will contain the word counts and indices. The reason why the number of words is VOCABULARY_SIZE + 2 is that it contains to special tokens for padding and unknown words: `<unk>` and `<pad>`.

Make dataset iterators:

In [18]:
train_loader, valid_loader, test_loader = data.BucketIterator.splits(
    (train_data, valid_data, test_dataset), 
    batch_size=BATCH_SIZE,
    sort_within_batch=True, # necessary for packed_padded_sequence
    sort_key=lambda x: len(x.content),
    device=DEVICE)

Testing the iterators (note that the number of rows depends on the longest document in the respective batch):

In [19]:
print('Train')
for batch in train_loader:
    print(f'Text matrix size: {batch.content[0].size()}')
    print(f'Target vector size: {batch.classlabel.size()}')
    break
    
print('\nValid:')
for batch in valid_loader:
    print(f'Text matrix size: {batch.content[0].size()}')
    print(f'Target vector size: {batch.classlabel.size()}')
    break
    
print('\nTest:')
for batch in test_loader:
    print(f'Text matrix size: {batch.content[0].size()}')
    print(f'Target vector size: {batch.classlabel.size()}')
    break

Train
Text matrix size: torch.Size([113, 128])
Target vector size: torch.Size([128])

Valid:
Text matrix size: torch.Size([6, 128])
Target vector size: torch.Size([128])

Test:
Text matrix size: torch.Size([5, 128])
Target vector size: torch.Size([128])


## Model

In [20]:
import torch.nn as nn


class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, bidirectional, hidden_dim, num_layers, output_dim, dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx=pad_idx)
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim,
                           num_layers=num_layers,
                           bidirectional=bidirectional, 
                           dropout=dropout)
        self.fc1 = nn.Linear(hidden_dim * num_layers, 64)
        self.fc2 = nn.Linear(64, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_length):

        embedded = self.dropout(self.embedding(text))
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_length)
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        hidden = self.fc1(hidden)
        return hidden

In [21]:
INPUT_DIM = len(TEXT.vocab)

PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

torch.manual_seed(RANDOM_SEED)
model = RNN(INPUT_DIM, EMBEDDING_DIM, BIDIRECTIONAL, HIDDEN_DIM, NUM_LAYERS, OUTPUT_DIM, DROPOUT, PAD_IDX)
model = model.to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

## Training

In [22]:
def compute_accuracy(model, data_loader, device):
    model.eval()
    correct_pred, num_examples = 0, 0
    with torch.no_grad():
        for batch_idx, batch_data in enumerate(data_loader):
            text, text_lengths = batch_data.content
            logits = model(text, text_lengths).squeeze(1)
            _, predicted_labels = torch.max(logits, 1)
            num_examples += batch_data.classlabel.size(0)
            correct_pred += (predicted_labels.long() == batch_data.classlabel.long()).sum()
        return correct_pred.float()/num_examples * 100

In [23]:
start_time = time.time()

for epoch in range(NUM_EPOCHS):
    model.train()
    for batch_idx, batch_data in enumerate(train_loader):
        
        text, text_lengths = batch_data.content
        
        ### FORWARD AND BACK PROP
        logits = model(text, text_lengths).squeeze(1)
        cost = F.cross_entropy(logits, batch_data.classlabel.long())
        optimizer.zero_grad()
        
        cost.backward()
        
        ### UPDATE MODEL PARAMETERS
        optimizer.step()
        
        ### LOGGING
        if not batch_idx % 1000:
            print (f'Epoch: {epoch+1:03d}/{NUM_EPOCHS:03d} | '
                   f'Batch {batch_idx:03d}/{len(train_loader):03d} | '
                   f'Cost: {cost:.4f}')

    with torch.set_grad_enabled(False):
        print(f'training accuracy: '
              f'{compute_accuracy(model, train_loader, DEVICE):.2f}%'
              f'\nvalid accuracy: '
              f'{compute_accuracy(model, valid_loader, DEVICE):.2f}%')
        
    print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')
    
print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')
print(f'Test accuracy: {compute_accuracy(model, test_loader, DEVICE):.2f}%')

Epoch: 001/050 | Batch 000/4157 | Cost: 4.1925
Epoch: 001/050 | Batch 1000/4157 | Cost: 0.3392
Epoch: 001/050 | Batch 2000/4157 | Cost: 0.3254
Epoch: 001/050 | Batch 3000/4157 | Cost: 0.3263
Epoch: 001/050 | Batch 4000/4157 | Cost: 0.1488
training accuracy: 94.50%
valid accuracy: 94.12%
Time elapsed: 8.57 min
Epoch: 002/050 | Batch 000/4157 | Cost: 0.2246
Epoch: 002/050 | Batch 1000/4157 | Cost: 0.1248
Epoch: 002/050 | Batch 2000/4157 | Cost: 0.1107
Epoch: 002/050 | Batch 3000/4157 | Cost: 0.1820
Epoch: 002/050 | Batch 4000/4157 | Cost: 0.0808
training accuracy: 95.75%
valid accuracy: 95.35%
Time elapsed: 17.23 min
Epoch: 003/050 | Batch 000/4157 | Cost: 0.0877
Epoch: 003/050 | Batch 1000/4157 | Cost: 0.0720
Epoch: 003/050 | Batch 2000/4157 | Cost: 0.0770
Epoch: 003/050 | Batch 3000/4157 | Cost: 0.0876
Epoch: 003/050 | Batch 4000/4157 | Cost: 0.0851
training accuracy: 96.15%
valid accuracy: 95.62%
Time elapsed: 25.90 min
Epoch: 004/050 | Batch 000/4157 | Cost: 0.1596
Epoch: 004/050 | B

## Evaluation

Evaluating on some new text that has been collected from recent Yelp reviews and are not part of the training or test sets.

In [24]:
import spacy
nlp = spacy.load('en')


map_dictionary = {
    0: "negative",
    1: "positive"
}


def predict_class(model, sentence, min_len=4):
    # Somewhat based on
    # https://github.com/bentrevett/pytorch-sentiment-analysis/
    # blob/master/5%20-%20Multi-class%20Sentiment%20Analysis.ipynb
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    if len(tokenized) < min_len:
        tokenized += ['<pad>'] * (min_len - len(tokenized))
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(DEVICE)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    preds = model(tensor, length_tensor)
    preds = torch.softmax(preds, dim=1)

    proba, class_label = preds.max(dim=1)
    return proba.item(), class_label.item()

In [28]:
text = """
I have returned many times since my original review, and I can attest to the fact that, indeed, 
the plethora of books she provides does not disappoint. Although under new ownership, 
the vibe and the focus remains unchanged. 

I still collect Kobayashi poetry anytime I stumble upon it.

My absolute favorite bookshop, card vendor, and truth teller. 

Until next time.
"""

proba, pred_label = predict_class(model, text)

print(f'Class Label: {pred_label} -> {map_dictionary[pred_label]}')
print(f'Probability: {proba}')

torch.Size([1, 64])
Class Label: 1 -> positive
Probability: 0.9960760474205017


In [29]:
text = """
Horrible customer service experience!!

Why I even bothered to go here is beyond me.. 
My wife asked me to get some gift cards and my dad 
mentioned that he would give me a yearly membership as a present.  
I made the mistake of not listening to that little voice in my head 
screaming "DON'T!!!!".  I got the gift cards and asked for the membership 
and then realized that they hadn't given me the membership.  So I go in the 
next day and asked someone in customer service if I could get the membership 
and then have them apply the discount to the previous purchases and some new 
purchases and their response was "Of course..  Talk to Scott, our head cashier, 
and he will gladly take care of this".  I go to Scott and he tells me "I've never 
done that, we would never do that and whoever told you that was obviously 
wrong"  Needless to say, I did not make any new purchases and I will promptly 
return any of the previous purchases and give my hard-earned money to someone who deserves it.

Bottom line..  Overpriced lousy customer service is not for me.  In this day
and age they should know better than that and you should use your buying power to show them. Stay away..
"""

proba, pred_label = predict_class(model, text)

print(f'Class Label: {pred_label} -> {map_dictionary[pred_label]}')
print(f'Probability: {proba}')

torch.Size([1, 64])
Class Label: 0 -> negative
Probability: 0.999991774559021


In [27]:
%watermark -iv

pandas    0.24.2
torch     1.3.0
numpy     1.17.2
spacy     2.2.3
torchtext 0.4.0



In [32]:
torch.save(model.state_dict(), 'rnn_bi_multilayer_lstm_own_csv_yelp-polarity.pt')