# PyTorch Sentiment Analysis
Code example adapted from:
https://github.com/bentrevett/pytorch-sentiment-analysis

### Installations
Use the following command to install PyTorch (recommended in an virtualenv)  
Further supports can be found here: https://pytorch.org/
```bash
pip3 install torch torchvision torchtext spacy
python3 -m spacy download en
```

In [2]:
!git clone https://github.com/yyul10/Pyomo.git

Cloning into 'Pyomo'...
fatal: could not read Username for 'https://github.com': No such device or address


In [0]:
!pip3 install torch torchvision torchtext spacy
!python3 -m spacy download en

Collecting pillow>=4.1.1 (from torchvision)
[?25l  Downloading https://files.pythonhosted.org/packages/85/5e/e91792f198bbc5a0d7d3055ad552bc4062942d27eaf75c3e2783cf64eae5/Pillow-5.4.1-cp36-cp36m-manylinux1_x86_64.whl (2.0MB)
[K    100% |████████████████████████████████| 2.0MB 2.2MB/s 
Collecting wrapt<1.11.0,>=1.10.0 (from thinc<6.13.0,>=6.12.1->spacy)
  Downloading https://files.pythonhosted.org/packages/a0/47/66897906448185fcb77fc3c2b1bc20ed0ecca81a0f2f88eda3fc5a34fc3d/wrapt-1.10.11.tar.gz
Building wheels for collected packages: wrapt
  Building wheel for wrapt (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/48/5d/04/22361a593e70d23b1f7746d932802efe1f0e523376a74f321e
Successfully built wrapt
[31mspacy 2.0.18 has requirement numpy>=1.15.0, but you'll have numpy 1.14.6 which is incompatible.[0m
Installing collected packages: pillow, wrapt
  Found existing installation: Pillow 4.0.0
    Uninstalling Pillow-4.0.0:
      Successfully uninstalled Pillow-4.


[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en

    You can now load the model via spacy.load('en')



### Objectives
This notebook will walk you through building a model to predict sentiment (i.e. positive or negative) using PyTorch and its useful library TorchText.  
We will use a widely-used sentiment analysis benchmarking dataset, [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/), for our sentiment classification task.

In [0]:
# a bit of setups
import random

import torch
import torch.nn as nn
from torchtext import data
from torchtext import datasets
import torch.optim as optim

seed = 1234

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

### Data Preprocessing
TorchText of PyTorch has a member function, `Field`, which defines how the data is processed.  
Our data consists of both the raw text of the review and the labeled sentiment, either "pos" or "neg".  

For natural lanuages, we usually need to "tokenize" a sentence into separated words.  
Here our tokenization is done with the [spaCy](https://spacy.io) tokenizer.  
The default is splitting the text on spaces.  
`LabelField` here specifically is used for handling labels.  

References of TorchText for further reading can be found [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py).

In [0]:
text = data.Field(tokenize='spacy')
label = data.LabelField(dtype=torch.float)

In [0]:
# We will use imdb dataset
train_data, test_data = datasets.IMDB.splits(text, label)
print('Number of training examples: {}'.format(len(train_data)))
print('Number of testing examples: {}'.format(len(test_data)))

Let's take a look at an exempler data

In [0]:
print(vars(train_data.examples[0]))

Perform a test train split

In [0]:
train_data, valid_data = train_data.split(random_state=random.seed(seed))
print('Number of training examples: {}'.format(len(train_data)))
print('Number of validation examples: {}'.format(len(valid_data)))
print('Number of testing examples: {}'.format(len(test_data)))

### Word Representations
Vocabulary of a natural language sentence is usually represented with to its one-hot representation.  
An illustration of one-hot representation

![](https://i.imgur.com/0o5Gdar.png)


For unknown vocabulary we use _unknown_ or `<unk>` token.

The following builds the vocabulary, only keeping the most common `max_size` tokens.

In [0]:
text.build_vocab(train_data, max_size=25000)
label.build_vocab(train_data)
print("Unique tokens in TEXT vocabulary: {}".format(len(text.vocab)))
print("Unique tokens in LABEL vocabulary: {}".format(len(label.vocab)))

Most frequent words

In [0]:
for k, v in text.vocab.freqs.most_common(20):
    print (k, ':', v)

In [0]:
print(text.vocab.itos[:10])
print(label.vocab.stoi)

#### Prepare torch device and training data loader

In [0]:
batch_size = 64

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device('cpu')

train_dataloader, valid_dataloader, test_dataloader = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size=batch_size,
    device=device)

### The LSTM Model

We will use **recurrent neural network** (RNN), specifically its variant, **Long-Short Term Memory** (LSTM), as our model architecture.  
The below is an illustration, with the model predicting zero, indicating a negative sentiment.  
The initial hidden state, $h_0$, will be initialized as a zero tensor. 

![](assets/sentiment1.png)

The `forward` method is called when we feed examples into our model.

`x`, is a tensor of size _**[sentence length, batch size]**_.

You may notice that this tensor should have another dimension due to the one-hot vectors, however PyTorch conveniently stores a one-hot vector as it's index value, i.e. the tensor representing a sentence is just a tensor of the indexes for each token in that sentence.

**Word embedding** layer is needed to to get `word_embd`, which gives us a dense vector representation of our sentences. `word_embd` is a tensor of size _**[sentence length, batch size, embedding dim]**_. `word_embd` is then fed into our model.

The LSTM Model returns 2 tensors, `output` of size _**[sentence length, batch size, hidden dim]**_ and `hidden` of `state` and `cell`.

We feed the last output tensor, to a linear classification layer `fc` to produce a prediction.

In [0]:
class Model(nn.Module):
    def __init__(self, input_dim, embd_dim, rnn_size, out_proj_dim):
        super().__init__()
        ############################
        #### Start of Your Code ####
        ############################
        pass
        ############################
        ##### End of Your Code #####
        ############################
        
    def forward(self, x):
        ############################
        #### Start of Your Code ####
        ############################
        pass
        ############################
        ##### End of Your Code #####
        ############################
        return None

Construct our model

In [0]:
input_dim = len(text.vocab)
embd_dim = 100
rnn_size = 256
out_proj_dim = 1

model = Model(input_dim, embd_dim, rnn_size, out_proj_dim)

Define the optimizer and training objectives

In [0]:
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()

Put them to device using `.to`

In [0]:
model = model.to(device)
criterion = criterion.to(device)

### Essential functions for training

In [0]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    # round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() # convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [0]:
def train(model, dataloader, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in dataloader:
        ############################
        #### Start of Your Code ####
        ############################
        # important! to make all the gradients zero
                
        # forward pass
        
        # compute the loss
        
        # compute the accuracy
        
        # backward pass
        
        # optimizer makes one gradient update step
        
        # aggregate training statistics
        
        ############################
        ##### End of Your Code #####
        ############################
    return epoch_loss / len(dataloader), epoch_acc / len(dataloader)


def evaluate(model, dataloader, criterion):
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
        for batch in dataloader:
            ############################
            #### Start of Your Code ####
            ############################
            pass
            ############################
            ##### End of Your Code #####
            ############################
    return epoch_loss / len(dataloader), epoch_acc / len(dataloader)

### Let's train the model now!

In [0]:
max_epochs = 5

for epoch in range(max_epochs):

    train_loss, train_acc = train(model, train_dataloader, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_dataloader, criterion)
    
    print('| Epoch: {} | Train Loss: {:.3f} | Train Acc: {:.2f}% | Val. Loss: {:.3f} | Val. Acc: {:.2f}% |'.format(
           epoch+1, train_loss, train_acc*100, valid_loss, valid_acc*100))

In [0]:
test_loss, test_acc = evaluate(model, test_dataloader, criterion)

print('| Test Loss: {:.3f} | Test Acc: {:.2f}% |'.format(
      test_loss, test_acc*100))