# Sentiment Analysis using a pre-trained embedding layer

The purpose of this project is to combine the contents of the previous 2 notebooks and to implement an improved version of the RNN for sentiment analysis on the movie reviews set. For that I will pre-train the embedding layer using *word2vec* (skip-gram architecture) in a separate LSTM.

(Note this is no guided project / walkthrough. Some explanations can be found in the other notebooks.)

**Data Source**
- A dataset of movie reviews, accompanied by sentiment labels: positive or negative.

**Project Log**
- 19-12-05: Start project

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-libraries,-load-data" data-toc-modified-id="Import-libraries,-load-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import libraries, load data</a></span></li><li><span><a href="#Data-pre-processing" data-toc-modified-id="Data-pre-processing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data pre-processing</a></span><ul class="toc-item"><li><span><a href="#Clean-text" data-toc-modified-id="Clean-text-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Clean text</a></span></li><li><span><a href="#Encode" data-toc-modified-id="Encode-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Encode</a></span></li><li><span><a href="#Subsample" data-toc-modified-id="Subsample-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Subsample</a></span></li><li><span><a href="#Create-batches" data-toc-modified-id="Create-batches-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Create batches</a></span></li></ul></li><li><span><a href="#Build-Neural-Net-for-Word2Vec" data-toc-modified-id="Build-Neural-Net-for-Word2Vec-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Build Neural Net for Word2Vec</a></span><ul class="toc-item"><li><span><a href="#Define-Architecture-and-Loss" data-toc-modified-id="Define-Architecture-and-Loss-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Define Architecture and Loss</a></span></li><li><span><a href="#Define-validation-function" data-toc-modified-id="Define-validation-function-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Define validation function</a></span></li></ul></li><li><span><a href="#Training-NN" data-toc-modified-id="Training-NN-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Training NN</a></span><ul class="toc-item"><li><span><a href="#Create-features:-Pad-/-truncate-reviews" data-toc-modified-id="Create-features:-Pad-/-truncate-reviews-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Create features: Pad / truncate reviews</a></span></li><li><span><a href="#Split-into-Training,-Validation,-Test" data-toc-modified-id="Split-into-Training,-Validation,-Test-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Split into Training, Validation, Test</a></span></li><li><span><a href="#DataLoaders-and-Batching" data-toc-modified-id="DataLoaders-and-Batching-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>DataLoaders and Batching</a></span></li></ul></li><li><span><a href="#Instantiate-the-network" data-toc-modified-id="Instantiate-the-network-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Instantiate the network</a></span></li><li><span><a href="#Training" data-toc-modified-id="Training-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Training</a></span></li><li><span><a href="#Testing" data-toc-modified-id="Testing-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Testing</a></span><ul class="toc-item"><li><span><a href="#Inference-on-a-test-review" data-toc-modified-id="Inference-on-a-test-review-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Inference on a test review</a></span></li></ul></li></ul></div>

## Import libraries, load data

In [1]:
from collections import Counter
from random import choices, random
import numpy as np

import torch
from torch import nn
import torch.optim as optim
# from torch.utils.data import TensorDataset, DataLoader

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(), sns.set_style('whitegrid')
%matplotlib inline

In [2]:
# Read data from text files
with open('data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

In [3]:
# Print some stats about the reviews
reviews_list = reviews.split('\n')
labels_list = labels.split('\n')
print("Number of reviews:", len(reviews_list))
print("Number of unique words in reviews:", len(set(reviews.split())))

Number of reviews: 25001
Number of unique words in reviews: 74073


## Data pre-processing

### Clean text
- Normalize (all lowercase)
- Replace punctuation (Note: Punktuation is actually already removed from this text, except of periods (padded with empty spaces) and new lines.)
- Remove all words with an occurence of only 2 times or less

In [4]:
def clean_text(text, trim_threshold=5):
    """Clean the input text (replace punctuation, remove unfrequent words) 
    and split it into individual reviews. Return cleaned text as list of
    reviews and as one large string.
    
    Arguments:
    ----------
    - text: string, input text with a delimiter '\n' for individual elements
    - trim_threshold: int, all words with <= this frequency will be removed
    
    Returns:
    --------
    - text_complete: str, cleaned text 
    - reviews_list: list of strings, cleaned text split into individual reviews
    """
    
    # Replace punctuation with tokens so we can use them in our model
    text = text.lower()
    text = text.replace('.', '<PERIOD>')  # watch whitespaces
    text = text.replace(',', ' <COMMA> ')
    text = text.replace('"', ' <QUOTATION_Double> ')
    text = text.replace("''", '<QUOTATION')
    text = text.replace(';', ' <SEMICOLON> ')
    text = text.replace('!', ' <EXCLAMATION_MARK> ')
    text = text.replace('?', ' <QUESTION_MARK> ')
    text = text.replace('(', ' <LEFT_PAREN> ')
    text = text.replace(')', ' <RIGHT_PAREN> ')
    text = text.replace('--', ' <HYPHENS> ')
    text = text.replace('?', ' <QUESTION_MARK> ')
    text = text.replace('\n', ' <NEW_LINE> ')  # watch whitespaces
    text = text.replace(':', ' <COLON> ')
      
    # Remove all words with x or fewer occurences
    words_all = text.split(" ")
    word_counts = Counter(words_all)
    words_trimmed = [word for word in words_all if word_counts[word] > trim_threshold]
    text_trimmed = ' '.join([word for word in words_trimmed])

    # Split by new lines, reassemble (without new lines)
    reviews_list = text_trimmed.split(' <NEW_LINE>')
    text_complete = ' '.join(reviews_list)

    return text_complete, reviews_list

In [5]:
text_complete, reviews_list = clean_text(reviews, trim_threshold=2)

In [6]:
# Check results
vocab_list = list(set(text_complete.split()))
print("Number of reviews:", len(reviews_list))
print("Number of unique words in text_complete:", len(vocab_list))

Number of reviews: 25001
Number of unique words in text_complete: 37442


**Note:** Approx. halving the vocabulary by removing words with 2 or less occurrences!

One more thing: According to docs there should be only 25'000 reviews. We have one too much, probably an empty one, caused by an final '\n' (see next cell).

In [7]:
labels[-20:]

'e\npositive\nnegative\n'

In [8]:
# Remove any reviews/labels with zero length from the reviews_ints list.
idx_to_remove = [len(x) for x in reviews_list].index(1)
reviews_list.pop(idx_to_remove)

try:
    labels_list = labels.split().pop(idx_to_remove)
except IndexError:
    labels_list = labels.split()

print('Number of reviews after removing outliers: ', len(reviews_list))
assert len(reviews_list) == len(labels_list)

Number of reviews after removing outliers:  25000


### Encode

Create 2 dicts mapping words in vocabulary to integers and vice-versa. 
- make sure that most frequent words get lowest int representation
- leave pos [0] for padding (see next section)

In [9]:
def encode_text(text, start_pos=0):
    """Encode words to integers (and vice versa), so that most frequent words
    get lowest int representations. Return two mapping dictionaries.
    
    Arguments:
    ----------
    - text: string, input text with words delimited by whitespaces
    - start_pos: int, value for first integer representation (default=0)
    
    Returns:
    --------
    - vocab_to_int: dict, mapping of words to ints (most frequent come first)
    - int_to_vocab: dict, mapping of ints to words
    """

    word_count_dict = Counter(text.split())
    vocab = sorted(word_count_dict, key=word_count_dict.get, reverse=True)
    vocab_to_int = {word: i for i, word in enumerate(vocab, start_pos)}
    int_to_vocab = {i: word for i, word in enumerate(vocab, start_pos)}

    return vocab_to_int, int_to_vocab

In [10]:
vocab_to_int, int_to_vocab = encode_text(text_complete, start_pos=1)

In [11]:
# Check results
print('Unique words: ', len((vocab_to_int)))
assert len(vocab_to_int) == len(vocab_list)
print(list(vocab_to_int.items())[:5])
print(list(int_to_vocab.items())[:5])

Unique words:  37442
[('the', 1), ('<PERIOD>', 2), ('and', 3), ('a', 4), ('of', 5)]
[(1, 'the'), (2, '<PERIOD>'), (3, 'and'), (4, 'a'), (5, 'of')]


### Subsample

Discard some high-frequent words from our data and in return get faster training and better representations (noise reduction). (According to [Neural Information Processing Systems, paper](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf), Mikolov et al.)


In [12]:
def subsample(text, vocab_to_int, threshold=1e-5):
    """Discard some frequent words, according to the subsampling equation.
    Return a new (reduced) list of words for training. 
    
    Arguments:
    ----------
    - text: string, dict, mapping of words to ints (most frequent come first)
    - vocab_to_int: int, value for first integer representation (default=0)
    - threshold: float, threshold for 'random discarding' a word (default=1e-5)
    
    Returns:
    --------
    - words_train: list, subsampled text represented in ints instead of words
    """ 
    
    # Encode the whole text into a list of ints using a mapping dict
    int_words = [vocab_to_int[word] for word in text.split()]
    # Create dictionary of int_words, showing their frequencies
    word_counts = Counter(int_words)  
    
    total_count = len(int_words)
    freqs = {word: count / total_count for word, count in word_counts.items()}
    p_drop = {word: 1 - np.sqrt(threshold / freqs[word]) for word in word_counts}

    words_train = [word for word in int_words if random() < (1 - p_drop[word])]
    
    return words_train

In [14]:
words_train = subsample(text_complete, vocab_to_int)

In [15]:
# Check results
print("Length train text:", len(words_train))
print("Length initial text:", len(text_complete.split()))

Length train text: 1333008
Length initial text: 6301782


In [16]:
print(words_train[:10])

[21026, 309, 2139, 5786, 383, 5195, 61, 4976, 5853, 21026]


### Create batches

"Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples... If we choose $C = 5$, for each training word we will select randomly a number $R$ in range $[ 1: C ]$, and then use $R$ words from history and $R$ words from the future of the current word as correct labels."

(According to [First Word2Vec paper](https://arxiv.org/pdf/1301.3781.pdf) from Mikolov et al.)

In [17]:
# This function will be called within the next function

def get_target(words, idx, window_size=5):
    """ Get a random-length list of target words (ints) in a window around 
    an input index (the input word).
    
    Arguments:
    ----------
    - words: list, text represented as list of words / ints
    - idx: int, the index of the input word in words
    - window_size: max window to and from idx to get target words
    
    Returns:
    --------
    - target: list, of words near the idx, the 'label' four our input
    
    """
    
    R = np.random.randint(1, window_size+1)
    min_val = np.max([idx-R, 0])  # make sure no neg index occurs, not necessary for values at end
    target = words[min_val : idx] + words[idx+1 : idx+R+1]
    
    return target

In [18]:
def get_batches(words, batch_size, window_size=5):
    """Create a generator of word batches as a tuple (inputs, targets). It
    grabs `batch_size` words from a words list. Then for each of those 
    batches, it gets the target words in a window.
    
    Arguments:
    ----------
    - words: list, text represented as list of words / ints
    - batch size: int, number of inputs to form one batch
    - window_size: max window to and from idx to get target words
    
    Returns:
    --------
    - x, y: lists, inputs und corresponding labels for one batch at a time
    
    """
    
    n_batches = len(words)//batch_size
    
    # only full batches
    words = words[:n_batches*batch_size]
    
    for idx in range(0, len(words), batch_size):
        x, y = [], []
        batch = words[idx : idx+batch_size]
        for ii in range(len(batch)):
            batch_x = batch[ii]
            batch_y = get_target(batch, ii, window_size)
            y.extend(batch_y)  # each batch x and y will be one row of values
            x.extend([batch_x]*len(batch_y))
        yield x, y

In [19]:
# Check results
test_text = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
x, y = next(get_batches(test_text, batch_size=4, window_size=5))

print('x\n', x)
print('y\n', y)

x
 ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D']
y
 ['B', 'C', 'A', 'C', 'D', 'A', 'B', 'D', 'A', 'B', 'C']


## Build Neural Net for Word2Vec

### Define Architecture and Loss

In [20]:
class SkipGramNeg(nn.Module):
    def __init__(self, n_vocab, n_embed, noise_dist=None):
        super().__init__()
        
        self.n_vocab = n_vocab
        self.n_embed = n_embed
        self.noise_dist = noise_dist
        
        # Define embedding layers for input and output words
        self.in_embed = nn.Embedding(n_vocab, n_embed)
        self.out_embed = nn.Embedding(n_vocab, n_embed)
        
        # Initialize both embedding tables with uniform distribution
        # (this may help with convergence)
        self.in_embed.weight.data.uniform_(-1, 1)
        self.out_embed.weight.data.uniform_(-1, 1)
        
    def forward_input(self, input_words):
        input_vectors = self.in_embed(input_words)
        return input_vectors
    
    def forward_output(self, output_words):
        output_vectors = self.out_embed(output_words)
        return output_vectors
    
    def forward_noise(self, batch_size, n_samples):
        """ Generate noise vectors with shape (batch_size, n_samples, n_embed)"""
        if self.noise_dist is None:
            # Sample words uniformly
            noise_dist = torch.ones(self.n_vocab)
        else:
            noise_dist = self.noise_dist
            
        # Sample words from our noise distribution
        noise_words = torch.multinomial(noise_dist,
                                        batch_size * n_samples,
                                        replacement=True)
        
        device = "cuda" if model.out_embed.weight.is_cuda else "cpu"
        noise_words = noise_words.to(device)
        noise_vectors = self.out_embed(noise_words).view(batch_size, n_samples, self.n_embed)      
        
        return noise_vectors

In [21]:
class NegativeSamplingLoss(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, input_vectors, output_vectors, noise_vectors):
        
        batch_size, embed_size = input_vectors.shape
        
        # Input vectors should be a batch of column vectors
        input_vectors = input_vectors.view(batch_size, embed_size, 1)
        
        # Output vectors should be a batch of row vectors
        output_vectors = output_vectors.view(batch_size, 1, embed_size)
        
        # bmm = batch matrix multiplication
        # Correct log-sigmoid loss
        out_loss = torch.bmm(output_vectors, input_vectors).sigmoid().log()
        out_loss = out_loss.squeeze()
        
        # Incorrect log-sigmoid loss
        noise_loss = torch.bmm(noise_vectors.neg(), input_vectors).sigmoid().log()
        noise_loss = noise_loss.squeeze().sum(1)  # sum the losses over the sample of noise vectors

        # Negate and sum correct and noisy log-sigmoid losses
        # Return average batch loss
        return -(out_loss + noise_loss).mean()

### Define validation function

This function helps observing the model during training. It will print out the closest words to some input words using the _cosine similarity_. We will input a mix of a few common words and a few uncommon words.


We can encode the validation words as vectors $\vec{a}$ using the embedding table, then calculate the similarity with each word vector $\vec{b}$ in the embedding table. With the similarities, we can print out the validation words and words in our embedding table semantically similar to those words. 

In [22]:
def cosine_similarity(embedding, valid_size=16, valid_window=100, device='cpu'):
    """ Returns the cosine similarity of validation words with words in the 
    embedding matrix.
    
    Arguments:
    ----------
    - embedding: a PyTorch embedding module
    - ...    
    """
    
    # Calculate the cosine similarity between some random words and the embedding vectors. 
    # With the similarities, we can look at what words are close to our random words.
    # sim = (a . b) / |a||b|
    
    embed_vectors = embedding.weight
    
    # Calculate magnitude of embedding vectors, |b|
    magnitudes = embed_vectors.pow(2).sum(dim=1).sqrt().unsqueeze(0)
    
    # Pick N words from our ranges (0, window) and (1000, 1000+window). 
    # lower id implies more frequent words, higher id more uncommon words
    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
    valid_examples = np.append(valid_examples,
                               random.sample(range(1000, 1000+valid_window), valid_size//2))
    valid_examples = torch.LongTensor(valid_examples).to(device)
    
    valid_vectors = embedding(valid_examples)
    similarities = torch.mm(valid_vectors, embed_vectors.t()) / magnitudes
        
    return valid_examples, similarities

## Training NN

(Training on GPU recommended, if available.)

In [34]:
# Check for a GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
if not torch.cuda.is_available():
    print('No GPU found. Please use a GPU to train your neural network.')

No GPU found. Please use a GPU to train your neural network.


In [41]:
def train_w2v(model, words, batch_size, optimizer, criterion, epochs, print_every=1500):
    """Train loop with forward and backward propagation, """
    
    model.train()
    steps = 0
    for e in range(epochs):

        # Get our input, target batches
        for input_words, target_words in get_batches(words, batch_size):
            steps += 1
            inputs, targets = torch.LongTensor(input_words), \
                              torch.LongTensor(target_words)
            inputs, targets = inputs.to(device), targets.to(device)

            # input, outpt, and noise vectors
            input_vectors = model.forward_input(inputs)
            output_vectors = model.forward_output(targets)
            noise_vectors = model.forward_noise(inputs.shape[0], 5)

            # negative sampling loss
            loss = criterion(input_vectors, output_vectors, noise_vectors)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # loss stats
            if steps % print_every == 0:
                print("Epoch: {}/{}".format(e + 1, epochs))
                print("Loss: ", loss.item()) # avg batch loss at this point
                valid_examples, valid_similarities = cosine_similarity(
                        model.in_embed, device=device)
                _, closest_idxs = valid_similarities.topk(6)

                valid_examples, closest_idxs = valid_examples.to('cpu'), \
                                               closest_idxs.to('cpu')
                for ii, valid_idx in enumerate(valid_examples):
                    closest_words = [int_to_vocab[idx.item()] \
                                     for idx in closest_idxs[ii]][1:]
                    print(int_to_vocab[valid_idx.item()] \
                          + " | " + ', '.join(closest_words))
                print("...\n")
    
    # Return trained model
    return model

In [42]:
# Get noise distribution
# Calculate new freqs after subsampling - this was not done in orig project!
word_counts = Counter(words_train)
total_count = len(words_train)
freqs = {word: count / total_count for word, count in word_counts.items()}
word_freqs = np.array(sorted(freqs.values(), reverse=True))
unigram_dist = word_freqs / word_freqs.sum()
noise_dist = torch.from_numpy(unigram_dist**(0.75)/np.sum(unigram_dist**(0.75)))

In [43]:
# Set net parameters
embedding_dim = 300

# Set training parameters
print_every = 3000
batch_size = 512
epochs = 5

criterion = NegativeSamplingLoss() 
optimizer = optim.Adam(model.parameters(), lr=0.003)

In [46]:
# Instantiate model and move to GPU if available
w2v_net = SkipGramNeg(len(vocab_to_int), embedding_dim, noise_dist=noise_dist)
w2v_net.to(device)

# Train model
trained_w2v_net = train_w2v(w2v_net, words_train, batch_size, 
                            optimizer, criterion, epochs, print_every)

KeyboardInterrupt: 

In [None]:
# Tokenize each review in reviews_split
reviews_ints = []
for review in reviews_list:
    review_int = [vocab_to_int[word] for word in review.split()]
    reviews_ints.append(review_int)

In [None]:
# Check results
print("Tokenized review:", reviews_ints[:1])
assert len(reviews_ints[0]) == len(reviews_list[0].split())

Next: Encode the labels 1 = positive, 0 = negative

In [None]:
labels_array = np.array([1 if label == 'positive' else 0 for label in labels.split('\n')])

In [None]:
# Check results
labels_array[:5]

### Create features: Pad / truncate reviews

To have a uniform input format, I'll pad or truncate all reviews to a specific length. For reviews shorter than some `seq_length`, I'll pad with 0s (from start). For reviews longer than `seq_length`, we can truncate them to the first `seq_length` words. 

> **Exercise:** Define a function that returns an array `features` that contains the padded data, of a standard size, that we'll pass to the network. 
* The data should come from `review_ints`, since we want to feed integers to the network. 
* Each row should be `seq_length` elements long. 
* For reviews shorter than `seq_length` words, **left pad** with 0s. That is, if the review is `['best', 'movie', 'ever']`, `[117, 18, 128]` as integers, the row will look like `[0, 0, 0, ..., 0, 117, 18, 128]`. 
* For reviews longer than `seq_length`, use only the first `seq_length` words as the feature vector.

As a small example, if the `seq_length=10` and an input review is: 
```
[117, 18, 128]
```
The resultant, padded sequence should be: 

```
[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]
```

**Your final `features` array should be a 2D array, with as many rows as there are reviews, and as many columns as the specified `seq_length`.**

In [None]:
# Outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

plt.figure()
plt.hist([len(x) for x in reviews_ints], bins=100, color='rebeccapurple');

In [None]:
def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is left-padded with 0's 
        or truncated to the input seq_length.
    '''
       
    features = np.zeros((len(reviews_ints), seq_length), dtype=int)
    
    for i, review in enumerate(reviews_ints):
        features[i, -len(review):] = np.array(review[:seq_length])
    
    return features

In [None]:
# Test implementation
seq_length = 200
features = pad_features(reviews_ints, seq_length=seq_length)

assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# Print first 10 values of the first 5 batches 
print(features[:5,:10])

### Split into Training, Validation, Test

With our data in nice shape, we'll split it into training, validation, and test sets.


In [None]:
split_frac = 0.8

# Split data into training, validation, and test data (features and labels, x and y)
upper_train = int(len(features) * split_frac)
upper_val = int(upper_train + (len(features) * (1-split_frac) / 2))


train_x = features[:upper_train, :]
val_x = features[upper_train : upper_val, :]
test_x = features[upper_val: , :]

train_y = encoded_labels[:upper_train]
val_y = encoded_labels[upper_train : upper_val]
test_y = encoded_labels[upper_val:]

# Print out the shapes of resultant feature data
print(train_x.shape)
print(val_x.shape)
print(test_x.shape)

### DataLoaders and Batching

After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets.

```
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)
```

This is an alternative to creating a generator function for batching our data into full batches.

In [None]:
# Create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# Dataloaders
batch_size = 50
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size) # make sure to SHUFFLE your data
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

In [None]:
# Obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

---
# Sentiment Network with PyTorch

Below is the basic network architecture:

<img src="assets/network_diagram.png" width=40%>

The layers are as follows:
1. An [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) that converts our word tokens (integers) into embeddings of a specific size (dimensionality reduction).
2. An [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) defined by a hidden_state size and number of layers
3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
4. A sigmoid activation layer which turns all outputs into a value 0-1; we return **only the last sigmoid output** as the output of this network.

**The Embedding Layer:** We need to add an [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) because there are 74000+ words in our vocabulary. It is massively inefficient to one-hot encode that many classes. So, instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table. We could train an embedding layer using Word2Vec (see last project), then load it here. But for this implementation, it's fine to just make a new layer, using it for only dimensionality reduction, and let the network learn the weights.

**The LSTM Layer(s):** We'll create an [LSTM](https://pytorch.org/docs/stable/nn.html#lstm) to use in our recurrent network, which takes in an input_size, a hidden_dim, a number of layers, a dropout probability (for dropout between multiple layers), and a batch_first parameter.

In [None]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

In [None]:
import torch.nn as nn

class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, 
                 hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # Define all layers
        self.embed = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, batch_first=True, dropout=drop_prob)
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)
            
        embeddings = self.embed(x)
        lstm_out, hidden = self.lstm(embeddings, hidden)
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim) # stack-up LSTM output
        out = self.dropout(lstm_out) 
        out = self.fc(out)
        sig_out = self.sig(out)
        
        
        # Reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1]  # get last batch of labels only!
        
        
        # return last sigmoid output and hidden state
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        """
        Initializes hidden state to all zeros and move to GPU if available.
        """
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden
        

`__init__` **explanation:**
- First I have an embedding layer, which should take in the size of our vocabulary (our number of integer tokens) and produce an embedding of embedding_dim size. So, as this model trains, this is going to create and embedding lookup table that has as many rows as we have word integers, and as many columns as the embedding dimension.

- Then, I have an LSTM layer, which takes in inputs of embedding_dim size. So, it's accepting embeddings as inputs, and producing an output and hidden state of a hidden size. I am also specifying a number of layers, and a dropout value, and finally, I’m setting batch_first to True because we are using DataLoaders to batch our data like that!

- Then, the LSTM outputs are passed to a dropout layer and then a fully-connected, linear layer that will produce output_size number of outputs. And finally, I’ve defined a sigmoid layer to convert the output to a value between 0-1.


`forward` **explanation:**
- First, I'm getting the batch_size of my input x, which I’ll use for shaping my data. Then, I'm passing x through the embedding layer first, to get my embeddings as output.

- These embeddings are passed to the lstm layer, alongside a hidden state, and this returns an lstm_output and a new hidden state. Then I'm going to stack up the outputs of my LSTM to pass to my last linear layer.

- Then I keep going, passing the reshaped lstm_output to a dropout layer and my linear layer, which should return a specified number of outputs that I will pass to my sigmoid activation function.

- Finally, I want to make sure that I’m returning only the last of these sigmoid outputs for a batch of input data, so, I’m going to shape these outputs into a shape that is batch_size first. Then I'm getting the last bacth by calling `sig_out[:, -1]`.

`init_hidden` **explanation:** The hidden and cell states of an LSTM are a tuple of values and each of these is size (n_layers by batch_size, by hidden_dim). I’m initializing these hidden weights to all zeros, and moving to a gpu if available.

## Instantiate the network

Hyperparameters.

* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3


In [None]:
# Instantiate the model with hyperparams
vocab_size = len(vocab_to_int) + 1  # add one slot for our '0' paddings!
output_size = 1  # a sigmoid value between 0 and 1
embedding_dim = 400 
hidden_dim = 256 
n_layers = 2

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

---
## Training

We use a kind of cross entropy loss, which is designed to work with a single Sigmoid output. [BCELoss](https://pytorch.org/docs/stable/nn.html#bceloss), or **Binary Cross Entropy Loss**, applies cross entropy loss to a single value between 0 and 1.

We also have some data and training hyparameters:

* `lr`: Learning rate for our optimizer.
* `epochs`: Number of times to iterate through the training dataset.
* `clip`: The maximum gradient value to clip at (to prevent exploding gradients).

In [None]:
# Loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

**Note on Output, target format:** In the training loop, we are making sure that our outputs are squeezed so that they do not have an empty dimension output.squeeze() and the labels are float tensors, labels.float(). Then we perform backpropagation as usual.

In [None]:
# Training params

epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip = 5  # gradient clipping

# Move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# Train for some number of epochs
for e in range(epochs):
    # Initialize hidden state
    h = net.init_hidden(batch_size)

    # Batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # Zero accumulated gradients
        net.zero_grad()

        # Get the output from the model
        output, h = net(inputs, h)

        # Calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # Loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

---
## Testing

There are a few ways to test your network.

* **Test data performance:** First, we'll see how our trained model performs on all of our defined test_data, above. We'll calculate the average loss and accuracy over the test data.

* **Inference on user-generated data:** Second, we'll see if we can input just one example review at a time (without a label), and see what the trained model predicts. Looking at new, user input data like this, and predicting an output label, is called **inference**.

In [None]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = net(inputs, h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

### Inference on a test review

You can change this test_review to any text that you want. Read it and think: is it pos or neg? Then see if your model predicts correctly!
    
> **Exercise:** Write a `predict` function that takes in a trained net, a plain text_review, and a sequence length, and prints out a custom statement for a positive or negative review!
* You can use any functions that you've already defined or define any helper functions you want to complete `predict`, but it should just take in a trained net, a text review, and a sequence length.


In [None]:
# Test reviews
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. \
                   This movie had bad acting and the dialogue was slow.'
test_review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'
test_review_raph = 'I am not so shure if I can recommend this movie, it was more or less ok, \
                    but that is all I can say about it.'

In [None]:
def predict(net, test_review, sequence_length=200):
    ''' Prints out whether a give review is predicted to be 
        positive or negative in sentiment, using a trained model.
        
        params:
        net - A trained net 
        test_review - a review made of normal text and punctuation
        sequence_length - the padded length of a review
        '''
    
    review_clean = ''.join([c for c in test_review.lower() if c not in punctuation])
    review_int = [vocab_to_int[word] for word in review_clean.split()]
    features = np.zeros(sequence_length, dtype=int)
    features[-len(review_int):] = np.array(review_int[:seq_length])
    features = features.reshape(1, -1)  # transform to 2D
    feature_tensor = torch.from_numpy(features)  # convert to tensor to pass into model
    print(feature_tensor.size())
    net.eval()
    
    batch_size = feature_tensor.size(0)
    h = net.init_hidden(batch_size)

    if(train_on_gpu):
        feature_tensor = feature_tensor.cuda()

    # Get the output from the model
    output, h = net(feature_tensor, h)
    pred = torch.round(output.squeeze())  # convert output probabilities to predicted class (0 or 1)
    
    
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
    if(pred.item()==1):
        print("Positive review detected!")
    else:
        print("Negative review detected.")       

In [None]:
# Call function
seq_length=200
predict(net, test_review_neg, seq_length)

In [None]:
# Call function
seq_length=200
predict(net, test_review_pos, seq_length)

In [None]:
# Call function
seq_length=200
predict(net, test_review_raph, seq_length)

---