<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-research-and-practice/blob/main/real-world-natural-language-processing/2-first-nlp-application/sst_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Introducing sentiment analysis

In a scenario, you wanted to extract users’ subjective opinions
from online survey results. You have a collection of textual data in response to a
free-response question, but you are missing the answers to the “How do you like our
product?” question, which you’d like to recover from the text. 

This task is called sentiment
analysis, which is a text analytic technique used in the automatic identification
and categorization of subjective information within text. The technique is widely used
in quantifying opinions, emotions, and so on that are written in an unstructured way
and, thus, hard to quantify otherwise. Sentiment analysis is applied to a wide variety of
textual resources such as survey, reviews, and social media posts.

In machine learning, classification means categorizing something into a set of predefined,
discrete categories. One of the most basic tasks in sentiment analysis is the
classification of polarity, that is, to classify whether the expressed opinion is positive,
negative, or neutral.

<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/real-world-natural-language-processing/2-first-nlp-application/images/1.png?raw=1' width='800'/>



##Setup

In [None]:
!pip -q install allennlp==2.5.0
!pip -q install allennlp-models==2.5.0
!git clone https://github.com/mhagiwara/realworldnlp.git
%cd realworldnlp

In [3]:
from itertools import chain
from typing import Dict

import numpy as np
import torch
import torch.optim as optim
from allennlp.data.data_loaders import MultiProcessDataLoader
from allennlp.data.samplers import BucketBatchSampler
from allennlp.data.vocabulary import Vocabulary
from allennlp.models import Model
from allennlp.modules.seq2vec_encoders import Seq2VecEncoder, PytorchSeq2VecWrapper
from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.nn.util import get_text_field_mask
from allennlp.training import GradientDescentTrainer
from allennlp.training.metrics import CategoricalAccuracy, F1Measure
from allennlp_models.classification.dataset_readers.stanford_sentiment_tree_bank import StanfordSentimentTreeBankDatasetReader

from realworldnlp.predictors import SentenceClassifierPredictor

In [4]:
EMBEDDING_DIM = 128
HIDDEN_DIM = 128

##What is a dataset?

In NLP, records in a dataset are usually some type of linguistic units, such as words,
sentences, or documents. A dataset of natural language texts is called a corpus (plural: corpora).

If a dataset contains a collection of sentences annotated
with their parse trees, the dataset is called a treebank. The most famous example
of this is [Penn Treebank (PTB)](http://realworldnlpbook.com/ch2.html#ptb), which
has been serving as the de facto standard dataset for training and evaluating NLP tasks
such as part-of-speech (POS) tagging and parsing.

A closely related term to a record is an instance. In machine learning, an instance is
a basic unit for which the prediction is made.

<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/real-world-natural-language-processing/2-first-nlp-application/images/2.png?raw=1' width='800'/>

Finally, a label is a piece of information
attached to some linguistic unit in a dataset.

Labels are usually used as training signals (i.e., answers for
the training algorithm) in a supervised machine learning setting.



###Train, validation, and test sets

A train (or training) set is the main dataset used to train the NLP/ML models.
Instances from the train set are usually fed to the ML training pipeline directly and
used to learn parameters of the model.

A validation set (also called a dev or development set) is used for model selection. Model
selection is a process where appropriate NLP/ML models are selected among all possible
models that can be trained using the train set, and here’s why it’s necessary.

<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/real-world-natural-language-processing/2-first-nlp-application/images/3.png?raw=1' width='800'/>

In summary, when training NLP models, use a train set to train your model candidates,
use a validation set to choose good ones, and use a test set to evaluate them.

###Loading SST datasets using AllenNLP

AllenNLP already supports an abstraction called DatasetReader, which takes care of
reading a dataset from the original format (be it raw text or some exotic XML-based
format) and returns it as a collection of instances. 

We are going to use Stanford-
SentimentTreeBankDatasetReader(), which is a type of DatasetReader that
specifically deals with SST datasets, as shown here:

In [5]:
reader = StanfordSentimentTreeBankDatasetReader()
train_path = 'https://s3.amazonaws.com/realworldnlpbook/data/stanfordSentimentTreebank/trees/train.txt'
dev_path = 'https://s3.amazonaws.com/realworldnlpbook/data/stanfordSentimentTreebank/trees/dev.txt'

##Using word embeddings

Word embeddings are one of the most important concepts in modern NLP. Technically,
an embedding is a continuous vector representation of something that is usually discrete.
A word embedding is a continuous vector representation of a word.

In simpler terms, word embeddings are a way to represent
each word with a 300-element array (or an array of any other size) filled with
nonzero float numbers.

Can we think of some sort of numerical scale where words are represented as points, so that semantically closer words (e.g., “dog” and “cat,” which are both animals) are also geometrically closer?

<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/real-world-natural-language-processing/2-first-nlp-application/images/4.png?raw=1' width='800'/>

Because computers are really good at dealing with multidimensional
spaces (because you can just represent points by arrays), you can simply keep doing
this until you have a sufficient number of dimensions. 

Let’s have three dimensions. In
this 3-D space, you can represent those three words as follows:

```python
vec("cat") = [0.7, 0.5, 0.1]
vec("dog") = [0.8, 0.3, 0.1]
vec("pizza") = [0.1, 0.2, 0.8]
```

The x -axis (the first element) here represents some concept of “animal-ness” and
the z -axis (the third dimension) corresponds to “food-ness.” (I’m making these numbers
up, but you get the point.) 

This is essentially what word embeddings are. You just
embedded those words in a three-dimensional space. By using those vectors, you
already “know” how the basic building blocks of the language work.

<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/real-world-natural-language-processing/2-first-nlp-application/images/5.png?raw=1' width='800'/>

###Using word embeddings for sentiment analysis

First, we create dataset loaders that take care of loading data and passing it to the training pipeline.

In [6]:
sampler = BucketBatchSampler(batch_size=32, sorting_keys=["tokens"])
train_data_loader = MultiProcessDataLoader(reader, train_path, batch_sampler=sampler)
dev_data_loader = MultiProcessDataLoader(reader, dev_path, batch_sampler=sampler)

loading instances: 0it [00:00, ?it/s]
downloading:   0%|          | 0/2160058 [00:00<?, ?B/s][A
downloading:   2%|1         | 43008/2160058 [00:00<00:08, 259252.81B/s][A
downloading:  10%|#         | 217088/2160058 [00:00<00:02, 708270.16B/s][A
downloading: 100%|##########| 2160058/2160058 [00:00<00:00, 3601735.96B/s]
loading instances: 8544it [00:03, 2305.34it/s]
loading instances: 0it [00:00, ?it/s]
downloading:   0%|          | 0/280825 [00:00<?, ?B/s][A
downloading:  18%|#7        | 50176/280825 [00:00<00:00, 306658.21B/s][A
downloading: 100%|##########| 280825/280825 [00:00<00:00, 811960.36B/s]
loading instances: 1101it [00:01, 782.93it/s]


AllenNLP provides a useful Vocabulary class that manages mappings from some linguistic
units (such as characters, words, and labels) to their IDs.

In [12]:
# You can optionally specify the minimum count of tokens/labels.
# `min_count={'tokens':3}` here means that any tokens that appear less than three times will be ignored and not included in the vocabulary.
vocab = Vocabulary.from_instances(chain(train_data_loader.iter_instances(), dev_data_loader.iter_instances()),
                                  min_count={"tokens": 3})

building vocab: 9645it [00:00, 57063.47it/s]


In [13]:
train_data_loader.index_with(vocab)
dev_data_loader.index_with(vocab)

Then, you need to initialize an Embedding instance, which takes care of converting IDs to embeddings.The size (dimension) of the
embeddings is determined by `EMBEDDING_DIM`:

In [14]:
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size("tokens"), embedding_dim=EMBEDDING_DIM)

Finally, you need to specify which index names correspond to which embeddings and pass it to `BasicTextFieldEmbedder` as follows:

In [15]:
# BasicTextFieldEmbedder takes a dict - we need an embedding just for tokens, not for labels, which are used as-is as the "answer" of the sentence classification
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})

Now you can use word_embeddings to convert words to their embeddings.

##Neural networks

In [None]:
# Seq2VecEncoder is a neural network abstraction that takes a sequence of something
# (usually a sequence of embedded word vectors), processes it, and returns a single
# vector. Oftentimes this is an RNN-based architecture (e.g., LSTM or GRU), but
# AllenNLP also supports CNNs and other simple architectures (for example,
# just averaging over the input vectors).
encoder = PytorchSeq2VecWrapper(
    torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

In [None]:
model = LstmClassifier(word_embeddings, encoder, vocab)

In [None]:
optimizer = optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-5)

In [None]:
trainer = GradientDescentTrainer(
    model=model,
    optimizer=optimizer,
    data_loader=train_data_loader,
    validation_data_loader=dev_data_loader,
    patience=10,
    num_epochs=20,
    cuda_device=-1)

trainer.train()

In [None]:
predictor = SentenceClassifierPredictor(model, dataset_reader=reader)
logits = predictor.predict('This is the best movie ever!')['logits']
label_id = np.argmax(logits)

print(model.vocab.get_token_from_index(label_id, 'labels'))