# NATURAL LANGUAGE PROCESSING

Most impact of deep learning on NLP:
* neural word embeddings: increases accuracy of NLP algorithms
* recurrent neural networks (RNNs): effective at predicting accross sequences

# Supervised NLP: Sentiment classification with IMDB movie reviews

[Corpus page](http://ai.stanford.edu/%7Eamaas/data/sentiment/)

## Encoding strategy

* **output dataset** (target predictions) already a number (1 to 5):   
  * adjust the range to be between 0 and 1, so that we can use **binary `softmax`**
* **input dataset**: bag of words
  * given a review's vocabulary, predict its sentiment
  * matrix (reviews * words): each column a word, cell $(i,j)$ tells if review $i$ contains word $j$
  * so each row is a vector of 1s (for words in the reviews) and 0s (everywhere else)
  * = *one-hot encoding*: most common format for encoding binary data (presence or absence of a feature)

## Neural network architecture

* `layer_0`: first layer
* `weights_0_1`: linear layer -> replaced with an **embedding layer** (shortcut to `layer_1`)
  * select rows from `weights_0_1` corresponding to each word in a review, and sum them (or average)
  * instead of doing a big vector-matrix multiplication, mostly multiplying 0s (vocab size is 131094)
  * only difference: embedding is much faster
* `layer_1`: `relu` layer
* `weights_1_2`: linear layer
* `layer_2`: prediction layer

#### Predictions

* sigmoid:
  * gives an output **between 0 and 1** that can be interpreted as a **probability**
  * output should be:
    * **_close to 1_** if target label is 1
    * **_close to 0_** if target label is 0
    * condition for a correct prediction: `np.abs(layer_2 - target_label) < 0.5`
    
    
```
print(layer_2, target_label, np.abs(layer_2 - target_label), np.abs(layer_2 - target_label) < 0.5)
# [0.90341038] 1 [0.09658962] [ True]
# [0.00025796] 1 [0.99974204] [False]
# [0.00053331] 0 [0.00053331] [ True]
# [0.81981521] 0 [0.81981521] [False]
```

## The hidden layer

* What does the **hidden layer** learn?
  * hidden layers group datapoints from a previous layer into $n$ groups ($n$ the number of neurons)
  * each hidden neuron takes in a datapoint and answers the question: **_Is this datapoint in my group?_**
  * each hidden neuron classifies a datapoint as either *subscribing* or *not subscribing* to its group
    * **_similar datapoints_** (layers) subscribe to many of the **_same groups_**
    * **_similar inputs_** (words) have **_similar weights_** linking them to various hidden neurons
    * hidden neurons are a measure of each word's group affinity
  * hidden layer searches for useful groupings of its input
  * powerful groupings for the next layer to use to make its predictions
* What are **useful groupings**? They do two things:
  * must be useful to the prediction of an output label
  * must be an actual phenomenon in the data:
    * bad groupings just memorize data
    * good groupings pick up linguistically useful phenomena:   
      ex. a neuron that turns **_off_** when it sees *awful* and **_on_** when it sees *nice*
* Problems:
  * *it was great, not terrible* creates the same `layer_1` value as *it was terrible, not great*
  * network is very unlikely to create a hidden neuron that understands negation

## The weights connecting words and hidden neurons

* All the weights for "good" form the embedding for "good"   
  They reflect how much the term "good" is a member of each group (hidden neuron)
* Words with similar predictive power have **similar word embeddings** (weight values)
* Words that correlate with similar labels have similar weights connecting them to various hidden neurons,
  * because the hidden layer groups them into similar hidden neurons,
  * so that the final layer (`weights_1_2`) can make correct predictions
* We can see it by taking a word and searching for other words with similar weight values connecting them to each hidden neuron (group)
* A neuron has similar meaning to other neurons in the same layer if and only if it has similar weights connecting it to the next and/or previous layers
* The meaning of a neuron entirely **depends on the target labels**


In [1]:
import sys
import os
import re
import itertools
import numpy as np

#### Loading data

In [2]:
punc = '''[!()-[]{};:'\"\, <>./?@#$%^&*_~]'''

train_dir_pos = '/Users/macbook/code/_dataset_imdb/aclImdb/train/pos'
train_dir_neg = '/Users/macbook/code/_dataset_imdb/aclImdb/train/neg'
test_dir_pos = '/Users/macbook/code/_dataset_imdb/aclImdb/test/pos'
test_dir_neg = '/Users/macbook/code/_dataset_imdb/aclImdb/test/neg'

# 12500 files each
train_files_pos = os.listdir(train_dir_pos)
train_files_neg = os.listdir(train_dir_neg)
test_files_pos = os.listdir(test_dir_pos)
test_files_neg = os.listdir(test_dir_neg)

## TRAINING DATA

# labels
train_labels = ([1] * len(train_files_pos)) + ([0] * len(train_files_neg))

# reviews
train_raw_reviews = list()

for filename in train_files_pos:
    filepath = os.path.join(train_dir_pos, filename)
    with open(filepath, 'r') as f:
        rev = ' '.join(f.readlines()).strip().lower()
        review = re.sub(r'[^\w\s]', ' ', rev)
        train_raw_reviews.append(review)

for filename in train_files_neg:
    filepath = os.path.join(train_dir_neg, filename)
    with open(filepath, 'r') as f:
        rev = ' '.join(f.readlines()).strip().lower()
        review = re.sub(r'[^\w\s]', ' ', rev)
        train_raw_reviews.append(review)

## TESTING DATA

# labels
# Folder '/Users/macbook/code/_dataset_imdb/aclImdb/test/pos' contains negative reviews !!!
# test_labels = ([0] * len(test_files_pos)) + ([0] * len(test_files_neg))
test_labels = ([1] * len(test_files_pos)) + ([0] * len(test_files_neg))

# reviews
test_raw_reviews = list()

for filename in test_files_pos:
    filepath = os.path.join(test_dir_pos, filename)
    with open(filepath, 'r') as f:
        rev = ' '.join(f.readlines()).strip().lower()
        review = re.sub(r'[^\w\s]', ' ', rev)
        test_raw_reviews.append(review)

for filename in test_files_neg:
    filepath = os.path.join(test_dir_neg, filename)
    with open(filepath, 'r') as f:
        rev = ' '.join(f.readlines()).strip().lower()
        review = re.sub(r'[^\w\s]', ' ', rev)
        test_raw_reviews.append(review)

print(f'training: {len(train_raw_reviews)} reviews, {len(train_labels)} labels')
print(f'testing : {len(test_raw_reviews)} reviews, {len(test_labels)} labels')

training: 25000 reviews, 25000 labels
testing : 25000 reviews, 25000 labels


#### Encoding the input and target datasets

In [3]:
## VOCABULARY INDEX
# Only based on training data, unknown words in test dataset set to None
# train vocab = 94 463 - train+test vocab = 131 093

tokenizer = lambda text: list(set(text.split(' ')))

train_sentences = list(map(tokenizer, train_raw_reviews))
test_sentences = list(map(tokenizer, test_raw_reviews))

# Listing all words in training and test data

vocab = set()

for sentence in train_sentences:
    for word in sentence:
        vocab.add(word)

for sentence in test_sentences:
    for word in sentence:
        vocab.add(word)
            
vocab = list(vocab)
print(f'vocab size: {len(vocab)}')

vocab_index = {}
for index, word in enumerate(vocab):
    vocab_index[word] = index

## DATASETS
# Converting each sentence into a sequence (list of word indices)

train_seqs = list()
for sentence in train_sentences:
    sequence = np.array([vocab_index.get(word) for word in sentence if len(word) > 0], dtype=int)
    train_seqs.append(sequence)
train_sequences = np.array(train_seqs)

test_seqs = list()
for sentence in test_sentences:
    sequence = np.array([vocab_index.get(word) for word in sentence if len(word) > 0], dtype=int)
    test_seqs.append(sequence)
test_sequences = np.array(test_seqs)


vocab size: 102935


# Embeddings

[Wikipedia: Word_embedding](https://en.wikipedia.org/wiki/Word_embedding)

<p style="background:#DDEEEE;padding:15px;">
    Vectors whose relative similarities correlate with semantic similarity based on distributional properties
    <br/>
    Firth: <i>a word is characterized by the company it keeps</i>
</p>

* Research area of **distributional semantics**
* *aims to quantify and categorize **semantic similarities** between linguistic items based on their **distributional properties** in large samples of language data* (Wikipédia)
* representing words as vectors, started in the 1960s with the development of the vector space model
* In 2013, a team at Google led by Tomas Mikolov created **`word2vec`**, a word embedding toolkit which can train vector space models faster than the previous approaches

### Limitations

Polysemy and homonymy are not handled properly   
words with **multiple meanings** are conflated into a **single representation** (a single vector in the semantic space)   
  -> necessity for **multi-sense embeddings**
>Most approaches that produce multi-sense embeddings can be divided into two main categories for their word sense representation, i.e., **unsupervised** and **knowledge-based**.[23] Based on word2vec skip-gram, **Multi-Sense Skip-Gram (MSSG)**[24] performs word-sense discrimination and embedding simultaneously, improving its training time, while assuming a specific number of senses for each word. In the **Non-Parametric Multi-Sense Skip-Gram (NP-MSSG)** this number can vary depending on each word. Combining the prior knowledge of lexical databases (e.g., WordNet, ConceptNet, BabelNet), word embeddings and word sense disambiguation, **Most Suitable Sense Annotation (MSSA)[25]** labels word-senses through an unsupervised and knowledge-based approach considering a word’s context in a pre-defined sliding window. Once the words are disambiguated, they can be used in a standard word embeddings technique, so multi-sense embeddings are produced. MSSA architecture allows the disambiguation and annotation process to be performed recurrently in a self-improving manner.
>
>The use of multi-sense embeddings is known to **improve performance in several NLP tasks**, such as part-of-speech tagging, semantic relation identification, and semantic relatedness. However, tasks involving named entity recognition and sentiment analysis seem not to benefit from a multiple vector representation.[26]
   
   
>Software for training and using word embeddings includes Tomas Mikolov's [Word2vec](https://en.wikipedia.org/wiki/Word2vec), Stanford University's [GloVe](https://en.wikipedia.org/wiki/GloVe_(machine_learning)), AllenNLP's [ELMo](https://en.wikipedia.org/wiki/ELMo), [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)), [fastText](https://en.wikipedia.org/wiki/FastText), [Gensim](https://en.wikipedia.org/wiki/Gensim), Indra[33] and [Deeplearning4j](https://en.wikipedia.org/wiki/Deeplearning4j). Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.

#### Network parameters

In [4]:
np.random.seed(1)

sigmoid = lambda x: 1 / (1 + np.exp(-x))
def softmax(x):
    temp = np.exp(x)
    return temp / np.sum(temp, axis=0, keepdims=True)

alpha = 0.01
iterations = 2
hidden_size = 100

train_data_size = len(train_sequences)
test_data_size = len(test_sequences)

weights_0_1 = 0.2 * np.random.random((len(vocab), hidden_size)) - 0.1
weights_1_2 = 0.2 * np.random.random((hidden_size, 1)) - 0.1

#### Learning and predicting with training data

In [5]:
train_correct_preds, train_total_preds = 0, 0
train_acc = 0.0

for it in range(iterations):
    
    for i in range(train_data_size):
        
        sequence, target_label = train_sequences[i], train_labels[i]
        
        # weights_0_1[x]: extracting vectors of all words in sequence (indexing with a list)
        word_vectors = weights_0_1[sequence]
        layer_1 = sigmoid(np.sum(word_vectors, axis=0))  # embed + sigmoid
        layer_2 = sigmoid(np.dot(layer_1, weights_1_2))  # linear + softmax
        
        layer_2_delta = layer_2 - target_label  # pred - target_pred
        layer_1_delta = layer_2_delta.dot(weights_1_2.T)
        
        weights_0_1[sequence] -= alpha * layer_1_delta  # updating weights of the words in sequence
        weights_1_2 -= alpha * np.outer(layer_1, layer_2_delta)
        
        if np.abs(layer_2_delta) < 0.5:
            train_correct_preds += 1
        train_total_preds += 1
        train_acc = train_correct_preds / float(train_total_preds)
        
        if (i % 10 == 9):
            progress = str(i / float(train_data_size))
            sys.stdout.write(f'\rIter: {it}' \
                           + f' Progress: {progress}'\
                           + f' Train-Acc: {train_acc:.5f}')
    print()
    

Iter: 0 Progress: 0.99996 Train-Acc: 0.99884
Iter: 1 Progress: 0.99996 Train-Acc: 0.99822


#### Learning and predicting with test data

In [6]:
test_correct_preds, test_total_preds = 0, 0
test_acc = 0.0

for j in range(test_data_size):
    
    sequence, target_label = test_sequences[j], test_labels[j]
    
    word_vectors = weights_0_1[sequence]
    layer_1 = sigmoid(np.sum(word_vectors, axis=0))
    layer_2 = sigmoid(np.dot(layer_1, weights_1_2))
    
    if np.abs(layer_2 - target_label) < 0.5:
        test_correct_preds += 1
    test_total_preds += 1
    
test_acc = test_correct_preds / float(test_total_preds)
print(f'Test-Acc {test_acc}')

# !! Folder '/Users/macbook/code/_dataset_imdb/aclImdb/test/pos' contains negative reviews.
# So all reviews with 'target_label == 1' are predicted as 0 (negative)

Test-Acc 0.5


## Comparing word embeddings: visualizing weight similarity

* input word: select its corresponding row in `weights_0_1`
* each entry in that row represents each weight proceeding from that word to each hidden neuron

In [7]:
from collections import Counter
import math

def similar(target):
    target_index = vocab_index.get(target)
    scores = Counter()
    for word,index in vocab_index.items():
        raw_difference = weights_0_1[index] - weights_0_1[target_index]
        squared_difference = raw_difference * raw_difference
        scores[word] = -math.sqrt(sum(squared_difference))
    
    return scores.most_common(20)

print(similar('terrible'))

[('amazing', -0.0), ('haschiguchi', -0.6317337806357755), ('shortsightedness', -0.6376352345185063), ('sapir', -0.6448039135797875), ('abner', -0.6448376806024975), ('chekhov', -0.6496655227088957), ('kehna', -0.6513570764963157), ('alsion', -0.6527235465071505), ('6100', -0.6547546690316058), ('80yr', -0.6571166527124129), ('schwartzenegger', -0.6574348458112316), ('larroquette', -0.6594053730151328), ('paleontology', -0.6603381885943841), ('1967', -0.6631118085294591), ('35c', -0.6648377121735045), ('uproots', -0.665032801359282), ('margraet', -0.6652087394618698), ('roberto', -0.6657892334805985), ('chjaractor', -0.6677867319114845), ('berti', -0.6687005558229965)]


# Meaning is derived from loss

<p style="background:#DDEEEE;padding:15px;">
    <b>Learning</b> = Adjust each weight in the <b>correct direction</b> by the <b>correct amount</b> so `error` reduces to 0
</p>
<br/>

<div style="background:#DDEEEE;padding:15px;">
    <p>
        <b>The secret</b>: For any <code>input</code> and <code>goal_pred</code>, an exact relationship is defined between <code>error</code> and <code>weight</code>, found by combining the <code>prediction</code> and <code>error</code> formula.
    </p>
    <p style="text-align:center">
        <code>error = ((0.5 * weight) - goal_target) ** 2</code>
    </p>
    <p>
        <code>(0.5 * weight)</code> the backpropagation part, <code>0.5</code> the <code>input</code>
    </p>
</div>

* NNs don't really learn data, they minimize the loss function (including forward propagation)
* the choice of loss function determines the neural network's knowledge   
<br/>   

* If a network is overfitting, you can **augment the loss function** by :
  * choosing simpler nonlinearities
  * smaller layer sizes
  * shallower architectures
  * larger datasets
  * or more aggressive regularization techniques
* All have a similar effect of the loss function and similar consequence on the behavior of the network

# Word analogies

* If we train the previous network on a large enough corpus, we'll be able to:
  * take the vector for `king`
  * subtract from it the vector for `man`
  * add in the vector for `woman`
  * then search for the most similar vector (other than those in the query)
  * most often it is the vector for `queen`

In [20]:
def analogy(positive=['terrible', 'good'], negative=['bad']):

    norms = np.sum(weights_0_1 * weights_0_1, axis=1)
    norms.resize(norms.shape[0], 1)
    
    normed_weights = weights_0_1 * norms
    print(norms.shape, weights_0_1.shape, normed_weights.shape)
    
    query_vect = np.zeros(len(weights_0_1[0]))
    for word in positive:
        query_vect += normed_weights[vocab_index.get(word)]
    for word in negative:
        query_vect -= normed_weights[vocab_index.get(word)]
    
    scores = Counter()
    for word, index in vocab_index.items():
        raw_difference = weights_0_1[index] - query_vect
        squared_difference = raw_difference * raw_difference
        scores[word] = -math.sqrt(sum(squared_difference))
    
    return scores.most_common(10)

print(analogy(['terrible', 'good'], ['bad']))   # terrible - bad + good = ?
print(analogy(['elizabeth', 'he'], ['she']))    # elizabeth - she + he = ?

(102935, 1) (102935, 100) (102935, 100)
[('terrible', -0.4905085610937908), ('good', -0.5026199224223888), ('salò', -0.5218130501909815), ('sbd', -0.531838353839204), ('loans', -0.5422102403450381), ('giles', -0.5428412871190563), ('withholding', -0.5446938718395303), ('discretionary', -0.5481240906829359), ('assessing', -0.5500615956821244), ('carville', -0.5505426847949003)]


#### For the next chapter

In [34]:
import numpy as np

norms = np.sum(weights_0_1 * weights_0_1, axis=1)
norms.resize(norms.shape[0], 1)
normed_weights = weights_0_1 * norms

def make_sentence_vector(words):
    indices = list(map(lambda x: vocab_index.get(x), \
              filter(lambda x: x in vocab_index, words)))
    return np.mean(normed_weights[indices], axis=0)

reviews2vectors = list()
for review in train_sentences:
    reviews2vectors.append(make_sentence_vector(review))
reviews2vectors = np.array(reviews2vectors)

def most_similar_reviews(review):
    v = make_sentence_vector(review)
    
    scores = Counter()
    for index, value in enumerate(reviews2vectors.dot(v)):
        scores[index] = value
    
    most_similar = list()
    for index, score in scores.most_common(3):
        most_similar.append(train_raw_reviews[index][0:80])
    
    return most_similar

print(most_similar_reviews(['boring', 'awful']))
print()
print(most_similar_reviews(['nice', 'good']))
    

['comment this movie is impossible  is terrible  very improbable  bad interpretati', 'horrible waste of time   bad acting  plot  directing  this is the most boring mo', 'this movie stinks  the stench resembles bad cowpies that sat in the sun too long']

['this is actually one of my favorite films  i would recommend that everyone watch', 'this movie is terrible but it has some good effects ', 'malcolm mcdowell has not had too many good movies lately and this is no differen']
