# General Applications of Neural Networks <br> Session 6: Natural Language Processing I

**Instructor**: Wesley Beckner

**Contact**: wesleybeckner@gmail.com

---

<br>

In this session we will introduce tactics for processing, analyzing, and performing predictive capabilities around text, aka, Natural Language Processing or NLP.
<br>

---

<br>

<a name='top'></a>

## 🙋 Question 1: What is NLP?

Where are some places we experience NLP?

To start us off here are some I'd like to mention:

* The Alexa device on my bedside table
* Gmail sentence completion
* Advertisement suggestions from Facebook listening to my conversations from my iPhone 😵‍💫

# Word Embeddings

[back to top](#top)

It is important to note that, while a lot of NLP involves machine learning, a great extent of it does not. As a first order of business, we will need to convert the raw text data into a machine readable format, ie a numeric or vector representation that is meaningful.

## Definitions

[back to top](#top)

* Corpus: the documents that serve as the data for the analyses or training a machine learning model
* Vocabulary: the set of unique words contained in the corpus

Let's take a very simple example:



In [None]:
corpus = 'DREAM SEQUENCE: EXT. ARRAKIS - END OF DAY 1\
The planet Arrakis, as seen from space. \
Track across its endless windswept terrain. \
We glide into a low-hanging dark cloud that’s generated by a\
massive mining vehicle, a HARVESTER, kicking up glowing\
flecks of SPICE. We PUSH through the SPICE, creating a\
dreamlike swirl of orange flakes. \
Through the swirl WE REVEAL a SECOND HARVESTER airborne,\
being hauled by a powerful CARRYALL. \
ON THE GROUND - HARKONNEN SOLDIERS flanking the harvester,\
leading the industrial nightmare through the darkness. One of\
them holds a massive flag bearing the HARKONNEN EMBLEM. \
Now these soldiers are observed through the P.O.V. of a\
thermal scope. Reveal that this scope is attached to a\
strange MISSLE LAUNCHER, one of multiple cloth-shrouded\
weapons being wielded by a small band of blue-eye FREMEN\
FIGHTERS taking cover behind a sprawling black rock. A young\
female fighter, CHANI, is among them; along with a'

we grab the unique words in the corpus to build the vocabulary:

In [None]:
vocabulary = []
for word in corpus.split(' '):
  if word not in vocabulary:
    vocabulary.append(word)

and then we can confirm that the vocabulary is a subset of the corpus:

In [None]:
len(corpus.split(' '))

137

In [None]:
len(vocabulary)

112

Some extra things to note from this example is that your corpus could also be defined as a *collection* of texts and the vocabulary could be *character* based instead of *word* based, although this is much less common.

## Tokenization <br> *🧹cleaning that corpus*

[back to top](#top)

The first step in gearing our corpus for analyses is to tokenize, or represent the interesting components (i.e. words) of our corpus into objects. To do this we typically filter out punctuation, special characters, and other dandruff and organize the words into lists or lists of lists if our corpus is a collection. 

In [None]:
# first split according to sentences
processed = corpus.split('. ')

# then according to words
processed = [i.split(' ') for i in processed]

# show the first three sentences
processed[:3]

[['DREAM', 'SEQUENCE:', 'EXT'],
 ['ARRAKIS',
  '-',
  'END',
  'OF',
  'DAY',
  '1The',
  'planet',
  'Arrakis,',
  'as',
  'seen',
  'from',
  'space'],
 ['Track', 'across', 'its', 'endless', 'windswept', 'terrain']]

We can also, preferrably, perform this with tensorflow:

In [None]:
import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer()

# note that, the default behavior of fit_on_texts is to split
# the high-level iterable in the corpus, i.e. if corpus is just a string
# and not a list, it will treat each character as an element and tokenize
# the letters instead of the words.
tokenizer.fit_on_texts(corpus.split('. '))

We like using tensorflow becuase this takes care of a lot of the dirty work of removing punctuation and converting to lowercase:

In [None]:
print(tokenizer.word_index)

{'a': 1, 'the': 2, 'of': 3, 'through': 4, 'we': 5, 'by': 6, 'harvester': 7, 'arrakis': 8, 'spice': 9, 'swirl': 10, 'reveal': 11, 'being': 12, 'harkonnen': 13, 'soldiers': 14, 'one': 15, 'scope': 16, 'is': 17, 'dream': 18, 'sequence': 19, 'ext': 20, 'end': 21, 'day': 22, '1the': 23, 'planet': 24, 'as': 25, 'seen': 26, 'from': 27, 'space': 28, 'track': 29, 'across': 30, 'its': 31, 'endless': 32, 'windswept': 33, 'terrain': 34, 'glide': 35, 'into': 36, 'low': 37, 'hanging': 38, 'dark': 39, 'cloud': 40, 'that’s': 41, 'generated': 42, 'amassive': 43, 'mining': 44, 'vehicle': 45, 'kicking': 46, 'up': 47, 'glowingflecks': 48, 'push': 49, 'creating': 50, 'adreamlike': 51, 'orange': 52, 'flakes': 53, 'second': 54, 'airborne': 55, 'hauled': 56, 'powerful': 57, 'carryall': 58, 'on': 59, 'ground': 60, 'flanking': 61, 'leading': 62, 'industrial': 63, 'nightmare': 64, 'darkness': 65, 'ofthem': 66, 'holds': 67, 'massive': 68, 'flag': 69, 'bearing': 70, 'emblem': 71, 'now': 72, 'these': 73, 'are': 74,

note that, if the tokenizer sees a word that it hasn't fit to (i.e. was not in the original corpus) it will not now what number to represent the word by.

For example, 'brutal' was not included in the original corpus:

In [None]:
tokenizer.texts_to_sequences(['The harkonnen are brutal'])

[[2, 13, 74]]

## Tokenizer Parameters

[back to top](#top)

Now we'll discuss variations we may want to make to the initialized Tokenizer object.

### Out of Vocabulary

[back to top](#top)

An important setting is to determine what the tokenizer does when it encounters a word it hasn't seen in the corpus. Setting `oov_token` with a string value will dictate how unseen words are counted during texts_to_sequences

In [None]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(
  oov_token='OOV')
tokenizer.fit_on_texts(corpus.split('. '))
print(tokenizer.texts_to_sequences(['The harkonnen are brutal']))

[[3, 14, 75, 1]]


In [None]:
print(tokenizer.word_index)

{'OOV': 1, 'a': 2, 'the': 3, 'of': 4, 'through': 5, 'we': 6, 'by': 7, 'harvester': 8, 'arrakis': 9, 'spice': 10, 'swirl': 11, 'reveal': 12, 'being': 13, 'harkonnen': 14, 'soldiers': 15, 'one': 16, 'scope': 17, 'is': 18, 'dream': 19, 'sequence': 20, 'ext': 21, 'end': 22, 'day': 23, '1the': 24, 'planet': 25, 'as': 26, 'seen': 27, 'from': 28, 'space': 29, 'track': 30, 'across': 31, 'its': 32, 'endless': 33, 'windswept': 34, 'terrain': 35, 'glide': 36, 'into': 37, 'low': 38, 'hanging': 39, 'dark': 40, 'cloud': 41, 'that’s': 42, 'generated': 43, 'amassive': 44, 'mining': 45, 'vehicle': 46, 'kicking': 47, 'up': 48, 'glowingflecks': 49, 'push': 50, 'creating': 51, 'adreamlike': 52, 'orange': 53, 'flakes': 54, 'second': 55, 'airborne': 56, 'hauled': 57, 'powerful': 58, 'carryall': 59, 'on': 60, 'ground': 61, 'flanking': 62, 'leading': 63, 'industrial': 64, 'nightmare': 65, 'darkness': 66, 'ofthem': 67, 'holds': 68, 'massive': 69, 'flag': 70, 'bearing': 71, 'emblem': 72, 'now': 73, 'these': 74,

### Maximum Vocabulary

[back to top](#top)

The `num_words` parameter will limit the number of words represented by the Tokenizer object from the corpus. This may be useful for corpora that contain many, many words (and we would like to speed up our pipeline) or when we would like overfitting to overly specific words within training data 

In [None]:
import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer(
  oov_token='OOV', num_words=10)
tokenizer.fit_on_texts(corpus.split('. '))
print(tokenizer.texts_to_sequences(['The harkonnen are brutal']))

[[3, 1, 1, 1]]


In [None]:
print(tokenizer.word_index)

{'OOV': 1, 'a': 2, 'the': 3, 'of': 4, 'through': 5, 'we': 6, 'by': 7, 'harvester': 8, 'arrakis': 9, 'spice': 10, 'swirl': 11, 'reveal': 12, 'being': 13, 'harkonnen': 14, 'soldiers': 15, 'one': 16, 'scope': 17, 'is': 18, 'dream': 19, 'sequence': 20, 'ext': 21, 'end': 22, 'day': 23, '1the': 24, 'planet': 25, 'as': 26, 'seen': 27, 'from': 28, 'space': 29, 'track': 30, 'across': 31, 'its': 32, 'endless': 33, 'windswept': 34, 'terrain': 35, 'glide': 36, 'into': 37, 'low': 38, 'hanging': 39, 'dark': 40, 'cloud': 41, 'that’s': 42, 'generated': 43, 'amassive': 44, 'mining': 45, 'vehicle': 46, 'kicking': 47, 'up': 48, 'glowingflecks': 49, 'push': 50, 'creating': 51, 'adreamlike': 52, 'orange': 53, 'flakes': 54, 'second': 55, 'airborne': 56, 'hauled': 57, 'powerful': 58, 'carryall': 59, 'on': 60, 'ground': 61, 'flanking': 62, 'leading': 63, 'industrial': 64, 'nightmare': 65, 'darkness': 66, 'ofthem': 67, 'holds': 68, 'massive': 69, 'flag': 70, 'bearing': 71, 'emblem': 72, 'now': 73, 'these': 74,

## Context Windows for Embeddings

One of the problems with representing our words with simple integers, is we don't have any meangingful *relationships* between our embeddings. For instance, if we have

* `apple = 1`
* `pear = 2`
* `toaster = 3`

we might expect that apple and pear should have a *similar* representation in their enconding space, but by this simple tokenization process, it does not. Furthermore, the relationships between these representations are troublesome. For instance:

`1 + 2 = 3`

but

`'apple' + 'pear' != 'toaster'`

Troublesome indeed!

To overcome this challenge, data scientists use vector representations of their words for all but the simplest of analyses. 

> In practice, we typically use the fourth root of the corpus size as the length of the embedded vector, i.e. a corpus with 10,000 words would represent each word with a 10 element length vector. Before we get into the actual algorithms for determining the vector we need to build our training set that will be used by the algorithm.

# Text Classification

## Bi-Directional LSTMs