#  How NLP Pipelines Work
The 3 stages of an NLP pipeline are: Text Processing > Feature Extraction > Modeling.

1. Text Processing: Take raw input text, clean it, normalize it, and convert it into a form that is suitable for feature extraction.

2. Feature Extraction: Extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.
3. Modeling: Design a statistical or machine learning model, fit its parameters to training data, use an optimization procedure, and then use it to make predictions about unseen data.

This process isn't always linear and may require additional steps.

# Why Do We Need to Process Text?

1. Extracting plain text: Textual data can come from a wide variety of sources: the web, PDFs, word documents, speech recognition systems, book scans, etc. Your goal is to extract plain text that is free of any source specific markup or constructs that are not relevant to your task.
2. Reducing complexity: Some features of our language like capitalization, punctuation, and common words such as a, of, and the, often help provide structure, but don't add much meaning. Sometimes it's best to remove them if that helps reduce the complexity of the procedures you want to apply later.

## What Text Processing Will You Do in This Lesson?
You'll prepare text data from different sources with the following text processing steps:

1. Cleaning to remove irrelevant items, such as HTML tags
2. Normalizing by converting to all lowercase and removing punctuation
3. Splitting text into words or tokens
4. Removing words that are too common, also known as stop words
5. Identifying different parts of speech and named entities
6. Converting words into their dictionary forms, using stemming and lemmatization

After performing these steps, your text will capture the essence of what was being conveyed in a form that is easier to work with.

## Normalization

Plain text is still human language with all its variations and bells and whistles so in normalization, we will try to reduce some of that complexity.

### Capitalization Removal
In the English language, the starting letter of the first word in any sentence is usually capitalized. All caps are sometimes used for emphasis and for stylistic reasons. While this is convenient for a human reader from the standpoint of a machine learning algorithm, it does not make sense to differentiate between variations that mean the same thing:

- Car
- car
- CAR
Therefore, we usually convert every letter in our text to a common case, usually lowercase, so that each word is represented by a unique token.

Here's some sample text from a movie review:

> The first time you see The Second Renaissance it may look boring. Look at it at least twice and definetly watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?

If we have the review stored in a variable called text, converting it to lowercase is a simple call to the lower method in Python.

```
# Convert to lowercase
text = text.lower()
print(text)
```

> Output
>
> the first time you see the second renaissance it may look boring. look at it at least twice and definetly watch part 2. it will change your view of the matrix. are the human people the ones who started the war ? is ai a bad thing ?

Note all the letters that were changed.

### Punctation Removal

Other languages may or may not have a case equivalent but similar principles may apply depending on your NLP task, you may want to remove special characters like periods, question marks, and exclamation points from the text and only keep letters of the alphabet and maybe numbers.

This is useful when looking at text documents as a whole in applications like document classification and clustering where the low level details doesn't affect the application.

To do this we can use a regular expression that matches everything that is not a lowercase A to Z, uppercase A is Z, or digits zero to nine, and replaces them with a space.

```
import re

### Remove punctuation characters
text = re.sub(r"[^a-zA-Z0-9", " ", text) # Anything that isn't A through Z or 0 through 9 will be replaced by a space
print(text)
```

> Output
> 
> the you he first time you see the second renaissance it may look boring look at it at least twice and definetly watch part 2 it will change  your view of the matrix are the human people the ones who started the war is ai a bad thing

This approach avoids having to specify all punctuation characters, but you can use other regular expressions as well.

Lowercase conversion and punctuation removal are the two most common text normalization steps. If and when you apply these steps depends on your end goal and how you design your pipeline.


## Tokenization

Token is a fancy term for a symbol that holds some meaning and is not typically split up any further.

In natural language processing, our tokens are usually individual words. This means that the process of tokenization is simply splitting a sentence into a sequence of words. The simplest way to do this is using the split method which returns a list of words.

> Input
>
> the you he first time you see the second renaissance it may look boring look at it at least twice and definetly watch part 2 it will change your view of the matrix are the human people the ones who started the war is ai a bad thing

```
# Split text into tokens (words)
words = text.split()
print(words)
```

> Output
>
> ['the', 'you', 'he', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definetly', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing]

Notice that it splits on whitespace characters (spaces, tabs, new lines, etc.) and will automatically ignoring two or more whitespace characters in a sequence, so it doesn't return blank strings. This can be further adjusted using optional parameters.

### Natural Language Toolkit (NLTK)

So far, we've only been using Python's built-in functionality, but some of these operations are much easier to perform using a library like Natural Language Toolkit (NLTK).

The most common approach for splitting up text in NLTK is to use the word tokenized function from nltk.tokenize.

```
from nltk.tokenize import word_tokenize

#Split text into words using NLTK
words = word_tokenize(text)
print(words)
```

This performs the same task as split but has a few more features than the split method.

For example if we gave it

> Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.

it would return the following

> ['Dr.', 'Smith', 'graduated', 'from', 'the', 'University', 'of', 'Washington', '.', 'He', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'Lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.']

You'll notice that the punctuations are treated differently based on their position. For example, 'Dr.' has been tokenized as one word rather than being tokenized into two seperate entities 'Dr' and '.'. NLTK is using some rules or patterns to decide what to do with each punctuation.

### NLTK's Sentence Tokenization

There are instances you may need to split a longer document into sentence, this is something that might be done for translations. You can achieve this with NLTK using sent tokenize.

```
from nltk.tokenize import sent_tokenize

# Split text into sentences
sentences = sent_tokenize(text)
print(sentences) 
```

>Output
>
>['Dr.Smith graduated from the University of Washington.', 'He later started an analytics firm called Lux, which catered to enterprise customers.']

Now one could tokenize based on words if needed.

NLTK provide several other tokenizers and here are some of them:

regular expression based tokenizer that can remove punctuation and perform tokenization in a single step
tweet tokenizer that is aware of twitter handles, hash tags, and emoticons
Reference:

[nltk.tokenize package](http://www.nltk.org/api/nltk.tokenize.html)

## Stop Word Removal

**Stop Words** are words that don't add a lot meaning to a sentence or phrase (i.e, is, the, in, at, etc.) and are often very common words.

We want to remove them to simplify procedures down the pipeline.

For example you may have the statement:

Dogs are the best

Even with removing "are" and "the", the positive sentiment about dogs is still conveyed.

A common package that has a pre-set list of stop words is NLTK.

```
# List stop words from NLTK
from nltk.corpus import stopwords
print(stopwords.words("english"))
```

The NLTK can be used on a list of words.

```
words = ['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definetly', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']

# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)
```

## Part-of-Speach Tagging
**Note**: Part-of-speech tagging using a predefined grammar like this is a simple, but limited, solution. It can be very tedious and error-prone for a large corpus of text, since you have to account for all possible sentence structures and tags!

There are other more advanced forms of POS tagging that can learn sentence structures and tags from given data, including Hidden Markov Models (HMMs) and Recurrent Neural Networks (RNNs).

NLTK has the ability to label the parts of speach of the words given.

```
from nltk import pos_tag

# Tag parts of speach (PoS)
sentence = word_tokenize("I always lie down to tell a lie.")
pos_tag(sentence)
```

output
```
[('I', 'PRP'),
 ('always', 'RB'),
 ('lie', 'VBP'),
 ('down', 'RP'),
 ('to', 'TO'),
 ('tell', 'VB'),
 ('a', 'DT'),
 ('lie', 'NN'),
 ('.', '.')]
 ```
Custom grammar to parse an ambiguous sentence and will return the possible ways the sentence could be read.

```
# Define a cusom grammar
my_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
parser = nltk.ChartParser(my_grammar)

# Parse a sentence
sentence = word_tokenize("I shot an elephant in my pajamas")
for tree in parser.parse(sentence):
  print(tree)
```

This can be even further visualized with the draw function on the tree.
```
# Visualize parse trees
for tree in parser.parse(sentence):
  tree.draw()
```
To learn more about NLTK PoS

- NLTK Documentation on pos_tag in this link to [Chapter 5. Categorizing and Tagging Words](https://www.nltk.org/book/ch05.html)
- Stack Overflow thread on the tokens for pos_tag in this link to [What are all possible pos tags of NLTK?](https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk)

## Named Entity Recognition

Named Entity are nouns or noun phrases that refer to specific object, person, or place.

To label these we can use the ne_chunk function in NLTK.

```
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize

# Recognize named entities in a tagged sentence
ne_chunk(pos_tag(word_tokenize("Antonio joined Udacity Inc. in California.")))
```

## Stemming 

Stemming is the process of reducing a word to its stem or root form. For example, branching, branched, and branches all stem from the word branch.

This is a very quick and rough process so sometime the result isn't a complete word. For example, caching, cached, caches would result in a stem "cach", but that isn't a word. But as long as all related words to cache results in the same stem still captures the common idea in the resultant stem.

There are a few options from NLTK but in this example we will look at Porter.

```
from nltk.stem.porter import PorterStemmer

#Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)
```

## Lemmatization

Lemmatization is the process to map the words back to its root using a dictionary. For example, is, was, and were would all be lemmatized to "be".

The default NLTK lemmatizer is wordnet.

```
from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)
```

Lemmatizers need to know the part of speech and will default to nouns but we can add parameters to change which part of speech it will use.

```
# Lemmatize verbs by specifying pos
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
print(lemmed)
```

Let's go over a summary of all the text processing we just covered.

Text Processing	Example

| Given             | Jenna went back to University                 |
|-------------------|-----------------------------------------------|
| Normalized        | jenna went back to university                 |
| Tokenized         | <"jenna", "went", "back", "to", "University"> |
| Stop Word Removal | <"jenna", "went", "university">               |
| Stem & lemmatized | <"jenna", "go", "univers">                    |

## Feature Extraction
Each letter is represented using encodings like ASCII or Unicode so that each letter is represented by a number which is then stored or transmitted using binary (0s or 1s).

Words, rather than letters themselves, hold meaning. But computers don't have a standard representation for words. Practically they are a sequence of binary, ASCII, or Unicode but the meaning and relationship between words is not easily captured with these methods.

In comparison, an image's pixel value contains the relative intensity of light. For a color image we keep a value for each of the primary colors (red, green, and blue) which carry relavant information. This means that pixels with similar values are also visually similar. So pixel values can be used in a numerical model for images.

### How can we do the same thing for image modeling with language?

Depends on the model and goal of the model.

For example, for a graph based model to extract insights you might create a web of nodes.

But if you want a statistical model, you will need numerical representation.

- If you are working at the document level (for spam detection or sentiment of the document) one would use bag-of-words or doc2vec.
- If you are working at the individual words and phrases (for text generation or machine translation) one would use word2vec or glove.
Practice will help over time to determine which is the best method for your use case.

[WordNet visualization tool](http://mateogianolio.com/wordnet-visualization/)

## Bag of words
Each document is turned into an unordered collection of words. For a plagiarism check for students in a class each submission or report could be considered a document. But if you are looking at sentiment in a tweet, each tweet would be considered a document.

The first step is text processing and below is a table of the Given and result after text processing.


| Given                       | Text Processed                |
|-----------------------------|-------------------------------|
| Little House on the Prairie | jenna went back to university |
| Mary had a Little Lamb      | {"mari", "littl", "lamb"}     |
| The Silence of the Lambs    | {"silenc", "lamb"}            |
| Twinkle Twinkle Little Star | {"twinkl", "littl", "star"}   |

The above table is a good start but the result doesn't represent that there was two "Twinkle"s from "Twinkle Twinkle Little Star". A better way to do this is with a Document-Term Matrix.

This is usually done with a set of documents, known as a corpus (D).

|                             | littl | hous | prairi | mari | lamb | silenc | twinkl | star |
|-----------------------------|-------|------|--------|------|------|--------|--------|------|
| Little House on the Prairie | 1     | 1    | 1      | 0    | 0    | 0      | 0      | 0    |
| Mary had a Little Lamb      | 1     | 0    | 0      | 1    | 1    | 0      | 0      | 0    |
| The Silence of the Lambs    | 0     | 0    | 0      | 0    | 1    | 1      | 0      | 0    |
| Twinkle Twinkle Little Star | 1     | 0    | 0      | 0    | 0    | 0      | 2      | 1    |

Each number in the vector is called a term frequency.

### Comparing Documents

Compare two documents based on how many words they have in common or how similar their terms frequencies are.

|   |                             | littl | hous | prairi | mari | lamb | silenc | twinkl | star |
|---|-----------------------------|-------|------|--------|------|------|--------|--------|------|
| a | Little House on the Prairie | 1     | 1    | 1      | 0    | 0    | 0      | 0      | 0    |
| b | Mary had a Little Lamb      | 1     | 0    | 0      | 1    | 1    | 0      | 0      | 0    |

This is done mathematically by a dot product which is the sum of the products of corresponding elements. The larger the dot product will indicate that the two vectors are more similar.

The dot product of A and B is the sum of A and B for each item. 
Dot Product
$$
  a\cdot b = \sum a_nb_n
$$

|                          |     |     |     |     |     |     |     |     |
|--------------------------|-----|-----|-----|-----|-----|-----|-----|-----|
| dot product of a and b = | 1*1 | 1*0 | 1*0 | 0*1 | 0*1 | 0*0 | 0*0 | 0*0 |
| =                        | 1   | 0   | 0   | 0   | 0   | 0   | 0   | 0   |
| =                        | 1   |     |     |     |     |     |     |     |

The dot product only captures the overlap but doesn't take into account the values that don't overlap. Sometimes this can result in comparing two very different documents leads to a result as documents that are identical.

The way to get around this is using cosine similarity. Which still uses the dot product as the numerator but will divide by the products of their magnitudes (Euclidean norms).

$$
  cos\theta = \frac{a\cdot b}{|| a || \times || b ||}
$$
Cosine of theta is the dot product of A and B divide by the product of normalized A and normalized B. 
Cosine Similarity

This esentially makes each of the vectors an arrow pointing in a direction and then calculates the theta of the angle made by the arrow of A and B. We can look at this for comparison between a and b in the table below.

| cos(theta) = dot(a, b) / \|\|a\|\| x \|\|b\|\| = | 1/3     |
|--------------------------------------------------|---------|
| dot product of a and b =                         | 1       |
| \|\|a\|\|                                        | sqrt(3) |
| \|\|b\|\|                                        | sqrt(3) |

Identical documents will have a result of 1 and documents that don't share any similarities will have a result of -1. But documents that share approximately half will result in an orthogonal vector with a result of 0.

## Document Frequency
Bag of words treats each words as equally important. But based on our intiuition some words will occur more frequently in a corpus. For example, in financial documents, this corpus may have a high term frequency in terms like cost or price. To compensate for this we can count in how many documents each word occurs.

|                             | littl| hous | prairi | mari | lamb | silenc | twinkl | star |
|-----------------------------|------|--------|------|------|--------|--------|------|---|
| Little House on the Prairie | 1    | 1      | 1    | 0    | 0      | 0      | 0    | 0 |
| Mary had a Little Lamb      | 1    | 0      | 0    | 1    | 1      | 0      | 0    | 0 |
| The Silence of the Lambs    | 0    | 0      | 0    | 0    | 1      | 1      | 0    | 0 |
| Twinkle Twinkle Little Star | 1    | 0      | 0    | 0    | 0      | 0      | 2    | 1 |
| Document Frequency          | 3    | 1      | 1    | 1    | 2      | 1      | 1    | 1 |

Then divide the document Frequencies on all the values in the corpus. This now gives a proportional value of the term frequencies but is inversely proportional to how many documents that term appears in.

|                             | littl | hous | prairi | mari | lamb | silenc | twinkl | star |
|-----------------------------|-------|------|--------|------|------|--------|--------|------|
| Little House on the Prairie | 1/3   | 1    | 1      | 0    | 0    | 0      | 0      | 0    |
| Mary had a Little Lamb      | 1/3   | 0    | 0      | 1    | 1/2  | 0      | 0      | 0    |
| The Silence of the Lambs    | 0     | 0    | 0      | 0    | 1/2  | 1      | 0      | 0    |
| Twinkle Twinkle Little Star | 1/3   | 0    | 0      | 0    | 0    | 0      | 2      | 1    |
| Document Frequency          | 3     | 1    | 1      | 1    | 2    | 1      | 1      | 1    |

Values with a higher value (i.e., "Mary" and "Silence") are unique to a particular docment while smaller values mean they are frequently used throughout the corpus (i.e., "Little" or "Lamb"). This allows for better charaterization.

## Term Frequency - Inverse Document Frequency (TF-IDF) Transform

Includes two weights:

- Term Frequency (tf)
- Inverse Document Frequency (idf)

### Term Frequency
Is mathematically defined as the count of a term (t) in a document (d) divided by all the terms in the document.

$$
  tf(t,d) = \frac{count(t,d)}{|d|}
$$
                                                      Term Frequency Mathematical Definition

### Inverse Document Frequency
Is the logarithm of the total number of documents in the coprpus (D) divided by the number of documents where the term (t) exists.

$$
    idf(t,D) = log (\frac{|D|}{| \{ d\in D : t\in d\}|})
$$

                                                    Inverse Document Frequency Mathematical Definition

### Resultant Equation of the TF-IDF
These come together into the following mathematical formula.

$$
  tfidf(t,d,D) = tf(t,d) \times idf(t,D)
$$

                                              Term Frequency - Inverse Document Frequency Mathematical Definition

There are many variations that try to smooth or normalize the results or try to prevent edge cases and division by zero errors.

But ultimately this is a good way to assign weight to words and indicate their relevance in a given document.

## One-Hot Encodings
One-Hot Encodings are used to have a numerical representation for each word in a document. It does they by treating each word as a class and assign it in a vector that only has one for where it is used and a zero for all other positons.

|   | littl  | hous | prairi | mari | lamb | silenc | twinkl | star |
|--------|------|--------|------|------|--------|--------|------|---|
| hous   | 0    | 1      | 0    | 0    | 0      | 0      | 0    | 0 |
| lamb   | 0    | 0      | 0    | 0    | 1      | 0      | 0    | 0 |
| silenc | 0    | 0      | 0    | 0    | 0      | 1      | 0    | 0 |
| twinkl | 0    | 0      | 0    | 0    | 0      | 0      | 1    | 0 |

This may look similar to a bag of words but instead it is just a single word in each bag and build a vector for it.

## Word Embeddings

As the number of words grows for a given dataset, One-hot encodings becomes less and less sustainable beacuse the size of the word representations grows with the number of words.

This is where word embeddings comes in where it limits the word representation to a fixed-size vector. This means for each word we want to find the embedding in a vector space which exhibit desired properties.

For example, words with similar meanings such as kid and child should be closer in comparison to words that have disparate meaning (i.e., rock).

Another example, are words that are different in similar ways like man, king, woman, and queen. The distance between man and woman should be similar to the distance between king and queen.



# Modeling
The final stage of the *NLP pipeline* is **modeling**, which includes designing a statistical or machine learning model, fitting its parameters to training data, using an optimization procedure, and then using it to make predictions about unseen data.

The nice thing about working with numerical features is that it allows you to choose from all machine learning models or even a combination of them.

Once you have a working model, you can deploy it as a web app, mobile app, or integrate it with other products and services. The possibilities are endless!

## Word Embedding - Word2Vec

Word2Vec is one of the most popular used word embeddings. As the name indicates it transforms words into vectors but let's look at how that transformation is done.

The core idea is to predict a given word using neighboring words or the using a word to predict neighboring words. This indicates that the model is likely to have a strong grasp of contextual meaning of the words.

There are 2 main cases:

You are given a word and it predicts the neighboring words is called Continuous Skip-gram.
You are given neigboring words is called continous bag of words (CBoW).
Case 1: Skip-gram Model
In the Skip-gram model, a word is chosen from a sentence. This word is converted into a one-hot encoded vector and fed into a neural network or probabilistic model. The model is designed to predict a few surrounding words, its context. We then would optimize the model's weights or parameters and repeat till it best predicts the surrounding words.

Now, take an intermediate representation like a hidden layer in a neural network. The outputs of that layer for a given word become the corresponding word vector.

Shows how the skip gram model would be given the word "jumps", turn it into a vector and then similar to a neural network would create word vectors and return "brown", "fox", "over", and "the" as neighboring words.

![Skip Gram model](skip-gram-model.png)

Case 2: Continuous Bag of Words (CBoW)
Yields a very robust representation of words because the meaning of each word is distributed throughout the vector. The size of the word vector dependent on how you want to tune performance versus complexity. Unlike BoW, CBoW's vector size remains constant no matter how many words. Once trained on the traininig set (a large set of word vectors), you can just store them in a lookup table for future use.

Now in a look up table it can be used in deep learning architectures. For example, it can be used as the input vector for recurrent neural nets. It is also possible to use RNNs to learn even better word embeddings. Some other optimizations are possible that further reduce the model and training complexity such as representing the output words using Hierarchical Softmax, computing loss using Sparse Cross Entropy, et cetera.

## Global Vectors for Word Representation (GloVe)
GloVe or global vectors for word representation is an approach of embedding that tries to directly optimize the vector representation of each word using co-occurrence statistics.

First, the probability of a word j appears in the context of word i. For example, what is the probability that the word "cup" would be in the context (within 1-2 neighboring words) of the word "coffee"? The words "cup" and "coffee" are often related so we would intuit that it would be have a relatively high probability.

Then, we count all such occurrences of i and j in our text collection, and then normalize a count to get a probability. Two random vectors are initialized for each word

1. Word as a context
2. Word as the target
Now, for any pair of words, ij, we want the dot product of their word vectors.

$$
  P(j|i) = w_i \times w_j
$$

                                                        Co-occurrence Probability Equation

Using this as our goal and a suitable last function, we can iteratively optimize these word vectors. The result should be a set of vectors that capture the similarities and differences between individual words. If you look at it from another point of view, we are essentially factorizing the co-occurrence probability matrix into two smaller matrices. This is the basic idea behind GloVe. All that sounds good, but why co-occurrence probabilities?

Consider this table and probabilities:

|       | solid           | water           |
|-------|-----------------|-----------------|
| ice   | P(solid\|ice)   | P(water\|ice)   |
| steam | P(solid\|steam) | P(water\|steam) |

Using our intuition one would come across "solid" more often in the context of "ice" than "steam" and "water" could occur in either context with roughly equal probability. And that is what we see in the co-occurance probabilities.

$$
  \frac{P(solid|ice)}{P(solid|steam)}>> 1               
  
$$
$$
                
  \frac{P(water|ice)}{P(water|steam)}\approx 1
$$

                                              Probability comparison of ice, steam, solid, and water

Given a large corpus, you'll find that the ratio of P solid given ice (P(solid|ice)) to P solid given steam (P(solid|steam)) is much greater than one, while the ratio of P water given ice (P(water|ice)) and P water given steam (P(water|steam))is close to one.

Thus, we see that co-occurrence probabilities already exhibit some of the properties we want to capture. In fact, one refinement over using raw probability values is to optimize for the ratio of probabilities. The co-occurence probability matrix is huge and the co-occurrence probability values are typically very low, so it makes sense to work with the log of these values.

I encourage you to read the paper that introduced GloVe to get a better understanding of this technique, called [GloVE: Global Vectors for Word Representations.](https://nlp.stanford.edu/pubs/glove.pdf)

### Embedding for Deep Learning

Word embeddings are fast becoming the de facto choice for representing words, especially for use and deep neural networks. In the distributional hypothesis, states that words that occur in the same contexts tend to have similar meanings. For example, consider these sentences:

***A**: Would you like to have a cup of <blank>?*

***B**: I like my <blank> black.*

***C**: I need my morning <blank> before I can do anything.*

By now you probably have a word to fill in the <blank>. Let's look at some follow up questions:

1. What would the blank be? "Tea" or "Coffee"
2. What words in the sentence gave you the context clue for the word?
- "Cup"
- "Black"
- "Morning"

But it either "Tea" or "Coffee" could fill in the blanks and make sense. In these contexts, tea and coffee are actually similar. Therefore, when a large collection of sentences is used to learn in embedding, words with common context words tend to get pulled closer and closer together. Of course, there could also be contexts in which tea and coffee are dissimilar.

For example:

***A**: <blank> grounds are great for composting.*

***B**: I prefer loose leaf <blank>.*

A is clearly talking about "coffee grounds". While B is talking about "loose leaf tea".

We can capture these similarities and differences in the same embedding by adding another dimension. Words can be close along one dimension. For example "tea" and "coffee" are both breverages but differ in other ways. A dimension could captures all the variability among beverages.

In a human language, there are many more dimensions along which word meanings can vary. The more dimensions you can capture in your word vector, the more expressive that representation will be.

### How many dimensions do you really need?
For example, a typical neural network architecture designed for an NLP task like word prediction could have a few hundred dimension in a word embedding layer. This might seem large but remember using one-heart encodings is as large as the size of the vocabulary, sometimes in tens of thousands of words.

You can also add learning embedding as part of the model training process and obtain a representation that captures the dimensions that are most relevant for your task. This often adds complexity so often we use a pre-trained embeddings (Word2Vec or GloVe) as a look-up unless your use case is very narrow like on for medical terminology. This will allow you to only train the layer specific to your task.

Compare this with the network architecture for a computer vision task, say, image classification, the raw input here is also very high dimensional. For example, even 128 by 128 Image contains over 16 thousand pixels. We typically use convolutional layers to exploit the spatial relationships and image data and reduce this dimensionality. Early stages and visual processing are often transferable across tasks, so it is common to use some pre-trained layers from an existing network, like Alex nad or BTG 16 and only learn the later layers. Come to think of it, using an embedding look up for NLP is not on like using pre-treated layers for computer vision. Both are great examples of transfer learning.

# t-SNE
Word embeddings need to have high dimensionality to capture sufficient variations in natural language, but this makes them hard to visualize.

**t-Distributed Stochastic Neighbor Embedding (t-SNE)**, is a dimensionality reduction technique that can map high dimensional vectors to a lower dimensional space. It's kind of like Principle Component Analysis (PCA), but it tries to maintain relative distances between objects, so that similar ones stay closer together while dissimilar objects stay further apart.

If we look at the larger vector space, we can discover meaningful groups of related words. Sometimes, that takes a while to realize why certain clusters are formed, but most of the groupings are very intuitive.

T-SNE also works on other kinds of data, such as images. For example, pictures from the Caltech 101 dataset organized into clusters that roughly correspond to class labels

- airplanes with blue sky
- sailboats of different shapes and sizes
- human faces

This is a very useful tool for better understanding the representation that a network learns and for identifying any bugs or other issues.