# Introduction to Natural Language Processing (NLP)

## NLP Pipelines

NLP pipelines in general consist of three stages: 

- Text Processing
- Feature Extraction
- Modeling

#### Text Processing 

- Source of text, such as html needs to prepared i.e. the text from the web needs to be extracted.
 - Example: [Kingfisher](https://en.wikipedia.org/wiki/Kingfisher)
- Goal: Extract plain text

Other processing text

- Change capitalization (capitalization generally does not change meaning)
- Punctuation
- Common words which do not add meaning can often be filtered out
- Remove endings of words, i.e. reduce to word stem.

## Feature Extraction

#### Bag of Words

- Treats document as an un-ordered collection or bag of words. 
- Each observation consists of the words of the text as features. 

To obtain a bag of words we need to apply appropriate text processing steps. The resulting "Tokens" are then treated as an un-ordered collection or set. 

- Vectorize all words that were used. 
- Represent occurence using Document-Term Matrix
- Compare different obersvations using this representation
 - Mathematical way to express similarities: Dot product the two row vectors. 
 - Dot product is the sum of the products of corresponding elements. 

$$
ab = \sum a_0 b_0 + a_1 b_1 + ... + a_n b_n 
$$

- Better measure: **Cosine similarity**

$$ 
cos(\Theta) = \frac{a \cdot b}{||a|| \cdot ||b||}
$$

where $||a|| = \sqrt{a_1^2 + a_2^2 + ... + a_n^2}$ represents their magnitudes or Euclidian norms.

If you think of these arrows in some n-dimensional space, then this is equal to the cosine of the angle theta between them. 

- Identical vectors have cosine of 1
- Orthogonal vectors have cosine of 0
- Opposite vectors have cosine of -1


#### TF-IDF

One limiation of Bag of words: Treats every word as being equally important. 

We can approach this by collecting each word's frequency and then dividing the term frequencies by the document frequency of that term to get a relative measure.

Properties of this metric: 
- Proportional to the frequency of occurence of a term in a document. 
- Inversely proportional to the number of document it appears in.
- Highlights the words that are more unique to a document

> TF-IDF transform is simply the product of two words, 


$$
tfidf(t, d, D) = tf(t, d) \cdot idf(t, D),
$$

with $tf(t,d)$ as the term frequency and $idf(t, D)$ being the inverse document frequency.

$$
tf(t,d) = \frac{count(t, d)}{|d|}
$$

is the raw count of a term t in a document d divided by the total number of terms in d.

$$
idf(t,D) = log \Big( \frac{|D|}{d \in D: t\in d} \Big)
$$

The total number of Documents in the collection, divided by the number of documents where t is present. 

Several variations exist. 

#### One-Hot Encoding

In the context of Language Processing One-hot encoding means

> Treat each word like a class, assign it a vector. If the word is present this variable is one and zero otherwise (dummy variable).

#### Word Embeddings

Problem of One-hot encoding: May brea down when we have a large vocabulary to deal with

- We need a way to control the size of our word representation by limiting it to a fixed-size vector. 
- We want to find an embeding for each word in some vector space and we want it to exhibit some desired properties. 
 - I.e. If words are close in meaning they should be close to each other in the vector space compared to words that are not. 
 - Also, if two pairs of words have a similar difference in their meanings, they should be approximately equally separated in the embedded space. 

#### Word2Vec

Word2Vec is perhaps one of the most popular examples of word embeddings used in practice. It transforms word to vectors, just like suggested with the name.

Core idea: 

> A model that is able to predict a word, given neighboring words, or vice versa, is likely to capture the contextual meanings of words very well. 

- Continous bag of words (CBoW) or Skip-gram

Skip-gram: 

- Pick any word from a sentence, 
- convert it into a one-hot encoded vector and feed it into a neural network (or some ohter probabilistic model). 
- Train model to predict context words as best as it can. 
- Take an intermediate representation like a hidden layer in a neural network.
- Outputs of that layer for a given word become the corresponding word vector.

Properties Word2Vec:

- Robust, distributed representation
- Vector size independent of vocabulary.
- Train once, store in lookup table. 
- Deep Learning ready!

#### GloVe

Global vectors for word representation is another approach

- Probability that word j appears in the context of word i is computed
- Count all occurences of i and j in our text collection and then normalize account to get a probability. 
- Using additional context and target vector
- For any word ij, we want the dot product of their word vectors to be equal to their co-occurence probability. 

Paper: [GloVe (PDF)](https://www.aclweb.org/anthology/D14-1162)

#### Embeddings for Deep Learning

> **Distributional Hypothesis:** Words that occur in the same contexts tend to have similar meanings.

Example: Which word are we searching?

- Would you like a cup of ___ ? 
- I like my ___ black. 
- I need my morning ___ before I can do anything?

The point here is that in these contexts tea and coffee are actually very similar. Therefore, when a large collection of sentences is used to learn in embedding,  words with common context words tend to get pulled closer and closer together. 

Adding different dimensions enable to capture similarities and differences in the same embedding. 

I.e. coffee and tea are similar in the dimension of being both beverages (one dimension) 

Pre-trained embedding can be used as a lookup if the application is not too specific. Using an embedding look up for NLP can be thought of just as using pretrained nets (e.g. Alexnet, BTG16) and only learn the later layers. 

## Modeling

Final stage includes: 

- Designing a model (ML or statsistical one)
- Fitting parametesr to training data using an optimization procedure
- Using it to make predictions 

Because we vectorized the model we could in principle utilize pretty much any machine learning model we like.