## Word vectors

**Computers only know how to deal with numbers.**

We can represent images as numbers too... but how do they deal with text?

A computer can tell you whether two strings are different, but how can it understand that `football` and `Ronaldo` are related to `Messi`, or that `Apple` in `Apple is not an orange` is not the company?

We need to create a numerical representation for our natural language that captures the meaning of words, their semantic relationships, and the different types of contexts they are used in. This numerical representation of text is called `word embeddings`.

In [None]:
sentence = "Georgetown is a great university, nay a fantastic university!"

# Number of words = 9
# Vocabulary = ["a", "fantastic", "Georgetown", "great", 
#               "is", "nay", "university"]
# Length of vocabulary = 7

In [None]:
# A vector representation of a word may be a `one-hot encoded` vector

one_hot = {
    "a":          [1, 0, 0, 0, 0, 0, 0], 
    "fantastic":  [0, 1, 0, 0, 0, 0, 0], 
    "Georgetown": [0, 0, 1, 0, 0, 0, 0], 
    "great":      [0, 0, 0, 1, 0, 0, 0], 
    "is":         [0, 0, 0, 0, 1, 0, 0], 
    "nay":        [0, 0, 0, 0, 0, 1, 0], 
    "university": [0, 0, 0, 0, 0, 0, 1]
}

In [None]:
# Consider a corpus of D documents and N unique tokens. 
# The size of the Count Matrix is by D x N. 
# Each row (i) in the matrix contains the frequency of tokens in document D(i).

d1 = "Georgetown is a great university, nay is a fantastic university!"
d2 = "The best university is Trump University"
vocabulary = ["a", "best", "fantastic", "Georgetown", "great", 
              "is", "nay", "the", "Trump", "university"]

# Here D=2, N=10

CV = [
    [2, 0, 1, 1, 1, 2, 1, 0, 0, 2],
    [0, 1, 0, 0, 0, 1, 0, 1, 1, 2],
]

# The sum of each row should be equal to the number of words in the document
# The sum of each column should be equal to the number of times the word
# appears in the corpus

There may be quite a few variations while preparing the above matrix:
- In real world applications we might have a corpus with millions of unique words. The matrix above will be a very sparse one. An alternative is to consider only the most frequent 10,000 words.
- We can either take the frequency (number of times a word appears in the document) or the presence (does the word appear in the document?) to be the entries in the Count Matrix.

Common words like `is`, `the`, `a` etc. tend to appear quite frequently in comparison to the words which are important to a document. For example, a document on `Georgetown` is going to contain more occurences of the word `Georgetown` in comparison to other documents, while common words are going to have a high frequency in every document.

`TF-IDF` penalizes common words, and gives more importance to rare words that appear in a subset of documents.

`TF = (Number of times term t appears in a document)/(Number of terms in the document)`

e.g. `TF("university", d1) = 0.2`

It quantifies the contribution of that specific word to the document: words relevant to the document should be frequent.

`IDF = log(N/n)`, where `N` is the number of documents, and `n` is the number of documents a term t has appeared in.

e.g. `IDF(This) = log(2/2) = 0`

If a word appears in every document, probably it's not relevant to that particular document.

Let us compute IDF for the word ‘Messi’.

IDF(Messi) = log(2/1) = 0.301.

Now, let us compare the TF-IDF for a common word ‘This’ and a word ‘Messi’ which seems to be of relevance to Document 1.

TF-IDF(This,Document1) = (1/8) * (0) = 0

TF-IDF(This, Document2) = (1/5) * (0) = 0

TF-IDF(Messi, Document1) = (4/8)*0.301 = 0.15

As, you can see for Document1 , TF-IDF method heavily penalises the word ‘This’ but assigns greater weight to ‘Messi’. So, this may be understood as ‘Messi’ is an important word for Document1 from the context of the entire corpus.

I was about to talk about `word2vec`, then Google last month published their
[sentence encoder](https://arxiv.org/pdf/1803.11175.pdf). Embedding the whole sentence instead of single words works much better, so that's what we'll do.