# Terminology in NLP

## Corpus

A corpus (plural corpora) is a representative collection of texts large enough to observe statistically meaningful frequencies of words and word sequences.

## Word Embedding

Word embedding is the process of mapping words/phrases to vectors of real numbers. Sometimes, the term "word embedding" refers to a word vector.

## Affinity Matrix

An affinity matrix, also called a similarity matrix, contains distances between pairs of data points in a dataset.

## Part of Speech tagging

Part-of-Speech (POS) tagging also called grammatical tagging. Each word in the document is marked as e.g. noun, verb, adjective, etc.

## Distributional semantics

Distributional semantics is a theory about the semantics of a word. Essentially, it says that we can describe the meaning of words by understanding the context (neighbouring words) in which they appear. "You shall know a word by the company it keeps." by J. R. Firth, 1957.

## Language Model

A Language Model is a probabilistic model which predicts the probability that a sequence of tokens belongs to a language. The probabilities returned by a language model are mostly useful to compare the likelihood that different sentences are "good sentences".

## Cosine Similarity vs L1/L2 Distances

The L1 and L2 distance function quantify the amount of space "we must travel" to get between two given points. Another approach is to examine the angle between two vectors. When we map words/phrases to vectors of real numbers, we get high-dimensional dense vectors. Cosine similarity performs better when comparing whether two words are similar. Cosine has the advantage that it is a norm-invariant metric:

$$
cosine(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}\cdot \mathbf{v}}{\lVert \mathbf{u} \lVert \lVert \mathbf{v} \lVert}
$$

## Downstream Task

A downstream task is a supervised learning task that utilise a pre-trained model or component. Examples of downstream tasks include machine translation, syntactic parsing, text classification, and machine comprehension, etc.

## TF*IDF

Short for Term Frequency – Inverse Document Frequency. It is basically a product of two statistics; the term frequency (TF) and the inverse document frequecy (IDF). There are different ways to compute both statistics.

The Term Frequency is the frequency of a word in a document i.e., the number of times a word $w$ appear in a document $d$.

The Inverse Document Frequency (IDF) is a measure of how significant a word is or how much information the word provides in the whole corpus.





$$
W_{w,d} = \text{tf}(w,d) \log\left[ \frac{N}{D_w}  \right]
$$
where
- $W_{w,d}$ is the weight of the word $w$ in document $d$
- $tf(w,d)$ is word/term frequency i.e., the number of times the word $w$ appears in document $d$
- $N$ is the total number of documents in the corpus
- $D_w$ is the number of documents containing the word $w$


The weight indicates how rare a word is. A high $W_{w,d}$ value indicates a rare word whereas a low value indicates a common term.

## Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a topic modelling method. Topic modelling is the process of identifying abstract topics that best describes a collection of documents.

Termite: Visualization Techniques for Assessing Textual Topic Models: http://vis.stanford.edu/papers/termite

<img src="figures/termite.png" />



##  Retraining Word Vectors

Pretrained word vectors can be trained further as part of the training the NLP system. However, word vector
retraining should be considered for large training datasets so as to cover most words from the vocabulary. For small datasets, retraining word vectors will likely worsen performance.