# Gaia NLP

## Summary

Gaia NLP is a paragraph-embedding on a node graph that provides summary and answers the question "How similar is A to B?" where A and B are thoughts identified in a single corpus.  It uses [fuzzy concepts](https://en.wikipedia.org/wiki/Fuzzy_concept) to build a [latent space](https://en.wikipedia.org/wiki/Latent_space) of **thoughts** which are, technically, paragraph vectors ala  with some secret sauce. The website [knowing-gaia.net](https://knowing-gaia.net/) is used as an example as it has a repetitive narrative structure with labeled topics and lens well-suited for testing clustering and classification models. 

## Document structure
1. Intro
2. The 10 prompts
  - 
3. 

# Latent Semantic Analysis

## Background


Essentially, gaia-nlp is a mashup of LDA (latent Dirichlet allocation) and UMAP with the criteria that the model inputs are always two sentences long and labeled with parts of speech. Thus, correlations in the latent space are between complete thoughts and not short phrases (n-grams) or long documents (over feeding doc2vec).

## Corpus structure
<img src="../assets/clippings_page001.svg" type="image/svg+xml" width="200" >

Gaia NLP augments Mikolov's word + document scheme with SpaCy's token annotations while also insisting on inputs of exactly two sentences. This is to establish an arbitrary, but human-sized, sweet spot between conceptual fidelity and inferential generalization.

This annotated two-sentence sample is called a **thought** and it's [embedding](https://en.wikipedia.org/wiki/Embedding) is called **thought manifold**. The shape and resolution of [manifold](https://en.wikipedia.org/wiki/Manifold) is influenced by hyper parameters of shape ($\omega$), resolution ($\sigma$), scale ($s$).

## Adding structure with SpaCy Doc

### Structurally a Gaia Thought is a list of SpaCy tokens of two sentences.

In general, doc2vec and dimensional reduction methods over a latent space can only handle short 'documents'. Its important to remember that doc2vec augments the original word2vec scheme, the intent here is to give the word embeddings more umph and less about mapping and recalling the gist of an entire 'document'.

Said another way, in the doc2vec genre of NLP, a **document is closer to a sentence than an article.** 

A short paragraph? Yes, that's an NLP document. A long sentence? Definitely. Two long sentences? Okay. Four long sentence? Pushing it. Two paragraphs? No way! That would be two documents.

## Gaia-nlp maps annotated paragraph into a single latent space involving individual words and thoughts.

Thoughts are subject to the following constraints:

- Exactly 2 sentences
- Overlapping sentence input
  - first sentence at $t_1$ will be second sentence at $t_2$
- classified with POS and NER

A thought can do one of the following:
1. equate a sentiment with an statement ([FuzzySet](https://en.wikipedia.org/wiki/FinSet))
1. describe a relation between two nouns ([FinSet](https://en.wikipedia.org/wiki/FinSet))
1. describe a relation between two verbs ([functor](https://en.wikipedia.org/wiki/Functor))
1. describe a relative between verb and noun ([simplical set](https://en.wikipedia.org/wiki/Simplicial_set))

You may think of a thought vector as short paragraph vector with syntax annotations that help identify a thought as either an attribute of fuzzy sentiment or declaration of correlation with probability of causation.

### Thoughts can be related to topics
As a corpus grows, so too can the size of the document size (chunk size). But as document (chunk) size grows, the more data one needs to either 'fill out' a latent space spanning more concepts, or, 'fill in' nuances about a limited number of topics. In the second case, we are better off using LDA for topic classification. Let's take a look at LDA of data now.

### A Thought is two sentences.

Let's say a a thought is composed of two sentences that you can speak over 4 seconds. So, rhythmically, at 120bpm (average walking tempo) one can express one thought in two measures (bars). This means a rhyme over 8 bars would be four complete thoughts. Let's call that a [concept](https://en.wikipedia.org/wiki/Concept).

## A Section is a semi-ordered collection of Concepts.

Because we define a thought with two sentences, we assume it is rich enough to stand on 
its own and that the order of **thoughts** in a **concept** does not always matter and the 
order of concepts in a section matters even less. That is to say, the courser the grain, the less order matters. This tracks with intuition - statements of causality are naturally ambiguous for an expanding set of considerations and concepts.

A [Simplicial set](https://en.wikipedia.org/wiki/Simplicial_set) is an appropriate structure to represent a section.

### Encoding causality

To encode causal inference, a partial order must be honored. 

The [Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/) is a good summary the famous Attention is all You Need which describes temporal encoding. Research on the value of retaining sequence information is warranted.

### The syntax of a paragraph.

Quoc Le and Tomas Mikolov's [Distributed Representations of Sentences and Documents](https://github.com/study-groups/nlp-study-group#papers) is considerer 'the' doc2vec paoer. They use an aggregation abstraction called a Paragraph Vector. We will do the same. We use SpaCy's Doc structure to encode paragraphs as a sequence of tokens. The encoding scheme is determined by the model.

### Coarse Grain / Fine Grain

Similar to bin sizes for a histogram of stationary data, one must decide how much detail we want our embedded space to capture. And, like bin size, more resolution requires more training data.  
Unlike choosing a bin size over the domain of a single random variable, the resolution of embedded spaces are multidimensional with each dimension having arbitrary resolution. Furthermore, the number of dimensions of the space is at the description of the model.

## Latent space consideration

A 45 page document that covers 8 topics with 10 probes covers more [latent space](https://en.wikipedia.org/wiki/Latent_space) when the topics are disimilar (more bins).

One of the first things we'll do is look the Chapter headings for an intuition of similarity and differences. This is a subjective activity and requires the participant to instantiate sentiments from their own latent space (brain).

### The goal is understanding Entropy

In thermodynamics, the internal energy of a system is expressed in terms of pairs of conjugate variables such as temperature and entropy or pressure and volume.

This brings us to the goal of this notebook: to show the size (temperature) of a paragraph is 
in congugate relation with the descriptive power (entropy) of model. Further more, unlike the physical sciences, sentiment analysis takes place in single latent spaces of people's min

## Sentiment Analysis

https://realpython.com/sentiment-analysis-python/