### Latent semantic analysis
Note that so far our assumptions have been limited. We developed a method for representing the text of Carmilla as a set of vectors, which collectively formed a vector-space. Using principal component analysis, we reduced the dimensions of that vector-space down into three. In this truncated representation, we examined the graph and looked for meaning both in the points and in the dimensions. Methods where order in data is found through a mathematical process and then meaning ascribed post-hoc, are described as "unsupervised learning".

The value in these unsupervised methods comes from this reversal. We are neither proving nor modeling a preconcieved idea, but rather examining order already present in the data. So when the quest is to determine "what we don't know", this shift in reference frame is a valuable attribute.

Latent semantic analysis (LSA) assumes that the vector-space is separable into two ordered matrices. Originally as a process for noise reduction in signals, the implication to a corpus is simple but profound. Every document consists of words, where those words are taken from a topic, and each document is constructed from topics. Or put another way, the author choose topics to cover in each document, and each word came from those selected topics. In customer support, it may be that a customer makes contact to "complain" about their "billing", both of which would be how someone in customer support may classify a customer contact.

Unlike PCA where the vector-space is centralized and then the covariance calculated before the singular value decomposition calculated, in LSA we just calculate the SVD from the word vectors. This means that the original sense of the dimensions is retained and the words that make up the "topics" (one of the decomposed matrices) can be examined directly.

Rather than view the LSA output here, let's stay with sci-kit learn and review another unsupervised method, k-means clustering. After, we will switch over to the gensim library to go deeper into LSA and then onto Latent Dirichlet Allocation (LDA). I expect that LSA & LDA will form a significant component of the final "Arion" package.

In [1]:
# import useful stuff
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition

In [2]:
with open('carmilla.txt', 'r') as f:
    corpus = f.read()
    
from utils.cheaters import dctConstr

dct = dctConstr(stop_words=["i", "you", "a"], ignore_case=True)
dct.constructor(corpus)
# dct.trimmer(top=5, bottom=10)
# dct.build_tfidf(corpus)

def split_by_paragraphs(data:str) -> []:
    processed=data.lower()
    while '\n\n\n' in processed:
        processed=processed.replace('\n\n\n','\n\n')
    out = processed.split('\n\n')
    return [o.replace("\n", " ") for o in out]

pcorp = split_by_paragraphs(corpus)
pbow = [dct(para) for para in pcorp]
ptfidf = [dct.tfidf(para) for para in pcorp]
pvec = [dct.bow_to_vec(p) for p in pbow]

idx_to_terms = {i:j for j, i in dct.terms.items()} # flip dictionary for reversal
print(len(pvec[0]))

3988


In [3]:
lsa = decomposition.TruncatedSVD(n_components=100, n_iter=10, random_state=42)
lsa.fit(pvec)
X = lsa.transform(pvec)

At this point, we have taken the corpus vector-space and calulated the "directions" within that space that are common as transformed into relationships of documents-topics and topics-words. When we transform the original corpus, we are representing each document in terms of its distribution along each topic.

Note that the SVD, places the orthonormal vectors in order of most explained content to least explained. So by taking the original 3988 word dimensions and only selecting the upper 100, we are reducing the "noise" in the signal. In this reduced-space, we can once again compare documents. However, this time synonyms etc. will fall along the same axes, bringing similar documents closer together.

### K-means clustering
[Data Science Manual](https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html) provides a good introduction. There is also a discussion regarding the expectation-maximization function, which is key to solving many ML problems.

In [4]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=4, init='k-means++', max_iter=100, n_init=1)
km.fit(X)

KMeans(max_iter=100, n_clusters=4, n_init=1)

In [6]:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
print(np.shape(order_centroids))

cluster1 = order_centroids[0,10:30]
print([idx_to_terms[i] for i in cluster1])

cluster2 = order_centroids[1,10:30]
print([idx_to_terms[i] for i in cluster2])

(4, 100)
['narrative', 'hesselius', 'illuminates', 'mysterious', 'and', 'it', 'learning', 'publish', 'relates', 'forestall', 'condensation', 'lady', 'précis', 'will', 'elaborate', 'paper', 'in', 'treats', 'papers', 'learned']
['narrative', 'to', 'written', 'with', 'after', 'remarkable', 'subject', 'the', 'prologue', 'elaborate', 'upon', 'ms', 'usual', 'arcana', 'interest', 'paper', 'accompanies', 'nothing', 'on', 'but']


What have these in common?

### Exercise 04
- Printing the terms for cluster 3 will result in an error, fix it
- Cluster the LSA representation for 7 centers
- Plot the first three dimensions of the LSA for all documents
- Color the points according to the clusters