# Dendrogram Tutorial

This notebook is test to show how a dendrogram would be constructed using three novels. You can substitute any data you like.

## Import Lexos Modules

In [None]:
import re
from lexos.io.smart import Loader
from lexos import tokenizer
from lexos.dtm import DTM

## Load Data

We will load _Pride and Prejudice_ and _Sense and Sensibility_ from the repository's test data, but we'll also download _Frankenstein_ from Project Gutenberg.

For quick run times, we'll take only the first 10000 characters from each novel. We're going to run a quick function called `clean_text()` to remove unwanted line breaks and spaces. If you are using your own data, you may want to do some preprocessing with `Scrubber`.

In [None]:
data = [
    "../test_data/txt/Austen_Pride.txt",
    "../test_data/txt/Austen_Sense.txt",
    "https://www.gutenberg.org/files/84/84-0.txt"
]

# Create the loader and load the data
loader = Loader()
loader.load(data)

# Shorten the texts
texts = [text[0:10000] for text in loader.texts]

# We'll do a little cleanup to get rid of line breaks
def clean_text(text: str) -> str:
    """Convert line breaks to spaces and remove trailing spaces."""
    return re.sub("[\r\n|\n]+", " ", text).strip()

texts = [clean_text(text) for text in texts]


## Make spaCy Docs

Since we are dealing with full novels, this might take a while to process. If you are using a language model, it is recommended that you disable pipeline components you are not using. However, the default multilanguage model should work fairly quickly.

In [None]:
labels = ["Pride_and_Prejudice", "Sense_and_Sensibility", "Frankenstein"]
docs = tokenizer.make_docs(texts, labels)


## Generate the Document-Term Matrix and Show the DTM Table

In [None]:
dtm = DTM(docs, labels)
df = dtm.get_table()
df.head(20)

### Yikes!

There are a lot of spaces, punctuation marks, and digits in our table. Do we _really_ want those in our analysis? Maybe we should filter them out and then re-generate the DTM. 

In [None]:
filtered_texts = []
for doc in docs:
    filtered_tokens = [
        token.text for token in doc
        if not token.is_space
        and not token.is_punct
        and not token.is_digit
    ]
    filtered_texts.append(" ".join(filtered_tokens))

docs = tokenizer.make_docs(filtered_texts)

You can experiment with the cell above to try different filters. For instance, try removing stop words with `token.is_stop`. You can also convert to lower case by changing `token.text` to `token.norm_`.

In the next cell, you'll rebuild the DTM based on your new spaCy docs. If you're satisfied, you can move on to generate your dendrogram.

In [None]:
dtm = DTM(docs, labels)
df = dtm.get_table()
df.head(20)

## Make a Dendrogram

A dendrogram is a graph based on a form of cluster analysis called hierarchical agglomerative clustering. This technique measures the term vectors for each document and determines the "distance" between them. Based on this distance, the algorithm assigns documents to clusters. In the dendrogram, the "leaves" (document labels) that are most directly connected by "branches" have the closest vectors. The height of branches indicates how close different clusters of connected leaves are to other clusters.

There are a number of ways to determine how to measure the distance between documents and how to assign documents to clusters (known as "linkage"). In this example, we will use the default settings of Euclidean distance and average linkage to generate a dendrogram.

We start by importing the `Dendrogram` class from the Lexos `cluster` module.

In [None]:
from lexos.cluster.dendrogram import Dendrogram

We are now ready to build the dendrogram by feeding it our DTM. The `show=True` flag will display the dendrogram in our notebook. For some applications, you may wish to suppress the display and do something else with the dendrogram, which we are here assigning to the `dendrogram` variable.

In [None]:
dendrogram = Dendrogram(dtm, show=True)

If you have followed the default settings, _Pride and Prejudice_ groups with _Frankenstein_, rather than _Sense and Sensibility_. This is surprising since _Pride and Prejudice_ and _Sense and Sensibility_ are by Jane Austen, whereas _Frankenstein_ is by Mary Shelley. We need to decide if this is a meaningful result or if it has something to do with our data. We may want to go back and reconsider how we have preprocessed the data. Have we done anything that might have influenced the result?

Even if we are satisfied with our data, we may also have influenced the result by our choice of distance metric and linkage method. It is a good idea to try other options to see how robust the result is. Valid <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist" target="_blank">distance metrics</a> and <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage" target="_blank">linkage methods</a>, along with technical explanations, can be found in the scipy documentation. However, the following guide may be useful.

One of the most important (and least well-documented) aspects of the hierarchical clustering method is the distance metric. Since we are representing texts as document vectors, it makes sense to define document similarity by comparing the vectors. One way to do this is to measure the distance between each pair of vectors. For example, if two vectors are visualized as lines in a triangle, the hypotenuse between these lines can be used as a measure of the distance between the two documents. This method of measuring how far apart two documents are is known as Euclidean distance, which is the default distance metric used in Lexos. The table below provides some suggestions for distance metrics to try with different types of data.


| Vocabulary Size  | Small Number of terms per document                         | Large Number of terms per document                                |
|------------------|-----------------------------------------------------------|------------------------------------------------------------------|
| Small | `Bray-Curtis`, `Hamming`               | `Chebyshev`, `Euclidean`, `Standardized Euclidean` |
| Large | `Correlation`, `Jaccard`, `Squared Euclidean` | `Canberra`, `Cosine`, `Manhattan`     

At each stage of the clustering process, a choice must be made about whether two clusters should be joined (a single document itself forms a cluster at the lowest level of the hierarchy). You may choose any of the linkage methods listed below:

- `average`: Average linkage is a compromise between single and complete linkage. It takes the average distance of all the points in each cluster and uses the shortest average distance for deciding which cluster should be joined to the current one. This is the default linkage method in Lexos.
- `single`: Single linkage joins the cluster containing a point (e.g. a term frequency) closest to the current cluster. Single linkage joins clusters based on only a single point and does not take into account the rest of the points in the cluster. The resulting dendrograms tend to have spread out clusters. This process is called "chaining".
- `complete`: Complete linkage uses the opposite approach to single linkage. It takes the two points furthest apart between the current cluster and the others. The cluster with the shortest distance to the current cluster is joined to it. Complete linkage thus takes into account all the points on the vector that come before the one with the maximum distance. It tends to produce compact, evenly distributed clusters in the resulting dendrograms.
- `weighted`: The weighted average linkage performs the average linkage calculation but weights the distances based on the number of terms in the cluster. It, therefore, may be a good option when there is significant variation in the size of the documents under examination.

In [None]:
dendrogram = Dendrogram(dtm, show=True, metric="cosine", method="average")

You can change the orientation of the dendrogram or the angle of the labels, which is especially useful if you have a dendrogram with a lot of leaves.

In [None]:
dendrogram = Dendrogram(dtm, show=True, orientation="left", leaf_rotation=-45)

You can also save the dendrogram to a file. In the cell below, replace "dendrogram.png" with a filepath of your choice. If you change the file extension to "jpg" or ".pdf", the file will be saved in that format.

dendrogram.savefig("dendrogram.png")