# Lexos Dendrogram Tutorial

A **dendrogram** is a visual representation of the hierarchical cluster analysis, showing how your documents are joined into clusters particularly robust because it doesn't just build one tree. Instead, it builds many, many trees by randomly sampling portions of your DTM. It then finds the "consensus": the most consistently appearing relationships across all those individual trees.

A dendrogram can allow you to make observations about

- Stylistic Similarities: Are authors from the same period or literary movement clustering together?
- Genre Influence: Do texts of a particular genre (e.g., Gothic horror) form a distinct cluster?
- Thematic Patterns: If you used POS filtering or focused on specific terms, does the tree reveal groupings based on shared themes or subject matter?

We'll begin by importing some data and creating a DTM.

In [None]:
from lexos.dtm import DTM
from lexos.io.loader import Loader
from lexos.tokenizer import Tokenizer

# Load some text files and set their names
files = [
    "FilesToUse/Poe_FallOfHouseUsher_1839.txt",
    "FilesToUse/Lippard_BelOfPrairieEden.txt",
    "FilesToUse/Irving_RipVanWInkle.txt",
    "FilesToUse/HenryWP_ThePirate.txt",
]
loader = Loader()
loader.load(files)
loader.names = ["Poe", "Lippard", "Irving", "Henry"]

# Tokenize the loaded documents
tokenizer = Tokenizer()
docs = list(tokenizer.make_docs(texts=loader.texts))
labels = loader.names

print(f"Loaded {len(docs)} documents with labels: {labels}")

# Create a Document-Term Matrix (DTM)
dtm = DTM()
dtm(docs=docs, labels=labels)

print(f"DTM created with {dtm.to_df().shape[1]} documents and {dtm.to_df().shape[0]} unique terms.")

### Generating the Dendrogram 

Now for the exciting part: generating and displaying your Dendrogram! This tree visually represents the relationships between your documents based on the numerical data in our DTM.
 
When we create the `Dendrogram`, we need to tell it how to measure document similarity and how to connect those similarities into a tree. Here are the key parameters you can adjust:
 
- `dtm`: This is our "linguistic spreadsheet" (`dtm`) that we created in the previous step. It's the essential input for the tree.
- `metric`: This tells the dendrogram how to measure the "distance" or dissimilarity between your documents. Shorter distances mean more similar documents.
    - `"euclidean"` (default): Think of this as the "straight-line" distance between two points on a graph. It's good for general comparisons but can be sensitive to the overall length of documents (longer documents might naturally have larger term counts, increasing their "distance").
    - `"cosine"`: Imagine each document as an arrow pointing in a specific linguistic "direction." Cosine similarity measures how much these arrows point in the same direction. If they point almost identically, the documents are very similar, even if one document is much longer than another. This is often an excellent choice for text analysis as it focuses on stylistic or thematic *direction* rather than raw word counts.
    - `"cityblock"` (also called Manhattan distance): Imagine moving on a city grid where you can only go along streets (no diagonal shortcuts). This distance is the sum of the absolute differences for each term between two documents. Useful when the individual differences in term counts are important.
    * Many other metrics are available (e.g., "jaccard", "chebyshev"). You can find a full list in the SciPy documentation for `scipy.spatial.distance.pdist`.

- `method`: Once we've measured distances, this method determines how individual documents (or existing clusters of documents) are joined together to form larger branches and clusters in the tree.
    - `"average"` (default): When combining two clusters, this method considers the average distance between *all* pairs of documents in the two clusters. It tends to produce well-balanced clusters.
    - `"single"`: Joins clusters based on the *closest* pair of documents between them. This can sometimes lead to "chaining," where documents connect one after another, forming long, straggly branches.
    - `"complete"`: Joins clusters based on the *farthest* pair of documents between them. This tends to produce more compact, spherical clusters, ensuring all documents within a cluster are relatively similar to each other.
    - `"ward"`: This method aims to minimize the increase in "variance" (or spread) within clusters when they are merged. It tries to make clusters that are as "tight" and internally similar as possible. Often produces intuitive and well-structured clusters.
    * Many other methods are available. You can find a full list in the SciPy documentation for `scipy.cluster.hierarchy.linkage`.
- `labels`: This is simply the list of descriptive names for your documents (e.g., "Poe", "Lippard") that we defined earlier. These will appear as the leaves (endpoints) on your tree.
- `orientation`: Controls the direction of the dendrogram.
    - `"top"` (default)**: Branches extend downwards from the top.
    - `"bottom"`: Branches extend upwards from the bottom.
    - `"left"`: Branches extend rightwards from the left.
    - `"right"`: Branches extend leftwards from the right.
- `color_threshold`: If set, branches with a distance below this threshold will be colored differently from those above it. This helps visualize clusters at a certain distance level. You can try a number like `1.0` or `1.5` to see its effect.
- `show`: Controls whether the generated tree figure is displayed directly in this Jupyter Notebook cell.
    - `True`: The tree will appear right below the code cell.
    - `False` (default): The tree will not be shown immediately. This is useful if you just want to save the figure to a file without displaying it in the notebook. If you set `show=False`, remember to call `dendrogram.showfig()` later to display it.
- `title`: Adds a title to your dendrogram plot.
- `figsize`: A tuple `(width, height)` in inches to set the size of the overall figure. For example, `(12, 8)` for a wider and taller plot.

Let's generate our first dendrogram! We'll start with common parameters, but feel free to come back and experiment with them.


In [None]:
# Import the Dendrogram class
from lexos.cluster import Dendrogram

# Create an instance of the Dendrogram object (feel free to change the parameters)
dendrogram = Dendrogram(
    dtm=dtm,
    labels=labels,
    metric="euclidean",
    method="average",
    orientation="top",
    # color_threshold=1.5, # Uncomment to color branches
    title="My First Dendrogram",
    figsize=(10, 8),
    show=True
)

# Show the dendrogram
dendrogram.show()

### Interpreting Your Dendrogram

Now that you've generated the tree, let's talk about how to read it!

The tree you see is a **dendrogram**, a type of tree diagram that shows hierarchical clustering. Each line is a **branch** on the the tree and connects either to another branch or to a **leaf** (a terminal branch). Each leaf represents an individual document and is labelled with that document's name. Leaves connected to other leaves form clusters, or **clades**, and these may be joined to other clades, forming "super-clusters" at the next level in the tree's hierarchy. Branch length indicates the **distance** or **dissimilarity** between the documents in a clade and those that form part of the same "super-cluster". Shorter branches mean that the documents or clusters are more similar to each other based on their linguistic features. Longer horizontal branches mean that the clades are further apart.

The following procedure is useful in reading the tree.

1. Start from the leaves (your document names) and move up the hierarchy towards the root of the tree. Look for the first merges.
2.  As you move further up the hierarchy, you'll see larger branches forming, grouping together sets of documents or smaller clusters. These broader groupings represent larger patterns of similarity.

For example, in the dendrogram above, you might observe that "Henry" and "Irving" are clustered together with a relatively short branch length, suggesting a linguistic similarity. Then, perhaps "Poe" joins that cluster, indicating it's somewhat similar to the Henry/Irving group but less so than they are to each other. "Lippard" might be an outlier, joining the main cluster with a much longer branch, indicating it's quite distinct linguistically from the others.


### Customizing Your Visualization
 
 The Lexos `Dendrogram` class uses `matplotlib` to create the plot, which means you have a lot of control over its appearance. While Lexos sets sensible defaults, you can easily tweak things like font size, leaf rotation, and overall figure size for better readability or presentation.
 
 #### Adjusting Font Size and Leaf Rotation
 
 Sometimes, the labels on the tree can be small or overlap. You can adjust the font size of the leaf labels (your document names) and rotate them for better readability.
 
 - `leaf_font_size`: Sets the font size for the document labels.
 - `leaf_rotation`: Rotates the leaf labels. A value of `90` is common for vertical labels when the `orientation` is "top" or "bottom".


In [None]:
# You can generate a new dendrogram with adjusted settings
dendrogram = Dendrogram(
    dtm=dtm,
    labels=labels,
    metric="cosine",
    method="ward",
    orientation="top",
    leaf_font_size=10,    # Make labels slightly larger
    leaf_rotation=45,     # Rotate labels for better fit
    title="Dendrogram with Custom Leaf Settings",
    figsize=(12, 8),
    show=True
)

#### Adjusting Figure Size
 
 If your tree has many documents or long labels, it might feel cramped. You can make the overall figure wider or taller to give it more space using the `figsize` parameter directly when you create the `Dendrogram` instance.


In [None]:
# Example: Make the figure 15 inches wide and 10 inches tall
dendrogram = Dendrogram(
    dtm=dtm,
    labels=labels,
    metric="euclidean",
    method="average",
    title="Dendrogram with Larger Figure Size",
    figsize=(15, 10), # Set width and height here
    show=True
)

Another approach to dealing with crowded dendrograms is to use Lexos' `PlotlyDendrogram` class, which uses the Python Plotly library to produce an interactive diagram with pan and zoom features.

### Saving Your Tree
 
 Once you're happy with your dendrogram, you'll likely want to save it as an image file for reports, presentations, or simply for your records.
 
 The `Dendrogram.save()` method allows you to do this easily. Just provide a file name (including the extension), and it will save the image. You can choose different file formats by changing the extension (e.g., `.png`, `.jpg`, `.pdf`, `.svg`). PNG is generally a good choice for web or documents, while SVG provides a scalable vector graphic useful for high-quality printing.


In [None]:
# Save the last generated dendrogram figure
# dendrogram.save("my_dendrogram_analysis_result.png")

### Generating Dendrograms with Plotly

The Dendrogram class uses Python's matplotlib library to produce static images. However, in very large dendrograms, there is a danger of leaf labels overlapping, making the plot unreadable. In this case, you can use the Plotly plotter, which provides the ability to pan and zoom around the dendrogram, making it more readable. The Plotly plotter is also ideal if you are including the dendrogram in a web app.

Here are the available settings:

### Generating the Dendrogram 

Now for the exciting part: generating and displaying your Dendrogram! This tree visually represents the relationships between your documents based on the numerical data in our DTM.
 
When we create the `Dendrogram`, we need to tell it how to measure document similarity and how to connect those similarities into a tree. Here are the key parameters you can adjust:
 
- `dtm`: This is our "linguistic spreadsheet" (`dtm`) that we created in the previous step. It's the essential input for the tree.
- `metric`: This tells the dendrogram how to measure the "distance" or dissimilarity between your documents. Shorter distances mean more similar documents.
    - `"euclidean"` (default): Think of this as the "straight-line" distance between two points on a graph. It's good for general comparisons but can be sensitive to the overall length of documents (longer documents might naturally have larger term counts, increasing their "distance").
    - `"cosine"`: Imagine each document as an arrow pointing in a specific linguistic "direction." Cosine similarity measures how much these arrows point in the same direction. If they point almost identically, the documents are very similar, even if one document is much longer than another. This is often an excellent choice for text analysis as it focuses on stylistic or thematic *direction* rather than raw word counts.
    - `"cityblock"` (also called Manhattan distance): Imagine moving on a city grid where you can only go along streets (no diagonal shortcuts). This distance is the sum of the absolute differences for each term between two documents. Useful when the individual differences in term counts are important.
    * Many other metrics are available (e.g., "jaccard", "chebyshev"). You can find a full list in the SciPy documentation for `scipy.spatial.distance.pdist`.

- `method`: Once we've measured distances, this method determines how individual documents (or existing clusters of documents) are joined together to form larger branches and clusters in the tree.
    - `"average"` (default): When combining two clusters, this method considers the average distance between *all* pairs of documents in the two clusters. It tends to produce well-balanced clusters.
    - `"single"`: Joins clusters based on the *closest* pair of documents between them. This can sometimes lead to "chaining," where documents connect one after another, forming long, straggly branches.
    - `"complete"`: Joins clusters based on the *farthest* pair of documents between them. This tends to produce more compact, spherical clusters, ensuring all documents within a cluster are relatively similar to each other.
    - `"ward"`: This method aims to minimize the increase in "variance" (or spread) within clusters when they are merged. It tries to make clusters that are as "tight" and internally similar as possible. Often produces intuitive and well-structured clusters.
    * Many other methods are available. You can find a full list in the SciPy documentation for `scipy.cluster.hierarchy.linkage`.
- `labels`: This is simply the list of descriptive names for your documents (e.g., "Poe", "Lippard") that we defined earlier. These will appear as the leaves (endpoints) on your tree.
- `orientation`: Controls the direction of the dendrogram.
    - `"top"` (default)**: Branches extend downwards from the top.
    - `"bottom"`: Branches extend upwards from the bottom.
    - `"left"`: Branches extend rightwards from the left.
    - `"right"`: Branches extend leftwards from the right.
- `truncate_mode`: If your dendrogram has too many branches and is hard to read, you can "truncate" it to show only the most important parts. The "lastp" setting shows only the last few merges (which represent the largest clusters). The "level" setting shows all merges up to a certain hierarchical level.
- `color_threshold`: If set, branches with a distance below this threshold will be colored differently from those above it. This helps visualize clusters at a certain distance level. You can try a number like `1.0` or `1.5` to see its effect.
- `show`: Controls whether the generated tree figure is displayed directly in this Jupyter Notebook cell.
    - `True`: The tree will appear right below the code cell.
    - `False` (default): The tree will not be shown immediately. This is useful if you just want to save the figure to a file without displaying it in the notebook. If you set `show=False`, remember to call `dendrogram.showfig()` later to display it.
- `title`: Adds a title to your dendrogram plot.
- `figsize`: A tuple `(width, height)` in inches to set the size of the overall figure. For example, `(12, 8)` for a wider and taller plot.

To use the Plotly plotter, import the PlotlyDendrogram class, create an instance, and use it as above.

In [None]:
# Import the PlotlyDendrogram class
from lexos.cluster import PlotlyDendrogram

# Create an instance of the PlotlyDendrogram object
dendrogram = PlotlyDendrogram(
    dtm=dtm,
    labels=labels,
    metric="euclidean",
    method="average",
    orientation="bottom",
    title="Document Similarity Dendrogram",
)

# Show the dendrogram using Plotly
dendrogram.show()

### Customizing Your Visualization

 The Lexos `PlotlyDendrogram` class generates an interactive Plotly figure, giving you a lot of flexibility for customization beyond the initial parameters. You can adjust dimensions, orientation, and even control how clusters are colored or truncated.
 
 #### Adjusting Parameters and Truncation
 
 Let's try some different settings to see how they change the dendrogram.

In [None]:
dendrogram = PlotlyDendrogram(
    dtm=dtm,
    labels=labels,
    metric="cosine",      # Using cosine for a different perspective
    method="ward",        # Using ward for compact clusters
    orientation="right",  # Try a different orientation
    color_threshold=0.5,  # Set a threshold to color clusters (experiment with values)
    truncate_mode="lastp",# Show only the last 'p' mergers (default p is often 30 for 'lastp')
    figsize=(6, 4),  # Set the figure size for better visibility
    title="Dendrogram: Cosine Distance, Ward Linkage (Truncated, Right Orientation)",
)

dendrogram.show()

### Saving Your Dendrogram
 
Once you're happy with your Plotly dendrogram, you'll likely want to save it as an interactive HTML file or a static image for reports, presentations, or simply for your records.

You have several options. In the Plotly toolbar, there is a "Download plot as png" option to save the plot as a static `.png` file. You can also save the the image to a static file programmatically by calling `PlotlyDendrogram.write_image()`. Just provide a file name (including the extension), and it will save the image. You can choose different file formats by changing the extension (e.g., `.png`, `.jpg`, `.pdf`, `.svg`). This is a wrapper around Plotly's [`write_image()`](https://plotly.github.io/plotly.py-docs/generated/plotly.io.write_image.html) function and accepts all the same arguments.

Plotly figures are highly interactive when saved as HTML, allowing you to zoom, pan, and hover over data points in your saved file. If you wish to save your diagram as an HTML file, call `PlotlyDendrogram.write_html()`. This is a wrapper around Plotly's [`write_html()`](https://plotly.github.io/plotly.py-docs/generated/plotly.io.write_html.html) function and accepts all the same arguments.

Note that `write_image()` and `write_html()` have parallel `to_image()` and `to_html()` methods that allow you to assign the results to a variable, rather than saving to a file. 

An example is given below:

In [None]:
# Save as an interactive HTML file
# dendrogram.write_html("my_dendrogram_analysis.html")

### Troubleshooting 

**"My tree looks blank or empty!"**

- **Check `cutoff`:** If your `cutoff` is too high (e.g., 0.9 or 1.0), it might be too strict, and no common groupings meet that threshold. Try lowering it (e.g., to 0.5 or 0.3).
- **Check `iterations`:** If `iterations` is very low (e.g., 1 or 2), there might not be enough "sampling" to find a stable consensus. Increase it.
- **Verify DTM:** Ensure your `dtm` was created successfully and contains terms. If your documents are very short or very similar, the DTM might be sparse.

**"My documents don't cluster the way I expected!"**

- **Experiment with `distance_metric` and `linkage_method`**: Different metrics and methods reveal different aspects of similarity. `"cosine"` often works well for stylistic analysis of texts. `"ward"` or `"average"` are generally good linkage methods.
- **Adjust DTM pre-processing**: Try different combinations of `stop_words`, `lemmatize`, `pos_filter`, `min_freq`, and `ngrams`. The choice of "terms" is paramount to the relationships you uncover. For example, if you're interested in character interactions, perhaps filtering for proper nouns and verbs would be revealing.