# Lexos Clustermap Tutorial

Unlike a simple dendrogram that only clusters documents, a Clustermap allows you to simultaneously cluster both your documents and the terms within them. This provides a rich, two-dimensional view of where specific terms are concentrated across your documents and how those concentrations lead to natural groupings.

We'll begin by importing some data and creating a DTM.

In [None]:
from lexos.dtm import DTM
from lexos.io.loader import Loader
from lexos.tokenizer import Tokenizer

# Load some text files and set their names
files = [
    "FilesToUse/Poe_FallOfHouseUsher_1839.txt",
    "FilesToUse/Lippard_BelOfPrairieEden.txt",
    "FilesToUse/Irving_RipVanWInkle.txt",
    "FilesToUse/HenryWP_ThePirate.txt",
]
loader = Loader()
loader.load(files)
loader.names = ["Poe", "Lippard", "Irving", "Henry"]

# Tokenize the loaded documents
tokenizer = Tokenizer(model="en_core_web_sm")
docs = list(tokenizer.make_docs(texts=loader.texts))
labels = loader.names

print(f"Loaded {len(docs)} documents with labels: {labels}")

In [None]:
# Create lists of doc tokens, removing spaces, punctuation, and digits
doc_tokens = []
for doc in docs:
    tokens = [token.text for token in doc if not (token.is_space or token.is_punct or token.is_digit or token.is_stop)]
    doc_tokens.append(tokens)

# Create a Document-Term Matrix (DTM)
dtm = DTM(max_n_terms=50)
dtm(docs=doc_tokens, labels=labels)
num_terms, num_docs = dtm.to_df().shape

print(f"DTM created with {num_docs} documents and {num_terms} unique terms.")

### Generating the Clustermap 

A clustermap is produced using Seaborn, a Python data visualization library built on top of `matplotlib`.

When we create the clustermap, we need to provide it with our document-term matrix and tell it how to measure distances and how to arrange the clusters. Here are the some of the key parameters you can adjust:
 
- `metric`: This tells the dendrogram how to measure the "distance" or dissimilarity between your documents. The default "euclidean" is good for general comparisons but can be sensitive to the overall length of documents (longer documents might naturally have larger term counts, increasing their "distance"). Using "cosine" similarity is commonly used for documents of different length because it focuses on stylistic or thematic *direction* rather than raw word counts. The "cityblock" metric is also commonly used when individual differences in term counts are important. See the Lexos User Guide for further discussion.
- `method`: The linkage method determines how individual documents (or existing clusters of documents) are joined together to form larger branches and clusters in the tree. The default "average" linkage setting considers the average distance between *all* pairs of documents in the two clusters. The "single" linkage setting joins clusters based on the *closest* pair of documents between them. It can sometimes lead to "chaining," where documents connect one after another, forming long, straggly branches. By contrast, "complete" linkage joins clusters based on the *furthest* pair of documents between them. This tends to produce more compact, clusters, ensuring all documents within a cluster are relatively similar to each other. The "ward" linkage setting aims to minimize the increase in "variance" (or spread) within clusters when they are merged. It tries to make clusters that are as "tight" and internally similar as possible. See the Lexos User Guide for further discussion.
- `labels`: A list of descriptive names for your documents (e.g., "Poe", "Lippard").
- `z_score`: Standardizes the values within each row (documents) or column (terms). If the value is set to `None`, the heatmap shows raw frequencies (or whatever your DTM contains). The setting `0` standardizes each row (document) by subtracting its mean and dividing by its standard deviation. This highlights how *terms vary within a single document* relative to that document's average term frequency. Useful for comparing patterns across documents regardless of their length. The setting `1` standardizes each column (term) by subtracting its mean and dividing by its standard deviation. This highlights how *a single term's frequency varies across different documents* relative to that term's average frequency. Useful for seeing which documents use a term more or less than average.
- `standard_scale`: Similar to `z_score`, but scales to a specific range (usually 0 to 1). The setting `0` scales each row (document) so its minimum value is 0 and its maximum is 1. The setting `1` scales each column (term) so its minimum is 0 and its maximum is 1.
- `title`: Adds a title to your dendrogram plot.
- `figsize`: A tuple `(width, height)` in inches to set the size of the overall figure. For example, `(12, 8)` for a wider and taller plot.

Let's generate our first clustermap! We'll start with default parameters, but feel free to run the cell below multiple times with different settings.

In [None]:
# Import the Clustermap class
from lexos.cluster import Clustermap

cm = Clustermap(dtm=dtm, title="Top 50 Terms", figsize=(5, 4))
cm.show()

The clustermap provides a wealth of information at a glance. Let's break down how to interpret it.

- In a clustermap, the rows (terms) and columns (documents) are reordered based on the similarity of their values.
- The dendrogram on the left side of the heatmap shows how the terms and documents are grouped based on their similarity.
- The colors in the heatmap represent the frequency (or scaled frequency if you used `z_score` or `standard_scale`) of a term in a particular document. Darker colors indicating higher frequencies.
 
So we can see which terms are most strongly correllated with which documents. This is useful because it is otherwise not immediately clear which terms determine the clusters. The clustermap allows you to go beyond simple similarity and pinpoint *which specific terms* are driving those similarities and differences.

The clustermap may be more or less readable, depending on the nature of your data. You may have to change the parameters in order to generate more useful dendrograms. (Note also that Seaborn does not display every label.) Lexos provides access to a variety of parameters for changing the colours, hiding dendrograms, and other manipulations of the clustermap's appearance. For instance, you can change the colormap with the `cmap` parameter to find a colormap that best highlights your data's patterns.

- Diverging (good for z_score data): "vlag", "coolwarm", "RdBu" (red-blue)
- Sequential (good for raw counts, high to low): "viridis", "plasma", "YlGnBu" (yellow-green-blue)

You can try this out in the cell below.

See the User Guide for an account of all available settings.

In [None]:
cm = Clustermap(dtm=dtm, title="Top 50 Words", cmap="viridis", figsize=(5, 4))
cm.show()

### Adding Row/Column Colors

Let's say you wanted to colour code the documents by genre, say, purple for Gothic and orange for Adventure. You can make a list of genres matching the order of your labels and then feed that to the clustermap `col_colors` parameter. If you provide a Pandas Series or Pandas DataFrame, Seaborn extracts a label for your Series name or DataFrame column. The categories must be indexed to your DTM's labels, as demonstrated below: 

In [None]:
!!! Note: This works, but you can't get the label. Seaborn says that it uses a DataFrame column or Series name, which works, but it does not extract the colours. The same is true if you just call vanilla Seaborn clustermap.

# Import pandas so we can create a Series or DataFrame
import pandas as pd

# Create a color map for genres
color_map = {"Gothic": "purple", "Adventure": "orange"}

# Create a list of genres matching the order of the labels
genres = ["Gothic", "Gothic", "Adventure", "Adventure"]

# Create a list of colors corresponding to the genres
col_colors = [color_map[genre] for genre in genres]

# Use a Pandas Series to provide the colors
col_colors = pd.Series(col_colors, name="Genre", index=dtm.labels)

# Or uncomment to use a Pandas DataFrame to provide the colors
# col_colors = pd.DataFrame(col_colors, columns=["Genre"], index=dtm.labels)

# Create a clustermap with row and column colors
cm = Clustermap(dtm=dtm, title="Test Clustermap", col_colors=col_colors, figsize=(5, 4))
cm.show()


### Hiding Dendrograms

If you prefer a simpler visual, you can hide the dendrograms.

In [None]:
!!! Note: hiding the upper dendrogram may affect the height of the title above the heatmap. You can adjust it with `cm.ax.set_title("New Title", pad=20)` where `pad` is the distance in points from the top of the axes to the title.

# Example: Hiding the upper (column) dendrogram
cm = Clustermap(dtm=dtm, title="Test Clustermap", hide_upper=True, hide_side=True, figsize=(5, 4))

# If hiding the upper dendrogram, affects the title position, you can adjust it using matplotlib, as shown below. The example puts the title below the heatmap.
# cm.fig.suptitle(t="Test Clustermap", y=-.05)

# Show the clustermap
cm.show()


### Saving Your Clustermap

After you've generated your clustermap, you'll likely want to save it as an image for reports or presentations. The save function lets you do this easily. Just provide a file path, and it'll save the image. You can specify different file formats by changing the extension (e.g., .png, .jpg, .pdf, .svg). 

- `dpi` is a helpful argument that allows you to change the resolution of the image. A dpi of 300 is considered high resolution.

In [None]:
# Save the clustermap with a higher resolution
# cm.save("my_clustermap_high_res.png", dpi=300)

## Plotly Clustermaps

Plotly clustermaps are somewhat experimental and may or may not render plots that are as informative as Seaborn clustermaps. One advantage they have is that, instead of providing labels for each document at the bottom of the graph, they provide the document labels on the `x` and `y` axes, as well as the `z` (distance) score in the hovertext. This allows you to mouse over individual sections of the heatmap to see which documents are represented by that particular section. Also, by adjusting the height of the heatmap, it can be easier to say the labels for all terms. Note that the dendrograms and heatmap are separate subplots, so panning and zooming occurs separately. It is likely that this is not a very useful functionality for this sort of diagram.

With that said, let's generate a Plotly clustermap with some default settings.

In [None]:
from lexos.cluster import PlotlyClustermap

cm = PlotlyClustermap(
    dtm=dtm,
    labels=labels,
    metric="euclidean", # Using cosine for a different perspective
    method="average",   # Using ward for compact clusters
    width=700,
    height=600,
    hide_upper=False,  # Set to True to hide the top dendrogram
    hide_side=False,  # Set to True to hide the side dendrogram
    colorscale="Blues", # Try a different colorscale
    title="Top 50 Terms",
)
cm.show()

Note that panning and zooming can cause the heatmap and dendrograms to become unsynced. There is currently no way to maintain the syncing in pure Python. If you need to zoom in on particular sections of the plot, you may be able to achieve the effect you are looking for by saving the plot as an HTML file with the _experimental_ `include_sync` parameter:

```python
html = cm.to_html(include_sync=True)
with open("filename.html", "w") as f:
    f.write(html)
```

Open the HTML file in a web browser, and you may get the behaviour you need. See below for other options for saving your Plotly clustermaps.

### Customizing Your Visualization
 
The Lexos `PlotlyClustermap` class generates an interactive Plotly figure, giving you a lot of flexibility for customization beyond the initial parameters. You can adjust dimensions, hide dendrograms, and even control the colorscale.
 
You can change the `width` and `height` of the plot to make it larger or smaller. You can also hide the top or side dendrograms if you only want to focus on the heatmap or a single clustering view.

In [None]:
# Generate the clustermap with no dendrograms
cm = PlotlyClustermap(
    dtm=dtm,
    labels=labels,
    metric="cosine",  # Using cosine for a different perspective
    method="ward",  # Using ward for compact clusters
    width=700,
    height=600,
    hide_upper=True,  # Set to True to hide the top dendrogram
    hide_side=False,  # Set to True to hide the side dendrogram
    colorscale="Blues",  # Try a different colorscale
    title="Clustermap: Cosine Distance, Ward Linkage (Blue Scale)",
)

# Show the clustermap
cm.show()

### Saving Your Clustermap
 
Once you're happy with your Plotly Dendrogram, you'll likely want to save it as an interactive HTML file or a static image for reports, presentations, or simply for your records.

You have several options. In the Plotly toolbar, there is a "Download plot as png" option to save the plot as a static `.png` file. You can also save the the image to a static file programmatically by calling `PlotlyClustermap.write_image()`. Just provide a file name (including the extension), and it will save the image. You can choose different file formats by changing the extension (e.g., `.png`, `.jpg`, `.pdf`, `.svg`). This is a wrapper around Plotly's [`write_image()`](https://plotly.github.io/plotly.py-docs/generated/plotly.io.write_image.html) function and accepts all the same arguments.

Plotly figures are highly interactive when saved as HTML, allowing you to zoom, pan, and hover over data points in your saved file. If you wish to save your diagram as an HTML file, call `PlotlyClustermap.write_html()`. This is a wrapper around Plotly's [`write_html()`](https://plotly.github.io/plotly.py-docs/generated/plotly.io.write_html.html) function and accepts all the same arguments.

Note that `write_image()` and `write_html()` have parallel `to_image()` and `to_html()` methods that allow you to assign the results to a variable, rather than saving to a file. 

An example is given below:

In [None]:
# Save as an interactive HTML file
# cm.write_html("my_clustermap_analysis.html")