# Lexos Bootstrap Consensus Tree Tutorial

A **Bootstrap Consensus Tree** is particularly robust because it doesn't just build one tree. Instead, it builds many, many trees by randomly sampling portions of your DTM. It then finds the "consensus": the most consistently appearing relationships across all those individual trees.

A Bootstrap Consensus tree can allow you to make observations about

- Stylistic Similarities: Are authors from the same period or literary movement clustering together?
- Genre Influence: Do texts of a particular genre (e.g., Gothic horror) form a distinct cluster?
- Thematic Patterns: If you used POS filtering or focused on specific terms, does the tree reveal groupings based on shared themes or subject matter?

We'll begin by importing some data and creating a DTM.

In [None]:
from lexos.dtm import DTM
from lexos.io.loader import Loader
from lexos.tokenizer import Tokenizer

# Load some text files and set their names
files = [
    "FilesToUse/Poe_FallOfHouseUsher_1839.txt",
    "FilesToUse/Lippard_BelOfPrairieEden.txt",
    "FilesToUse/Irving_RipVanWInkle.txt",
    "FilesToUse/HenryWP_ThePirate.txt",
]
loader = Loader()
loader.load(files)
loader.names = ["Poe", "Lippard", "Irving", "Henry"]

# Tokenize the loaded documents
tokenizer = Tokenizer()
docs = list(tokenizer.make_docs(texts=loader.texts))
labels = loader.names

print(f"Loaded {len(docs)} documents with labels: {labels}")

# Create a Document-Term Matrix (DTM)
dtm = DTM()
dtm(docs=docs, labels=labels)

print(f"DTM created with {dtm.to_df().shape[1]} documents and {dtm.to_df().shape[0]} unique terms.")

### Generating the Bootstrap Consensus Tree (BCT):

When we create the BCT, we need to tell it how to build and combine these trees. Here are the key parameters you can adjust:

- `dtm`: The "linguistic spreadsheet" (`dtm`) that we created in the previous step.
- `metric`: This tells the tree how to measure the "distance" or dissimilarity between your documents. Shorter distances mean more similar documents. The default value "euclidean" is good for general comparisons but can be sensitive to the overall length of documents. Other common values are "cosine", often a good choice when documents vary considerably in length, and "cityblock" (Manhattan distance), which is useful when the individual differences in term counts are important. For further discussion, see the documentation on Hierchical Agglomerative Clustering.
- `method`: The criterion used to join clusters to form larger branches and clusters in the tree. Possible values are "average" (the default), "single", "complete", and "ward".  The default "average" tends to produce well-balanced clusters. Single linkage often produces "chaining" of adjacent documents, whereas complete linkage tends to ensure that all documents within a cluster are relatively similar. Ward linkage tries to make clusters as "tight" and internally similar as possible. For further discussion of these options, see the documentation on Hierchical Agglomerative Clustering.
- `cutoff`: This is a confidence threshold. Since a BCT is built from many individual "bootstrap" trees, a `cutoff` of `0.5` (50%) means that a specific grouping of documents (a branch on the tree) must appear in at least 50% of all the trees generated in each iteration to be considered reliable enough to show up in the final consensus tree. Higher `cutoff` values will result in a "sparser" tree, showing only the most robust and consistent relationships. Lower `cutoff` values will show more relationships, but some of these might be less statistically reliable.
- `iterations`: The number of "bootstrap resampling" rounds. In each round, Lexos takes a random 80% sample of the terms (columns) from your DTM and builds a tree from that sample. More iterations will make the consensus tree more statistically reliable and representative of the underlying relationships in your documents, as it averages out more variations. However, it will take longer to compute. Since fewer iterations will be faster, a lower number is good for quick testing or initial explorations. Setting `iterations` to 100 or higher is recommended for final research results if computation time allows.
- `replace`: This determines how the terms are sampled during each iteration. If set to "with", a term column can be selected multiple times within a single 80% sample (which allows for more randomness). The default setting "without" means that each term column can only be selected once per 80% sample (which is more stable).
- `doc_labels`: A list of labels to use for the leaves (endpoints) on your tree. The list should have the same order as the names in your DTM.
- `text_color`: Sets the color for all text on the plot (axis labels, branch lengths, and document labels). You can use "rgb(R, G, B)" format, such as `"rgb(0, 0, 0)"` (black) or `"rgb(255, 0, 0)"` (red).
- `title`: A title to place at the top of the plot.
- `layout`: Sets the layout style of the dendrogram. Possible values are "rectangular" (the default) and "fan".
- `showfig`: Controls whether the generated tree figure is displayed automatically. If set to `False`, the diagram will not be displayed immediately. The default is `True`.directly in this Jupyter Notebook cell.


In [None]:
# Import the BCT class for Bootstrap Consensus Tree
from lexos.cluster import BCT

# Create an instance of the BCT object (feel free to adjust parameters)
bct = BCT(
    dtm=dtm,
    metric="euclidean",          # Try "cosine" for stylistic comparisons, or "cityblock"
    method="average",            # Try "ward" for compact clusters, or "complete"
    cutoff=0.5,                  # Only show relationships present in at least 50% of trees
    iterations=10,               # Set higher for more slower but more reliable results
    replace="without",           # Do not repeat terms in each sample
    labels=labels,               # Use the labels we defined for our documents
    text_color="rgb(0, 0, 0)",   # Set the text color to black
    layout="rectangular",        # Alternatively, use "fan"
    title="Bootstrap Consensus Tree (Rectangular Layout)"
)

# Show the figure
bct.show()

**Note:** If your tree does not look right, see the **Troubleshooting** section at the end of this tutorial.

### Interpreting Your Bootstrap Consensus Tree

Now that you've generated the tree, let's talk about how to read it!

The tree you see is a **dendrogram**, a type of tree diagram that shows hierarchical clustering. Each line is a **branch** on the the tree and connects either to another branch or to a **leaf** (a terminal branch). Each leaf represents an individual document and is labelled with that document's name. Leaves connected to other leaves form clusters, or **clades**, and these may be joined to other clades, forming "super-clusters" at the next level in the tree's hierarchy. In the rectangular layout, branch length indicates the **distance** or **dissimilarity** between the documents in a clade and those that form part of the same "super-cluster". Shorter branches mean that the documents or clusters are more similar to each other based on their linguistic features. Longer horizontal branches mean that the clades are further apart. In the fan layout, branch length is irrelevant since it is scaled to allow all leaves to end at the circumference of a circle.

The following procedure is useful in reading the tree.

1. Start from the leaves (your document names) and move up the hierarchy towards the root of the tree. Look for the first merges.
2.  As you move further up the hierarchy, you'll see larger branches forming, grouping together sets of documents or smaller clusters. These broader groupings represent larger patterns of similarity.

For example, in the dendrogram above, you might observe that "Henry" and "Irving" are clustered together with a relatively short branch length, suggesting a linguistic similarity. Then, perhaps "Poe" joins that cluster, indicating it's somewhat similar to the Henry/Irving group but less so than they are to each other. "Lippard" might be an outlier, joining the main cluster with a much longer branch, indicating it's quite distinct linguistically from the others.

### Customizing Your Visualization

The Lexos `BCT` class uses `matplotlib` to create the plot, which means you have a lot of control over its appearance. While Lexos sets sensible defaults, you can easily tweak things!

#### Adjusting Font Size

 You can adjust the default font size for your plots using `plt.rcParams.update`. This needs to be run *before* you generate the tree, or you can re-run the tree generation cell after making this change.

 To test this, uncomment and run the cell below. Then re-run the BCT generation cell above.

In [None]:
# Set default font size to 12 point for all matplotlib plots
# plt.rcParams.update({'font.size': 12})

#### Adjusting Figure Size

Currently, the Lexos BCT automatically adjusts height based on the number of documents. The default width is 9.5 inches.

If your tree has many documents or long labels, it might feel cramped. You can make the overall figure wider or taller to give it more space. This is done by setting the `w` (width) and `h` (height) parameters in `plt.gcf().set_size_inches()`. This adjustment is already part of the Lexos code for the tree, but you can override it if you need a specific size.

You can manually adjust the figure size *after* the tree has been generated but *before* you save or show it, if the automatic sizing isn't perfect for you. An example is given below (uncomment the cell to try it out):

In [None]:
# Make the figure 12 inches wide and 8 inches tall
bct.fig.set_size_inches(w=12, h=8)
bct.show()

### Fan Layout

Bootstrap consensus trees can often be easier to read with a fan-shaped layout.

In [None]:
bct = BCT(
    dtm=dtm,
    metric="euclidean",
    method="average",
    cutoff=0.5,
    iterations=10,
    replace="without",
    labels=labels,
    text_color="rgb(0, 0, 0)",
    layout="fan",
    title="Bootstrap Consensus Tree (Fan Layout)"
)
bct.show()

### Saving Your Tree

Once you're happy with your Bootstrap Consensus Tree, you'll likely want to save it as an image file for reports, presentations, or simply for your records.

The `bct.save()` method allows you to do this easily. Just provide a file name, and it will save the image in the same directory where your Jupyter Notebook is located. You can choose different file formats by changing the extension (e.g., `.png`, `.jpg`, `.pdf`, `.svg`). PNG is generally a good choice for web or documents, while SVG provides a scalable vector graphic useful for high-quality printing.

In [None]:
# Save the figure to a file
# bct.save("bct_test_result.png")

### Troubleshooting 

**"My tree looks blank or empty!"**

- **Check `cutoff`:** If your `cutoff` is too high (e.g., 0.9 or 1.0), it might be too strict, and no common groupings meet that threshold. Try lowering it (e.g., to 0.5 or 0.3).
- **Check `iterations`:** If `iterations` is very low (e.g., 1 or 2), there might not be enough "sampling" to find a stable consensus. Increase it.
- **Verify DTM:** Ensure your `dtm` was created successfully and contains terms. If your documents are very short or very similar, the DTM might be sparse.

**"My documents don't cluster the way I expected!"**

- **Experiment with `metric` and `method`**: Different distance metrics and linkage methods reveal different aspects of similarity. `"cosine"` often works well for stylistic analysis of texts. `"ward"` or `"average"` are generally good linkage methods.
- **Adjust DTM pre-processing**: Try different combinations of `stop_words`, `lemmatize`, `pos_filter`, `min_freq`, and `ngrams`. The choice of "terms" is paramount to the relationships you uncover. For example, if you're interested in character interactions, perhaps filtering for proper nouns and verbs would be revealing.
