# Topwords Tutorial

This tutorial will guide on how to use this module for extracting keyterms using the `KeyTerms`, `ZTest`, and `MannWhitney` classes of the `topwords` module.

## What's the Difference Between `KeyTerms`, `ZTest`, and `MannWhitney`?

Imagine you have a large collection of books. You can use `KeyTerms` like an index generator for a single book. It looks inside the book, identifies the most important words or phrases and highlights which ones are most statistically significant in the book, which may correlate to significant topics or themes. The book is not compared to any other book.

By contrast, you can use `ZTest` to examine what distinguishes term significance in groups of books. One group of books represents your area of interest (your "target" documents), and the rest, or a portion of the rest, represents the "comparison" documents to which you will compare the ones that interest you. `ZTest` compares words and phrases in your target documents to the comparison documents. Its goal is to find words and phrases that are unusually common in your target group or vice versa. This helps to identify what terms mark your target documents as unique.

The `MannWhitney` class implements a procedure similar to `ZTest` but provides a measure of confidence in *how* significant the differences between documents are. More information about when to use `MannWhitney` over `ZTest` is provided below.

In short, use `KeyTerms` to identify what an individual document is about. Use `ZTest` or `MannWhitney` to identify what terms make a specific group of documents different from another group of documents.

## Data Preparation

Run the cell below to load the sample data.

In [None]:
from lexos.io.loader import Loader
from lexos.tokenizer import Tokenizer

paths = [
    "txt_files/Dracula by Bram Stoker.txt",
    "txt_files/Alices Adventures in Wonderland by Lewis Carroll.txt",
    "txt_files/Romeo and Juliet by William Shakespeare.txt",
]
loader = Loader()
loader.load(paths)

# We take only the beginning of the text to speed up processing for this tutorial
texts = [text[0:10000] for text in loader.texts]
tokenizer = Tokenizer(model="en_core_web_sm")
docs = list(tokenizer.make_docs(texts))

### The `KeyTerms` Class

This class wraps [Textacy's `extract.keyterms`](https://textacy.readthedocs.io/en/latest/api_reference/extract.html#keyterms) method, the user is able to extract representative keyterms from a single document, using the library's algorithms `textrank`, `sgrank`, `scake`, or `yake`. Choose your algorithm with the `method` parameter. See the Lexos documentation for advice on which algorithm to choose.

The `topn` parameter is used to configure the number of topwords to output (the default is 10). You can also choose the count ngrams by setting `ngrams=2` (or 3, 4, etc.). If you wish to count multiple types of ngrams, use a tuple like `ngrams=(1, 2)`, which will count single tokens and bigrams.

The `normalize` parameter is used to determine whether variant forms of tokens should be counted separately. The default setting "lemma" counts together all morphological forms of a word. You can turn this off by setting `normalize="orth"`. You can also make counts case insensitive by setting `normalize="lower"`.

Note that you need to preprocess your documents into spaCy `Doc` objects with a language model that generates lemmas for your tokens (e.g. `en_core_web_sm` for English). You may also pass a such a model to the `KeyTerms` class to preprocess string inputs in place, but this will slow down processing times.

Run the cell below to perform a basic experiment on *Dracula*.

In [None]:
from lexos.topwords.keyterms import KeyTerms

kt = KeyTerms(
        document=docs[0],
        method="textrank",
        topn=5,
        ngrams=1,
        normalize="lemma",
    )
kt.keyterms

You can output the results as a list of tuples or pandas DataFrame with the helper methods shown below.

In [None]:
# Output a list of tuples
print(kt.to_list())

# Output a pandas DataFrame
kt.to_df()

### Understanding `KeyTerms` Output
The `KeyTerms` output, whether as a DataFrame or a list of tuples, provides a term and a score.

For example, say you had the following output:

```txt
term score
0 poor old man 0.003206
1 dear old man 0.003025
2 good one 0.002898
3 little one 0.002791
4 good brave man 0.002725
```

The `term` represents the keyterm or keyphrase extracted from the document. `KeyTerms` uses algorithms like TextRank, which are graph-based methods that identify important words or phrases by analyzing their relationships within the text. This is why you often see multi-word phrases like "poor old man" or "good brave man" as keyterms.

The `score` is a numerical value indicating the estimated importance or salience of the term within the single document being analyzed. A higher score means the algorithm considers that term more central or relevant to the document's content. For example, 'poor old man' is more relevant to the document's content than 'good brave man'. The exact range and meaning of the score depend on the specific Textacy algorithm used (`textrank` or `sgrank`), but generally, they are normalized values, not counts or frequencies. They reflect the term's position and connections in the text's semantic network. 

### The `ZTest` Class

`ZTest` identifies statistically over-represented words in target documents compared to comparison documents. It calculates the proportion of each term in target vs. comparison documents, computes a z-score for the difference in proportions, and returns terms with the highest z-scores. These terms are deemed to be the most distinctive of the target documents.

In the process, it also sets a `._.topwords` attribute on all spaCy `Doc` objects so that the topwords for that document can be retrieved easily using `doc._.topwords`.

In [None]:
from lexos.topwords.ztest import ZTest

target_docs = [docs[0]] # Dracula
comparison_docs = [docs[1], docs[2]]  # Alice and Romeo

ztest = ZTest(target_docs=target_docs, comparison_docs=comparison_docs, topn=5)

# Print the top words as a list of tuples
print("All Topwords from ZTest:")
print(ztest.topwords)

print()

print("Each doc in docs now has doc._.topwords set to the top words:")
print(f"Topwords for *Dracula*: {docs[0]._.topwords}")

The following keyword parameters are useful for configuring the output:

- `topn`: Number of top words to return (default: 10).
- `ngrams`: An integer representing the number of ngrams to count or a tuple specifying more than one type, e.g., `(1, 2)` for single tokens and bigrams.
- `case_sensitive`, `remove_stopwords`, `remove_punct`, `remove_digits`: Preprocessing options. The default is `True`, except for `remove_digits`.
- `model`: spaCy model name to use for tokenization if the input documents are strings. The default is `"xx_sent_ud_sm"`.

You can try these out in the code above to see how the output changes.

Additionally, there are `to_dict()`, `to_df()`, and `to_list_of_dicts methods()`, which allow you to output the data as a dictionary, a pandas DataFrame, or a list of dictionaries. In the example below, we display the results as a pandas DataFrame.

In [None]:
ztest.to_df()

### Understanding `ZTest` Output
The `ZTest` output provides a term and a `z-score`.

For example:

```txt
term z_score
0 Alice -30.694384
1 thou -23.555131
2 ROMEO -19.807056
3 Romeo -18.895006
4 O -18.895006
5 thy -18.832643
6 thee -17.936969
7 JULIET -16.925139
8 CAPULET -15.321415
9 NURSE -14.614962
```
The `term` is a word or phrase (n-gram) that has a statistically significant difference in its relative frequency between your target documents and your comparison documents.

The `z_score` is the result of a z-test, which measures the statistical significance of the difference in proportions of a given term between the two document sets. The larger the absolute value of the z-score, the more statistically significant the difference in frequency. A z-score further from zero (either very high positive or very high negative) indicates a stronger distinction.

A negative z-score (e.g., "Alice", "ROMEO", "thou") indicates that the term is more characteristic of the comparison documents (in this case, *Alice in Wonderland* and *Romeo and Juliet*) compared to the target documents (in this case, *Dracula*). The higher the absolute negative value, the more strongly that term belongs to the background set.

A positive z-score would indicate that the term is more characteristic of the target documents than the comparison documents. This is often what users are looking for when identifying "top distinguishing words." (Note: You might not see many in this example if *Dracula* is truly distinct from the comparison documents.)

### The `MannWhitney` Class

The Mann-Whitney U test (also called the Wilcoxon rank-sum test) is a statistical method that compares two groups to determine if they differ significantly. Unlike the z-test, it doesn't make assumptions about how the data is distributed.

Instead of comparing proportions directly (like the z-test), the Mann-Whitney test:

1. Ranks all term frequencies across both document sets
2. Compares the ranks between target and comparison documents
3. Calculates a U-statistic that measures how different the rankings are
4. Provides a p-value indicating the probability the difference occurred by chance

The `MannWhitney` class takes as its input a pandas DataFrame of term frequencies with docs in rows and terms in columns. Any filtering of your terms must be done in advance. The easiest way to produce the input DataFrames is with the Lexos DTM module. Here is an example:

In [None]:
from lexos.dtm import DTM
from lexos.tokenizer import Tokenizer
from lexos.topwords.mann_whitney import MannWhitney

texts = [
    "This is a sample text for testing.",
    "Here is another example of a text to analyze.",
    "This text is different from the others.",
    "Yet another sample text for comparison.",
    "This text is similar to the first one.",
    "A completely different text for the analysis.",
]

# Process the sample texts with spaCy to create documents
docs = list(tokenizer.make_docs(texts))

# Create labels for the documents
labels = [f"Doc{i + 1}" for i in range(len(docs))]

# Create a Document-Term Matrix (DTM) using the sample documents
# Limit to terms occurring in at least 2 documents
dtm = DTM()
dtm(docs=docs, labels=labels, min_df=2)
df = dtm.to_df(transpose=True)
df.head()

In the example above, we'll split the DataFrame into target and comparison data based on whether the label has an even or odd number. We'll uses these as the targe and comparison data.

In [None]:
# Split the DataFrame into target and comparison groups based on even/odd document labels
even_df = df[df.index.isin(["Doc2", "Doc4", "Doc6"])]
odd_df = df[df.index.isin(["Doc1", "Doc3", "Doc5"])]

mw = MannWhitney(target=even_df, comparison=odd_df)
mw.to_df()

The output will show the terms ranked by their distinctiveness, along with their U statistic and p-value.

The p-value is the probability that a test statistic is extreme or more extreme than the one observed, assuming that the two samples come from the same distribution. A small p-value (typically less than 0.05) indicates that the observed difference between the two samples is statistically significant, and we conclude that the two samples do not come from the same distribution.

By default, the table displays the average frequency of terms in the control group along with the increase in frequency in the comparison group. This provides us with another view of how important the word is to the sample and its relative over- or under-usage in comparison to the other sample. You can suppress the average frequency and difference columns with `add_freq=False`.

The following points provide a useful guide to interpreting the results:

- **u_statistic**: Higher values indicate the term appears more in target documents
- **p_value**: Lower values (< 0.05) indicate statistically significant differences
  - p < 0.05: Statistically significant (95% confident)
  - p < 0.01: Highly significant (99% confident)
  - p < 0.001: Very highly significant (99.9% confident)

#### When to Use Mann-Whitney vs. Z-Test

There are a number of key differences between the two types of tests:

- Z-test assumes terms are normally distributed (that is, data has a rough bell curve shape with frequencies decreasing evenly on both sides of the central mean). Mann-Whitney makes no such assumption.
- Z-test is more powerful with large, well-behaved data; Mann-Whitney works better with small or irregular data.
- Z-test gives a z-score (can be positive or negative); Mann-Whitney gives a U-statistic and p-value. Lower p-values (< 0.05) indicate more significant differences.

Here are some rules of thumb for choosing a method:

- **Large corpus (100+ documents)** → Use `ZTest` for faster, more powerful results
- **Small corpus (< 30 documents)** → Use `MannWhitney` for more reliable results
- **Unsure about your data** → Use `MannWhitney` to be safe
- **Need fast computation** → Use `ZTest` (it's computationally simpler)

### The `Compare` Class

The `Compare` class provides powerful methods for analyzing and comparing documents using statistical measures. This class wraps around `ZTest` or `MannWhitney` to enable three comparison strategies:

1. **`document_to_corpus()`** - Compare each document to all other documents. Use this method when you want to find what terms make each document unique.
2. **`documents_to_classes()`** - Compare each document to documents in other classes. Use this method when you want to find outliers or representative terms within classes.
3. **`classes_to_classes()`** - Compare entire classes to each other. Use this method when you want to find the signature vocabulary of particular categories.

All methods support three output formats: `dict`, `dataframe`, and `list_of_dicts`.

As a basic example, we will take four short texts. Although you can perform experiments with raw strings, they will generally be converted to spaCy `Doc` objects internally. So, for efficiency, we will preprocess the texts into spaCy `Docs`. We'll then create an instance of the `ZTest` class for our example. We provide it with no docs because these will be passed to it when we choose what we want to compare. The `ZTest` instance is our calculator. We can swap it out for other classes in the `topwords` module or with our own custom classes. Finally, we create an instance of the `Compare` class and pass it our calculator.

In [None]:
from lexos.tokenizer import Tokenizer
from lexos.topwords.ztest import ZTest
from lexos.topwords.compare import Compare

# Load spaCy model
tokenizer = Tokenizer(model="en_core_web_sm")

# Prepare sample documents
docs = list(tokenizer.make_docs([
    "Dracula was a vampire who lived in Transylvania. He had sharp fangs and drank blood.",
    "Frankenstein created a monster in his laboratory. The creature was terrifying and misunderstood.",
    "Alice fell down the rabbit hole into Wonderland. She met the Cheshire Cat and Mad Hatter.",
    "Peter Pan could fly and never wanted to grow up. He lived in Neverland with the Lost Boys."
]))

# Create a calculator instance (ZTest in this example)
calculator = ZTest(target_docs=[], comparison_docs=[])

# Create Compare instance
compare = Compare(calculator=calculator)

By creating the calculator class first, we are also able to configure it with any parameters relevant to the class. For instance, we might want to set the `case_sensitive` parameter with `calculator = ZTest(target_docs=[], comparison_docs=[], case_sensitive=False)`.

We're now ready to perform our comparison. We'll start by using the `document_to_corpus()` method to find what terms make each document unique compared to all other documents.

In [None]:
# Compare each document to the rest of the corpus
results = compare.document_to_corpus(docs)
print(results)

Running one of the `Compare` class methods populates its `data` and `results` attributes, which you can access separately.

In [None]:
# View the data
print("Topwords Data:")
print(compare.data)

# View the results
print("\nTopwords Results:")
print(compare.results)

You can also change the output with the with the `output_format` parameter (the default is "dict"). You can also return a pandas DataFrame or a list of dicts by setting this parameter to "dataframe" or "list_of_dicts" respectively.

In [None]:
results = compare.document_to_corpus(docs, output_format="dataframe")
display(results)

results = compare.document_to_corpus(docs, output_format="list_of_dicts")
print(results)

The `documents_to_classes()` method is used to find what makes each document distinctive compared to documents in other categories. In addition to `docs` and `doc_labels`, it accepts a `class_labels` list that supplies categories for each document (class labels indices must correspond to document indices).

When you call `documents_to_classes()`, each document is compared to **all documents in other classes**:

- "Dracula" (gothic) is compared to ["Alice", "Peter Pan"] (whimsy)
- "Frankenstein" (gothic) is compared to ["Alice", "Peter Pan"] (whimsy)
- "Alice" (whimsy) is compared to ["Dracula", "Frankenstein"] (gothic)
- "Peter Pan" (whimsy) is compared to ["Dracula", "Frankenstein"] (gothic)

In [None]:
# Define class labels for each document
doc_labels = ["Dracula", "Frankenstein", "Alice", "Peter Pan"]
class_labels = ["gothic", "gothic", "whimsy", "whimsy"]

results = compare.documents_to_classes(
    docs=docs,
    doc_labels=doc_labels,
    class_labels=class_labels
)
print(results['Dracula'])


You can access data and results with the `data` and `results` attributes (class labels are added to the `data` dict). You can also output DataFrames and lists of dicts as shown above.

The `classes_to_classes()` method is used to find what what terms characterize entire categories/genres compared to other categories. It also takes a `class_labels` parameter.

Each class is treated as a unified group:

- "gothic" class: ["Dracula", "Frankenstein"] combined vs. ["Alice", "Peter Pan"] combined
- "whimsy" class: ["Alice", "Peter Pan"] combined vs. ["Dracula", "Frankenstein"] combined

This is different from `documents_to_classes()` which compares individual documents.

In [None]:
results = compare.classes_to_classes(
    docs=docs,
    doc_labels=doc_labels,
    class_labels=class_labels,
    output_format="dict"
)
print(results)

As with the other classes, you can output the results as a pandas DataFrame or as a list of dicts.

#### Using Dictionary Input

Instead of passing separate `doc_label` and `class_label` lists, you can pass documents as a list of dictionaries if you already have your data in that format. Just pass the dictionary with the `docs` parameter:

In [None]:
doc_dicts = [
    {"doc": docs[0], "doc_label": "Dracula", "class_label": "gothic"},
    {"doc": docs[1], "doc_label": "Frankenstein", "class_label": "gothic"},
    {"doc": docs[2], "doc_label": "Alice", "class_label": "whimsy"},
    {"doc": docs[3], "doc_label": "Peter Pan", "class_label": "whimsy"}
]

results = compare.documents_to_classes(docs=doc_dicts)
results


Note that only the `doc` and `doc_label` keys are used for `document_to_corpus()`. The `class_label` key (if present) will be ignored.

If `doc_label` values are not available in the dict, `documents_to_classes()` and `classes_to_classes()` will supply "Doc 1", "Doc 2", "Doc 3", etc.

#### Using spaCy `Doc` Extensions

If given a list of spaCy `Doc` objects, the `Compare` class will attempt to extract class values from custom extensions before trying other methods. For instance, if you supply the class label "author", `Compare` will first try to assign values for each `Doc` from its `._.author` extension. If that fails, the value "author" will be assigned as the class label for the doc.

!!! Note
    The class does not support nested dictionaries like `{"metadata": "author": "Shakespeare", "language": "en"}` If you have metadata in this form, you can convert it to an class instance. Here is a simple way to do this:

    ```python
    from dataclasses import dataclass

    @dataclass
    class Metadata:
        author: str
        language: str

    doc._.metadata = Metadata("Shakespeare", "en")
    ```

    You can now use dot notation for the nested attributes:

    ```python
    results = comparison.documents_to_classes(
          docs=docs,
          class_labels=["_.metadata.author"]
      )
    ```

#### Helper Methods

The `Compare` class provides two helper methods for getting information about your data once it has been populated.

The `get_class()` method takes a document label and returns the name of the class to which the document it belongs (if available).

The `get_docs_by_class()` method returns a dict containing all documents grouped by class (the dictionary key). If you supply a `class_label` the output will be restricted to only documents with that label.

You can run some examples below:

In [None]:
# Get the class for a specific document
doc_class = compare.get_class("Dracula")
print(f"Class for Dracula: {doc_class}")

# Get all documents grouped by class
docs_by_class = compare.get_docs_by_class()
print(f"All classes: {list(docs_by_class.keys())}")

# Get documents for a specific class
gothic_docs = compare.get_docs_by_class(class_label="gothic")
print(f"Documents in 'gothic' class: {gothic_docs}")

### 3. Example Experiments with the `topwords` Module


- **Extracting representative keyterms from a single literary work**  
    Use `KeyTerms` to identify the most important terms or phrases in a novel, article, or essay.

- **Comparing distinguishing words between two or more texts**  
    Use `ZTest` to find words that are statistically significant in one document (or set) compared to a background corpus.

- **Analyzing topic drift across chapters or sections**  
    Apply `KeyTerms` or `ZTest` to different segments of a text to observe how key terms change over time.

- **Identifying authorial style or signature vocabulary**  
    Compare works by different authors to find words or phrases that are characteristic of each.

- **Studying the effect of preprocessing choices**  
    Experiment with parameters like `remove_stopwords`, `remove_punct`, `ngrams`, and `case_sensitive` to see how they affect the output.

- **Building custom keyterm extraction pipelines**  
    Integrate the module with other NLP tools (e.g., sentiment analysis, entity recognition) for richer text analysis.

- **Exploring genre or period-specific vocabulary**  
    Use the Z-test approach to compare texts from different genres or historical periods.

- **Evaluating the impact of different spaCy models**  
    Swap out the `model` parameter to see how different language models affect tokenization and keyterm extraction.
<a id='exp'></a>

### Visualizing Topwords with Word Clouds 

A word cloud visually represents the frequency or importance of words. Larger words indicate higher significance. This is particularly useful for quickly grasping the most prominent keyterms from `KeyTerms` or the distinguishing terms from `ZTest`. The code samples below use the Lexos `cloud` module to generate word clouds of terms calculated by Topwords.

#### Word Cloud from `KeyTerms` Output

You can directly use the output (as a list of tuples with term and score) from `KeyTerms` to generate a word cloud. The wordcloud() function can interpret this format. The 'score' from Textacy will determine the size of the words in the cloud. See the Lexos `cloud` module documentation for information on how to customize the word cloud.

In [None]:
# We reload some resources in case they are not available
import pandas as pd
from lexos.io.loader import Loader
from lexos.tokenizer import Tokenizer
from lexos.visualization.cloud import WordCloud

paths = ["txt_files/Dracula by Bram Stoker.txt"]
loader = Loader()
loader.load(paths)

# We take only the beginning of the text to speed up processing for this tutorial
texts = [text[0:10000] for text in loader.texts]

# Tokenize the text
tokenizer = Tokenizer(model="en_core_web_sm")
docs = list(tokenizer.make_docs(texts))

# Generate the key terms
kt = KeyTerms(
    document=docs[0],
    method="textrank",
    topn=10,
    ngrams=1,
    normalize="lemma",
)

# Convert the key terms to a pandas DataFrame
df = pd.DataFrame(kt.to_list(), columns=["term", "score"]).set_index(
    "term"
)

wc = WordCloud(data=df, title="Key Terms from Dracula", height=200, width=200, round=150)
wc.show()

##### Word Cloud from `ZTest` Output

Similarly, you can create a word cloud from the `ZTest` output. In this case, the absolute value of the Z-score can determine the size of the words, indicating how strongly the term distinguishes the target from the background corpus.

Note that a couple of hacks are required in the code below. First, we convert the z-scores to absolute values (no negative numbers), which the word cloud function needs to generate term sizes. Second, we strip any line breaks that inadvertently crept into our docs since these generate an error in the word cloud function. For convenience, we perform both of these hacks when converting the z-test output to a pandas DataFrame.

In [None]:
# We reload some resources in case they are not available
import pandas as pd
from lexos.io.loader import Loader
from lexos.tokenizer import Tokenizer
from lexos.topwords.ztest import ZTest
from lexos.visualization.cloud import WordCloud

# Load all three texts
loader = Loader()
loader.load([
    "txt_files/Dracula by Bram Stoker.txt",
    "txt_files/Alices Adventures in Wonderland by Lewis Carroll.txt",
    "txt_files/Romeo and Juliet by William Shakespeare.txt",
])

# We take only the beginning of the text to speed up processing for this tutorial
texts = [text[0:10000] for text in loader.texts]

# Tokenize the texts
tokenizer = Tokenizer(model="en_core_web_sm")
docs = list(tokenizer.make_docs(texts))

# Split the docs into target and comparison sets
target_docs = [docs[0]]
comparison_docs = docs[1:]

# Perform the Z-Test
ztest = ZTest(target_docs=target_docs, comparison_docs=comparison_docs, topn=15)

# Convert list of tuples to a Pandas DataFrame
df = pd.DataFrame(
    [(term, abs(score)) for term, score in ztest.topwords if '\n' not in term],
    columns=["term", "abs_z_score"],
).set_index("term")

# Create and show the word cloud
wc = WordCloud(data=df, title="Key Terms from Dracula", height=200, width=200, round=150)
wc.show()