# Tutorial for `DTM` Module

A document-term matrix (DTM) is the standard interface for analysis and information of document data, consisting of a list of term (unique token) counts per document arranged as a matrix. In many real-world text collections, most documents only contain a small fraction of all possible terms, resulting in a sparse matrix where the majority of values are zero—this allows for highly efficient storage and processing. The DTM is fundamental in text analysis because it transforms a collection of documents into a structured, numerical format suitable for computational analysis, enabling a wide range of analyses such as identifying common terms, comparing document similarity, and serving as input for machine learning models. By quantifying textual data, the DTM makes it possible to apply statistical and algorithmic techniques to extract insights from text. 

In this tutorial, we will understand what is a Document-Term Matrix, its applications, and what the `dtm` module can achieve, as we go through the process of using real texts to create our own DTM, and explore what can be be done with it.

## Preparing Your Documents

A document-term matrix is constructed by counting tokens to calculate the number of unique token forms -- or terms -- per document. This procedure requires the input documents to be already tokenised. To illustrate, we'll use a toy dataset consisting of the first sentence in each of three novels.

In [None]:
toy_data = {
    "Metamorphosis": "One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.",
    "Jekyll_and_Hyde": "Mr. Utterson the lawyer was a man of a rugged countenance that was never lighted by a smile; cold, scanty and embarrassed in discourse; backward in sentiment; lean, long, dusty, dreary and yet somehow lovable.",
    "Alice_in_Wonderland": "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?”",
}

# Get a list of labels and texts from the toy dataset
labels = list(toy_data.keys())
texts = list(toy_data.values())

# Import the Lexos Tokenizer class
from lexos.tokenizer import Tokenizer

# Tokenize the texts
tokenizer = Tokenizer(model="en_core_web_sm")
docs = tokenizer.make_docs(texts)

# Create lists of doc tokens, removing spaces, punctuation, and digits
doc_tokens = []
for doc in docs:
    tokens = [token.text for token in doc if not (token.is_space or token.is_punct or token.is_digit)]
    doc_tokens.append(tokens)

# Print the cleaned documents with their labels
for i, doc in enumerate(doc_tokens):
    print(f"{labels[i]}:")
    print(f"{doc}\n")
print(labels)

We're now ready to build a DTM.

In [None]:
# Import the DTM class
from lexos.dtm import DTM

# Initialize the DTM model
dtm = DTM() # vts are converted to a mapping
# Build the DTM using the cleaned documents and labels
dtm(doc_tokens, labels)

# Print the shape of the document-term matrix
print(dtm.shape)
print(f"{dtm.shape[0]} rows (docs), {dtm.shape[1]} columns (terms)")

*Voilà*! We now have a DTM. By default, `DTM` creates a document-term matrix where each column represents a single term and each cell contains the **raw count** of how many times that terms appears in the document.

The output of `dtm.shape` shows us that we now have a matrix with the columns, corresponding to our 3 documents, and 80 rows, corresponding to the 80 unique terms found in the collection.

If we wish to see the terms, we can call `dtm.sorted_terms_list`.

In [None]:
print("List of Terms:\n")
print(dtm.sorted_terms_list)

This terms list is also known as the **vocabulary**. If we wish to see the counts for each item in the vocabulary, we can call `dtm.sorted_term_counts`.

In [None]:
print("Term Counts:\n")
for k, v in dtm.sorted_term_counts.items():
    print(f"{k}: {v}")

These properties can be useful for a quick overview, but it is even more useful to convert the document-term matrix to a pandas DataFrame with the `to_df()` method:

In [None]:
df = dtm.to_df()

print("First Ten Rows of the Document-Term Matrix:\n")
df.head(10)

Nice! We can already make some observations. `Alice_in_Wonderland` (at least the first sentence), has a much larger term count than the other texts, probably because it is longer. In the first ten rows of the matrix, `Jekyll_and_Hyde` shares only two items of vocabulary with it, and `Metamorphosis` shares none. If these observations seem odd, keep in mind that we have only the first sentence of each text and are viewing only the first ten rows of our DTM -- so we shouldn't expect very meaningful results.

The `to_df()` method takes a number of parameters that allow us to manipulate the DataFrame.

- **Sort by a specific column:**  
    Use the `by` parameter to sort the DataFrame by a particular document label (e.g., `by="Metamorphosis"`). By default, the DataFrame is sorted by the first label.

- **Control sort order:**  
    Set `ascending=False` to sort in descending order (largest to smallest values). The default is `ascending=True` for ascending order.

- **Display percentages instead of raw counts:**  
    Set `as_percent=True` to show term frequencies as percentages of the total terms in each document, rather than raw counts. This is useful for comparing documents of different lengths.

- **Adjust decimal precision:**  
    Use the `rounding` parameter to specify the number of decimal places for percentages (e.g., `rounding=2` for two decimals). The default is 1 decimal place.

- **Add row statistics:**  
    Include summary statistics for each term across all documents by setting `sum=True` (total count), `mean=True` (average count), or `median=True` (median count). These columns help you quickly identify the most common or distinctive terms.

- **Transpose the matrix:**  
    Set `transpose=True` to swap rows and columns, so that documents become columns and terms become rows. This can be helpful for certain types of analysis or visualization. 

Once the dataframe has been created, it can be further manipulated using the pandas interface. **Important:** the `transpose` parameter is applied *before* the other parameters, which may cause errors when they are combined. If you need to a transpose a DataFrame after it has been created, it is best to use the pandas `transpose()` method (or the alternative `T` property), as shown in the example below.

For this example, we'll display term frequencies as percentages, rounded to two decimal places, and we'll include a "Total" count column for each term.This format makes it easy to compare term usage across documents of different lengths and quickly identify the most frequent terms in the corpus. We'll use pandas to transpose the DataFrame at the end.


In [None]:
# Create the DataFrame with percentages and totals
df_pct = dtm.to_df(
    as_percent=True,
    rounding=2,
    sum=True,
)

# Transpose the DataFrame using Pandas
df_pct = df_pct.transpose() # Or use df_pct = df_pct.T

# Display the DataFrame with percentages and totals
df_pct

## Using the Vectorizer

We've now seen how to create a DTM and to display the information generated. But what is really happening under the hood?

Lexos feeds your documents to a **vectorizer**, a function that counts the number of times each item in the vocabulary occurs in each document. The vectorizer performs three tasks:

- Transform textual data into numberical representations called **vectors**.
- Create vector representations that represent the meaning or characteristics of the original data.
- Prepare the data for machine leaning tasks which require numerical, rather than textual, input.

Be default, the Lexos `DTM` class borrows its vectorizer from the Python Textacy library. Although you can substitute another vectorizer or your own custom vectorizer, this goes beyond the scope of this tutorial. Instead, we will look at how to make use of Textacy's vectorizer in a Lexos DTM. You can access Textacy's `Vectorizer` class as an attribute of a `DTM` instance:

In [None]:
print(f"Vectorizer Weighting Scheme: '{dtm.vectorizer.weighting}'")

The weighting scheme "tf" stands for "term frequency". We'll discuss the weighting scheme further below. Other useful `Vectorizer` properties are `terms_list` and `vocabulary_terms`, which output unsorted versions of the `DTM` class's `sorted_terms_list` and `sorted_term_counts` properties.

In [None]:
print("Vectorizer Terms List (first 10 items):\n")
print(dtm.vectorizer.terms_list[:10])

print()

print("Vectorizer Vocabulary Terms:\n")
print(dtm.vectorizer.vocabulary_terms)


### Normalization and Weighting

The real power of the vectorizer is in its ability to apply normalization and weighting functions to the raw counts. **Normalization** refers to a calculation that compares the count to some outside metric, for instance, the percentage of times a term occurs relative to the total number of terms in the vocabulary. **Weighting** refers to a calculation made to make the count more or less prominent based on some criterion. For instance, we have seen above that _Alice_in_Wonderland_ is longer than our other two documents. It therefore might seem reasonable to adjust the numbers representing term frequencies to account for document length.

A common method of combining these actions is to calculate the term frequency-inverse document frequency, or TF-IDF. We can do this by adjusting the weighting and normalization parameters available in Textacy's `Vectorizer` class. Let's try this by adjusting our vectorizer settings and then re-building our DTM:

In [None]:
# Create a new DTM instance with different weighting and normalization settings
dtm = DTM(tf_type="linear", idf_type="smooth", dl_type="log", norm="l2")

# Generate the DTM again with the new settings
dtm(doc_tokens, labels)

# Show a mathematical representation of the new weighting scheme
print(f"Vectorizer Weighting Scheme: '{dtm.vectorizer.weighting}'")

# Display the DTM as a DataFrame
dtm.to_df(sum=True, by="Total", ascending=False).head(10)

There are two ways you can achieve the same effect without creating a new `DTM` instance. The first method is to modify the `Vectorizer` attributes directly:

```python
dtm.vectorizer.tf_type="linear"
dtm.vectorizer.idf_type="smooth"
dtm.vectorizer.dl_type="log"
dtm.vectorizer.norm="l2"
```

The second method is to use the same keywords when you generate the DTM:

```python
dtm(doc_tokens, labels, tf_type="linear", idf_type="smooth", dl_type="log", norm="l2")
```

It's really up to you which one you prefer.


The Textacy `Vectorizer` class has four parameters for controlling weighting and normalization:

- `tf_type` controls how term frequencies are calculated (e.g., raw counts, log-scaled, binary presence/absence).
- `idf_type` controls inverse document frequency type, how document frequency scaling is applied (for TF-IDF weighting).
- `dl_type`: Controls normalization based on document length.
- `norm`: Applies vector normalization, either scaling term frequencies in each document so that the sum of the absolute values equals 1 or scaling the term frequencies so that the sum of the squares of the values equals 1.

As discussion of the available values for these parameters is beyond the scope of this tutorial. You are encouraged to consult the Lexos or Textacy documentation for informatio on the full range of options. However, Textacy's documentation provides the following useful advice for applying weighting and normalization:

> In general, weights may consist of a local component (term frequency), a global component (inverse document frequency), and a normalization component (document length). Individual components may be modified: they may have different scaling (e.g. tf vs. sqrt(tf)) or different behaviors (e.g. "standard" idf vs bm25's version). There are *many* possible weightings, and some may be better for particular use cases than others. When in doubt, though, just go with something standard.

### Culling

Another common use of the vectorizer is to cull the dataset. **Culling** refers to reducing the size of the matrix to include only part of the data. In many cases, you will want to cull terms from your DTM in order to reduce the size of the data or to remove terms which you think might not be meaningful for your research. A common form of culling is to restrict the data to the *n* most-frequently occurring terms. You can do this with the `max_n_terms` parameter. You can also restrict your data to terms occurring in a minimum number of documents with `min_df` or a maximum number of documents with `max_df`. Here is an example using all three:

```python
dtm = DTM(max_n_terms=100, min_df=2, max_df=5)
```

As with the weighting and normalization parameters, you can set culling attributes directly on the tokenizer or when you create the DTM. Here are some quick examples:

```python
# Configure the DTM tokenizer directly
dtm.vectorizer.min_df=2

# Set the parameters when calling the DTM instance
dtm(docs=docs, labels=labels, max_df=5)
```

Feel free to use whichever approach you find most comfortable.

In the cell below, you can try out culling, although there are limitations to how much we can cull with a small dataset (try changing `min_df` to 3 to see what happens).

In [None]:
dtm = DTM()
dtm(doc_tokens, labels, max_n_terms=10, min_df=2)
dtm.to_df(sum=True, by="Total", ascending=False)

Another method of culling is to specify the exact vocabulary you wish to appear in the matrix with the `vocabulary_terms` parameter. Here's how to use it:

In [None]:
vocabulary_terms = ["a", "an", "the"]

# You can also pass a mapping of terms to indices like this:
# {"a": 0, "an": 1, "the": 2}

dtm = DTM(vocabulary_terms=vocabulary_terms)
dtm(doc_tokens, labels)

dtm.to_df(sum=True, by="Total", ascending=False)

## Visualizing the DTM

The DataFrame output from `to_df()` makes it easy to visualize your document-term matrix using standard pandas plotting methods. The bar chart below displays the top 20 most frequent terms across our entire corpus of classical texts. Each bar represents a term, and its height corresponds to the total number of times that term appears in all documents combined.

In [None]:
dtm = DTM()
dtm(doc_tokens, labels)

# Get the first 20 rows of the DTM as a DataFrame sorted by sum
df = dtm.to_df(sum=True, by="Total", ascending=False)[0:20]

# Plot the DataFrame
df.Total.plot(
    kind="bar",
    title="Top 20 Most Frequent Terms",
    xlabel="Terms",
    ylabel="Frequency"
)

We can even use Lexos' own `cloud` module to generate a word cloud of the most frequent terms in our corpus. In this visualization, each word's size is proportional to its overall frequency in the corpus—the larger the word appears, the more often it occurs.

In [None]:
from lexos.visualization.cloud import WordCloud

wc = WordCloud(data=dtm)
wc.show()

## Workflow

As a final example, let's load a larger dataset consisting of 11 literary texts aquired from the [Project Gutenberg](https://www.gutenberg.org/) website, and saved locally in our `txt_files` directory. Tokenizing novels can take a lot of time, so we will just work with  5000 characters (starting at 2000 to avoid Project Gutenberg headers).

In [None]:
# Import the Loader class
from lexos.io.loader import Loader

# We loaded the Tokenizer class above, but this is in case you are skipping ahead and need to import it again
from lexos.tokenizer import Tokenizer

# Load the files
loader = Loader()
loader.load(["txt_files"])

# Get labels for each text
labels = loader.names

# Limit each text to first characters 2000 to 7000
# This is to avoid memory issues with very large texts
texts = [text[2000:7000] for text in loader.texts]

# Tokenize the documents - this may take a while
tokenizer = Tokenizer(model="en_core_web_sm")
docs = list(tokenizer.make_docs(texts))

Now we'll make the DTM. Notice that, since spaCy docs are iterable, they can be treated just like lists of tokens. In other words, you can submit a list of spaCy docs to the `DTM` class. But the results may not be what you want.

In [None]:
dtm = DTM()
dtm(docs, labels)
dtm.to_df(sum=True, by="Total", ascending=False).head(10)

OK, there are some definite problems here. For one thing, our labels are pretty unwieldy. We should probably create shorter versions, such as using the author's name.

We can also see lots of "junk" in our terms list. This includes spaces, line breaks, punctuation marks, and probably digits. Looking at the most frequently occurring words, we can see that they are all typical stop words. Let's clean up the data to remove these unwanted items.

In [None]:
# Rename the labels for clarity
labels = ["Kafka", "Stevenson", "Carroll", "Stoker", "Nietzsche", "Austen", "Doyle", "Fitzgerald", "Shelley", "Shakespeare", "Wilde"]

# Clean the docs
cleaned_docs = []
for doc in docs:
    tokens = [token.text for token in doc if not (token.is_space or token.is_punct or token.is_digit or token.is_stop)]
    cleaned_docs.append(tokens)

dtm = DTM()
dtm(cleaned_docs, labels)
dtm.to_df(sum=True, by="Total", ascending=False).head(10)

Now that's more like it. It generally pays to create token lists for each document. And this technique also allows you to use ngrams. For instance, here is how we might make a DTM of bigrams. We'll only display the first eight documents so that the DataFrame is not too wide for the screen.

In [None]:
# Import the Ngrams class for creating n-grams
from lexos.tokenizer.ngrams import Ngrams

# Create n-grams from the our documents
ng = Ngrams()
ngrams = ng.from_docs(docs=docs)

# Convert the n-grams to lists of tokens
ngrams = [list(doc) for doc in ngrams]

# Generate a DTM using the n-grams
dtm = DTM()
dtm(ngrams[0:8], labels[0:8])
dtm.to_df(sum=True, by="Total", ascending=False).head(10)
