# `DTM` Tutorial
   
This notebook is to show examples of how to use the `dtm` module.

A "document" is a generic term for any text, but in the context of this module, it will refer to a tokenised text. Once it is possible to count the tokens in a document, it is also possible to calculate the number of terms (unique tokens) in that document. You can list these counts in a table with a term as the header of a column and a count as its value:

| Doc                   | a    | abandoned | abatement | abbeyland | ... |
|-----------------------|------|-----------|-----------|-----------|-----|
| Pride_and_Prejudice   | 1940 | 0         | 1         | 0         | ... |

If you have more than one document, you can put it on a separate line.

| Doc                   | a    | abandoned | abatement | abbeyland | ... |
|-----------------------|------|-----------|-----------|-----------|-----|
| Pride_and_Prejudice   | 1940 | 0         | 1         | 0         | ... |
| Sense_and_Sensibility | 2042 | 1         | 1         | 1         | ... |

This is called a document-term matrix, or DTM for short. DTMs are typically a starting point for computational forms of text analysis. In this tutorial, we will learn how to use Lexos to generate a DTM and obtain statistics from its matrix.

It is also important to note that each term is represented by a number, in this case, the number of times each term occurs. A list of such numeric representations is called a _vector_, and the task of transforming a list of document tokens into a DTM is performed by a _vectorizer_. The tables above use a very simple vectorizer to identify raw term counts, but a vectorizer's algorithm can be modified with various weighting and other properties to take account of specific circumstances such as variations in document length. 


## Prepare Sample Data

For this tutorial, we will load Jane Austen's _Pride and Prejudice_, tokenise it, and then cut it into ten segments, which we'll treat as ten separate documents.

In [None]:
import re
from lexos.io.smart import Loader
from lexos import tokenizer
from lexos.cutter.ginsu import Ginsu

loader = Loader()
loader.load("../test_data/txt/Austen_Pride.txt")
text = re.sub("[\r\n|\n]", " ", loader.texts[0]).strip()
doc = tokenizer.make_doc(text)

cutter = Ginsu()
docs = cutter.splitn(doc, n=10)

## Initialise the `DTM` Class

**Arguments:**

- `docs`: A list of spaCy docs or a list of token lists.
- `labels`: A list of string labels for the documents. If not provided, the labels "doc1", "doc2", etc. will be used.

The `DTM` class uses the Textacy package's `Vectorizer` to build a matrix of vectors. The default settings are as follows:

- `tf_type`: "linear"
- `idf_type`: None
- `dl_type`: None
- `norm`: None
- `min_df`: 1
- `max_df`: 1.0
- `max_n_terms`: None
- `vocabulary_terms`: None

For explanations of the various parameters, see the <a href="https://textacy.readthedocs.io/en/latest/api_reference/representations.html#vectorizers" target="_blank">Vectorizers</a> in the Textacy documentation.

In [None]:
from lexos.dtm import DTM

# Without labels
dtm = DTM(docs)
print(f"Default labels: {dtm.labels}")

# With labels
labels=["Pride1", "Pride2", "Pride3", "Pride4", "Pride1", "Pride6", "Pride7", "Pride8", "Pride9", "Pride10"]
dtm = DTM(docs, labels=labels)
print(f"Assigned labels: {dtm.labels}")


### `DTM.set_vectorizer()`

Any of the vectorizer settings above can be changed with the `DTM.set_vectorizer()` method. However, it is important to rebuild the matrix after any change using `DTM.build()` to ensure that your `DTM` instance is updated.

In [None]:
dtm.set_vectorizer(tf_type="sqrt", max_n_terms=100)
dtm.build()

### `DTM.get_table()`

Once you have a DTM, you can access the matrix conveniently (no pills required, Neo!) as a pandas dataframe. This displays a table of raw token counts.


**Arguments:**

- `transpose`: If `True`, terms are columns and docs are rows. Default = False.

In [None]:
df = dtm.get_table()
df.head()

### Basic Dataframe Usage

This is a quick cheat sheet for accessing information in a pandas dataframe (referenced as `df`). It will be useful in following the cells below.

- Get a single column as a list: `df["terms"].values.tolist()`
- Get a dataframe with only a few columns: `small_df = df[["terms", "Pride1"]]`
- Get the top 10 rows: `df.head(10)`
- Get the bottom 10 rows: `df.tail(10)`
- Get rows 5-10 in the table: `df[4:10]`
- Sort by specific columns: `sorted_df = df.sort_values(by=["terms", "Pride1"], ascending=False)`
- Save to a CSV file (without the row indices): `df.to_csv("filename.csv", index=False)`

### `DTM.get_freq_table()`

If you wish to show relative frequencies, use `DTM.get_freq_table()`.


**Arguments:**

- `rounding`: The number of digits to round floats. Default = 3.
- `as_percent`: Whether to format the frequencies as percentages. Default = False.

In [None]:
df = dtm.get_freq_table()
df.head()

### `DTM.get_stats_table()`

If you wish to calculate the sum, mean, and/or median calculated for each row, use `DTM.get_stats_table()`.

**Arguments:**

- `stats`: One or more of "sum", "mean", and/or "median" (use a list if more than one). Default = "sum".
- `rounding`: The number of digits to round floats. Default = 3.

In [None]:
df = dtm.get_stats_table("sum")
df.head()

### `DTM.get_terms()`

If you wish to get an alphabetical list of terms in the DTM, use `DTM.get_terms()`. Note that this function returns a generator that provides one value at a time. If you want the whole list, you will need to convert it to a list as shown below.

In [None]:
print("First 20 Terms:\n")
list(dtm.get_terms())[0:20]

### `DTM.get_term_counts()`

If you wish to get a list of terms with their counts, use `DTM.get_terms()`. The output is a list of term-count tuples, but you can easily convert this to a pandas dataframe.

**Arguments:**

- `sort_by`: The column(s) to sort by in order of preference. Default = `["terms", "sum"]`
- `ascending`: Whether to sort values in ascending or descending order. Default = True.
- `alg`: The algorithm used for sorting.

Note that `DTM.get_term_counts()` generates a table using `DTM.get_stats_table()`, so it may sort by any column available in that table.

The default sorting algorithm is whatever locale the user's operating system is set to, but it can be changed by setting `alg` to one of the locales used by the Python `natsort` package. A list of locales can be found in the <a href="https://natsort.readthedocs.io/en/stable/api.html#natsort.ns" target="_blank">natsort documentation</a>.


In [None]:
print(f"First 5 terms as a list of tuples:\n")
print(dtm.get_term_counts()[0:5])

import pandas as pd
print(f"\nFirst 20 terms in a dataframe:\n")
df = pd.DataFrame(dtm.get_term_counts(), columns=["Term", "Count"])
df.head(20)

### Getting the Most Frequent and Least Frequent Terms

Probably the easiest method to identify the most and least frequent terms is to get their sums using `DTM.get_stats_table()`, sort the resulting table, and getting the table's "head" or "tail". This is demonstrated in the cells below.

Be aware that, if you have set the `min_df`, `max_df`, or `max_n_terms` parameters in your vectorizer, you may have already filtered some terms from your DTM.


#### Most Frequent Terms

In [None]:
df = dtm.get_stats_table("sum").sort_values(by=["sum", "terms"], ascending=False)
df.head(10)

#### Least Frequent Terms

In [None]:
df = dtm.get_stats_table("sum").sort_values(by=["sum", "terms"], ascending=False)
df.tail(10)