# Test DTM
   
This notebook is to show examples of how to use the `dtm` module.

## Add Lexos to the Jupyter `sys.path`

In [1]:
%run jupyter_local_setup.py ../../../lexos

System path set to `../../../lexos`.


## Import Lexos Modules

In [2]:
from lexos.io.smart import Loader
from lexos import tokenizer
from lexos.cutter.ginsu import Ginsu

## Load data

In [21]:
data = "../test_data/txt/Austen_Pride.txt"
loader = Loader()
loader.load(data)
doc = tokenizer.make_doc(loader.texts[0])

## Cut into Ten Segments

In [61]:
cutter = Ginsu()
pride_segments = cutter.splitn(doc, n=10)

## Initialise the `DTM` Class

**Arguments:**

- `docs`: A list of spaCy docs or a list of token lists.
- `labels`: A list of string labels for the documents. If not provided, the labels "doc1", "doc2", etc. will be used.

The `DTM` class uses the Textacy `Vectorizer` to build a matrix of vectors. The default settings are as follows:

- `tf_type`: (default = "linear")
- `idf_type`: (default = None)
- `dl_type`: (default = None)
- `norm`: (default = None)
- `min_df`: (default = 1)
- `max_df`: (default = 1.0)
- `max_n_terms`: (default = None)
- `vocabulary_terms`: (default = None)

For explanations of the various parameters, see the <a href="https://textacy.readthedocs.io/en/latest/api_reference/representations.html#vectorizers" target="_blank">Vectorizers</a> in the Textacy documentation.

In [None]:
from lexos.dtm import DTM

dtm = DTM(pride_segments)

### `DTM.set_vectorizer()`

Any of the vectorizer settings above can be changed with the `DTM.set_vectorizer()` method. However, it is important to rebuild the matrix after any change using `DTM.build()`.

Note: It has not yet been tested whether this works for changing labels.

In [None]:
dtm.set_vectorizer(tf_type="tfidf", max_n_terms=100)
dtm.build()

### `DTM.get_table()`

Once you have a DTM, you can document-term matrix access it conveniently as a pandas dataframe. This displays a table of raw token counts.


**Arguments:**

- `transpose`: If `True`, terms are columns and docs are rows (default=False).

In [None]:
df = dtm.get_table()
df.head()

### `DTM.get_frequency_table()`

If you wish to show relative frequencies, use `DTM.get_frequency_table()`.


**Arguments:**

- `rounding`: The number of digits to round floats (default = 3).
- `as_percent`: Whether to format the frequencies as percentages (default = False).

In [None]:
df = dtm.get_frequency_table()
df.head()

### `DTM.get_stats_table()`

Gets a table with the sum, mean, and/or median calculated for each row.

**Arguments:**

- `stats`: One or more of "sum", "mean", and/or "median"  (default = "sum").
- `rounding`: The number of digits to round floats (default = 3).

In [None]:
df = dtm.get_stats_table("sum")
df.head()

### `DTM.get_terms()`

Gets an alphabetical list of terms in the DTM.

In [60]:
dtm.get_terms()

### `DTM.get_term_counts()`

Gets an alphabetical list of terms in the DTM.

**Arguments:**

- `sort_by`: The column(s) to sort by in order of preference. Default = `["terms", "sum"]`
- `ascending`: Whether to sort values in ascending or descending order (default = True).
- `alg`: The algorithm used for sorting.

The default sorting algorithm is whatever locale the user's operating system is set to, but it can be changed by setting `alg` to one of the locales used by the Python `natsort` package. A list of locales can be found in the <a href="https://natsort.readthedocs.io/en/stable/api.html#natsort.ns" target="_blank>natsort documentation</a>.


In [None]:
dtm.get_terms_counts()

### `DTM.least_frequent()`

Gets list of the least frequent terms in the DTM.

**Arguments:**

- `max_n_terms`: The number of terms to return (default = 100).
- `start`: The start index in the DTM table (default = -1).


**Important:** This function should not be used if `min_df` or `max_df` is set in the vectorizer because the table will be cut twice.


In [None]:
dtm.least_frequent(max_n_terms=20)

### `DTM.most_frequent()`

Gets list of the most frequent terms in the DTM.

**Arguments:**

- `max_n_terms`: The number of terms to return (default = 100).
- `start`: The start index in the DTM table (default = 0).


**Important:** This function should not be used if `min_df` or `max_df` is set in the vectorizer because the table will be cut twice.


In [None]:
dtm.most_frequent(max_n_terms=20)