# `Worcloud` Tutorial
   
This notebook is to show examples of how to produce word cloud type visualisations of a document-term matrix. It starts out with a condensed version of the tutorial for the `dtm` module.

## Load Data and Construct a DTM

For this tutorial, we will load Jane Austen's _Pride and Prejudice_, tokenise it, and then cut it into ten segments, which we'll treat as ten separate documents.

In [None]:
# Python imports
import re
from lexos.io.smart import Loader
from lexos import tokenizer
from lexos.cutter.ginsu import Ginsu
from lexos.dtm import DTM

# Load the data
loader = Loader()
loader.load("../test_data/txt/Austen_Pride.txt")
text = re.sub("[\r\n|\n]", " ", loader.texts[0]).strip()

# Make a doc
doc = tokenizer.make_doc(text)

# Cut the god into 10 segments
cutter = Ginsu()
docs = cutter.splitn(doc, n=10)

# Build a DTM with labels
labels=["Pride1", "Pride2", "Pride3", "Pride4", "Pride5", "Pride6", "Pride7", "Pride8", "Pride9", "Pride10"]
dtm = DTM(docs, labels=labels)

## Create a Word Cloud

The word cloud is generated using the Python <a href="https://amueller.github.io/word_cloud/" target="_blank">Wordcloud</a> package. Any options used by that library can be passed to the Lexos `wordcloud()` function as option with the `opts` parameter, as shown below.

The resulting figure is plotted using <a href="https://matplotlib.org/" target="_blank">matplotlib</a>. It's options can be passed to the Lexos `wordcloud()` function with the `figure_opts` parameter.

By default, word clouds are square, but you can round the corners to various degrees using the `round` parameter. You can also choose whether or not to display the word cloud when it is generated using the `show` parameter. If you supply a filename using the `filename` parameter, the word cloud will be saved to an image of file in a format determined by the file extension.

In [None]:
from lexos.visualization.cloud.wordcloud import wordcloud

# Options for the Python Wordcloud package
opts = {
    "max_words": 2000,
    "background_color": "white",
    "contour_width": 0,
    "contour_color": "steelblue"
}

# Options for controlling the matplotlib figure
figure_opts = {"figsize": (15, 8)}

# Generate the word cloud
# cloud = wordcloud(
#     dtm,
#     opts=opts,
#     figure_opts=figure_opts,
#     round=150,
#     show=True,
#     # filename="wordcloud.png"
# )

By default, `wordcloud()` creates a word cloud based on the total term counts for all documents. If you wish to use a single or a subset of documents, use the `docs` parameter with a list the labels of docs you wish to use.

In [None]:
cloud = wordcloud(
    dtm,
    opts=opts,
    figure_opts=figure_opts,
    round=150,
    show=True,
    docs=["Pride1", "Pride2", "Pride3", "Pride4"],
)

In addition to a DTM, you can also provide `wordcloud()` with a raw text string (which will be tokenised by the `Wordcloud` package's internal tokenizer), as show below.

You can also provide a list of token lists, a dict with the terms as keys and the counts/frequencies as values, or a dataframe with "term" and "count" or "frequency" columns.

In [None]:
cloud = wordcloud(
    loader.texts[0],
    opts=opts,
    figure_opts=figure_opts,
    round=150,
    show=True,
)

## Create a Multicloud

A multicloud is simply a series of word clouds (one for each document), arranged in a grid.

Since `multicloud()` produces multiple subplots, there is a `title` parameter to give the entire figure a title and a `labels` parameter, which includes a list labels to be assigned to each subplot. The function takes the same `opts` and `figure_opts` parameters as `wordcloud()`.

If a `filename` is provided, the entire plot will be saved. If `show=False`, the `multicloud()` function returns a list of word clouds. These can be saved individually by calling `to_file()` on them.

As with `wordcloud()`, the `multicloud()` function accepts list of raw text strings, lists of dicts with the terms as keys and the counts/frequencies as values, or lists of dataframes with "term" and "count" or "frequency" columns.


In [None]:
# Import multicloud
from lexos.visualization.cloud.wordcloud import multicloud

# Generate the multicloud
multicloud(
    dtm,
    title="My Multicloud",
    labels=labels,
    ncols=3,
    show=True,
    opts=opts,
    figure_opts=figure_opts,
    round=150,
    # filename="multicloud.png"
)

# To save an individual cloud, use the following:

# multi_cloud = multicloud(
#     dtm,
#     title="My Multicloud",
#     labels=labels,
#     ncols=3,
#     show=False,
#     opts=opts,
#     figure_opts=figure_opts,
#     round=150,
# )
# multi_cloud[0].to_file("multicloud1.png")
