# Word Clouds

"Word clouds" is a cover term for a family of visualisations that enable you to examine term frequency in a compact way. The Lexos `visualization` module has two basic types: traditional word clouds and bubble charts. We'll start by defining a simple set of data for use in our initial examples.

In [None]:
# Text taken from the Wikipedia article in Digital Humanities
text = """
Digital humanities (DH) is an area of scholarly activity at the intersection of computing or digital technologies and the disciplines of the humanities. It includes the systematic use of digital resources in the humanities, as well as the analysis of their application. DH can be defined as new ways of doing scholarship that involve collaborative, transdisciplinary, and computationally engaged research, teaching, and publishing. It brings digital tools and methods to the study of the humanities with the recognition that the printed word is no longer the main medium for knowledge production and distribution.

By producing and using new applications and techniques, DH makes new kinds of teaching possible, while at the same time studying and critiquing how these impact cultural heritage and digital culture. A distinctive feature of DH is its cultivation of a two-way relationship between the humanities and the digital: the field both employs technology in the pursuit of humanities research and subjects technology to humanistic questioning and interrogation.
"""

## Basic Word Clouds

Let's start with a basic word cloud. Note that, when fed a raw string, Lexos will simply tokenise the text on whitespace. We'll look at some more sophisticated ways of submitting data below.

In [None]:
# Import the WordCloud class
from lexos.visualization.cloud import WordCloud

wc = WordCloud(data=text, title="My Word Cloud")
wc.show()

Unlike many other word cloud generators, Lexos prefers for you to pre-tokenise your data *before* you pass it to the `WordCloud` class, such as by converting it to a spaCy `Doc`. You *can* pass a spaCy `Doc` object directly, but it, but it is better to convert it to a list of token strings. This allows you to perform other types of pre-processing, such as removing punctuation and stop words (you could also generate a list of token strings by some other means). In the cell below, we'll convert our text to a spaCy `Doc`, filter out punctuation, stop words, and white space, then generate our word cloud.

In [None]:
# Import the Lexos Tokenizer class
from lexos.tokenizer import Tokenizer

# Create an instance of the Tokenizer class and make a spaCy doc
tokenizer = Tokenizer(model="en_core_web_sm")
doc = tokenizer.make_doc(text)

# Generate a filtered list of tokens
tokens = [
    token.text for token in doc
    if not token.is_punct
    and not token.is_stop
    and not token.is_space
]

# Create a new word cloud
wc = WordCloud(data=tokens, title="Digital Humanities Word Cloud")
wc.show()

That's much nicer! You can also pass a list of documents. By default, they will be merged into one document before the terms are counted. In the example below, we'll combine the document we created above with another document consisting of three addtional tokens. By using only the first 10 tokens of the first document, you'll be able to see whether the second document is being added. 

In [None]:
# Create a list of documents
doc1 = tokens[:10]
doc2 = ["some", "additional", "tokens"]
multiple_docs = [doc1, doc2]

# Create a new word cloud
wc = WordCloud(data=multiple_docs, title="Digital Humanities Word Cloud")
wc.show()

Now we can explore how to change the appearance of the word cloud using some keyword parameters. We've already seen the use of the `title` keyword. We can define the `height` and `width` (in pixels) of the word cloud, create a `round` mask to confine the words to a circular shape.

We can also `limit` the number of words to 20. This is different from what we did in the previous cell because the `limit` parameter will select the most common terms, not the first 20 tokens in the token list.

In [None]:
# Create and display a word cloud
wc = WordCloud(data=tokens, title="Digital Humanities Word Cloud", height=200, width=200, round=100, limit=20)
wc.show()

Under the hood, Lexos uses the Python [WordCloud](https://amueller.github.io/word_cloud/) and [matplotlib](https://matplotlib.org/) to create the word cloud. You can pass options to `WordCloud` with the `opts` keyword, the value of which should be a dictionary of options and their values. You can pass options to `matplotlib` with the `figure_opts` keyword, which also takes a dictionary. A full discussion of the available options is beyond the scope of this tutorial, and you are encouraged to consult the `WordCloud` and `matplotlib` documentation for ways to customise your word cloud. Here, we'll just provide a simple example showing how to change the background colour.

In [None]:
# Define an options dictionary
opts = {"background_color": "lightblue"}

# Create the word cloud
wc = WordCloud(data=text, title="My Blue Word Cloud", opts=opts)
wc.show()

Advanced users can manipulate the plot directly using `matplotlib`, as shown below:

In [None]:
# Import matplotlib
import matplotlib.pyplot as plt

# Create a pyplot figure object with the specified options
fig = plt.figure(**wc.figure_opts)

# Modify the figure
fig.set_facecolor("lightgreen")
fig.suptitle("My Light Green Word Cloud")

# Hide the axis lines and labels
plt.axis("off")

# Create the image
# The semicolon prevents display of the object in Jupyter notebooks, or you can add plt.show()
plt.imshow(wc.cloud, interpolation="bilinear");

### Saving Your Word Cloud Image

To save your image, use the save method. The file type will be determined by your file suffix (`.png` or `.jpg`).

In [None]:
# Change the file path to the location where you wish to save the image file
wc.save("wordcloud.png")

# Or save as a jpg file
# wc.save("wordcloud.jpg")

## Bubble Charts

Now we're ready to produce a bubble chart with the `BubbleChart` class.

In [None]:
# Import the BubbleChart class
from lexos.visualization.bubbleviz import BubbleChart

bc = BubbleChart(data=tokens, title="Digital Humanities Bubble Chart")
bc.show()

As with word clouds, we can apply a `limit` to the number of terms.

One important difference from the `WordCloud` class is that the figure dimensions are supplied with the `figsize` keyword, the value of which should be an integer or float (decimal number) corresponding to the size in inches. Unlike word clouds, bubble charts must have the same height and width.

In [None]:
bc = BubbleChart(data=tokens, title="Digital Humanities Bubble Chart", figsize=10, limit=20)
bc.show()

If you want to manipulate the figure, you must access the image data through the `fig` attribute. Here is an example where we replace the title and change the image size. Note that you may have to do additional adjustments to padding and margins in `matplotlib` if you want the image to appear the same as the one produced by Lexos.

In [None]:
# Import matplotlib
import matplotlib.pyplot as plt

# Get the term counts from the BubbleChart instance
data = list(bc.counts.keys())

# Create a new figure and axis, setting the figure size
fig, ax = plt.subplots(subplot_kw=dict(aspect="equal"), figsize=(10, 10))

# Call the internal _plot method to create the bubble chart
bc._plot(ax, data)

# Turn off the axis lines and labels
plt.axis("off")

# Recompute the data limits
ax.relim()

# Autoscale the view limits using the data limits
ax.autoscale_view()

# Add a new title
ax.set_title("A different title")

# Display the plot
plt.show()

### Saving Your Bubble Chart Image

To save your image, use the save method. The file type will be determined by your file suffix (`.png` or `.jpg`).

In [None]:
# Change the file path to the location where you wish to save the image file
wc.save("bubble_chart.png")

# Or save as a jpg file
wc.save("bubble_chart.jpg")

As with word clouds, you can pass additional arguments to `matplotlib`'s `savefig` method.

## Multiclouds

Multiclouds are grids of word clouds that allow you to compare the term counts in multiple documents. We'll show how this works below using a list of texts as our data.

In [None]:
# Import the MultiCloud class
from lexos.visualization.cloud import MultiCloud

texts = [
    "Natural language processing is a fascinating field that combines linguistics, computer science, and artificial intelligence.",
    "Text analysis, sentiment analysis, and language modeling are key components of modern NLP systems.",
    "Machine learning algorithms help computers understand and process human language effectively.",
    "Natural language processing is a fascinating field that combines linguistics, computer science, and artificial intelligence.",
    "Text analysis, sentiment analysis, and language modeling are key components of modern NLP systems.",
    "Machine learning algorithms help computers understand and process human language effectively."
]

# Create and display a MultiCloud chart
mc = MultiCloud(data=texts, round=150, title="Sample Multiclouds")
mc.show()
# mc.save("multiclouds.png")

If you want to change the labels ("Doc 1", "Doc 2", etc.), you can pass a list to the `labels` parameter:

```python
labels = ["A", "B", "C", "D", "E", "F"]
mc = MultiCloud(data=texts, labels=labels, round=150, title="Sample Multiclouds")
```

You can change the number of columns with the `ncols` parameter. The default is 3.

The `height` and `width` parameters will be applied to each individual word cloud. You can actually use all the parameters available in the `WordCloud` class.

If you want to use `matplotlib` to modify the multicloud after it has been created, do it like this:

In [None]:
# Create the MultiCloud chart
mc = MultiCloud(data=texts, title="Starting Title", round=150)

# Draw the canvas
mc.fig.canvas.draw()

# Show the figure data as a matplotlib image
plt.imshow(mc.fig.canvas.buffer_rgba(), interpolation="bilinear")

# Change the title and size of the figure
plt.suptitle("A Different Title")

# Turn off the axis lines and labels
plt.axis("off")

# Display the plot
plt.show()

## Using a Document-Term Matrix (DTM)

Frequently your data will take the form of a document-term matrix (DTM), which is a matrix counts for each term in a collection of documents. Lexos provides the `DTM` class for producing document-term matrices. Here is an example using the texts used to generate the multiclouds above. 

In [None]:
# Import the DTM class
from lexos.dtm import DTM

# Convert the texts to spaCy documents
docs = list(tokenizer.make_docs(texts))

# Tokenize the documents and filter out stop words, punctuation, and spaces, then convert to lower case
tokenized_docs = []
for doc in docs:
    tokenized_docs.append([token.lower_ for token in doc if not token.is_stop and not token.is_punct and not token.is_space])

# Create a DTM instance and call it with the tokenized documents
dtm = DTM()
dtm(docs=tokenized_docs, labels=["Doc 1", "Doc 2", "Doc 3", "Doc 4", "Doc 5", "Doc 6"])

# Display the start of the DTM in a pandas DataFrame
df = dtm.to_df().head()
df

Now that web have a DTM, we can submit it directly to the `MultiCloud` class.

In [None]:
# Create and display a MultiCloud chart from a DTM
mc = MultiCloud(data=dtm, round=150, title="Sample Multiclouds")
mc.show()

You can also pass the output of `dtm.to_df()`. The `WordCloud` and `BubbleChart` classes also accept data in these formats, but the documents will be merged by default.

## Filtering Document Collections

If your data consists of a list of documents or a document-term matrix, you can choose which documents to use with the `docs` parameter. Simply pass a list of document indices (starting with 0) specifying which documents you wish to include. Here is an example using a DTM.

In [None]:
# Create and display a MultiCloud chart from a DTM
from regex import L

# Create a MultiCloud chart with specific documents
mc = MultiCloud(
    data=dtm,
    docs=[1, 2, 3, 4],
    labels=["Doc 2", "Doc 3", "Doc 4", "Doc 5"],
    round=150,
    ncols=2,
    title="Multiclouds of Filtered Docs"
)
mc.show()

You can also use the `docs` parameter in the `WordCloud` and `BubbleChart` classes.

## Generating Dynamic Images

The static images produced by `WordCloud`, `BubbleChart`, and `MultiCloud` are very good for presentations, but they have their limitations, especially for more cluttered data. Because of this, Lexos offers alternative versions that use the Javascript [D3.js](https://d3js.org/) library. This allows you produce interactive features such as the ability to hover over the terms in your word cloud to see their counts. D3 visualisations are beautiful and useful for exploring data when static images are hard to read. They are also ideal for embedding in web applications.

The cells below demonstrate how to generate D3 versions of word clouds and bubble charts.

> **Important:**
>
> Because D3 is a Javascript library, it processes data into charts in the web browser. As a result, the charts will probably not display in a Jupyter notebook. Instead, you have to save your chart as an HTML page and open it separately in the web browser.
>
> For most system configurations, the Lexos D3 visualisations will automatically open a web browser when you generate your chart. If it does not, save the file as shown below and use your operating system to locate and open it in your browser.

In [None]:
# Import the D3WordCloud class
from lexos.visualization.d3_wordcloud import D3WordCloud

# A web browser will open by default. If you do not want this,
# set auto_open=False.
wc = D3WordCloud(data=text, title="D3 Word Cloud", width=300, height=300)

# Uncomment this and set your `path_to_save` to save your word cloud
path_to_save = "d3_wordcloud.html"
wc.save(path_to_save)

print("Nothing will display in the notebook, but a web browser should open automatically. If it doesn't, try saving the file and opening it manually.")

Try hovering over the words to see further information. You can do the same on D3 bubble charts such as the one we'll generate in the next cell.

In [None]:
# Import the D3WordCloud class
from lexos.visualization.d3_bubbleviz import D3BubbleChart

# A web browser will open by default. If you do not want this,
# set auto_open=False.
bc = D3BubbleChart(data=text, auto_open=False)

# Uncomment this and set your `path_to_save` to save your word cloud
path_to_save = "bubblechart.html"
bc.save(path_to_save)

print("Nothing will display in the notebook, but a web browser should open automatically. If it doesn't, try saving the file and opening it manually.")

Finally, let's generate a multicloud visualisation using D3. Note that the height and width of individual clouds are measured in pixels and controlled by the `cloud_height` and `cloud_width` keywords.

In [None]:
# Import the D3MultiCloud class
from lexos.visualization.d3_wordcloud import D3MultiCloud

# Create multi-cloud
# A web browser will open by default. If you do not want this,
# set auto_open=False.
mc = D3MultiCloud(
    data_sources=texts,
    labels=None,
    cloud_width=250,
    cloud_height=250,
    columns=3,
    title="Technology Word Clouds",
    auto_open=False
)

# Uncomment this and set your `path_to_save` to save your word cloud
path_to_save = "multiclouds.html"

print("Nothing will display in the notebook, but a web browser should open automatically. If it doesn't, try saving the file and opening it manually.")


D3 word clouds, bubble charts, and multiclouds can be modified with titles, limits, and settings for height and width (in pixels) using the same keywords we available in the pure Python implementations we have seen above. There is an `include_js` keyword that allows you to specify whether you would like to load the D3 Javascript over the internet or include it in your file. If you do the latter, you will have a bigger file. If you choose the former, you will not be able to view your chart without an internet connection. Generally, you can leave the default settings and read the documentation if you need to make changes.

In either case, the D3 code will be "minified" to and not readable by humans. However, the main logic is left unminified, along with the HTML and styling. If you wish to open a chart file in an editor, you can make modifications that aren't enabled directly by the Lexos programming interface. However, this requires knowledge of HTML, CSS, and D3, so we won't go into any further details in this tutorial.