# Create a "sequence" with word clouds

A simple word cloud of an entire text does not often convey the changes that occur over the course of a conversation or a narrative.
However, by splitting the text into segments and generating word clouds for each, one can get a better sense of these changes.

## Load a text file
Open a standard text file that you might have saved locally. Update the file path below with one that links to your file.

In [None]:
with open('./data/gutenberg/carroll-alice.txt', 'r') as fo:
    text = fo.readlines()

If you downloaded the book from Project Gutenberg, you might have text that you don't need. Use the code below to figure out where the actual text starts.

In [None]:
for index, line in enumerate(text[0:30]):
    print("%2d : %s" % (index, line))

----
Note down the line number where the actual text begins and get rid of the text until that line.

You might also see a number of carriage returns (`\n`) and empty lines, which you can also remove if you prefer. 

In [None]:
import pprint

# Get rid of lines containing table of contents
text=text[23:] 

# remove all carriage returns within lines.
text = [line.replace('\n', '') for line in text] 

# remove all empty lines
text = [line for line in text if len(line) > 0]  

pprint.pp(text[0:15])

---
## Tokenize the text

The above output should look good (no empty lines, no table of contents etc.).
Now we can collapse the text into one long string, and start separating the words.
We use the [Punkt tokenizer](https://www.nltk.org/api/nltk.tokenize.punkt.html) to separate text into tokens.
This tokenizer uses an unsupervised algorithm (trained on large amounts of plain text) to build a model for abbreviation words, collocations (words that often go together), and words that start sentences. 

In [None]:
import nltk
from nltk import word_tokenize

# Uncomment the below line the first time you run this code.
# nltk.download('punkt_tab')  

text_str = ' '.join(text)
tokens = word_tokenize(text_str)
print("%d words in text, including punctuations" % (len(tokens)))

### Remove punctuations
We have not made any effort to get rid of punctuations, so let's do that.

In [None]:
# Download all punctuations
import string

# Add punctuations you might spot in the text
# that may not be in the list of standard punctuations.
punctuations = string.punctuation + 'â€™' 

tokens_without_puncts = [w for w in tokens if w not in punctuations]

print("%d words in text, excluding punctuations" 
      % (len(tokens_without_puncts)))

## Segment the text for the sequential word cloud
Split the text into a number of segments, say N. 
We want the value of "N" to be adjustable, so we use it as a variable as well.

In [None]:
def segment_text(tokens_list, num_segments, remaining_spaces=''):
    # Identify the closest number of tokens to place in each segment.
    segment_size = int(round(len(tokens_list)/num_segments, 0))
    # The closest number might be slightly less than needed to cover
    # the entire text, especially if the above operation results in
    # rounding down the value. To make up for this case, increase the
    # number by one in each segment.
    if segment_size * num_segments < len(tokens_list) :
        segment_size += 1

    # allocate the text sequentially to each segment
    list_of_segments = []
    for ind in range(num_segments):
        start_of_segment = ind * segment_size
        end_of_segment = (ind+1) * segment_size - 1
        segment_tokens = tokens_list[start_of_segment : end_of_segment]
        list_of_segments.append(segment_tokens)
    return segment_size, list_of_segments


In [None]:
num_segments = 10
words_per_wc, wc_token_lists = segment_text(tokens_without_puncts,
                                            num_segments)
print("%d words segmented into %d segments of %d words each" 
      % (len(tokens_without_puncts), num_segments, words_per_wc))

### Plot word cloud for each segment
We use the existing [WordCloud](https://github.com/amueller/word_cloud) library for this task.

In [None]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

fig = plt.figure(constrained_layout=True, dpi=300)
widths = [1] * num_segments

color_func = lambda *args, **kwargs: 'black'

# We make one figure to contain all N subfigures :
spec = fig.add_gridspec(ncols=num_segments,
                        width_ratios=widths,
                        wspace=0.0, hspace=0.0)

# Then, we iterate over our segmented lists of words, 
# and generate word clouds using the library we imported.

for ind, words_list in enumerate(wc_token_lists):
    words_str = ' '.join(words_list)
    ax = fig.add_subplot(spec[ind])
    wc = WordCloud(width=500, height=800,
                   background_color="white",
                   color_func=color_func,
                   max_words=100,
                   stopwords=STOPWORDS)
    wc.generate_from_text(words_str)
    ax.imshow(wc)
    ax.axis("off")

# Save the figure.
plt.savefig('./plots/sequence_wordcloud.pdf', bbox_inches='tight', dpi=600)

If you are interested in the use of `lambda` in the above code, read on.

Lambda is a concept in functional programming that lets us use functions as variables to pass to other functions.

We use a lambda function here because of how color is specified
in the WordCloud library (see [this example](https://amueller.github.io/word_cloud/auto_examples/a_new_hope.html)).
We set it up to pass a variable number of arguments (`*args`) e.g. (1, 2) or (1, 2, 3, ...)
and keyword arguments (`**kwargs`) e.g. (a=1, b=1) or (a=1, b=1, c=2, ...) and return **'black'** for all items.

Lambda functions are not essential for this workshop, but it is an interesting concept. If you are interested, read [this article](https://realpython.com/python-lambda/) for more detailed information.

### Keyword in context analysis
Are you interested in any particular word that caught your attention in the text or the word clouds?
You can use the below code to examine the context in which it appears in the text.

In [None]:
from nltk.text import Text
textList = Text(tokens_without_puncts)
textList.concordance('gryphon', width=85, lines=25)

---
Try the option below if you are interested in a sequence of words.

In [None]:
textList.concordance(['Mock', 'Turtle'], width=85, lines=25)

## Exercise for you!
Instead of splitting the text uniformly, can you draw a similar word cloud for each chapter?

Additional challenges:

  - [ ] Scale each word cloud segment to match the size of the chapter it represents. Longer chapters get bigger word clouds and vice versa.
  - [ ] Analyse each segment for its tone, sentiment, or a related LIWC-like psycholinguistic aspect. Can you draw the word cloud sequence as a "bar chart" with the heights representing a that value?

Here is a bit of code to start you off.

In [None]:
chapters = [text for text in text_str.strip().split("CHAPTER") if len(text) > 0]
len(chapters)