In the Read Me file, I explained how the word_tokenizer tokenizes our text into words (more specifically, it tokenizes a string into a list of substrings). NLTK tokenizer and pos-tagger instantiate a list of tuples, 'tagged' (In this explanation, I've labeled it 'sample_tagged').  The tuples that the pos tagger produce are composed of this tokenized word, and its associated part of speech. Here is an example of how this works:

In [103]:
import nltk
from nltk.tokenize import word_tokenize

sample_text = "The dog sleepily walked away."
sample_words = word_tokenize(sample_text)
sample_tagged = nltk.pos_tag(sample_words)
print('Tokenized text',sample_words)
print('Tagged text',sample_tagged)
print('List type: ', type(sample_tagged))
print('List item type: ', type(sample_tagged[1]))


Tokenized text ['The', 'dog', 'sleepily', 'walked', 'away', '.']
Tagged text [('The', 'DT'), ('dog', 'NN'), ('sleepily', 'RB'), ('walked', 'VBD'), ('away', 'RB'), ('.', '.')]
List type:  <class 'list'>
List item type:  <class 'tuple'>


At first glance, one might ask why I did not change this list into a dictionary or create a dictionary. After all, after using the NLTK pos-tagged, many go on to change the data structure for further processing, like graph representations (for understanding semantic relationships), objects and classes (for N.E.R.), and more.  I originally thought that implementing a dictionary with key (pos) value (words) pairs would be the most efficient option for this project, and was hesitant at the thought of just replacing tuple items. However, the key to successful word-games like mad libs is maintaining word position in a sentence. This is why I decided to stick with a list. In a dictionary, when several values are associated with one key, they have to be stored in an iterable, like a list. You would have to iterate all the values of that list before moving to the next key (pos). This will disturb the order of the sentence. Have a look at the output in the cell below, which shows the comparison between the original text and the modified text if we use a dictionary when a sentence contains 2 adverbs.

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from collections import OrderedDict

sample_text = "The dog sleepily walked away."
sample_words = word_tokenize(sample_text)
sample_tagged = nltk.pos_tag(sample_words)

sample_dict = OrderedDict()
for word, pos in sample_tagged:
    if pos not in sample_dict:
        sample_dict[pos] = [word]
    else:
        sample_dict[pos].append(word)

print("Original Dictionary:")
print(sample_dict)
print('')

for pos_tag, words in sample_dict.items():
    for i in range(len(words)):
        new_word = input(f"{pos_tag}: {words[i]}: ")
        words[i] = new_word

print("\nModified Dictionary:")
print('')
print(sample_dict)

modified_sentence = ' '.join(word for words in sample_dict.values() for word in words)
print('')
print('Original Sentence:', sample_text)
print('')
print('Modified Sentence', modified_sentence)

Original Dictionary:
OrderedDict([('DT', ['The']), ('NN', ['dog']), ('RB', ['sleepily', 'away']), ('VBD', ['walked']), ('.', ['.'])])

DT: The: The
NN: dog: dog
RB: sleepily: sleepily
RB: away: away
VBD: walked: walked
.: .: .

Modified Dictionary:

OrderedDict([('DT', ['The']), ('NN', ['dog']), ('RB', ['sleepily', 'away']), ('VBD', ['walked']), ('.', ['.'])])

Original Sentence: The dog sleepily walked away.

Modified Sentence The dog sleepily away walked .


On the other hand, with a list of tuples, the order of the words in every sentence is maintained. We can replace tuple items at particular indices with ease.

After insantiating 'tagged', I made a list called pos_list containing the relevant pos's (those which I wanted to extract for my project implementation).  You'll notice below that I only chose 5 pos's - nouns, adjectives, verb-ings, past-tense verbs, and adverbs, respectively. I did so because Player 1 creating the story already allows for great variability in sentence structure. In short, it already has the potential to get pretty whacky, so I didn't want to implement too many possible changes at the expense of interpretability. 

In [3]:
pos_list_sample = ['NN', 'JJ', 'VBG', 'VBD', 'RB']

The next step is to iterate through the list, the elements of the tuple, and the (index, (word, pos)) of 'tagged'. The application looks for pos's in 'tagged' list that match those in pos list. For those matches, we create user-friendly labels, so that the user is prompted with "Noun, Adjective, etc.", instead of Penn Treebank syntax: "NN", "JJ". The last line replaces the tuple at the appropriate index with their input and pos. This is another benefit of the list in Python, and another reason why I chose it - because it is mutable and dynamic. So, while we can't change the items within the tuple without replacing it entirely, we can at least add and remove tuples as we please.

In [6]:
for i, (word_sample, pos_sample) in enumerate(sample_tagged):
    if pos_sample in pos_list_sample:
        label_sample = (
            "Noun" if pos_sample == 'NN' else
            "Adverb" if pos_sample == 'RB' else
            "Past-Tense Verb" if pos_sample == 'VBD' else
            "Verb-ING" if pos_sample == 'VBG' else
            "Unknown"   
        )
        user_input_sample = input(f'{label_sample}: ')
        sample_tagged[i] = (user_input_sample, pos_sample)
    print(f'\n{sample_tagged[i]}')


('The', 'DT')
Noun: cat

('cat', 'NN')
Adverb: happily

('happily', 'RB')
Past-Tense Verb: pranced

('pranced', 'VBD')
Adverb: along

('along', 'RB')

('.', '.')


Then, all that is left to do is rebuild the text! This is yet another reason why the array is integral. We can use "join" to join a list's elements, separated by a space, which allows us to read the output as a sequence of sentences.

In [94]:
reconstructed_sentence_sample = ' '.join([word_sample for word_sample, pos_sample in sample_tagged])
print("Original Sentence:", sample_text)
print("Modified Sentence:", reconstructed_sentence_sample)

Original Sentence: The dog sleepily walked away.
Modified Sentence: The cat happily pranced along .


As you can probably tell, the NLTK pos tagger is not perfect. It doesn't totally capture information about plurality, among other limitations. I think the integration of a LLM would be great for this project. You could have it generate stories based on an idea you have, and its understanding of pos's undoubtedly has the potential to be more robust.