# Implementing Tokenization

Tokenizers are essential tools in natural language processing that break down text into smaller units called tokens. These tokens can be words, characters, or subwords, making complex text understandable to computers. By dividing text into manageable pieces, tokenizers enable machines to process and analyze human language, powering various language-related applications like translation, sentiment analysis, and chatbots. Essentially, tokenizers bridge the gap between human language and machine understanding.


* [`nltk`](https://www.nltk.org/) or natural language toolkit, will be employed for data management tasks. It offers comprehensive tools and resources for processing natural language text, making it a valuable choice for tasks such as text preprocessing and analysis.

* [`spaCy`](https://spacy.io/) is an open-source software library for advanced natural language processing in Python. spaCy is renowned for its speed and accuracy in processing large volumes of text data.

* [`BertTokenizer`](https://huggingface.co/docs/transformers/main_classes/tokenizer#berttokenizer) is part of the Hugging Face Transformers library, a popular library for working with state-of-the-art pre-trained language models. BertTokenizer is specifically designed for tokenizing text according to the BERT model's specifications.

* [`XLNetTokenizer`](https://huggingface.co/docs/transformers/main_classes/tokenizer#xlnettokenizer) is another component of the Hugging Face Transformers library. It is tailored for tokenizing text in alignment with the XLNet model's requirements.

* [`torchtext`](https://pytorch.org/text/stable/index.html) It is part of the PyTorch ecosystem, to handle various natural language processing tasks. It  simplifies the process of working with text data and provides functionalities for data preprocessing, tokenization, vocabulary management, and batching.

In [5]:
import ssl
import certifi
import urllib.request

opener = urllib.request.build_opener(
    urllib.request.HTTPSHandler(context=ssl.create_default_context(cafile=certifi.where()))
)
urllib.request.install_opener(opener)

In [None]:
from PIL.Image import WARN_POSSIBLE_FORMATS
import spacy
import subprocess
import sys

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
# from nltk.probability import FreqDist
# from nltk.util import ngrams

from transformers import BertTokenizer
from transformers import XLNetTokenizer

import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

import warnings
def warn(*args, **kwargs):
  pass
warnings.warn = warn
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/victorcata/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/victorcata/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## What is a tokenizer and why do we use it?

Tokenizers play a pivotal role in natural language processing, segmenting text into smaller units known as tokens. These tokens are subsequently transformed into numerical representations called token indices, which are directly employed by deep learning algorithms.
<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0201EN-Coursera/images/Tokenization%20lab%20Diagram%201.png" width="50%" alt="Image Description">
</center>


## Types of tokenizer

The meaningful representation can vary depending on the model in use. Various models employ distinct tokenization algorithms, and you will broadly cover the following approaches. Transforming text into numerical values might appear straightforward initially, but it encompasses several considerations that must be kept in mind.
<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0201EN-Coursera/images/Tokenization%20lab%20Diagram%202.png" width="50%" alt="Image Description">
</center>


## Word-based tokenizer

###  nltk

As the name suggests, this is the splitting of text based on words. There are different rules for word-based tokenizers, such as splitting on spaces or splitting on punctuation. Each option assigns a specific ID to the split word. Here you use nltk's  ```word_tokenize```


In [9]:
text = 'This is a sample sentence for word tokenization.'
tokens = word_tokenize(text)
print(tokens)

['This', 'is', 'a', 'sample', 'sentence', 'for', 'word', 'tokenization', '.']


General libraries like nltk and spaCy often split words like 'don't' and 'couldn't,' which are contractions, into different individual words. There's no universal rule, and each library has its own tokenization rules for word-based tokenizers. However, the general guideline is to preserve the input format after tokenization to match how the model was trained.


In [10]:
text = "I couldn't help the dog. Can't you do it? Don't be afraid if you are."
tokens = word_tokenize(text)
print(tokens)

['I', 'could', "n't", 'help', 'the', 'dog', '.', 'Ca', "n't", 'you', 'do', 'it', '?', 'Do', "n't", 'be', 'afraid', 'if', 'you', 'are', '.']


In [None]:
# Install spaCy model if not already installed
try:
  nlp = spacy.load('en_core_web_sm')
  print("spaCy model 'en_core_web_sm' is already installed")
except OSError:
  print("Installing spaCy model 'en_core_web_sm'...")
  subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])
  print("Model installed successfully!")

Installing spaCy model 'en_core_web_sm'...
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Model installed successfully!


In [20]:
text = "I couldn't help the dog. Can't you do it? Don't be afraid if you are."
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

token_list = [token.text for token in doc]
print('Tokens:', token_list)

for token in doc:
  print(f"Token: {token.text}, POS(part of speech): {token.pos_}, Dependency: {token.dep_}")

Tokens: ['I', 'could', "n't", 'help', 'the', 'dog', '.', 'Ca', "n't", 'you', 'do', 'it', '?', 'Do', "n't", 'be', 'afraid', 'if', 'you', 'are', '.']
Token: I, POS(part of speech): PRON, Dependency: nsubj
Token: could, POS(part of speech): AUX, Dependency: aux
Token: n't, POS(part of speech): PART, Dependency: neg
Token: help, POS(part of speech): VERB, Dependency: ROOT
Token: the, POS(part of speech): DET, Dependency: det
Token: dog, POS(part of speech): NOUN, Dependency: dobj
Token: ., POS(part of speech): PUNCT, Dependency: punct
Token: Ca, POS(part of speech): AUX, Dependency: aux
Token: n't, POS(part of speech): PART, Dependency: neg
Token: you, POS(part of speech): PRON, Dependency: nsubj
Token: do, POS(part of speech): VERB, Dependency: ROOT
Token: it, POS(part of speech): PRON, Dependency: dobj
Token: ?, POS(part of speech): PUNCT, Dependency: punct
Token: Do, POS(part of speech): AUX, Dependency: aux
Token: n't, POS(part of speech): PART, Dependency: neg
Token: be, POS(part of s

The problem with this algorithm is that words with similar meanings will be assigned different IDs, resulting in them being treated as entirely separate words with distinct meanings. For example, $Unicorns$ is the plural form of $Unicorn$, but a word-based tokenizer would tokenize them as two separate words, potentially causing the model to miss their semantic relationship.


In [21]:
text = "Unicorns are real. I saw a unicorn yesterday."
token = word_tokenize(text)
print(token)

['Unicorns', 'are', 'real', '.', 'I', 'saw', 'a', 'unicorn', 'yesterday', '.']


Each word is split into a token, leading to a significant increase in the model's overall vocabulary. Each token is mapped to a large vector containing the word's meanings, resulting in large model parameters.


## Character-based tokenizer

As the name suggests, character-based tokenization involves splitting text into individual characters. The advantage of using this approach is that the resulting vocabularies are inherently small. Furthermore, since languages have a limited set of characters, the number of out-of-vocabulary tokens is also limited, reducing token wastage.

For example:
Input text: `This is a sample sentence for tokenization.`

Character-based tokenization output: `['T', 'h', 'i', 's', 'i', 's', 'a', 's', 'a', 'm', 'p', 'l', 'e', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 'f', 'o', 'r', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', '.']`

However, it's important to note that character-based tokenization has its limitations. Single characters may not convey the same information as entire words, and the overall token length increases significantly, potentially causing issues with model size and a loss of performance.


## Subword-based tokenizer

The subword-based tokenizer allows frequently used words to remain unsplit while breaking down infrequent words into meaningful subwords. Techniques such as SentencePiece, or WordPiece are commonly used for subword tokenization. These methods learn subword units from a given text corpus, identifying common prefixes, suffixes, and root words as subword tokens based on their frequency of occurrence. This approach offers the advantage of representing a broader range of words and adapting to the specific language patterns within a text corpus.

In both examples below, words are split into subwords, which helps preserve the semantic information associated with the overall word. For instance, 'Unhappiness' is split into 'un' and 'happiness,' both of which can appear as stand-alone subwords. When we combine these individual subwords, they form 'unhappiness,' which retains its meaningful context. This approach aids in maintaining the overall information and semantic meaning of words.

<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0201EN-Coursera/images/Tokenization%20lab%20Diagram%203.png" width="50%" alt="Image Description">
</center>


### WordPiece

Initially, WordPiece initializes its vocabulary to include every character present in the training data and progressively learns a specified number of merge rules. WordPiece doesn't select the most frequent symbol pair but rather the one that maximizes the likelihood of the training data when added to the vocabulary. In essence, WordPiece evaluates what it sacrifices by merging two symbols to ensure it's a worthwhile endeavor.

Now, the WordPiece tokenizer is implemented in BertTokenizer. 
Note that BertTokenizer treats composite words as separate tokens.


In [22]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer.tokenize("IBM taught me tokenization.")

['ibm', 'taught', 'me', 'token', '##ization', '.']

### Unigram and SentencePiece

Unigram is a method for breaking words or text into smaller pieces. It accomplishes this by starting with a large list of possibilities and gradually narrowing it down based on how frequently those pieces appear in the text. This approach aids in efficient text tokenization.

SentencePiece is a tool that takes text, divides it into smaller, more manageable parts, assigns IDs to these segments, and ensures that it does so consistently. Consequently, if you use SentencePiece on the same text repeatedly, you will consistently obtain the same subwords and IDs.

Unigram and SentencePiece work together by implementing Unigram's subword tokenization method within the SentencePiece framework. SentencePiece handles subword segmentation and ID assignment, while Unigram's principles guide the vocabulary reduction process to create a more efficient representation of the text data. This combination is particularly valuable for various NLP tasks in which subword tokenization can enhance the performance of language models.


In [24]:
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
tokenizer.tokenize("IBM taught me tokenization.")

['▁IBM', '▁taught', '▁me', '▁token', 'ization', '.']

## Tokenization with PyTorch 
In PyTorch, especially with the `torchtext` library, the tokenizer breaks down text from a data set into individual words or subwords, facilitating their conversion into numerical format. After tokenization, the vocab (vocabulary) maps these tokens to unique integers, allowing them to be fed into neural networks. This process is vital because deep learning models operate on numerical data and cannot process raw text directly. Thus, tokenization and vocabulary mapping serve as a bridge between human-readable text and machine-operable numerical data. Consider the dataset:


In [25]:
dataset = [
    (1,"Introduction to NLP"),
    (2,"Basics of PyTorch"),
    (1,"NLP Techniques for Text Classification"),
    (3,"Named Entity Recognition with PyTorch"),
    (3,"Sentiment Analysis using PyTorch"),
    (3,"Machine Translation with PyTorch"),
    (1," NLP Named Entity,Sentiment Analysis,Machine Translation "),
    (1," Machine Translation with NLP "),
    (1," Named Entity vs Sentiment Analysis  NLP ")]

In [None]:
# fetch a tokenizer by name. Provides support for a range of tokenization methods, 
# including basic string splitting, and returns various tokenizers based on the argument passed to it.
tokenizer = get_tokenizer('basic_english')

You apply the tokenizer to the dataset. Note: If ```basic_english``` is selected, it returns the ```_basic_english_normalize()``` function, which normalizes the string first and then splits it by space.


In [37]:
tokenizer(dataset[0][1])

['introduction', 'to', 'nlp']

## Token indices
You would represent words as numbers as NLP algorithms can process and manipulate numbers more efficiently and quickly than raw text. You use the function **```build_vocab_from_iterator```**, the output is typically referred to as 'token indices' or simply 'indices.' These indices represent the numeric representations of the tokens in the vocabulary.

The **```build_vocab_from_iterator```** function, when applied to a list of tokens, assigns a unique index to each token based on its position in the vocabulary. These indices serve as a way to represent the tokens in a numerical format that can be easily processed by machine learning models.

For example, given a vocabulary with tokens ["apple", "banana", "orange"], the corresponding indices might be [0, 1, 2], where "apple" is represented by index 0, "banana" by index 1, and "orange" by index 2.

**```dataset```** is an iterable. Therefore, you use a generator function yield_tokens to apply the **```tokenizer```**. The purpose of the generator function **```yield_tokens```** is to yield tokenized texts one at a time. Instead of processing the entire dataset and returning all the tokenized texts in one go, the generator function processes and yields each tokenized text individually as it is requested. The tokenization process is performed lazily, which means the next tokenized text is generated only when needed, saving memory and computational resources.


In [38]:
def yield_tokens(data_iter):
  for _,text in data_iter:
    yield tokenizer(text)

my_iterator = yield_tokens(dataset)

This creates an iterator called **```my_iterator```** using the generator. To begin the evaluation of the generator and retrieve the values, you can iterate over **```my_iterator```** using a for loop or retrieve values from it using the **```next()```** function.


In [39]:
next(my_iterator)

['introduction', 'to', 'nlp']

You build a vocabulary from the tokenized texts generated by the **```yield_tokens```** generator function, which processes the dataset. The **```build_vocab_from_iterator()```** function constructs the vocabulary, including a special token `unk` to represent out-of-vocabulary words. 

### Out-of-vocabulary (OOV)

When text data is tokenized, there may be words that are not present in the vocabulary because they are rare or unseen during the vocabulary building process. When encountering such OOV words during actual language processing tasks like text generation or language modeling, the model can use the ```<unk>``` token to represent them.

For example, if the word "apple" is present in the vocabulary, but "pineapple" is not, "apple" will be used normally in the text, but "pineapple" (being an OOV word) would be replaced by the ```<unk>``` token.

By including the `<unk>` token in the vocabulary, you provide a consistent way to handle out-of-vocabulary words in your language model or other natural language processing tasks.

In [58]:
# Version torchtext 0.10.0
# vocab = build_vocab_from_iterator(yield_tokens(dataset), specials=["<unk>"])
# vocab.set_default_index(vocab["<unk>"])

# Shim for 0.6.0
from collections import Counter
from torchtext.vocab import Vocab

def build_vocab_from_iterator(iterator, specials=("<unk>",), specials_first=True):
  counter = Counter()
  for tokens in iterator:
    counter.update(tokens)
  return Vocab(counter, specials=list(specials), specials_first=specials_first)

vocab = build_vocab_from_iterator(yield_tokens(dataset), specials=["<unk>"])
UNK_IDX = vocab.stoi["<unk>"]

# drop-in helpers to “act like” set_default_index + lookup
def lookup_token(token: str) -> int:
  return vocab.stoi.get(token, UNK_IDX)

def tokens_to_ids(tokens):
  return [lookup_token(t) for t in tokens]

In [57]:
def get_tokenized_sentence_and_indices(iterator):
  tokenized_sentence = next(iterator)
  token_indices = [vocab[token] for token in tokenized_sentence]
  return tokenized_sentence, token_indices

tokenized_sentence, token_indices = get_tokenized_sentence_and_indices(my_iterator)
next(my_iterator)

print('Tokenized Sentence:', tokenized_sentence)
print('Token Indices:', token_indices)

Tokenized Sentence: ['named', 'entity', 'recognition', 'with', 'pytorch']
Token Indices: [6, 4, 16, 9, 2]


In [77]:
lines = ["IBM taught me tokenization",
         "Special tokenizers are ready and they will blow your mind",
         "just saying hi!"]

special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

tokenizer_en = get_tokenizer('spacy', language='en_core_web_sm')

tokens = []
max_length = 0

for line in lines:
  tokenized_line = tokenizer_en(line)
  tokenized_line = ['<bos>'] + tokenized_line + ['<eos>']
  tokens.append(tokenized_line)
  max_length = max(max_length, len(tokenized_line))

for i in range(len(tokens)):
  tokens[i] = tokens[i] + ['<pad>'] * (max_length - len(tokens[i]))

print("Lines after adding special tokes:\n", tokens)

# For torchtext 0.10.0
# vocab = build_vocab_from_iterator(tokens, specials=['<unk>'])
# vocab.set_default_index(vocab['<unk>'])

# print('Vocabulary:', vocab.get_itos())
# print('Token IDs for tokenization:', vocab.get_stoi())

# ---- Build vocab the 0.6.0 way ----
counter = Counter()
for seq in tokens:
  counter.update(seq)

def tokens_to_ids(seq):
  return [vocab.stoi.get(t, UNK_IDX) for t in seq]

# 0.6 exposes .itos list (no get_itos())
def ids_to_tokens(id_seq):
  return [vocab.itos[i] if 0 <= i < len(vocab.itos) else '<unk>' for i in id_seq]

# Put specials first so <unk> gets a stable, low index (usually 0)
vocab = Vocab(counter, specials=special_symbols, specials_first=True)

# 0.6 has no set_default_index; keep unk index and use .get(...) when mapping
UNK_IDX = vocab.stoi['<unk>']

# show vocab structures (no get_itos()/get_stoi() in 0.6)
print('Vocabulary:', vocab.itos)   # list -> index -> token
print('Token IDs for tokenization:\n', list(vocab.stoi.items())[:12], '...')  # dict -> token -> index

# # example: encode/decode first line
ids = tokens_to_ids(tokens[0])
print('Encoded:', ids)
print('Decoded:', ids_to_tokens(ids))

Lines after adding special tokes:
 [['<bos>', 'IBM', 'taught', 'me', 'tokenization', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['<bos>', 'Special', 'tokenizers', 'are', 'ready', 'and', 'they', 'will', 'blow', 'your', 'mind', '<eos>'], ['<bos>', 'just', 'saying', 'hi', '!', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']]
Vocabulary: ['<unk>', '<pad>', '<bos>', '<eos>', '!', 'IBM', 'Special', 'and', 'are', 'blow', 'hi', 'just', 'me', 'mind', 'ready', 'saying', 'taught', 'they', 'tokenization', 'tokenizers', 'will', 'your']
Token IDs for tokenization:
 [('<unk>', 0), ('<pad>', 1), ('<bos>', 2), ('<eos>', 3), ('!', 4), ('IBM', 5), ('Special', 6), ('and', 7), ('are', 8), ('blow', 9), ('hi', 10), ('just', 11)] ...
Encoded: [2, 5, 16, 12, 18, 3, 1, 1, 1, 1, 1, 1]
Decoded: ['<bos>', 'IBM', 'taught', 'me', 'tokenization', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']


In [76]:
new_line = "I learned about embeddings and attention mechanisms."

tokenized_new_line = tokenizer_en(new_line)
tokenized_new_line = ['<bos>'] + tokenized_new_line + ['<eos>']

new_line_padded = tokenized_new_line + ['<pad>'] * (max_length - len(tokenized_new_line))

# For 0.10.0
# new_line_ids = [vocab[token] if token in vocab else vocab['[unk]'] for token in new_line_padded]

# For 0.6.0
new_line_ids = [vocab.stoi.get(tok, UNK_IDX) for tok in new_line_padded]

print("Token IDs for new line:", new_line_ids)
print(new_line_padded)

Token IDs for new line: [2, 0, 0, 0, 0, 7, 0, 0, 0, 3, 1, 1]
['<bos>', 'I', 'learned', 'about', 'embeddings', 'and', 'attention', 'mechanisms', '.', '<eos>', '<pad>', '<pad>']


## Exercise

- Objective: Evaluate and compare the tokenization capabilities of four different NLP libraries (`nltk`, `spaCy`, `BertTokenizer`, and `XLNetTokenizer`) by analyzing the frequency of tokenized words and measuring the processing time for each tool using `datetime`.
- Text for tokenization is as below:


In [82]:
from collections import Counter
from datetime import datetime

text = """
Going through the world of tokenization has been like walking through a huge maze made of words, symbols, and meanings. Each turn shows a bit more about the cool ways computers learn to understand our language. And while I'm still finding my way through it, the journey's been enlightening and, honestly, a bunch of fun.
Eager to see where this learning path takes me next!"
"""

def show_frequencies(tokens, method_name):
  print(f"{method_name} Token Frequencies: {dict(Counter(tokens))}\n")

# NLTK Tokenization
start_time = datetime.now()
nltk_tokens = nltk.word_tokenize(text)
nltk_time = datetime.now() - start_time
print(f"NLTK tokens: {nltk_tokens}\nTime Taken: {nltk_time} seconds")
show_frequencies(nltk_tokens, 'NLTK')

# SpaCy Tokenization
nlp = spacy.load('en_core_web_sm')
start_time = datetime.now()
spacy_tokens = [token.text for token in nlp(text)]
spacy_time = datetime.now() - start_time
print(f"SpaCy tokens: {spacy_tokens}\nTime Taken: {spacy_time} seconds")
show_frequencies(spacy_tokens, 'SpaCy')

# BertTokenizer Tokenization
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
start_time = datetime.now()
bert_tokens = bert_tokenizer.tokenize(text)
bert_time = datetime.now() - start_time
print(f"Bert tokens: {bert_tokens}\nTime Taken: {bert_time} seconds")
show_frequencies(bert_tokens, 'Bert')

# XLNetTokenizer Tokenization
xlnet_tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
start_time = datetime.now()
xlnet_tokens = xlnet_tokenizer.tokenize(text)
xlnet_time = datetime.now() - start_time
print(f"Bert tokens: {xlnet_tokens}\nTime Taken: {xlnet_time} seconds")
show_frequencies(xlnet_tokens, 'XLNet')



NLTK tokens: ['Going', 'through', 'the', 'world', 'of', 'tokenization', 'has', 'been', 'like', 'walking', 'through', 'a', 'huge', 'maze', 'made', 'of', 'words', ',', 'symbols', ',', 'and', 'meanings', '.', 'Each', 'turn', 'shows', 'a', 'bit', 'more', 'about', 'the', 'cool', 'ways', 'computers', 'learn', 'to', 'understand', 'our', 'language', '.', 'And', 'while', 'I', "'m", 'still', 'finding', 'my', 'way', 'through', 'it', ',', 'the', 'journey', "'s", 'been', 'enlightening', 'and', ',', 'honestly', ',', 'a', 'bunch', 'of', 'fun', '.', 'Eager', 'to', 'see', 'where', 'this', 'learning', 'path', 'takes', 'me', 'next', '!', "''"]
Time Taken: 0:00:00.000424 seconds
NLTK Token Frequencies: {'Going': 1, 'through': 3, 'the': 3, 'world': 1, 'of': 3, 'tokenization': 1, 'has': 1, 'been': 2, 'like': 1, 'walking': 1, 'a': 3, 'huge': 1, 'maze': 1, 'made': 1, 'words': 1, ',': 5, 'symbols': 1, 'and': 2, 'meanings': 1, '.': 3, 'Each': 1, 'turn': 1, 'shows': 1, 'bit': 1, 'more': 1, 'about': 1, 'cool': 1,