About
-----

One of the things that makes BERT so flexible is its ability to handle out of vocabulary (OOV) words. When the model comes across a word that isn't in its vocabulary, it breaks that word into different "subwords" that are in the vocabulary. These subwords become the tokenized representation of the word.

But how many ways can a word be chopped up? What if we prevented BERT from ever using whole words? In what ways might a word's subwords differ or relate to one another in the embedding space? What, in short, would the embedding space of subwords look like?


In [1]:
%%capture
!pip install transformers
!pip install bertviz

In [2]:
import numpy as np
import re
import string
from itertools import combinations, chain

from transformers import BertTokenizer, BertModel
import torch
from bertviz import head_view, model_view

Load and intialize the tokenizer and model. Get all subwords from the tokenizer vocabulary

In [3]:
%%capture
model_info = {'type': 'bert', 'version': 'bert-base-uncased'}

model = BertModel.from_pretrained(model_info['version'], output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_info['version'], do_lower_case=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
vocab = [token for token in tokenizer.get_vocab().keys()]
subword_vocab = [token for token in tokenizer.get_vocab().keys() if "#" in token]

In [5]:
tokenizer.tokenize("programmatic tokenization")

['program', '##matic', 'token', '##ization']

Set up all the functions for finding and encoding every possible subword combination of a given word

In [6]:
def partition(word):
  # Length of word
  n = len(word)
  # The first and last index position, along with the middle range
  beg, mid, end = [0], list(range(1, n)), [n]
  # Find the indexes for the word
  getslice = word.__getitem__
  # Build a partition generator
  splits = (split for i in range(n) for split in combinations(mid, i))
  partitions = []
  # For every split in the generator...
  for split in splits:
    # ...chain together outputs from a sliding index window between beginning and end
    partition = [word[sl] for sl in map(slice, chain(beg, split), chain(split, end))]
    # ...append to the partitions list
    partitions.append(partition)
  return partitions

def get_subwords(word, subword_vocab):
  subwords = []
  # For every word in the subword vocabulary...
  for subword in subword_vocab:
    # ...strip its hashes
    no_hash = re.sub(r'#{1,3}', '', subword)
    # If the unhashed subword is in the input word...
    if (no_hash in word) and (no_hash is not ''):
      # ...append it to the subwords list
      subwords.append(subword)
  return subwords

def in_vocab(partitions, subwords):
  # Build a dictionary of unhashed subwords
  hash_dict = {re.sub(r'#{1,3}', '', subword): subword for subword in subwords}
  result = []
  # For each partition....
  for partition in partitions:
    # ...if there is a corresponding subword for each of its segments
    try:
      # ...append the special start and end tokens BERT requires, then add
      # the result to the final list
      partition = [hash_dict[spl] for spl in partition]
      partition = ['[CLS]'] + partition + ['SEP']
      result.append(partition)
    # ...otherwise, continue
    except:
      continue
  return result

def to_ids(valid_tokens, tokenizer, to_pad = False, pad_len = 15):
  result = []
  tokens = []
  # For every list of subword partitions...
  for seq in valid_tokens:
    # ...get the token ids from the model
    token_ids = [tokenizer.convert_tokens_to_ids(token) for token in seq]
    # If padding is true...
    if to_pad:
      # ...if the length of the partitions is less than the total padding length...
      if len(seq) < pad_len:
        # ...pad accordingly
        padding = [0] * (pad_len - len(seq))
        token_ids = token_ids + padding
        # ...build two other text representations: token type ids (not used) and an 
        # attention mask
        token_type_ids = [0] * pad_len
        attention_mask = ([1] * len(seq)) + padding
        # ...pad the tokens as well
        seq = seq + [str(i) for i in padding]
    # If no padding, build token type ids and attention mask on the length of 
    # the tokens
    token_type_ids = [0] * len(token_ids)
    attention_mask = [1] * len(token_ids)
    # Then combine everything into a dictionary, converting the integer lists to 
    # tensors
    seq_dict = {
        'input_ids': torch.tensor([token_ids]),
        'token_type_ids': torch.tensor([token_type_ids]),
        'attention_mask': torch.tensor([attention_mask])
        }
    # ...append the dictionary to the final list
    result.append(seq_dict)
    # ...append padded tokens to a list as well
    tokens.append(seq)
  return result, tokens
  

Encode all the partitions of a word. Randomly select one sequence of partitions and view it

In [7]:
def view_subwords(word, rand_select = True, idx = None):
  partitions = partition(word)
  subwords = get_subwords(word, subword_vocab)
  valid_tokens = in_vocab(partitions, subwords)
  valid_tokens, token_strings = to_ids(valid_tokens, tokenizer)

  # Either use a randomly selected index from the list of partitions
  if rand_select:
    idx = np.random.choice(range(len(valid_tokens)))

  # Or select one yourself, but make sure it's valid
  if idx > len(valid_tokens):
    raise Exception("Index not in list of partitions")

  attention = model(**valid_tokens[idx])[-1]
  model_view(attention, token_strings[idx])

view_subwords("tokenization")

<IPython.core.display.Javascript object>