# Implementation of Byte Pair Encoding (BPE)
Source code: https://github.com/neubig/anlp-code/blob/main/02-subwords/bpe.ipynb


Byte Pair Encodings tackle the problem of conjugation in Language processing. In the Bag Of Words method, words are alike sparse vectors and its frequencies. Weights of the exact words are adjusted during training. That means, the conjugation or differen tense of the same word's weight would not be updated until they are being trained.

To help models recognize conjugations, one way done by Sennrich et.al (2015) is to **split** a word into different subwords
  
*ex: expanding --> expand_ing*

This is known as tokenization


**Methodology**

BPE takes in a dictionary of words and their corresponding frequencies, and finds the most frequent letter pairs (hence the name Pair Encoding).

1. Loop through the input dictionary and break down the words into characters
2. Track the frequency of every two characters and incease their frequency in the output dictionary
3. Sort the frequency in decreasing order in terms of frequency
4. Embed the vocabulary with the highest occuring pair

## Intuition

Conjugated words are similar with the non-conjugated words. If we can find a way to group the words in a way that the more common (non-conjugated) words can be grouped together, then the machine can process the conjugated words similar with the conjugated version. BPE tries to accomplish that.

Start off by breaking every word into characters and find the most frequent pairs, then combine the most frequent pairs (known as forming vocabularies), repeat this process until we don't have any new vocabularies to create or stop at a certain threshold.

By then, the conjugated words would have grouped the non-conjugated words, the machine will interpret them as the same word

In [None]:
import collections  #collection library is used to initialize the return dictionary
import re

**get_pair** counts the frequency of every pair of characters. if a word is already combined, won't have a pair.

In [None]:
def get_pair(vocab: dict[str,int]) -> dict[tuple[str,str],int]:
  pairs=collections.defaultdict(int)  #output dictionary
  for word, freq in vocab.items():
    chars= word.split() #split every single character
    for i in range(0, len(chars)-1):
      pairs[chars[i],chars[i+1]] += freq  #add the frequency to the word pair
  return pairs

**merge_vocab** merges all the words with the most frequent pair. Words with the pair in it will be combined, words without the pair has no changes

In [None]:
def merge_vocab(best: tuple[str,str], v_in:dict[str,int])-> dict[str,int]:
  v_out={}  #output list
  bigram= re.escape(' '.join(best))   #make the pair of two string into one string separated by ' '
  p= re.compile(r'(?<!\S)' + bigram + r'(?!\S)')  #make the pair into a pattern detectable in the words
  for word in v_in:
    word_out= p.sub(''.join(best),word)
    v_out[word_out]= v_in[word]
  return v_out

Testing the code

In [None]:
vocab = {
    'l o w </w>' : 5,
    'l o w e r </w>' : 2,
    'l o w e s t </w>':2,
    'l o w l y </w>':5,
    'w i d e </w>':2
}

num_ite=10
for i in range(0,num_ite):
  #print all the pairs
  print(f"{vocab=}")
  pair= get_pair(vocab)
  #print(f"{pair.items()}")
  top_pair= sorted(list(pair.items()), key=lambda x:x[1], reverse=True)[:5] #sort the pairs by frequency
  print(f"{top_pair=}")
  vocab= merge_vocab(top_pair[0][0], vocab)#merge vocabs
  print(f"Merge done\n")

vocab={'l o w </w>': 5, 'l o w e r </w>': 2, 'l o w e s t </w>': 2, 'l o w l y </w>': 5, 'w i d e </w>': 2}
top_pair=[(('l', 'o'), 14), (('o', 'w'), 14), (('w', '</w>'), 5), (('w', 'l'), 5), (('l', 'y'), 5)]
Merge done

vocab={'lo w </w>': 5, 'lo w e r </w>': 2, 'lo w e s t </w>': 2, 'lo w l y </w>': 5, 'w i d e </w>': 2}
top_pair=[(('lo', 'w'), 14), (('w', '</w>'), 5), (('w', 'l'), 5), (('l', 'y'), 5), (('y', '</w>'), 5)]
Merge done

vocab={'low </w>': 5, 'low e r </w>': 2, 'low e s t </w>': 2, 'low l y </w>': 5, 'w i d e </w>': 2}
top_pair=[(('low', '</w>'), 5), (('low', 'l'), 5), (('l', 'y'), 5), (('y', '</w>'), 5), (('low', 'e'), 4)]
Merge done

vocab={'low</w>': 5, 'low e r </w>': 2, 'low e s t </w>': 2, 'low l y </w>': 5, 'w i d e </w>': 2}
top_pair=[(('low', 'l'), 5), (('l', 'y'), 5), (('y', '</w>'), 5), (('low', 'e'), 4), (('e', 'r'), 2)]
Merge done

vocab={'low</w>': 5, 'low e r </w>': 2, 'low e s t </w>': 2, 'lowl y </w>': 5, 'w i d e </w>': 2}
top_pair=[(('lowl', 'y'), 5), (