<a href="https://colab.research.google.com/github/tomhar92/NLP-Experiments/blob/master/Word_Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Tokenization

The task of word tokenization is all about segmenting running text into words. These tokens are very useful for finding patterns in the text. They are also considered to be a base step for stemming and lemmatization.

# Byte Pair Encoding

Instead of tokenizing words, we can use our text to automatically define what size our tokens should be. This is useful to breakdown words into sub-words because our algorithms will know how to handle unknown words. 

At each step of the algorithm we count the number of symbol pairs, and replace it with the new merged symbol. We continue to count and merge, creating new longer and longer character strings, until we have done K merges. The resulting symbol set will consist of the original set of characters plus k new symbols.

In [3]:
!pip install wikipedia

Collecting wikipedia
  Downloading https://files.pythonhosted.org/packages/67/35/25e68fbc99e672127cc6fbb14b8ec1ba3dfef035bf1e4c90f78f24a80b7d/wikipedia-1.4.0.tar.gz
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-cp36-none-any.whl size=11686 sha256=3d2ef481a55a7a8dbba742d3480242a96d574d41568a3f39af0feb4bf2313157
  Stored in directory: /root/.cache/pip/wheels/87/2a/18/4e471fd96d12114d16fe4a446d00c3b38fb9efcb744bd31f4a
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [0]:
import wikipedia
import re
import random

In [0]:
def get_word_tokens(words):
  word_tokens = {}
  for word in words:
    new_word = ""
    for char in re.findall("[a-zA-Z]", word):
      if char.isupper():
        new_word = new_word + char.lower();
      else:
        new_word = new_word + char;
    if len(new_word) > 0:
      new_word = new_word + "_";  
    if new_word in word_tokens:
      word_tokens[new_word] = word_tokens[new_word] + 1;
    else:
      word_tokens[new_word] = 1;   
  return word_tokens;

In [0]:
def get_pairs_from_page(text_content, length_of_pairs):
  words = text_content.split(" ");
  word_tokens = get_word_tokens(words);
  common_pairs = {}
  for key in word_tokens:
    symbols = list(key);
    limit = 1;
    while limit <= length_of_pairs:
      for i in range(0, len(symbols) - limit):
        pair = '';
        for j in range(limit):
          pair = pair + symbols[i+j];
        if pair in common_pairs:
          common_pairs[pair] += 1
        else:
          common_pairs[pair] = 1
      limit = limit + 1;
  return common_pairs; 
      

In [0]:
def get_page_contents(number_of_wiki_pages):
  page_contents = [];
  while len(page_contents) < number_of_wiki_pages:
    page_title = wikipedia.random(pages = 1);
    try:  
      text_content = wikipedia.WikipediaPage(title=page_title).content;
    except wikipedia.exceptions.DisambiguationError as e:
      try:
        page_title = e.options[0];
        text_content = wikipedia.WikipediaPage(title=page_title).content;
      except wikipedia.exceptions.WikipediaException as e:
        continue;
    print("Page Number "+str(len(page_contents) + 1)+" is: "+page_title);
    page_contents.append(text_content);
  return page_contents;

In [75]:
print("Welcome to Byte Pair Generator!")
number_of_pairs = int(input("Please enter the required number of Pairs"))
number_of_wiki_pages = int(input("Please enter the number of Wiki pages you would like to scan"))

Welcome to Byte Pair Generator!
Please enter the required number of Pairs1000
Please enter the number of Wiki pages you would like to scan5


In [100]:
common_pairs = {}
pair_length = 1;
page_contents = get_page_contents(number_of_wiki_pages);
while len(common_pairs) < number_of_pairs:
  pair_length = pair_length + 1;
  print("Pair Length: "+str(pair_length))
  for i in range(len(page_contents)):
    pairs_from_page = get_pairs_from_page(page_contents[i], pair_length);
    for pair in pairs_from_page.items():
      if pair[0] in common_pairs:
        common_pairs[pair[0]] += pair[1];
      else:
        common_pairs[pair[0]] = pair[1];
    print("Length of pairs: "+str(len(common_pairs)))
sorted_pairs = sorted(common_pairs.items(), key=lambda item: item[1], reverse=True)
result_chars = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'];
for i in range(number_of_pairs):
  result_chars.append(sorted_pairs[i][0]);
print("The resulted vocabulary of characters:")
print(result_chars)

Page Number 1 is: Bill Weigand
Page Number 2 is: Caledonomorpha
Page Number 3 is: Gary Kuo




  lis = BeautifulSoup(html).find_all('li')


Page Number 4 is: Bojan Dimitrijević (actor)
Page Number 5 is: Fengguiwei Fort
Pair Length: 2
Length of pairs: 196
Length of pairs: 230
Length of pairs: 308
Length of pairs: 331
Length of pairs: 354
Pair Length: 3
Length of pairs: 677
Length of pairs: 771
Length of pairs: 1111
Length of pairs: 1239
Length of pairs: 1469
The resulted vocabulary of characters:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'e', 'i', 'a', 'n', 'r', 'o', 't', 's', 'l', 'd', 'c', 'm', 'h', 'u', 'g', 'er', 'in', 'f', 'w', 'p', 'an', 'b', 're', 'or', 'ed', 'on', 'y', 'k', 'v', 'te', 'es', 'at', 'ng', 'en', 'ar', 'ti', 'he', 'j', 'th', 'li', 'se', 'al', 'le', 'is', 'it', 'ic', 'om', 'st', 'si', 'll', 'ri', 'mi', 'io', 'ca', 'na', 've', 'me', 'ro', 'ch', 'ra', 'ce', 'ie', 'nd', 'as', 'ma', 'to', 'el', 'de', 'hi', 'co', 'di', 'ne', 'ni', 'no', 'ia', 'fo', 'la', 'ac', 'il', 'ol', 'mo', 'ta', 'ea', 'am', 'rn', 'rt', 'vi', 'ge', 'da