<a href="https://colab.research.google.com/github/tomhar92/NLP-Experiments/blob/master/Word_Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Tokenization

The task of word tokenization is all about segmenting running text into words. These tokens are very useful for finding patterns in the text. They are also considered to be a base step for stemming and lemmatization.

# Byte Pair Encoding

Instead of tokenizing words, we can use our text to automatically define what size our tokens should be. This is useful to breakdown words into sub-words because our algorithms will know how to handle unknown words. 

At each step of the algorithm we count the number of symbol pairs, and replace it with the new merged symbol. We continue to count and merge, creating new longer and longer character strings, until we have done K merges. The resulting symbol set will consist of the original set of characters plus k new symbols.

In [3]:
!pip install wikipedia

Collecting wikipedia
  Downloading https://files.pythonhosted.org/packages/67/35/25e68fbc99e672127cc6fbb14b8ec1ba3dfef035bf1e4c90f78f24a80b7d/wikipedia-1.4.0.tar.gz
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-cp36-none-any.whl size=11686 sha256=3d2ef481a55a7a8dbba742d3480242a96d574d41568a3f39af0feb4bf2313157
  Stored in directory: /root/.cache/pip/wheels/87/2a/18/4e471fd96d12114d16fe4a446d00c3b38fb9efcb744bd31f4a
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [0]:
import wikipedia
import re
import random

In [0]:
def get_word_tokens(words):
  word_tokens = {}
  for word in words:
    new_word = ""
    for char in re.findall("[a-zA-Z]", word):
      if char.isupper():
        new_word = new_word + char.lower();
      else:
        new_word = new_word + char;
    if len(new_word) > 0:
      new_word = new_word + "_";  
    if new_word in word_tokens:
      word_tokens[new_word] = word_tokens[new_word] + 1;
    else:
      word_tokens[new_word] = 1;   
  return word_tokens;

In [0]:
def get_pairs_from_page(text_content):
  words = text_content.split(" ");
  word_tokens = get_word_tokens(words);
  common_pairs = {}
  for key in word_tokens:
    symbols = list(key);
    for i in range(0, len(symbols) - 1):
      if symbols[i]+symbols[i+1] in common_pairs:
        common_pairs[symbols[i]+symbols[i+1]] += 1
      else:
        common_pairs[symbols[i]+symbols[i+1]] = 1
  return common_pairs;       
      

In [63]:
print("Welcome to Byte Pair Generator!")
number_of_pairs = int(input("Please enter the required number of Pairs"))
number_of_wiki_pages = int(input("Please enter the number of Wiki pages you would like to scan"))

Welcome to Byte Pair Generator!
Please enter the required number of Pairs10
Please enter the number of Wiki pages you would like to scan10


In [66]:
common_pairs = {}
page_counter = 1;
while page_counter <= number_of_wiki_pages:
  page_title = wikipedia.random(pages = 1);
  try:  
    text_content = wikipedia.WikipediaPage(title=page_title).content;
  except wikipedia.exceptions.DisambiguationError as e:
    page_title = e.options[0];
    text_content = wikipedia.WikipediaPage(title=page_title).content;
  print("Page Number "+str(page_counter)+" is: "+page_title);
  pairs_from_page = get_pairs_from_page(text_content);
  for pair in pairs_from_page.items():
    if pair[0] in common_pairs:
      common_pairs[pair[0]] += pair[1];
    else:
      common_pairs[pair[0]] = pair[1];
  page_counter += 1;
print("The "+str(number_of_pairs)+ " most common pairs are:")
sorted_pairs = sorted(common_pairs.items(), key=lambda item: item[1], reverse=True)
result_chars = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'];
for i in range(number_of_pairs):
    print(sorted_pairs[i]);
    result_chars.append(sorted_pairs[i][0]);
print("The resulted vocabulary of characters:")
print(result_chars)

Page Number 1 is: Downtown LaPorte Historic District
Page Number 2 is: Maud Mary Brindley
Page Number 3 is: Sanju Yadav
Page Number 4 is: Lenny Walls
Page Number 5 is: Lene Mykjåland
Page Number 6 is: Jordan Brown (footballer, born 1991)
Page Number 7 is: Zeta Phi Eta
Page Number 8 is: Tsuruga Nursing University
Page Number 9 is: Westminster School District
Page Number 10 is: Lime kilns, Oeiras, Portugal
The 10 most common pairs are:
('e_', 258)
('s_', 237)
('er', 232)
('in', 227)
('d_', 206)
('n_', 186)
('on', 164)
('re', 163)
('ed', 150)
('te', 145)
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'e_', 's_', 'er', 'in', 'd_', 'n_', 'on', 're', 'ed', 'te']
