## Tokenize using NLTK

NLTK ( Natural Language Toolkit) a python library, http://www.nltk.org/.

*Tokenization*  or text segmentation is the process of transforming a text into a list of words, and words are called *tokens*.

In [30]:
# Most simple way to divide a text into a list is using "split" 
text = "Let's to see the TV, please!"
print (text.split())

["Let's", 'to', 'see', 'the', 'TV,', 'please!']


With this approach the contractions are not handled and punctuations signs stay attached to the nearest word.
The right way to tokenize is to use a tokenizer. Most of NLP libraries offer ther own tokenizers.
Here we are going to use nltk.

In [31]:
# !pip install nltk

## Apply the WordPunctTokenizer

we obtain a different results. The punctuation are now handled as tokens.

In [32]:
from nltk.tokenize import WordPunctTokenizer

tokens = WordPunctTokenizer().tokenize("Let's to see the film on the TV, Please Mumy!.")

print (tokens)

['Let', "'", 's', 'to', 'see', 'the', 'film', 'on', 'the', 'TV', ',', 'Please', 'Mumy', '!.']


## Tokenize a text from Wikipedia

In [33]:
!pip install wordcloud



In [34]:
import requests

def wikipedia_page(title):
  '''
  This function returns the raw text of a wikipedia page
  given a wikipedia page title
  '''
  params = {
      'action': 'query',
      'format': 'json', # request json formatted content
      'titles': title, #title of the wikipedia page
      'prop': 'extracts',
      'explaintext': True
  }
  headers = {"User-Agent": ""}
  # send a request to the wikipedia api
  response = requests.get(
      'https://en.wikipedia.org/w/api.php',  
      params= params, headers = headers
  ).json()

  #Parse the result
  page = next(iter(response['query']['pages'].values()))
  # return the page content
  if 'extract' in page.keys():
    return page['extract']
  else:
    return "Page not found"

# we lowercase the text to avoid having to deal with uppercase and capitalized words
text = wikipedia_page('Carbon capture and storage').lower()
print(text)

carbon capture and storage (ccs) or carbon capture and sequestration is the process of capturing carbon dioxide (co2) before it enters the atmosphere, transporting it, and storing it (carbon sequestration) for centuries or millennia. usually the co2 is captured from large point sources, such as coal-fired power plant, a chemical plant or biomass power plant, and then stored in an underground geological formation. the aim is to prevent the release of co2 from heavy industry with the intent of mitigating the effects of climate change. although co2 has been injected into geological formations for several decades for various purposes, including enhanced oil recovery, the long-term storage of co2 is a relatively new concept. carbon capture and utilization (ccu) and ccs are sometimes discussed collectively as carbon capture, utilization, and sequestration (ccus). this is because ccs is a relatively expensive process yielding a product with an intrinsic low value (i.e. co2). hence, carbon cap

In [35]:
from typing import Counter
from nltk.tokenize import WordPunctTokenizer

text = wikipedia_page('Carbon capture and storage').lower()
tokens = WordPunctTokenizer().tokenize(text)
print(Counter(tokens).most_common(20))

[('the', 482), ('.', 438), (',', 352), ('of', 250), ('and', 232), ('in', 209), ('to', 208), ('co2', 191), ('a', 177), ('is', 160), ('for', 113), ('-', 113), ('capture', 101), ('ccs', 92), ('from', 73), ('carbon', 72), ('(', 72), ('gas', 69), ('be', 66), ('plant', 63)]


## Tokenize on Characters or Syllables

Tokenization is not restricted to words or punctuation. It could be applied to create subwords or separate the characters of a word. 

In [36]:
# example of character tokenization
char_tokens = [c for c in text]
print("Most common characters in the text")
print(Counter(char_tokens).most_common(20))

print()
print("All characters in the text:")
print(set(char_tokens))

Most common characters in the text
[(' ', 8620), ('e', 5365), ('t', 3958), ('o', 3717), ('a', 3655), ('i', 3392), ('n', 3257), ('r', 3093), ('s', 2968), ('c', 2393), ('l', 1945), ('d', 1527), ('h', 1370), ('u', 1316), ('p', 1309), ('m', 1112), ('f', 942), ('g', 923), ('b', 678), ('y', 608)]

All characters in the text:
{'g', 'l', 'p', ']', 'e', 'u', 'j', '(', ')', '—', 'n', ':', '6', "'", 'ß', '8', '7', '0', '4', 'ü', '.', 'z', '1', 'ö', 't', 'm', '°', 'k', '”', '3', '€', '+', '$', 's', 'o', ',', '“', '9', 'i', '\n', 'y', 'b', 'w', 'h', 'd', '=', ';', 'c', 'q', 'f', '5', '–', '%', 'r', 'ø', 'a', '’', '"', 'v', '″', '[', '‘', '&', '-', '/', 'x', ' ', '2'}


* **character tokenization** works best for spell checking.
* **subword tokenization** is used in recent NLP models such as BERT.

## Tokenize on N-Grams

Sometimes could be useful to consider groups of two words (*bigrams*) or three words (*trigrams*), etc. In general, groups of words taken as a single token are called **n-grams**.

In [37]:
from nltk import ngrams
from nltk.tokenize import WordPunctTokenizer

text = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"

#Tokenize
tokens = WordPunctTokenizer().tokenize(text)

#bigrams
bigrams = [w for w in ngrams(tokens, n=2)]
print(bigrams)

[('How', 'much'), ('much', 'wood'), ('wood', 'would'), ('would', 'a'), ('a', 'woodchuck'), ('woodchuck', 'chuck'), ('chuck', 'if'), ('if', 'a'), ('a', 'woodchuck'), ('woodchuck', 'could'), ('could', 'chuck'), ('chuck', 'wood'), ('wood', '?')]


In [38]:
# trigrams
trigrams = ['_'.join(w) for w in ngrams(tokens, n=3)]
print(trigrams)

['How_much_wood', 'much_wood_would', 'wood_would_a', 'would_a_woodchuck', 'a_woodchuck_chuck', 'woodchuck_chuck_if', 'chuck_if_a', 'if_a_woodchuck', 'a_woodchuck_could', 'woodchuck_could_chuck', 'could_chuck_wood', 'chuck_wood_?']


## Add ngrams to list of tokens

Add the bigrams and trigrams to the list of tokens on the wikipedia  'Carbon capture and storage' page and look at the frequency of ngrams.

In [43]:
text = wikipedia_page('Carbon capture and storage').lower()
unigrams = WordPunctTokenizer().tokenize(text)
bigrams = ['_'.join(w) for w in  ngrams(unigrams, n=2)]
trigrams = ['_'.join(w) for w in ngrams(unigrams,n=3)]

In [44]:
tokens = unigrams + bigrams + trigrams

In [45]:
print (f"we have a total of {len(tokens)} tokens, including: \n- {len(unigrams)} unigrams \n- {len(bigrams)} bigrams \n- {len(trigrams)} trigrams.")

we have a total of 30795 tokens, including: 
- 10266 unigrams 
- 10265 bigrams 
- 10264 trigrams.


In [46]:
Counter(tokens).most_common(50)

[('the', 482),
 ('.', 438),
 (',', 352),
 ('of', 250),
 ('and', 232),
 ('in', 209),
 ('to', 208),
 ('co2', 191),
 ('a', 177),
 ('is', 160),
 ('for', 113),
 ('-', 113),
 ('capture', 101),
 ('ccs', 92),
 ('._the', 83),
 ('from', 73),
 ('carbon', 72),
 ('(', 72),
 ('gas', 69),
 ('be', 66),
 ('plant', 63),
 ('with', 59),
 ('project', 59),
 ('===', 58),
 ('as', 57),
 ('storage', 54),
 ('that', 54),
 ('by', 52),
 ('energy', 49),
 ('at', 49),
 ('it', 47),
 ('power', 47),
 ('are', 47),
 ('was', 44),
 (')', 43),
 ('====', 42),
 ('in_the', 42),
 ('oil', 41),
 ('or', 40),
 ('coal', 40),
 ('carbon_capture', 40),
 ('can', 38),
 ('of_co2', 38),
 ('on', 37),
 (',_and', 36),
 ('this', 35),
 ('of_the', 35),
 (',_the', 33),
 ('==', 32),
 ('s', 32)]