# Tokenization using NLTK

Goal of this notebook is to get a basic idea about different tokenization methods using NLTK packages like sent_tokenize, word_tokenize, word_punct_tokenize and treebank.

In [1]:
!pip install nltk



In [32]:
import nltk

# Creating the corpus

In [41]:
corpus = """This is an example corpus. A corpus is a collection of authentic, machine-readable texts that can be used to train AI and machine learning systems.

Next Paragraph. Corpora can be made up of a variety of materials, including:
Newspapers, Novels, Recipes, Radio broadcasts, Television shows, Movies, and Tweets.

Adding words with apostrophes to see how it works - it's can't haven't.
Adding words with paranthesis - (inside)"""

# Decompose corpus into a list of sentences using sent_tokenize

In [42]:
from nltk.tokenize import sent_tokenize

In [43]:
sentences = sent_tokenize(corpus)
sentences

['This is an example corpus.',
 'A corpus is a collection of authentic, machine-readable texts that can be used to train AI and machine learning systems.',
 'Next Paragraph.',
 'Corpora can be made up of a variety of materials, including:\nNewspapers, Novels, Recipes, Radio broadcasts, Television shows, Movies, and Tweets.',
 "Adding words with apostrophes to see how it works - it's can't haven't.",
 'Adding words with paranthesis - (inside)']

Sentences are separated by periods(.).

# Decompose corpus into a list of words using TreebankWordTokenizer

### TreebankWordTokenizer

Origin: Based on the Penn Treebank tokenization conventions, which are widely used in parsing and tagging corpora.

Rules: It uses a set of rules and regular expressions designed specifically for tokenizing text in a way that is compatible with the Penn Treebank corpus. It handles punctuation and contractions more consistently with the treebank format.

Features:

* Splits standard contractions (e.g., "can't" -> ["ca", "n't"]).
* Separates punctuation from words (e.g., "Hello!" -> ["Hello", "!"]).
* Handles periods in abbreviations and numeric expressions more accurately.

In [44]:
from nltk.tokenize.treebank import TreebankWordTokenizer

In [45]:
tokenizer = TreebankWordTokenizer()

treebank_words = tokenizer.tokenize(text=corpus, convert_parentheses=True)
print(treebank_words)

['This', 'is', 'an', 'example', 'corpus.', 'A', 'corpus', 'is', 'a', 'collection', 'of', 'authentic', ',', 'machine-readable', 'texts', 'that', 'can', 'be', 'used', 'to', 'train', 'AI', 'and', 'machine', 'learning', 'systems.', 'Next', 'Paragraph.', 'Corpora', 'can', 'be', 'made', 'up', 'of', 'a', 'variety', 'of', 'materials', ',', 'including', ':', 'Newspapers', ',', 'Novels', ',', 'Recipes', ',', 'Radio', 'broadcasts', ',', 'Television', 'shows', ',', 'Movies', ',', 'and', 'Tweets.', 'Adding', 'words', 'with', 'apostrophes', 'to', 'see', 'how', 'it', 'works', '-', 'it', "'s", 'ca', "n't", "haven't.", 'Adding', 'words', 'with', 'paranthesis', '-', '-LRB-', 'inside', '-RRB-']


"it's" has become (it) and ('s), while (can't) has become (ca) and (n't). Paranthesis have been converted to LRB and RRB tokens.

# Decompose corpus into a list of words using word_tokenize

### word_tokenize

* Basis: Uses the 'punkt' tokenizer for sentence splitting and then tokenizes sentences into words.
* Complexity: More sophisticated than TreebankWordTokenizer and can handle a variety of edge cases well.
* General Use: Suitable for general-purpose tokenization where standard NLP processing is needed.

In [46]:
from nltk.tokenize import word_tokenize
nltk.download('punkt') # 'punkt' tokenizer has pre-trained data that is used by nltk.word_tokenize to split corpora into smaller units

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sharo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [48]:
words = word_tokenize(corpus, language='english', preserve_line=False)
print(words)

['This', 'is', 'an', 'example', 'corpus', '.', 'A', 'corpus', 'is', 'a', 'collection', 'of', 'authentic', ',', 'machine-readable', 'texts', 'that', 'can', 'be', 'used', 'to', 'train', 'AI', 'and', 'machine', 'learning', 'systems', '.', 'Next', 'Paragraph', '.', 'Corpora', 'can', 'be', 'made', 'up', 'of', 'a', 'variety', 'of', 'materials', ',', 'including', ':', 'Newspapers', ',', 'Novels', ',', 'Recipes', ',', 'Radio', 'broadcasts', ',', 'Television', 'shows', ',', 'Movies', ',', 'and', 'Tweets', '.', 'Adding', 'words', 'with', 'apostrophes', 'to', 'see', 'how', 'it', 'works', '-', 'it', "'s", 'ca', "n't", 'have', "n't", '.', 'Adding', 'words', 'with', 'paranthesis', '-', '(', 'inside', ')']


"it's" has become (it) and ('s), while (can't) has become (ca) and (n't). Paranthesis remain the same.

# Decompose corpus into a list of words using wordpunct_tokenize

### wordpunct_tokenize

* Simplistic Approach: Splits text based on non-alphanumeric characters, treating all punctuation as separate tokens.
* Use Case: Useful for applications where a simple and fast tokenization is sufficient, but less sophisticated in handling    complex cases.

In [49]:
from nltk.tokenize import wordpunct_tokenize

In [50]:
wordspunct = wordpunct_tokenize(corpus)
print(wordspunct)

['This', 'is', 'an', 'example', 'corpus', '.', 'A', 'corpus', 'is', 'a', 'collection', 'of', 'authentic', ',', 'machine', '-', 'readable', 'texts', 'that', 'can', 'be', 'used', 'to', 'train', 'AI', 'and', 'machine', 'learning', 'systems', '.', 'Next', 'Paragraph', '.', 'Corpora', 'can', 'be', 'made', 'up', 'of', 'a', 'variety', 'of', 'materials', ',', 'including', ':', 'Newspapers', ',', 'Novels', ',', 'Recipes', ',', 'Radio', 'broadcasts', ',', 'Television', 'shows', ',', 'Movies', ',', 'and', 'Tweets', '.', 'Adding', 'words', 'with', 'apostrophes', 'to', 'see', 'how', 'it', 'works', '-', 'it', "'", 's', 'can', "'", 't', 'haven', "'", 't', '.', 'Adding', 'words', 'with', 'paranthesis', '-', '(', 'inside', ')']


Here punctuations are treated as separate words - (can't) has become (can)+(')+(t)