In [1]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [2]:
corpus = """Welcome to Ribhav Jain's NLP Tutorials.
Please do watch the entire course to become expert in NLP.
"""

In [3]:
print(corpus)

Welcome to Ribhav Jain's NLP Tutorials.
Please do watch the entire course to become expert in NLP.



## 🧠 Tokenizers in NLTK

**Tokenization** is the process of breaking text into **words**, **subwords**, or **tokens**. NLTK provides several tokenizers, each with different strategies.

---

### 🔹 1. `word_tokenize`

- Uses Punkt sentence tokenizer + TreebankWordTokenizer.
- Handles contractions and punctuation well.
- Requires downloading the `punkt` model.

---

### 🔹 2. `wordpunct_tokenize`

- Splits tokens based on all punctuation marks.
- Simpler and faster, but can over-split tokens (e.g., splits contractions like “Let’s” into `Let`, `'`, and `s`).
- Does not require additional downloads.

---

### 🔹 3. `TreebankWordTokenizer`

- Rule-based tokenizer used internally by `word_tokenize`.
- Mimics the Penn Treebank conventions.
- Good for consistent splitting of standard English.

---

## ✅ Comparison

| Tokenizer               | Pros                   | Cons                           | Best For                   |
| ----------------------- | ---------------------- | ------------------------------ | -------------------------- |
| `word_tokenize`         | Accurate, standard     | Slower, needs `punkt` download | Most NLP tasks             |
| `wordpunct_tokenize`    | Fast, no dependencies  | Over-splits contractions       | Quick & rough tokenization |
| `TreebankWordTokenizer` | Rule-based, consistent | Less flexible                  | Custom pipelines           |

> ✅ **Recommended:** Use `word_tokenize` for most use cases due to its accuracy and robustness.


In [4]:
from nltk.tokenize import sent_tokenize

documents = sent_tokenize(corpus)
print(documents)

["Welcome to Ribhav Jain's NLP Tutorials.", 'Please do watch the entire course to become expert in NLP.']


In [5]:
type(documents)

list

In [6]:
for sentence in documents:
    print(sentence)

Welcome to Ribhav Jain's NLP Tutorials.
Please do watch the entire course to become expert in NLP.


In [7]:
## Tokenization
## Paragraph --> words
## sentence --> words
from nltk.tokenize import word_tokenize

In [8]:
word_tokenize(corpus)

['Welcome',
 'to',
 'Ribhav',
 'Jain',
 "'s",
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [9]:
for sentence in documents:
    print(word_tokenize(sentence))

['Welcome', 'to', 'Ribhav', 'Jain', "'s", 'NLP', 'Tutorials', '.']
['Please', 'do', 'watch', 'the', 'entire', 'course', 'to', 'become', 'expert', 'in', 'NLP', '.']


In [10]:
from nltk.tokenize import wordpunct_tokenize

In [11]:
wordpunct_tokenize(corpus)

['Welcome',
 'to',
 'Ribhav',
 'Jain',
 "'",
 's',
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [12]:
from nltk.tokenize import TreebankWordTokenizer

In [13]:
tokenizer = TreebankWordTokenizer()

In [14]:
tokenizer.tokenize(corpus)

['Welcome',
 'to',
 'Ribhav',
 'Jain',
 "'s",
 'NLP',
 'Tutorials.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']