In [8]:
corpus = """ Hello Welcome, to Krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.
"""

In [9]:
corpus

" Hello Welcome, to Krish Naik's NLP Tutorials.\nPlease do watch the entire course! to become expert in NLP.\n"

In [10]:
print(corpus)

 Hello Welcome, to Krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.



In [11]:
## Tokenisation
## Paragraph --> Sentence 
from nltk.tokenize import sent_tokenize

In [13]:
documents = sent_tokenize(corpus)

In [14]:
type(documents)

list

In [15]:
for sentence in documents:
    print(sentence)

 Hello Welcome, to Krish Naik's NLP Tutorials.
Please do watch the entire course!
to become expert in NLP.


In [16]:
## Tokenisation
## Paragraph --> Words
## Sentence --> Words
from nltk.tokenize import word_tokenize

In [17]:
word_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [18]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', 'Welcome', ',', 'to', 'Krish', 'Naik', "'s", 'NLP', 'Tutorials', '.']
['Please', 'do', 'watch', 'the', 'entire', 'course', '!']
['to', 'become', 'expert', 'in', 'NLP', '.']


In [20]:
from nltk.tokenize import wordpunct_tokenize

In [21]:
wordpunct_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'",
 's',
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

# Comparison of `wordpunct_tokenize` and `word_tokenize`

## Overview
Both `wordpunct_tokenize` and `word_tokenize` are functions from the NLTK library used for tokenizing text into words. However, they differ in their methods of tokenization and handling punctuation.

| **Feature**                    | **`word_tokenize`**                                      | **`wordpunct_tokenize`**                                   |
|--------------------------------|---------------------------------------------------------|-----------------------------------------------------------|
| **Purpose**                    | Splits text into words and handles punctuation.         | Splits text into words while preserving punctuation as tokens. |
| **Punctuation Handling**       | Treats punctuation as separate tokens and may remove some punctuation. | Retains all punctuation marks as individual tokens.       |
| **Token Types**                | Returns a list of words and punctuation as separate tokens. | Returns a list of words and punctuation, preserving their original form. |
| **Complexity**                 | Uses the Punkt tokenizer, which includes a pre-trained model for handling text. | Simpler approach; splits based on whitespace and punctuation. |
| **Example**                    | ```python<br>from nltk.tokenize import word_tokenize<br>text = "Hello, world!"<br>tokens = word_tokenize(text)<br>print(tokens)  # Output: ['Hello', ',', 'world', '!']``` | ```python<br>from nltk.tokenize import wordpunct_tokenize<br>text = "Hello, world!"<br>tokens = wordpunct_tokenize(text)<br>print(tokens)  # Output: ['Hello', ',', 'world', '!']``` |

## Conclusion
- **`word_tokenize`** is more sophisticated and suitable for general text tokenization, providing better handling of complex scenarios.
- **`wordpunct_tokenize`** is simpler and useful when preserving punctuation as separate tokens is desired.



In [28]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()


In [29]:
tokenizer.tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [30]:
print("The End")

The End
