*Text Processing:** **NLTK** and **SpaCy** are popular choices for tokenization, stemming, and other pre-processing tasks.

In [1]:
!pip install nltk



### Defining a Corpus

A **corpus** is a collection of text. In Python, you can define a multi-line string using triple quotes. This will be our sample text for tokenization.

In [2]:
corpus = '''Hello.
Welcome,Hello welcome to the Krish's NLP tutorials.
Please do watch the entire course to become expert in NLP!
Thank you so much!
'''

### Sentence Tokenization

To split the corpus into sentences, we'll use the `sent_tokenize` function from NLTK. This function identifies sentence boundaries, typically indicated by punctuation like periods, exclamation marks, and question marks.

In [5]:
import nltk

In [6]:
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(corpus)
print(sentences)

['Hello.', "Welcome,Hello welcome to the Krish's NLP tutorials.", 'Please do watch the entire course to become expert in NLP!', 'Thank you so much!']


This will output a list of sentences, where each sentence is a string.

```
['Hello.', "Welcome,Hello welcome to the Krish's NLP tutorials.", 'Please do watch the entire course to become expert in NLP!', 'Thank you so much!']
```

### Word Tokenization

To break down the text into individual words, we use the `word_tokenize` function. This function splits the text into words and punctuation marks.

In [7]:
from nltk.tokenize import word_tokenize

words = word_tokenize(corpus)
print(words)

['Hello', '.', 'Welcome', ',', 'Hello', 'welcome', 'to', 'the', 'Krish', "'s", 'NLP', 'tutorials', '.', 'Please', 'do', 'watch', 'the', 'entire', 'course', 'to', 'become', 'expert', 'in', 'NLP', '!', 'Thank', 'you', 'so', 'much', '!']


The output will be a list of words and punctuation.

```
['Hello', '.', 'Welcome', ',', 'Hello', 'welcome', 'to', 'the', 'Krish', "'s", 'NLP', 'tutorials', '.', 'Please', 'do', 'watch', 'the', 'entire', 'course', 'to', 'become', 'expert', 'in', 'NLP', '!', 'Thank', 'you', 'so', 'much', '!']
```

### Exploring Other Tokenizers

NLTK provides several other tokenizers that handle specific cases differently.

#### **WordPunct Tokenizer**

The `WordPunctTokenizer` is useful for scenarios where you need to separate punctuation from words. For example, it will split a word like "welcome," into two tokens: "welcome" and ",".

In [8]:
from nltk.tokenize import wordpunct_tokenize

print(wordpunct_tokenize(corpus))

['Hello', '.', 'Welcome', ',', 'Hello', 'welcome', 'to', 'the', 'Krish', "'", 's', 'NLP', 'tutorials', '.', 'Please', 'do', 'watch', 'the', 'entire', 'course', 'to', 'become', 'expert', 'in', 'NLP', '!', 'Thank', 'you', 'so', 'much', '!']


The output highlights how punctuation is separated:

```
['Hello', '.', 'Welcome', ',', 'Hello', 'welcome', 'to', 'the', 'Krish', "'", 's', 'NLP', 'tutorials', '.', 'Please', 'do', 'watch', 'the', 'entire', 'course', 'to', 'become', 'expert', 'in', 'NLP', '!', 'Thank', 'you', 'so', 'much', '!']
```

#### **Treebank Word Tokenizer**

The `TreebankWordTokenizer` is designed to handle contractions and other special cases in a more linguistically aware way. For example, it splits contractions like "don't" into "do" and "n't".

In [9]:
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize(corpus))

['Hello.', 'Welcome', ',', 'Hello', 'welcome', 'to', 'the', 'Krish', "'s", 'NLP', 'tutorials.', 'Please', 'do', 'watch', 'the', 'entire', 'course', 'to', 'become', 'expert', 'in', 'NLP', '!', 'Thank', 'you', 'so', 'much', '!']


This tokenizer's output might be slightly different from `word_tokenize`, as it follows specific rules for handling punctuation.

output:
```bash
['Hello.', 'Welcome', ',', 'Hello', 'welcome', 'to', 'the', 'Krish', "'s", 'NLP', 'tutorials.', 'Please', 'do', 'watch', 'the', 'entire', 'course', 'to', 'become', 'expert', 'in', 'NLP', '!', 'Thank', 'you', 'so', 'much', '!']

```