# Tokenization 

- Tokenization is the process of breaking down a text into smaller units called tokens. These tokens can be words, subwords, or characters, depending on the tokenization method. 
- Tokenization is a key step in Natural Language Processing (NLP) tasks because it converts raw text into a structured format that models can understand.

## Types of Tokenization:
1. Word Tokenization: Splits text into words.
   - *Example: "I love Python" → ["I", "love", "Python"]*

2. Character Tokenization: Breaks text into individual characters. 
   - *Example: "Python" → ["P", "y", "t", "h", "o", "n"]*

3. Subword Tokenization: Breaks words into smaller meaningful subunits, often used in transformer models like BERT and GPT. 
   - *Example: "unbelievable" → ["un", "believable"] or ["un", "##believable"]*

## Importance:
- Tokenization allows the model to handle text of varying lengths.
- It helps preserve the meaning of the text when processing for NLP models.

Common libraries for tokenization include **spaCy**, **NLTK**, and **Hugging Face's tokenizers**.

In [18]:
corpus = "Mr. Satish website is https://bigdataplaybook.wordpress.com, and Satish's email is sateeshfrnd@gmail.com, and he said, 'I'm visting kudramuk this weekend!'.Its takes an overnight journey of 335km from Bangalore to reach here. "
print(corpus)

Mr. Satish website is https://bigdataplaybook.wordpress.com, and Satish's email is sateeshfrnd@gmail.com, and he said, 'I'm visting kudramuk this weekend!'.Its takes an overnight journey of 335km from Bangalore to reach here. 


In [2]:
import nltk

In [5]:
# Downloading the required resources for tokenization
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\satee\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

Punkt Tokenizer: It is a pre-trained model that NLTK uses for tokenizing text into sentences and words. It is particularly useful because it can handle a variety of punctuation marks and abbreviations, making it effective for multiple languages.

In [19]:
# Tokenize into sentences
from nltk.tokenize import sent_tokenize
documents=sent_tokenize(corpus, language="english")

In [20]:
type(documents)

list

In [21]:
for sentence in documents:
    print(sentence)

Mr. Satish website is https://bigdataplaybook.wordpress.com, and Satish's email is sateeshfrnd@gmail.com, and he said, 'I'm visting kudramuk this weekend!
'.Its takes an overnight journey of 335km from Bangalore to reach here.


In [22]:
# Tokenize into words
from nltk.tokenize import word_tokenize
words = word_tokenize(corpus)
words

['Mr.',
 'Satish',
 'website',
 'is',
 'https',
 ':',
 '//bigdataplaybook.wordpress.com',
 ',',
 'and',
 'Satish',
 "'s",
 'email',
 'is',
 'sateeshfrnd',
 '@',
 'gmail.com',
 ',',
 'and',
 'he',
 'said',
 ',',
 "'",
 'I',
 "'m",
 'visting',
 'kudramuk',
 'this',
 'weekend',
 '!',
 "'.Its",
 'takes',
 'an',
 'overnight',
 'journey',
 'of',
 '335km',
 'from',
 'Bangalore',
 'to',
 'reach',
 'here',
 '.']

In [23]:
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize(corpus)

['Mr',
 '.',
 'Satish',
 'website',
 'is',
 'https',
 '://',
 'bigdataplaybook',
 '.',
 'wordpress',
 '.',
 'com',
 ',',
 'and',
 'Satish',
 "'",
 's',
 'email',
 'is',
 'sateeshfrnd',
 '@',
 'gmail',
 '.',
 'com',
 ',',
 'and',
 'he',
 'said',
 ',',
 "'",
 'I',
 "'",
 'm',
 'visting',
 'kudramuk',
 'this',
 'weekend',
 "!'.",
 'Its',
 'takes',
 'an',
 'overnight',
 'journey',
 'of',
 '335km',
 'from',
 'Bangalore',
 'to',
 'reach',
 'here',
 '.']

**word_tokenize:**
- Handles abbreviations like Mr. and email addresses more naturally, keeping parts of the email and abbreviation together.
- Treats possessives better, like "Satish's," handling them as single tokens.

**wordpunct_tokenize:**
- Breaks at every punctuation mark, including periods, apostrophes, and email addresses.
- It separates abbreviations (like "Mr." becomes ['Mr', '.']) and email addresses into individual components.

**When to Use:**
- Use word_tokenize if you want a more natural breakdown of text with better handling of contractions, abbreviations, and email addresses.
- Use wordpunct_tokenize if you need to strictly split on every punctuation mark, like when processing technical documents or tasks where punctuation needs to be isolated.

### TreebankWordTokenizer
The TreebankWordTokenizer in NLTK is a tokenizer that uses the Penn Treebank conventions to tokenize text. This tokenizer is more sophisticated than basic tokenizers because it is designed to split contractions, punctuation, and special characters based on linguistic rules rather than simple whitespace or punctuation splitting.

Key Features of TreebankWordTokenizer:
- Handles contractions by splitting them into separate tokens (e.g., "don't" becomes ["do", "n't"]).
- Separates punctuation from words (e.g., periods, commas, etc.).
- Handles quotes and brackets intelligently.
- Based on the rules used in the Penn Treebank corpus, making it a good option for parsing natural language.

In [24]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer=TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Mr.',
 'Satish',
 'website',
 'is',
 'https',
 ':',
 '//bigdataplaybook.wordpress.com',
 ',',
 'and',
 'Satish',
 "'s",
 'email',
 'is',
 'sateeshfrnd',
 '@',
 'gmail.com',
 ',',
 'and',
 'he',
 'said',
 ',',
 "'I",
 "'m",
 'visting',
 'kudramuk',
 'this',
 'weekend',
 '!',
 "'.Its",
 'takes',
 'an',
 'overnight',
 'journey',
 'of',
 '335km',
 'from',
 'Bangalore',
 'to',
 'reach',
 'here',
 '.']

**Comparison to word_tokenize:**
- TreebankWordTokenizer is more rule-based and adheres closely to linguistic conventions (e.g., contraction handling).
- word_tokenize is more generalized but can still handle most common text processing needs effectively.

**When to Use TreebankWordTokenizer:**
- When you need more linguistically accurate tokenization (e.g., dealing with formal text, such as parsing text from academic papers or linguistic corpora).
- When you want to handle contractions and punctuation according to Treebank conventions.