### Dictionary-Based Tokenization
  Dictionary-based tokenization is a type of tokenization where text is split into tokens using a predefined dictionary of words or phrases.

***Why use dictionary-based tokenization when there are many other ways to create tokens?***

Let’s say you have the text:

"I live in San Francisco", or "United Nations".

If you use simple whitespace tokenization, "San Francisco" will be split into two separate tokens: "San" and "Francisco". This can be a problem because:

The original meaning of the named entity (city name) gets broken.

The number of unique tokens increases (data sparsity), which will increase computational power.

To solve such issues, dictionary-based tokenization helps by keeping important multi-word expressions together as a single token — for example, "San Francisco" stays as "San_Francisco".

***How to perform dictionary-based tokenization?***

Create a predefined dictionary of important words or phrases (e.g., place names, organizations, domain-specific terms).

Preprocess the text (lowercasing, cleaning, etc.) and apply simple tokenization as the first step.

Match tokens against the dictionary — if a sequence matches, treat it as a single token and assign a specific token ID or representation.

Handle unmatched words — for words not found in the dictionary, apply subword or character-level tokenization to handle unknowns.

***how can i do dictionary tokenization ?***
- step 1: creating the predifined dictionary like creating the
- step 2 : preprocess the text and do simple tokneization
- step 3 : if the words matches to the dictionary then we give specific values to that words
- step 4: if the words does not matches then the dictionary we would do subword or character tokenization

In [7]:
## Prepairing the text
predefined = [('San', 'Francisco'), ('United', 'Nations'), ('New', 'York'), ('Google', 'coolab')]

In [6]:
## Importing the Libraries
import nltk
from nltk.tokenize import MWETokenizer
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
# Preprocessing the Text
def preprocess_text(text):
  text = text.lower()
  return text

sample_text = 'San Francisco is a beautiful city. The United Nations meets regularly.'
cleaned_text = preprocess_text(sample_text)

In [11]:
# Tokrnizing the text
def tokenizing(text):
  tokenize = word_tokenize(text)
  return tokenize
  print('Tokenized text: ',tokenize)

tokens=tokenizing(cleaned_text)
print(tokenizing(cleaned_text))

['san', 'francisco', 'is', 'a', 'beautiful', 'city', '.', 'the', 'united', 'nations', 'meets', 'regularly', '.']


In [10]:
# Applying Dictionary based Tokenization
tokenizer = MWETokenizer(predefined)
tokenized_text = tokenizer.tokenize(tokens)
print("Dictionary tokeized : ", tokenized_text)

Dictionary tokeized :  ['san', 'francisco', 'is', 'a', 'beautiful', 'city', '.', 'the', 'united', 'nations', 'meets', 'regularly', '.']


In [12]:
# taking out matched tokens
unmatched_tokens = []
for token in tokens:
  if token not in ['San', 'Francisco', 'United', 'Nations']:
    unmatched_tokens.append(token)

print("Tokenized Sentence: ", unmatched_tokens)

Tokenized Sentence:  ['san', 'francisco', 'is', 'a', 'beautiful', 'city', '.', 'the', 'united', 'nations', 'meets', 'regularly', '.']


In [16]:
def word_tokenizing(tokens):
  unmatched_tokens = []
  for token in tokens:
    if token not in ['San', 'Francisco', 'United', 'Nations']:
      unmatched_tokens.append(token)

  return unmatched_tokens

In [15]:
# Example of Dictonary Based Tokenization in Action
sentence = 'San Francisco is a part of the United Nations'
tokens = word_tokenize(sentence)
tokenized_sentence = word_tokenizing(tokens)
print(tokenized_sentence)

['is', 'a', 'part', 'of', 'the']


In [17]:
# customizing the dictionary
predefined.extend([('Machine', 'Learning'), ('Natural', 'Learning' ,'processing')])
tokenizer = MWETokenizer(predefined)

In [21]:
# Visualizing Tokenization Outpur
sentences = [
    'San Francisco is a beautiful place.',
    'The United Nations is headquartered in New York.',
    'Machine Learning is a subset of Artificial.'
]

for sentence in sentences:
  cleaned_sentence = preprocess_text(sentence)
  tokens = word_tokenize(cleaned_sentence)
  tokenized_sentence = tokenizer.tokenize(tokens)
  print('Original: ', sentence)
  print('Tokenized: ', tokenized_sentence)
  print('\n')

Original:  San Francisco is a beautiful place.
Tokenized:  ['san', 'francisco', 'is', 'a', 'beautiful', 'place', '.']


Original:  The United Nations is headquartered in New York.
Tokenized:  ['the', 'united', 'nations', 'is', 'headquartered', 'in', 'new', 'york', '.']


Original:  Machine Learning is a subset of Artificial.
Tokenized:  ['machine', 'learning', 'is', 'a', 'subset', 'of', 'artificial', '.']


