<table align="left" width=100%>
    <tr>
        <td width="10%">
            <img src="../images/RA_Logo.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                  <b> 1. Tokenization </b>
                </font>
            </div>
        </td>
    </tr>
</table>

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/vidyadharbendre/learn_nlp_using_examples/blob/main/notebooks/01_Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/vidyadharbendre/learn_nlp_using_examples/blob/main/notebooks/01_Tokenization.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>

In [16]:
#!pip install spacy

In [17]:
#!python -m spacy download en_core_web_sm

In [18]:
import re
import spacy

## Basic Tokenization

A simple way to tokenize text is to split it at spaces and punctuation marks. However, this approach can be too simplistic for more complex text.

In [27]:
# Tokenization using Regular Expression

text = "Radhika saved the puppy"
tokens = re.findall(r'\b\w+\b', text)
print(tokens)

['Radhika', 'saved', 'the', 'puppy']


In [28]:
# Tokenization using spacy

#import spacy
nlp = spacy.load("en_core_web_sm")
text = "Radhika saved the puppy"
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)

['Radhika', 'saved', 'the', 'puppy']


In [32]:
for token in doc:
    print(token)

Radhika
saved
the
puppy


In [33]:
token.text

'puppy'

In [38]:
[ token.text  for token in doc]

['Radhika', 'saved', 'the', 'puppy']

In [39]:
# Tokenization using nltk

import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/vidyadharbendre/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vidyadharbendre/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [40]:
import nltk
from nltk.tokenize import word_tokenize

# Example text
text = "Radhika saved the puppy"

# Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

Tokens: ['Radhika', 'saved', 'the', 'puppy']


In [41]:
# Normalization
# Convert to lower case
tokens = [token.lower() for token in tokens]

# Remove punctuation
tokens = [token for token in tokens if token not in string.punctuation]

# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]

print("Normalized Tokens:", tokens)

Normalized Tokens: ['radhika', 'saved', 'puppy']


## Challenges in Tokenization

# Non-Alphanumeric Characters
Splitting text at all non-alphanumeric characters can help in tokenization, but it may not always be ideal.

In [20]:
text = "Hello, world! It's a beautiful day."
tokens = re.findall(r'\b\w+\b', text)
print(tokens)

['Hello', 'world', 'It', 's', 'a', 'beautiful', 'day']


# Apostrophes
Handling apostrophes can be tricky as they can signify contractions or possessives.

In [21]:
text = "It's Radhika's puppy."
tokens = re.findall(r"\b\w+(?:'\w+)?\b", text)
print(tokens)

["It's", "Radhika's", 'puppy']


# Two-Word Entities
Recognizing two-word entities, such as proper nouns or named entities, requires more advanced techniques like Named Entity Recognition (NER).


In [22]:
#import spacy

nlp = spacy.load("en_core_web_sm")
text = "Radhika lives in New Delhi."
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)

['Radhika', 'lives', 'in', 'New', 'Delhi', '.']


# Compound Words
Compound words, especially in languages like Sanskrit and German, need careful handling to preserve their meaning.

In [24]:
text = "Bhagavad-gita"
tokens = re.findall(r'\b\w+\b', text)
print(tokens)

['Bhagavad', 'gita']
