# 2.5 Tokenizing Text

Fundamental step in NLP involves converting our text into smaller units through a process known as tokenization. These smaller units are known as our tokens. Word tokenization is the most common form of tokenization, where individual words in the text becomes a token, but tokens can also be sentences, sub words or individual characters depending on your use case. 

Why do we do this? The meaning of the overall text is better understood if we can analyse and understand the individual parts as well as the whole. It's also an important step before we vecotrize our data, which we'll cover more in the next section of this course. 

Now let's look at some examples of sentence and word tokenization using the nltk package.

In [None]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize, sent_tokenize

### Sentence tokenization

In [3]:
# sent_tokenize splits the text into a list of sentences based on punctuation like '.'
sentences = "Her cat's name is Luna. Her dog's name is max"
sent_tokenize(sentences)

["Her cat's name is Luna.", "Her dog's name is max"]

### Word tokenization

In [4]:
# word_tokenize breaks the sentence into words and punctuation, keeping apostrophes correctly (e.g., "cat's")
sentence = "Her cat's name is Luna"
word_tokenize(sentence)

['Her', 'cat', "'s", 'name', 'is', 'Luna']

Notice how "cat's" has been split into 2 tokens. This may be fine for your task but it is definitely something to keep in mind when you are preprocessing any text data - you might want to remove punctuation or replace contractions before tokenizing.

In [5]:
sentence_2 = "Her cat's name is Luna and her dog's name is max"
word_tokenize(sentence_2)

['Her',
 'cat',
 "'s",
 'name',
 'is',
 'Luna',
 'and',
 'her',
 'dog',
 "'s",
 'name',
 'is',
 'max']

These tokens illustrate what we learned in our last lesson about the importance of using lowercase. We can see we have two instances of the word 'her' - one which is capitalised. The tokens then are different and will be treated as different in most analysis.

## Another example

In [None]:
text = "John's bike is red. Sarah's car is blue."
print(sent_tokenize(text))


print(word_tokenize("Sarah's car is blue"))

["John's bike is red.", "Sarah's car is blue."]
['Sarah', "'s", 'car', 'is', 'blue']


## What I Learned

- sent_tokenize() separates by sentence-ending punctuation.

- word_tokenize() keeps contractions and possessives like 's as separate tokens.

- Useful for preprocessing in NLP tasks like Named Entity Recognition or POS tagging.
