# 3.1 Tokenization

In the context of natural language processing (NLP) and text analysis, a `"token"` refers to a single, indivisible unit of text. 

*  Tokens are the building blocks of a text, and they can be individual words or even smaller units, such as punctuation marks or subword units in some cases. 

*  The process of breaking down a text into its constituent tokens is called **"tokenization."**

Some key points about tokens:

1.  **Words as Tokens**: In most NLP tasks, words are the most common type of tokens. 
    *  For example, in the sentence `"The quick brown fox,"` there are four tokens: `"The,"` `"quick,"` `"brown,"` and `"fox."`

2.  **Punctuation and Symbols**: Punctuation marks, symbols, and special characters can also be treated as tokens. 
    *   For instance, in the sentence `"Hello, world!"`, the tokens are `"Hello,"` `","`, `"world,"` and `"!"`.


3.  **Subword Units**: In some NLP applications, text is tokenized into subword units, especially in languages with complex morphology or for tasks like machine translation.  Subword tokenization breaks words into smaller units, like `syllables` or character `n-grams`. 
    *   For example, `"unhappiness"` might be tokenized into `"un,"` `"happi,"` and `"ness."`

4.  **Tokenization Rules**: Tokenization rules can vary depending on the language and the specific NLP task. 
    *   For example, in English, contractions like `"I'm"` are often tokenized into two tokens: `"I"` and `"'m."`

5.  **Whitespace and Delimiters**: Tokenization is typically based on whitespace (`space`, `tab`, `newline`) as a delimiter, but it can also consider other delimiters or language-specific rules.


There are 3 techniques of tokenization that we are going to explore.

1. **Regular expressions**
2. **Tokenization with NLTK**
    * `word_tokenize`
    * `sen_tokenize`
    * `regex_tokenize`
    *  and other NLTK tokenizers

## 3.1.1  Regular expressions

Regular expressions are strings we can use that have a special syntax, which allows us to match *patterns* and find other strings. 

*   A *pattern* is a series of letters or symbols which can map to an actual text or words or punctuation. 

Regular expressions can be usedto do things like 
*   find links in a webpage, 
*   parse email addresses and 
*   remove unwanted strings or characters.

Regular expressions are often referred to as **regex** and can be used easily with python via the `re` library. 

In [1]:
import numpy
import pandas
import re

In [2]:
print(re.match('abc', 'abcdefg'))

<re.Match object; span=(0, 3), match='abc'>


In [3]:
print(re.match('cde', 'abcdefg'))

None


In [4]:
print(re.search('cde', 'abcdefg'))

<re.Match object; span=(2, 5), match='cde'>


In [5]:
word_regex = '\w+' 
re.match(word_regex, 'One big fight!')

<re.Match object; span=(0, 3), match='One'>

In [6]:
digit_regex = '\d+'
print(re.search(digit_regex, 'Magbubukas ngayon ng 12 pm at bukas ng 10 am.'))


<re.Match object; span=(21, 23), match='12'>


In [7]:
re.findall(word_regex, 'One big fight!')

['One', 'big', 'fight']

In [8]:
print(re.findall(digit_regex, 'Magbubukas ngayon ng 12 pm at bukas ng 10 am.'))

['12', '10']


**Python's re module**

*   `match`: match an entire string or substring based on a pattern
*   `search` search for a pattern
*   `split`: split a string on regex
*  `findall`: fill all patterns in a string


In [9]:
re.split('\s+', 'One big fight!')

['One', 'big', 'fight!']

**Regex groups**

*   OR is represented using `|`
*   Define a group using `()`
*   Define explicit character ranges using `[]`

**More on regular expressions**

[The Complete Guide to Regular Expressions](https://coderpad.io/blog/development/the-complete-guide-to-regular-expressions-regex/)

In [10]:
match_digits_and_words = ('(\d+|\w+)')
amds = """At the end of first year, a grade of at least C in MATH 71.1 and an average grade of at least C (2.00) in MATH 31.1 and MATH 31.2"""


In [11]:
#import re
print(re.findall(match_digits_and_words,amds))

['At', 'the', 'end', 'of', 'first', 'year', 'a', 'grade', 'of', 'at', 'least', 'C', 'in', 'MATH', '71', '1', 'and', 'an', 'average', 'grade', 'of', 'at', 'least', 'C', '2', '00', 'in', 'MATH', '31', '1', 'and', 'MATH', '31', '2']


In [12]:
print(re.split("\s+", amds))

['At', 'the', 'end', 'of', 'first', 'year,', 'a', 'grade', 'of', 'at', 'least', 'C', 'in', 'MATH', '71.1', 'and', 'an', 'average', 'grade', 'of', 'at', 'least', 'C', '(2.00)', 'in', 'MATH', '31.1', 'and', 'MATH', '31.2']


## 3.1.2  Tokenization with NLTK

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. 

It supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities.



In [13]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/repl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [14]:
# fair tales by Hans Christian Andersen
hca_fairy_tales = pd.read_csv("datasets/hca_fairytales.csv")
hca_fairy_tales.shape

(126102, 3)

In [15]:
hca_fairy_tales.head()

Unnamed: 0,text,book,language
0,Der kom en soldat marcherende hen ad landeveje...,The tinder-box,Danish
1,"tornyster på ryggen og en sabel ved siden, for...",The tinder-box,Danish
2,skulle han hjem. Så mødte han en gammel heks p...,The tinder-box,Danish
3,hendes underlæbe hang hende lige ned på bryste...,The tinder-box,Danish
4,Hvor du har en pæn sabel og et stort tornyster...,The tinder-box,Danish


In [16]:
hca_fairy_tales["language"].unique()

array(['Danish', 'German', 'English', 'Spanish', 'French'], dtype=object)

In [17]:
andersen_en = hca_fairy_tales[hca_fairy_tales["language"]== "English"]
andersen_en.shape

(31380, 3)

In [18]:
len(andersen_en["book"].unique())

156

In [19]:
#import numpy as np
np.where(andersen_en["book"].unique() == "The fir tree")

(array([25]),)

In [20]:
andersen_en["book"].unique()[25]

'The fir tree'

In [21]:
the_fir_tree = andersen_en[andersen_en["book"]== "The fir tree"]
thefirtree_book = " ".join(the_fir_tree["text"])

In [22]:
type(thefirtree_book)


str

### Splitting into sentences with `sent_tokenize()`

In [23]:
from nltk.tokenize import sent_tokenize

In [24]:
# Split the book into sentences
thefirtree_sentences = sent_tokenize(thefirtree_book)
len(thefirtree_sentences)

186

In [25]:
# First sentence of the book
thefirtree_sentences[0]

'Far down in the forest, where the warm sun and the fresh air made a sweet resting-place, grew a pretty little fir-tree; and yet it was not happy, it wished so much to be tall like its companions– the pines and firs which grew around it.'

### Splitting into words with `word_tokenize()`

In [26]:
# import pandas as pd
from nltk.tokenize import word_tokenize

# Tokenize the first sentence into words
tokenized_1st_sent = word_tokenize(thefirtree_sentences[0].lower())

In [27]:
tokenized_sent1_series = pd.Series(tokenized_1st_sent)
tokenized_sent1_df = pd.DataFrame({"Token": tokenized_sent1_series})
tokenized_sent1_counts_df = tokenized_sent1_df.value_counts().reset_index()
tokenized_sent1_counts_df.columns = ["Token", "Count"]
tokenized_sent1_counts_df

Unnamed: 0,Token,Count
0,the,4
1,",",3
2,and,3
3,it,3
4,a,2
5,grew,2
6,sun,1
7,not,1
8,pines,1
9,pretty,1


In [28]:
tokenized_sentences = [word_tokenize(s) for s in thefirtree_sentences]
len(tokenized_sentences)

186

In [29]:
# Create an empty list to store the dataframes
dfs = []

# Iterate over the indices of tokenized_sentences
for i in range(len(tokenized_sentences)):
    # Create a dataframe with two columns
    df = pd.DataFrame({'Sentence': [i]*len(tokenized_sentences[i]), 
                       'Token': [token.lower() for token in tokenized_sentences[i]]})
    # Append the dataframe to the list
    dfs.append(df)

# Concatenate all the dataframes in the list
tokenized_df = pd.concat(dfs, ignore_index=True)

tokenized_df

Unnamed: 0,Sentence,Token
0,0,far
1,0,down
2,0,in
3,0,the
4,0,forest
...,...,...
3960,185,an
3961,185,end
3962,185,at
3963,185,last


In [30]:
# Count the frequency of a token in each sentence
tokenized_count_df = pd.DataFrame(tokenized_df.groupby(["Sentence", "Token"]).size().reset_index(name="Count"))
tokenized_count_df.sort_values(by=["Sentence", "Count"], ascending=[True,False])

Unnamed: 0,Sentence,Token,Count
32,0,the,4
0,0,",",3
5,0,and,3
18,0,it,3
3,0,a,2
...,...,...,...
3342,185,stories,1
3343,185,story,1
3345,185,to,1
3346,185,tree,1


In [31]:
unique_count_df = pd.DataFrame(tokenized_count_df.value_counts("Sentence").reset_index(name="UniqueWordCount"))
unique_count_df

Unnamed: 0,Sentence,UniqueWordCount
0,68,45
1,115,42
2,183,41
3,150,41
4,0,40
...,...,...
181,80,3
182,81,3
183,122,2
184,46,2


In [32]:
# The number of tokens per sentence
group_sum_df = pd.DataFrame(tokenized_count_df.groupby("Sentence").sum("Count").reset_index())
# Rename the "Count" column to "TotalCount"
group_sum_df.rename(columns={"Count": "TotalCount"}, inplace=True)
group_sum_df

Unnamed: 0,Sentence,TotalCount
0,0,51
1,1,30
2,2,38
3,3,9
4,4,34
...,...,...
181,181,32
182,182,32
183,183,57
184,184,34


In [33]:
thefirtree_count_df = unique_count_df.merge(group_sum_df, on="Sentence", how="left")
thefirtree_count_df.sort_values("Sentence")

Unnamed: 0,Sentence,UniqueWordCount,TotalCount
4,0,40,51
59,1,23,30
20,2,31,38
140,3,9,9
27,4,30,34
...,...,...,...
33,181,28,32
62,182,23,32
2,183,41,57
37,184,28,34


### Using the NLTK's `regexp_tokenize()`

In [34]:
# import pandas as pd

# Load the file
scene_one = pd.read_csv("datasets/grail.txt", sep="\t", header=None)

# Display the first few rows of the dataframe
scene_one.head(10)

Unnamed: 0,0
0,SCENE 1: [wind] [clop clop clop]
1,KING ARTHUR: Whoa there! [clop clop clop]
2,SOLDIER #1: Halt! Who goes there?
3,"ARTHUR: It is I, Arthur, son of Uther Pendrago..."
4,SOLDIER #1: Pull the other one!
5,"ARTHUR: I am, ... and this is my trusty serva..."
6,SOLDIER #1: What? Ridden on a horse?
7,ARTHUR: Yes!
8,SOLDIER #1: You're using coconuts!
9,ARTHUR: What?


In [35]:
scene_one.rename(columns={0: "Text"}, inplace=True)
scene_one.head()

Unnamed: 0,Text
0,SCENE 1: [wind] [clop clop clop]
1,KING ARTHUR: Whoa there! [clop clop clop]
2,SOLDIER #1: Halt! Who goes there?
3,"ARTHUR: It is I, Arthur, son of Uther Pendrago..."
4,SOLDIER #1: Pull the other one!


In [36]:
# import re

scene_one_df = pd.DataFrame()
# Apply the re.sub() function to the "Text" column in scene_one
pattern = "[A-Z]{2,}(\s)?(#\d)?(\d)?([A-Z]{2,})?:"
scene_one_df["Text"] = scene_one["Text"].apply(lambda x: re.sub(pattern, '', str(x)))
scene_one_df.head(10)

Unnamed: 0,Text
0,[wind] [clop clop clop]
1,Whoa there! [clop clop clop]
2,Halt! Who goes there?
3,"It is I, Arthur, son of Uther Pendragon, from..."
4,Pull the other one!
5,"I am, ... and this is my trusty servant Pats..."
6,What? Ridden on a horse?
7,Yes!
8,You're using coconuts!
9,What?


In [37]:
from nltk.tokenize import regexp_tokenize
scene_one_df["Tokens"] = scene_one_df["Text"].apply(lambda x: regexp_tokenize(str(x), "\w+"))
scene_one_df.head(10)

Unnamed: 0,Text,Tokens
0,[wind] [clop clop clop],"[wind, clop, clop, clop]"
1,Whoa there! [clop clop clop],"[Whoa, there, clop, clop, clop]"
2,Halt! Who goes there?,"[Halt, Who, goes, there]"
3,"It is I, Arthur, son of Uther Pendragon, from...","[It, is, I, Arthur, son, of, Uther, Pendragon,..."
4,Pull the other one!,"[Pull, the, other, one]"
5,"I am, ... and this is my trusty servant Pats...","[I, am, and, this, is, my, trusty, servant, Pa..."
6,What? Ridden on a horse?,"[What, Ridden, on, a, horse]"
7,Yes!,[Yes]
8,You're using coconuts!,"[You, re, using, coconuts]"
9,What?,[What]


In [38]:
scene_one_tokenized_series = scene_one_df["Tokens"]

# Create an empty list to store the dataframes
dfs = []

# Iterate over the indices of scene_one_tokenized_series
for i in range(len(scene_one_tokenized_series)):
    # Create a dataframe with two columns
    df = pd.DataFrame({'Line': [i]*len(scene_one_tokenized_series[i]), 
                       'Token': [token.lower() for token in scene_one_tokenized_series[i]]})
    # Append the dataframe to the list
    dfs.append(df)

# Concatenate all the dataframes in the list
scene_tokenized_df = pd.concat(dfs, ignore_index=True)

scene_tokenized_df

Unnamed: 0,Line,Token
0,0.0,wind
1,0.0,clop
2,0.0,clop
3,0.0,clop
4,1.0,whoa
...,...,...
10115,1189.0,pack
10116,1189.0,that
10117,1189.0,in
10118,1189.0,crash


In [39]:
# Count the frequency of a token in each line
lines_wordcount = pd.DataFrame(scene_tokenized_df.groupby(["Line", "Token"]).size().reset_index(name="Count"))
lines_wordcount.sort_values(by=["Line", "Count"], ascending=[True,False])

Unnamed: 0,Line,Token,Count
0,0.0,clop,3
1,0.0,wind,1
2,1.0,clop,3
3,1.0,there,1
4,1.0,whoa,1
...,...,...,...
8482,1189.0,pack,1
8483,1189.0,right,1
8484,1189.0,s,1
8485,1189.0,sonny,1


In [40]:
lines_unique_wordcount = pd.DataFrame(lines_wordcount.value_counts("Line").reset_index(name="UniqueWordCount"))
lines_unique_wordcount

Unnamed: 0,Line,UniqueWordCount
0,285.0,67
1,460.0,61
2,960.0,59
3,609.0,57
4,368.0,57
...,...,...
1185,628.0,1
1186,623.0,1
1187,616.0,1
1188,1031.0,1


In [41]:
lines_sum = pd.DataFrame(lines_wordcount.groupby("Line").sum("Count").reset_index())
# Rename the "Count" column to "TotalCount"
lines_sum.rename(columns={"Count": "TotalCount"}, inplace=True)
lines_sum

Unnamed: 0,Line,TotalCount
0,0.0,4
1,1.0,5
2,2.0,4
3,3.0,25
4,4.0,4
...,...,...
1185,1186.0,11
1186,1187.0,9
1187,1188.0,2
1188,1189.0,11


In [42]:
sceneone_count_df = lines_unique_wordcount.merge(lines_sum, on="Line", how="left")
sceneone_count_df.sort_values("Line")

Unnamed: 0,Line,UniqueWordCount,TotalCount
911,0.0,2,4
822,1.0,3,5
632,2.0,4,4
77,3.0,19,25
642,4.0,4,4
...,...,...,...
285,1186.0,9,11
408,1187.0,7,9
883,1188.0,2,2
240,1189.0,10,11


In [43]:
german_text = 'Wann gehen wir Pizza essen? 🍕 Und fährst du mit Über? 🚕'

In [44]:
# Tokenize and print only capital words (1st letter is capitalized)
capital_words = r"[A-Z|Ü]\w+"
print(regexp_tokenize(german_text, capital_words))

['Wann', 'Pizza', 'Und', 'Über']


In [45]:
# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))

['🍕', '🚕']


In [46]:
#HangzhouAsianGames #AsianGames
message1 = 'Congratulations 🎊 Kabayan,,,,Go for Gold..'
print(regexp_tokenize(message1, emoji))

['🎊']


In [47]:
print(regexp_tokenize(message1, capital_words))

['Congratulations', 'Kabayan', 'Go', 'Gold']


In [48]:
print(word_tokenize(message1))

['Congratulations', '🎊', 'Kabayan', ',', ',', ',', ',Go', 'for', 'Gold', '..']


In [49]:
# Ernest Obiena's facebook page. Posted by Djundi Biñas
message2 = """Congratulations Ernest Obiena EJ Obiena - Ernest Obiena for Winning the 1st Gold 🥇Medal for the Philippines 🇵🇭 in the 19th Asian Games and breaking the Championship Record! 💪🏼💯 Boom! 💥Salamat EJ 
#ParaSaBayan 
#PoleVault 
#PhilippinePoleVault 
#TheEJEffectCongratulations Ernest Obiena EJ Obiena - Ernest Obiena for Winning the 1st Gold 🥇Medal for the Philippines 🇵🇭 in the 19th Asian Games and breaking the Championship Record! 💪🏼💯 Boom! 💥Salamat EJ 🫰🏼🙌🏼
#ParaSaBayan 
#PoleVault 
#PhilippinePoleVault 
#TheEJEffect"""
# message2

In [50]:
print(regexp_tokenize(message2, emoji))

['💪', '🏼', '💯', '💥', '💪', '🏼', '💯', '💥', '🏼', '🙌', '🏼']


In [51]:
# all hashtags in the message 
regexp_tokenize(message2, r"#\w+")

['#ParaSaBayan',
 '#PoleVault',
 '#PhilippinePoleVault',
 '#TheEJEffectCongratulations',
 '#ParaSaBayan',
 '#PoleVault',
 '#PhilippinePoleVault',
 '#TheEJEffect']

In [52]:
print(word_tokenize(message2))

['Congratulations', 'Ernest', 'Obiena', 'EJ', 'Obiena', '-', 'Ernest', 'Obiena', 'for', 'Winning', 'the', '1st', 'Gold', '🥇Medal', 'for', 'the', 'Philippines', '🇵🇭', 'in', 'the', '19th', 'Asian', 'Games', 'and', 'breaking', 'the', 'Championship', 'Record', '!', '💪🏼💯', 'Boom', '!', '💥Salamat', 'EJ', '#', 'ParaSaBayan', '#', 'PoleVault', '#', 'PhilippinePoleVault', '#', 'TheEJEffectCongratulations', 'Ernest', 'Obiena', 'EJ', 'Obiena', '-', 'Ernest', 'Obiena', 'for', 'Winning', 'the', '1st', 'Gold', '🥇Medal', 'for', 'the', 'Philippines', '🇵🇭', 'in', 'the', '19th', 'Asian', 'Games', 'and', 'breaking', 'the', 'Championship', 'Record', '!', '💪🏼💯', 'Boom', '!', '💥Salamat', 'EJ', '\U0001faf0🏼🙌🏼', '#', 'ParaSaBayan', '#', 'PoleVault', '#', 'PhilippinePoleVault', '#', 'TheEJEffect']


### Using NLTK's `tweet_tokenize()`

In [53]:
# from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

In [54]:
tweets = ['Shape-changing, free-roaming soft #robot created','#Mathematics needed in preparation for an introductory class in #MachineLearning:  https://bit.ly/2NtsX8yÂ','IBM fairness toolkit aims to eliminate bias in data sets -  https://bit.ly/2BFbRQDÂ  - thanks @RichardEudes #DataScience #DS,#MachineLearning,#ArtificialIntelligence,#DataScience']

In [55]:
# All hashtags in tweets
[regexp_tokenize(twit, r"#\w+") for twit in tweets]

[['#robot'],
 ['#Mathematics', '#MachineLearning'],
 ['#DataScience',
  '#DS',
  '#MachineLearning',
  '#ArtificialIntelligence',
  '#DataScience']]

In [56]:
tweets_hashtags = list(map(lambda twit: regexp_tokenize(twit, r"#\w+"), tweets))
tweets_hashtags

[['#robot'],
 ['#Mathematics', '#MachineLearning'],
 ['#DataScience',
  '#DS',
  '#MachineLearning',
  '#ArtificialIntelligence',
  '#DataScience']]

In [57]:
# Write a pattern that matches both mentions (@) and hashtags
pattern = r"([\@|#]\w+)"
# Use the pattern on the last tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[-1], pattern)
print(mentions_hashtags)

['@RichardEudes', '#DataScience', '#DS', '#MachineLearning', '#ArtificialIntelligence', '#DataScience']


In [58]:
# from nltk.tokenize import regexp_tokenize
# from nltk.tokenize import TweetTokenizer

# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
# all_tokens = [tknzr.tokenize(t) for t in tweets]
all_tokens = list(map(lambda twit: tknzr.tokenize(twit), tweets))
print(all_tokens)

[['Shape-changing', ',', 'free-roaming', 'soft', '#robot', 'created'], ['#Mathematics', 'needed', 'in', 'preparation', 'for', 'an', 'introductory', 'class', 'in', '#MachineLearning', ':', 'https://bit.ly/2NtsX8yÂ'], ['IBM', 'fairness', 'toolkit', 'aims', 'to', 'eliminate', 'bias', 'in', 'data', 'sets', '-', 'https://bit.ly/2BFbRQDÂ', '-', 'thanks', '@RichardEudes', '#DataScience', '#DS', ',', '#MachineLearning', ',', '#ArtificialIntelligence', ',', '#DataScience']]
