## Text Tokenization
<i>Tokens</i> are independent and minimal textual components that have some difinite syntax and semantics. A paragraph of text or a text document has several components including
sentences that can be further broken down into clauses, phrases, and words. The most
popular tokenization techniques include sentence and word tokenization, which are
used to break down a text corpus into sentences, and each sentence into words. Thus,
tokenization can be defined as the process of breaking down or splitting textual data into
smaller meaningful components called tokens.

### Sentence Tokenization
<i>Sentence tokenization</i> is the process of splitting a text corpus into sentences that act as
the first level of tokens which the corpus is comprised of. This is also known as <i>sentence
segmentation</i> , because we try to segment the text into meaningful sentences.

There are various ways of performing sentence tokenization. Basic techniques include looking for specific delimiters between sentences, such as a period (.) or a newline character (\n), and sometimes even a semi-colon (;).

In [11]:
# Tokenization with NLTK

import nltk
from nltk import sent_tokenize

sample_text = """We will discuss briefly about the basic syntax, structure and \
design philosophies. There is a defined hierarchical syntax for Python code \
which you should remember when writing code! Python is a really powerful \
programming language!"""

sentences = sent_tokenize(text=sample_text)
for sent in sentences:
    print(sent)

We will discuss briefly about the basic syntax, structure and design philosophies.
There is a defined hierarchical syntax for Python code which you should remember when writing code!
Python is a really powerful programming language!


In [13]:
german_text = """Mit der Aktion will Trump nach eigenen Worten seinen loyalsten Fans eine \
weitere Chance bieten, ihre volle Unterstützung seiner “Save America”-Bewegung zum Ausdruck \
zu bringen. Wozu genau die Mitgliedschaftskarten ihre jeweiligen Inhaber berechtigen, konnte \
sein Team bislang nicht erklären. Bereits in der Vorwoche verkaufte der ehemalige Präsident auf \
seiner Webseite signierte Fotos von sich selbst für 45 Dollar. Seine Anhänger zahlen den Preis \
gern und hoffen inständig auf seine Wiederkandidatur 2024."""

sentences = sent_tokenize(text=german_text,language='german')
for sent in sentences:
    print(sent)

Mit der Aktion will Trump nach eigenen Worten seinen loyalsten Fans eine weitere Chance bieten, ihre volle Unterstützung seiner “Save America”-Bewegung zum Ausdruck zu bringen.
Wozu genau die Mitgliedschaftskarten ihre jeweiligen Inhaber berechtigen, konnte sein Team bislang nicht erklären.
Bereits in der Vorwoche verkaufte der ehemalige Präsident auf seiner Webseite signierte Fotos von sich selbst für 45 Dollar.
Seine Anhänger zahlen den Preis gern und hoffen inständig auf seine Wiederkandidatur 2024.


In [42]:
 ## Regex tokenizer
    
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(
            pattern=SENTENCE_TOKENS_PATTERN,
            gaps=True)
sample_sentences = regex_st.tokenize(german_text)
for sent in  sample_sentences:
    print(sent)

Mit der Aktion will Trump nach eigenen Worten seinen loyalsten Fans eine weitere Chance bieten, ihre volle Unterstützung seiner “Save America”-Bewegung zum Ausdruck zu bringen.
Wozu genau die Mitgliedschaftskarten ihre jeweiligen Inhaber berechtigen, konnte sein Team bislang nicht erklären.
Bereits in der Vorwoche verkaufte der ehemalige Präsident auf seiner Webseite signierte Fotos von sich selbst für 45 Dollar.
Seine Anhänger zahlen den Preis gern und hoffen inständig auf seine Wiederkandidatur 2024.


In [54]:
# Using spaCy Framework
import spacy
nlp = spacy.load('en_core_web_md')

sample_text = """We will discuss briefly about the basic syntax, structure and \
design philosophies. There is a defined hierarchical syntax for Python code \
which you should remember when writing code! Python is a really powerful \
programming language!"""

text_spacy = nlp(sample_text)

sentences = list(text_spacy.sents)
for sent in  sample_sentences:
    print(sent)

Mit der Aktion will Trump nach eigenen Worten seinen loyalsten Fans eine weitere Chance bieten, ihre volle Unterstützung seiner “Save America”-Bewegung zum Ausdruck zu bringen.
Wozu genau die Mitgliedschaftskarten ihre jeweiligen Inhaber berechtigen, konnte sein Team bislang nicht erklären.
Bereits in der Vorwoche verkaufte der ehemalige Präsident auf seiner Webseite signierte Fotos von sich selbst für 45 Dollar.
Seine Anhänger zahlen den Preis gern und hoffen inständig auf seine Wiederkandidatur 2024.


### Word Tokenization

In [35]:
# NLTK word tokenizer
from nltk import word_tokenize

sample_text = """Hours later, rumors began to spread in the close-knit community that something \
had happened to her son in the nearby river. Kapessa’s older sister heard from friends that \
Kapessa, who could not swim, may have jumped into the water; others alleged that he had been \
pushed."""
words = word_tokenize(sample_text)
print(words)

['Hours', 'later', ',', 'rumors', 'began', 'to', 'spread', 'in', 'the', 'close-knit', 'community', 'that', 'something', 'had', 'happened', 'to', 'her', 'son', 'in', 'the', 'nearby', 'river', '.', 'Kapessa', '’', 's', 'older', 'sister', 'heard', 'from', 'friends', 'that', 'Kapessa', ',', 'who', 'could', 'not', 'swim', ',', 'may', 'have', 'jumped', 'into', 'the', 'water', ';', 'others', 'alleged', 'that', 'he', 'had', 'been', 'pushed', '.']


In [39]:
# Treebank word Tokenizer

sample_text = """Hours later, rumors began to spread in the close-knit community that something \
had happened to her son in the nearby river. Kapessa’s older sister heard from friends that \
Kapessa, who could not swim, may have jumped into the water; others alleged that he had been \
pushed."""

tokenizer = nltk.TreebankWordTokenizer()
words = tokenizer.tokenize(sample_text)
print(words)


['Hours', 'later', ',', 'rumors', 'began', 'to', 'spread', 'in', 'the', 'close-knit', 'community', 'that', 'something', 'had', 'happened', 'to', 'her', 'son', 'in', 'the', 'nearby', 'river.', 'Kapessa’s', 'older', 'sister', 'heard', 'from', 'friends', 'that', 'Kapessa', ',', 'who', 'could', 'not', 'swim', ',', 'may', 'have', 'jumped', 'into', 'the', 'water', ';', 'others', 'alleged', 'that', 'he', 'had', 'been', 'pushed', '.']


In [41]:
# Regexp Word Tokenizer
sample_text = """Hours later, rumors began to spread in the close-knit community that something \
had happened to her son in the nearby river. Kapessa’s older sister heard from friends that \
Kapessa, who could not swim, may have jumped into the water; others alleged that he had been \
pushed."""

TOKEN_PATTERN = r'\w+'        
regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN,
                                gaps=False)
words = regex_wt.tokenize(sample_text)
print(words)


GAP_PATTERN = r'\s+'        
regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN,
                                gaps=True)
words = regex_wt.tokenize(sample_text)
print(words)

['Hours', 'later', 'rumors', 'began', 'to', 'spread', 'in', 'the', 'close', 'knit', 'community', 'that', 'something', 'had', 'happened', 'to', 'her', 'son', 'in', 'the', 'nearby', 'river', 'Kapessa', 's', 'older', 'sister', 'heard', 'from', 'friends', 'that', 'Kapessa', 'who', 'could', 'not', 'swim', 'may', 'have', 'jumped', 'into', 'the', 'water', 'others', 'alleged', 'that', 'he', 'had', 'been', 'pushed']
['Hours', 'later,', 'rumors', 'began', 'to', 'spread', 'in', 'the', 'close-knit', 'community', 'that', 'something', 'had', 'happened', 'to', 'her', 'son', 'in', 'the', 'nearby', 'river.', 'Kapessa’s', 'older', 'sister', 'heard', 'from', 'friends', 'that', 'Kapessa,', 'who', 'could', 'not', 'swim,', 'may', 'have', 'jumped', 'into', 'the', 'water;', 'others', 'alleged', 'that', 'he', 'had', 'been', 'pushed.']


In [57]:
# Spacy

from spacy.lang.en import English

nlp = English()
sample_text = """Hours later, rumors began to spread in the close-knit community that something \
had happened to her son in the nearby river. Kapessa’s older sister heard from friends that \
Kapessa, who could not swim, may have jumped into the water; others alleged that he had been \
pushed."""

doc = nlp(sample_text)
tokens = [token for token in doc]
print(tokens)

[Hours, later, ,, rumors, began, to, spread, in, the, close, -, knit, community, that, something, had, happened, to, her, son, in, the, nearby, river, ., Kapessa, ’s, older, sister, heard, from, friends, that, Kapessa, ,, who, could, not, swim, ,, may, have, jumped, into, the, water, ;, others, alleged, that, he, had, been, pushed, .]
