 <h4>Unit 1 <h1 style="text-align:center"> Chapter 2</h1>
 
 ---

###### Words
---

Take a look at this sentence :

'The quick brown fox jumps over the lazy fox, and took his meal.'

* The sentence has 13 _Words_ if you don't count punctuations, and 15 if you count punctions. 

* To count punctuation as a word or not depends on the task in hand.

* For some tasks like P-O-S tagging & speech synthesis, punctuations are treated as words. (Hello! and Hello? are different in speech synthesis)

In [4]:
len('The quick brown fox jumps over the lazy fox, and took his meal.'.split())

13

##### Utterance

> An utterance is a spoken correlate of a sentence. (Speaking a sentence is utterance)

Take a look at this sentence:

'I am goi- going to the market to buy ummm fruits.'

* This utterance has two kinds of <strong>disfluencies</strong>(disorder in smooth flow).

1. Fragment - The broken off word 'goi' is a fragment.
2. Fillers - Words like ummm, uhhh, are called fillers or filled pauses.


##### Lemma

> A lemma is a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense.

* Wordform is the full inflected or derived form of the word.

Example,

Wordforms - cats,cat

Lemma - cat

Wordforms - Moving, move

Lemma - move

##### Vocabulary, Wordtypes, and Wordtokens

* Vocabulary - It is the set of distinct words in a corpus.

* Wordtypes - It is the size of the vocabulary V i.e. |V|

* Wordtokens - It is the total number of running words.

Take a look at this sentence:

'They picnicked by the pool, then lay back on the grass and looked at the stars.'

Here,

* Vocabulary = V = {They, picnicked, by, the, pool, then, lay, back, on, grass, and, looked, at, stars}

* Wordtypes = |V| = 14

* Wordtokens(ignoring punctuation) = 16

In [27]:
def vocTypeToken(sentence):
    tokens = sentence.split()
    vocabulary = list(set(tokens))
    wordtypes = len(vocabulary)
    wordtokens = len(tokens)
    print("Sentence = {}\n".format(sentence))
    print("Tokens = {}\n".format(tokens))
    print("Vocabulary = {}\n".format(sorted(vocabulary)))
    print("Wordtypes = {}\n".format(wordtypes))
    print("Wordtokens = {}".format(wordtokens))

In [28]:
sentence = 'They picnicked by the pool, then lay back on the grass and looked at the stars.'
vocTypeToken(sentence)

Sentence = They picnicked by the pool, then lay back on the grass and looked at the stars.

Tokens = ['They', 'picnicked', 'by', 'the', 'pool,', 'then', 'lay', 'back', 'on', 'the', 'grass', 'and', 'looked', 'at', 'the', 'stars.']

Vocabulary = ['They', 'and', 'at', 'back', 'by', 'grass', 'lay', 'looked', 'on', 'picnicked', 'pool,', 'stars.', 'the', 'then']

Wordtypes = 14

Wordtokens = 16


###### Herdan's Law

> The larger the corpora we look at, the more wordtypes we find. The relationsip between wordtypes and tokens is called <strong>Herdan's Law</strong>

\begin{equation*}
|V| = kN^\beta 
\end{equation*}
, k and \\(\beta\\) are positive consonants.

The value of \\(\beta\\) depends on the corpus size and is in the range of 0 to 1.

* We can say that the vocabulary size for a text goes up significantly faster than the square root of its length in words.
---

- Another rough measure of number of words in a corpus is the number of lemmas.

##### Code switching

> The phenonmenon of changing lanugage while reading or writing is called code switching.

Example,

'Tu mera dost hai or rahega, don't worry.'

---

## Text Normalization
---

Before any type of natural language processing, the text has to be brought a normal condition or state.

The below mentioned three tasks are common for almost every normalization process.

1. Tokenizing ( breaking into words )
2. Normalizing word formats
3. Segmenting sentences

###  Word tokenization
---

> The task of segmenting text into words.

<p style="color:red">Why you should not use split() for tokenizaiton.</p>

If using split() on the text, the words like 'Mr. Randolf', emails like 'hello@internet.com' may be broken down as ['Mr.','Randolf'], emails may be broken down as ['hello','@','internet','.','com'].

This is not what we generally want, hence special tokenization algorithms must be used.

* Commas are generally used as word boundaries but also in large numbers (540,000).
* Periods are generally used as sentence boundaries but also in emails, urls, salutation.

##### Clitic

> Clitics are words that can't stand on their own. They are attached to other words. Tokenizer can be used to expand clitics.

Example of clitics,

What're, Who's, You're.


- Tokenization algorithms can also be used to tokenize multiwords like 'New York', 'rock N roll'.

This tokenization is used in conjunction with <strong>Named Entity Detection</strong> (the task of detecting name, places, dates, organizations)


Python code for tokenization below

In [30]:
from nltk.tokenize import word_tokenize

In [31]:
text = 'The San Francisco-based restaurant," they said, "doesn’t charge $10".'

In [33]:
print(word_tokenize(text))

['The', 'San', 'Francisco-based', 'restaurant', ',', "''", 'they', 'said', ',', '``', 'doesn', '’', 't', 'charge', '$', '10', "''", '.']


In [34]:
from nltk.tokenize import wordpunct_tokenize

In [36]:
print(wordpunct_tokenize(text))

['The', 'San', 'Francisco', '-', 'based', 'restaurant', ',"', 'they', 'said', ',', '"', 'doesn', '’', 't', 'charge', '$', '10', '".']


Since tokenization needs to run before any language processing, it needs to be fast.

Regex based tokenization is fast but not that smart while handling punctuations, and language dilemma. 
There are many tokenization algorithms like ByteLevelBPETokenizer, CharBPETokenizer, SentencePieceBPETokenizer.

Below excercise shows step by step guide to modern way of tokenization using [huggingface's](https://huggingface.co/) ultrafast tokenization library - [Tokenizer](https://github.com/huggingface/tokenizers)
                             
---

#### Notice the speed of huggingface tokenizer and nltk tokenizer

In [87]:
!python3 -m pip install tokenizers #install tokenizer



In [12]:
from tokenizers import (BertWordPieceTokenizer)
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)

In [13]:
from datetime import datetime
def textTokenizer(text):
    start = (datetime.now())
    print(tokenizer.encode(text).tokens)
    end = (datetime.now())
    print("Time taken - {}".format((end-start).total_seconds()))
    
    

In [14]:
textTokenizer('Number expressions introduce other complications as well; while commas nor- mally appear at word boundaries, commas are used inside numbers in English, every three digits.')

['[CLS]', 'number', 'expressions', 'introduce', 'other', 'complications', 'as', 'well', ';', 'while', 'com', '##mas', 'nor', '-', 'mall', '##y', 'appear', 'at', 'word', 'boundaries', ',', 'com', '##mas', 'are', 'used', 'inside', 'numbers', 'in', 'english', ',', 'every', 'three', 'digits', '.', '[SEP]']
Time taken - 0.00062


* We will discuss about [CLS] and [SEP] later

In [92]:
from datetime import datetime
def nltkTokenizer(text):
    start = (datetime.now())
    print(word_tokenize(text))
    end = (datetime.now())
    print("Time taken - {}".format((end-start).total_seconds()))
    
    

In [93]:
nltkTokenizer('Number expressions introduce other complications as well; while commas nor- mally appear at word boundaries, commas are used inside numbers in English, every three digits.')

['Number', 'expressions', 'introduce', 'other', 'complications', 'as', 'well', ';', 'while', 'commas', 'nor-', 'mally', 'appear', 'at', 'word', 'boundaries', ',', 'commas', 'are', 'used', 'inside', 'numbers', 'in', 'English', ',', 'every', 'three', 'digits', '.']
Time taken - 0.002232


##### Word segmentation

> Some languages(like Chinese) don't have words seperated by spaces, hence tokenization is not easily done. So word segmentation is done using sequence models trained on hand seperated datasets.
