## Lets learn a few terminologies we will be using in this notebook:
- Corpus: Paragraphs or text body or raw word dataset if you are looking for a machine learning analogy
- Documents: Sentences
- Vocubalary: Set of unique words

## What is Text Normalization
### Note: The following definitions are specific to ENGLISH Language. They may wary for other language based on their styles. 
It is a process of converting raw text data into a standard form. Following are some ways to normalize:
- Tokenization: Seperation of words based of spaces.
- Lemmetization: A method of converting the word to its root word called lemma. For example, sing is the lemma for sang, sung and sings.
- Stemming: It is the simpler version of Lemmetization where the suffix of a word is omitted. For example, going stemmed to go removing 'ing'

Text normalization also includes sentence segmentation: breaking up a text into individual sentences, using cues like periods or exclamation points. Note that tokenization is a type of sentence segmentation but not vice-versa

We will be using the NLTK library for this notebook so lets install it!

## Text Tokenization 

#### Type 1] Corpus to sentences

In [54]:
from nltk.tokenize import sent_tokenize

In [55]:
corpus = "Sentence segmentation is the process of determining the longer processing units consisting of one or more words. This task involves identifying sentence boundaries between words in different sentences. Since most written languages have punctuation marks which occur at sentence boundaries, sentence segmentation is frequently referred to as sentence boundary detection, sentence boundary disambiguation, or sentence boundary recognition. All these terms refer to the same task: determining how a text should be divided into sentences for further processing."
print(corpus)

Sentence segmentation is the process of determining the longer processing units consisting of one or more words. This task involves identifying sentence boundaries between words in different sentences. Since most written languages have punctuation marks which occur at sentence boundaries, sentence segmentation is frequently referred to as sentence boundary detection, sentence boundary disambiguation, or sentence boundary recognition. All these terms refer to the same task: determining how a text should be divided into sentences for further processing.


In [56]:
i = 1
for document in sent_tokenize(corpus):
    print(f'{i}]{document}')
    i += 1

1]Sentence segmentation is the process of determining the longer processing units consisting of one or more words.
2]This task involves identifying sentence boundaries between words in different sentences.
3]Since most written languages have punctuation marks which occur at sentence boundaries, sentence segmentation is frequently referred to as sentence boundary detection, sentence boundary disambiguation, or sentence boundary recognition.
4]All these terms refer to the same task: determining how a text should be divided into sentences for further processing.


#### Type 2] Paragraph to word 
Observe that even symbols are treated as words

In [57]:
from nltk.tokenize import word_tokenize

In [58]:
words = word_tokenize(corpus)
print(words)

['Sentence', 'segmentation', 'is', 'the', 'process', 'of', 'determining', 'the', 'longer', 'processing', 'units', 'consisting', 'of', 'one', 'or', 'more', 'words', '.', 'This', 'task', 'involves', 'identifying', 'sentence', 'boundaries', 'between', 'words', 'in', 'different', 'sentences', '.', 'Since', 'most', 'written', 'languages', 'have', 'punctuation', 'marks', 'which', 'occur', 'at', 'sentence', 'boundaries', ',', 'sentence', 'segmentation', 'is', 'frequently', 'referred', 'to', 'as', 'sentence', 'boundary', 'detection', ',', 'sentence', 'boundary', 'disambiguation', ',', 'or', 'sentence', 'boundary', 'recognition', '.', 'All', 'these', 'terms', 'refer', 'to', 'the', 'same', 'task', ':', 'determining', 'how', 'a', 'text', 'should', 'be', 'divided', 'into', 'sentences', 'for', 'further', 'processing', '.']


#### Type 3] Sentences to word 

In [59]:
for document in sent_tokenize(corpus):
    print(word_tokenize(document))

['Sentence', 'segmentation', 'is', 'the', 'process', 'of', 'determining', 'the', 'longer', 'processing', 'units', 'consisting', 'of', 'one', 'or', 'more', 'words', '.']
['This', 'task', 'involves', 'identifying', 'sentence', 'boundaries', 'between', 'words', 'in', 'different', 'sentences', '.']
['Since', 'most', 'written', 'languages', 'have', 'punctuation', 'marks', 'which', 'occur', 'at', 'sentence', 'boundaries', ',', 'sentence', 'segmentation', 'is', 'frequently', 'referred', 'to', 'as', 'sentence', 'boundary', 'detection', ',', 'sentence', 'boundary', 'disambiguation', ',', 'or', 'sentence', 'boundary', 'recognition', '.']
['All', 'these', 'terms', 'refer', 'to', 'the', 'same', 'task', ':', 'determining', 'how', 'a', 'text', 'should', 'be', 'divided', 'into', 'sentences', 'for', 'further', 'processing', '.']


#### Type 4] Regex Tokenizer
Build a basic understanding of regular expression from this url: https://www.w3schools.com/python/python_regex.asp. <br>These expression will be useful to solve complex tokenization problems of deteting words like U.S.A, New York,  as one token or even numerical value like decimals, currency etc.
<br> Refer for more information: https://www.nltk.org/api/nltk.tokenize.regexp.html

In [60]:
from nltk.tokenize import RegexpTokenizer 

In [61]:
#The regex patter has been taken from the Stanford professors' Speech and Language Processing book
sentence = "New York is located in the U.S.A. that sells the best biscuits for $1.28"
regex_tokenizer = RegexpTokenizer(r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    ''')
regex_tokenizer.tokenize(sentence)

['New',
 'York',
 'is',
 'located',
 'in',
 'the',
 'U.S.A.',
 'that',
 'sells',
 'the',
 'best',
 'biscuits',
 'for',
 '$1.28']

### Some Advanced Toeknization topics to be covered as I build my understanding of NLP
<br> There are roughly two classes of tokenization algorithms: <br>**Top-down algorithm (rule-based) and Bottom-Up Algorithm (Byte-Pair encoding)**
<br> In Top down algorithm we decide how to seperate the token based on symbols, regular expressions or on words itself. However, in the Byte Pair encoding, tokens are decided automatically based on the training data set. This model is heavily used in LLMs. 

## Text Stemming

In [62]:
from nltk.stem import PorterStemmer

In [63]:
words = ["eating", "jumps", "goes", "running", "fairly"]

In [64]:
 # Apply stemming to each word
porter_stemmer = PorterStemmer()
stemmed_words = [porter_stemmer.stem(word) for word in words]
print(stemmed_words)

['eat', 'jump', 'goe', 'run', 'fairli']


As you can see from the output above the word happily has not been stemmed properly (cahnge in form of the word, change in meaning of the word, etc) which is a major drawback of this technique. 

In [65]:
from nltk.stem import RegexpStemmer

In [66]:
regex_stemmer = RegexpStemmer('ing$|s$|e$|able$')
regex_stemmer.stem('works')

'work'

In [67]:
from nltk.stem import SnowballStemmer

In [68]:
snowball_stemmer = SnowballStemmer('english')
stemmed_words = [snowball_stemmer.stem(word) for word in words]
print(stemmed_words)

['eat', 'jump', 'goe', 'run', 'fair']


Observe that even though Snowball performs better than PotterStemmer, it still does not work properly with all the words. Hence these techniques are useful only for simpler languge processing. To get rid of all these disadvantages we use Lemmatization.

## Text Lemmatization
<br> Lemmatization is an advanced version of Stemming which find the root word called *"Lemma"* of the givem words.
<br> We will use a few basic Lemmatizers in this notebook.

### WordNetLemmatizer
<br> It is built on the morphy function of the wordnet library. It utilizes the WordNetCorpus class to search for the lemma. The function, in simple terms, checks for the base foem of the input word in an intelligent and thorough manner.
<br> reference link for more on morphy: https://wordnet.princeton.edu/documentation/morphy7wn#:~:text=A%20set%20of%20morphology%20functions,found%20in%20the%20WordNet%20database.
<br> Two important parameters should be used: the word and The Part Of Speech tag. Valid options are “n” for nouns, “v” for verbs, “a” for adjectives, “r” for adverbs and “s” for satellite adjectives. By default, the POS is Noun.

In [69]:
from nltk.stem import WordNetLemmatizer

In [70]:
lemmatizer = WordNetLemmatizer()

In [72]:
lemmatizer.lemmatize("going") # since the POS is by default set as 'n', the output is unchanged

'going'

In [74]:
lemmatizer.lemmatize("going",pos='v') #Since the word going is verb, it now gave the correct output

'go'

In [76]:
words = ["eating", "jumps", "goes", "running", "fairly"]
[lemmatizer.lemmatize(w,pos='v') for w in words]

['eat', 'jump', 'go', 'run', 'fairly']