# A Quick Tour of Traditional NLP

> The term natural language processing is a more specific term referring to the sub-field of computer science that deals with methods to analyze, model, and understand human language.

# What is NLP?

- Natural language processing (evolved from computational linguistics) uses methods from various disciplines, such as computer science, artificial intelligence, linguistics, and data science, to enable computers to understand human language in both written and verbal forms. 

### Diffrence between NLP and Computational Linguistisc?

> Natural language processing emphasizes its use of machine learning and deep learning techniques to complete tasks, like language translation or question answering.
 

> While computational linguistics has more of a focus on aspects of language, such as syntax, semantics, and grammatical structure.

## **NLP vs NLU VS NLG**

- While natural language processing (NLP), natural language understanding (NLU), and natural language generation (NLG) are all related topics, they are distinct ones.

- At a high level, NLU and NLG are just components of NLP.



<img src="NLP_NLU.png" alt="Drawing" style="width: 500px;"/>





##  **Natural Language Understanding** (When you want to understand the meaning of a sentence)


- Natural language understanding is a subset of natural language processing, which uses syntactic and semantic analysis of text and speech to determine the meaning of a sentence.


Example 1:

- Alice is swimming against the current.

- The current version of the report is in the folder.


Example 2:

- I will give you a ring tomorrow.

- The ring is in the folder.


Example 3:


- The profits increases by 10%.

- The pains increase day by day.   

##  **Natural Language Generation** (When computers writes language)


- While natural language understanding focuses on computer reading comprehension, natural language generation enables computers to write. 

- NLG is the process of producing a human language text response based on some data input. This text can also be converted into a speech format through text-to-speech services.

- NLG tasks include: 
  
  - generating text, 
  
  - generating speech, 
  
  - and generating images 
  
  - and videos.

![](npl.png)

# Approaches to NLP

- Heuristics-based NLP (Rule-based NLP)

- Machine Learning NLP

- Deep Learning for NLP

## Heuristics-based NLP Examples:

- Dictionary-based sentiment analysis (Lexicon-based SA)

- WordNet for lexical relations

- Regular Expressions

- Context-free grammar

### Strengths:

- Rules based on domain-specific knowledge can efficiently reduce the mistakes that are sometimes very expensive.

### Dis

- Manually curation of feuatures



## Machine Learning for NLP


### Common methods for machine learning:

- Naive Bayes
  
- Logistic Regression

- Support Vector Machine

- Hidden Markov Model

- Conditional Random Field


### Three common steps for machine learning

- Extracting features from texts

- Using the feature representation to learn a model

- Evaluating and improving the model


# Deep Learning for NLP

Convolutional Neural Network (CNN)
  
Sequence Models
  
  - Recurrent Neural Network (RNN)
  
  - Long-Term Short-Term Memory (LSTM)


Strengths of Sequence Models

- It reflects the fact that a sentence in language flows from one direction to another.

- The model can progressively read an input text from one end to another.

- The model have neural units capable of remembering what it has processed so far.

Transfer Learning

- It is a technique in AI where the knowledge gained while solving one problem is applied to a different but related problem.

- We can use unsupervised methods to train a transformer-based model for predicting a part of a sentence given the rest of the content.

- This model can encode high-level nuances of the language, which can be applied to other relevant downstream tasks.


Transformers

- The state-of-the-art model in major NLP tasks
  
- It models the textual context in a non-sequential manner.

- Given a word in the input, the model looks at all the words around it and represent each word with respect to its context. This is referred to as self-attention.

# Corpora, Tokens, and Types

All NLP methods, be they classic or modern, begin with a text dataset, also called a corpus (plural: corpora).

> A corpus is a representative sample of actual language production within a meaningful context and with a general purpose. 


> A dataset is a representative sample of a specific linguistic phenomenon in a restricted context and with annotations that relate to a specific research question

Both terms are use interchangebly in the NLP literature.

<img src="corpus_dataset.png" alt="Drawing" style="width: 500px;"/>


The metadata could be any auxiliary piece of information associated with the text, like identifiers, labels, and timestamps In machine learning parlance, the text along with its metadata is called an instance or data point. 

- We freely interchange the terms corpus and dataset throughout

<img src="corpus.png" alt="Drawing" style="width: 500px;"/>




# A General NLP Pipeline

Varations of the NLP Pipelines

- The process may not always be linear.
  
- There are loops in between.

- These procedures may depend on specific task at hand.



![](./nlp_pipe_line.png)

## Data Collection

- Ideal Setting: We have everything needed.

- Labels and Annotations

- Very often we are dealing with less-than-ideal scenario (scrape the data, public datasets)

- Initial datasets with limited annotations/labels (one solution: data augmentation)

**Data augmentation :** It is a technique to exploit language properties to create texts that are syntactically similar to the source text data.
Types of strategies:

 - synonym replacement
 
 - Related word replacement (based on association metrics)
 
 - Back translation
 
 - Replacing entities
 
 - Adding noise to data (e.g. spelling errors, random words) 



## Data Cleaning

- Relevant vs. irrelevant information

- non-textual information

- markup

- metadata


In [73]:
text = 'I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣 ȀÆĎǦƓ'
print(text)
text2 = text.encode('utf-8') # encode the strings in bytes
print(text2)

I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣 ȀÆĎǦƓ
b'I feel really \xf0\x9f\x98\xa1. GOGOGO!! \xf0\x9f\x92\xaa\xf0\x9f\x92\xaa\xf0\x9f\x92\xaa  \xf0\x9f\xa4\xa3\xf0\x9f\xa4\xa3 \xc8\x80\xc3\x86\xc4\x8e\xc7\xa6\xc6\x93'


In [74]:
import unicodedata
unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

'I feel really . GOGOGO!!    ADG'

## Data preprocessing

Preliminaries

- Sentence segmentation

- Word tokenization

> The process of breaking a text down into tokens is called tokenization. For example, there are six tokens in the Esperanto sentence “Maria frapis la verda sorĉistino.” 


<img src="tokenization.png" alt="Drawing" style="width: 700px;"/>

<img src="tokenization_general.png" alt="Drawing" style="width: 700px;"/>


![]("./tokenization_example.png")

![](./tokenization_example.png)


In [30]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = '''
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
'''

## sent segmentation
sents = sent_tokenize(text)

## word tokenization
for sent in sents:
    print(sent)
    print(word_tokenize(sent))


Python is an interpreted, high-level and general-purpose programming language.
['Python', 'is', 'an', 'interpreted', ',', 'high-level', 'and', 'general-purpose', 'programming', 'language', '.']
Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
['Python', "'s", 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'its', 'notable', 'use', 'of', 'significant', 'whitespace', '.']
Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
['Its', 'language', 'constructs', 'and', 'object-oriented', 'approach', 'aim', 'to', 'help', 'programmers', 'write', 'clear', ',', 'logical', 'code', 'for', 'small', 'and', 'large-scale', 'projects', '.']



### Frequent preprocessing

- Stopword removal

- Stemming and/or lemmatization

- Digits/Punctuaions removal

- Case normalization


#### Removing stopwords, punctuations, digits¶



In [42]:
from nltk.corpus import stopwords
from string import punctuation
import nltk
nltk.download('stopwords')

eng_stopwords = stopwords.words('english')

text = "Mr. John O'Neil works at Wonderland, located at 245 Goleta Avenue, CA., 74208."

words = word_tokenize(text)

print(words)

# remove stopwords, punctuations, digits
for w in words:
    if w not in eng_stopwords and w not in punctuation and not w.isdigit():
        print(w)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shmuhammad/nltk_data...


['Mr.', 'John', "O'Neil", 'works', 'at', 'Wonderland', ',', 'located', 'at', '245', 'Goleta', 'Avenue', ',', 'CA.', ',', '74208', '.']
Mr.
John
O'Neil
works
Wonderland
located
Goleta
Avenue
CA.


[nltk_data]   Unzipping corpora/stopwords.zip.


## Lemmas and Stems

 Lemmas are root forms of words. Consider the verb fly. It can be inflected into many different words —flow, flew, flies, flown, flowing, and so on—and **fly** is the lemma for all of these seemingly different words. 

#### Stemming Algorithm

- Stemming Algorithm:  algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. 

- This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations

#### Lemmatization Algorithm

- Lemmatization algorithm, on the other hand, takes into consideration the morphological analysis of the words. 

- To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma.


Sometimes, it might be useful to reduce the tokens to their lemmas to keep the dimensionality of the vector representation low.

![](./lemma.png)


> Stemming is the poor­man’s lemmatization. It involves the use of handcrafted rules to strip endings  of words to reduce them to a common form called stems.  

Popular stemmers often implemented in open source packages include the Porter and Snowball stemmers.

###  Stemming



In [59]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

words = ['cars','revolution', 'better']
print([stemmer.stem(w) for w in words])


['car', 'revolut', 'better']


### Lemmatization

In [60]:
## Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

## Wordnet requires POS of words
poss = ['n','n','a']

for w,p in zip(words,poss):
    print(lemmatizer.lemmatize(w, pos=p))

car
revolution
good


### Important Reminders for Preprocessing

- Not all steps are necessary

- These steps are NOT sequential

- These steps are task-dependent

Goals

- Text Normalization

- Text Tokenization

- Text Enrichment/Annotation


# Tokenization

> The process of breaking a text down into tokens is called tokenization. For example, there are six tokens in the Esperanto sentence “Maria frapis la verda sorĉistino.” 


In [29]:
# Construction 1
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()

# Creating a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)
tokens = tokenizer("Let's go to N.Y.")
print("Blank tokenizer",end=" : ")
for token in tokens:
    print(token,end=', ')
 
# Construction 2
from spacy.lang.en import English
nlp = English()

# Creating a Tokenizer with the default settings for English
tokenizer = nlp.tokenizer
tokens = tokenizer("Let's go to N.Y.")
print("\nDefault tokenizer",end=' : ')
for token in tokens:
    print(token,end=', ')

Blank tokenizer : Let's, go, to, N.Y., 
Default tokenizer : Let, 's, go, to, N.Y., 

- Tokenization can become more complicated than simply splitting text based on nonalphanumeric characters, 

- For agglutinative languages like Turkish, splitting on whitespace and punctuation might not be sufficient and more specialized techniques might be needed (chap 5 and 6).

- It may be possible to entirely circumvent the issue of tokenization in some neural network models by representing text as a stream of bytes; this becomes very important for agglutinative languages. 

![](chapters/Chapter_2/aglunative_hungarian.png)

# Feauture Engineering

- It refers to a process to feed the extracted and preprocessed texts into a machine-learning algorithm.

- It aims at capturing the characteristics of the text into a numeric vector that can be understood by the ML algorithms. 
  
- In short, it concerns how to meaningfully represent texts quantitatively, i.e., text representation.


## Feature Engineering for Classical ML

- Word-based frequency lists

- Bag-of-words representations

- Domain-specific word frequency lists

- Handcrafted features based on domain-specific knowledge

### Unigrams, Bigrams, Trigrams, ..., N-grams

- N­grams are fixed­length (n) consecutive token sequences occurring in the text. A bigram has two tokens, a unigram one

- Generating n­grams from a text is straightforward

In [56]:
import nltk
from nltk.util import ngrams

def extract_ngrams(data, num):
    n_grams = ngrams(nltk.word_tokenize(data), num)
    return [ ' '.join(grams) for grams in n_grams]


My_text = 'Jack is very good in mathematics but he is not that much good in science'


In [58]:
print("1-gram ", extract_ngrams(My_text, 1), '\n')
print("2-gram ", extract_ngrams(My_text, 2), '\n')
print("3-gram: ", extract_ngrams(My_text, 3), '\n')
print("4-gram: ", extract_ngrams(My_text, 4), '\n')

1-gram  ['Jack', 'is', 'very', 'good', 'in', 'mathematics', 'but', 'he', 'is', 'not', 'that', 'much', 'good', 'in', 'science'] 

2-gram  ['Jack is', 'is very', 'very good', 'good in', 'in mathematics', 'mathematics but', 'but he', 'he is', 'is not', 'not that', 'that much', 'much good', 'good in', 'in science'] 

3-gram:  ['Jack is very', 'is very good', 'very good in', 'good in mathematics', 'in mathematics but', 'mathematics but he', 'but he is', 'he is not', 'is not that', 'not that much', 'that much good', 'much good in', 'good in science'] 

4-gram:  ['Jack is very good', 'is very good in', 'very good in mathematics', 'good in mathematics but', 'in mathematics but he', 'mathematics but he is', 'but he is not', 'he is not that', 'is not that much', 'not that much good', 'that much good in', 'much good in science'] 



### Pos Tagging

In [65]:
from nltk import pos_tag
text ="learn php from guru99 and make study easy".split()
print("After Split:",text)
tokens_tag = pos_tag(text)
print("After Token:",tokens_tag)


After Split: ['learn', 'php', 'from', 'guru99', 'and', 'make', 'study', 'easy']
After Token: [('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99', 'NN'), ('and', 'CC'), ('make', 'VB'), ('study', 'NN'), ('easy', 'JJ')]


## Feature Engineering for DL

- DL directly takes the texts as inputs to the model.

- The DL model is capable of learning features from the texts (e.g., embeddings)

-  The price is that the model is often less interpretable.


# Modeling

## From Simple to Complex

Start with heuristics or rules
Experiment with different ML models

- From heuristics to features
  
- From manual annotation to automatic extraction

Find the most optimal model

- Ensemble and stacking
  
- Redo feature engineering
  
- Transfer learning


# Resources

1. NLTK Tutorials: https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/

![](chapters/Chapter_2/aglunativelanguage.jpeg)