# Introduction to Natural Language Processing (NLP)


_Summarized by QH_  
_First version: 2023-07-16_  
_Last updated on : 2023-07-23_  

## What is NLP?
It is a field of using statistics and computers to understand languages. It can help to identify topic, classify text.

We can use NLP in the following applications:
* Chatbots
* Translation
* Sentiment analysis


## Preparation for NLP

### Data Prepocessing
#### Tokenization
Tokens are smaller chunks of text from a string or document which can help us understand better of the text (e.g., excluding unwanted or uninformative chunks).

The process of breaking into tokens is called _Tokenization_.
We can use regular expressions help with this task, e.g.:
* Breaking out words or sentences
* Separating punctuaion
* Separating all hashtags in a tweet.
#### Stemming and Lemmatization
Lemmatization: determining the root word.
Stemming: simpler version of lemmatization - stripping suffixes from the end of the word.

#### Sentence segmentation
Sentence segmentation: breaking up a text into individual sentences, using cues like perios or exclamation points.

#### Stop word removal
Stop words are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text, and which may be removed to avoid them being construed as signal for prediction.

There are several python packages that helps with tokenization.
* `re`
* `tokenize` from `nltk`
    * `word_tokenize`: tokenize a document into words.
    * `sent_tokenize`: tokenize a document into sentences.
    * `regexp_tokenize`: tokenize a string or document based on a regular expression pattern
    * `TweetTokenizer`: special class just for tweet tokenization, allowing you to separate hashtags, mentions and lots of exclamation points

### Feature extraction

#### Bag-of-words (BOW)
When doing text analysis, strings need to be transformed into numeric values to be fed to the algorithms. "Bag-of-words" or "Bag of n-grams" representation is a strategy used to transform a text document into numeric features - Builds a vocabulary of the words and a measure of presence:

The following is the process:
* __tokenizing__ strings and giving an integer id for each possible token (using white-spaces and puncuations as token seperators).
* __counting__ the occurrences of tokens in each document
* __normalizing__ and weighting with diminishing importance tokens that occur in the majority of samples/documents.

As a result of this process (we call it __vectorization__):
* each __individual token occurrence frequency__ (normalized or not) is treated as a __feature__
* the vector of all the token frequencies for a given __document__ is considered a multivariate __sample__.

##### Sparsity
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have feature values that are zeros (typically more than 99% of them).

In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation 
##### n-grams
Even thought the number of each words are exactly the same, the meaning is different - which means context matters:
* "I am happy, not sad."
* "I am sad, not happy."

1. Unigrams: single tokes
2. Bigrams: pairs of tokens
3. Trigrams: triple of tokens
4. n-grams: sequence of n-tokens

"The weather today is wonderful."
* Unigrams : { The, weather, today, is, wonderful }
* Bigrams: {The weather, weather today, today is, is wonderful}
* Trigrams: {The weather today, weather today is, today is wonderful}

In [12]:
# CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# initialize the vectorizer
vectorizer = CountVectorizer()
# bigrams
bigram_vect = CountVectorizer(ngram_range=(1,2))
# create the corpus
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    'This is the last document.'
]
print("Unigram:\n")
X = vectorizer.fit_transform(corpus)
display(X)
# Get a glimpse of X
display(X.toarray())
# get the token name
token_name = vectorizer.get_feature_names_out()
display(token_name)

print('\nBi-grams:\n')
Y = bigram_vect.fit_transform(corpus)
display(Y)
display(bigram_vect.get_feature_names_out())

Unigram:



<5x10 sparse matrix of type '<class 'numpy.int64'>'
	with 24 stored elements in Compressed Sparse Row format>

array([[0, 1, 1, 1, 0, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 1, 0, 0, 1, 0, 1]])

array(['and', 'document', 'first', 'is', 'last', 'one', 'second', 'the',
       'third', 'this'], dtype=object)


Bi-grams:



<5x24 sparse matrix of type '<class 'numpy.int64'>'
	with 44 stored elements in Compressed Sparse Row format>

array(['and', 'and the', 'document', 'first', 'first document', 'is',
       'is the', 'is this', 'last', 'last document', 'one', 'second',
       'second document', 'second second', 'the', 'the first', 'the last',
       'the second', 'the third', 'third', 'third one', 'this', 'this is',
       'this the'], dtype=object)

#### Term Frequency - Inverse Document Frequency (TF-IDF)
In a large text corpus, some words will be very commonly used (e.g. "the", "a", "is", etc.) hence carrying very little meaningful information. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms. So we need to do some transformation, Tf-idf is a technique.

#### Word2Vec
tba.
#### GLoVE
tba.