# Natural Language Processing

<div class="slide-title">

# Natural Language Processing
    
</div>

Notes: What is NL? Ultimate goal? - Natural human-to-computer communication Sub-field of Artificial Intelligence, but very interdisciplinary - computer science, human-computer interaction (HCI), linguistics, cognitive psychology, speech signal processing (EE), ...

<center>
    <img src="../images/nlp/img_p0_1.png" width=1000>
</center>


## Why care about text?

<div class="group">
    <div class="text">
        
* Chat bots
* Spell checking
* Speech recognition
* Sentiment analysis
* Book recommder
* Translators
* ...
    </div>
    <div class="images">
        <img src="../images/nlp/img_p1_1.png">
    </div>
</div>


## Working with text data

* Algorithms work well with numbers
* working with text = meaningfully transforming your data into numbers
* meaningful = depends on your application


Notes: How is language text different from numerical/categorical data?  
* It's a string, so can't do mathematical operations on it. ('tree' > 'blade of gras' doesn't work.)  
* It can be ambigious. (A good life depends on a liver. – Liver may be an organ or simply a living person.)  
* It can be in a wide range of formats / files. (oral, UTF-8, handwritten, farsi, ...)  
* Has a wide granularity level. (file, paragraph, sentence, word, character)  
* It is unstructured. (no mandatory constraint in word order)  
* ...

## Converting text into numbers

* this is also called **text preprocessing**



Notes: Text preprocessing also includes cleaning of the raw text before converting into numbers

### Text processing → text to numbers

Local representations
* Encoding with a unique number
* Statistical Encodings

Distributed Representations
* Word Embeddings


Notes:  
The term local refers to the uniqueness of a single word in vector space. The words and vectors are a 1-1 relationship (e.g. bag of words).  
In distributed representations a word can be represented by several vectors and a vector can mean different (but semantically similar) words (or phrases, in genral: entities) (= many to many).

### Text processing → text to numbers

<div class="group">
    <div class="text">
        
**Encoding with a unique number**

        
Easy to create, but the numbers have no relational representation
- the relationship between words is not captured
- models cannot interpret well these representation
        
    </div>
    <div class="images">
        <img src="../images/nlp/img_p6_1.png">
    </div>
</div>

Notes:  Drawback of using unique numbers: words with similar meanings can have completely different numbers.


### Text processing → text to numbers

<div class="group">
    <div class="text">
        
**Statistical Encodings**

Creating vectors of the size of the vocabulary
- leads to large sparse features space
- not very efficient
        
    </div>
    <div class="images">
        <img src="../images/nlp/img_p7_1.png">
    </div>
</div>

### Text processing → text to numbers

<div class="group">
    <div class="text">
        
**Word Embeddings**

embedding = new latent space    
* properties and relationships between items are preserved
* less number of dimensions
* less sparseness
        
    </div>
    <div class="images">
        <img src="../images/nlp/img_p8_1.png">
    </div>
</div>

Notes: "ice" and "cream" are closer to each other than to "Mia".  
No further remarks on "encoding with unique number" here.

## Statistical Encodings

## Text Preprocessing

* Tokenization
* CountVectorizer
* TF-IDF
* N-grams
* Normalization
* Stemming
* Lemmatization

### Tokenization

<center>
    <img src="../images/nlp/img_p11_1.png">
</center>


Notes: A token can be a word, a subword or even a character.  
The term "corpus" (collection of documents) is the equivalent to data set.

In [None]:
import nltk
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("punkt_tab")

from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

Notes:  nltk is natural language toolkit, developed since 2001 for academic purposes.

In [None]:
text = "Let us learn some NLP. NLP is amazing!"

In [None]:
word_tokenize(text)

In [None]:
sent_tokenize(text)

### CountVectorizer

Converting a collection of text documents to a matrix of token counts

[sklearn's CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

### CountVectorizer

<center>
    <img src="../images/nlp/img_p13_4.png">
    <br>
    <img src="../images/nlp/img_p13_5.png">
</center>

<div class="alert alert-block alert-info">
<b>Note:</b> 

Gives a lot of weight to frequent (and maybe not so informative) words... → TF-IDF fixes this
</div>

In [None]:
corpus = [
    'This is the first Document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
 ]

In [None]:
cv = CountVectorizer()

X = cv.fit_transform(corpus)

In [None]:
features = cv.get_feature_names_out()
print(f"Features - {features}")
 
output = pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())
print("\n",output)

Notes:  This (sparse) matrix is called "document-term matrix".  
Possibilities to shorten it:  
- remove so-called "stop words" (see below) which appear very often (e.g. in more than 75% of the documents) and have probably no relevant meaning.  
- lowercase all words (with some information loss)  

Objective of this matrix:  
Can e.g. be used to classify to which document a sentence belongs to (e.g.via logistic regression: every word column is a feature. while the document name could be the label.)

In [None]:
from sklearn.linear_model import LogisticRegression

y = ['document 1', 'document 2', 'document 3', 'document 4']
model = LogisticRegression().fit(X, y)

In [None]:
query = ['What is about second document?']

query_transformed = cv.transform(query)

model.predict(query_transformed)[0]
#model.predict_proba(query_transformed)[0]


### TF-IDF

**TF-IDF**: Term Frequency * Inverse Document Frequency

→ measure how important a word is to a document in a corpus

<div class="alert alert-block alert-info">
<b>Note:</b> 

A frequent word in a document that is also frequent in the corpus is less important to a document than a frequent word in a document that is not frequent in the corpus.
</div>

Notes:  Why is TF-IDF needed if basic document-term matrix works already? Improves prediction

### TF-IDF

**TF**: 

$$\text{tf}(t, d)=\frac{f_{t,d}}{\sum_{t'\in{d}} f_{t', d}} $$

**IDF**:

$$\text{idf}(t, D)= \log\frac{N}{|\{d\in{D}:t\in{d}\}|}$$

**TF-IDF**:

$$\text{tfidf}(t, d, D)=\text{tf}(t, d) \cdot \text{idf}(t, D)$$

Notes:  
TF  
Objective: The more frequent a term appears in a document, the more charateristic it is for this document.  
TF = occurences of term in document / number of unique terms in document.  
There is a TF for every term in every document.  

IDF  
Objective: If a term appears only in few documents, it has high explanatory power.  
IDF = log (number of documents / number of documents containing the term in question)  
There is one IDF for every unique term of the corpus.  

TF-IDF  
Objective: statistical heuristic that takes both into account.
TF-IDF = TF * IDF  
One for each term in corpus.


### TF-IDF

<center>
    <img src="../images/nlp/img_p13_4.png">
    <br>
    <img src="../images/nlp/img_p16_3.png">
</center>

[sklearn's TF-IDF](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)

In detail article how [Tf-IDF](https://medium.com/analytics-vidhya/demonstrating-calculation-of-tf-idf-from-sklearn-4f9526e7e78b) works.


In [None]:
corpus = [
    'This is the first Document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

tfidf = TfidfVectorizer()

X = tfidf.fit_transform(corpus)

X.toarray()

In [None]:
df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names_out())
df

### N-grams

<div class="group">
    <div class="text_70">
        
To model sequences of words... for example ice and cream make more sense as a 2-gram when they appear together

can be at word level or at character level


[n-grams](https://books.google.com/ngrams)
    </div>
    <div class="images_30">
        <img src="../images/nlp/img_p17_2.png">
    </div>
</div>

In [None]:
from nltk import ngrams

In [None]:
n = 4

for i in range(1, n):
    print(f"{i} gram\n")
    ngram = ngrams(text.split(), i)
    for gram in ngram:
        print(gram)
    print("-"*10)

Notes: using n-grams with large n somehow teaches to understand contexts but dramatically enlarges document-term matrix.

### Normalization

[‘List’, ‘listed’, ‘lists’, ‘listing’, ‘listings’, ‘.’]

→  [‘list’, ‘listed’, ‘lists’, ‘listing’, ‘listings’, ‘.’]


Do we want to distinguish between “List” and “list”?

Sometimes we do: “White House” vs. “white house”


Notes: Normalization is the process of converting text data into a standardized form to reduce complexity and improve the efficiency of machine learning models. This can include lowercasing, stemming/lemmatization, ...

### Stemming

[‘list’, ‘listed’, ‘lists’, ‘listing’, ‘listings’, ‘.’]

→ [‘list’, ‘list’, ‘list’, ‘list’, ‘list’, ‘.’]


<div class="alert alert-block alert-info">
<b>Note:</b> 

Stemming reduces words to a shorter form, a form that might have no meaning.
</div>

### Lemmatization

[‘list’, ‘listed’, ‘lists’, ‘listing’, ‘listings’, ‘.’]

→ [‘list’, ‘listed’, ‘list’, ‘listing’, ‘listing’, ‘.’]

<div class="alert alert-block alert-info">
<b>Note:</b> 

Lemmatization uses the language dictionary to get the base word of a word.
</div>

In [None]:
stemmer = nltk.PorterStemmer()

text = "We are learning how a stemmer works"
text1 = "People are running so fast." 

In [None]:
tokenized_text = word_tokenize(text)
stem = [stemmer.stem(word) for word in tokenized_text]
stem

In [None]:
lemmatizer = nltk.WordNetLemmatizer()

In [None]:
tokenized_text = word_tokenize(text)
lemm = [lemmatizer.lemmatize(word) for word in tokenized_text]
lemm

### Stemming or Lemmatization?

It depends...
* Stemming is faster
* Lemmatization preserves more information

Notes:  
stemming works by using a heuristic rules rather than using a lookup table. Pro: Doesn't need a lookup table. Con: stemmed words often don't have any meaning.  
Lemmatization usually needs a language specific lookup table.

## BUT What about meanings?

## Advanced Text Preprocessing

Notes: BREAK HERE

### Stopwords

* some words do not provide meaningful information ... they are not “content words”
* the list of non-content words is language specific and corpus specific

What would you say are stop words in this text?

"Apple is looking at buying U.K. startup for $1 billion"


### Stopwords

* some words do not provide meaningful information ... they are not “content words”
* the list of non-content words is language specific and corpus specific

What would you say are stop words in this text?

"Apple **is** looking **at** buying U.K. startup **for** $1 billion"


In [None]:
nltk.download("stopwords")
from nltk.corpus import stopwords

print(stopwords.words('english'))

### POS Tagging

* **P**art **O**f **S**peech tagging - assigning grammatical annotations
    * ADJ - adjective
    * NOUN
    * VERB
    * ...
    
Which are verbs and nouns here?

"Apple is looking at buying U.K. startup for $1 billion"

[universaldependencies](https://universaldependencies.org/docs/u/pos/)


### POS Tagging

* **P**art **O**f **S**peech tagging - assigning grammatical annotations
    * ADJ - adjective
    * NOUN
    * VERB
    * ...
    
Which are **verbs** and *nouns* here?

"*Apple* is  **looking** at **buying** U.K. *startup* for $1 billion"

[universaldependencies POS](https://universaldependencies.org/docs/u/pos/)


In [None]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger_eng')

In [None]:
tokenized_text = word_tokenize(text)
tag = pos_tag(tokenized_text)
tag

* VBP - Verb, non-3rd person singular present
* VBG - Verb, ending in '-ing' or present participle
* VBZ - Verb, 3rd person singular present
* WRB - Wh-adverb

### POS Tagging using Spacy

Notes: SpaCy is a more modern language library made for production. Some overlap with nltk.

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

In [None]:
import spacy
  
nlp = spacy.load("en_core_web_sm")

In [None]:
# new_text = "The car is blue"
doc = nlp(text)
  
# Token and Tag
for token in doc:
    print(token, token.pos_)

### Named Entities 


* Named Entities are real-world objects that are assigned a name: person, country, book, product..
* The recognition of entities is based on training data so it's not perfect.

What entities do you think are in this text?

"Apple is looking at buying U.K. startup for $1 billion"


### Named Entities 


* Named Entities are real-world objects that are assigned a name: person, country, book, product..
* The recognition of entities is based on training data so it's not perfect.

What entities do you think are in this text?

"**Apple** is looking at buying **U.K.** startup for $1 billion"


In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text," - ", ent.label_)

In [None]:
# GPE above is Geographical Entity

In [None]:
from spacy import displacy

displacy.render(doc, style="ent")

In [None]:
for token in doc:
    print(token, token.pos_)

In [None]:
displacy.render(doc, style="dep")

## So.. what do we do with all that?

* document similarity
* text classification
* ...


### Text similarity or Document Similarity

Each document is a vector of features. 

Similarity between documents is the similarity between vectors

Usage:
* search engines: query to document
* clustering of documents: document to document
* Question & Answering platforms: query to query


### Text classification
You can use your favourite classifier with text
* Logistic Regression provides nice baseline
* AUC score as performance metric

Some applications:
* spam detection
* sentiment analysis
* hate speech analysis


## Word Embeddings

### Word Embeddings
* Represent feature space in smaller dimension
* Similar words are near in embedding space
* Trained by using neural networks  
    &emsp;&rarr; Use those trained weights as first layer in your NLP neural network.


Notes: A common dimensionality of word embeddings is e.g. 300 while English has around 170,000 words.  

Using word embeddings transforms words into sensible numerical values.

### Word similarity
Is “St Pauli” more similar to:

* De Wallen → Similar type

or

* HSV → Similar topic?

Result depends on the context ... or on the feature space / embedding you chose


Notes:  
St. Pauli is a district in Hamburg and a name of a Hamburg football club.  
De Wallen is a district in Amsterdam.  
HSV is a Hamburg football club.  

### Using Embeddings

<div class="group">
    <div class="text">
        
Relevant items for your task should be similar in the embedding space / i.e close to each other.
        
.
    </div>
    <div class="images">
        <img src="../images/nlp/img_p8_1.png">
    </div>
</div>

### How do we get Word Embeddings

<div class="group">
    <div class="text">
        
Having lots of data and:
* Read the text
* Process text
* Create x, y data points - for example each 2 words appearing in a text
* Create one hot encodings
* Train a neural network
* Extract the weights from the input layer

[Example 1](https://github.com/Eligijus112/word-embedding-creation), 
[Example 2](https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8)
    </div>
    <div class="text">
    </div>
</div>

### How do we get Word Embeddings

<div class="group">
    <div class="text">
CBOW - Continuous Bag of Words
    </div>
    <div class="images">
        <img src="../images/nlp/CBOW.png">    
    </div>
</div>

Notes:  If we finished training on all context windows (here: 4) of all documents of the corpus, the weigths of the hidden layer (n weights per unique word) are the vectors of the unique words.  
(The number of the nodes of the input as well as of the output layer is the number of unique words/tokens.)

### How do we get Word Embeddings

<div class="group">
    <div class="text">
Skip-Gram
    </div>
    <div class="images">
        <img src="../images/nlp/SkipGram.png">    
    </div>
</div>

Notes:  
skip-gram works better than CBOW but needs more input.  

### How do we get Word Embeddings

<div class="group">
    <div class="images">
        <img src="../images/nlp/img_p39_3.png">    
    </div>
    <div class="images">
        <img src="../images/nlp/img_p39_4.png">
    </div>
</div>

[Creating word embeddings](https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8)

In the list of sentences above, we see that words appearing together in the sentences appear nearby 
each other in the embeddings graph.
Masculine words form one cluster and feminine words form other cluster.
It tells us in a small example that words appearing together in a context will be closer to each other in 
the n-dimensional embedding space

### Using pre-trained embeddings
Most times you do not have enough data to get good word embeddings for your task, instead you can use pre-trained word embeddings.  

There are different kinds of word embeddings:  
- static word embeddings: Word2vec (google), GloVe (Standford University), fastText (Facebook),  
- contextual word embeddings: ELMo, Bert (google), gpt-2/3/4 (openAI), ...  


example: [pretrained word embeddings](https://keras.io/examples/nlp/pretrained_word_embeddings/)

Notes:  
gensim is not a word embedding but a library to handle word embeddings  
word2vec and fasttext use skip-gram / CBOW.  
fasttext uses parts of words instead of words.   
glove works on co-occurence matrix, not on CBOW or skipgram.  

### Word Embeddings

In [None]:
!pip install gensim
!pip install scipy==1.12

In [None]:
import gensim.downloader as api

## List available embeddings
info = api.info()

for model_name, model_data in sorted(info['models'].items()):
    print(model_name)

In [None]:
# caveat: If you don't have enough RAM, this cell can crash your kernel

wv = api.load("word2vec-google-news-300")
glove = api.load("glove-twitter-100")
fasttext = api.load("fasttext-wiki-news-subwords-300")

In [None]:
from gensim.models import KeyedVectors

# Load the first 200,000 words from the downloaded file only instead
wv = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True, limit=200000)

Notes:  download 'GoogleNews-vectors-negative300.bin.gz' first

In [None]:
wv.most_similar("coffee")

In [None]:
wv.get_vector("coffee")

In [None]:
glove.most_similar("coffee")

In [None]:
fasttext.most_similar("coffee")

In [None]:
wv.distance("coffee", "tea")
# wv.distance("coffee","coffees")

In [None]:
wv.distance("coffee", "onion")

In [None]:
wv.most_similar(positive=["king", "woman"], negative=["man"])

Notes: "most_similar" because it is very unlikely that another word has exactly the 300 components of the difference vector.

In [None]:
wv.most_similar(positive=["restaurant", "coffee"], negative=["dinner"])

In [None]:
wv.most_similar(positive=["Berlin", "France"], negative=["Germany"])

In [None]:
wv.doesnt_match(["sklearn","numpy","python","pandas"])

In [None]:
# find out which other methods there are and test their function
dir(wv)

### Visualize Semantics with Graphs 

[TensorFlow projector](https://projector.tensorflow.org)

<center>
    <img src="../images/nlp/img_p41_2.png" width=1100>
</center>


## Hugging Face & Transformers

<center>
    <img src="../images/nlp/img_p42_1.png">
</center>


### Hugging Face

~ 7k pre trained NLP models on [huggingface.co](https://huggingface.co)

<center>
    <img src="../images/nlp/img_p43_1.png" width=800>
</center>


### Zero Shot Learning 

When you have little data.

**Zero-shot learning (ZSL)** is a problem setup in machine **learning**, where at test time, a learner
observes samples from classes that were not observed during **training**, and needs to predict the
class they belong to.

(see notebook 2 in workbooks)

Notes:  
Zero Shot Classification is the task of predicting a class that wasn't seen by the model during training. Zero-shot text classification is a task in natural language processing where a model is trained on a set of labeled examples but is then able to classify new examples from previously unseen classes.  
It's also called heterogeneous transfer learning.  
This works because those models use auxiliary information, e.g. from a text corpus (keyword: multimodel inputs).

### Resources

- [Getting started with NLP (Pyladies)](https://github.com/pyladieshamburg/getting-started-with-nlp)
- [NGram Loader](https://pypi.org/project/google-ngram-downloader/)
- [spaCy](https://spacy.io/)
- [Text similarities](https://medium.com/@adriensieg/text-similarities-da019229c894)
- [Neural models for information retrieval](https://www.microsoft.com/en-us/research/video/neural-models-information-retrieval-video/)
- [Glove](https://nlp.stanford.edu/projects/glove/)
- [What is a transformer? (3blue1brown)](https://www.youtube.com/watch?v=wjZofJX0v4M)
- How does zero shot learning work [[video](https://blog.roboflow.com/zero-shot-learning-computer-vision/), [text](https://www.kdnuggets.com/2022/12/zeroshot-learning-explained.html)]?
- Sentiment Analysis with VADER [[stand alone](https://vadersentiment.readthedocs.io/en/latest/), [using nltk](https://www.nltk.org/howto/sentiment.html)]



Notes:  
Maybe show simple example of sentiment analysis (instead of some other slides)