# Natural Language Processing (NLP)

## Motivation

🗣 How can we incorporate textual data in Machine Learning Algorithms?

🤔 What are the Machine Learning models dedicated to language-related tasks?

Examples:

- E-mail filtering (legitimate e-mail vs. spam)
- Sentiment analysis
- Chatbots
- Voice/speech recognition
- Smart assistants
- Language translation

**What is NLP?**

Natural Language Processing (NLP) is a subfield of linguistics, computer science and artificial intelligence concerned with the interactions between computers and human language - in particular how to program computers to process and analyze large amounts of natural language data (speech and text).

Source: [Wikipedia](https://en.wikipedia.org/wiki/Natural_language_processing)

**Plan**
1. Text Preprocessing
2. Vectorizing
3. NLP Modeling: Naive Bayes Classifier
4. Topic Modeling: the Latent Dirichlet Allocation Algorithm (LDA) (Unsupervised)

![plan](pics/plan.png)

## 1. Text Preprocessing
For any Machine Learning algorithm, data preprocessing is crucial, and this remains true for algorithms dealing with text

✍️ Text preprocessing is quite different from numerical preprocessing. The most common preprocessing tasks for textual data are:

- lowercase
- dealing with numbers, punctuation, and symbols
- splitting
- tokenizing
- removing "stopwords"
- lemmatizing

<br>

### 💻 🧹 Basic cleaning with Python core string operations
When you have some unstructured text, you can already clean it with some Python built-in string operations

<br>

💻 ✂️ [strip](https://docs.python.org/3/library/stdtypes.html#str.strip) (1/2)

`strip` removes all the whitespaces at the beginning and the end of a string

In [1]:
texts = [
    '   Bonjour, comment ca va ?     ',
    '    Heyyyyy, how are you doing ?   ',
    '        Hallo, wie gehts ?     '
]
texts

['   Bonjour, comment ca va ?     ',
 '    Heyyyyy, how are you doing ?   ',
 '        Hallo, wie gehts ?     ']

In [2]:
[text.strip() for text in texts]

['Bonjour, comment ca va ?',
 'Heyyyyy, how are you doing ?',
 'Hallo, wie gehts ?']

💻 ✂️ strip (2/2)

You can also specify a "list" of characters (in the form of a single and unordered string) to be removed at the beginning and at the end of a string

In [3]:
text = "abcd Who is abcd ? That's not a real name!!! abcd"
text

"abcd Who is abcd ? That's not a real name!!! abcd"

In [4]:
text.strip('bdac')

" Who is abcd ? That's not a real name!!! "

💻 🔄 [replace](https://docs.python.org/3/library/stdtypes.html#str.replace)

In [5]:
text = "I love koalas, koalas are the cutest animals on Earth."
text

'I love koalas, koalas are the cutest animals on Earth.'

In [6]:
text.replace("koala", "panda")

'I love pandas, pandas are the cutest animals on Earth.'

💻 📏 [split](https://docs.python.org/3/library/stdtypes.html#str.split)

In [7]:
text = "linkin park / metallica /red hot chili peppers"

In [8]:
text.split("/")

['linkin park ', ' metallica ', 'red hot chili peppers']

💻 🔡 Lowercase

Text modeling algorithms are case-sensitive (the capitalization of words carries meanings and contexts). Two words need to have the same casing to be considered equal.

In [9]:
text = "i LOVE football sO mUch. FOOTBALL is my passion. Who else loves fOOtBaLL ?"
text

'i LOVE football sO mUch. FOOTBALL is my passion. Who else loves fOOtBaLL ?'

In [10]:
text.lower()

'i love football so much. football is my passion. who else loves football ?'

💻 🔢 Numbers

✅ We can (and often should) remove numbers during the text preprocessing steps, especially for:

- text clustering
- collecting keyphrases

In [11]:
text = "i do not recommend this restaurant, we waited for so long, like 30 minutes, this is ridiculous"
text

'i do not recommend this restaurant, we waited for so long, like 30 minutes, this is ridiculous'

In [12]:
cleaned_text = ''.join(char for char in text if not char.isdigit())
cleaned_text

'i do not recommend this restaurant, we waited for so long, like  minutes, this is ridiculous'

💻 ❗️❓Punctuation and Symbols

Punctuation like ".?!" and symbols like "@#$" are not useful for topic modeling.

Punctuation is barely used properly on social media platforms.

Warning: you might want to keep punctuation and symbols for authorship attribution (Authorship attribution is the task of identifying the author of a given text.)!

In [13]:
text = "I love bubble tea! OMG so #tasty @channel XOXO @$ ^_^ "
text

'I love bubble tea! OMG so #tasty @channel XOXO @$ ^_^ '

In [14]:
import string # "string" module is already installed with Python
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [15]:
for punctuation in string.punctuation:
    text = text.replace(punctuation, '')

text

'I love bubble tea OMG so tasty channel XOXO   '

💻 Combination: `strip` + `lowercase` + `numbers` + `punctuation/symbols`

In [16]:
sentences = [
    "   I LOVE Pizza 999 @^_^",
    "  Study is amazing, take care - 666"
]

In [17]:
def basic_cleaning(sentence):
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())

    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '')

    sentence = sentence.strip()

    return sentence

In [18]:
cleaned = [basic_cleaning(sentence) for sentence in sentences]
cleaned

['i love pizza', 'study is amazing take care']

💻 🔍 Removing Tags with [RegEx](https://regexr.com/)

We can remove HTML tags using RegEx:

In [19]:
import re

text = """<head><body>Hello HSLU!</body></head>"""
cleaned_text = re.sub('<[^<]+?>','', text)

print (cleaned_text)

Hello HSLU!


We can also extract e-mail addresses from a text:

In [20]:
import re

txt = '''
    This is a random text, authored by darkvador@gmail.com
    and batman@outlook.com, WOW!
'''

re.findall('[\w.+-]+@[\w-]+\.[\w.-]+', txt)

  re.findall('[\w.+-]+@[\w-]+\.[\w.-]+', txt)


['darkvador@gmail.com', 'batman@outlook.com']

## 💻 Cleaning with NLTK

Natural Language Toolkit (NLTK) is an NLP library that provides preprocessing and modeling tools for text data

📚 [NLTK official website](https://www.nltk.org/)

🛠 [Installation Documentation](https://www.nltk.org/install.html)

### 💻 🧩 Tokenizing

Tokenizing is essentially splitting a sentence, a paragraph, or even an entire piece of text into smaller chunks such as individual words called tokens.

"Natural Language Processing"  →   ["Natural","Language","Processing"]

📚 [nltk.tokenize](https://www.nltk.org/api/nltk.tokenize.html)

🔅 Here is a quote from Aristotle:

In [21]:
text = 'It is during our darkest moments that we must focus to see the light'

text

'It is during our darkest moments that we must focus to see the light'

In [22]:
# When importing nltk for the first time, we need to also download a few built-in libraries

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/arnaud/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/arnaud/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/arnaud/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/arnaud/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [23]:
from nltk.tokenize import word_tokenize

word_tokens = word_tokenize(text)
print(word_tokens) # print displays the words in one line

['It', 'is', 'during', 'our', 'darkest', 'moments', 'that', 'we', 'must', 'focus', 'to', 'see', 'the', 'light']


💻 🛑 Stopwords

Stopwords are words that are used so frequently that they don't carry much information, especially for topic modeling

NLTK has a built-in corpus of English stopwords that can be loaded and used

In [24]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english')) # you can also choose other languages

Here is an example of a tokenized sentence:

In [25]:
tokens = ["i", "am", "going", "to", "go", "to", "the",
        "club", "and", "party", "all", "night", "long"]

❓ What stopwords could be removed ❓

In [26]:
stopwords_removed = [w for w in tokens if w in stop_words]
stopwords_removed

['i', 'am', 'to', 'to', 'the', 'and', 'all']

❓ What are the meaningful words in this sentence ❓

In [27]:
tokens_cleaned = [w for w in tokens if not w in stop_words]
tokens_cleaned

['going', 'go', 'club', 'party', 'night', 'long']

👉 What if you are not going to the party?

😱 "not" is also considered as a stopword

✅ Removing stopwords is useful for:

- topic modeling

❌ Dangerous for:

- sentiment analysis
- authorship attribution

## 💻 🧬 Lemmatizing
Lemmatizing is a technique used to find the root of words, in order to group them by their meaning rather than by their exact form

![lemmatizing](pics/stem_lemma.png)

📚 [nltk.stem - WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)

👇 Look at the following sentence:

In [28]:
sentence = 'He was RUNNING and EATING at the same time =[. He has a bad habit of swimming after playing 3 hours in the Sun =/'

In [29]:
sentence

'He was RUNNING and EATING at the same time =[. He has a bad habit of swimming after playing 3 hours in the Sun =/'

🗓 Let's apply the following steps:

- Basic cleaning
- Tokenizing
- Removing stopwords (if not doing sentiment analysis!)
- Lemmatizing


🧹 Step 1: Basic Cleaning

In [30]:
cleaned_sentence = basic_cleaning(sentence)
cleaned_sentence

'he was running and eating at the same time  he has a bad habit of swimming after playing  hours in the sun'

💻 🧩 Step 2 : Tokenize

In [31]:
tokenized_sentence = word_tokenize(cleaned_sentence)
print(tokenized_sentence)

['he', 'was', 'running', 'and', 'eating', 'at', 'the', 'same', 'time', 'he', 'has', 'a', 'bad', 'habit', 'of', 'swimming', 'after', 'playing', 'hours', 'in', 'the', 'sun']


🛑 Step 3: Remove Stopwords

In [32]:
tokenized_sentence_no_stopwords = [w for w in tokenized_sentence if not w in stop_words]
print(tokenized_sentence_no_stopwords)

['running', 'eating', 'time', 'bad', 'habit', 'swimming', 'playing', 'hours', 'sun']


💻 🧬 Step 4: Lemmatizing

📚 [WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)

[Lemmatization with NLTK](https://www.geeksforgeeks.org/python-lemmatization-with-nltk/)

In [33]:
from nltk.stem import WordNetLemmatizer

# Lemmatizing the verbs
verb_lemmatized = [
    WordNetLemmatizer().lemmatize(word, pos = "v") # v --> verbs
    for word in tokenized_sentence_no_stopwords
]

# 2 - Lemmatizing the nouns
noun_lemmatized = [
    WordNetLemmatizer().lemmatize(word, pos = "n") # n --> nouns
    for word in verb_lemmatized
]

✅ Lemmatizing is useful for:

- topic modeling
- sentiment analysis

## Preprocessing Text - Takeaways

First of all, we can perform some pre-cleaning operations on the pieces of text of a corpus using Python built-in functions such as:

- ✂️ strip
- 🔄 replace
- 📏 split
- 🔡 lowercase
- 🔢 removing numbers
- ❗️ removing punctuation and symbols

Next, we can apply preprocessing techniques to prepare the pieces of text for NLP algorithms

- 🧩 Tokenizing
- 🛑 Removing stopwords
- 🧬 Lemmatizing


🤔 Now that the text is preprocessed, how can it be analyzed by Machine Learning algorithms?

## 2. Vectorizing

🤖 Machine Learning algorithms cannot process raw text, as it needs to be converted into numbers first

**Vectorizing** = the process of converting raw text into a numerical representation

There are multiple vectorizing techniques. Among them, we will present:

- `Bag-of-Words`
- `Tf_idf`
- `N-grams`


![vectorization](pics/vectorization.png)

### 2.1. Bag-of-Words representation

**Bag-of-Words representation(BoW)** is one of the most simple and effective ways to represent text for Machine Learning models.

When using this representation, we are simply counting how often each word appears in each document of a corpus. 

The count for each word becomes a feature:

![example_bow](pics/example_bow.png)

💻 `CountVectorizer`

In Scikit-Learn, there is a tool called `CountVectorizer` to generate bag-of-words representations of a set of texts

👉 `CountVectorizer` converts a collection of text documents into a matrix of token counts

📚 [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

👇 Look at the following sentences:

In [34]:
texts = [
    'the young dog is running with the cat',
    'running is good for your health',
    'your cat is young',
    'young young young young young cat cat cat'
]

Let's apply the CountVectorizer to generate a Bag-of-Words representation of these four sentences

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
X = count_vectorizer.fit_transform(texts)
X.toarray()

array([[1, 1, 0, 0, 0, 1, 1, 2, 1, 1, 0],
       [0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1],
       [3, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0]])

🤔 Can you guess which column represents which word?

🔥 As soon as the `CountVectorizer` is fitted to the text, you can retrieve all the words seen with `get_feature_names_out()`:

In [36]:
count_vectorizer.get_feature_names_out()

array(['cat', 'dog', 'for', 'good', 'health', 'is', 'running', 'the',
       'with', 'young', 'your'], dtype=object)

In [37]:
import pandas as pd

vectorized_texts = pd.DataFrame(
    X.toarray(),
    columns = count_vectorizer.get_feature_names_out(),
    index = texts
)

display(vectorized_texts)

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
the young dog is running with the cat,1,1,0,0,0,1,1,2,1,1,0
running is good for your health,0,0,1,1,1,1,1,0,0,0,1
your cat is young,1,0,0,0,0,1,0,0,0,1,1
young young young young young cat cat cat,3,0,0,0,0,0,0,0,0,5,0


Be aware that there are some limitations when it comes to the bag-of-words representation:

❌ A BoW does NOT take into account the order of the words  →   hence the name `"Bag of Words"`

❌ A BoW does NOT take into account a document's length  →   `Tf-idf` to the rescue

❌ A BoW does NOT capture document context  →   `N-gram` to the rescue

## 2.2. `Tf-idf` Representation

Term Frequency (`tf`) & `CountVectorizer`

*Idea: The more often a word appears in a document relative to others, the more likely it is that it will be important to this document*

Example: if the word elections appears relatively frequently in a document, it is obvious that this document deals with politics.



The frequency of a word $x$ in a document $d$ is called **term frequency**, and is denoted by:

$
TF_{x,d} = \dfrac{\text{Number of times term } x \text{ appears in document } d}{\text{Total number of terms in the document}}
$


❓ In our last example, could we compute $tf_{young.document4}$ ❓

In [38]:
vectorized_texts

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
the young dog is running with the cat,1,1,0,0,0,1,1,2,1,1,0
running is good for your health,0,0,1,1,1,1,1,0,0,0,1
your cat is young,1,0,0,0,0,1,0,0,0,1,1
young young young young young cat cat cat,3,0,0,0,0,0,0,0,0,5,0


$tf_{young, document4} = \dfrac{5 \text{ counts of "young"}}{8 \text{ total words}} = 0.625 $ 

Document Frequency (`df`)

*Idea: If a word appears in many documents of a corpus, however, it shouldn't be that important to understand a particular document.*

Example: on eurosport.com/football, the word "football" appears in every article, hence why the word football on this website is an unimportant word!

The number of documents $d$ in a corpus containing the word $x$ is called document frequency (df), and is denoted by $df_{x}$

❓ In our last example, could we compute $df_{cat}$, $df_{young}$, $df_{the}$ ❓

In [39]:
# Compute document frequency (DF)
document_frequency = (vectorized_texts > 0).sum(axis=0)

# Convert DF into a DataFrame format
document_frequency = pd.DataFrame([document_frequency], index=["Document Frequency"])

# Display the DataFrame
display(document_frequency)


Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
Document Frequency,3,1,1,1,1,3,2,1,1,3,2


If a word $x$ appears in too many documents of a corpus - i.e. if the document frequency $df_{x}$ is too high - the word $x$ won't help us with topic modeling and should be considered irrelevant.

Example: on eurosport.com/football/, the word "football" won't help us distinguish two articles, one dealing mainly with strategy and another one talking about referee best practices!

What if we considered the **relative document frequency** of a word $x$, which can be computed as:

$
\dfrac{df_x}{N}
$

where:
- $df_x$ is the number of documents $d$ containing the word $x$,
- $N$ is the total number of documents in a corpus.

For the word "football" on Eurosport, we would expect this formula to be close to 1 since the number of docs containing the word "football" will probably only be slightly less than the total number of docs (out of 100 maybe only 5 don't have the word "football", so we get 95/100).

Idea: A word $x$ in a corpus of texts will be considered important when its **(relative) document frequency** is **low** ⇔ its inverse document frequency $\dfrac{N}{df_x}$ is high.

Again, if the word "football" appears in all the articles it is not very useful for helping us identify between two articles, but if only a few documents contain words like "concussion" or "wellbeing", (e.g. they appear in 2/100 articles) it will be much more useful in determining the topic of that article (they are probably specifically about player wellfare).

**Tf-idf Formula**

💡 Thus the intuition of the `term frequency - inverse document frequency` approach is to give a high weight to any term which appears frequently in a single document, but not in too many documents of the corpus.

The weight of a word $x$ in a document $d$ is given by:

$$
w_{x,d} = tf_{x,d} \times \left[ \log \left( \frac{N + 1}{df_x + 1} \right) + 1 \right]
$$

where:

- $tf_{x,d}$ = $ \dfrac{\text{Number of occurrences of word } x \text{ in document } d}{\text{Total number of words in document } d} $

- $df_x$ = Number of documents $d$ containing the word $x$
- $N$ = Total number of documents in a corpus

## 2.3. 💻 TfidfVectorizer

`raw documents`  →   `matrix of tf-idf features`

📚 [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [40]:
texts

['the young dog is running with the cat',
 'running is good for your health',
 'your cat is young',
 'young young young young young cat cat cat']

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [42]:
# Instantiating the TfidfVectorizer
tf_idf_vectorizer = TfidfVectorizer()

# Training it on the texts
weighted_words = pd.DataFrame(tf_idf_vectorizer.fit_transform(texts).toarray(),
                 columns = tf_idf_vectorizer.get_feature_names_out())

weighted_words

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
0,0.227904,0.357056,0.0,0.0,0.0,0.227904,0.281507,0.714112,0.357056,0.227904,0.0
1,0.0,0.0,0.463709,0.463709,0.463709,0.29598,0.365594,0.0,0.0,0.0,0.365594
2,0.470063,0.0,0.0,0.0,0.0,0.470063,0.0,0.0,0.0,0.470063,0.580622
3,0.514496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.857493,0.0


**Controlling the vocabulary size**:

In every language, there are many words used in everyday vocabulary:

- 🇬🇧 English: ~20,000 words
- 🇫🇷 French: ~20,000 words
- 🇩🇪 German: ~20,000 words

In a document, we can't afford to vectorize every word!

We can, however, control the number of words to be vectorized (*curse of dimensionality*!):

👉 Scikit-Learn allows us to customize the `CountVectorizer` and `TfidVecdtorizer` with key parameters to control vocabulary size.

💻 Key parameters of `TfidfVectorizer` (and `CountVectorizer`)
- `max_df/min_df`
- `max_features`

💻 `max_df` (resp. `min_df`)

*When building the vocabulary, `CountVectorizer` and `TfidfVectorizer` will remove terms which have a document frequency strictly higher (resp. lower) than the given threshold. `max_df` and `min_df` help us building corpus-specific stopwords.*

Example: when classifying pieces of text into "basketball" or "football", the word "ball" would appear too often and would be useless for this classification, it would be better to filter it out using `max_df`

**How to use these parameters in practice?**

`max_df` (`min_df`) can be either a float between 0.0 and 1.0 or an integer

- `max_df` (`min_df`) = 0.5  ⇔   "ignore terms that appear in more (less) than 50% of the documents"
- `max_df` (`min_df`) = 20  ⇔   "ignore terms that appear in more (less) than 20 documents"

By default, `max_df` = 1.0  ⇔  no "frequent" word will be removed

By default, `min_df` = 0.0  ⇔   no "infrequent" word will be removed

In [43]:
# Number of occurences of each word
document_frequency

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
Document Frequency,3,1,1,1,1,3,2,1,1,3,2


In [44]:
# Instantiate the CountVectorizer with max_df = 2
count_vectorizer = CountVectorizer(max_df = 2) # removing "cat", "is", "young"

# Train it
X = count_vectorizer.fit_transform(texts)
X = pd.DataFrame(
    X.toarray(),
    columns = count_vectorizer.get_feature_names_out(),
    index = texts
)

X

Unnamed: 0,dog,for,good,health,running,the,with,your
the young dog is running with the cat,1,0,0,0,1,2,1,0
running is good for your health,0,1,1,1,1,0,0,1
your cat is young,0,0,0,0,0,0,0,1
young young young young young cat cat cat,0,0,0,0,0,0,0,0


💻 max_features

By specifying `max_features` = $k$ (k being an integer), the `CountVectorizer` (or the `TfidfVectorizer`) will build a vocabulary that only considers the top $k$ tokens ordered by term frequency across the corpus.

**How to use "max_features" in practice?**

In [45]:
# CountVectorizer with the 3 most frequent words
count_vectorizer = CountVectorizer(max_features = 3)

X = count_vectorizer.fit_transform(texts)
X = pd.DataFrame(
    X.toarray(),
     columns = count_vectorizer.get_feature_names_out(),
     index = texts
)

X

Unnamed: 0,cat,is,young
the young dog is running with the cat,1,1,1
running is good for your health,0,1,0
your cat is young,1,1,1
young young young young young cat cat cat,3,0,5


✅ Advantages of the `Tf-idf` representation:

- Using relative frequency rather than count is robust to document length
- Takes into account the context of the whole corpus

❌ Disadvantages of the `Tf-idf` representation:

- Like the `BoW`, `Tf-idf` does NOT capture the **within-document context** →  `N-gram` helps here
- Like the `BoW`, the word order is completely disregarded

### 2.4. `N-grams`

Example: the two following sentences have the exact same representation:

In [46]:
actors_movie = [
    "I like the movie but NOT the actors",
    "I like the actors but NOT the movie"
]

In [47]:
# Vectorize the sentences
count_vectorizer = CountVectorizer()
actors_movie_vectorized = count_vectorizer.fit_transform(actors_movie)

# Show the representations in a nice DataFrame
actors_movie_vectorized = pd.DataFrame(
    actors_movie_vectorized.toarray(),
    columns = count_vectorizer.get_feature_names_out(),
    index = actors_movie
)

# Show the vectorized movies
actors_movie_vectorized

Unnamed: 0,actors,but,like,movie,not,the
I like the movie but NOT the actors,1,1,1,1,1,2
I like the actors but NOT the movie,1,1,1,1,1,2


When using a `bag-of-words` representation, an efficient way to capture context is to consider:

- the count of single tokens (unigrams)
- the count of pairs (bigrams), triplets (trigrams), and more generally sequences of $n$ words, also known as `n-grams`

Examples:

- "mathematics" is a unigram (n = 1)
- "machine learning" is a bigram (n = 2)
- "natural language processing" is a trigram (n = 3)
- "deep convolutional neural networks" is a 4-gram (n = 4)

💻 `ngram_range`

In both `CountVectorizer` and `TfidfVectorizer`, you can specify the length of your sequences with the parameter `ngram_range` = (`min_n`, `max_n`).

Examples:

- ngram_range = (1, 1) 👉 (by default) will only capture the unigrams (single words)
- ngram_range = (1, 2) 👉 will capture the unigrams, and the bigrams
- ngram_range = (1, 3) 👉 will capture the unigrams, the bigrams, and the trigrams
- ngram_range = (2, 3) 👉 will capture the bigrams, and the trigrams but not the unigrams

With a unigram vectorization, we couldn't distinguish two sentences with the same words.

In [48]:
actors_movie_vectorized

Unnamed: 0,actors,but,like,movie,not,the
I like the movie but NOT the actors,1,1,1,1,1,2
I like the actors but NOT the movie,1,1,1,1,1,2


 What about a **bigram vectorization**?

In [49]:
# Vectorize the sentences
count_vectorizer_n_gram = CountVectorizer(ngram_range = (2,2)) # BI-GRAMS
actors_movie_vectorized_n_gram = count_vectorizer_n_gram.fit_transform(actors_movie)

# Show the representations in a nice DataFrame
actors_movie_vectorized_n_gram = pd.DataFrame(
    actors_movie_vectorized_n_gram.toarray(),
    columns = count_vectorizer_n_gram.get_feature_names_out(),
    index = actors_movie
)

# Show the vectorized movies with bigrams
actors_movie_vectorized_n_gram

Unnamed: 0,actors but,but not,like the,movie but,not the,the actors,the movie
I like the movie but NOT the actors,0,1,1,1,1,1,1
I like the actors but NOT the movie,1,1,1,0,1,1,1


👍 The two sentences are now distinguishable

#### **Vectorizing - Takeaways**

There are two methods for vectorizing:
- `CountVectorizer` (counting)
- `TfidfVectorizer` (weighing: take the document length into consideration)

The most important parameters of these vectorizers are:
- `min_df` (infrequent words)
- `max_df` (frequent words)
- `max_features` (curse of dimensionality)
- `ngram_range` = (`min_n`, `max_n`) (capturing the context of the words)

## 3. (Multinomial) Naive Bayes Algorithm

The Multinomial Naive Bayes algorithm is a classification algorithm based on Bayes' Theorem in probability theory

### 3.1. ✉️ The E-mail Classification Problem

🎯 We want to classify e-mails based on their content:
- ✅ Normal (N)
- 📩 Spam (S)

🤔 What is the probability that an e-mail containing some specific words be spam?


### Mathematical Approach

Mathematically speaking, the probability that an e-mail containing specific words is spam can be denoted by:

$
P(\textcolor{red}{S} \mid x_1, x_2, \dots, x_k)
$

where:

- $\textcolor{red}{S}$ = "this e-mail is spam"
- $x_k$ = "the word $x_k$ appears in this e-mail"

![bayes_theorem](pics/bayes_theorem.png)

$
P(\textcolor{red}{S} \mid x_1, x_2, \dots, x_k) = \dfrac{P(x_1, x_2, \dots, x_k \mid \textcolor{red}{S}) \times P(\textcolor{red}{S})}{P(x_1, x_2, \dots, x_k)}
$
  
(Bayes' Theorem)


$
P(\textcolor{red}{S} \mid x_1, x_2, \dots, x_k) =
\dfrac{P(x_1, x_2, \dots, x_k \mid \textcolor{red}{S}) \times P(\textcolor{red}{S})}
{P(x_1, x_2, \dots, x_k \cap \textcolor{red}{S}) + P(x_1, x_2, \dots, x_k \cap \textcolor{cyan}{N})}
$

(Law of Total Probabilities)



![total_prob](pics/total_prob.png)

$
P(\textcolor{red}{S} \mid x_1, x_2, \dots, x_k) =
\dfrac{P(x_1, x_2, \dots, x_k \mid \textcolor{red}{S}) \times P(\textcolor{red}{S})}
{P(x_1, x_2, \dots, x_k \mid \textcolor{red}{S}) \times P(\textcolor{red}{S}) + P(x_1, x_2, \dots, x_k \mid \textcolor{cyan}{N}) \times P(\textcolor{cyan}{N})}
$

(Conditional Probability)

<br>

👉 Let's focus on a specific term:

$
P(x_1, x_2, \dots, x_k \mid \textcolor{red}{S})
$

<br>

The Naive Bayes algorithm makes the strong assumption that the words in an e-mail are **conditionally independent**

By applying the independence property:

$
P(x_1, x_2, \dots, x_k \mid \textcolor{red}{S}) =
P(x_1 \mid \textcolor{red}{S}) \times P(x_2 \mid \textcolor{red}{S}) \times \dots \times P(x_k \mid \textcolor{red}{S})
$

$
= \prod_{i=1}^{k} P(x_i \mid \textcolor{red}{S})
$

<br>

🧨 In the **Naïve Bayes** algorithm, the probability that an e-mail is spam if it contains certain words is given by:

**Spam Formula**:

$
P(\textcolor{red}{S} \mid x_1, x_2, \dots, x_k) =
\dfrac{
P(\textcolor{red}{S}) \times \prod_{i=1}^{k} P(x_i \mid \textcolor{red}{S})
}{
P(\textcolor{red}{S}) \times \prod_{i=1}^{k} P(x_i \mid \textcolor{red}{S}) +
P(\textcolor{cyan}{N}) \times \prod_{i=1}^{k} P(x_i \mid \textcolor{cyan}{N})
}
$


#### 💻 **Computational Approach**  

Imagine that you have an e-mail inbox with:  

- **8** normal e-mails  
- **4** spam e-mails  

❓ What is the probability that an e-mail containing *"Dear Friend"* is spam? ❓  

$
P(\textcolor{red}{S} \mid \text{"Dear"}, \text{"Friend"})
$


![bayes_example](pics/bayes_compute.png)

#### Probability of an E-mail Being Spam if it Contains "Dear Friend"

The probability of an e-mail being spam given that it contains *"Dear Friend"* is:

$
P(\textcolor{red}{S} \mid \text{"Dear"}, \text{"Friend"}) = $

$\dfrac{
P(\textcolor{red}{S}) \times P(\text{"Dear"} \mid \textcolor{red}{S}) \times P(\text{"Friend"} \mid \textcolor{red}{S})
}{
P(\textcolor{red}{S}) \times P(\text{"Dear"} \mid \textcolor{red}{S}) \times P(\text{"Friend"} \mid \textcolor{red}{S}) +
P(\textcolor{cyan}{N}) \times P(\text{"Dear"} \mid \textcolor{cyan}{N}) \times P(\text{"Friend"} \mid \textcolor{cyan}{N})
}
$

<br>

Substituting values:

$
= \dfrac{\dfrac{4}{8+4} \times \dfrac{2}{7} \times \dfrac{1}{7}}
{\dfrac{4}{8+4} \times \dfrac{2}{7} \times \dfrac{1}{7} + \dfrac{8}{8+4} \times \dfrac{8}{17} \times \dfrac{5}{17}}
$

$
= \dfrac{0.0136}{0.0136 + 0.0923} = 0.129 (= 12.9\%)$


#### Smoothing

Imagine that you want to compute:

$
P(\textcolor{red}{S} \mid \text{"Dear"}, \text{"Lunch"})
¨$

**You will be in trouble**  

$
P(\text{"Dear"} \mid \textcolor{red}{S}) = \frac{2}{7} = 0.29
$

$
P(\text{"Lunch"} \mid \textcolor{red}{S}) = \frac{0}{7} = 0.00
$

And when multiplying these probabilities, you will have a **null probability**!  

🤔 **How do we overcome this problem with words that don’t appear in spam e-mails?**  

💡 **We can add +1 (or** $\alpha > 0$ **) to term frequencies.**  

**This is called smoothing, and** $\alpha$ **is the smoothing parameter.**

![smooth_example](pics/example_smooth.png)


### 3.2. Pros and Cons of the NB Algorithm

✅ Pros:

- Easy to implement
- Not an iterative learning process — fast!
- Works particularly well on text data because it can handle a large vocabulary
- Not a parametric model (no $β$ to learn, no loss function to minimize)

❌ Cons:

- Assumes that the words appearing in a document don't depend on any previous words

## 3.3. 💻 Implementation of the Naive Bayes Algorithm

📚 [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

✉️ Let's have a look at a dataset with thousands of e-mails classified either as spam or as a normal e-mail.

In [50]:
import pandas as pd

data = pd.read_csv("data/ham_spam_emails.csv")
data.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [51]:
data.shape

(5728, 2)

In [52]:
round(data["spam"].value_counts(normalize = True), 2)

spam
0    0.76
1    0.24
Name: proportion, dtype: float64

In [53]:
import numpy as np

from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import recall_score

# Feature/Target
X = data["text"]
y = data["spam"]

# Pipeline vectorizer + Naive Bayes
pipeline_naive_bayes = make_pipeline(
    TfidfVectorizer(),
    MultinomialNB()
)

# Cross-validation
cv_results = cross_validate(pipeline_naive_bayes, X, y, cv = 5, scoring = ["recall"])
average_recall = cv_results["test_recall"].mean()
np.round(average_recall,2)

0.45

👍 On average, the Naive Bayes algorithm is able to capture almost half of the spam e-mails, which is quite a good performance for a "naive" model!

## 3.4. 💻 Tuning the Vectorizer and the Naive Bayes Algorithm Simultaneously

🚨 Different vectorizing hyperparameters will affect the performance of the model. As such, it is important to simultaneously tune the hyperparameters of both the vectorizer and the Naive Bayes model.

💡 Remember that all the transformers and estimators of Scikit-Learn can be pipelined!

In [54]:
from sklearn.model_selection import GridSearchCV

# Define the grid of parameters
parameters = {
    'tfidfvectorizer__ngram_range': ((1,1), (2,2)),
    'multinomialnb__alpha': (0.1,1)
}

# Perform Grid Search
grid_search = GridSearchCV(
    pipeline_naive_bayes,
    parameters,
    scoring = "recall",
    cv = 5,
    n_jobs=-1,
    verbose=1
)

grid_search.fit(data.text,data.spam)

# Best score
print(f"Best Score = {grid_search.best_score_}")

# Best params
print(f"Best params = {grid_search.best_params_}")

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best Score = 0.9524932488436137
Best params = {'multinomialnb__alpha': 0.1, 'tfidfvectorizer__ngram_range': (1, 1)}


## 4. Topic Modeling and Latent Dirichlet Allocation 🔥

📚 This section is based on a research paper called [Latent Dirichlet Allocation](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) [diʀiˈkleː] published in 2003 in the Journal of Machine Learning by David M. Bei (Columbia), Andrew Y. Ng (Stanford), Michael I. Jordan (Berkeley). 


📚 If you read [Wikipedia/Latent_Dirichlet_Model](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Model), you will see that there are plenty of parameters and complex probability distributions to deal with

🐣 Consider this section an introduction to topic modeling, we will give you:

- an intuition about how LDA works
- how to use it on some texts


### 4.1. What is LDA?

`Latent Dirichlet Allocation` is an unsupervised algorithm for finding topics in documents

- "Latent" = hidden (topics)
- "Dirichlet" = type of probability distribution
    - Document  →   collection of topics
    - Topic  →   collection of tokens/words

📚 [sklearn.decomposition.LatentDirichletAllocation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)

👇 Consider the following documents:

In [55]:
documents = pd.DataFrame(['I like mangos and oranges', 'Frogs and turtles live in ponds',
                          'Kittems and pippies are fluffy', 'I had a spincah and kiwi smoothie',
                          'My kitten loves sttrawberries'], columns=['documents'])

documents

Unnamed: 0,documents
0,I like mangos and oranges
1,Frogs and turtles live in ponds
2,Kittems and pippies are fluffy
3,I had a spincah and kiwi smoothie
4,My kitten loves sttrawberries


📑 **Inputs**:

- Document-term matrix: documents to be converted using a vectorizer
- Number of topics: number of topics to be discovered within the documents
    - Each "topic" consists of a set of unordered words  →   **bag-of-words** format
- Number of iterations  →  LDA is an unsupervised iterative process

🎯 **Output**:

- Topics across different documents/pieces of text
    - These topics can be interpreted as "non-linear Principal Components" of the documents in the corpus

## 4.2. 💻 Implementation of the LDA

👇 Remember our original documents?

In [56]:
documents

Unnamed: 0,documents
0,I like mangos and oranges
1,Frogs and turtles live in ponds
2,Kittems and pippies are fluffy
3,I had a spincah and kiwi smoothie
4,My kitten loves sttrawberries


### 4.2.1. 💻 Cleaning the dataset

In [57]:
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def cleaning(sentence):

    # Basic cleaning
    sentence = sentence.strip() ## remove whitespaces
    sentence = sentence.lower() ## lowercase
    sentence = ''.join(char for char in sentence if not char.isdigit()) ## remove numbers

    # Advanced cleaning
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') ## remove punctuation

    tokenized_sentence = word_tokenize(sentence) ## tokenize
    stop_words = set(stopwords.words('english')) ## define stopwords

    tokenized_sentence_cleaned = [ ## remove stopwords
        w for w in tokenized_sentence if not w in stop_words
    ]

    lemmatized = [
        WordNetLemmatizer().lemmatize(word, pos = "v")
        for word in tokenized_sentence_cleaned
    ]

    cleaned_sentence = ' '.join(word for word in lemmatized)

    return cleaned_sentence

In [58]:
cleaned_documents = documents["documents"].apply(cleaning)
cleaned_documents.head()

0          like mangos oranges
1       frog turtle live ponds
2       kittems pippies fluffy
3        spincah kiwi smoothie
4    kitten love sttrawberries
Name: documents, dtype: object

### 4.2.2. 💻 Vectorizing

In [59]:
vectorizer = TfidfVectorizer()

vectorized_documents = vectorizer.fit_transform(cleaned_documents)
vectorized_documents = pd.DataFrame(
    vectorized_documents.toarray(),
    columns = vectorizer.get_feature_names_out()
)

vectorized_documents

Unnamed: 0,fluffy,frog,kittems,kitten,kiwi,like,live,love,mangos,oranges,pippies,ponds,smoothie,spincah,sttrawberries,turtle
0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.57735,0.57735,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.5,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.5
2,0.57735,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.57735,0.0,0.0
4,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0


### 4.3.3 💻 Finding the topics

In [60]:
from sklearn.decomposition import LatentDirichletAllocation

# Instantiate the LDA
n_components = 2
lda_model = LatentDirichletAllocation(n_components=n_components, max_iter = 100)

# Fit the LDA on the vectorized documents
lda_model.fit(vectorized_documents)

**Document Mixture (of Topics)**

In [61]:
document_topic_mixture = lda_model.transform(vectorized_documents)

In [62]:
document_topic_mixture

array([[0.20123894, 0.79876106],
       [0.18588725, 0.81411275],
       [0.20123763, 0.79876237],
       [0.8049096 , 0.1950904 ],
       [0.80490962, 0.19509038]])

🤔 How could our topic modeling be improved?

- by increasing the number of sentences
- by increasing the number of iterations

**Topic Mixture (of Words)**

In [63]:
topic_word_mixture = pd.DataFrame(
    lda_model.components_,
    columns = vectorizer.get_feature_names_out()
)

In [64]:
topic_word_mixture

Unnamed: 0,fluffy,frog,kittems,kitten,kiwi,like,live,love,mangos,oranges,pippies,ponds,smoothie,spincah,sttrawberries,turtle
0,0.51658,0.514375,0.51658,1.066362,1.066362,0.516589,0.514375,1.066362,0.516589,0.516589,0.51658,0.514375,1.066362,1.066362,1.066362,0.514375
1,1.06077,0.985625,1.06077,0.510988,0.510988,1.060761,0.985625,0.510988,1.060761,1.060761,1.06077,0.985625,0.510988,0.510988,0.510988,0.985625


What are the five most relevant words for each topic?

In [65]:
def print_topics(lda_model, vectorizer, top_words):
    # 1. TOPIC MIXTURE OF WORDS FOR EACH TOPIC
    topic_mixture = pd.DataFrame(
        lda_model.components_,
        columns = vectorizer.get_feature_names_out()
    )

    # 2. FINDING THE TOP WORDS FOR EACH TOPIC
    ## Number of topics
    n_components = topic_mixture.shape[0]

    ## Top words for each topic
    for topic in range(n_components):
        print("-"*10)
        print(f"For topic {topic}, here are the the top {top_words} words with weights:")

        topic_df = topic_mixture.iloc[topic]\
            .sort_values(ascending = False).head(top_words)

        print(round(topic_df,3))

In [66]:
print_topics(lda_model, vectorizer, 5)

----------
For topic 0, here are the the top 5 words with weights:
kitten           1.066
love             1.066
sttrawberries    1.066
kiwi             1.066
smoothie         1.066
Name: 0, dtype: float64
----------
For topic 1, here are the the top 5 words with weights:
fluffy     1.061
kittems    1.061
pippies    1.061
like       1.061
mangos     1.061
Name: 1, dtype: float64


### 4.3. Bonus: LDA Under the Hood

🎯 The goal of an LDA is to find topics across documents.

The LDA converts the **vectorized documents** (= `document_term_matrix`) into two matrices:

- `document_topic_mixture`
- `topic_word_mixture`

<br>

0️⃣ Choose the number of topics you want to detect in your corpus of documents

Example: $n_{components} = 2$  →   $\text{Topic 0}$  and $\text{Topic 1}$

<br>
 
1️⃣ Randomly assign each word in each document to one of topics

Example: The word "mangos" in $\text{Document 0}$ is randomly assigned to $\text{Topic 1}$

<br>
 
2️⃣ Go through every word and its topic assignment in each document

(1) Document Mixture $p(topic \ t | document \ d) $
→   how often a topic  $t$  occurs in a document $d$
 
(2) Topic Mixture $p(word \ w | topic \ t)$ →
  how often the word $w$ occurs in the topic $t$
 
(3) Update $p(word \ w \ with \ topic \ t)$ = $p( t | d) * p( w | t)$

<br>

🔁 Go through multiple iterations of step 2️⃣

<br>

🚀 Eventually, the topics will start making sense

<br>

**Document Mixture (of Topics)**

- Computing $p(topic \ t | document \ d)$ for every topic and every document is called **document mixture**

- The ideal document mixture for our example would be the following (*falsely assuming without verification that topic 0  = food and topic 1 = animals*):

In [67]:
data = {
    "documents": [
        "I like mangos and oranges",
        "Frogs and turtles live in ponds",
        "Kittens and puppies are fluffy",
        "I had a spinach and kiwi smoothie",
        "My kitten loves strawberries"
    ],
    "topic_food": [1.0, 0.0, 0.0, 1.0, 0.5],
    "topic_animals": [0.0, 1.0, 1.0, 0.0, 0.5]
}

# Convert to DataFrame
document_topic_matrix_ideal = pd.DataFrame(data)

document_topic_matrix_ideal

Unnamed: 0,documents,topic_food,topic_animals
0,I like mangos and oranges,1.0,0.0
1,Frogs and turtles live in ponds,0.0,1.0
2,Kittens and puppies are fluffy,0.0,1.0
3,I had a spinach and kiwi smoothie,1.0,0.0
4,My kitten loves strawberries,0.5,0.5


**Topic Mixture (of Words)**

- Computing $p(word \ w | topic \ t)$ for every word and every topic is called **topic mixture**
- The ideal topic mixture for our example would be the following:

In [68]:

data = {
    "topic": ["topic_food", "topic_animal"],
    "like": [1, 1],
    "mangos": [1, 0],
    "oranges": [1, 0],
    "frog": [0, 1],
    "turtle": [0, 1],
    "live": [0, 1],
    "ponds": [0, 1],
    "kitten": [0, 1],
    "puppies": [0, 1],
    "fluffy": [0, 1],
    "spinach": [1, 0],
    "kiwi": [1, 0],
    "smoothie": [1, 0],
    "love": [1, 0],
    "strawberry": [1, 0]
}

# Convert to DataFrame
topic_word_matrix_ideal = pd.DataFrame(data)

# Display the DataFrame
topic_word_matrix_ideal


Unnamed: 0,topic,like,mangos,oranges,frog,turtle,live,ponds,kitten,puppies,fluffy,spinach,kiwi,smoothie,love,strawberry
0,topic_food,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1
1,topic_animal,1,0,0,1,1,1,1,1,1,1,0,0,0,0,0


Hands-On! 🚀