# Natural Language Processing

Welcome everyone to NLP and today we are going to talk about it from a ML perspective

# Motivation

#### So far, we have been running Machine Learning algorithms with:
1. Numerical Inputs:

Example: Predicting House Prices

- Algorithm: Linear Regression, Random Forest Regression, Gradient Boosting Regression, etc.
- Features: Square footage, number of bedrooms, number of bathrooms, surface.
- Target: Sale price of the house (numerical)

2. Encoded Categorical Inputs:

Example: Customer Churn Prediction (refers to the task of predicting which customers are likely to stop using a product or service, i.e., churn. Churn prediction is crucial for businesses because retaining existing customers is generally more cost-effective than acquiring new ones.)

- Algorithm: Logistic Regression, Random Forest Classifier, Gradient Boosting Classifier, etc.
- Features: Customer demographics (encoded as categorical variables), subscription plan, usage patterns, etc.
- Target: Binary variable indicating whether the customer churned or not (1 for churn, 0 for no churn)


#### How can we incorporate textual data in these Machine Learning Algorithms? and what are the ML models dedicated to language-related tasks?

Incorporating textual data into machine learning algorithms involves leveraging techniques from the field of natural language processing (NLP). NLP focuses on enabling computers to understand, interpret, and generate human language data. There are several approaches to incorporating textual data into machine learning algorithms, for example:

- Before incorporating textual data into machine learning algorithms, it often requires preprocessing steps such as tokenization (splitting text into words or subwords), lowercasing, removing punctuation, stop words, and performing stemming or lemmatization. ( we will see how to do these in class today)
- Once the text is preprocessed, it needs to be converted into numerical form that machine learning algorithms can understand. Common techniques for feature extraction include bag-of-words (BoW), term frequency-inverse document frequency (TF-IDF), word embeddings (e.g., Word2Vec, and so on), and contextual embeddings (e.g., BERT, GPT).

# 🚀 Thanks to the development of NLP libraries, NLP is finding applications on an industrial level

- Email Filtering:NLP algorithms can distinguish between legitimate emails and spam by analyzing the content, sender information, and other metadata.
- Sentiment Analysis: Companies use sentiment analysis to gauge public opinion on their products, services, or brands by analyzing social media posts, customer reviews, and feedback.
- Chatbots: Chatbots powered by NLP can provide customer support, answer queries, and perform tasks like scheduling appointments or making reservations through natural language interactions.
- Voice/Speech Recognition: NLP is fundamental to voice and speech recognition systems, allowing devices to understand and interpret spoken language commands for tasks like voice searches, dictation, and controlling smart home devices.
- Smart Assistants: Virtual assistants like Siri, Alexa, and Google Assistant utilize NLP to understand user requests, execute commands, and provide personalized responses or recommendations.
- Language Translation: NLP facilitates automatic language translation, allowing businesses to translate documents, websites, or communication in real-time to communicate with a global audience effectively.

# What is NLP?

# Plan

This process here involves several key steps in natural language processing (NLP) for text classification or topic modeling.

- Raw Text Input: Raw text data is collected from various sources such as customer reviews, social media posts, emails, or news articles.

- Text Cleaning and Preprocessing: Text cleaning involves removing noise and irrelevant information from the raw text. This may include removing punctuation, special characters, numbers, and stopwords (commonly occurring words like "and," "the," "is," etc.). Additionally, techniques like stemming or lemmatization may be applied to reduce words to their base or root forms. Preprocessing may also involve handling issues like lowercase conversion, spell checking, and handling rare or misspelled words.
- Text Vectorization: Once the text is cleaned and preprocessed, it needs to be converted into a numerical format that machine learning algorithms can understand. This process is called text vectorization. 
- After vectorization, the processed text data can be fed into various NLP models for classification or topic modeling.
- Naive Bayes Classifier: A simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions between features. It's commonly used for text classification tasks due to its simplicity and efficiency.
- Latent Dirichlet Allocation (LDA): A generative probabilistic model that represents documents as mixtures of topics and topics as mixtures of words. LDA is often used for topic modeling to discover latent topics within a corpus of text documents.

- Model Training and Evaluation: The NLP model is trained on a labeled dataset and evaluated using appropriate metrics such as accuracy, precision, recall and F1-score.
- The performance of the model is assessed to determine its effectiveness in classifying or modeling topics within the text data.
- Deployment and Monitoring: Once the model is trained and evaluated, it can be deployed in production environments to make predictions or extract topics from new text data. The model's performance should be monitored over time, and it may be retrained or fine-tuned periodically to maintain its accuracy and effectiveness. but you will learn more abou this during the ML OPS Modules.

# 1. Text Preprocessing

# 👨🏻‍🏫 For any Machine Learning algorithm, data preprocessing is crucial, and this remains true for algorithms dealing with text - read

# 💻 🧹 Basic cleaning with Python core string operations - read

# 💻 ✂️ strip (1/2)

In [88]:
texts = [
    '   Bonjour, comment ca va ?     ',
    '    Heyyyyy, how are you doing ?   ',
    '        Hallo, wie gehts ?     '
]
texts

['   Bonjour, comment ca va ?     ',
 '    Heyyyyy, how are you doing ?   ',
 '        Hallo, wie gehts ?     ']

In [89]:
[text.strip() for text in texts]

['Bonjour, comment ca va ?',
 'Heyyyyy, how are you doing ?',
 'Hallo, wie gehts ?']

# 💻 ✂️ strip (2/2)

In [90]:
text = "abcd Who is abcd ? That's not a real name!!! abcd"
text

"abcd Who is abcd ? That's not a real name!!! abcd"

Here, the strip() method is used to remove leading and trailing characters from the string. In this case, it removes any occurrences of the characters 'b', 'd', 'a', or 'c' from the beginning and end of the string. 

In [91]:
text.strip('bdac')

" Who is abcd ? That's not a real name!!! "

# 💻 👥 replace

In [92]:
text = "I love koalas, koalas are the cutest animals on Earth."
text

'I love koalas, koalas are the cutest animals on Earth.'

In [93]:
text.replace("koala", "panda")

'I love pandas, pandas are the cutest animals on Earth.'

# 💻 🪚 split

In [94]:
text = "linkin park / metallica /red hot chili peppers"

In [95]:
text.split("/")

['linkin park ', ' metallica ', 'red hot chili peppers']

# 💻 🔡 Lowercase

In [96]:
text = "i LOVE football sO mUch. FOOTBALL is my passion. Who else loves fOOtBaLL ?"
text

'i LOVE football sO mUch. FOOTBALL is my passion. Who else loves fOOtBaLL ?'

In [97]:
text.lower() 

'i love football so much. football is my passion. who else loves football ?'

In [98]:
text.upper() 

'I LOVE FOOTBALL SO MUCH. FOOTBALL IS MY PASSION. WHO ELSE LOVES FOOTBALL ?'

# 💻 🔢 Numbers

Removing numbers during text preprocessing is often beneficial, especially for tasks like text clustering and keyphrase extraction. Here's why:

- Text Clustering: Clustering algorithms, such as K-means or hierarchical clustering, group similar documents together based on their features. Including numbers in the text can introduce noise and hinder the clustering process because numbers typically do not carry semantic meaning or contribute significantly to the similarity between documents. By removing numbers, the clustering algorithm can focus on the meaningful textual content, leading to more accurate clusters that reflect the data.

- Collecting Keyphrases: Keyphrase extraction involves identifying the most important phrases or terms in a document that capture its main topics or concepts. Including numbers in the text can lead to irrelevant or nonsensical keyphrases being extracted, because numbers are not informative at all in this context. Removing numbers helps ensure that the keyphrases extracted from the text are relevant and representative of its content, improving the quality of the extracted information.

In [99]:
text = "i do not recommend this restaurant, we waited for so long, like 30 minutes, this is ridiculous"
text

'i do not recommend this restaurant, we waited for so long, like 30 minutes, this is ridiculous'

In [100]:
cleaned_text = ''.join(char for char in text if not char.isdigit())
cleaned_text

'i do not recommend this restaurant, we waited for so long, like  minutes, this is ridiculous'

In [101]:
print('a'.isdigit())
print('5'.isdigit())

False
True


# 💻 ❗️❓Punctuation and Symbols

#### Warning: you might want to keep punctuation and symbols for authorship attribution!

Writing Style Identification:

Punctuation and symbols play a crucial role in shaping an author's writing style. Factors such as the frequency and placement of commas, dashes, exclamation marks, and other symbols can be distinctive features of an author's writing.
Retaining punctuation and symbols allows the model to capture these nuances in writing style accurately, improving the accuracy of authorship attribution.

In [102]:
text = "I love bubble tea! OMG so #tasty @channel XOXO @$ ^_^ "
text

'I love bubble tea! OMG so #tasty @channel XOXO @$ ^_^ '

In [103]:
import string # "string" module is already installed with Python
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

`string.punctuation` is a string containing all ASCII punctuation characters. 
<br>
These characters include common symbols such as exclamation marks, double quotes, hash symbols, percent signs, ampersands, apostrophes, parentheses, asterisks, plus signs, commas, hyphens, periods, slashes, colons, semicolons, less than signs, equals signs, greater than signs, question marks, at symbols, square brackets, backslashes, circumflex accents, underscores, grave accents, curly braces, vertical bars, and tildes.

In [104]:
for punctuation in string.punctuation:
    text = text.replace(punctuation, '') 
    
text

'I love bubble tea OMG so tasty channel XOXO   '

In [105]:
text.strip()

'I love bubble tea OMG so tasty channel XOXO'

# 💻 💪 Combo: strip + lowercase + numbers + punctuation/symbols

In [106]:
sentences = [
    "   I LOVE Pizza 999 @^_^", 
    "  Le Wagon is amazing, take care - 666"
]

In [107]:
def basic_cleaning(sentence):
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') 
    
    sentence = sentence.strip()
    
    return sentence

In [108]:
cleaned = [basic_cleaning(sentence) for sentence in sentences]
cleaned

['i love pizza', 'le wagon is amazing take care']

#### Join


This line of code performs a text processing operation on the variable sentence, removing any digits (numeric characters) from it. Let's break down the code step by step:

Iteration over Characters:

python
`for char in sentence`
This part of the code iterates over each character in the sentence string. It uses a loop to go through every character, one by one.

Conditional Filtering:
`if not char.isdigit()`
Inside the loop, for each character (char), there's a condition checking whether the character is not a digit. The isdigit() method is a built-in method in Python that returns True if all characters in the string are digits, otherwise it returns False. The not keyword negates this condition, so it evaluates to True if the character is not a digit.

Joining Characters:

`''.join(...)`
The join() method is then used to concatenate the characters back together into a new string. It takes an iterable (in this case, a generator expression) as input and joins the elements together using the specified separator. In this case, the separator is an empty string '', which means the characters will be joined together without any separation between them.

Generator Expression:
`(char for char in sentence if not char.isdigit())`
Inside the join() method, there's a generator expression. It iterates over each character in the sentence string and yields only those characters that are not digits, effectively filtering out the digits from the original string.

Result:
The final result of this line of code is a new string containing only the characters from the original sentence string that are not digits. Essentially, it removes all numeric characters from the sentence.

In [109]:
sentence = ''.join(char for char in sentences[0] if not char.isdigit())
sentence

'   I LOVE Pizza  @^_^'

# 💻 🔍 Removing Tags with RegEx

In [110]:
import re

text = """<head><body>Hello Le Wagon!</body></head>"""
cleaned_text = re.sub('<[^<]+?>','', text)

print (cleaned_text)

Hello Le Wagon!


In [111]:
import re

txt = '''
    This is a random text, authored by darkvader@gmail.com 
    and batman@outlook.com, WOW!
'''

re.findall('[\w.+-]+@[\w-]+\.[\w.-]+', txt)

['darkvader@gmail.com', 'batman@outlook.com']

# 💻 Cleaning with NLTK

Natural Language Toolkit (NLTK) is an NLP library that provides preprocessing and modeling tools for text data

# 💻 🌲 Tokenizing - read

In [112]:
text = 'It is during our darkest moments that we must focus to see the light'

text

'It is during our darkest moments that we must focus to see the light'

In [113]:
from nltk.tokenize import word_tokenize
import nltk
# nltk.download('punkt')

word_tokens = word_tokenize(text)
print(word_tokens) # print displays the words in one line

['It', 'is', 'during', 'our', 'darkest', 'moments', 'that', 'we', 'must', 'focus', 'to', 'see', 'the', 'light']


# 💻 🛑 Stopwords

In [114]:
from nltk.corpus import stopwords 
import nltk
# nltk.download('stopwords')
# set(...): The list of stopwords is converted into a set. Using a set ensures that duplicate
# stopwords are removed, and it allows for efficient membership checks.
stop_words = set(stopwords.words('english')) # you can also choose other languages
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

🕺🏻 Here is an example of a tokenized sentence:

In [115]:
tokens = ["i", "am", "going", "to", "go", "to", "the", 
        "club", "and", "party", "all", "night", "long"]

#### ❓ What stopwords could be removed ❓

In [116]:
stopwords_removed = [w for w in tokens if w in stop_words] 
stopwords_removed

['i', 'am', 'to', 'to', 'the', 'and', 'all']

❓ What are the meaningful words in this sentence ❓

#### 👉 What if you are not going to the party?

😱 "not" is also considered as a stopword

So guys, the slide says here thar removing stopwords can be dangerous for seentiment analysis and authorship attribution

We have to be careful with this statement - why

when it comes to Sentiment Analysis:

Stopword removal is generally not dangerous for sentiment analysis. In fact, it can be beneficial in some cases by reducing noise and focusing on sentiment-carrying words. However, the impact of stopword removal on sentiment analysis depends on the specific context and the sentiment lexicon used. Some stopwords may carry sentiment themselves (e.g., "not"), so their removal could potentially affect the sentiment analysis results. So, it's essential to carefully consider the stopwords to remove and their potential impact on sentiment analysis accuracy.

And when it comes to author attribution, Stopword removal is also not inherently dangerous. While some stopwords may carry author-specific stylistic features, their removal is unlikely to significantly impact the accuracy of authorship attribution models. Authorship attribution relies more on higher-level stylistic features, such as vocabulary choice, sentence structure, and syntactic patterns such as punctutions, which are not heavily influenced by stopwords.

# 💻 🌐 Lemmatizing - read

#### 👇 Look at the following sentence:

In [117]:
sentence = 'He was RUNNING and EATING at the same time =[. He has a bad habit of swimming after playing 3 hours in the Sun =/'
sentence

'He was RUNNING and EATING at the same time =[. He has a bad habit of swimming after playing 3 hours in the Sun =/'

#### 🧹 Step 1: Basic Cleaning ( the method we created)

In [118]:
sentence

'He was RUNNING and EATING at the same time =[. He has a bad habit of swimming after playing 3 hours in the Sun =/'

In [119]:
cleaned_sentence = basic_cleaning(sentence)
cleaned_sentence

'he was running and eating at the same time  he has a bad habit of swimming after playing  hours in the sun'

# 🎄 Step 2 : Tokenize

So what is tokenizing again folks?

In [120]:
tokenized_sentence = word_tokenize(cleaned_sentence)
print(tokenized_sentence)

['he', 'was', 'running', 'and', 'eating', 'at', 'the', 'same', 'time', 'he', 'has', 'a', 'bad', 'habit', 'of', 'swimming', 'after', 'playing', 'hours', 'in', 'the', 'sun']


# 🛑 Step 3: Remove Stopwords

In [121]:
tokenized_sentence_no_stopwords = [w for w in tokenized_sentence if not w in stop_words] 
print(tokenized_sentence_no_stopwords)

['running', 'eating', 'time', 'bad', 'habit', 'swimming', 'playing', 'hours', 'sun']


# 🌐 Step 4: Lemmatizing

What does lemmatizing do exactly?

It reduces words to their base or canonical form, known as the lemma. The lemma represents the dictionary form or root word of a given word, which allows different inflected forms of the word to be treated as a single item. For example, the lemma of "running" is "run," and the lemma of "better" is "good."

In [122]:
from nltk.stem import WordNetLemmatizer
import pandas as pd
import nltk
# nltk.download('wordnet')
# nltk.download('omw-1.4')

# Lemmatizing the verbs
verb_lemmatized = [
    WordNetLemmatizer().lemmatize(word, pos="v")  # v --> verbs
    for word in tokenized_sentence_no_stopwords
]

Here, a list comprehension is used to iterate over each word in the tokenized_sentence_no_stopwords list.
For each word, the WordNetLemmatizer().lemmatize(word, pos="v") method is called. This method lemmatizes the word with the specified part-of-speech (POS) tag, which in this case is "v" indicating a verb.
The lemmatized verbs are stored in the verb_lemmatized list.

in the same way, another list comprehension is used to iterate over each word in the verb_lemmatized list, which contains the lemmatized verbs.
For each word, the WordNetLemmatizer().lemmatize(word, pos="n") method is called with the POS tag "n", indicating a noun this time.
The lemmatized nouns are stored in the noun_lemmatized list.

In [122]:
# Lemmatizing the nouns
noun_lemmatized = [
    WordNetLemmatizer().lemmatize(word, pos="n")  # n --> nouns
    for word in verb_lemmatized
]

Here I create a dataframe with columns for original verbs, lemmatized verbs, and lemmatized nouns. It then displays the DataFrame, allowing us to inspect the original and lemmatized forms of the words in a tabular format.

In [123]:
# Create a DataFrame
df = pd.DataFrame({
    'Original Verb': tokenized_sentence_no_stopwords,
    'Verb Lemmatized': verb_lemmatized,
    'Noun Lemmatized': noun_lemmatized
})

# Display the DataFrame
df

Unnamed: 0,Original Verb,Verb Lemmatized,Noun Lemmatized
0,running,run,run
1,eating,eat,eat
2,time,time,time
3,bad,bad,bad
4,habit,habit,habit
5,swimming,swim,swim
6,playing,play,play
7,hours,hours,hour
8,sun,sun,sun


 Lemmatizing is useful for:

- topic modeling
- sentiment analysis

- Topic Modeling: In topic modeling, the goal is to identify themes or topics within a collection of documents. Lemmatization helps by reducing words to their base or canonical form, which helps in consolidating different inflected forms of words into a single representation. This reduces the vocabulary size and helps in identifying topics more accurately by treating different forms of the same word as the same token. For example, "run", "running", and "ran" would all be lemmatized to "run", making it easier for topic modeling algorithms to recognize the underlying topic related to running or physical activity.

- Sentiment Analysis: In sentiment analysis, the goal is to determine the sentiment or opinion expressed in a piece of text. Lemmatization can be beneficial in sentiment analysis by standardizing words and reducing word variations. This helps in capturing the sentiment conveyed by words more accurately, regardless of their inflected forms. For example, lemmatizing "better" to "good" ensures that variations like "better" and "best" are treated similarly in sentiment analysis, as they convey similar positive sentiment.

# 🥡 Preprocessing Text - Takeaways - read

# 🤔 Now that the text is preprocessed, how can it be analyzed by Machine Learning algorithms?

# 2. Vectorizing

# 🤖 Machine Learning algorithms cannot process raw text, as it needs to be converted into numbers first

So we have the texts. 

- Vectorize Texts: Use techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings (e.g., Word2Vec, GloVe) to convert the text data into numerical vectors. TF-IDF represents each text document as a vector of word frequencies, while word embeddings represent each word as a dense vector in a continuous space. You can use libraries like scikit-learn (for TF-IDF) or TensorFlow/Keras (for word embeddings) to perform text vectorization.

- Encode Target Variable: Encode the target variable (e.g., "normal email" vs. "spam") into numerical labels. For binary classification tasks like spam detection, you can use label encoding to convert categorical labels into numerical values (e.g., 0 for "normal email" and 1 for "spam").

- Make Predictions: Once the model is trained, use it to make predictions on new text data. Vectorize the new text data using the same techniques used during training, and then feed the vectorized data into the trained model to get predictions. The predictions will be numerical labels representing the predicted class (e.g., 0 for "normal email" and 1 for "spam").

# 2.1. Bag-of-Words representation

# 👩🏻‍🏫 Bag-of-Words representation(BoW)  - read

# 💻 CountVectorizer - Read

#### 👇 Look at the following sentences:

In [124]:
texts = [
    'the young dog is running with the cat',
    'running is good for your health',
    'your cat is young',
    'young young young young young cat cat cat'
]

#### 👩🏻‍🔬 Let's apply the CountVectorizer to generate a Bag-of-Words representation of these four sentences

In [125]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
X = count_vectorizer.fit_transform(texts)
X.toarray()

array([[1, 1, 0, 0, 0, 1, 1, 2, 1, 1, 0],
       [0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1],
       [3, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0]])

SO what is going on in this code?

`count_vectorizer = CountVectorizer()`
CountVectorizer is a method provided by scikit-learn for converting a collection of text documents into a matrix of token counts. Each row of the matrix represents a document, and each column represents a unique word (or token) in the entire corpus of documents.
The CountVectorizer() function initializes a CountVectorizer object with default parameters. You can customize parameters such as tokenization rules, stopwords removal, and n-gram range, but in this case, it uses default settings.

In the context of natural language processing (NLP) and text analysis, a document typically refers to a single unit of text data. 

The `toarray()` method converts the sparse matrix X into a dense array format. Sparse matrices store only non-zero entries, which is efficient for memory usage when dealing with large matrices where most entries are zero. However, dense arrays store all entries, including zeros, which makes them more memory-intensive but easier to work with for certain operations.
By calling `toarray()`, the sparse matrix X is converted into a 2D NumPy array, where each row corresponds to a document and each column corresponds to a word, with the entry representing the count of that word in the document.

🤔 Can you guess which column represents which word?

# 🔥 As soon as the CountVectorizer is fitted to the text, you can retrieve all the words seen with get_feature_names_out():

In [126]:
count_vectorizer.get_feature_names_out()

array(['cat', 'dog', 'for', 'good', 'health', 'is', 'running', 'the',
       'with', 'young', 'your'], dtype=object)

In [129]:
# here we turn out results into a dataframe
import pandas as pd

vectorized_texts = pd.DataFrame(
    X.toarray(), 
    columns = count_vectorizer.get_feature_names_out(),
    index = texts
)

vectorized_texts

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
the young dog is running with the cat,1,1,0,0,0,1,1,2,1,1,0
running is good for your health,0,0,1,1,1,1,1,0,0,0,1
your cat is young,1,0,0,0,0,1,0,0,0,1,1
young young young young young cat cat cat,3,0,0,0,0,0,0,0,0,5,0


# Be aware that there are some limitations when it comes to the bag-of-words representation read

Absolutely correct! While Bag-of-Words (BoW) representation is effective in capturing the frequency of individual words in a document, it lacks the ability to capture the context or the sequential relationship between words. This limitation can be addressed by using N-grams.

this is when we call n-grams to the rescue!

N-grams are contiguous sequences of n items (words in the context of NLP), where n refers to the number of words in the sequence. By considering sequences of words instead of individual words, N-grams capture more contextual information from the text data.

# 2.2. Tf-idf Representation - read

# Term Frequency (tf) & CountVectorizer - read

In [130]:
vectorized_texts

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
the young dog is running with the cat,1,1,0,0,0,1,1,2,1,1,0
running is good for your health,0,0,1,1,1,1,1,0,0,0,1
your cat is young,1,0,0,0,0,1,0,0,0,1,1
young young young young young cat cat cat,3,0,0,0,0,0,0,0,0,5,0


here calculating the term frequency (TF) for the word "young" in a document. Term frequency is a measure of how often a term (word) appears in a document relative to the total number of words in that document.

In your example:

The word "young" appears 5 times in the document.

The total number of words in the document is 8.

To calculate the term frequency (TF) for "young", you divide the number of occurrences of "young" by the total number of words in the document:

TF("young") = (Number of occurrences of "young") / (Total number of words in the document)
= 5 / 8
= 0.625

So, the term frequency (TF) for the word "young" in the document is 0.625. This means that "young" accounts for 62.5% of the total words in the document. TF is often used as a feature in text analysis tasks such as information retrieval, document classification, and sentiment analysis.

# Document Frequency (df) - read

❓ In our last example, could we compute 
d
f
c
a
t
, 
d
f
y
o
u
n
g
 and 
d
f
t
h
e
 ❓

In [133]:
vectorized_texts

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
the young dog is running with the cat,1,1,0,0,0,1,1,2,1,1,0
running is good for your health,0,0,1,1,1,1,1,0,0,0,1
your cat is young,1,0,0,0,0,1,0,0,0,1,1
young young young young young cat cat cat,3,0,0,0,0,0,0,0,0,5,0


# What if we considered the relative document frequency of a word x can be computed as dfx/N?

If we consider the relative document frequency (DF) of a word, denoted as DF(x), it represents the proportion of documents in a corpus that contain the word x. The formula for computing the relative document frequency of a word x is:

DF(x) = Number of documents containing word x / Total number of documents in the corpus

Here:
- Number of documents containing word x is the count of documents in which the word x appears at least once.
- Total number of documents in the corpus is the total count of all documents.

Relative document frequency provides insight into how widespread or common a word is across the entire corpus. Words with high DF values are likely to be common terms, while words with low DF values are likely to be rare or specialized terms.

It's worth noting that relative document frequency is often used in conjunction with term frequency-inverse document frequency (TF-IDF) weighting to assign weights to terms in a document. TF-IDF considers both the term frequency (TF) within a document and the inverse document frequency (IDF) across the corpus to determine the importance of a term in a document relative to the entire corpus.

# 👩🏻‍🏫 A word x

The inverse document frequency (IDF) is a measure used in information retrieval and text mining to determine the importance of a term within a corpus of documents. It is often used in conjunction with term frequency (TF) to calculate TF-IDF scores, which weigh the importance of terms in a document relative to the entire corpus.

The IDF of a term x is calculated as follows:

IDF(x) = log(N / DF(x))

Where:

N is the total number of documents in the corpus.
DF(x) is the document frequency of the term x, i.e., the number of documents in the corpus that contain the term x.
When the relative document frequency (DF) of a word x is low, it means that the word appears in a small proportion of documents in the corpus. On the other hand, this leads to a high IDF value for the term x. So when a word has a high relative document frequency (DF), it appears in many documents across the corpus, resulting in a low IDF value.

Basically, in the context of TF-IDF weighting:

Words with low relative document frequency (low DF) and high IDF are considered important because they are rare across the corpus but are present in a few documents where they occur, implying potential importance or specificity.
And, words with high relative document frequency (high DF) and low IDF are common terms that occur frequently across many documents, making them less informative or distinguishing in characterizing individual documents.

# next slide -read

# Tf-idf Formula - read

# 👩🏻‍🏫 Weight of a word x in a document d - read

# Summarizing

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects the importance of a term (word) within a document relative to a collection of documents (corpus). It is calculated by multiplying the term frequency (TF), which measures how often a term appears in a document, by the inverse document frequency (IDF), which measures how unique or important a term is across the entire corpus. TF-IDF assigns higher weights to terms that are frequent within a document but rare across the corpus,by highlighting terms that are both relevant and discriminative for characterizing the content of individual documents. By considering both local (within-document) and global (corpus-wide) term characteristics, TF-IDF is widely used in information retrieval, text mining, and natural language processing tasks to improve the accuracy and effectiveness of document analysis, search, and classification.

# 2.3. 💻 TfidfVectorizer

The TfidfVectorizer is a feature extraction method provided by the scikit-learn library in Python, which converts a collection of raw documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features. This method combines the functionality of both CountVectorizer and TfidfTransformer into a single step, making it convenient for transforming text data into a numerical representation suitable for machine learning algorithms.

In [134]:
texts

['the young dog is running with the cat',
 'running is good for your health',
 'your cat is young',
 'young young young young young cat cat cat']

In [135]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [136]:
# Instantiating the TfidfVectorizer
tf_idf_vectorizer = TfidfVectorizer()

# Training it on the texts
weighted_words = pd.DataFrame(tf_idf_vectorizer.fit_transform(texts).toarray(),
                 columns = tf_idf_vectorizer.get_feature_names_out())

weighted_words

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
0,0.227904,0.357056,0.0,0.0,0.0,0.227904,0.281507,0.714112,0.357056,0.227904,0.0
1,0.0,0.0,0.463709,0.463709,0.463709,0.29598,0.365594,0.0,0.0,0.0,0.365594
2,0.470063,0.0,0.0,0.0,0.0,0.470063,0.0,0.0,0.0,0.470063,0.580622
3,0.514496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.857493,0.0


the code here guys demonstrates how to use the TfidfVectorizer from the scikit-learn library to transform a collection of raw text documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features and then convert it into a DataFrame for further analysis.

`weighted_words = pd.DataFrame(tf_idf_vectorizer.fit_transform(texts).toarray(),`<br>
`columns = tf_idf_vectorizer.get_feature_names_out())`

We call the fit_transform() method of the TfidfVectorizer object on the texts input. This method tokenizes the input texts, calculates the TF-IDF scores for each word in each document, and returns a sparse matrix representation of the TF-IDF features.
We convert the sparse matrix to a dense array format using .toarray().
Then, we create a DataFrame named weighted_words from the dense array. Each column in the DataFrame corresponds to a unique word (feature) extracted from the texts, and each row represents a document. The cell values are the corresponding TF-IDF scores for each word in each document.

# Controlling the vocabulary size:

The Curse of Dimensionality refers to various challenges and phenomena that arise when working with high-dimensional data in machine learning and data analysis. As the number of dimensions (features) in the dataset increases, the volume of the data space grows exponentially, leading to several consequences and difficulties

# 💻 Key parameters of TfidfVectorizer (and CountVectorizer)

# 💻 max_df (resp. min_df)

# How to use these parameters in practice?

- max_df (Maximum Document Frequency):This parameter specifies the threshold for the maximum document frequency of terms. Terms that appear in a higher percentage of documents than the specified threshold will be ignored. If max_df is a float between 0.0 and 1.0, it represents the proportion of documents in which a term must not exceed in order to be considered. For example, max_df = 0.5 means to ignore terms that appear in more than 50% of the documents. if max_df is an integer, it represents the absolute count of documents. For example, max_df = 20 means to ignore terms that appear in more than 20 documents.

- min_df (Minimum Document Frequency): This parameter specifies the threshold for the minimum document frequency of terms. Terms that appear in fewer documents than the specified threshold will be ignored. If min_df is a float between 0.0 and 1.0, it represents the proportion of documents in which a term must appear in order to be considered. For example, min_df = 0.1 means to ignore terms that appear in less than 10% of the documents. If min_df is an integer, it represents the absolute count of documents. For example, min_df = 5 means to ignore terms that appear in fewer than 5 documents.

- Defaults:
    - By default, max_df = 1.0, meaning no "frequent" word will be removed.
    - By default, min_df = 0, meaning no "infrequent" word will be removed.

In [137]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Example usage with specified max_df and min_df
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=2)

In [138]:
tfidf_vectorizer

This creates a TfidfVectorizer object with a maximum document frequency of 50% and a minimum document frequency of 2 documents.

Adjusting these parameters allows you to control the size and quality of the vocabulary used for text analysis tasks.

In [140]:
# Number of occurences of each word
weighted_words

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
0,0.227904,0.357056,0.0,0.0,0.0,0.227904,0.281507,0.714112,0.357056,0.227904,0.0
1,0.0,0.0,0.463709,0.463709,0.463709,0.29598,0.365594,0.0,0.0,0.0,0.365594
2,0.470063,0.0,0.0,0.0,0.0,0.470063,0.0,0.0,0.0,0.470063,0.580622
3,0.514496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.857493,0.0


- X.toarray() converts the sparse matrix X (which likely represents a document-term matrix with TF-IDF or word counts) into a dense NumPy array. This operation converts the sparse matrix into a format that can be easily converted into a DataFrame.
- columns=count_vectorizer.get_feature_names_out() retrieves the feature names (i.e., the terms or words) from the CountVectorizer object count_vectorizer. These feature names are used as column labels in the DataFrame.
- index=texts sets the index of the DataFrame to be the texts variable. Presumably, texts contains the original raw text data used to create the document-term matrix. Each row in the DataFrame corresponds to a document, and the index labels each row with the corresponding raw text data.

In [145]:
# Instantiate the CountVectorizer with max_df = 2
count_vectorizer = CountVectorizer(max_df = 2) # removing "cat", "is", "young"

# Train it
X = count_vectorizer.fit_transform(texts)
X = pd.DataFrame(
    # A sparse matrix is a matrix that contains a large number of zero elements relative
    # to its total size. In other words, most of the entries in a sparse matrix are zero.
    X.toarray(),
    columns = count_vectorizer.get_feature_names_out(),
    index = texts
)

X

Unnamed: 0,dog,for,good,health,running,the,with,your
the young dog is running with the cat,1,0,0,0,1,2,1,0
running is good for your health,0,1,1,1,1,0,0,1
your cat is young,0,0,0,0,0,0,0,1
young young young young young cat cat cat,0,0,0,0,0,0,0,0


In [146]:
texts

['the young dog is running with the cat',
 'running is good for your health',
 'your cat is young',
 'young young young young young cat cat cat']

# 💻 max_features

# How to use "max_features" in practice?

Here, count_vectorizer is an instance of the CountVectorizer class from the scikit-learn library, which is used to convert a collection of text documents into a matrix representing the count of each word (term) in each document.
fit_transform(texts) method fits the count_vectorizer to the texts data and transforms the text data into a document-term matrix. Each row of the matrix corresponds to a document, and each column corresponds to a unique word in the vocabulary. The values in the matrix represent the count of each word in each document.

- X.toarray() converts the sparse matrix X (output from fit_transform) into a dense NumPy array. This operation is performed to convert the sparse matrix into a format suitable for creating a DataFrame.
- columns=count_vectorizer.get_feature_names_out() retrieves the feature names (i.e., the terms or words) from the CountVectorizer object count_vectorizer. These feature names are used as column labels in the DataFrame.
- index=texts sets the index of the DataFrame to be the texts variable. Presumably, texts contains the original raw text data used to create the document-term matrix. Each row in the DataFrame corresponds to a document, and the index labels each row with the corresponding raw text data.

In [147]:
# CountVectorizer with the 3 most frequent words
count_vectorizer = CountVectorizer(max_features = 3)

X = count_vectorizer.fit_transform(texts)
X = pd.DataFrame(
    X.toarray(),
     columns = count_vectorizer.get_feature_names_out(),
     index = texts
)

X

Unnamed: 0,cat,is,young
the young dog is running with the cat,1,1,1
running is good for your health,0,1,0
your cat is young,1,1,1
young young young young young cat cat cat,3,0,5


#  Advantages of the Tf-idf representation - read

# 2.4. N-grams

N-grams are contiguous sequences of n items (or words) from a given text or speech sample. These items can be characters, syllables, words, or even other linguistic units like morphemes or phonemes. N-grams are widely used in natural language processing (NLP) and text analysis tasks to capture local patterns and dependencies between adjacent elements in a sequence of text.

In [148]:
actors_movie = [
    "I like the movie but NOT the actors",
    "I like the actors but NOT the movie"
]

In [149]:
# Vectorize the sentences
count_vectorizer = CountVectorizer()
actors_movie_vectorized = count_vectorizer.fit_transform(actors_movie)

# Show the representations in a nice DataFrame
actors_movie_vectorized = pd.DataFrame(
    actors_movie_vectorized.toarray(),
    columns = count_vectorizer.get_feature_names_out(),
    index = actors_movie
)

# Show the vectorized movies
actors_movie_vectorized

Unnamed: 0,actors,but,like,movie,not,the
I like the movie but NOT the actors,1,1,1,1,1,2
I like the actors but NOT the movie,1,1,1,1,1,2


# 🧑🏻‍🏫 When using a bag-of-words representation, an efficient way to capture context is to consider:

# 💻 ngram_range

min_n is the minimum length of the N-grams to be considered.
max_n is the maximum length of the N-grams to be considered.
For example:

- ngram_range=(1, 1) considers only unigrams (single words).
- ngram_range=(1, 2) considers both unigrams and bigrams.
- ngram_range=(2, 2) considers only bigrams.
- ngram_range=(1, 3) considers unigrams, bigrams, and trigrams.

# 😥 With a unigram vectorization, we couldn't distinguish two sentences with the same words.

In [150]:
actors_movie_vectorized

Unnamed: 0,actors,but,like,movie,not,the
I like the movie but NOT the actors,1,1,1,1,1,2
I like the actors but NOT the movie,1,1,1,1,1,2


While unigram vectorization is useful for capturing the occurrence of individual words in each document, it does not consider the order or sequence of words within the document. Therefore, two sentences with the same words but in different orders will have identical unigram representations.

For example, consider the following two sentences:

"The quick brown fox jumps over the lazy dog."<br>
"The lazy dog jumps over the quick brown fox."
<br>
If we use unigram vectorization, both sentences will have the same vector representation because they contain the same words, regardless of the word order. This lack of consideration for word order means that unigram vectorization cannot distinguish between sentences that have the same words but different meanings or contexts.

# 👩🏻‍🔬 What about a bigram vectorization?

In [151]:
# Vectorize the sentences
count_vectorizer_n_gram = CountVectorizer(ngram_range = (2,2)) # BI-GRAMS
actors_movie_vectorized_n_gram = count_vectorizer_n_gram.fit_transform(actors_movie)

# Show the representations in a nice DataFrame
actors_movie_vectorized_n_gram = pd.DataFrame(
    actors_movie_vectorized_n_gram.toarray(),
    columns = count_vectorizer_n_gram.get_feature_names_out(),
    index = actors_movie
)

# Show the vectorized movies with bigrams
actors_movie_vectorized_n_gram

Unnamed: 0,actors but,but not,like the,movie but,not the,the actors,the movie
I like the movie but NOT the actors,0,1,1,1,1,1,1
I like the actors but NOT the movie,1,1,1,0,1,1,1


😄 The two sentences are now distinguishable

To overcome this limitation and capture the sequence of words, we can use techniques such as bigram or n-gram vectorization, which consider sequences of words (e.g., pairs of consecutive words). By incorporating the order of words into the vectorization process, these techniques can capture more detailed information about the structure and semantics of the text, allowing for better differentiation between sentences with similar word compositions but different meanings.

# 🥡 Vectorizing - Takeaways

# 🚀 Let's discover two NLP algorithms:

# (Multinomial) Naive Bayes Algorithm

Bayes' Theorem is a fundamental concept in probability theory that describes the probability of an event based on prior knowledge or conditions related to the event. It provides a way to update our beliefs or probabilities about an event in light of new evidence or information.

# ✉️ The E-mail Classification Problem


The email classification problem involves categorizing emails into different classes or categories based on their content, subject, sender, or other relevant features. This task is often tackled using machine learning and natural language processing techniques to automatically classify incoming emails into predefined categories. The goal is to efficiently handle and organize large volumes of emails by automatically routing them to the appropriate folders or personnel.

# 👩🏻‍🏫 Mathematical Approach

![2024-02-20_06-36-29.png](attachment:9d469412-ebb6-408c-89f4-36d5a9820f07.png)

# Law of total probabilities

it allows us to calculate the probability of an event by considering all possible outcomes and their associated probabilities across multiple disjoint cases or events.

# Conditional Probability

![2024-02-20_06-43-17.png](attachment:4639b74f-208e-4314-ac3b-47c94fcdbd15.png)

# 👉 Let's focus on a specific term:

![2024-02-20_06-45-50.png](attachment:318089a4-c0f0-455f-8d9c-ed63d464edc7.png)

# By applying the independence property:

![2024-02-20_06-47-30.png](attachment:c5ba26f1-8e8b-4007-9385-c24bbfa30e62.png)

# Spam Formula

![2024-02-20_06-49-14.png](attachment:9024fed7-ac45-4aab-9312-162ed3444227.png)

# 💻 Computational Approach

# Imagine that you have an e-mail inbox with:

![2024-02-20_06-52-30.png](attachment:d7747c73-7e10-44da-a0a8-51f9f7dbca60.png)

# Probability of being spam if the e-mail contains Dear Friend

# 🏂 Smoothing

![2024-02-20_06-55-18.png](attachment:2b2d1235-d84c-4541-a35d-718ae87e95ea.png)

# We can add +1 (or α>0) to term frequencies.

![2024-02-20_07-04-03.png](attachment:c8305290-bbb5-463d-8dbe-4246ed08be45.png)

# 3.2. Pros and Cons of the NB Algorithm

# 3.3. 💻 Implementation of the Naive Bayes Algorithm

# 3.4. 💻 Tuning the Vectorizer and the Naive Bayes Algorithm Simultaneously


The code snippet you've provided demonstrates how to use GridSearchCV from scikit-learn for hyperparameter tuning of a machine learning pipeline that likely involves text processing and classification, presumably for spam detection or a similar task. Let's break down the key components and explain how they work together:

GridSearchCV
GridSearchCV is a powerful tool for automating the process of tuning hyperparameters to find the best possible model performance. It systematically works through multiple combinations of parameter options, cross-validating as it goes to determine which parameters give the best performance.

Parameters

In [154]:
parameters = {
    'tfidfvectorizer__ngram_range': ((1,1), (2,2)),
    'multinomialnb__alpha': (0.1,1)
}

This dictionary defines the grid of hyperparameters to be tested.
'tfidfvectorizer__ngram_range': ((1,1), (2,2)) specifies the n-gram range for the TfidfVectorizer step of the pipeline. An n-gram range of (1,1) means only unigrams (single words) are considered, while (2,2) means only bigrams (pairs of consecutive words).
'multinomialnb__alpha': (0.1,1) sets the alpha parameter for the MultinomialNB (Naive Bayes) classifier. Alpha is the smoothing parameter: 0.1 and 1 are the values to be tested.


#### Performing Grid Search

- pipeline_naive_bayes is not defined in the provided snippet but is assumed to be a Pipeline object that includes at least a TfidfVectorizer step and a MultinomialNB classifier step.
- scoring = "recall" indicates that recall is the metric used to evaluate the performance of the model for each parameter combination. Recall is particularly important in applications like spam detection where missing a positive case (e.g., failing to identify a spam email) can be more problematic than falsely identifying a negative case as positive.
- cv = 5 specifies that 5-fold cross-validation is used. This means the data is split into 5 parts; in each iteration, 4 parts are used for training and 1 part is used for testing, cycling through all parts.
- n_jobs=-1 tells GridSearchCV to use all available CPU cores to perform the computations in parallel, speeding up the grid search process.
- verbose=1 provides detailed output about the progress of the grid search.

# Fitting the Model

This line trains the GridSearchCV instance on the dataset, where data.text contains the text to be classified, and data.spam indicates the class labels (e.g., spam or not spam).

# Results

- After fitting, grid_search.best_score_ provides the best recall score achieved across all parameter combinations.
- grid_search.best_params_ shows the parameters that led to the best recall score, helping you understand which configuration of ngram_range and alpha is most effective for your classification task.

# 4. Topic Modeling and Latent Dirichlet Allocation 🔥

# 4.1. What is LDA?

Latent Dirichlet Allocation is a generative statistical model that is used to discover abstract topics within a collection of documents. It belongs to the field of natural language processing and text mining. LDA assumes that documents are produced from a mixture of topics, and those topics generate words based on their probability distribution.

Topic Modeling: LDA is used for topic modeling, where the goal is to identify the underlying themes or topics that pervade a large collection of documents. It helps in discovering the hidden thematic structure in a large corpus of text, making it easier to manage, organize, and provide recommendations based on content.

Generative Process: In LDA, each document can be seen as a mixture of various topics, and each topic is characterized by a distribution over words. LDA models the generative process of documents, where it assumes a document is created by first choosing a distribution over topics, and then for each word in the document, a topic is chosen based on this distribution, and finally, a word is selected from the chosen topic.

# LDA is an unsupervised...

# 👇 Consider the following documents:

# Input and Output

Given your description, you're referring to **Latent Dirichlet Allocation (LDA)**, a type of topic modeling used in natural language processing and text mining to uncover hidden thematic structures within a collection of documents. LDA is an unsupervised learning algorithm that identifies topics based on the distribution of words across a set of documents. Here's a breakdown of the inputs and outputs of the LDA process as you've described:

### Inputs for LDA:

1. **Document-term matrix**: This is a matrix representation of the corpus, where each row corresponds to a document and each column represents a unique word in the corpus. The values in the matrix typically represent the frequency of each word in each document, though they can also be binary (denoting the presence or absence of a word) or TF-IDF (Term Frequency-Inverse Document Frequency) scores.

2. **Number of topics**: This is a predefined number indicating how many distinct topics you expect the algorithm to discover within the corpus. The choice of the number of topics can significantly affect the granularity of the topics found by LDA.

3. **Bag-of-words format**: The document-term matrix is based on the bag-of-words model, which treats each document as a collection of words without considering the order of words. This simplification is crucial for LDA's operation, focusing on the occurrence of words to infer topics.

4. **Number of iterations**: LDA is an iterative algorithm that alternates between two steps: assigning topics to documents and updating the distribution of words associated with each topic. The number of iterations is a parameter that controls how long the algorithm runs, allowing the model to refine its estimates of word-topic and document-topic distributions.

### Output of LDA:

- **Topics across different documents**: LDA outputs a set of topics, each represented as a distribution over words in the corpus. These topics are essentially clusters of words that frequently occur together across the documents.

- **Interpretation as "non-linear Principal Components"**: While not a technically precise analogy, thinking of the topics as "non-linear Principal Components" can be helpful for intuition. Just as Principal Component Analysis (PCA) identifies the axes (principal components) that maximize the variance in the data, LDA identifies themes (topics) that best capture the distribution of words across documents. However, unlike PCA, which is a linear method, LDA does not rely on linear combinations of original variables (words) and works on the probability distribution over a fixed set of topics.

Each document in the corpus is then represented as a mixture of these topics, where the contribution of each topic to a document is expressed as a probability. This allows for a nuanced understanding of the themes that pervade the corpus and how each document relates to those themes.

# 4.2. 💻 Implementation of the LDA

# 👇 Remember our original documents?

# 4.2.1. 💻 Cleaning the dataset

# apply

- .apply(cleaning): The .apply() method is used to apply a function along an axis of the DataFrame or Series. In this case, it's applied to the Series of documents. The cleaning function is a user-defined function that you're applying to each document in the Series. This function is expected to perform certain cleaning operations on the text data, such as:

- Removing special characters, punctuation, or numbers.
- Lowercasing all the text to maintain consistency.
- Removing stopwords (common words that add little value to text analysis, like "the", "is", "in", etc.).
- Stemming or lemmatization (reducing words to their root form or base form).
- cleaned_documents = ...: The result of applying the cleaning function to each document is assigned back to a variable named cleaned_documents. This variable now holds a Series where each document has been processed by the cleaning function.

# 4.2.2. 💻 Vectorizing

# 4.3.3 💻 Finding the topics

`from sklearn.decomposition import LatentDirichletAllocation`
- This line imports the LatentDirichletAllocation class from the decomposition module of the scikit-learn library. Scikit-learn is a popular machine learning library in Python that provides a wide range of algorithms for classification, regression, clustering, and dimensionality reduction, including LDA for topic modeling.

`n_components = 2`<br>
`lda_model = LatentDirichletAllocation(n_components=n_components, max_iter = 100)`

- n_components: This parameter specifies the number of topics to extract from the documents. In this case, n_components = 2 means the model will identify 2 distinct topics.
- max_iter: This parameter defines the maximum number of iterations the algorithm will run to converge towards the optimal solution. Here, max_iter = 100 sets a limit of 100 iterations.
- An instance of LatentDirichletAllocation is created with these parameters, and it's stored in the variable lda_model. This instance is configured to identify 2 topics from the data and will iterate up to 100 times to optimize its internal topic distribution models.


`lda_model.fit(vectorized_documents)`<br>
Before fitting the LDA model, the documents must be converted into a numerical format that the algorithm can process. This is typically done using vectorization techniques like Count Vectorizer or TF-IDF Vectorizer, which transform the text documents into a document-term matrix.

- vectorized_documents: This variable is expected to contain the vectorized form of the documents, representing the document-term matrix. Each row in this matrix corresponds to a document, and each column represents a term (or word) from the corpus. The values in the matrix might be raw term frequencies or TF-IDF scores, depending on the vectorization technique used.
- The .fit() method is then called on lda_model with vectorized_documents as its argument. This method fits the LDA model to the document-term matrix, allowing the model to learn the distribution of topics across documents and the distribution of words across topics.

After fitting, the lda_model object can be used to inspect the topics discovered in the documents, the distribution of topics in each document, and the words that are most representative of each topic. This is invaluable for understanding the latent thematic structure of a large corpus, summarizing content, and organizing or categorizing documents based on their dominant topics.

# Document Mixture of topics

The line of code you've provided is using the transform method of the LatentDirichletAllocation (LDA) model from the scikit-learn library on a set of vectorized documents. This method is applied after the LDA model has been fitted to the document-term matrix. Here's what happens during this step:

Transform Method in LDA
Purpose: The transform method is used to infer the topic distribution for each document in the dataset based on the LDA model learned from the data. It essentially assigns each document a mixture of topics, where each topic contributes a certain proportion to the document.

Process: The LDA model, which has already learned the topic-word distributions (how words are distributed across topics) during the fitting process, now uses this information to determine the distribution of topics in each document. It does this by examining the words in each document and their corresponding weights in the context of the learned topic-word distributions.

Output: Document-Topic Mixture
document_topic_mixture: This variable will contain the output from the transform method, which is a matrix where each row corresponds to a document in the original dataset, and each column represents a topic. The values in this matrix are proportions that sum to 1 for each row, indicating the weight or contribution of each topic to the corresponding document.

For example, if there are 2 topics (as specified by n_components=2 in your LDA model), and you have 100 documents, document_topic_mixture will be a 100x2 matrix. Each row of this matrix will have two values that sum to 1, indicating the proportion of each of the two topics in each document.

# Topic Mixture of words

Improving topic modeling, especially when using methods like Latent Dirichlet Allocation (LDA), can depend on several factors, including the quality and quantity of your data, the parameters you choose for the model, and how you preprocess your text data.

1. Increasing the Number of Sentences
Increasing the number of sentences, or more broadly, the amount of text data available for topic modeling, can significantly improve the model's performance. More data provides the model with a better opportunity to learn the distribution of words across topics and the distribution of topics across documents. However, the benefits of adding more data may plateau beyond a certain point, especially if the additional data does not introduce new information or variations in topics.

2. Increasing the Number of Iterations
The number of iterations refers to how many times the LDA algorithm will cycle through the entire dataset to adjust the topic assignments based on the data and the model's current state. Increasing the number of iterations can lead to a more stable and accurate model, as it allows more opportunities for the model to refine its understanding of how words are associated with topics. However, there's a trade-off: more iterations also mean more computational time and resources, and beyond a certain point, additional iterations may not yield significant improvements.

**Additional Ways to Improve Topic Modeling**

Optimizing Model Parameters: Beyond just increasing iterations, tuning other model parameters (like n_components for the number of topics, learning_decay for controlling the learning rate, etc.) can also improve model performance.

Better Preprocessing: Cleaning the text data more effectively (removing stop words, applying stemming or lemmatization, and excluding too frequent or too rare words) can help the model focus on more meaningful patterns.

Feature Selection: Using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) instead of raw counts for vectorization can help emphasize more informative words.

Evaluating Topic Coherence: Using metrics like topic coherence to evaluate the quality of the topics generated by the model can guide adjustments to the preprocessing steps or model parameters for better interpretability and relevance of topics.

# 🕵🏻 What are the five most relevant words for each topic?

Your `print_topics` function is designed to display the top words associated with each topic identified by an LDA model, along with their corresponding weights. This function is very useful for interpreting the results of topic modeling by identifying which words are most characteristic or relevant for each topic. Here's a breakdown of how the function works:

### Steps in the `print_topics` Function

1. **Topic Mixture of Words for Each Topic**:
   - The function begins by creating a DataFrame called `topic_mixture` from `lda_model.components_`. This attribute of the LDA model contains the word-topic matrix, where each entry (i, j) represents the importance of word j in topic i.
   - The columns of the DataFrame are set to the names of the features (words) in the document-term matrix, retrieved using `vectorizer.get_feature_names_out()`.

2. **Finding the Top Words for Each Topic**:
   - The function then calculates the number of topics (`n_components`) by getting the number of rows in `topic_mixture`, which corresponds to the number of topics the LDA model was asked to identify.
   - It iterates over each topic, printing a header for the topic and then finding the top words for that topic.
     - For each topic, it selects the corresponding row from `topic_mixture` and sorts it in descending order to bring the most relevant (or highest weighted) words to the top.
     - It uses `.head(top_words)` to select the specified number of `top_words` from this sorted list.
   - Finally, it prints the top words for the topic, along with their weights, rounded to three decimal places for readability.

### Usage

To use this function, you simply need to pass in the fitted LDA model (`lda_model`), the vectorizer used to transform your documents (`vectorizer`), and the number of top words you wish to display for each topic (`top_words`). For example, if you want to see the five most relevant words for each topic, you would call `print_topics(lda_model, vectorizer, 5)`.

### Output

The output will consist of a series of blocks, each corresponding to one of the topics identified by the LDA model. Within each block, you'll see the top words for that topic along with their weights, indicating the relative importance or contribution of each word to the topic. This information can be invaluable for understanding the thematic focus of each topic and for interpreting the overall results of the topic modeling process.

# `print_topics(lda_model, vectorizer, 5)`

The output of the `print_topics` function you've shared provides insights into the composition of topics identified by the Latent Dirichlet Allocation (LDA) model applied to a collection of documents. This function has displayed the five most relevant or characteristic words for each of two topics, along with their corresponding weights. Here's what this tells us about the topics and the function itself:

### Interpretation of the Output

- **Topic 0**: This topic seems to be related to animals and perhaps affection or cuteness, as indicated by words like "kitten," "fluffy," "puppies," and "love." The word "strawberries" might seem out of place in this context, suggesting that the documents might have mixed content or that there's some association in the data between animals and strawberries (perhaps in the context of things people love). The weights indicate the relative importance of these words within the topic, with "kitten" being the most significant.

- **Topic 1**: This topic appears to focus on items related to a healthy or green lifestyle, possibly smoothies or health foods, given words like "kiwi," "smoothie," "spinach." The presence of "frog" and "live" might suggest content related to nature or living things, or it could reflect a less cohesive topic if the documents contain varied content. As with Topic 0, the weights provide a sense of how central each word is to the theme of the topic.

### Understanding the Weights

The weights next to each word represent the word's importance or contribution to the topic. In LDA, these weights are derived from how frequently the words appear in documents associated with the topic, adjusted by how common they are across all documents (to prioritize words that are more specific to the topic). A higher weight means the word is more characteristic or defining of that topic within the context of the dataset.

### Implications

This output is crucial for understanding what the LDA model has learned from the data:

- **Topic Coherence**: The relevance and coherence of the words within each topic can give you an idea of how well the LDA model is performing. Coherent topics have top words that make sense together, suggesting the model is effectively capturing meaningful patterns in the data.

- **Dataset Insights**: By examining the top words for each topic, you can gain insights into the underlying themes or subjects present in your dataset. This can be particularly valuable for exploratory data analysis, content categorization, or feature engineering for further machine learning tasks.

- **Model Tuning**: If the topics seem incoherent or not useful, it might indicate a need to adjust the model parameters (like the number of topics), improve data preprocessing (e.g., removing irrelevant words), or collect more data. It could also suggest exploring different values for `n_components` (number of topics) or `max_iter` (number of iterations for convergence) in the LDA model setup.

The `print_topics` function thus serves as a valuable tool for both qualitative evaluation of the LDA model's performance and for gaining thematic insights into the text corpus being analyzed.

# 2️⃣ Go through every word and its topic assignment in each document

This process is part of the inner workings of the Latent Dirichlet Allocation LDA - LDA is a generative probabilistic model that assumes each document in a corpus can be represented as a mixture of various topics, and each topic, in turn, is characterized by a distribution over words. The goal of LDA is to learn these distributions: specifically, the distribution of topics in each document and the distribution of words in each topic. Let's break down the steps you've outlined:

![2024-02-20_08-09-57.png](attachment:52f1e902-bc6c-4976-9686-48d3ceea592e.png)

# Document Mixture (of Topics)

![2024-02-20_08-15-28.png](attachment:0a25abba-7f86-4a2a-bf87-760b096b7e53.png)

# Topic Mixture (of Words)

![2024-02-20_08-20-27.png](attachment:92e8f0cb-6353-4ba5-9162-671a8db98ffd.png)