> # **Context**
>
> ### **Objectives**
>
> In this section, we will explore the fundamental concept of text vectorization. We'll discover why and how text, naturally a form of unstructured data, can be transformed into a structured numerical representation, usable for various applications in Natural Language Processing (NLP).
>
> ### **What is Text Vectorization ?**
>
> Text vectorization is the process of converting text into numerical vectors. In the world of NLP, this technique is crucial as it allows computers to "understand" and process text using mathematical and statistical methods. Raw text, comprised of words, sentences, and syntax, isn't directly interpretable by algorithms. Therefore, we need to convert it into a format that machines can process - typically in the form of numerical vectors.
>
> ### **Why Vectorize Text ?**
> Vectorization is an essential step for performing tasks such as text classification, spam detection, sentiment analysis, and many other NLP applications. Without a proper numerical representation, it would be challenging, if not impossible, for machine learning models to effectively process and analyze text. By converting words and sentences into vectors, we can harness the power of mathematical algorithms to infer meanings, recognize patterns, and even generate text.
>
> In the following sections, we will examine two popular methods of text vectorization: TF-IDF (Term Frequency-Inverse Document Frequency) and Word2Vec. Each of these techniques offers a unique approach to tackling vectorization and is widely used in various NLP applications.

## Load Data

>**What Are Dataframes in Python?**
>
> In Python, a dataframe is like a special table that helps us organize and work with data. Imagine it as a neat and organized way to store information, kind of like a spreadsheet you'd use in Excel. Each row in a dataframe represents a single "thing" (like a person, a product, or a date), and each column stores a specific piece of information about that "thing" (like a name, age, or price).
>
>**Why Use Dataframes?**
>
>Dataframes are super helpful because they make it easy to:
>
>* **Organize Data** : They help us keep our data structured and tidy, making it easier to understand and work with.
>* **Manipulate Data** : We can change, filter, or calculate things with our data easily, just like in Excel.
>* **Analyze Data** : Dataframes are often used with libraries like Pandas in Python, which provide powerful tools to analyze and make sense of our data.
>* **Visualize Data** : We can create charts and graphs to visualize our data for better insights.
>* **Handle Big Data** : Dataframes can handle large amounts of data efficiently, which is crucial for big projects.
>
>So, in a nutshell, dataframes help us manage and make the most of our data in Python, making it a go-to tool for data analysis and manipulation.
>
>We will use the **pandas** library to load comments extracted by a team of data scientists. Here's an example of how to load a csv file :
>```python
> df = pd.read_csv("my_path/to/my/file.csv")
> ```
>
> In the context of this course, you can choose to consider only 1000 comments. Here's an example using the **sample** method:
>```python
> df_sample = df.sample(1000)
> ```
> **Instruction : Load the cleaned dataset named fournisseurs_energie_top5_forums_clean provided by the pedagogical team**

In [None]:
# In order to connect your drive to your env

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Insert your code here

import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/Bootcamp_x_hec/Cours-2 Cleaning Embedding/fournisseurs_energie_top5_forums_clean.csv", index_col=0).dropna(subset="text")
df.head()

> # **TF-IDF**
>
> TF-IDF is a statistical technique used in Natural Language Processing (NLP) to quantify the importance of a word in a document, relative to a set of documents. The principle is based on two concepts:
>
> * **TF** (Term Frequency): This measures how frequently a word appears in a given document. The more often a word appears in a document, the higher its TF value.
>
> * **IDF** (Inverse Document Frequency): This assesses the importance of a word by considering the number of documents in which it appears. If a word appears in many documents, it is considered less significant and receives a lower IDF value.
>
> The combination of these two measures gives the TF-IDF score of a word. A high TF-IDF score indicates that a word is both frequent in a particular document and rare across the entire document corpus. This suggests that the word is potentially a good indicator of the subject or important content of the document.
>
> TF-IDF is widely used for information retrieval and topic modeling, helping systems understand and categorize documents based on their content.
>
> ### **TF-IDF Formula**
>
> The TF-IDF weights of a word in a document are calculated using the following two measures:
>
> 1. **Term Frequency (TF)**:
   TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)
>
> 2. **Inverse Document Frequency (IDF)**:
   IDF(t, D) = log(Total number of documents in the corpus D / Number of documents containing term t)
>
> The final formula for the TF-IDF weight of a term is the product of these two measures:
>
> TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)
>
> where:
> - t is a specific term.
> - d is the concerned document.
> - D represents the entire set of documents.
>
> **Instruction : Check how to implement tf-idf using scikit-learn framework : https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html**
>
> **Instruction : Implement tf-idf method on your clean dataset**


In [None]:
# Insert your code here

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = df["text"].values

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
print(vectorizer.vocabulary_)

> # Two-Dimensional Document Visualization with t-SNE
>
> ## What is t-SNE?
>
> t-SNE (t-Distributed Stochastic Neighbor Embedding) is a powerful dimensionality reduction technique that is widely used to visualize high-dimensional data in a lower-dimensional space, such as two or three dimensions. It is particularly effective for visualizing datasets such as text vectors obtained from TF-IDF or Word2Vec in NLP tasks.
>
> ## Why Visualize Documents in Two Dimensions?
>
> **Understanding Relationships Between Documents:** Visualizing documents in two dimensions with t-SNE allows for a better understanding of the relationships and similarities between documents. Similar documents will cluster together, while dissimilar ones will be further apart.
>
> ## Discovery of Clusters and Themes:
>
> t-SNE helps in identifying clusters and themes in a dataset, which can be especially useful for discovering dominant topics in a collection of documents.
>
> ## Simplification of Complex Data:
>
> By reducing the number of dimensions, t-SNE simplifies the complexity inherent in textual data, preserving the important relationships between documents and making the data more accessible for analysis.
>
> # Implementing t-SNE
>
> ## To apply t-SNE:
>
> - **Data Preparation:** Begin by converting your textual documents into numerical vectors, such as those derived from TF-IDF or Word2Vec representations.
>
> - **Applying t-SNE:** Utilize the t-SNE algorithm to reduce these high-dimensional data vectors into a two-dimensional space.
>
> - **Visualization:** Create a scatter plot where each point represents a document. This allows for the visualization of the dataset's overall structure and the relationships between individual documents.
>
> In summary, t-SNE is a valuable tool for visualizing documents in two dimensions, offering insights into the underlying structure and relationships within text data sets.
>
> ## Code Example:
>
> ```python
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
>
> # Assuming 'tfidf_matrix' is your TF-IDF matrix and 'document_labels' is a list of labels for each document.
tfidf_matrix = # your TF-IDF matrix here
document_labels = # your document labels here
>
> # Initialize t-SNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=0)
>
> # Apply t-SNE transformation
tsne_embedding = tsne_model.fit_transform(tfidf_matrix)
>
> # Plotting
plt.figure(figsize=(10, 8))
scatter = plt.scatter(tsne_embedding[:, 0], tsne_embedding[:, 1], c=[sns.color_palette()[x] for x in document_labels])
plt.gca().set_aspect('equal', 'datalim')
plt.title('t-SNE projection of the TF-IDF dataset')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.colorbar(scatter)
plt.show()
```
> **Instruction : Implement the TSNE method to visualize the first 5000 reviews and their associated ratings in 2 dimensions**


In [None]:
# Insert your code here

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

document_labels = df["note"].values[:5000]

# Initialize t-SNE
tsne_model = TSNE(init="random", n_components=2, verbose=1, random_state=0)

# Apply t-SNE transformation
tsne_embedding = tsne_model.fit_transform(tfidf_matrix[:5000])

# Creating a color palette with 6 distinct shades
colors = plt.cm.jet(np.linspace(0, 1, 6))  # 'jet' can be replaced with any colormap

# Visualization
for i in range(6):
    indices = document_labels == i
    plt.scatter(tsne_embedding[indices, 0], tsne_embedding[indices, 1], color=colors[i], label=f'Score {i}')

plt.legend()
plt.xlabel('t-SNE dimension 1')
plt.ylabel('t-SNE dimension 2')
plt.title('Visualization of Documents with t-SNE')
plt.show()

> # **Word2Vec**
>
> **Word2Vec** is an innovative approach in the field of Natural Language Processing (NLP) for transforming words into numerical form, more specifically, into vectors. Developed by a team at Google, it revolutionized how machines interpret the nuances of human language. The core idea behind Word2Vec is not just to turn words into vectors but to capture the context and semantic relationships between them.
>
> ## **How Does Word2Vec Work ?**
>
> Word2Vec uses a neural network model to learn word associations from a large corpus of text. Once trained, the model can detect synonymous words or suggest additional words for a partial sentence. The two main architectures in Word2Vec are:
>
> **Continuous Bag of Words (CBOW)**: This model predicts a target word based on its context.
>
> **Skip-gram**: It works inversely compared to CBOW. Skip-gram predicts the surrounding context words from a target word. This model performs well with small datasets and can represent rare words effectively.
>
> ## **Why is Word2Vec Important?**
>
> Word2Vec is crucial for several reasons:
>
> * Semantic Understanding: It helps machines understand the semantic relationships between words, such as similarity and analogy.
> * Dimensionality Reduction: Word2Vec reduces the high dimensionality of text data, making it easier for algorithms to process.
> * Improving NLP Applications: It enhances the performance of various NLP tasks like sentiment analysis, translation, and text summarization.
>
> ## **Applications of Word2Vec**
>
> Word2Vec has wide applications in sentiment analysis, machine translation, named entity recognition, and more. Its ability to understand the context and meaning of words has made it a cornerstone technique in NLP.
>
> Code example with gensim :
> ```python
import pandas as pd
from gensim.models import Word2Vec
import gensim.utils
>
>import gensim
from gensim.models import Word2Vec
>
># Prepare the data for Word2Vec
sentences = #sentences tokenized
>
> # Training the Word2Vec model
model = Word2Vec(sentences, vector_size=256, window=5, min_count=1, sg=1)
>
> # Example of how to use the model
word_vectors = model.wv
print(word_vectors.similar_by_word('pertinent'))
> ```
> ## Parameters of the Word2Vec Class
>
> The `Word2Vec` class from Gensim is used for training word embedding models. Below are the most commonly used parameters:
>
> - `sentences`: The training sentences for Word2Vec. This can be a list of word lists or an iterable returning lists of words.
> - `vector_size`: The dimensionality of the word vectors. It is the number of neurons in the hidden layer of the neural network used in Word2Vec. A larger number can capture more detailed word representations but requires more training data and computational power.
> - `window`: The maximum distance between the current word and predicted words in a sentence. For example, `window=2` means only words that are two or fewer words away from the current word in the sentence are considered.
> - `min_count`: Ignores all words with total frequency lower than this number. This helps to eliminate rare words and errors in the data.
> - `workers`: The number of worker threads used for training. This accelerates training, especially on machines with multiple cores.
> - `sg`: If 1, uses the skip-gram training method; if 0, uses CBOW (Continuous Bag of Words). Skip-gram works well with less frequent data, while CBOW is faster and performs well with frequent data.
>
> Other less common parameters include:
>
> - `hs`: If 1, hierarchical softmax will be used for model training. If 0 (default), and `negative` is non-zero, negative sampling will be used.
> - `negative`: If greater than 0, negative sampling will be used. The number indicates how many "noise words" should be drawn (usually between 5-20).
> - `alpha`: The initial learning rate.
> - `min_alpha`: The learning rate will linearly drop to this value during training.
> - `seed`: The seed for random number generation. This enables reproducibility of results using the same seed.
> - `max_vocab_size`: Limits the RAM during vocabulary building; if more unique words are found, the least frequent ones are dropped.
>
> **Instruction : Implement Word2vec skipgram method on reviews you have**

In [None]:
# Insert your code here

import gensim
from gensim.models import Word2Vec

# Prepare the data for Word2Vec
sentences = df['text'].apply(lambda x: x.split(' ')).values

# Training the Word2Vec model
model = Word2Vec(sentences, vector_size=256, window=5, min_count=1, sg=1)

# Example of how to use the model
word_vectors = model.wv
print(word_vectors.similar_by_word('pertinent'))


**Instruction : Using TSNE, visualize in 2 dimensions the first 200 words of your vocabulary**

In [None]:
# Insert your code here

# Assume you have already trained your Word2Vec model and it's stored in the variable 'model'
# Step 1: Retrieve vectors of the first 1000 words
words = word_vectors.index_to_key[:200]  # get the list of the first 1000 words
vectors = word_vectors.vectors[:200]  # get the vectors of the first 1000 words

# Step 2: Reduce dimensions with t-SNE
tsne = TSNE(n_components=2, random_state=0)
vectors_2d = tsne.fit_transform(vectors)

# Step 3: Visualize with Matplotlib
plt.figure(figsize=(16, 16))
for i, word in enumerate(words):
    plt.scatter(vectors_2d[i, 0], vectors_2d[i, 1])
    plt.annotate(word, xy=(vectors_2d[i, 0], vectors_2d[i, 1]), xytext=(5, 2),
                 textcoords='offset points', ha='right', va='bottom')
plt.show()


> To obtain a sentence embedding from the embeddings of individual words, you can follow a straightforward but effective approach. This process typically involves averaging the word embeddings of all the words in a sentence. Here's a brief explanation of the steps involved:
>
> * Word Embeddings: First, you need word embeddings, which are vector representations of words. Each word in your vocabulary is represented by a vector in a high-dimensional space (e.g., 256 dimensions).
>
> * Vector Representation of Words: For each word in the sentence, retrieve its corresponding word embedding. This step assumes that you have a way to convert each word into a vector, typically using a lookup from the word2vec trained.
>
> * Averaging the Vectors: Compute the average of these word vectors. This is done by summing up all the word vectors in the sentence and then dividing by the number of words. The result is a single vector that represents the entire sentence.
>
> **Instruction: Using the Word2Vec model you trained, create a dataframe with an associated vector for each document**

In [None]:
# Insert your code here

import numpy as np

def preprocess(sentence_tokenized, model):
    return [word for word in sentence_tokenized if word in list(model.wv.index_to_key)]

# Calculating document vectors
def average_pooling(doc, model):
    words = preprocess(doc, model)
    if len(words) == 0:
       return np.zeros(model.vector_size)
    else:
       return np.mean(model.wv[words], axis=0)

vectors = [average_pooling(sentence, model) for sentence in sentences]
sentence_vector = pd.DataFrame(vectors)

> **Bonus : Instruction : Implement the TSNE method to visualize the first 5000 reviews and their associated ratings in 2 dimensions**

In [None]:
# Insert your code here

# Reduce dimensions with t-SNE
tsne = TSNE(n_components=2, random_state=0)
tsne_embedding = tsne.fit_transform(sentence_vector[:5000])

# Creating a color palette with 6 distinct shades
colors = plt.cm.jet(np.linspace(0, 1, 6))  # 'jet' can be replaced with any colormap

# Visualization
for i in range(6):
    indices = document_labels == i
    plt.scatter(tsne_embedding[indices, 0], tsne_embedding[indices, 1], color=colors[i], label=f'Score {i}')

plt.legend()
plt.xlabel('t-SNE dimension 1')
plt.ylabel('t-SNE dimension 2')
plt.title('Visualization of Documents with t-SNE')
plt.show()

> **Bonus: Instruction: Redo the two previous questions but with an open source Word2vec model trained on a very large corpus. You can find a pre-trained model at the following link: http://nlp.polytechnique.fr/word2vec#french**

In [None]:
# Insert your code here

import requests, zipfile, io
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
from tqdm import tqdm


# URL of the GloVe model (for example, GloVe 6B 100d)
glove_url = 'http://nlp.stanford.edu/data/glove.6B.zip'

# Download and extract the file
response = requests.get(glove_url)
zip_file = zipfile.ZipFile(io.BytesIO(response.content))
zip_file.extract('glove.6B.100d.txt')

# Convert the GloVe file to Word2Vec format
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.word2vec.txt'
glove2word2vec(glove_input_file, word2vec_output_file)

# Load the Word2Vec model
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

In [None]:
def preprocess_glove(sentence_tokenized, model):
    return [word for word in sentence_tokenized if word in list(model.index_to_key)]

# Calculating document vectors
def average_pooling_glove(doc, model):
    words = preprocess_glove(doc, model)
    if len(words) == 0:
       return np.zeros(model.vector_size)
    else:
       return np.mean(model[words], axis=0)

glove_sentence_vectors = [average_pooling_glove(sentence, glove_model) for sentence in tqdm(sentences[:500])]
glove_sentence_vectors = pd.DataFrame(glove_sentence_vectors)

# Reduce dimensions with t-SNE
tsne = TSNE(n_components=2, random_state=0)
tsne_embedding = tsne.fit_transform(glove_sentence_vectors)

document_labels = document_labels[:500]

# Creating a color palette with 6 distinct shades
colors = plt.cm.jet(np.linspace(0, 1, 6))  # 'jet' can be replaced with any colormap

# Visualization
for i in range(6):
    indices = document_labels == i
    plt.scatter(tsne_embedding[indices, 0], tsne_embedding[indices, 1], color=colors[i], label=f'Score {i}')

plt.legend()
plt.xlabel('t-SNE dimension 1')
plt.ylabel('t-SNE dimension 2')
plt.title('Visualization of Documents with t-SNE')
plt.show()