# **Embed documents using watsonx's embedding model**


## Overview


Imagine you work in a company that handles a vast amount of text data, including documents, emails, and reports. Your task is to build an intelligent search system that can quickly and accurately retrieve relevant documents based on user queries. Traditional keyword-based search methods often fail to understand the context and semantics of the queries, leading to poor search results.

To address this challenge, you can use embedding models to convert documents into numerical vectors. These vectors capture the semantic meaning of the text, enabling more accurate and context-aware search capabilities. Document embedding is a powerful technique to convert textual data into numerical vectors, which can then be used for various downstream tasks such as search, classification, clustering, and more.


In this lab, you will learn how to use embedding models from watsonx.ai and Hugging Face to embed documents. By the end of this lab, you will be able to effectively use these embedding models to transform and utilize textual data in your projects.


## 📘 Table of Contents

1. [Objectives](#objectives)
2. [Setup](#setup)  
    1. [Installing required libraries](#installing-required-libraries)  
    2. [Load data](#load-data)  
    3. [Split data](#split-data)  
3. [Watsonx embedding model](#watsonx-embedding-model)  
    1. [Model description](#model-description)  
    2. [Build model](#build-model)  
    3. [Query embeddings](#query-embeddings)  
    4. [Document embeddings](#document-embeddings)  
4. [HuggingFace embedding models](#huggingface-embedding-models)  
    1. [Model description](#model-description-1)  
    2. [Build model](#build-model-1)  
    3. [Query embeddings](#query-embeddings-1)  
    4. [Document embeddings](#document-embeddings-1)  

</ol>



## Objectives

After completing this lab, you will be able to:

 - Prepare and preprocess documents for embedding
 - Use watsonx.ai and Hugging Face embedding models to generate embeddings for your documents


----


## Setup


For this lab, you will use the following libraries:

* [`ibm-watson-ai`](https://ibm.github.io/watsonx-ai-python-sdk/fm_embeddings.html#EmbeddingModels) for using embedding models from IBM's watsonx.ai.
* [`langchain`, `langchain-ibm`, `langchain-community`](https://www.langchain.com/) for using relevant features from LangChain.
* [`sentence-transformers`](https://huggingface.co/sentence-transformers) for using embedding models from HuggingFace.


### Installing required libraries

The following required libraries are __not__ preinstalled in the Skills Network Labs environment. __You need to run the following cell__ to install them:

**Note:** The version is being pinned here to specify the version. It's recommended that you do this as well. Even if the library is updated in the future, the installed library could still support this lab work.

This might take around 1-2 minutes. 

As `%%capture` is used to capture the installation, you won't see the output process. But after the installation completes, you will see a number beside the cell.


In [None]:
%%capture
#After executing the cell,please RESTART the kernel and run all the cells.
!pip install --user "ibm-watsonx-ai==1.1.2"
!pip install --user "langchain==0.2.11"
!pip install --user "langchain-ibm==0.1.11"
!pip install --user "langchain-community==0.2.10"
!pip install --user "sentence-transformers==3.0.1"


## Load data


A text file has been prepared as the source document for the downstream embedding task.

Now, let's download and load it using LangChain's `TextLoader`.


In [2]:

import requests

url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/i5V3ACEyz6hnYpVq6MTSvg/state-of-the-union.txt"  # <-- Replace with actual URL
response = requests.get(url)

with open("state-of-the-union.txt", "w", encoding="utf-8") as f:
    f.write(response.text)


In [3]:
from langchain_community.document_loaders import TextLoader

In [5]:
loader = TextLoader("state-of-the-union.txt", encoding="utf-8")
data = loader.load()

Let's take a look at the document.


In [6]:
data

[Document(metadata={'source': 'state-of-the-union.txt'}, page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russiaâ\x80\x99s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their 

## Split data


Since the embedding model has a maximum input token limit, you cannot input the entire document at once. Instead, you need to split it into chunks.

The following code shows how to use LangChain's `RecursiveCharacterTextSplitter` to split the document into chunks.
- Use the default separator list, which is `["\n\n", "\n", " ", ""]`.
- Chunk size is set to `100`. This should be set to less than the model's maximum input token.
- Chunk overlap is set to `20`.
- The length function is `len`.


In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [8]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
)

In [9]:
chunks = text_splitter.split_text(data[0].page_content)

Let's see how many chunks you get.


In [10]:
len(chunks)

574

Let's also see what these chunks looks like.


In [11]:
chunks

['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and',
 'of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.',
 'Last year COVID-19 kept us apart. This year we are finally together again.',
 'Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.',
 'With a duty to one another to the American people to the Constitution.',
 'And with an unwavering resolve that freedom will always triumph over tyranny.',
 'Six days ago, Russiaâ\x80\x99s Vladimir Putin sought to shake the foundations of the free world thinking',
 'free world thinking he could make it bend to his menacing ways. But he badly miscalculated.',
 'He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of',
 'he met a wall of strength he never imagined.',
 'He met the Ukrainian people.',
 'From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their',
 'c

## Watsonx embedding model


### Model description


In this section, you will use IBM `slate-125m-english-rtrvr` model as an example embedding model.

The slate.125m.english.rtrvr model is a [standard sentence](https://www.sbert.net/) transformers model based on bi-encoders. The model produces an embedding for a given input, e.g., query, passage, document, etc. At a high level, the model is trained to maximize the cosine similarity between two input pieces of text, e.g., text A (query text) and text B (passage text), which results in the sentence embeddings q and p.These sentence embeddings can be compared using cosine similarity, which measures the distance between sentences by calculating the distance between their embeddings.


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/NDCHhZfcC96jggb2hMdJhg/fm-slate-125m-english-rtrvr-cosine.jpg" width="50%">


### Build model


The following code shows how to build the `slate-125m-english-rtrvr` model from IBM watsonx.ai API.


First, import the necessary dependencies. 
- `WatsonxEmbeddings` is a class/dependence that can be used to form an embedding model object.
- `EmbedTextParamsMetaNames` is a dependence that controls the embedding parameters.


In [15]:
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames
from langchain_ibm import WatsonxEmbeddings


In [None]:
embed_params = {
    EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: 3,
    EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": True},
}

watsonx_embedding = WatsonxEmbeddings(
    model_id="ibm/slate-125m-english-rtrvr",
    url="https://us-south.ml.cloud.ibm.com",
    project_id="skills-network",
    params=embed_params,
)

### Query embeddings


Now, create an embedding based on a single sentence, which can be treated as a query.


Use the `embed_query` method.


In [None]:
query = "How are you?"

query_result = watsonx_embedding.embed_query(query)

Let's see the length/dimension of this embedding.


In [None]:
len(query_result)

It has a dimension of `768`, which aligns with the model description. 


Next, take a look at the first five results from the embeddings.


In [None]:
query_result[:5]

### Document embeddings


After creating the query embeddings, you will be guided on how to create embeddings from documents, which are a list a text chunks.


Use `embed_documents`. The parameter `chunks` should be a list of text. Here, chunks is a list of documents you get from before after splitting the whole document.


In [None]:
doc_result = watsonx_embedding.embed_documents(chunks)

As each piece of text is embedded into a vector, so the length of the `doc_result` should be the same as the length of chunks.


In [None]:
len(doc_result)

Now, take a look at the first five results from the embeddings of the first piece of text.


In [None]:
doc_result[0][:5]

Check the embedding dimension to see if it is also 768.


In [None]:
len(doc_result[0])

## Hugging Face embedding model


### Model description


In this section, you will use the `all-mpnet-base-v2` from HuggingFace as an example embedding model.

It is a sentence-transformers model. It maps sentences and paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search. It used the pre-trained `Microsoft/money-base` model and fine-tuned it on a 1B sentence pairs dataset. For more information, please refer to [here](https://huggingface.co/sentence-transformers/all-mpnet-base-v2).


### Build model


To build the model, you need to import the `HuggingFaceEmbeddings` dependence first.


In [16]:
from langchain_community.embeddings import HuggingFaceEmbeddings

Then, you specify the model name.


In [17]:
model_name = "sentence-transformers/all-mpnet-base-v2"

Here we create a embedding model object.


In [18]:
huggingface_embedding = HuggingFaceEmbeddings(model_name=model_name)

  warn_deprecated(


ImportError: Could not import sentence_transformers python package. Please install it with `pip install sentence-transformers`.

### Query embeddings


Let's create the embeddings from the same sentence, but using the Hugging Face embedding model. 


In [None]:
query = "How are you?"

In [None]:
query_result = huggingface_embedding.embed_query(query)

In [None]:
query_result[:5]

Do you see the differences between embeddings that are created by the watsonx embedding model and the Hugging Face embedding model?


### Document embeddings


Next, you can do the same for creating embeddings from documents.


In [None]:
doc_result = huggingface_embedding.embed_documents(chunks)
doc_result[0][:5]

In [None]:
len(doc_result[0])

```{## Change Log}
```
