# Create Text Embeddings for a Vector Store using LangChain

In [None]:
%%capture --no-stderr
!pip install --quiet langchain chromadb==0.5.3
!pip install langchain-community
!pip install langchain-google-vertexai
!pip install --upgrade --quiet langchain-google-genai

In [2]:
# Restart kernel after installs so that your environment can access the new packages
import IPython
import time

app = IPython.Application.instance()
app.kernel.do_shutdown(True)


{'status': 'ok', 'restart': True}

In [None]:
# Define project information
import sys

PROJECT_ID = "PROJECT_ID"  # @param {type:"string"} Please set your PROJECT_ID
LOCATION = "us-central1"  # @param {type:"string"}

# if not running on colab, try to get the PROJECT_ID automatically
if "google.colab" not in sys.modules:
    import subprocess

    PROJECT_ID = subprocess.check_output(
        ["gcloud", "config", "get-value", "project"], text=True
    ).strip()

print(f"Your project ID is: {PROJECT_ID}")


: 

In [2]:
from langchain import PromptTemplate
from langchain import hub
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import WebBaseLoader
from langchain.schema import StrOutputParser
from langchain.schema.prompt_template import format_document
from langchain.schema.runnable import RunnablePassthrough
from langchain.vectorstores import Chroma
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_google_vertexai import VertexAI
from langchain.chains import RetrievalQA, ConversationalRetrievalChain


USER_AGENT environment variable not set, consider setting it to identify your requests.


In [3]:
# Task 1. Use the WebBaseLoader to load documents related to queries
loader = WebBaseLoader("https://blog.google/technology/ai/google-gemini-ai/")
docs = loader.load()


### Split text into chunks
Now that we have the documents we will split them into chunks. Each chunk will become one vector in the vector store. To do this we will define a chunk size (number of characters) and a chunk overlap (amount of overlap i.e. sliding window). The perfect chunk size can be difficult to determine. Too large of a chunk size leads to too much information per chunk (individual chunks not specific enough), however too small of a chunk size leads to not enough information per chunk. In both cases, nearest neighbors lookup with a query/question embedding may struggle to retrieve the actually relevant chunks, or fail altogether if the chunks are too large to use as context with an LLM query.

In this notebook we will use a chunk size of 800 chacters and a chunk overlap of 100 characters, but feel free to experiment with other sizes! Note: you can specify a custom `length_function` with `RecursiveCharacterTextSplitter` if you want chunk size/overlap to be determined by something other than Python's `len` function. In addition to `RecursiveCharacterTextSplitter`, there are [other text splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token) you can consider.

In [4]:
# Task 2. Use the RecursiveCharacterTextSplitter class to split the documents into chunks for embedding
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 800,
    chunk_overlap  = 100,
    length_function = len,
)

chunks = text_splitter.split_documents(docs)

# Look at the first two chunks 
chunks[0:2]

[Document(metadata={'source': 'https://blog.google/technology/ai/google-gemini-ai/', 'title': 'Introducing Gemini: Google’s most capable AI model yet', 'description': 'Gemini is our most capable and general model, built to be multimodal and optimized for three different sizes: Ultra, Pro and Nano.', 'language': 'en-us'}, page_content='Introducing Gemini: Google’s most capable AI model yet'),
 Document(metadata={'source': 'https://blog.google/technology/ai/google-gemini-ai/', 'title': 'Introducing Gemini: Google’s most capable AI model yet', 'description': 'Gemini is our most capable and general model, built to be multimodal and optimized for three different sizes: Ultra, Pro and Nano.', 'language': 'en-us'}, page_content='[{"model": "blogsurvey.survey", "pk": 3, "fields": {"name": "General Article Sentiment", "survey_id": "general-article-sentiment_240906", "scroll_depth_trigger": 50, "previous_survey": null, "display_rate": 50, "thank_message": "Thank you!", "thank_emoji": "✅", "quest

### Vectorize/Embed Document Chunks
Now we need to embed the document chunks (turn them into vectors) and store them in a vector store. For this, we can use any text embedding model, however we need to be sure to use the same text embedding model when we embed our queries/questions at prediction time. To make things simple we will use the PaLM API for Embeddings. The LangChain library provides a wrapper class around the PaLM Embeddings API, `VertexAIEmbeddings()`.

For the purposes of this lab, you will use [Chroma](https://www.trychroma.com/) as the vector store for simplicity. In a real-world scenario with a large private knowledge-base, you may not be able to fit everything in memory. Langchain has a nice wrapper class for Chroma which allows us to pass in a list of documents, and an embedding class to create the vector store.

In [5]:
# Task 3. Create vector store using embeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings

gemini_embeddings = VertexAIEmbeddings(model="text-embedding-004")


In [6]:
# Save to disk
vectorstore = Chroma(embedding_function=gemini_embeddings, persist_directory="./vectorstore")

for chunk in chunks:
    vectorstore.add_documents([chunk])

  vectorstore = Chroma(embedding_function=gemini_embeddings, persist_directory="./vectorstore")


In [7]:
# Load from disk
vectorstore_disk = Chroma(
    embedding_function=gemini_embeddings,   # Embedding model
    persist_directory="./vectorstore"       # Directory to save the embeddings
)

retriever = vectorstore_disk.as_retriever(search_kwargs={"k": 1})
print(len(retriever.invoke("MMLU")))


1


### Putting it all together
Now that everything is in place, we can tie it all together with a langchain chain. A langchain chain simply orchestrates the multiple steps required to use an LLM for a specific use case. In this case the process we will chain together first embeds the query/question, then performs a nearest neighbors lookup to find the relevant chunks, then uses the relevant chunks to formulate a response with an LLM. We will use the Chroma database as our vector store and PaLM as our LLM. Langchain provides a wrapper around PaLM, `VertexAI()`. 

For this simple Q/A use case we can use langchain's `RetrievalQA` to link together the process.

In [8]:
# vector store 
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k":2} # number of nearest neighbors to retrieve  
)

# You can also set temperature, top_p, top_k 
llm = VertexAI(
    model_name="text-bison",
    max_output_tokens=1024
)

# q/a chain 
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

### Query 
Now that everything is "chained" together using LangChain we can send queries and get answers! 

In [9]:
def ask_question(question: str):
    response = qa.invoke({"query": question})
    print(f"Response: {response['result']}\n")

    citations = {doc.metadata['source'] for doc in response['source_documents']}
    print(f"Citations: {citations}\n")

    # uncomment below to print source chunks used  
    # print(f"Source Chunks Used: {response['source_documents']}")

In [10]:
ask_question("What is MMLU?")

Response:  MMLU stands for massive multitask language understanding. It is a benchmark that uses a combination of 57 subjects such as math, physics, history, law, medicine, and ethics for testing both world knowledge and problem-solving abilities.

Citations: {'https://blog.google/technology/ai/google-gemini-ai/'}



In [11]:
ask_question("What is a TPU?")

Response:  A TPU is a Tensor Processing Unit, a custom-designed AI accelerator developed by Google.

Citations: {'https://blog.google/technology/ai/google-gemini-ai/'}

