### Wednesday, November 22, 2023

https://github.com/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/10-langchain-multi-query.ipynb

docker container start hfpt_Nov20

Start => OpenAI usage is at $1.47

Finish => OpenAI usage is at $1.63

This all runs.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/10-langchain-multi-query.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/08-langchain-multi-query.ipynb)

#### [LangChain Handbook](https://pinecone.io/learn/langchain)

# LangChain Multi-Query for RAG

In [1]:
# !pip install -qU \
#   pinecone-client==2.2.4 \
#   langchain==0.0.321 \
#   datasets==2.14.6 \
#   openai==0.28.1 \
#   tiktoken==0.5.1

## Getting Data

We will download an existing dataset from Hugging Face Datasets.

In [1]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

# 2m 10.3s

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

In [2]:
from langchain.docstore.document import Document

docs = []

for row in data:
    doc = Document(
        page_content=row["chunk"],
        metadata={
            "title": row["title"],
            "source": row["source"],
            "id": row["id"],
            "chunk-id": row["chunk-id"],
            "text": row["chunk"]
        }
    )
    docs.append(doc)

    # 7.9s

## Embedding and Vector DB Setup

Initialize our embedding model:

In [3]:
import os
from getpass import getpass
from langchain.embeddings.openai import OpenAIEmbeddings

model_name = "text-embedding-ada-002"

# get openai api key from platform.openai.com
# OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass("OpenAI API Key: ")
# OPENAI_API_KEY = getpass("OpenAI API Key: ")

# embed = OpenAIEmbeddings(
#     model=model_name, openai_api_key=OPENAI_API_KEY, disallowed_special=()
# )

In [5]:
# get openai api key from platform.openai.com
# OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass("OpenAI API Key: ")
OPENAI_API_KEY = getpass("OpenAI API Key: ")

In [6]:
embed = OpenAIEmbeddings(
    model=model_name, openai_api_key=OPENAI_API_KEY, disallowed_special=()
)

Create our Pinecone index:

In [7]:
import pinecone
import time

index_name = "langchain-multi-query-demo"

# # find API key in console at app.pinecone.io
# YOUR_API_KEY = os.getenv('PINECONE_API_KEY') or getpass("Pinecone API Key: ")
# # find ENV (cloud region) next to API key in console
# YOUR_ENV = os.getenv('PINECONE_ENVIRONMENT') or input("Pinecone Env: ")

# find API key in console at app.pinecone.io
YOUR_API_KEY = getpass("Pinecone API Key: ")
# find ENV (cloud region) next to API key in console
YOUR_ENV = input("Pinecone Env: ")

# pinecone.init(
#     api_key=YOUR_API_KEY,
#     environment=YOUR_ENV
# )

# if index_name not in pinecone.list_indexes():
#     # we create a new index
#     pinecone.create_index(
#         name=index_name,
#         metric='cosine',
#         dimension=1536  # 1536 dim of text-embedding-ada-002
#     )
#     # wait for index to be initialized
#     while not pinecone.describe_index(index_name).status["ready"]:
#         time.sleep(1)

# # now connect to index
# index = pinecone.Index(index_name)

In [8]:
pinecone.init(
    api_key=YOUR_API_KEY,
    environment=YOUR_ENV
)

In [9]:
if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=1536  # 1536 dim of text-embedding-ada-002
    )
    # wait for index to be initialized
    while not pinecone.describe_index(index_name).status["ready"]:
        time.sleep(1)

# 15.8s

In [10]:
# now connect to index
index = pinecone.Index(index_name)

Populate our index:

In [11]:
len(docs)

41584

In [12]:
# Yeah, we want to do this ... !
# if you want to speed things up to follow along
docs = docs[:5000]

In [13]:
%%time 

from tqdm.auto import tqdm
from uuid import uuid4

batch_size = 100

for i in tqdm(range(0, len(docs), batch_size)):
    i_end = min(len(docs), i+batch_size)
    docs_batch = docs[i:i_end]
    # get IDs
    ids = [f"{doc.metadata['id']}-{doc.metadata['chunk-id']}" for doc in docs_batch]
    # get text and embed
    texts = [d.page_content for d in docs_batch]
    embeds = embed.embed_documents(texts=texts)
    # get metadata
    metadata = [d.metadata for d in docs_batch]
    to_upsert = zip(ids, embeds, metadata)
    index.upsert(vectors=to_upsert)

# CPU times: user 10.3 s, sys: 253 ms, total: 10.6 s
# Wall time: 27min 15s

  0%|          | 0/50 [00:00<?, ?it/s]

CPU times: user 10.3 s, sys: 253 ms, total: 10.6 s
Wall time: 27min 15s


## Multi-Query with LangChain

Now we switch across to using our populated index as a vectorstore in Langchain.

In [14]:
from langchain.vectorstores import Pinecone

text_field = "text"

vectorstore = Pinecone(index, embed.embed_query, text_field)



In [15]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

We initialize the `MultiQueryRetriever`:

In [16]:
from langchain.retrievers.multi_query import MultiQueryRetriever

retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(), llm=llm
)

We set logging so that we can see the queries as they're generated by our LLM.

In [17]:
# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

To query with our multi-query retriever we call the `get_relevant_documents` method.

In [18]:
%%time

question = "tell me about llama 2?"

docs = retriever.get_relevant_documents(query=question)
len(docs)

# CPU times: user 44.6 ms, sys: 0 ns, total: 44.6 ms
# Wall time: 7.37 s

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What information can you provide about llama 2?', '2. Could you give me some details about llama 2?', '3. I would like to learn more about llama 2. Can you help me with that?']


CPU times: user 44.6 ms, sys: 0 ns, total: 44.6 ms
Wall time: 7.37 s


7

From this we get a variety of docs retrieved by each of our queries independently. By default the `retriever` is returning `3` docs for each query — totalling `9` documents — however, as there is some overlap we actually return `6` unique docs.

In [19]:
docs

[Document(page_content='Q:Yes or no: Could a llama birth twice during War in Vietnam (1945-46)?\nA:TheWar inVietnam was6months. Thegestationperiod forallama is11months, which ismore than 6\nmonths. Thus, allama could notgive birth twice duringtheWar inVietnam. So the answer is no.\nQ:Yes or no: Would a pear sink in water?\nA:Thedensityofapear isabout 0:6g=cm3,which islessthan water.Objects lessdense than waterﬂoat. Thus,\napear would ﬂoat. So the answer is no.\nTable 26: Few-shot exemplars for full chain of thought prompt for Date Understanding.\nPROMPT FOR DATE UNDERSTANDING\nQ:2015 is coming in 36 hours. What is the date one week from today in MM/DD/YYYY?\nA:If2015 iscomingin36hours, then itiscomingin2days. 2days before01/01/2015 is12/30/2014, sotoday\nis12/30/2014. Sooneweek from todaywillbe01/05/2015. So the answer is 01/05/2015.', metadata={'chunk-id': '137', 'id': datetime.date(2201, 11, 22), 'source': 'http://arxiv.org/pdf/2201.11903', 'title': 'Chain-of-Thought Prompting Elicit

## Adding the Generation in RAG

So far we've built a multi-query powered **R**etrieval **A**ugmentation chain. Now, we need to add **G**eneration.

In [20]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

QA_PROMPT = PromptTemplate(
    input_variables=["query", "contexts"],
    template="""You are a helpful assistant who answers user queries using the
    contexts provided. If the question cannot be answered using the information
    provided say "I don't know".

    Contexts:
    {contexts}

    Question: {query}""",
)

# Chain
qa_chain = LLMChain(llm=llm, prompt=QA_PROMPT)

In [21]:
out = qa_chain(
    inputs={
        "query": question,
        "contexts": "\n---\n".join([d.page_content for d in docs])
    }
)
out["text"]

'I don\'t have any information about "llama 2" in the provided contexts.'

## Chaining Everything with a SequentialChain

We can pull together the logic above into a function or set of methods, whatever is prefered — however if we'd like to use LangChain's approach to this we must "chain" together multiple chains. The first retrieval component is (1) not a chain per se, and (2) requires processing of the output. To do that, and fit with LangChain's "chaining chains" approach, we setup the _retrieval_ component within a `TransformChain`:

In [22]:
from langchain.chains import TransformChain

def retrieval_transform(inputs: dict) -> dict:
    docs = retriever.get_relevant_documents(query=inputs["question"])
    docs = [d.page_content for d in docs]
    docs_dict = {
        "query": inputs["question"],
        "contexts": "\n---\n".join(docs)
    }
    return docs_dict

retrieval_chain = TransformChain(
    input_variables=["question"],
    output_variables=["query", "contexts"],
    transform=retrieval_transform
)

Now we chain this with our generation step using the `SequentialChain`:

In [23]:
from langchain.chains import SequentialChain

rag_chain = SequentialChain(
    chains=[retrieval_chain, qa_chain],
    input_variables=["question"],  # we need to name differently to output "query"
    output_variables=["query", "contexts", "text"]
)

Then we perform the full RAG pipeline:

In [24]:
%%time

out = rag_chain({"question": question})
out["text"]

# CPU times: user 64.4 ms, sys: 646 µs, total: 65.1 ms
# Wall time: 10min 18s

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What information can you provide about llama 2?', '2. Could you give me some details about llama 2?', '3. I would like to learn more about llama 2. Can you help me with that?']


CPU times: user 64.4 ms, sys: 646 µs, total: 65.1 ms
Wall time: 10min 18s


'I don\'t have any information about "llama 2" in the provided contexts.'

---

## Custom Multiquery

We'll try this with two prompts, both encourage more variety in search queries.

**Prompt A**
```
Your task is to generate 3 different search queries that aim to
answer the user question from multiple perspectives.
Each query MUST tackle the question from a different viewpoint,
we want to get a variety of RELEVANT search results.
Provide these alternative questions separated by newlines.
Original question: {question}
```


**Prompt B**
```
Your task is to generate 3 different search queries that aim to
answer the user question from multiple perspectives. The user questions
are focused on Large Language Models, Machine Learning, and related
disciplines.
Each query MUST tackle the question from a different viewpoint, we
want to get a variety of RELEVANT search results.
Provide these alternative questions separated by newlines.
Original question: {question}
```

In [25]:
from typing import List
from langchain.chains import LLMChain
from pydantic import BaseModel, Field
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser


# Output parser will split the LLM result into a list of queries
class LineList(BaseModel):
    # "lines" is the key (attribute name) of the parsed output
    lines: List[str] = Field(description="Lines of text")


class LineListOutputParser(PydanticOutputParser):
    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        lines = text.strip().split("\n")
        return LineList(lines=lines)


output_parser = LineListOutputParser()

template = """
Your task is to generate 3 different search queries that aim to
answer the user question from multiple perspectives. The user questions
are focused on Large Language Models, Machine Learning, and related
disciplines.
Each query MUST tackle the question from a different viewpoint, we
want to get a variety of RELEVANT search results.
Provide these alternative questions separated by newlines.
Original question: {question}
"""

QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template=template,
)
llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

# Chain
llm_chain = LLMChain(llm=llm, prompt=QUERY_PROMPT, output_parser=output_parser)

In [26]:
%%time

# Run
retriever = MultiQueryRetriever(
    retriever=vectorstore.as_retriever(), llm_chain=llm_chain, parser_key="lines"
)  # "lines" is the key (attribute name) of the parsed output

# Results
docs = retriever.get_relevant_documents(
    query=question
)
len(docs)

# CPU times: user 69.5 ms, sys: 266 µs, total: 69.8 ms
# Wall time: 10min 14s

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the key features and capabilities of Large Language Model Llama 2?', '2. How does Llama 2 compare to other Large Language Models in terms of performance and efficiency?', '3. What are the applications and use cases of Llama 2 in the field of Machine Learning and Natural Language Processing?']


CPU times: user 69.5 ms, sys: 266 µs, total: 69.8 ms
Wall time: 10min 14s


12

In [27]:
docs

[Document(page_content='Gall², Suzana Ili\x01 c, Yacine Jernite, Younes Belkada, Thomas Wolf\nAbstract\nLarge language models (LLMs) have been shown to be able to perform new tasks based on\na few demonstrations or natural language instructions. While these capabilities have led to\nwidespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we\npresent BLOOM, a 176B-parameter open-access language model designed and built thanks\nto a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of\nsources in 46 natural and 13 programming languages (59 in total). We find that BLOOM\nachieves competitive performance on a wide variety of benchmarks, with stronger results\nafter undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs,

Putting this together in another `SequentialChain`:

In [28]:
retrieval_chain = TransformChain(
    input_variables=["question"],
    output_variables=["query", "contexts"],
    transform=retrieval_transform
)

rag_chain = SequentialChain(
    chains=[retrieval_chain, qa_chain],
    input_variables=["question"],  # we need to name differently to output "query"
    output_variables=["query", "contexts", "text"]
)

And asking again:

In [29]:
%%time

out = rag_chain({"question": question})
out["text"]

# CPU times: user 56 ms, sys: 840 µs, total: 56.9 ms
# Wall time: 6min 23s

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the key features and capabilities of Large Language Model Llama 2?', '2. How does Llama 2 compare to other Large Language Models in terms of performance and efficiency?', '3. What are the applications and use cases of Llama 2 in the field of Machine Learning and Natural Language Processing?']


CPU times: user 56 ms, sys: 840 µs, total: 56.9 ms
Wall time: 6min 23s


'I don\'t have information about "llama 2" in the provided contexts.'

After finishing, delete your Pinecone index to save resources:

In [30]:
pinecone.delete_index(index_name)

---