<a href="https://colab.research.google.com/github/sugarforever/LangChain-Advanced/blob/main/06_Web_Research_Retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 06 Web Research Retriever


## Example

In [None]:
!pip install -q -U langchain openai chromadb tiktoken

In [None]:
import os
os.environ['OPENAI_API_KEY'] = "your valid openai api key"

In [None]:
from langchain.retrievers.web_research import WebResearchRetriever

In [None]:
import os
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models.openai import ChatOpenAI
from langchain.utilities import GoogleSearchAPIWrapper

# Vectorstore
vectorstore = Chroma(embedding_function=OpenAIEmbeddings(),persist_directory="./chroma_db_oai")

# LLM
llm = ChatOpenAI(temperature=0)

# Request from https://programmablesearchengine.google.com/controlpanel/all
os.environ["GOOGLE_CSE_ID"] = "yor cse id"
# Request from https://developers.google.com/custom-search/v1/introduction
os.environ["GOOGLE_API_KEY"] = "your google api key"
search = GoogleSearchAPIWrapper()

In [18]:
search.run("What is vitamin?")

"Dec 16, 2020 ... A vitamin is an organic compound, which means that it contains carbon. It is also an essential nutrient that the body may need to get from food. Sep 18, 2023 ... Total vitamin D intakes were three times higher with supplement use than with diet alone; the mean intake from foods and beverages alone for\xa0... The two main forms of vitamin A in the human diet are preformed vitamin A (retinol, retinyl esters), and provitamin A carotenoids such as alpha-carotene and\xa0... Nov 8, 2022 ... Vitamin D helps maintain strong bones. Learn how much you need, good sources, deficiency symptoms, and health effects here. Jan 19, 2023 ... Vitamins are a group of substances that are needed for normal cell function, growth, and development. There are 13 essential vitamins. This\xa0... Aug 12, 2022 ... Vitamin A also helps your heart, lungs, and other organs work properly. Carotenoids are pigments that give yellow, orange, and red fruits and\xa0... Aug 2, 2022 ... Vitamin D deficiency m

In [None]:
web_research_retriever = WebResearchRetriever.from_llm(
    vectorstore=vectorstore,
    llm=llm,
    search=search,
)

In [None]:
!pip install -q -U html2text

In [19]:
from langchain.chains import RetrievalQAWithSourcesChain
user_input = "Who is the winner of FIFA world cup 2002?"
qa_chain = RetrievalQAWithSourcesChain.from_chain_type(llm,retriever=web_research_retriever)
result = qa_chain({"question": user_input})
result

INFO:langchain.retrievers.web_research:Generating questions for Google Search ...
INFO:langchain.retrievers.web_research:Questions for Google Search (raw): {'question': 'Who is the winner of FIFA world cup 2002?', 'text': LineList(lines=['1. "Which country won the FIFA World Cup in 2002?"\n', '2. "Who was the champion of the FIFA World Cup held in 2002?"\n', '3. "Who emerged as the victor in the FIFA World Cup 2002?"'])}
INFO:langchain.retrievers.web_research:Questions for Google Search: ['1. "Which country won the FIFA World Cup in 2002?"\n', '2. "Who was the champion of the FIFA World Cup held in 2002?"\n', '3. "Who emerged as the victor in the FIFA World Cup 2002?"']
INFO:langchain.retrievers.web_research:Searching for relevant urls...
INFO:langchain.retrievers.web_research:Searching for relevant urls...
INFO:langchain.retrievers.web_research:Search results: [{'title': '2002 FIFA World Cup - Wikipedia', 'link': 'https://en.wikipedia.org/wiki/2002_FIFA_World_Cup', 'snippet': 'However

{'question': 'Who is the winner of FIFA world cup 2002?',
 'answer': 'Brazil is the winner of the FIFA World Cup 2002.\n',
 'sources': 'https://en.wikipedia.org/wiki/2002_FIFA_World_Cup'}

In [20]:
import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.web_research").setLevel(logging.INFO)
user_input = "What is Task Decomposition in LLM Powered Autonomous Agents?"
docs = web_research_retriever.get_relevant_documents(user_input)

INFO:langchain.retrievers.web_research:Generating questions for Google Search ...
INFO:langchain.retrievers.web_research:Questions for Google Search (raw): {'question': 'What is Task Decomposition in LLM Powered Autonomous Agents?', 'text': LineList(lines=['1. How does task decomposition work in LLM powered autonomous agents?\n', '2. What is the role of task decomposition in LLM powered autonomous agents?\n', '3. Can you explain the concept of task decomposition in LLM powered autonomous agents?'])}
INFO:langchain.retrievers.web_research:Questions for Google Search: ['1. How does task decomposition work in LLM powered autonomous agents?\n', '2. What is the role of task decomposition in LLM powered autonomous agents?\n', '3. Can you explain the concept of task decomposition in LLM powered autonomous agents?']
INFO:langchain.retrievers.web_research:Searching for relevant urls...
INFO:langchain.retrievers.web_research:Searching for relevant urls...
INFO:langchain.retrievers.web_research:S

In [None]:
import os
import re
from typing import List
from langchain.chains import LLMChain
from pydantic import BaseModel, Field
from langchain.prompts import PromptTemplate
from langchain.output_parsers.pydantic import PydanticOutputParser

# LLMChain
search_prompt = PromptTemplate(
    input_variables=["question"],
    template="""You are an assistant tasked with improving Google search
    results. Generate 5 Google search queries that are similar to
    this question. The output should be a numbered list of questions and each
    should have a question mark at the end: {question}""",
)

class LineList(BaseModel):
    """List of questions."""

    lines: List[str] = Field(description="Questions")

class QuestionListOutputParser(PydanticOutputParser):
    """Output parser for a list of numbered questions."""

    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        lines = re.findall(r"\d+\..*?\n", text)
        return LineList(lines=lines)

llm_chain = LLMChain(llm=llm, prompt=search_prompt, output_parser=QuestionListOutputParser())

In [21]:
# Initialize
web_research_retriever_llm_chain = WebResearchRetriever(vectorstore=vectorstore, llm_chain=llm_chain, search=search)

# Run
docs = web_research_retriever_llm_chain.get_relevant_documents("What is the recommended way to recycle plastics?")

INFO:langchain.retrievers.web_research:Generating questions for Google Search ...
INFO:langchain.retrievers.web_research:Questions for Google Search (raw): {'question': 'What is the recommended way to recycle plastics?', 'text': LineList(lines=['1. How can I recycle plastics effectively?\n', '2. What are the best practices for recycling plastics?\n', '3. Which methods are recommended for recycling plastics?\n', '4. What is the most efficient way to recycle plastics?\n'])}
INFO:langchain.retrievers.web_research:Questions for Google Search: ['1. How can I recycle plastics effectively?\n', '2. What are the best practices for recycling plastics?\n', '3. Which methods are recommended for recycling plastics?\n', '4. What is the most efficient way to recycle plastics?\n']
INFO:langchain.retrievers.web_research:Searching for relevant urls...
INFO:langchain.retrievers.web_research:Searching for relevant urls...
INFO:langchain.retrievers.web_research:Search results: [{'title': '7 Tips to Recycle

In [22]:
docs

[Document(page_content='_Solution: Just as the rule states, make sure your recyclables are clean,\nempty and dry. It’ll take seconds and if everyone did it, it would save tons\nof recyclables going to the landfill._\n\n## **4\\. Combined materials are trash**\n\nRecycling only works when like materials are together. Unfortunately, items\nlike plastic-coated coffee cups, laminated paper and paper-bubble wrap\nenvelopes from the mail can’t ever be separated, which means they’re trash.\n\n_Solution: Try to avoid buying nonrecyclable materials that can’t be\nseparated. And when you can, shop local to cut down on the carbon footprint of\nyour products._\n\n## **5\\. Know your plastics**\n\nNot all plastics are treated equally. Rigid plastics are recyclable, labeled\nby resin codes 1 through 7. Generally, the higher the number, the less\nrecyclable it is. Most recycling centers will recycle plastics 1 and 2 without\na problem. Past that, it gets tricky.\n\nFurthermore, a lot of plastic just 