## **Combine Internal Data with Web Data**

### **Introduction**

In this tutorial, you'll discover how to build a **hybrid agent** that combines real-time web data with your own internal knowledge sources. For enterprises, the true power of agents comes from integrating proprietary internal data with information from the public web.

By the end, you'll be able to:
- **Integrate** web search results and private documents into a single agent workflow
- **Enable** your agent to compare, contrast and synthesize information from both public and internal sources
- **Build** more accurate, up-to-date, and context-aware applications for research, analysis, and decision-making

> **Why is this approach valuable?**
> 
> It empowers your AI to answer questions with the most current information avaiable online, while also leveraging your proprietary data - just like a human expert would.
>
> You'll unlock new capabilities for market research, competitive intelligence, and any scenario where both internal and external knowledge matter.

### **Setting Up**

In [1]:
# install dependencies
!uv add tavily-python langchain-google-genai langchain-chroma langchain langchain-tavily langgraph python-dotenv --quiet

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

tavily_api_key = os.getenv("TAVILY_API_KEY")
gg_api_key = os.getenv("GOOGLE_API_KEY")

if tavily_api_key is None or gg_api_key is None:
    print("Please set up you API_KEY in .env file")
else:
    print("Get API_KEY successfully")

Get API_KEY successfully


In [2]:
from tavily import TavilyClient
from langchain_google_genai import ChatGoogleGenerativeAI

tavily_client = TavilyClient(api_key=tavily_api_key)
llm = ChatGoogleGenerativeAI(model = "gemini-2.5-flash",
                             api_key=gg_api_key)

# Test llm setup
response = llm.invoke(input="Hi")
print(response.content)

Hello! How can I help you today?


### **Vector Database Set Up**

In this tutorial, we'll use a pre-built vector database to simulate an internal knowledge store. The data is based on synthetically generated mock CRM documents, designed to support a CRM and sales insights use case.

To keep things simple, we've prepared a vector store using Chroma and GEMINI embeddings, built from the mock CRM data.

In [3]:
from langchain_chroma import Chroma
from langchain_google_genai.embeddings import GoogleGenerativeAIEmbeddings

# Initialize the embedding model
embedding_model = GoogleGenerativeAIEmbeddings(api_key=gg_api_key, model="text-embedding-004")

# Initialize the Chroma vector store
vector_store = Chroma(
    embedding_function=embedding_model,
    persist_directory="db",
    collection_name="crm"
)

In [4]:
# Test setup database

# Initialize the retriever
retriever = vector_store.as_retriever()

# Query the vector index
search_results = retriever.invoke("robotics use case")

In [5]:
for result in search_results:
    print(result.page_content)
    print("\n")

Tesla
Contact: Nathan Ash – Head of Software Integration
Sales Stage: Early Demo
Deal Size: $600K (initial scope)
Meetings:
May 10: Robotics analytics walkthrough.
May 14: Discussion with manufacturing systems team.
Needs:
Predictive maintenance alerts for Gigafactory robotics arms.
Offline deployment (low latency / no external traffic).
Objections:
Concern about vendor lock-in; prefers modular system.
Internal Notes:
Use on-prem reference from Vantage Energy rollout.
Next Steps:
Share local agent deployment option + performance benchmarks.


Microsoft
Contact: Jane Doe – Head of AI Platform Partnerships
Sales Stage: Technical Evaluation (Late Stage)
Deal Size: $720K ARR
Meetings:
April 4: Architecture deep dive with Azure AI team.
May 2: Compliance & audit alignment workshop.
Needs:
Integration with Azure AI Studio for LLM monitoring.
Compliance alerts for multi-tenant AI deployments.
Objections:
Wants FedRAMP Moderate within 6 months.
Internal Notes:
They’re replacing an internal too

In [6]:
from langchain_tavily import TavilySearch, TavilyExtract, TavilyCrawl

vector_search_tool = retriever.as_tool(
    name="vector_search",
    description="""
    Perform a vector search on our company's CRM data.
    """
)

search = TavilySearch(max_results=5, topic="general")
extract = TavilyExtract(extract_depth="advanced")
crawl = TavilyCrawl()

  vector_search_tool = retriever.as_tool(


In [7]:
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model = "gemini-2.5-flash",
    api_key = gg_api_key
)

### **Hybrid Agent**

We'll now define our hybrid react agent using LangGraph, illusetrated in the diagram below.

The system prompt is a crucial component in guiding the agent's behavior, as it sets the context, boundaries, and instructions for how the agent should interact with available tools and data sources. A well-crafted system prompt ensures the agent understands what information it can access, how to use each tool effectively, and what standards to follow when generating responses.

The system prompt below explicitly outlines the agent's capabilities, including access to both public web tools (search, extract, crawl) and an internal vector search tool for proprietary CRM data. It details the purpose and best pratices for each tool, emphasizes the importance of grouping answers in retrieved information, and prohibits the agent from making up facts or relying on prior knowledge. This comprehensive prompt helps the agent make informed decisions about when and how to use each tool, resulting in more accurate and reliable outputs.

The prompt clearly describe what information is available in the vector store, so the agent knows how to use it alongside the web tools

In [8]:
import datetime
from langgraph.prebuilt import create_react_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

today = datetime.datetime.today().strftime("%A, %B %d, %Y")

system_prompt = f"""
You are a ReAct-style research agent equipped with the following tools:

        * **Tavily Web Search**
        * **Tavily Web Extract**
        * **Tavily Web Crawl**
        * **Internal Vector Search** (for proprietary CRM data)

        Your mission is to conduct comprehensive, accurate, and up-to-date research, using a combination of real-time public web information and internal enterprise knowledge.  All answers must be grounded in retrieved information from the tools provided. You are not permitted to make up information or rely on prior knowledge.

        **Today's Date:** {today}

        **Available Tools:**

        1. **Tavily Web Search**

        * **Purpose:** Retrieve relevant web pages based on a query.
        * **Usage:** Provide a search query to receive up to 10 semantically ranked results, each containing the title, URL, and a content snippet.
        * **Best Practices:**
            * Use specific queries to narrow down results.
                * Example: "OpenAI's latest product release" and "OpenAI's headquarters location" rather than "OpenAI's latest product release and headquarters location"
            * Optimize searches using parameters such as `search_depth`, `time_range`, `include_domains`, and `include_raw_content`.
            * Break down complex queries into specific, focused sub-queries.

        2. **Tavily Web Crawl**

        * **Purpose:** Explore a website's structure and gather content from linked pages for deep research and information discovery from a single source.
        * **Usage:** Input a base URL to crawl, specifying parameters such as `max_depth`, `max_breadth`, and `extract_depth`.
        * **Best Practices:**

            * Begin with shallow crawls and progressively increase depth.
            * Utilize `select_paths` or `exclude_paths` to focus the crawl.
            * Set `extract_depth` to "advanced" for comprehensive extraction.

        3. **Tavily Web Extract**

        * **Purpose:** Extract the full content from specific web pages.
        * **Usage:** Provide URLs to retrieve detailed content.
        * **Best Practices:**

            * Set `extract_depth` to "advanced" for detailed content, including tables and embedded media.
            * Enable `include_images` if image data is necessary.

        4. **Internal Vector Search**

        * **Purpose:** Search the proprietary CRM knowledge base.
        * **Usage:** Submit a natural language query to retrieve relevant context from the CRM vector store. Contains context about key accounts like Meta, Apple, Google, Amazon, Microsoft, and Tesla.
        * **Best Practices:**
            * When possible, refer to specific information, such as names, dates, product usage, or meetings.
            

        **Guidelines for Conducting Research:**

        * You may not answer questions based on prior knowledge or assumptions.
        * **Citations:** Always support findings with source URLs, clearly provided as in-text citations.
        * **Accuracy:** Rely solely on data obtained via provided tools (web or vector store) —never fabricate information.
        * **If none of the tools return useful information, respond:**
            * "I'm sorry, but none of the available tools provided sufficient information to answer this question."

        **Research Workflow:**

        * **Thought:** Consider necessary information and next steps.
        * **Action:** Select and execute appropriate tools.
        * **Observation:** Analyze obtained results.
        **IMPORTANT: Repeat Thought/Action/Observation cycles as needed. You MUST only respond to the user once you have gathered all the information you need.**
        * **Final Answer:** Synthesize and present findings with citations in markdown format.
        
        **Example Workflow:**

        **Workflow 4: Combine Web Tools and Vector Search**

        **Question:** What has Apple publicly shared about their AI strategy, and do we have any internal notes on recent meetings with them?

        * **Thought:** I’ll start by looking for Apple’s public statements on AI using a search.
        * **Action:** Tavily Web Search with the query "Apple AI strategy 2024" and `time_range` set to "month".
        * **Observation:** Found a recent article from Apple’s newsroom and a keynote transcript.
        * **Thought:** I want to read the full content of Apple’s keynote.
        * **Action:** Tavily Web Extract on the URL from the search result.
        * **Observation:** Extracted detailed content from Apple’s announcement.
        * **Thought:** Now, I’ll check our internal CRM data for recent meeting notes with Apple.
        * **Action:** Internal Vector Search with the query "recent Apple meetings AI strategy".
        * **Observation:** Retrieved notes from a Q2 strategy sync with Apple’s enterprise team.
        * **Final Answer:** Synthesized response combining public and internal data, with citations.
        ---

        You will now receive a research question from the user:
"""

hybrid_agent = create_react_agent(
    model=llm,
    tools=[search, crawl, extract, vector_search_tool],
    prompt=ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            MessagesPlaceholder(variable_name="messages")
        ]
    ),
    name="hybrid_agent"
)

In [9]:
from langchain.schema import HumanMessage

inputs = {
    "messages": [
        HumanMessage(
            content="Search for the latest news on google relevant to our current CRM data on them"
        )
    ]
}

for s in hybrid_agent.stream(inputs, stream_mode="values"):
    message = s["messages"][-1]
    if isinstance(message, tuple):
        print(message)
    else: 
        message.pretty_print()


Search for the latest news on google relevant to our current CRM data on them
Name: hybrid_agent
Tool Calls:
  vector_search (85057a0d-e31b-474b-8fbd-b653c69bed8b)
 Call ID: 85057a0d-e31b-474b-8fbd-b653c69bed8b
  Args:
    query: CRM data on Google
Name: vector_search

[Document(id='b8374cc4-3326-41a0-bf03-ff08336251a8', metadata={'page_label': '1', 'source': './docs\\google.pdf', 'title': 'google', 'producer': 'macOS Version 15.1 (Build 24B2083) Quartz PDFContext', 'page': 0, 'creationdate': "D:20250526163349Z00'00'", 'creator': 'TextEdit', 'moddate': "D:20250526163349Z00'00'", 'total_pages': 1}, page_content='Google\nContact: Alex Zhang – Sr. Director, Cloud ML Ops\nSales Stage: Closed Won\nDeal Size: $310K ARR\nMeetings:\nMarch 18: Risk monitoring POC kickoff.\nApril 10: Training session with model reliability team.\nNeeds:\nAlerting on drift in high-traffic models.\nDashboard integrations with Looker.\nObjections:\nNone — deployment complete, expanding to Ads org.\nInternal Notes:

In [10]:
from IPython.display import Markdown

Markdown(message.content)

Our CRM data indicates Google's interest in "Alerting on drift in high-traffic models" and "Dashboard integrations with Looker," along with a request for a "user-level event tagging feature from roadmap."

Here's the latest news relevant to these areas:

*   **Looker Integration:** Google is enabling the use of Gemini and open-source text embedding models directly within BigQuery and Looker, enhancing dashboard integrations [1].
*   **Model Drift Alerting:** Forbes highlighted the importance of tracking AI model versioning, usage logging, and drift monitoring, along with embedding AI observability tools with alerts to ensure performance in cloud observability [3].
*   **AI Agents for Automation:** Google has released five new AI agents designed to automate data pipelines, migrations, notebooks, dashboards, and GitHub reviews, which could indirectly contribute to improved MLOps and alerting capabilities [4].
*   **General AI/ML Advancements:** Google Cloud's blog features "101 real-world gen AI use cases with technical blueprints," showcasing their continuous development in AI and Machine Learning [2].

While there isn't specific news about a "user-level event tagging feature from roadmap," the broader advancements in AI, MLOps, and Looker integration align with Google's stated needs.

**Citations:**
[1] https://www.linkedin.com/posts/rob-staples-3468b613_new-way-now-how-nabc-is-growing-the-blueberry-activity-7363902671482728448-KEFk
[2] https://cloud.google.com/blog/products/ai-machine-learning/real-world-gen-ai-use-cases-with-technical-blueprints
[3] https://www.forbes.com/councils/forbestechcouncil/2025/10/10/cloud-observability-challenges-at-scale-and-how-to-solve-them/
[4] https://www.linkedin.com/posts/jafarnajafov_google-just-dropped-5-new-ai-agents-that-activity-7365049973668851713-if4J

In [11]:

# Test the hybrid agent
inputs = {
    "messages": [
        HumanMessage(
            content="Check for the Google Deal size and find the latest earnings report for them to validate if they are in spending spree"
        )
    ]
}

# Stream the agent's response
for s in hybrid_agent.stream(inputs, stream_mode="values"):
    message = s["messages"][-1]
    if isinstance(message, tuple):
        print(message)
    else:
        message.pretty_print()


Check for the Google Deal size and find the latest earnings report for them to validate if they are in spending spree
Name: hybrid_agent
Tool Calls:
  tavily_search (e1762a04-f5f0-4ed7-9c8e-e97260ea25f5)
 Call ID: e1762a04-f5f0-4ed7-9c8e-e97260ea25f5
  Args:
    search_depth: advanced
    query: Google deal size OR Google acquisitions
Name: tavily_search

{"query": "Google deal size OR Google acquisitions", "follow_up_questions": null, "answer": null, "images": [], "results": [{"url": "https://www.forex.com/en-us/trading-guides/google-acquisition-history/", "title": "Google's Biggest Acquisitions: What Does Google Own? - FOREX.com", "content": "Google’s biggest acquisition to date is Motorola Mobility, which it bought for $12.5 billion but was later sold to Lenovo for just a quarter of the price ($2.9 billion). Of Google’s current holdings, the largest acquisition deal was Nest for $3.2 billion.\n\nHere are Google’s ten biggest acquisitions. [...] Google owns approximately 256 compani

In [12]:
Markdown(message.content)

Google has made a number of significant acquisitions throughout its history. Its largest acquisition to date is the cloud security company Wiz for $33 billion in March 2025. Other notable acquisitions include Motorola Mobility for $12.5 billion (later sold), Nest for $3.2 billion, DoubleClick for $3.1 billion, and YouTube for $1.65 billion. Google has acquired approximately 256 companies for a total cost of about $20.89 billion (excluding the Wiz acquisition). In the last five years (2019-2024), Google averaged 5 acquisitions per year, with 2 acquisitions completed in 2025 so far (as of September 2025). This includes Galileo AI and Wiz in 2025. [https://www.forex.com/en-us/trading-guides/google-acquisition-history/](https://www.forex.com/en-us/trading-guides/google-acquisition-history/) [https://www.cbinsights.com/research/google-biggest-acquisitions-infographic/](https://www.cbinsights.com/research/google-biggest-acquisitions-infographic/) [https://tracxn.com/d/acquisitions/acquisitions-by-google/__8zKOTB9XR934x_3BSnEVSrmsu3RZtEv3AorQtuBb2Yk](https://tracxn.com/d/acquisitions/acquisitions-by-google/__8zKOTB9XR934x_3BSnEVSrmsu3RZtEv3AorQtuBb2Yk)

In terms of their latest earnings, Alphabet Inc. (Google's parent company) reported a consolidated revenue of $96.4 billion in Q2 2025, which is a 14% increase compared to Q2 2024. Google Services operating income increased 17% to $32.7 billion, and the operating margin increased from 39.6% to 42.3%. [https://finance.yahoo.com/news/alphabet-inc-goog-q2-2025-071346968.html](https://finance.yahoo.com/news/alphabet-inc-goog-q2-2025-071346968.html) [https://abc.xyz/investor/events/event-details/2025/2025-Q1-Earnings-Call/](https://abc.xyz/investor/events/event-details/2025/2025-Q1-Earnings-Call/)

Given the significant acquisition of Wiz for $33 billion in March 2025, along with other acquisitions and strong revenue growth reported in Q2 2025, it suggests that Google is actively investing and expanding, which could be interpreted as a "spending spree" in strategic areas, particularly in cloud security.

### **Conclusion**

Throughout these tutorials, you've learned how to:
- **Set up the Tavily API** for real-time web search, extraction and crawling.
- **Design agent architectures** that combine both web and internal knowledge sources.
- **Integrate multiple tools** - including web search, extract, crawling and vector search - into a unified, powerful agent system.
- **Apply these agents** to real-world scenarios, such as researching companies and analyzing news.

By leveraging Tavily's web API, your agents can now access up-to-date information beyond their training data, making them more effective and relevant for dynamic business.