### 🚀 WaterCrawl × FLARE — the perfect duo for RAG playgrounds!

Welcome to this **step‑by‑step Jupyter Notebook** where we:

1. 🕷 **Crawl & clean** any website with **WaterCrawl** – turning raw HTML into markdown/JSON that’s ready for embeddings.   
2. 🔍 **Retrieve on‑the‑fly** with **FLARE (Forward‑Looking Active REtrieval)** – an “always‑be‑fact‑checking” wrapper that pulls extra docs *only* when the LLM shows low confidence.  
3. 🛠 **Tie it all together** with **LangChain**, **Tavily Search API** & a few helper utils so you can remix the pipeline to your heart’s content.

---

#### What’s inside?

| 🔧 Component | 💡 Why we’re using it |
|--------------|----------------------|
| **WaterCrawl** | Point‑&‑shoot crawling with sitemap visualizer, duplicate detection, and markdown/JSON exports – perfect for vector DB ingestion. :contentReference[oaicite:0]{index=0} |
| **LangChain** | Glue layer that lets us chain the crawl → embed → FLARE retrieval steps with a few lines of code. |
| **Tavily Search API** | Fast, inexpensive web search that slots into `TavilyRetriever`; great complement to your own crawled corpora. |
| **FLARE** | Re‑checks the model’s “next sentence” for shaky tokens; if confidence is low, it auto‑generates a smart query and fetches fresh docs before writing. :contentReference[oaicite:1]{index=1} |

---

#### Notebook flow 🗺️

1. **Setup**: grab your API keys from https://watercrawl.dev/, spin up your own `watercrawl` from: https://github.com/watercrawl/watercrawl. To run WaterCrawl API you need to install the Python SDK, which we will do in the following steps
2. **FLARE chain**: initialize `FlareChain(llm_answer, llm_question, retriever)` with **Tavily** + your newly‑minted vector store.  
3. **Ask away!**: watch FLARE pause, retrieve, and resume writing—as many times as needed—to give rock‑solid answers.  
4. **Extras**: show off the visual sitemap PNG WaterCrawl generated and link each node to its vector IDs.  

---

### Why you’ll ❤️ this combo

- **Less hallucination, more citation**: WaterCrawl hands FLARE pristine, source‑mapped text, so every sentence can be traced back to a URL.  
- **Pay only for what you need**: FLARE calls Tavily *selectively*, not on every token—so your search bill stays tiny.  
- **Drop‑in for any stack**: swap Tavily for your own BM25/Elastic/Weaviate retriever, or point WaterCrawl at authenticated intranet sites.  
- **Open‑source all the way**: MIT‑style licences on both projects mean you can fork, tweak, and ship to prod. :contentReference[oaicite:2]{index=2}

> **Tip:** if you’re new to WaterCrawl, follow: https://github.com/watercrawl/watercrawl?tab=readme-ov-file#-quick-start  hit `http://localhost` after `docker compose up -d` and explore the Playground UI—selector testing & screenshot capture included! 🎨

---

Ready? Let’s spin up containers and start crawling! 🏁


##### ➡️ **Lets install 📦all the dependencies:** 


In [None]:
!pip install --upgrade pip
!pip install  langchain-community langchain-core langchain-openai notebook watercrawl-py tavily-python


### ➡️ 🔑 **API keys you’ll need (grab these first!)** 

| Service | What it’s for | Where to generate |
|---------|---------------|-------------------|
| **WaterCrawl** | Auth for crawling endpoints | <https://app.watercrawl.dev/dashboard/api-keys> |
| **OpenAI** | LLM + embeddings | <https://platform.openai.com/api-keys> |
| ***WaterCrawl Search or Tavily Search** | Web search for FLARE | <https://app.tavily.com/home> <https://app.watercrawl.dev/dashboard/search> |

---

Option 1 –keep it clean: use a `.env` file ⚠️


Create the file **once**, store your keys, and everything else “just works”.

```python
# ── create_env.py ──
env_text = """
OPENAI_API_KEY= ***put your APi key here *** 
WATERCRAWL_API_KEY=* **put your APi key here *** 
If you are using Tavily Search: TAVILY_API_KEY= ***put your APi key here *** 

""".strip()

with open(".env", "w") as f:
    f.write(env_text)
print(".env file created — now edit it with your real keys ✏️")

-------------------------------------------------
Option 2 – quick‑and‑dirty: hard‑code in the notebook ⚠️

OPENAI_API_KEY= ***put your APi key here *** 
TAVILY_API_KEY= ***put your APi key here ***  #If you are using Tavily Search:
WATERCRAWL_API_KEY=* **put your APi key here *** 

Not recommended — anyone who sees or commits the notebook can read your keys.





##### ➡️ **If you’re using a `.env` file load the API keys with dotenv** 


In [18]:

from dotenv import load_dotenv
import os

load_dotenv()  # pulls everything from .env

OPENAI_API_KEY   = os.environ.get("OPENAI_API_KEY")
WATERCRAWL_API_KEY = os.environ.get("WATERCRAWL_API_KEY")
# If you are using Tavily Search: 
TAVILY_API_KEY   = os.environ.get("TAVILY_API_KEY") 


##### ➡️ **Import our packages**:

In [3]:
from typing import Any, List
from langchain.callbacks.manager import (
    AsyncCallbackManagerForRetrieverRun,
    CallbackManagerForRetrieverRun,
)
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
from watercrawl import WaterCrawlAPIClient
from pydantic import BaseModel
from langchain_openai import ChatOpenAI, OpenAI
import requests
from langchain.chains import FlareChain
from typing import List
from watercrawl import WaterCrawlAPIClient


from typing import Any, List, Callable, Optional
from pydantic import BaseModel, Field

from langchain.schema import Document
from langchain.callbacks.manager import (
    CallbackManagerForRetrieverRun,
    AsyncCallbackManagerForRetrieverRun,
)
from watercrawl import WaterCrawlAPIClient



#### ➡️ **Lets build our search function**:

In [None]:

def tavily_search_tool(
    query: str,
    api_key: str,
    max_results: int = 3,
    *,
    topic: str = "general",
    depth: str = "basic"          # "basic" | "deep"
) -> List[str]:
    """
    Search the web with Tavily and return a list of result URLs.

    Parameters
    ----------
    query : str
        The search query.
    api_key : str
        Your Tavily API key.
    max_results : int, default 3
        Maximum number of links to return.
    topic : str, default "general"
        Topical preset that adjusts result ranking.
    depth : str, default "basic"
        Search depth ("basic" is fastest; "deep" gathers more context).
    """
    url = "https://api.tavily.com/search"
    payload = {
        "query": query,
        "topic": topic,
        "search_depth": depth,
        "max_results": max_results,
        "include_answer": False,
        "include_raw_content": False,
        "include_domains": [],
        "exclude_domains": [],
    }
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
    }

    try:
        response = requests.post(url, json=payload, headers=headers, timeout=15)
        response.raise_for_status()
        data = response.json()

        # `data["results"]` is a list of hits.
        urls = [
            hit["url"] for hit in data.get("results", []) if isinstance(hit, dict) and hit.get("url")
        ][:max_results]

        print(urls)  # optional debug‑print to mirror your original helper
        return urls

    except Exception as e:
        print(f"⚠️  Tavily search error: {e}")
        return []


def watercrawl_search_tool(
    query: str,
    api_key: str,
    max_results: int = 3,
    *,
    language: str = "en",
    country: str = "us",
    time_range: str = "month",   # "day" | "week" | "month" | "year" | "all"
    depth: str = "basic"         # "basic" | "deep"
) -> List[str]:
    """
    Search the web with WaterCrawl and return a list of result URLs.

    Parameters
    ----------
    query : str
        The search query.
    api_key : str
        Your WaterCrawl API key.
    max_results : int, default 3
        Maximum number of links to return.
    language : str, default "en"
        ISO‑639‑1 language code for the search.
    country : str, default "us"
        ISO‑3166‑1 alpha‑2 country code used to geotarget the search.
    time_range : str, default "month"
        How recent the results should be.
    depth : str, default "basic"
        Search depth ("basic" is fastest; "deep" crawls additional links).
    """
    client = WaterCrawlAPIClient(api_key)

    try:
        search_request = client.create_search_request(
            query=query,
            search_options={
                "depth": depth,
                "language": language,
                "country": country,
                "time_range": time_range,
                "search_type": "web",
            },
            result_limit=max_results,
            sync=True,     # wait for results
            download=True # we only need the SERP, not full pages
        )

        # `search_request["result"]` is a list of hits.
        urls = [hit["url"] for hit in search_request["result"] if "url" in hit][:max_results]

        print(urls)  # optional debug‑print to mirror your original helper
        return urls

    except Exception as e:
        print(f"⚠️  WaterCrawl search error: {e}")
        return []



#### ➡️ **Lets build WaterCrawl Retriever**:

In [15]:
class WaterCrawlRetriever(BaseRetriever, BaseModel):
    """
    Retrieve web pages with either *WaterCrawl* or *Tavily* search, then scrape
    the HTML via WaterCrawl for LangChain ingestion.

    The default search engine is **WaterCrawl**; switch to Tavily by passing
    ``search_tool=tavily_search_tool`` at init time.
    """

    client: WaterCrawlAPIClient
    tavily_api_key: Optional[str] = None
    watercrawl_api_key: Optional[str] = None
    search_tool: Callable[[str, str, int], List[str]] = Field(
        default_factory=lambda: watercrawl_search_tool
    )

    # Scraper config
    page_options: dict = {
        "exclude_tags": ["nav", "footer", "aside"],
        "include_tags": ["article", "main"],
        "wait_time": 100,
        "include_html": False,
        "only_main_content": True,
        "include_links": False,
    }

    # --------------------------------------------------------------------- #
    # Sync                                                                   #
    # --------------------------------------------------------------------- #
    def _get_relevant_documents(
        self,
        query: str,
        *,
        run_manager: CallbackManagerForRetrieverRun,
        **kwargs: Any,
    ) -> List[Document]:
        # Choose the right API key for the selected search function
        if self.search_tool is tavily_search_tool:
            api_key = self.tavily_api_key
        else:  # default: WaterCrawl search
            api_key = (
                self.watercrawl_api_key
                or getattr(self.client, "api_key", None)  # fallback to client key
            )

        documents: List[Document] = []

        try:
            urls = self.search_tool(query, api_key, max_results=3)
            for url in urls:
                try:
                    result = self.client.scrape_url(
                        url=url,
                        page_options=self.page_options,
                        sync=True,
                        download=True,
                    )
                    content = result.get("content", "")
                    if content:
                        documents.append(
                            Document(page_content=content, metadata={"source": url})
                        )
                except Exception as e:  # pragma: no cover
                    print(f"⚠️  Scrape failed for {url}: {e}")
        except Exception as e:  # pragma: no cover
            print(f"⚠️  Search failed: {e}")

        return documents

    # --------------------------------------------------------------------- #
    # Async                                                                  #
    # --------------------------------------------------------------------- #
    async def _aget_relevant_documents(
        self,
        query: str,
        *,
        run_manager: AsyncCallbackManagerForRetrieverRun,
        **kwargs: Any,
    ) -> List[Document]:
        # simple fall‑back to sync in worker thread
        from asyncio import to_thread

        return await to_thread(self._get_relevant_documents, query, run_manager=run_manager)

In [16]:
watercrawl_search_tool(query, WATERCRAWL_API_KEY, max_results=3)

['https://github.com/langgenius/dify/releases', 'https://liduos.com/en/ai-develope-tools-series-3-open-source-ai-web-crawler-frameworks.html', 'https://github.com/langgenius/dify-plugins']


['https://github.com/langgenius/dify/releases',
 'https://liduos.com/en/ai-develope-tools-series-3-open-source-ai-web-crawler-frameworks.html',
 'https://github.com/langgenius/dify-plugins']

#### ➡️ **Create the langchain retriever obect using WaterCrawlRetriever we have built above**:

In [25]:

retriever = WaterCrawlRetriever(client=WaterCrawlAPIClient(api_key=WATERCRAWL_API_KEY),watercrawl_api_key=WATERCRAWL_API_KEY)

# if you have tavily api key
# retriever = WaterCrawlRetriever(client=WaterCrawlAPIClient(api_key=WATERCRAWL_API_KEY),watercrawl_api_key=WATERCRAWL_API_KEY,
#     tavily_api_key=TAVILY_API_KEY)

#### ➡️ **FLARE Chain**

In [20]:
# We set this so we can see what exactly is going on
from langchain.globals import set_verbose
set_verbose(True)

In [26]:
llm = ChatOpenAI( model="gpt-4o", temperature=0)
flare = FlareChain.from_llm(
    llm,
    retriever=retriever,
    max_generation_len=164,
    min_prob=0.3,
)

In [27]:
query = "Explain what is watercrawl tool and how I can improve the LLM performance?"

In [28]:
flare.invoke(query)



[1m> Entering new FlareChain chain...[0m
[36;1m[1;3mCurrent Response: [0m
[33;1m[1;3mGenerated Questions: ['What type of software is the Watercrawl tool?', 'What type of models does the Watercrawl tool analyze and optimize for performance?', 'How can you use the Watercrawl tool to identify areas where the LLM may be underperforming and provide suggestions for improvement?'][0m
['https://github.com/watercrawl/watercrawl-dify-plugin/blob/main/README.md', 'https://watercrawl.dev/', 'https://www.reddit.com/r/estimators/comments/10y3a0q/best_large_commercial_estimating_software/']
['https://www.usna.edu/AeroDept/_files/AY21_Capstone.pdf', 'https://github.com/langgenius/dify/releases', 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9100143/']
⚠️  Scrape failed for https://www.usna.edu/AeroDept/_files/AY21_Capstone.pdf: 'NoneType' object has no attribute 'get'
⚠️  Scrape failed for https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9100143/: 'NoneType' object has no attribute 'get'
[]
[36

{'user_input': 'Explain what is watercrawl tool and how I can improve the LLM performance?',
 'response': "The Watercrawl tool is a software program used for web crawling and data extraction. It helps gather information from websites and analyze it for various purposes. To improve the LLM (Large Language Model) performance, you can optimize the data extraction process by using Watercrawl to efficiently collect relevant data that can be used to train and fine-tune the LLM model. This will ensure that the model has high-quality input data to learn from, leading to better performance. Additionally, you can also consider fine-tuning the hyperparameters of the LLM model based on the insights gained from the data extracted using Watercrawl. This can help enhance the model's accuracy and efficiency. "}

#### ➡️ **Now lets see a simple Open AI chain so we can see the value of the FLARE Chain**
#### for the test query we provided, the answer of the same LLM is completely wrong!!!

In [24]:
llm.invoke(query)

"\n\nWatercrawl is a web performance testing tool that helps in analyzing the load and stress on a website or web application. It simulates real-world user traffic and measures the website's response time, throughput, and server performance under different load conditions.\n\nTo improve the LLM (Load, Latency, and Memory) performance using Watercrawl, the following steps can be taken:\n\n1. Identify bottlenecks: Watercrawl helps in identifying the areas of the website that are causing performance issues. It provides detailed reports on page load times, HTTP requests, and server response times, which can help in identifying the bottlenecks.\n\n2. Optimize website code: Based on the reports generated by Watercrawl, developers can optimize the website's code to reduce page load times and improve server response times. This can include techniques like minimizing HTTP requests, optimizing images, and using caching mechanisms.\n\n3. Test under different load conditions: Watercrawl allows tes

##### 🚨⚠️ As you have noted, for the test query we provided, the answer from the **same LLM** is **completely wrong** ❌🤯‼️

> 💬 It confidently gives a **wrong answer** — showing **why refinement and retrieval matter** so much in real-world usage.


#### ➡️ **For further information**:
##### 📘 Introduction to FlareChain in LangChain

**FlareChain** is an advanced chain in the LangChain framework 🧠⚙️ designed to *iteratively refine answers* from a language model. It improves response quality by:

🔍 Identifying **low-confidence** spans  
❓ Generating **clarifying questions**  
📚 Retrieving **relevant context**  
🔁 Updating the answer in a loop

---

#### 🧩 Key Arguments of `FlareChain`

##### 🗣 2. `response_chain`
Generates the actual response using user input + context.

##### 🧾 3. `output_parser`
Checks whether the current answer is “good enough” to stop refinement.

##### 📡 4. `retriever`
Fetches documents to provide factual backup for refining the answer.

##### 📉 5. `min_prob`
Low-confidence threshold (default: `0.2`) – tokens below this are flagged for review.

##### ↔️ 6. `min_token_gap`
Ensures separation between two flagged spans (default: `5` tokens).

##### 🧷 7. `num_pad_tokens`
Adds context tokens around flagged spans (default: `2`).

##### 🔁 8. `max_iter`
Max number of refinement cycles (default: `10`).

##### 🧭 9. `start_with_retrieval`
If `True`, starts by retrieving context even before generating the first draft.

---

#### 🧾 Inputs and Outputs

- 📥 **Input Key**: `user_input`  
- 📤 **Output Key**: `response`  

💡 The chain processes a single user prompt and returns an *improved, confident, and context-aware response*.

---

#### 📚 References

- [LangChain FlareChain Documentation](https://api.python.langchain.com/en/latest/langchain/chains/langchain.chains.flare.base.FlareChain.html)
- [WaterCrawl Documentation](https://docs.watercrawl.dev/intro)
- [WaterCrawl scrape Documentation](https://docs.watercrawl.dev/api/scrape-url)
- [WaterCrawl search Documentation](https://docs.watercrawl.dev/api/get-search)
