# 🚀 DeepSeek R1 × WaterCrawl — Live Chat with Any Website URL

Welcome to this **step‑by‑step Jupyter Notebook tutorial**. By the end, you’ll be able to:
1. **Crawl & clean** any webpage with **WaterCrawl**.
2. **Summarize** the scraped content with **DeepSeek R1**.
3. **Chat interactively** about the page — all within a single Python class.

> **Important:** This demo *does not* store data in a vector database or perform similarity‑based retrieval. Everything happens on‑the‑fly in memory — perfect for quick analyses, but not for large‑scale production RAG pipelines.

## 🗝️ Key Features
- **One‑click crawling**: extract clean Markdown from any URL.
- **AI summaries**: condense long pages into factual bullet points.
- **Context‑aware answers**: ask follow‑up questions that cite page content.
- **Multi‑site sessions**: load several pages and keep chatting about them.

## ⚙️ Setup & Installation
```bash
# In a fresh environment ⬇️
pip install watercrawl-py deepseek-ai python-dotenv rich requests
```
Create a `.env` file in the same folder:
```env
WATERCRAWL_API_KEY="your_watercrawl_key"
DEEPSEEK_API_KEY="your_deepseek_key"
```
*(Free WaterCrawl keys are available at [app.watercrawl.dev](https://app.watercrawl.dev). DeepSeek R1 keys can be requested from their team.)*

In [2]:
# !pip install watercrawl-py deepseek-ai python-dotenv rich requests


In [3]:
# ▶️ Run this cell after adding your keys to .env
from dotenv import load_dotenv
import os
load_dotenv()

WATERCARWL_KEY = os.getenv("WATERCRAWL_API_KEY")
DEEPSEEK_KEY   = os.getenv("DEEPSEEK_API_KEY")

assert WATERCARWL_KEY and DEEPSEEK_KEY, "🔑 Add your API keys to the .env file first!"

## 🤖 DeepSeek R1 Lightweight Client
A minimal wrapper around the HTTP API — no extra dependencies.

In [4]:
import requests, logging
class DeepSeekClient:
    """Tiny helper for DeepSeek chat completions."""
    def __init__(self, api_key, base_url="https://api.deepseek.com"):
        self.session = requests.Session()
        self.session.headers.update({
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        })
        self.base_url = base_url

    def chat(self, messages, model="deepseek-reasoner", **kwargs):
        url = f"{self.base_url}/v1/chat/completions"
        payload = {"model": model, "messages": messages, **kwargs}
        resp = self.session.post(url, json=payload, timeout=60)
        resp.raise_for_status()
        return resp.json()

## 📝 Logging Configuration

In [5]:
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger("LiveChat")

## 💬 Chat Message Schema

In [6]:
from enum import Enum
from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, Any, List

class MessageType(Enum):
    SYSTEM="system"; USER="user"; ASSISTANT="assistant"; WEBSITE="website"

@dataclass
class ChatMessage:
    role: MessageType
    content: str
    timestamp: datetime = field(default_factory=datetime.now)
    metadata: Dict[str, Any] = field(default_factory=dict)

## 🤖 WebsiteChatBot Class
Handles crawling, summarization, and conversation flow.

In [7]:
from watercrawl import WaterCrawlAPIClient
from rich.progress import Progress, SpinnerColumn, TextColumn

class WebsiteChatBot:
    def __init__(self, watercrawl_key, deepseek_key):
        self.watercrawl = WaterCrawlAPIClient(api_key=watercrawl_key)
        self.deepseek  = DeepSeekClient(api_key=deepseek_key)
        self.chat_history: List[ChatMessage] = []
        self.website_content = {}
        self._system("You are a helpful AI assistant…")

    # ——— Utility helpers ———
    def _system(self, text):
        self.chat_history.append(ChatMessage(MessageType.SYSTEM, text))
    def _user(self, text):
        self.chat_history.append(ChatMessage(MessageType.USER, text))
    def _assistant(self, text):
        self.chat_history.append(ChatMessage(MessageType.ASSISTANT, text))

    # ——— Crawling ———
    def crawl(self, url):
        result = self.watercrawl.scrape_url(url=url,
            page_options={"exclude_tags":["nav","footer","header"],
                          "include_tags":["article","main","section"],
                          "include_html":False})
        markdown = result['result']['markdown']
        self.website_content[url] = markdown
        return markdown

    # ——— Summarize ———
    def summarize(self, content):
        messages=[{"role":"system","content":"Summarize this page."},
                  {"role":"user","content":content[:4000]}]
        resp = self.deepseek.chat(messages, temperature=0.3)
        return resp['choices'][0]['message']['content']

    # ——— Q&A ———
    def ask(self, question):
        ctx = "\n\n".join(self.website_content.values())[:5000]
        messages=[{"role":"system","content":"Context:"+ctx},
                  {"role":"user","content":question}]
        resp = self.deepseek.chat(messages, max_tokens=800)
        answer = resp['choices'][0]['message']['content']
        self._assistant(answer)
        return answer

## 🔍 Quick Demo 

In [8]:
bot = WebsiteChatBot(WATERCARWL_KEY, DEEPSEEK_KEY)

In [9]:
page = bot.crawl("https://watercrawl.dev/")
print(bot.summarize(page))


**Summary of the Page:**

The page highlights **WaterCrawl v0.7.1**, a web crawling tool update focused on smarter search capabilities, Google Custom Search integration, real-time status tracking, and transparent credit management. Released on May 3, 2025, by Amir Asaran, it emphasizes transforming web content into structured data for LLM training and analysis.

**Featured Articles:**  
1. **AI Communication Protocols**: Alireza Mofidi explains MCP (Model Context Protocol) and A2A (Agent-to-Agent Protocol), frameworks enabling AI systems to interact contextually and autonomously.  
2. **LLM Serving Frameworks**: A guide comparing on-premises deployment tools like vLLM and TGI for optimizing GPU performance and scalability, also by Alireza Mofidi.  

**Key Features of WaterCrawl:**  
- **Smart Crawling**: Control depth, domains, and paths for targeted extraction.  
- **Precision Extraction**: Customizable selectors to filter unwanted content.  
- **AI Integration**: OpenAI-powered proce

In [10]:
bot.ask("What is the main topic of the page?")

"The main topic of the page is **transforming web content into structured, LLM-ready data using WaterCrawl**. It positions WaterCrawl as a tool designed to:\n\n1. **Convert websites into knowledge bases** for:\n   - Training LLMs (Large Language Models)\n   - Content analysis\n   - Data-driven applications\n\n2. **Provide web crawling infrastructure** with features like:\n   - Smart content extraction (filtering ads/noise)\n   - AI-powered processing (OpenAI integration)\n   - JavaScript rendering\n   - Customizable plugins\n   - Integration with popular AI/LLM stacks (Dify, n8n, Langchain, etc.)\n\nThe page emphasizes WaterCrawl's role in bridging raw web content and AI applications, particularly highlighting its utility for developers and teams working with LLMs. Supporting articles about AI protocols (MCP/A2A) and LLM serving frameworks further contextualize its focus on AI/LLM infrastructure."

## 🚦 Next Steps
- Plug this notebook into LangChain or LlamaIndex for vector‑based retrieval.
- Cache crawled pages locally to avoid re‑scraping.
- Add citation snippets to answers for greater transparency.

---
© 2025 – Feel free to remix and share!