In [None]:
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "website_crawler_with_diff_detection.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github"
      },
      "source": [
        "# Webサイトクローラー（差分検知機能付き）- Google Colab版\n",
        "\n",
        "このノートブックでは、指定したURLから始めて同一ドメイン内のすべてのページをクロールし、Markdown形式で出力します。さらに、前回のクロール結果との差分を検出し、レポートを生成します。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "setup-section"
      },
      "source": [
        "## 1. 必要なライブラリのインストール\n",
        "\n",
        "最初に必要なライブラリをインストールします。PDF生成のためのwkhtmltopdfも導入します。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "install-libraries"
      },
      "source": [
        "!apt-get update\n",
        "!apt-get install -y wkhtmltopdf\n",
        "!pip install requests html2text lxml markdown pdfkit discord-webhook"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "drive-mount"
      },
      "source": [
        "## 2. Google Driveのマウント\n",
        "\n",
        "クロール結果やキャッシュを永続的に保存するために、Google Driveをマウントします。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mount-drive-code"
      },
      "source": [
        "from google.colab import drive\n",
        "drive.mount('/content/drive')\n",
        "\n",
        "# クローラー用のディレクトリを作成\n",
        "import os\n",
        "crawler_dir = '/content/drive/MyDrive/website_crawler'\n",
        "output_dir = os.path.join(crawler_dir, 'output')\n",
        "cache_dir = os.path.join(crawler_dir, 'cache')\n",
        "\n",
        "os.makedirs(crawler_dir, exist_ok=True)\n",
        "os.makedirs(output_dir, exist_ok=True)\n",
        "os.makedirs(cache_dir, exist_ok=True)\n",
        "\n",
        "print(f\"クローラーディレクトリを作成しました: {crawler_dir}\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "crawler-code"
      },
      "source": [
        "## 3. Webサイトクローラーのコード\n",
        "\n",
        "以下にWebサイトクローラーのコードを記述します。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "code-cell"
      },
      "source": [
        "#!/usr/bin/env python3\n",
        "# -*- coding: utf-8 -*-\n",
        "\n",
        "\"\"\"\n",
        "Webサイトクローラー：同一ドメイン内のすべてのコンテンツをMarkdown形式で出力し、\n",
        "完了後にDiscordに通知を送信するプログラム\n",
        "前回からの差分検知機能付き\n",
        "\"\"\"\n",
        "\n",
        "import requests\n",
        "import html2text\n",
        "from urllib.parse import urlparse, urljoin\n",
        "import time\n",
        "import os\n",
        "import logging\n",
        "import re\n",
        "import json\n",
        "import hashlib\n",
        "from collections import deque\n",
        "from typing import Set, Dict, List, Optional, Tuple, Any\n",
        "import markdown\n",
        "import pdfkit\n",
        "from discord_webhook import DiscordWebhook, DiscordEmbed\n",
        "import lxml.html\n",
        "import argparse\n",
        "import sqlite3\n",
        "from datetime import datetime\n",
        "import difflib\n",
        "\n",
        "\n",
        "class UrlFilter:\n",
        "    \"\"\"URLをフィルタリングして、同一ドメイン内のURLのみを許可するコンポーネント\"\"\"\n",
        "    \n",
        "    def __init__(self, base_url: str):\n",
        "        \"\"\"\n",
        "        URLフィルタークラスの初期化\n",
        "        \n",
        "        Args:\n",
        "            base_url (str): クロールする基本URL\n",
        "        \"\"\"\n",
        "        self.base_domain = urlparse(base_url).netloc\n",
        "        self.base_url = base_url\n",
        "        self.static_extensions = {\n",
        "            '.jpg', '.jpeg', '.png', '.gif', '.svg', '.css', \n",
        "            '.js', '.pdf', '.zip', '.tar', '.gz', '.mp3', \n",
        "            '.mp4', '.avi', '.mov', '.webm', '.webp', '.ico'\n",
        "        }\n",
        "    \n",
        "    def normalize_url(self, url: str) -> str:\n",
        "        \"\"\"\n",
        "        URLを正規化する（相対URLを絶対URLに変換、フラグメントの削除等）\n",
        "        \n",
        "        Args:\n",
        "            url (str): 正規化する URL\n",
        "        \n",
        "        Returns:\n",
        "            str: 正規化された URL\n",
        "        \"\"\"\n",
        "        # 相対URLを絶対URLに変換\n",
        "        normalized_url = urljoin(self.base_url, url)\n",
        "        \n",
        "        # フラグメント (#) を削除\n",
        "        normalized_url = normalized_url.split('#')[0]\n",
        "        \n",
        "        # トレーリングスラッシュを統一\n",
        "        if normalized_url.endswith('/'):\n",
        "            normalized_url = normalized_url[:-1]\n",
        "            \n",
        "        return normalized_url\n",
        "    \n",
        "    def should_crawl(self, url: str) -> bool:\n",
        "        \"\"\"\n",
        "        URLがクロール対象かどうかを判定する\n",
        "        \n",
        "        Args:\n",
        "            url (str): 判定する URL\n",
        "        \n",
        "        Returns:\n",
        "            bool: クロール対象の場合は True、そうでない場合は False\n",
        "        \"\"\"\n",
        "        # 空のURLはクロールしない\n",
        "        if not url:\n",
        "            return False\n",
        "        \n",
        "        # URLを正規化\n",
        "        url = self.normalize_url(url)\n",
        "        \n",
        "        # URLのドメインを取得\n",
        "        parsed_url = urlparse(url)\n",
        "        domain = parsed_url.netloc\n",
        "        \n",
        "        # 同一ドメインでない場合はクロールしない\n",
        "        if domain != self.base_domain:\n",
        "            return False\n",
        "        \n",
        "        # 静的ファイルはクロールしない\n",
        "        path = parsed_url.path.lower()\n",
        "        if any(path.endswith(ext) for ext in self.static_extensions):\n",
        "            return False\n",
        "        \n",
        "        # メールアドレスリンクはクロールしない\n",
        "        if url.startswith('mailto:'):\n",
        "            return False\n",
        "        \n",
        "        # 電話番号リンクはクロールしない\n",
        "        if url.startswith('tel:'):\n",
        "            return False\n",
        "            \n",
        "        return True\n",
        "\n",
        "\n",
        "class Fetcher:\n",
        "    \"\"\"指定されたURLからHTMLコンテンツを取得するコンポーネント\"\"\"\n",
        "    \n",
        "    def __init__(self, delay: float = 1.0, max_retries: int = 3, timeout: int = 10):\n",
        "        \"\"\"\n",
        "        Fetcherクラスの初期化\n",
        "        \n",
        "        Args:\n",
        "            delay (float): リクエスト間の遅延秒数（サーバー負荷軽減のため）\n",
        "            max_retries (int): 最大再試行回数\n",
        "            timeout (int): リクエストタイムアウト秒数\n",
        "        \"\"\"\n",
        "        self.delay = delay\n",
        "        self.max_retries = max_retries\n",
        "        self.timeout = timeout\n",
        "        self.last_request_time = 0\n",
        "        self.headers = {\n",
        "            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',\n",
        "            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',\n",
        "            'Accept-Language': 'en-US,en;q=0.5',\n",
        "        }\n",
        "        \n",
        "    def fetch(self, url: str, etag: Optional[str] = None, last_modified: Optional[str] = None) -> Tuple[Optional[str], Dict[str, str]]:\n",
        "        \"\"\"\n",
        "        URLからHTMLコンテンツを取得する\n",
        "        \n",
        "        Args:\n",
        "            url (str): コンテンツを取得するURL\n",
        "            etag (Optional[str]): 前回取得時のETag\n",
        "            last_modified (Optional[str]): 前回取得時のLast-Modified\n",
        "        \n",
        "        Returns:\n",
        "            Tuple[Optional[str], Dict[str, str]]: (取得したHTMLコンテンツ, レスポンスヘッダー情報)\n",
        "                                                 取得失敗時はコンテンツはNone\n",
        "        \"\"\"\n",
        "        # リクエスト間隔を確保する\n",
        "        elapsed = time.time() - self.last_request_time\n",
        "        if elapsed < self.delay:\n",
        "            time.sleep(self.delay - elapsed)\n",
        "        \n",
        "        # 条件付きリクエスト用ヘッダーを準備\n",
        "        headers = self.headers.copy()\n",
        "        if etag:\n",
        "            headers['If-None-Match'] = etag\n",
        "        if last_modified:\n",
        "            headers['If-Modified-Since'] = last_modified\n",
        "            \n",
        "        retries = 0\n",
        "        while retries <= self.max_retries:\n",
        "            try:\n",
        "                self.last_request_time = time.time()\n",
        "                response = requests.get(url, headers=headers, timeout=self.timeout)\n",
        "                \n",
        "                # 304 Not Modified の場合、コンテンツは変更されていない\n",
        "                if response.status_code == 304:\n",
        "                    logging.info(f\"Content not modified: {url}\")\n",
        "                    return None, {\n",
        "                        'etag': etag,\n",
        "                        'last_modified': last_modified,\n",
        "                        'status_code': 304\n",
        "                    }\n",
        "                \n",
        "                # ステータスコードが200以外の場合は失敗とみなす\n",
        "                if response.status_code != 200:\n",
        "                    logging.warning(f\"Failed to fetch {url}: status code {response.status_code}\")\n",
        "                    retries += 1\n",
        "                    time.sleep(self.delay * (2 ** retries))  # 指数バックオフ\n",
        "                    continue\n",
        "                \n",
        "                # content-typeがHTMLでない場合はスキップ\n",
        "                content_type = response.headers.get('Content-Type', '')\n",
        "                if 'text/html' not in content_type.lower():\n",
        "                    logging.info(f\"Skipping non-HTML content: {url}, Content-Type: {content_type}\")\n",
        "                    return None, {'status_code': response.status_code, 'content_type': content_type}\n",
        "                \n",
        "                # ヘッダー情報を取得\n",
        "                headers_info = {\n",
        "                    'etag': response.headers.get('ETag'),\n",
        "                    'last_modified': response.headers.get('Last-Modified'),\n",
        "                    'content_type': content_type,\n",
        "                    'status_code': response.status_code\n",
        "                }\n",
        "                \n",
        "                return response.text, headers_info\n",
        "                \n",
        "            except requests.RequestException as e:\n",
        "                logging.error(f\"Error fetching {url}: {e}\")\n",
        "                retries += 1\n",
        "                if retries <= self.max_retries:\n",
        "                    time.sleep(self.delay * (2 ** retries))  # 指数バックオフ\n",
        "                else:\n",
        "                    return None, {'status_code': 0, 'error': str(e)}\n",
        "        \n",
        "        return None, {'status_code': 0, 'error': 'Max retries exceeded'}\n",
        "\n",
        "\n",
        "class Parser:\n",
        "    \"\"\"HTMLコンテンツを解析し、コンテンツとリンクを抽出するコンポーネント（BeautifulSoup非使用）\"\"\"\n",
        "    \n",
        "    def __init__(self, url_filter: UrlFilter):\n",
        "        \"\"\"\n",
        "        Parserクラスの初期化\n",
        "        \n",
        "        Args:\n",
        "            url_filter (UrlFilter): URLフィルターインスタンス\n",
        "        \"\"\"\n",
        "        self.url_filter = url_filter\n",
        "    \n",
        "    def parse(self, html: str, url: str) -> Tuple[Dict, List[str]]:\n",
        "        \"\"\"\n",
        "        HTMLからコンテンツとリンクを抽出する\n",
        "        \n",
        "        Args:\n",
        "            html (str): 解析するHTMLコンテンツ\n",
        "            url (str): HTMLのURL（リンクの絶対URL化に使用）\n",
        "        \n",
        "        Returns:\n",
        "            Tuple[Dict, List[str]]: (抽出したコンテンツ, 抽出したリンクのリスト)\n",
        "        \"\"\"\n",
        "        try:\n",
        "            # lxmlを使用してHTMLを解析\n",
        "            doc = lxml.html.fromstring(html)\n",
        "            \n",
        "            # タイトルを抽出\n",
        "            title_elem = doc.xpath('//title')\n",
        "            title = title_elem[0].text_content().strip() if title_elem else \"No Title\"\n",
        "            \n",
        "            # メインコンテンツを抽出 (lxmlのXPath機能を使用)\n",
        "            content_selectors = [\n",
        "                '//main', '//article', \n",
        "                '//div[@class=\"content\"]', '//div[@id=\"content\"]', \n",
        "                '//div[@class=\"post-content\"]'\n",
        "            ]\n",
        "            \n",
        "            content_elem = None\n",
        "            for selector in content_selectors:\n",
        "                elements = doc.xpath(selector)\n",
        "                if elements:\n",
        "                    content_elem = elements[0]\n",
        "                    break\n",
        "            \n",
        "            # メインコンテンツが見つからない場合はbody全体を使用\n",
        "            if not content_elem:\n",
        "                body_elem = doc.xpath('//body')\n",
        "                content_elem = body_elem[0] if body_elem else doc\n",
        "            \n",
        "            # HTMLコンテンツを取得（lxml.html.tostring を使用）\n",
        "            html_content = lxml.html.tostring(content_elem, encoding='unicode')\n",
        "            \n",
        "            # リンクを抽出\n",
        "            links = []\n",
        "            for a_tag in doc.xpath('//a[@href]'):\n",
        "                href = a_tag.get('href')\n",
        "                if self.url_filter.should_crawl(href):\n",
        "                    normalized_url = self.url_filter.normalize_url(href)\n",
        "                    links.append(normalized_url)\n",
        "            \n",
        "            # ページ情報の辞書を作成\n",
        "            page_data = {\n",
        "                'url': url,\n",
        "                'title': title,\n",
        "                'html_content': html_content,\n",
        "            }\n",
        "            \n",
        "            return page_data, links\n",
        "            \n",
        "        except Exception as e:\n",
        "            logging.error(f\"Error parsing HTML from {url}: {e}\")\n",
        "            # エラー時は空のデータと空のリンクリストを返す\n",
        "            return {'url': url, 'title': 'Error', 'html_content': ''}, []\n",
        "\n",
        "\n",
        "class MarkdownConverter:\n",
        "    \"\"\"HTMLコンテンツをMarkdown形式に変換するコンポーネント\"\"\"\n",
        "    \n",
        "    def __init__(self):\n",
        "        \"\"\"\n",
        "        MarkdownConverterクラスの初期化\n",
        "        \"\"\"\n",
        "        self.converter = html2text.HTML2Text()\n",
        "        self.converter.ignore_links = False\n",
        "        self.converter.ignore_images = False\n",
        "        self.converter.ignore_tables = False\n",
        "        self.converter.body_width = 0  # 行の折り返しを無効化\n",
        "        self.converter.unicode_snob = True  # Unicode文字を維持\n",
        "        self.converter.single_line_break = True  # 単一の改行を維持\n",
        "        \n",
        "    def convert(self, page_data: Dict) -> Dict:\n",
        "        \"\"\"\n",
        "        HTMLをMarkdownに変換する\n",
        "        \n",
        "        Args:\n",
        "            page_data (Dict): 変換するページデータ\n",
        "        \n",
        "        Returns:\n",
        "            Dict: Markdownに変換されたページデータ\n",
        "        \"\"\"\n",
        "        title = page_data['title']\n",
        "        html_content = page_data['html_content']\n",
        "        url = page_data['url']\n",
        "        \n",
        "        # HTMLをMarkdownに変換\n",
        "        markdown_content = self.converter.handle(html_content)\n",
        "        \n",
        "        # Markdownタイトルを作成\n",
        "        markdown_title = f\"# {title}\\n\\n\"\n",
        "        \n",
        "        # URL情報を追加\n",
        "        url_info = f\"*Source: {url}*\\n\\n\"\n",
        "        \n",
        "        # 最終的なMarkdownコンテンツを組み立て\n",
        "        full_markdown = markdown_title + url_info + markdown_content\n",
        "        \n",
        "        # 結果を返す\n",
        "        result = page_data.copy()\n",
        "        result['markdown_content'] = full_markdown\n",
        "        \n",
        "        return result\n",
        "\n",
        "\n",
        "class CrawlCache:\n",
        "    \"\"\"クロール結果を永続的に保存し、差分検知に使用するコンポーネント\"\"\"\n",
        "    \n",
        "    def __init__(self, domain: str, cache_dir: str = \"cache\"):\n",
        "        \"\"\"\n",
        "        CrawlCacheクラスの初期化\n",
        "        \n",
        "        Args:\n",
        "            domain (str): キャッシュを保存するドメイン名\n",
        "            cache_dir (str): キャッシュディレクトリ\n",
        "        \"\"\"\n",
        "        self.domain = domain\n",
        "        self.cache_dir = cache_dir\n",
        "        os.makedirs(cache_dir, exist_ok=True)\n",
        "        \n",
        "        self.db_path = os.path.join(cache_dir, f\"{domain}.db\")\n",
        "        self._initialize_db()\n",
        "        \n",
        "    def _initialize_db(self):\n",
        "        \"\"\"データベースを初期化する\"\"\"\n",
        "        conn = sqlite3.connect(self.db_path)\n",
        "        cursor = conn.cursor()\n",
        "        \n",
        "        # pages テーブルを作成\n",
        "        cursor.execute('''\n",
        "        CREATE TABLE IF NOT EXISTS pages (\n",
        "            url TEXT PRIMARY KEY,\n",
        "            title TEXT,\n",
        "            content_hash TEXT,\n",
        "            etag TEXT,\n",
        "            last_modified TEXT,\n",
        "            last_crawled TEXT,\n",
        "            markdown_content TEXT\n",
        "        )\n",
        "        ''')\n",
        "        \n",
        "        # crawl_history テーブルを作成\n",
        "        cursor.execute('''\n",
        "        CREATE TABLE IF NOT EXISTS crawl_history (\n",
        "            id INTEGER PRIMARY KEY AUTOINCREMENT,\n",
        "            crawl_date TEXT,\n",
        "            page_count INTEGER,\n",
        "            new_count INTEGER,\n",
        "            updated_count INTEGER,\n",
        "            deleted_count INTEGER\n",
        "        )\n",
        "        ''')\n",
        "        \n",
        "        conn.commit()\n",
        "        conn.close()\n",
        "    \n",
        "    def get_page(self, url: str) -> Optional[Dict]:\n",
        "        \"\"\"\n",
        "        URLに対応するキャッシュされたページ情報を取得する\n",
        "        \n",
        "        Args:\n",
        "            url (str): 取得するページのURL\n",
        "        \n",
        "        Returns:\n",
        "            Optional[Dict]: キャッシュされたページ情報、存在しない場合はNone\n",
        "        \"\"\"\n",
        "        conn = sqlite3.connect(self.db_path)\n",
        "        conn.row_factory = sqlite3.Row\n",
        "        cursor = conn.cursor()\n",
        "        \n",
        "        cursor.execute('SELECT * FROM pages WHERE url = ?', (url,))\n",
        "        row = cursor.fetchone()\n",
        "        \n",
        "        conn.close()\n",
        "        \n",
        "        if row:\n",
        "            return dict(row)\n",
        "        return None\n",
        "    \n",
        "    def add_or_update_page(self, page_data: Dict) -> bool:\n",
        "        \"\"\"\n",
        "        ページ情報をキャッシュに追加または更新する\n",
        "        \n",
        "        Args:\n",
        "            page_data (Dict): 追加/更新するページデータ\n",
        "        \n",
        "        Returns:\n",
        "            bool: 更新された場合はTrue、新規追加の場合はFalse\n",
        "        \"\"\"\n",
        "        url = page_data['url']\n",
        "        title = page_data['title']\n",
        "        markdown_content = page_data.get('markdown_content', '')\n",
        "        content_hash = self._compute_hash(markdown_content)\n",
        "        etag = page_data.get('etag')\n",
        "        last_modified = page_data.get('last_modified')\n",
        "        last_crawled = datetime.now().isoformat()\n",
        "        \n",
        "        conn = sqlite3.connect(self.db_path)\n",
        "        cursor = conn.cursor()\n",
        "        \n",
        "        # 既存のページかチェック\n",
        "        cursor.execute('SELECT content_hash FROM pages WHERE url = ?', (url,))\n",
        "        row = cursor.fetchone()\n",
        "        \n",
        "        is_update = row is not None\n",
        "        \n",
        "        if is_update:\n",
        "            # 更新\n",
        "            cursor.execute('''\n",
        "            UPDATE pages \n",
        "            SET title = ?, content_hash = ?, etag = ?, last_modified = ?, \n",
        "                last_crawled = ?, markdown_content = ?\n",
        "            WHERE url = ?\n",
        "            ''', (title, content_hash, etag, last_modified, last_crawled, markdown_content, url))\n",
        "        else:\n",
        "            # 新規追加\n",
        "            cursor.execute('''\n",
        "            INSERT INTO pages \n",
        "            (url, title, content_hash, etag, last_modified, last_crawled, markdown_content)\n",
        "            VALUES (?, ?, ?, ?, ?, ?, ?)\n",
        "            ''', (url, title, content_hash, etag, last_modified, last_crawled, markdown_content))\n",
        "        \n",
        "        conn.commit()\n",
        "        conn.close()\n",
        "        \n",
        "        return is_update\n",
        "    \n",
        "    def get_all_urls(self) -> Set[str]:\n",
        "        \"\"\"\n",
        "        キャッシュに保存されているすべてのURLを取得する\n",
        "        \n",
        "        Returns:\n",
        "            Set[str]: キャッシュされているすべてのURL\n",
        "        \"\"\"\n",
        "        conn = sqlite3.connect(self.db_path)\n",
        "        cursor = conn.cursor()\n",
        "        \n",
        "        cursor.execute('SELECT url FROM pages')\n",
        "        urls = {row[0] for row in cursor.fetchall()}\n",
        "        \n",
        "        conn.close()\n",
        "        \n",
        "        return urls\n",
        "    \n",
        "    def delete_urls(self, urls: List[str]) -> int:\n",
        "        \"\"\"\n",
        "        指定されたURLをキャッシュから削除する\n",
        "        \n",
        "        Args:\n",
        "            urls (List[str]): 削除するURLのリスト\n",
        "        \n",
        "        Returns:\n",
        "            int: 削除されたURLの数\n",
        "        \"\"\"\n",
        "        if not urls:\n",
        "            return 0\n",
        "            \n",
        "        conn = sqlite3.connect(self.db_path)\n",
        "        cursor = conn.cursor()\n",
        "        \n",
        "        placeholders = ', '.join(['?'] * len(urls))\n",
        "        cursor.execute(f'DELETE FROM pages WHERE url IN ({placeholders})', urls)\n",
        "        \n",
        "        deleted_count = cursor.rowcount\n",
        "        \n",
        "        conn.commit()\n",
        "        conn.close()\n",
        "        \n",
        "        return deleted_count\n",
        "    \n",
        "    def save_crawl_history(self, page_count: int, new_count: int, updated_count: int, deleted_count: int) -> int:\n",
        "        \"\"\"\n",
        "        クロール履歴を保存する\n",
        "        \n",
        "        Args:\n",
        "            page_count (int): クロールしたページの総数\n",
        "            new_count (int): 新規追加されたページ数\n",
        "            updated_count (int): 更新されたページ数\n",
        "            deleted_count (int): 削除されたページ数\n",
        "        \n",
        "        Returns:\n",
        "            int: 履歴のID\n",
        "        \"\"\"\n",
        "        conn = sqlite3.connect(self.db_path)\n",
        "        cursor = conn.cursor()\n",
        "        \n",
        "        crawl_date = datetime.now().isoformat()\n",
        "        \n",
        "        cursor.execute('''\n",
        "        INSERT INTO crawl_history \n",
        "        (crawl_date, page_count, new_count, updated_count, deleted_count)\n",
        "        VALUES (?, ?, ?, ?, ?)\n",
        "        ''', (crawl_date, page_count, new_count, updated_count, deleted_count))\n",
        "        \n",
        "        history_id = cursor.lastrowid\n",
        "        \n",
        "        conn.commit()\n",
        "        conn.close()\n",
        "        \n",
        "        return history_id\n",
        "    \n",
        "    def get_latest_crawl_history(self) -> Optional[Dict]:\n",
        "        \"\"\"\n",
        "        最新のクロール履歴を取得する\n",
        "        \n",
        "        Returns:\n",
        "            Optional[Dict]: 最新のクロール履歴、存在しない場合はNone\n",
        "        \"\"\"\n",
        "        conn = sqlite3.connect(self.db_path)\n",
        "        conn.row_factory = sqlite3.Row\n",
        "        cursor = conn.cursor()\n",
        "        \n",
        "        cursor.execute('SELECT * FROM crawl_history ORDER BY id DESC LIMIT 1')\n",
        "        row = cursor.fetchone()\n",
        "        \n",
        "        conn.close()\n",
        "        \n",
        "        if row:\n",
        "            return dict(row)\n",
        "        return None\n",
        "    \n",
        "    def get_all_pages(self) -> List[Dict]:\n",
        "        \"\"\"\n",
        "        すべてのキャッシュされたページ情報を取得する\n",
        "        \n",
        "        Returns:\n",
        "            List[Dict]: キャッシュされたすべてのページ情報\n",
        "        \"\"\"\n",
        "        conn = sqlite3.connect(self.db_path)\n",
        "        conn.row_factory = sqlite3.Row\n",
        "        cursor = conn.cursor()\n",
        "        \n",
        "        cursor.execute('SELECT * FROM pages')\n",
        "        rows = cursor.fetchall()\n",
        "        \n",
        "        conn.close()\n",
        "        \n",
        "        return [dict(row) for row in rows]\n",
        "    \n",
        "    def is_content_changed(self, url: str, markdown_content: str) -> bool:\n",
        "        \"\"\"\n",
        "        ページのコンテンツが前回のクロール時から変更されているかどうかを確認する\n",
        "        \n",
        "        Args:\n",
        "            url (str): チェックするページのURL\n",
        "            markdown_content (str): 現在のMarkdownコンテンツ\n",
        "        \n",
        "        Returns:\n",
        "            bool: コンテンツが変更されている場合はTrue、変更がない場合はFalse\n",
        "        \"\"\"\n",
        "        current_hash = self._compute_hash(markdown_content)\n",
        "        \n",
        "        conn = sqlite3.connect(self.db_path)\n",
        "        cursor = conn.cursor()\n",
        "        \n",
        "        cursor.execute('SELECT content_hash FROM pages WHERE url = ?', (url,))\n",
        "        row = cursor.fetchone()\n",
        "        \n",
        "        conn.close()\n",
        "        \n",
        "        if not row:\n",
        "            return True  # 新規ページなので変更ありとみなす\n",
        "        \n",
        "        return current_hash != row[0]\n",
        "    \n",
        "    def _compute_hash(self, content: str) -> str:\n",
        "        \"\"\"\n",
        "        コンテンツのハッシュ値を計算する\n",
        "        \n",
        "        Args:\n",
        "            content (str): ハッシュ値を計算するコンテンツ\n",
        "        \n",
        "        Returns:\n",
        "            str: コンテンツのSHA256ハッシュ値\n",
        "        \"\"\"\n",
        "        return hashlib.sha256(content.encode('utf-8')).hexdigest()\n",
        "    \n",
        "    def get_diff(self, url: str, current_content: str) -> str:\n",
        "        \"\"\"\n",
        "        前回のコンテンツとの差分を取得する\n",
        "        \n",
        "        Args:\n",
        "            url (str): チェックするページのURL\n",
        "            current_content (str): 現在のMarkdownコンテンツ\n",
        "        \n",
        "        Returns:\n",
        "            str: 差分情報（unified diff形式）\n",
        "        \"\"\"\n",
        "        conn = sqlite3.connect(self.db_path)\n",
        "        cursor = conn.cursor()\n",
        "        \n",
        "        cursor.execute('SELECT markdown_content FROM pages WHERE url = ?', (url,))\n",
        "        row = cursor.fetchone()\n",
        "        \n",
        "        conn.close()\n",
        "        \n",
        "        if not row:\n",
        "            return \"新規ページ\"\n",
        "            \n",
        "        old_content = row[0]\n",
        "        if not old_content:\n",
        "            return \"前回のコンテンツが空\"\n",
        "            \n",
        "        # 差分を計算\n",
        "        diff = difflib.unified_diff(\n",
        "            old_content.splitlines(),\n",
        "            current_content.splitlines(),\n",
        "            fromfile=\"前回のバージョン\",\n",
        "            tofile=\"現在のバージョン\",\n",
        "            lineterm=''\n",
        "        )\n",
        "        \n",
        "        return '\\n'.join(diff)\n",
        "\n",
        "\n",
        "class ContentRepository:\n",
        "    \"\"\"クロールしたコンテンツを管理するコンポーネント\"\"\"\n",
        "    \n",
        "    def __init__(self):\n",
        "        \"\"\"\n",
        "        ContentRepositoryクラスの初期化\n",
        "        \"\"\"\n",
        "        self.contents = {}  # URLをキーとしたコンテンツ辞書\n",
        "        \n",
        "    def add(self, page_data: Dict) -> None:\n",
        "        \"\"\"\n",
        "        コンテンツを追加する\n",
        "        \n",
        "        Args:\n",
        "            page_data (Dict): 追加するページデータ\n",
        "        \"\"\"\n",
        "        url = page_data['url']\n",
        "        self.contents[url] = page_data\n",
        "        \n",
        "    def get(self, url: str) -> Optional[Dict]:\n",
        "        \"\"\"\n",
        "        URLに対応するコンテンツを取得する\n",
        "        \n",
        "        Args:\n",
        "            url (str): 取得するコンテンツのURL\n",
        "        \n",
        "        Returns:\n",
        "            Optional[Dict]: 取得したコンテンツ、存在しない場合はNone\n",
        "        \"\"\"\n",
        "        return self.contents.get(url)\n",
        "    \n",
        "    def get_all(self) -> Dict[str, Dict]:\n",
        "        \"\"\"\n",
        "        すべてのコンテンツを取得する\n",
        "        \n",
        "        Returns:\n",
        "            Dict[str, Dict]: すべてのコンテンツ（URLをキーとする辞書）\n",
        "        \"\"\"\n",
        "        return self.contents\n",
        "    \n",
        "    def count(self) -> int:\n",
        "        \"\"\"\n",
        "        コンテンツの数を取得する\n",
        "        \n",
        "        Returns:\n",
        "            int: コンテンツの数\n",
        "        \"\"\"\n",
        "        return len(self.contents)\n",
        "\n",
        "\n",
        "class FileExporter:\n",
        "    \"\"\"クロールしたコンテンツをファイルに出力するコンポーネント\"\"\"\n",
        "    \n",
        "    def __init__(self, output_dir: str = \"output\"):\n",
        "        \"\"\"\n",
        "        FileExporterクラスの初期化\n",
        "        \n",
        "        Args:\n",
        "            output_dir (str): 出力ディレクトリ\n",
        "        \"\"\"\n",
        "        self.output_dir = output_dir\n",
        "        os.makedirs(output_dir, exist_ok=True)\n",
        "        \n",
        "    def export_markdown(self, repository: ContentRepository, filename: str) -> str:\n",
        "        \"\"\"\n",
        "        コンテンツをMarkdownファイルとしてエクスポートする\n",
        "        \n",
        "        Args:\n",
        "            repository (ContentRepository): コンテンツリポジトリ\n",
        "            filename (str): 出力ファイル名\n",
        "        \n",
        "        Returns:\n",
        "            str: 出力したファイルのパス\n",
        "        \"\"\"\n",
        "        contents = repository.get_all()\n",
        "        \n",
        "        # 出力ファイルのパス\n",
        "        output_path = os.path.join(self.output_dir, filename)\n",
        "        \n",
        "        # コンテンツをリストにまとめる\n",
        "        markdown_contents = []\n",
        "        for url, page_data in sorted(contents.items()):\n",
        "            if 'markdown_content' in page_data:\n",
        "                markdown_contents.append(page_data['markdown_content'])\n",
        "            \n",
        "        # ファイルに書き込む\n",
        "        with open(output_path, 'w', encoding='utf-8') as f:\n",
        "            f.write('\\n\\n---\\n\\n'.join(markdown_contents))\n",
        "            \n",
        "        return output_path\n",
        "    \n",
        "    def export_diff_report(self, diff_data: Dict, filename: str) -> str:\n",
        "        \"\"\"\n",
        "        差分レポートをMarkdownファイルとして出力する\n",
        "        \n",
        "        Args:\n",
        "            diff_data (Dict): 差分データ\n",
        "            filename (str): 出力ファイル名\n",
        "        \n",
        "        Returns:\n",
        "            str: 出力したファイルのパス\n",
        "        \"\"\"\n",
        "        output_path = os.path.join(self.output_dir, filename)\n",
        "        \n",
        "        with open(output_path, 'w', encoding='utf-8') as f:\n",
        "            f.write(f\"# 差分レポート - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\\n\\n\")\n",
        "            \n",
        "            # 概要情報\n",
        "            f.write(\"## 概要\\n\\n\")\n",
        "            f.write(f\"- 合計ページ数: {diff_data['total']}\\n\")\n",
        "            f.write(f\"- 新規ページ: {len(diff_data['new_pages'])}\\n\")\n",
        "            f.write(f\"- 更新ページ: {len(diff_data['updated_pages'])}\\n\")\n",
        "            f.write(f\"- 削除ページ: {len(diff_data['deleted_pages'])}\\n\\n\")\n",
        "            \n",
        "            # 新規ページ\n",
        "            if diff_data['new_pages']:\n",
        "                f.write(\"## 新規ページ\\n\\n\")\n",
        "                for url in diff_data['new_pages']:\n",
        "                    f.write(f\"- [{url}]({url})\\n\")\n",
        "                f.write(\"\\n\")\n",
        "            \n",
        "            # 更新ページ\n",
        "            if diff_data['updated_pages']:\n",
        "                f.write(\"## 更新ページ\\n\\n\")\n",
        "                for url in diff_data['updated_pages']:\n",
        "                    f.write(f\"- [{url}]({url})\\n\")\n",
        "                f.write(\"\\n\")\n",
        "            \n",
        "            # 削除ページ\n",
        "            if diff_data['deleted_pages']:\n",
        "                f.write(\"## 削除ページ\\n\\n\")\n",
        "                for url in diff_data['deleted_pages']:\n",
        "                    f.write(f\"- {url}\\n\")\n",
        "                f.write(\"\\n\")\n",
        "            \n",
        "            # 詳細な差分情報 (オプション)\n",
        "            if diff_data.get('diffs'):\n",
        "                f.write(\"## 詳細な差分\\n\\n\")\n",
        "                for url, diff in diff_data['diffs'].items():\n",
        "                    f.write(f\"### {url}\\n\\n\")\n",
        "                    f.write(\"```diff\\n\")\n",
        "                    f.write(diff)\n",
        "                    f.write(\"\\n```\\n\\n\")\n",
        "        \n",
        "        return output_path\n",
        "\n",
        "\n",
        "class PdfConverter:\n",
        "    \"\"\"MarkdownファイルをPDF形式に変換するコンポーネント\"\"\"\n",
        "    \n",
        "    def __init__(self, output_dir: str = \"output\"):\n",
        "        \"\"\"\n",
        "        PdfConverterクラスの初期化\n",
        "        \n",
        "        Args:\n",
        "            output_dir (str): 出力ディレクトリ\n",
        "        \"\"\"\n",
        "        self.output_dir = output_dir\n",
        "        os.makedirs(output_dir, exist_ok=True)\n",
        "        \n",
        "    def convert(self, markdown_path: str) -> str:\n",
        "        \"\"\"\n",
        "        MarkdownファイルをPDFに変換する\n",
        "        \n",
        "        Args:\n",
        "            markdown_path (str): Markdownファイルのパス\n",
        "        \n",
        "        Returns:\n",
        "            str: 出力したPDFファイルのパス\n",
        "        \"\"\"\n",
        "        # 入力ファイル名からPDFファイル名を生成\n",
        "        pdf_filename = os.path.basename(markdown_path).replace('.md', '.pdf')\n",
        "        pdf_path = os.path.join(self.output_dir, pdf_filename)\n",
        "        \n",
        "        try:\n",
        "            # Markdownを読み込む\n",
        "            with open(markdown_path, 'r', encoding='utf-8') as md_file:\n",
        "                md_content = md_file.read()\n",
        "            \n",
        "            # MarkdownをHTML形式に変換\n",
        "            html_content = markdown.markdown(md_content, extensions=['tables', 'fenced_code'])\n",
        "            \n",
        "            # HTMLをPDFに変換\n",
        "            html_path = os.path.join(self.output_dir, \"temp.html\")\n",
        "            with open(html_path, 'w', encoding='utf-8') as f:\n",
        "                f.write(f\"<html><head><meta charset='utf-8'></head><body>{html_content}</body></html>\")\n",
        "            \n",
        "            # Google Colab用の設定（パスを指定）\n",
        "            config = pdfkit.configuration(wkhtmltopdf='/usr/bin/wkhtmltopdf')\n",
        "            \n",
        "            # wkhtmltopdfを使用してPDFに変換\n",
        "            pdfkit.from_file(html_path, pdf_path, configuration=config)\n",
        "            \n",
        "            # 一時ファイルを削除\n",
        "            if os.path.exists(html_path):\n",
        "                os.remove(html_path)\n",
        "                \n",
        "            return pdf_path\n",
        "            \n",
        "        except Exception as e:\n",
        "            logging.error(f\"Error converting to PDF: {e}\")\n",
        "            return None\n",
        "\n",
        "\n",
        "class DiscordNotifier:\n",
        "    \"\"\"Discordに通知を送信するコンポーネント\"\"\"\n",
        "    \n",
        "    def __init__(self, webhook_url: str):\n",
        "        \"\"\"\n",
        "        DiscordNotifierクラスの初期化\n",
        "        \n",
        "        Args:\n",
        "            webhook_url (str): Discord Webhook URL\n",
        "        \"\"\"\n",
        "        self.webhook_url = webhook_url\n",
        "        \n",
        "    def notify(self, message: str, markdown_path: Optional[str] = None, pdf_path: Optional[str] = None) -> bool:\n",
        "        \"\"\"\n",
        "        Discord通知を送信する\n",
        "        \n",
        "        Args:\n",
        "            message (str): 通知メッセージ\n",
        "            markdown_path (Optional[str]): 添付するMarkdownファイルのパス\n",
        "            pdf_path (Optional[str]): 添付するPDFファイルのパス\n",
        "        \n",
        "        Returns:\n",
        "            bool: 通知が成功した場合はTrue、失敗した場合はFalse\n",
        "        \"\"\"\n",
        "        try:\n",
        "            # Webhookインスタンスを作成\n",
        "            webhook = DiscordWebhook(url=self.webhook_url, content=message)\n",
        "            \n",
        "            # Markdownファイルを添付\n",
        "            if markdown_path and os.path.exists(markdown_path):\n",
        "                with open(markdown_path, 'rb') as f:\n",
        "                    webhook.add_file(file=f.read(), filename=os.path.basename(markdown_path))\n",
        "            \n",
        "            # PDFファイルを添付\n",
        "            if pdf_path and os.path.exists(pdf_path):\n",
        "                with open(pdf_path, 'rb') as f:\n",
        "                    webhook.add_file(file=f.read(), filename=os.path.basename(pdf_path))\n",
        "            \n",
        "            # 通知を送信\n",
        "            response = webhook.execute()\n",
        "            \n",
        "            # レスポンスコードをチェック\n",
        "            if response and 200 <= response.status_code < 300:\n",
        "                logging.info(\"Discord notification sent successfully\")\n",
        "                return True\n",
        "            else:\n",
        "                logging.error(f\"Failed to send Discord notification: {response.status_code if response else 'No response'}\")\n",
        "                return False\n",
        "                \n",
        "        except Exception as e:\n",
        "            logging.error(f\"Error sending Discord notification: {e}\")\n",
        "            return False\n",
        "\n",
        "\n",
        "class RobotsTxtParser:\n",
        "    \"\"\"robots.txtを解析してクロール許可を確認するコンポーネント\"\"\"\n",
        "    \n",
        "    def __init__(self, base_url: str, user_agent: str = \"*\"):\n",
        "        \"\"\"\n",
        "        RobotsTxtParserクラスの初期化\n",
        "        \n",
        "        Args:\n",
        "            base_url (str): ベースURL\n",
        "            user_agent (str): User-Agent文字列\n",
        "        \"\"\"\n",
        "        self.base_url = base_url\n",
        "        self.user_agent = user_agent\n",
        "        self.disallowed_paths = []\n",
        "        self.crawl_delay = 0\n",
        "        \n",
        "        parsed_url = urlparse(base_url)\n",
        "        robots_url = f\"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt\"\n",
        "        \n",
        "        try:\n",
        "            response = requests.get(robots_url, timeout=10)\n",
        "            if response.status_code == 200:\n",
        "                self._parse_robots_txt(response.text)\n",
        "            else:\n",
        "                logging.warning(f\"Could not fetch robots.txt: {response.status_code}\")\n",
        "        except requests.RequestException as e:\n",
        "            logging.error(f\"Error fetching robots.txt: {e}\")\n",
        "    \n",
        "    def _parse_robots_txt(self, robots_txt: str) -> None:\n",
        "        \"\"\"\n",
        "        robots.txtの内容を解析する\n",
        "        \n",
        "        Args:\n",
        "            robots_txt (str): robots.txtの内容\n",
        "        \"\"\"\n",
        "        current_agent = None\n",
        "        \n",
        "        for line in robots_txt.split('\\n'):\n",
        "            line = line.strip().lower()\n",
        "            \n",
        "            if not line or line.startswith('#'):\n",
        "                continue\n",
        "                \n",
        "            parts = line.split(':', 1)\n",
        "            if len(parts) != 2:\n",
        "                continue\n",
        "                \n",
        "            directive, value = parts\n",
        "            directive = directive.strip()\n",
        "            value = value.strip()\n",
        "            \n",
        "            if directive == 'user-agent':\n",
        "                current_agent = value\n",
        "            elif current_agent in (self.user_agent, '*') and directive == 'disallow' and value:\n",
        "                self.disallowed_paths.append(value)\n",
        "            elif current_agent in (self.user_agent, '*') and directive == 'crawl-delay':\n",
        "                try:\n",
        "                    self.crawl_delay = float(value)\n",
        "                except ValueError:\n",
        "                    pass\n",
        "    \n",
        "    def is_allowed(self, url: str) -> bool:\n",
        "        \"\"\"\n",
        "        URLがrobots.txtによりクロールを許可されているかを確認する\n",
        "        \n",
        "        Args:\n",
        "            url (str): 確認するURL\n",
        "        \n",
        "        Returns:\n",
        "            bool: クロールが許可されている場合はTrue、禁止されている場合はFalse\n",
        "        \"\"\"\n",
        "        parsed_url = urlparse(url)\n",
        "        path = parsed_url.path\n",
        "        \n",
        "        for disallowed in self.disallowed_paths:\n",
        "            if path.startswith(disallowed):\n",
        "                return False\n",
        "                \n",
        "        return True\n",
        "\n",
        "\n",
        "class WebCrawler:\n",
        "    \"\"\"Webクローラーのメインコントローラー\"\"\"\n",
        "    \n",
        "    def __init__(self, base_url: str, max_pages: int = 100, delay: float = 1.0, diff_detection: bool = True, cache_dir: str = \"cache\"):\n",
        "        \"\"\"\n",
        "        WebCrawlerクラスの初期化\n",
        "        \n",
        "        Args:\n",
        "            base_url (str): クロールを開始するURL\n",
        "            max_pages (int): クロールする最大ページ数\n",
        "            delay (float): リクエスト間の遅延秒数\n",
        "            diff_detection (bool): 差分検知を有効にするかどうか\n",
        "            cache_dir (str): キャッシュディレクトリ\n",
        "        \"\"\"\n",
        "        self.base_url = base_url\n",
        "        self.max_pages = max_pages\n",
        "        self.diff_detection = diff_detection\n",
        "        \n",
        "        # ドメイン名を取得\n",
        "        self.domain = urlparse(base_url).netloc\n",
        "        \n",
        "        # 各コンポーネントの初期化\n",
        "        self.url_filter = UrlFilter(base_url)\n",
        "        self.robots_parser = RobotsTxtParser(base_url)\n",
        "        self.fetcher = Fetcher(delay=max(delay, self.robots_parser.crawl_delay))\n",
        "        self.parser = Parser(self.url_filter)\n",
        "        self.markdown_converter = MarkdownConverter()\n",
        "        self.repository = ContentRepository()\n",
        "        \n",
        "        if self.diff_detection:\n",
        "            self.cache = CrawlCache(self.domain, cache_dir)\n",
        "        \n",
        "        # クロール状態の追跡\n",
        "        self.visited_urls = set()\n",
        "        self.queue = deque([base_url])\n",
        "        \n",
        "        # 差分情報の追跡\n",
        "        self.new_pages = []\n",
        "        self.updated_pages = []\n",
        "        self.deleted_pages = []\n",
        "        self.page_diffs = {}\n",
        "        \n",
        "    def crawl(self) -> Tuple[ContentRepository, Dict]:\n",
        "        \"\"\"\n",
        "        Webサイトをクロールする\n",
        "        \n",
        "        Returns:\n",
        "            Tuple[ContentRepository, Dict]: (クロールしたコンテンツのリポジトリ, 差分情報)\n",
        "        \"\"\"\n",
        "        count = 0\n",
        "        \n",
        "        while self.queue and count < self.max_pages:\n",
        "            # キューからURLを取得\n",
        "            url = self.queue.popleft()\n",
        "            \n",
        "            # 既に訪問済みのURLはスキップ\n",
        "            if url in self.visited_urls:\n",
        "                continue\n",
        "            \n",
        "            # robots.txtで禁止されているURLはスキップ\n",
        "            if not self.robots_parser.is_allowed(url):\n",
        "                logging.info(f\"Skipping URL disallowed by robots.txt: {url}\")\n",
        "                self.visited_urls.add(url)\n",
        "                continue\n",
        "            \n",
        "            logging.info(f\"Crawling {url} ({count + 1}/{self.max_pages})\")\n",
        "            \n",
        "            # キャッシュからページ情報を取得\n",
        "            cached_page = None\n",
        "            if self.diff_detection:\n",
        "                cached_page = self.cache.get_page(url)\n",
        "            \n",
        "            # ページのHTMLを取得（条件付きリクエスト）\n",
        "            etag = cached_page.get('etag') if cached_page else None\n",
        "            last_modified = cached_page.get('last_modified') if cached_page else None\n",
        "            \n",
        "            html, headers_info = self.fetcher.fetch(url, etag, last_modified)\n",
        "            \n",
        "            # 304 Not Modified の場合、キャッシュから前回のコンテンツを使用\n",
        "            if headers_info.get('status_code') == 304 and cached_page:\n",
        "                logging.info(f\"Using cached content for {url}\")\n",
        "                page_data = {\n",
        "                    'url': url,\n",
        "                    'title': cached_page['title'],\n",
        "                    'html_content': '', # HTMLは保存不要\n",
        "                    'markdown_content': cached_page['markdown_content'],\n",
        "                    'etag': cached_page['etag'],\n",
        "                    'last_modified': cached_page['last_modified'],\n",
        "                }\n",
        "                self.repository.add(page_data)\n",
        "                self.visited_urls.add(url)\n",
        "                count += 1\n",
        "                continue\n",
        "            \n",
        "            # HTMLが取得できなかった場合はスキップ\n",
        "            if html is None:\n",
        "                self.visited_urls.add(url)\n",
        "                continue\n",
        "            \n",
        "            # HTMLを解析してコンテンツとリンクを抽出\n",
        "            page_data, links = self.parser.parse(html, url)\n",
        "            \n",
        "            # コンテンツがない場合はスキップ\n",
        "            if not page_data.get('html_content'):\n",
        "                self.visited_urls.add(url)\n",
        "                continue\n",
        "            \n",
        "            # ヘッダー情報を追加\n",
        "            page_data['etag'] = headers_info.get('etag')\n",
        "            page_data['last_modified'] = headers_info.get('last_modified')\n",
        "            \n",
        "            # HTMLをMarkdownに変換\n",
        "            page_data = self.markdown_converter.convert(page_data)\n",
        "            \n",
        "            # 差分検知（有効な場合）\n",
        "            if self.diff_detection:\n",
        "                markdown_content = page_data.get('markdown_content', '')\n",
        "                \n",
        "                # キャッシュに追加または更新\n",
        "                is_update = self.cache.add_or_update_page(page_data)\n",
        "                \n",
        "                if is_update:\n",
        "                    # コンテンツが変更されている場合のみ更新ページとしてマーク\n",
        "                    if self.cache.is_content_changed(url, markdown_content):\n",
        "                        self.updated_pages.append(url)\n",
        "                        self.page_diffs[url] = self.cache.get_diff(url, markdown_content)\n",
        "                else:\n",
        "                    # 新規ページ\n",
        "                    self.new_pages.append(url)\n",
        "            \n",
        "            # コンテンツを保存\n",
        "            self.repository.add(page_data)\n",
        "            \n",
        "            # 訪問済みとしてマーク\n",
        "            self.visited_urls.add(url)\n",
        "            count += 1\n",
        "            \n",
        "            # 新しいリンクをキューに追加\n",
        "            for link in links:\n",
        "                if link not in self.visited_urls and link not in self.queue:\n",
        "                    self.queue.append(link)\n",
        "        \n",
        "        # 削除されたページを特定（差分検知が有効な場合）\n",
        "        if self.diff_detection:\n",
        "            cached_urls = self.cache.get_all_urls()\n",
        "            current_urls = set(self.repository.get_all().keys())\n",
        "            self.deleted_pages = list(cached_urls - current_urls)\n",
        "            \n",
        "            # 削除されたページをキャッシュから削除\n",
        "            if self.deleted_pages:\n",
        "                self.cache.delete_urls(self.deleted_pages)\n",
        "            \n",
        "            # クロール履歴を保存\n",
        "            self.cache.save_crawl_history(\n",
        "                page_count=self.repository.count(),\n",
        "                new_count=len(self.new_pages),\n",
        "                updated_count=len(self.updated_pages),\n",
        "                deleted_count=len(self.deleted_pages)\n",
        "            )\n",
        "        \n",
        "        # 差分情報を作成\n",
        "        diff_data = {\n",
        "            'total': self.repository.count(),\n",
        "            'new_pages': self.new_pages,\n",
        "            'updated_pages': self.updated_pages,\n",
        "            'deleted_pages': self.deleted_pages,\n",
        "            'diffs': self.page_diffs,\n",
        "            'has_changes': bool(self.new_pages or self.updated_pages or self.deleted_pages)\n",
        "        }\n",
        "        \n",
        "        logging.info(f\"Crawling completed. Visited {len(self.visited_urls)} URLs, stored {self.repository.count()} pages.\")\n",
        "        logging.info(f\"Changes detected: {len(self.new_pages)} new, {len(self.updated_pages)} updated, {len(self.deleted_pages)} deleted.\")\n",
        "        \n",
        "        return self.repository, diff_data\n",
        "\n",
        "\n",
        "def run_crawler(url, max_pages=100, delay=1.0, output_dir=\"output\", cache_dir=\"cache\", discord_webhook=None, no_diff=False, skip_no_changes=False):\n",
        "    \"\"\"Google Colab向けのクローラー実行関数\"\"\"\n",
        "    # ロガーの設定\n",
        "    logging.basicConfig(\n",
        "        level=logging.INFO,\n",
        "        format='%(asctime)s - %(levelname)s - %(message)s',\n",
        "        handlers=[\n",
        "            logging.FileHandler(os.path.join(output_dir, \"crawler.log\")),\n",
        "            logging.StreamHandler()\n",
        "        ]\n",
        "    )\n",
        "    \n",
        "    try:\n",
        "        # 出力ディレクトリを作成\n",
        "        os.makedirs(output_dir, exist_ok=True)\n",
        "        \n",
        "        # クローラーの初期化と実行\n",
        "        crawler = WebCrawler(url, max_pages, delay, diff_detection=not no_diff, cache_dir=cache_dir)\n",
        "        repository, diff_data = crawler.crawl()\n",
        "        \n",
        "        # 結果がない場合はエラー\n",
        "        if repository.count() == 0:\n",
        "            logging.error(\"No content was crawled.\")\n",
        "            if discord_webhook:\n",
        "                notifier = DiscordNotifier(discord_webhook)\n",
        "                notifier.notify(message=f\"Webサイトのクロールが完了しましたが、コンテンツは取得できませんでした。\\n**URL**: {url}\")\n",
        "            return None, None, None\n",
        "        \n",
        "        # 変更がなく、スキップオプションが有効な場合はスキップ\n",
        "        if skip_no_changes and not diff_data['has_changes']:\n",
        "            logging.info(\"No changes detected. Skipping file generation and notification.\")\n",
        "            if discord_webhook:\n",
        "                notifier = DiscordNotifier(discord_webhook)\n",
        "                notifier.notify(message=f\"Webサイトのクロールが完了しましたが、前回から変更はありませんでした。\\n**URL**: {url}\\n**取得ページ数**: {repository.count()}\")\n",
        "            return None, None, None\n",
        "        \n",
        "        # ドメイン名をファイル名として使用\n",
        "        domain = urlparse(url).netloc\n",
        "        markdown_filename = f\"{domain}.md\"\n",
        "        \n",
        "        # Markdownファイルとして出力\n",
        "        exporter = FileExporter(output_dir)\n",
        "        markdown_path = exporter.export_markdown(repository, markdown_filename)\n",
        "        logging.info(f\"Exported Markdown to {markdown_path}\")\n",
        "        \n",
        "        # 差分レポートを出力（差分検知が有効な場合）\n",
        "        diff_report_path = None\n",
        "        if not no_diff and diff_data['has_changes']:\n",
        "            diff_report_filename = f\"{domain}_diff_report.md\"\n",
        "            diff_report_path = exporter.export_diff_report(diff_data, diff_report_filename)\n",
        "            logging.info(f\"Exported diff report to {diff_report_path}\")\n",
        "        \n",
        "        # PDFファイルとして出力\n",
        "        pdf_converter = PdfConverter(output_dir)\n",
        "        pdf_path = pdf_converter.convert(markdown_path)\n",
        "        if pdf_path:\n",
        "            logging.info(f\"Exported PDF to {pdf_path}\")\n",
        "        \n",
        "        # 差分レポートのPDFを生成（差分がある場合）\n",
        "        diff_report_pdf_path = None\n",
        "        if diff_report_path:\n",
        "            diff_report_pdf_path = pdf_converter.convert(diff_report_path)\n",
        "            if diff_report_pdf_path:\n",
        "                logging.info(f\"Exported diff report PDF to {diff_report_pdf_path}\")\n",
        "        \n",
        "        # Discord通知\n",
        "        if discord_webhook:\n",
        "            notifier = DiscordNotifier(discord_webhook)\n",
        "            \n",
        "            # 差分検知が有効かつ変更がある場合\n",
        "            if not no_diff and diff_data['has_changes']:\n",
        "                message = f\"Webサイトのクロールが完了しました。**変更が検出されました**。\\n\"\n",
        "                message += f\"**URL**: {url}\\n\"\n",
        "                message += f\"**取得ページ数**: {diff_data['total']}\\n\"\n",
        "                message += f\"**新規ページ**: {len(diff_data['new_pages'])}\\n\"\n",
        "                message += f\"**更新ページ**: {len(diff_data['updated_pages'])}\\n\"\n",
        "                message += f\"**削除ページ**: {len(diff_data['deleted_pages'])}\"\n",
        "                \n",
        "                # 差分レポートを添付\n",
        "                success = notifier.notify(\n",
        "                    message=message,\n",
        "                    markdown_path=diff_report_path,\n",
        "                    pdf_path=diff_report_pdf_path or pdf_path\n",
        "                )\n",
        "            else:\n",
        "                # 変更がない場合または差分検知が無効の場合\n",
        "                message = f\"Webサイトのクロールが完了しました。\\n**URL**: {url}\\n**取得ページ数**: {repository.count()}\"\n",
        "                success = notifier.notify(\n",
        "                    message=message,\n",
        "                    markdown_path=markdown_path,\n",
        "                    pdf_path=pdf_path\n",
        "                )\n",
        "                \n",
        "            if success:\n",
        "                logging.info(\"Discord notification sent successfully\")\n",
        "            else:\n",
        "                logging.error(\"Failed to send Discord notification\")\n",
        "        \n",
        "        logging.info(\"Process completed successfully\")\n",
        "        \n",
        "        return markdown_path, pdf_path, diff_report_path\n",
        "        \n",
        "    except Exception as e:\n",
        "        logging.error(f\"An error occurred during execution: {e}\")\n",
        "        if discord_webhook:\n",
        "            notifier = DiscordNotifier(discord_webhook)\n",
        "            notifier.notify(message=f\"Webサイトのクロール中にエラーが発生しました。\\n**URL**: {url}\\n**エラー**: {str(e)}\")\n",
        "        return None, None, None"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "user-interface"
      },
      "source": [
        "## 4. Google Colab用のユーザーインターフェース\n",
        "\n",
        "以下のセルを実行して、クローラーの実行パラメータを設定します。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "form-cell"
      },
      "source": [
        "from IPython.display import display, HTML, Javascript, FileLink\n",
        "import ipywidgets as widgets\n",
        "\n",
        "# URL入力フィールド\n",
        "url_input = widgets.Text(\n",
        "    value='https://example.com',\n",
        "    placeholder='クロールするWebサイトのURLを入力してください',\n",
        "    description='URL:',\n",
        "    disabled=False,\n",
        "    style={'description_width': 'initial'}\n",
        ")\n",
        "\n",
        "# 最大ページ数スライダー\n",
        "max_pages_slider = widgets.IntSlider(\n",
        "    value=100,\n",
        "    min=10,\n",
        "    max=500,\n",
        "    step=10,\n",
        "    description='最大ページ数:',\n",
        "    disabled=False,\n",
        "    continuous_update=False,\n",
        "    orientation='horizontal',\n",
        "    readout=True,\n",
        "    readout_format='d'\n",
        ")\n",
        "\n",
        "# 遅延時間スライダー\n",
        "delay_slider = widgets.FloatSlider(\n",
        "    value=1.0,\n",
        "    min=0.5,\n",
        "    max=5.0,\n",
        "    step=0.5,\n",
        "    description='遅延時間(秒):',\n",
        "    disabled=False,\n",
        "    continuous_update=False,\n",
        "    orientation='horizontal',\n",
        "    readout=True,\n",
        "    readout_format='.1f'\n",
        ")\n",
        "\n",
        "# Discord Webhook URL入力\n",
        "discord_webhook_input = widgets.Text(\n",
        "    value='',\n",
        "    placeholder='Discord Webhook URLを入力してください（オプション）',\n",
        "    description='Discord:',\n",
        "    disabled=False,\n",
        "    style={'description_width': 'initial'}\n",
        ")\n",
        "\n",
        "# オプションチェックボックス\n",
        "diff_checkbox = widgets.Checkbox(\n",
        "    value=True,\n",
        "    description='差分検知を有効にする',\n",
        "    disabled=False\n",
        ")\n",
        "\n",
        "skip_no_changes_checkbox = widgets.Checkbox(\n",
        "    value=True,\n",
        "    description='変更がない場合はスキップ',\n",
        "    disabled=False\n",
        ")\n",
        "\n",
        "# パス設定\n",
        "output_dir = widgets.Text(\n",
        "    value='/content/drive/MyDrive/website_crawler/output',\n",
        "    description='出力ディレクトリ:',\n",
        "    disabled=False,\n",
        "    style={'description_width': 'initial'}\n",
        ")\n",
        "\n",
        "cache_dir = widgets.Text(\n",
        "    value='/content/drive/MyDrive/website_crawler/cache',\n",
        "    description='キャッシュディレクトリ:',\n",
        "    disabled=False,\n",
        "    style={'description_width': 'initial'}\n",
        ")\n",
        "\n",
        "# 実行ボタン\n",
        "run_button = widgets.Button(\n",
        "    description='クローラーを実行',\n",
        "    disabled=False,\n",
        "    button_style='success',\n",
        "    tooltip='クリックしてクローラーを実行',\n",
        "    icon='play'\n",
        ")\n",
        "\n",
        "# 出力エリア\n",
        "output = widgets.Output()\n",
        "\n",
        "# ボタンクリックイベント\n",
        "def on_run_button_clicked(b):\n",
        "    with output:\n",
        "        output.clear_output()\n",
        "        print(\"クローラーを実行中...\")\n",
        "        \n",
        "        url = url_input.value\n",
        "        max_pages = max_pages_slider.value\n",
        "        delay = delay_slider.value\n",
        "        discord_webhook = discord_webhook_input.value if discord_webhook_input.value else None\n",
        "        no_diff = not diff_checkbox.value\n",
        "        skip_no_changes = skip_no_changes_checkbox.value\n",
        "        \n",
        "        # クローラーを実行\n",
        "        markdown_path, pdf_path, diff_path = run_crawler(\n",
        "            url=url,\n",
        "            max_pages=max_pages,\n",
        "            delay=delay,\n",
        "            output_dir=output_dir.value,\n",
        "            cache_dir=cache_dir.value,\n",
        "            discord_webhook=discord_webhook,\n",
        "            no_diff=no_diff,\n",
        "            skip_no_changes=skip_no_changes\n",
        "        )\n",
        "        \n",
        "        # 結果を表示\n",
        "        if markdown_path:\n",
        "            print(f\"\\n処理が完了しました！\")\n",
        "            print(f\"\\nMarkdownファイル: {markdown_path}\")\n",
        "            if pdf_path:\n",
        "                print(f\"PDFファイル: {pdf_path}\")\n",
        "            if diff_path:\n",
        "                print(f\"差分レポート: {diff_path}\")\n",
        "                \n",
        "            # ファイルへのリンクを表示\n",
        "            if os.path.exists(markdown_path):\n",
        "                print(\"\\nファイルをダウンロード:\")\n",
        "                display(FileLink(markdown_path))\n",
        "                if pdf_path and os.path.exists(pdf_path):\n",
        "                    display(FileLink(pdf_path))\n",
        "                if diff_path and os.path.exists(diff_path):\n",
        "                    display(FileLink(diff_path))\n",
        "        else:\n",
        "            print(\"\\nエラーが発生したかクロールをスキップしました。ログを確認してください。\")\n",
        "\n",
        "run_button.on_click(on_run_button_clicked)\n",
        "\n",
        "# UIを表示\n",
        "display(widgets.HTML(\"<h3>Webサイトクローラー設定</h3>\"))\n",
        "display(url_input)\n",
        "display(widgets.HBox([max_pages_slider, delay_slider]))\n",
        "display(discord_webhook_input)\n",
        "display(widgets.HBox([diff_checkbox, skip_no_changes_checkbox]))\n",
        "display(output_dir)\n",
        "display(cache_dir)\n",
        "display(run_button)\n",
        "display(output)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "manual-run"
      },
      "source": [
        "## 5. 手動でクローラーを実行する（オプション）\n",
        "\n",
        "必要に応じて、以下のセルでクローラーを直接呼び出すこともできます。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "manual-run-code"
      },
      "source": [
        "# 手動実行の例\n",
        "# markdown_path, pdf_path, diff_path = run_crawler(\n",
        "#     url=\"https://example.com\",\n",
        "#     max_pages=100,\n",
        "#     delay=1.0,\n",
        "#     output_dir=crawler_dir + \"/output\",\n",
        "#     cache_dir=crawler_dir + \"/cache\",\n",
        "#     discord_webhook=None,  # ここにWebhook URLを入力\n",
        "#     no_diff=False,\n",
        "#     skip_no_changes=True\n",
        "# )\n",
        "# \n",
        "# print(\"処理結果:\")\n",
        "# print(f\"Markdown: {markdown_path}\")\n",
        "# print(f\"PDF: {pdf_path}\")\n",
        "# print(f\"差分レポート: {diff_path}\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "schedule-note"
      },
      "source": [
        "## 6. 定期実行について\n",
        "\n",
        "Google Colabでは、セッションが一定時間後に切断されるため、長時間の定期実行には向いていません。\n",
        "定期的なクロールを行いたい場合は、以下の選択肢があります：\n",
        "\n",
        "1. ローカルマシンでPythonスクリプトとして実行する\n",
        "2. Google Cloud FunctionsやCloud Runなどのサーバーレスサービスで定期実行する\n",
        "3. GitHub ActionsやCircle CIなどのCI/CDサービスを利用して定期実行する\n",
        "\n",
        "このノートブックはあくまで手動実行用として使用してください。"
      ]
    }
  ]
}