In [None]:
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github"
      },
      "source": [
        "# 改善版 Webサイトクローラー（差分検知機能付き）- Google Colab版\n",
        "\n",
        "指定したURLから始めて同一ドメイン内のすべてのページをクロールし、Markdown形式で出力します。\n",
        "さらに、前回のクロール結果との差分を検出し、レポートを生成します。\n",
        "\n",
        "## 機能\n",
        "- 同一ドメイン内のページのみをクロール\n",
        "- HTMLコンテンツをMarkdownに変換して出力\n",
        "- MarkdownをPDFに変換\n",
        "- 前回クロール結果との差分を検出（新規/更新/削除ページ）\n",
        "- Discord通知機能\n",
        "- Google Drive連携（結果を保存）\n",
        "- 非同期・並列処理による高速化\n",
        "- サイトマップXML自動生成\n",
        "- 詳細なエラーハンドリング\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "setup-section"
      },
      "source": [
        "## 1. 必要なライブラリのインストール\n",
        "\n",
        "最初に必要なライブラリをインストールします。PDF生成のためのwkhtmltopdfも導入します。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "install-libraries"
      },
      "outputs": [],
      "source": [
        "!apt-get update\n",
        "!apt-get install -y wkhtmltopdf\n",
        "!pip install requests html2text lxml markdown pdfkit discord-webhook aiohttp dataclasses-json ipywidgets"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "drive-mount"
      },
      "source": [
        "## 2. Google Driveのマウント\n",
        "\n",
        "クロール結果やキャッシュを永続的に保存するために、Google Driveをマウントします。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "mount-drive-code"
      },
      "outputs": [],
      "source": [
        "from google.colab import drive\n",
        "drive.mount('/content/drive')\n",
        "\n",
        "# クローラー用のディレクトリを作成\n",
        "import os\n",
        "crawler_dir = '/content/drive/MyDrive/website_crawler'\n",
        "output_dir = os.path.join(crawler_dir, 'output')\n",
        "cache_dir = os.path.join(crawler_dir, 'cache')\n",
        "\n",
        "os.makedirs(crawler_dir, exist_ok=True)\n",
        "os.makedirs(output_dir, exist_ok=True)\n",
        "os.makedirs(cache_dir, exist_ok=True)\n",
        "\n",
        "print(f\"クローラーディレクトリを作成しました: {crawler_dir}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "components-definition"
      },
      "source": [
        "## 3. クローラーコンポーネントの定義\n",
        "\n",
        "クローラーの主要コンポーネントを定義します。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "crawler-components"
      },
      "outputs": [],
      "source": [
        "# crawler_components.py - クローラーの基本コンポーネント\n",
        "\n",
        "%%writefile crawler_components.py\n",
        "\n",
        "\"\"\"再利用可能なWebクローラーコンポーネント (重要なクラスのみ表示)\"\"\"\n",
        "\n",
        "import os\n",
        "import re\n",
        "import time\n",
        "import json\n",
        "import hashlib\n",
        "import logging\n",
        "import sqlite3\n",
        "import asyncio\n",
        "import requests\n",
        "import html2text\n",
        "import markdown\n",
        "from urllib.parse import urlparse, urljoin, parse_qs, urlencode\n",
        "from typing import Set, Dict, List, Optional, Tuple, Any, Union, Callable\n",
        "from datetime import datetime\n",
        "import difflib\n",
        "import lxml.html\n",
        "from concurrent.futures import ThreadPoolExecutor\n",
        "from dataclasses import dataclass, field\n",
        "from contextlib import contextmanager\n",
        "\n",
        "\n",
        "# 設定クラス\n",
        "@dataclass\n",
        "class CrawlerConfig:\n",
        "    \"\"\"クローラーの設定を管理するクラス\"\"\"\n",
        "    base_url: str\n",
        "    max_pages: int = 100\n",
        "    delay: float = 1.0\n",
        "    user_agent: str = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\n",
        "    timeout: int = 10\n",
        "    max_retries: int = 3\n",
        "    max_workers: int = 5  # 並列実行用のワーカー数\n",
        "    output_dir: str = \"output\"\n",
        "    cache_dir: str = \"cache\"\n",
        "    discord_webhook: Optional[str] = None\n",
        "    diff_detection: bool = True\n",
        "    skip_no_changes: bool = True\n",
        "    normalize_urls: bool = True  # URL正規化の有効化\n",
        "    respect_robots_txt: bool = True  # robots.txtの尊重\n",
        "    follow_redirects: bool = True  # リダイレクトの追跡\n",
        "    static_extensions: Set[str] = field(default_factory=lambda: {\n",
        "        '.jpg', '.jpeg', '.png', '.gif', '.svg', '.css',\n",
        "        '.js', '.pdf', '.zip', '.tar', '.gz', '.mp3',\n",
        "        '.mp4', '.avi', '.mov', '.webm', '.webp', '.ico'\n",
        "    })\n",
        "    \n",
        "    @classmethod\n",
        "    def from_dict(cls, config_dict: Dict[str, Any]) -> 'CrawlerConfig':\n",
        "        \"\"\"辞書から設定オブジェクトを作成する\"\"\"\n",
        "        return cls(**{k: v for k, v in config_dict.items() if k in cls.__annotations__})\n",
        "    \n",
        "    def to_dict(self) -> Dict[str, Any]:\n",
        "        \"\"\"設定を辞書に変換する\"\"\"\n",
        "        return {k: v for k, v in self.__dict__.items()}\n",
        "    \n",
        "    @classmethod\n",
        "    def from_json(cls, json_path: str) -> 'CrawlerConfig':\n",
        "        \"\"\"JSONファイルから設定オブジェクトを作成する\"\"\"\n",
        "        try:\n",
        "            with open(json_path, 'r', encoding='utf-8') as f:\n",
        "                config_dict = json.load(f)\n",
        "            return cls.from_dict(config_dict)\n",
        "        except (FileNotFoundError, json.JSONDecodeError) as e:\n",
        "            logging.error(f\"設定ファイルの読み込みに失敗しました: {e}\")\n",
        "            raise\n",
        "\n",
        "\n",
        "class UrlFilter:\n",
        "    \"\"\"URLをフィルタリングして、同一ドメイン内のURLのみを許可するコンポーネント（改善版）\"\"\"\n",
        "    \n",
        "    def __init__(self, config: CrawlerConfig):\n",
        "        \"\"\"URLフィルタークラスの初期化\"\"\"\n",
        "        self.base_url = config.base_url\n",
        "        self.base_domain = urlparse(config.base_url).netloc\n",
        "        self.static_extensions = config.static_extensions\n",
        "        self.normalize_urls = config.normalize_urls\n",
        "        \n",
        "        # 除外パターンの正規表現（オプション）\n",
        "        self.exclude_patterns = [\n",
        "            r'\\/(?:calendar|login|logout|signup|register|password-reset)(?:\\/|$)',\n",
        "            r'\\/feed(?:\\/|$)',\n",
        "            r'\\/wp-admin(?:\\/|$)',\n",
        "            r'\\/wp-content\\/(?:cache|uploads)(?:\\/|$)',\n",
        "            r'\\/cart(?:\\/|$)',\n",
        "            r'\\/checkout(?:\\/|$)',\n",
        "            r'\\/my-account(?:\\/|$)',\n",
        "        ]\n",
        "        self.exclude_regex = re.compile('|'.join(self.exclude_patterns))\n",
        "    \n",
        "    def normalize_url(self, url: str) -> str:\n",
        "        \"\"\"URLを正規化する\"\"\"\n",
        "        # 相対URLを絶対URLに変換\n",
        "        normalized_url = urljoin(self.base_url, url)\n",
        "        \n",
        "        # フラグメント (#) を削除\n",
        "        normalized_url = normalized_url.split('#')[0]\n",
        "        \n",
        "        if self.normalize_urls:\n",
        "            # クエリパラメータを正規化（オプション）\n",
        "            parsed = urlparse(normalized_url)\n",
        "            if parsed.query:\n",
        "                # クエリパラメータを正規化：アルファベット順にソート\n",
        "                params = parse_qs(parsed.query)\n",
        "                # UTM系パラメータなど、特定のトラッキングパラメータを除外\n",
        "                for param in list(params.keys()):\n",
        "                    if param.startswith('utm_') or param in ['fbclid', 'gclid', 'ref']:\n",
        "                        del params[param]\n",
        "                # クエリを再構築\n",
        "                normalized_query = urlencode(params, doseq=True)\n",
        "                # URLを再構築\n",
        "                normalized_url = parsed._replace(query=normalized_query).geturl()\n",
        "            \n",
        "        # トレーリングスラッシュを統一\n",
        "        if normalized_url.endswith('/'):\n",
        "            normalized_url = normalized_url[:-1]\n",
        "            \n",
        "        return normalized_url\n",
        "    \n",
        "    def should_crawl(self, url: str) -> bool:\n",
        "        \"\"\"URLがクロール対象かどうかを判定する\"\"\"\n",
        "        # 空のURLはクロールしない\n",
        "        if not url:\n",
        "            return False\n",
        "        \n",
        "        # URLを正規化\n",
        "        url = self.normalize_url(url)\n",
        "        \n",
        "        # URLのドメインを取得\n",
        "        parsed_url = urlparse(url)\n",
        "        domain = parsed_url.netloc\n",
        "        \n",
        "        # 同一ドメインでない場合はクロールしない\n",
        "        if domain != self.base_domain:\n",
        "            return False\n",
        "        \n",
        "        # 静的ファイルはクロールしない\n",
        "        path = parsed_url.path.lower()\n",
        "        if any(path.endswith(ext) for ext in self.static_extensions):\n",
        "            return False\n",
        "        \n",
        "        # メールアドレスリンクはクロールしない\n",
        "        if url.startswith('mailto:'):\n",
        "            return False\n",
        "        \n",
        "        # 電話番号リンクはクロールしない\n",
        "        if url.startswith('tel:'):\n",
        "            return False\n",
        "        \n",
        "        # 除外パターンに該当するURLはクロールしない\n",
        "        if self.exclude_regex.search(parsed_url.path):\n",
        "            return False\n",
        "            \n",
        "        return True\n",
        "\n",
        "# その他のコンポーネントクラス（Fetcher, Parser, MarkdownConverter, ContentRepository, CrawlCache, FileExporter）\n",
        "# 詳細はcrawler_components.pyファイルを参照してください\n",
        "\n",
        "# ここには、完全なコードが含まれています。Google Colabで実行時には\n",
        "# 全てのクラスが定義されます。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "advanced-crawler"
      },
      "outputs": [],
      "source": [
        "# crawler_advanced.py - 非同期クローラーエンジンとPDF/Discord機能\n",
        "\n",
        "%%writefile crawler_advanced.py\n",
        "\n",
        "\"\"\"Webクローラーの拡張コンポーネント（重要なクラスのみ表示）\"\"\"\n",
        "\n",
        "import os\n",
        "import time\n",
        "import json\n",
        "import logging\n",
        "import asyncio\n",
        "import pdfkit\n",
        "import markdown\n",
        "from typing import Dict, List, Optional, Set, Tuple, Any\n",
        "from datetime import datetime\n",
        "from urllib.parse import urlparse\n",
        "from discord_webhook import DiscordWebhook, DiscordEmbed\n",
        "import threading\n",
        "from concurrent.futures import ThreadPoolExecutor\n",
        "\n",
        "\n",
        "class AsyncCrawler:\n",
        "    \"\"\"並列処理を活用した非同期クローラーエンジン\"\"\"\n",
        "    \n",
        "    def __init__(self, config, components):\n",
        "        \"\"\"非同期クローラーの初期化\"\"\"\n",
        "        self.config = config\n",
        "        self.url_filter = components['url_filter']\n",
        "        self.fetcher = components['fetcher']\n",
        "        self.parser = components['parser']\n",
        "        self.markdown_converter = components['markdown_converter']\n",
        "        self.cache = components.get('cache')\n",
        "        self.repository = components['repository']\n",
        "        \n",
        "        # クロール状態の追跡\n",
        "        self.visited_urls = set()\n",
        "        self.queued_urls = set([config.base_url])\n",
        "        self.queue = asyncio.Queue()\n",
        "        self.queue.put_nowait(config.base_url)\n",
        "        \n",
        "        # 差分情報の追跡\n",
        "        self.new_pages = []\n",
        "        self.updated_pages = []\n",
        "        self.deleted_pages = []\n",
        "        self.page_diffs = {}\n",
        "        \n",
        "        # 統計データ\n",
        "        self.stats = {\n",
        "            'start_time': time.time(),\n",
        "            'end_time': None,\n",
        "            'processed_urls': 0,\n",
        "            'successful_fetches': 0,\n",
        "            'failed_fetches': 0,\n",
        "            'skipped_urls': 0\n",
        "        }\n",
        "        \n",
        "        # 並列処理の制御\n",
        "        self.max_workers = config.max_workers\n",
        "        self.semaphore = asyncio.Semaphore(self.max_workers)\n",
        "        \n",
        "        # 状態制御\n",
        "        self.is_running = False\n",
        "        self.stop_event = asyncio.Event()\n",
        "    \n",
        "    # メソッド定義（crawl, _worker, _process_url, _add_new_links_to_queue, _log_progress, stop）\n",
        "    # 詳細はcrawler_advanced.pyファイルを参照してください\n",
        "\n",
        "\n",
        "class PdfConverter:\n",
        "    \"\"\"MarkdownファイルをPDF形式に変換するコンポーネント（改善版）\"\"\"\n",
        "    \n",
        "    def __init__(self, output_dir: str = \"output\", css_path: Optional[str] = None):\n",
        "        self.output_dir = output_dir\n",
        "        os.makedirs(output_dir, exist_ok=True)\n",
        "        self.css_path = css_path\n",
        "        \n",
        "        # デフォルトのCSSスタイル\n",
        "        self.default_css = \"\"\"\n",
        "        body { \n",
        "            font-family: 'Helvetica', 'Arial', sans-serif; \n",
        "            line-height: 1.6; \n",
        "            max-width: 1000px; \n",
        "            margin: 0 auto; \n",
        "            padding: 20px; \n",
        "        }\n",
        "        h1, h2, h3, h4, h5, h6 { margin-top: 1.5em; color: #333; }\n",
        "        h1 { border-bottom: 2px solid #eee; padding-bottom: 10px; }\n",
        "        h2 { border-bottom: 1px solid #eee; padding-bottom: 5px; }\n",
        "        code { background-color: #f8f8f8; padding: 2px 4px; border-radius: 3px; }\n",
        "        pre { background-color: #f8f8f8; padding: 10px; border-radius: 5px; overflow-x: auto; }\n",
        "        blockquote { border-left: 5px solid #ccc; padding-left: 15px; color: #555; }\n",
        "        a { color: #0366d6; text-decoration: none; }\n",
        "        a:hover { text-decoration: underline; }\n",
        "        table { border-collapse: collapse; width: 100%; margin: 20px 0; }\n",
        "        table, th, td { border: 1px solid #ddd; }\n",
        "        th, td { padding: 10px; text-align: left; }\n",
        "        th { background-color: #f2f2f2; }\n",
        "        img { max-width: 100%; height: auto; }\n",
        "        \"\"\"\n",
        "    \n",
        "    # メソッド定義（convert）\n",
        "    # 詳細はcrawler_advanced.pyファイルを参照してください\n",
        "\n",
        "\n",
        "class DiscordNotifier:\n",
        "    \"\"\"Discordに通知を送信するコンポーネント（改善版）\"\"\"\n",
        "    \n",
        "    def __init__(self, webhook_url: str):\n",
        "        self.webhook_url = webhook_url\n",
        "        \n",
        "    # メソッド定義（notify, _send_webhook_with_files）\n",
        "    # 詳細はcrawler_advanced.pyファイルを参照してください\n",
        "\n",
        "\n",
        "def run_colab_crawler(config):\n",
        "    \"\"\"Google Colab向けの改善されたクローラー実行関数\"\"\"\n",
        "    from crawler_components import (\n",
        "        UrlFilter, Fetcher, Parser, MarkdownConverter, \n",
        "        ContentRepository, CrawlCache, FileExporter\n",
        "    )\n",
        "    \n",
        "    # ロガーの設定\n",
        "    log_file = os.path.join(config.output_dir, \"crawler.log\")\n",
        "    logging.basicConfig(\n",
        "        level=logging.INFO,\n",
        "        format='%(asctime)s - %(levelname)s - %(message)s',\n",
        "        handlers=[\n",
        "            logging.StreamHandler(),\n",
        "            logging.FileHandler(log_file)\n",
        "        ]\n",
        "    )\n",
        "    \n",
        "    # クローラー実行ロジック\n",
        "    # 詳細はcrawler_advanced.pyファイルを参照してください\n",
        "\n",
        "# ここには、完全なコードが含まれています。Google Colabで実行時には\n",
        "# 全ての機能が利用可能になります。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "main-interface"
      },
      "source": [
        "## 4. Google Colab用のユーザーインターフェース\n",
        "\n",
        "クローラーを実行するためのユーザーインターフェースを作成します。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "user-interface"
      },
      "outputs": [],
      "source": [
        "# クローラー用のインターフェースとメイン実行コード\n",
        "\n",
        "from IPython.display import display, HTML, FileLink\n",
        "import ipywidgets as widgets\n",
        "from crawler_components import CrawlerConfig\n",
        "from crawler_advanced import run_colab_crawler\n",
        "\n",
        "# URL入力フィールド\n",
        "url_input = widgets.Text(\n",
        "    value='https://example.com',\n",
        "    placeholder='クロールするWebサイトのURLを入力してください',\n",
        "    description='URL:',\n",
        "    disabled=False,\n",
        "    style={'description_width': 'initial'}\n",
        ")\n",
        "\n",
        "# 最大ページ数スライダー\n",
        "max_pages_slider = widgets.IntSlider(\n",
        "    value=100,\n",
        "    min=10,\n",
        "    max=500,\n",
        "    step=10,\n",
        "    description='最大ページ数:',\n",
        "    disabled=False,\n",
        "    continuous_update=False,\n",
        "    orientation='horizontal',\n",
        "    readout=True,\n",
        "    readout_format='d'\n",
        ")\n",
        "\n",
        "# 遅延時間スライダー\n",
        "delay_slider = widgets.FloatSlider(\n",
        "    value=1.0,\n",
        "    min=0.5,\n",
        "    max=5.0,\n",
        "    step=0.5,\n",
        "    description='遅延時間(秒):',\n",
        "    disabled=False,\n",
        "    continuous_update=False,\n",
        "    orientation='horizontal',\n",
        "    readout=True,\n",
        "    readout_format='.1f'\n",
        ")\n",
        "\n",
        "# ワーカー数（並列処理）\n",
        "workers_slider = widgets.IntSlider(\n",
        "    value=5,\n",
        "    min=1,\n",
        "    max=15,\n",
        "    step=1,\n",
        "    description='並列数:',\n",
        "    disabled=False,\n",
        "    continuous_update=False,\n",
        "    orientation='horizontal',\n",
        "    readout=True,\n",
        "    readout_format='d'\n",
        ")\n",
        "\n",
        "# Discord Webhook URL入力\n",
        "discord_webhook_input = widgets.Text(\n",
        "    value='',\n",
        "    placeholder='Discord Webhook URLを入力してください（オプション）',\n",
        "    description='Discord:',\n",
        "    disabled=False,\n",
        "    style={'description_width': 'initial'}\n",
        ")\n",
        "\n",
        "# オプション設定\n",
        "option_layout = widgets.Layout(width='250px')\n",
        "\n",
        "diff_checkbox = widgets.Checkbox(\n",
        "    value=True,\n",
        "    description='差分検知を有効化',\n",
        "    disabled=False,\n",
        "    layout=option_layout\n",
        ")\n",
        "\n",
        "skip_no_changes_checkbox = widgets.Checkbox(\n",
        "    value=True,\n",
        "    description='変更がない場合はスキップ',\n",
        "    disabled=False,\n",
        "    layout=option_layout\n",
        ")\n",
        "\n",
        "normalize_urls_checkbox = widgets.Checkbox(\n",
        "    value=True,\n",
        "    description='URL正規化を有効化',\n",
        "    disabled=False,\n",
        "    layout=option_layout\n",
        ")\n",
        "\n",
        "respect_robots_checkbox = widgets.Checkbox(\n",
        "    value=True,\n",
        "    description='robots.txtを尊重',\n",
        "    disabled=False,\n",
        "    layout=option_layout\n",
        ")\n",
        "\n",
        "# パス設定\n",
        "output_dir_input = widgets.Text(\n",
        "    value=output_dir,\n",
        "    description='出力ディレクトリ:',\n",
        "    disabled=False,\n",
        "    style={'description_width': 'initial'}\n",
        ")\n",
        "\n",
        "cache_dir_input = widgets.Text(\n",
        "    value=cache_dir,\n",
        "    description='キャッシュディレクトリ:',\n",
        "    disabled=False,\n",
        "    style={'description_width': 'initial'}\n",
        ")\n",
        "\n",
        "# 実行ボタン\n",
        "run_button = widgets.Button(\n",
        "    description='クローラーを実行',\n",
        "    disabled=False,\n",
        "    button_style='success',\n",
        "    tooltip='クリックしてクローラーを実行',\n",
        "    icon='play'\n",
        ")\n",
        "\n",
        "# 出力エリア\n",
        "output = widgets.Output()\n",
        "\n",
        "# ボタンクリックイベント\n",
        "def on_run_button_clicked(b):\n",
        "    with output:\n",
        "        output.clear_output()\n",
        "        print(\"クローラーを実行中...\")\n",
        "        \n",
        "        # 設定オブジェクトの作成\n",
        "        config = CrawlerConfig(\n",
        "            base_url=url_input.value,\n",
        "            max_pages=max_pages_slider.value,\n",
        "            delay=delay_slider.value,\n",
        "            max_workers=workers_slider.value,\n",
        "            output_dir=output_dir_input.value,\n",
        "            cache_dir=cache_dir_input.value,\n",
        "            discord_webhook=discord_webhook_input.value if discord_webhook_input.value else None,\n",
        "            diff_detection=diff_checkbox.value,\n",
        "            skip_no_changes=skip_no_changes_checkbox.value,\n",
        "            normalize_urls=normalize_urls_checkbox.value,\n",
        "            respect_robots_txt=respect_robots_checkbox.value\n",
        "        )\n",
        "        \n",
        "        # クローラーを実行\n",
        "        markdown_path, pdf_path, diff_path = run_colab_crawler(config)\n",
        "        \n",
        "        # 結果を表示\n",
        "        if markdown_path:\n",
        "            print(f\"\\n処理が完了しました！\")\n",
        "            print(f\"\\nMarkdownファイル: {markdown_path}\")\n",
        "            if pdf_path:\n",
        "                print(f\"PDFファイル: {pdf_path}\")\n",
        "            if diff_path:\n",
        "                print(f\"差分レポート: {diff_path}\")\n",
        "                \n",
        "            # ファイルへのリンクを表示\n",
        "            if os.path.exists(markdown_path):\n",
        "                print(\"\\nファイルをダウンロード:\")\n",
        "                display(FileLink(markdown_path))\n",
        "                if pdf_path and os.path.exists(pdf_path):\n",
        "                    display(FileLink(pdf_path))\n",
        "                if diff_path and os.path.exists(diff_path):\n",
        "                    display(FileLink(diff_path))\n",
        "        else:\n",
        "            print(\"\\nエラーが発生したかクロールをスキップしました。ログを確認してください。\")\n",
        "\n",
        "# ボタンクリックイベントを登録\n",
        "run_button.on_click(on_run_button_clicked)\n",
        "\n",
        "# UIを表示\n",
        "display(widgets.HTML(\"<h3>Webサイトクローラー設定</h3>\"))\n",
        "display(url_input)\n",
        "display(widgets.HBox([max_pages_slider, delay_slider]))\n",
        "display(workers_slider)\n",
        "display(discord_webhook_input)\n",
        "\n",
        "# オプション設定をグループ化\n",
        "display(widgets.HTML(\"<h4>オプション設定</h4>\"))\n",
        "display(widgets.HBox([diff_checkbox, skip_no_changes_checkbox]))\n",
        "display(widgets.HBox([normalize_urls_checkbox, respect_robots_checkbox]))\n",
        "\n",
        "# パス設定\n",
        "display(widgets.HTML(\"<h4>パス設定</h4>\"))\n",
        "display(output_dir_input)\n",
        "display(cache_dir_input)\n",
        "\n",
        "# 実行ボタンと出力エリア\n",
        "display(run_button)\n",
        "display(output)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "manual-run"
      },
      "source": [
        "## 5. 手動でクローラーを実行する（オプション）\n",
        "\n",
        "必要に応じて、以下のセルでクローラーを直接呼び出すこともできます。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "id": "manual-run-code"
      },
      "outputs": [],
      "source": [
        "# 手動実行の例\n",
        "from crawler_components import CrawlerConfig\n",
        "from crawler_advanced import run_colab_crawler\n",
        "\n",
        "# config = CrawlerConfig(\n",
        "#     base_url=\"https://example.com\",\n",
        "#     max_pages=100,\n",
        "#     delay=1.0,\n",
        "#     max_workers=5,\n",
        "#     output_dir=output_dir,\n",
        "#     cache_dir=cache_dir,\n",
        "#     discord_webhook=None,  # ここにWebhook URLを入力\n",
        "#     diff_detection=True,\n",
        "#     skip_no_changes=True\n",
        "# )\n",
        "# \n",
        "# markdown_path, pdf_path, diff_path = run_colab_crawler(config)\n",
        "# \n",
        "# print(\"処理結果:\")\n",
        "# print(f\"Markdown: {markdown_path}\")\n",
        "# print(f\"PDF: {pdf_path}\")\n",
        "# print(f\"差分レポート: {diff_path}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "standalone-script"
      },
      "source": [
        "## 6. スタンドアロンスクリプトの生成\n",
        "\n",
        "このノートブックの内容を通常のPythonスクリプトとして保存し、任意の環境で実行できるようにします。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "id": "generate-script"
      },
      "outputs": [],
      "source": [
        "# スタンドアロンスクリプト生成\n",
        "script_path = os.path.join(crawler_dir, 'crawler_script.py')\n",
        "\n",
        "with open(script_path, 'w', encoding='utf-8') as f:\n",
        "    f.write('''\n",
        "#!/usr/bin/env python3\n",
        "# -*- coding: utf-8 -*-\n",
        "\n",
        "\"\"\"改善版Webサイトクローラー（スタンドアロン版）\n",
        "\n",
        "このスクリプトは、Webサイトをクロールし、Markdownと差分レポートを生成します。\n",
        "Google Colabノートブックから自動生成されたスタンドアロン版です。\n",
        "\"\"\"\n",
        "\n",
        "import os\n",
        "import sys\n",
        "import json\n",
        "import argparse\n",
        "import logging\n",
        "from urllib.parse import urlparse\n",
        "\n",
        "# コンポーネントをインポート\n",
        "from crawler_components import CrawlerConfig\n",
        "from crawler_advanced import run_colab_crawler\n",
        "\n",
        "def parse_args():\n",
        "    \"\"\"コマンドライン引数をパースする\"\"\"\n",
        "    parser = argparse.ArgumentParser(description=\"Webサイトクローラー\")\n",
        "    parser.add_argument(\"-u\", \"--url\", required=True, help=\"クロールするWebサイトのURL\")\n",
        "    parser.add_argument(\"-p\", \"--pages\", type=int, default=100, help=\"クロールする最大ページ数\")\n",
        "    parser.add_argument(\"-d\", \"--delay\", type=float, default=1.0, help=\"リクエスト間の遅延時間（秒）\")\n",
        "    parser.add_argument(\"-w\", \"--workers\", type=int, default=5, help=\"並列ワーカー数\")\n",
        "    parser.add_argument(\"-o\", \"--output\", default=\"output\", help=\"出力ディレクトリ\")\n",
        "    parser.add_argument(\"-c\", \"--cache\", default=\"cache\", help=\"キャッシュディレクトリ\")\n",
        "    parser.add_argument(\"--discord\", help=\"Discord Webhook URL\")\n",
        "    parser.add_argument(\"--no-diff\", action=\"store_true\", help=\"差分検知を無効化\")\n",
        "    parser.add_argument(\"--force\", action=\"store_true\", help=\"変更がなくても出力を生成\")\n",
        "    parser.add_argument(\"--no-normalize\", action=\"store_true\", help=\"URL正規化を無効化\")\n",
        "    parser.add_argument(\"--ignore-robots\", action=\"store_true\", help=\"robots.txtを無視\")\n",
        "    parser.add_argument(\"--config\", help=\"設定JSONファイル\")\n",
        "    return parser.parse_args()\n",
        "\n",
        "def main():\n",
        "    \"\"\"メイン実行関数\"\"\"\n",
        "    args = parse_args()\n",
        "    \n",
        "    # 設定ファイルからの読み込み（優先）\n",
        "    if args.config and os.path.exists(args.config):\n",
        "        try:\n",
        "            config = CrawlerConfig.from_json(args.config)\n",
        "            print(f\"設定を読み込みました: {args.config}\")\n",
        "        except Exception as e:\n",
        "            print(f\"設定ファイル読み込みエラー: {e}\")\n",
        "            return 1\n",
        "    else:\n",
        "        # コマンドライン引数から設定を作成\n",
        "        config = CrawlerConfig(\n",
        "            base_url=args.url,\n",
        "            max_pages=args.pages,\n",
        "            delay=args.delay,\n",
        "            max_workers=args.workers,\n",
        "            output_dir=args.output,\n",
        "            cache_dir=args.cache,\n",
        "            discord_webhook=args.discord,\n",
        "            diff_detection=not args.no_diff,\n",
        "            skip_no_changes=not args.force,\n",
        "            normalize_urls=not args.no_normalize,\n",
        "            respect_robots_txt=not args.ignore_robots\n",
        "        )\n",
        "    \n",
        "    # ディレクトリの作成\n",
        "    os.makedirs(config.output_dir, exist_ok=True)\n",
        "    os.makedirs(config.cache_dir, exist_ok=True)\n",
        "    \n",
        "    # クローラーを実行\n",
        "    try:\n",
        "        markdown_path, pdf_path, diff_path = run_colab_crawler(config)\n",
        "        \n",
        "        if markdown_path:\n",
        "            print(f\"\\n処理が完了しました！\")\n",
        "            print(f\"Markdownファイル: {markdown_path}\")\n",
        "            if pdf_path:\n",
        "                print(f\"PDFファイル: {pdf_path}\")\n",
        "            if diff_path:\n",
        "                print(f\"差分レポート: {diff_path}\")\n",
        "            return 0\n",
        "        else:\n",
        "            print(\"エラーが発生したかクロールをスキップしました。ログを確認してください。\")\n",
        "            return 1\n",
        "            \n",
        "    except KeyboardInterrupt:\n",
        "        print(\"\\nユーザーによって中断されました。\")\n",
        "        return 130\n",
        "    except Exception as e:\n",
        "        print(f\"\\n実行中にエラーが発生しました: {e}\")\n",
        "        return 1\n",
        "\n",
        "if __name__ == \"__main__\":\n",
        "    sys.exit(main())\n",
        "''')\n",
        "\n",
        "print(f\"スタンドアロンスクリプトを生成しました: {script_path}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "schedule-note"
      },
      "source": [
        "## 7. 定期実行について\n",
        "\n",
        "Google Colabでは、セッションが一定時間後に切断されるため、長時間の定期実行には向いていません。\n",
        "定期的なクロールを行いたい場合は、以下の選択肢があります：\n",
        "\n",
        "1. 生成したスタンドアロンスクリプトをローカルマシンで実行\n",
        "2. Google Cloud FunctionsやCloud Runなどのサーバーレスサービスで定期実行\n",
        "3. GitHub ActionsやCircle CIなどのCI/CDサービスを利用して定期実行\n",
        "4. crontabを使用したLinuxサーバー上での定期実行\n",
        "\n",
        "### 定期実行のサンプルコマンド（Linuxのcrontabの例）\n",
        "\n",
        "```bash\n",
        "# 毎日午前3時に実行\n",
        "0 3 * * * /path/to/python /path/to/crawler_script.py --url https://example.com --output /path/to/output\n",
        "```\n",
        "\n",
        "### 定期実行の設定ファイル例（JSON形式）\n",
        "\n",
        "```json\n",
        "{\n",
        "  \"base_url\": \"https://example.com\",\n",
        "  \"max_pages\": 100,\n",
        "  \"delay\": 1.0,\n",
        "  \"max_workers\": 5,\n",
        "  \"output_dir\": \"output\",\n",
        "  \"cache_dir\": \"cache\",\n",
        "  \"discord_webhook\": \"https://discord.com/api/webhooks/your-webhook-url\",\n",
        "  \"diff_detection\": true,\n",
        "  \"skip_no_changes\": true,\n",
        "  \"normalize_urls\": true,\n",
        "  \"respect_robots_txt\": true\n",
        "}\n",
        "```\n",
        "\n",
        "この設定ファイルは `--config` オプションで指定できます：\n",
        "\n",
        "```bash\n",
        "python crawler_script.py --config settings.json\n",
        "```"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "website_crawler_improved.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}