feat: Source Intake / Fetch Pipeline v0.1 (URL extraction, fetch, cache, provider integration)#23
Conversation
|
Warning Rate limit exceeded
To continue reviewing without waiting, purchase usage credits in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
ウォークスルーこのPRは、URL抽出、正規化、キャッシング、安全性検証を通じてソースインテイクパイプラインを実装し、プロバイダーに注入するコンパクト化されたソース候補リストを生成します。モックおよびOpenAIプロバイダーを更新して新しい入力形式を受け入れ、CI環境ではpnpmバージョン固定化とPrismaマイグレーション検証を追加します。 変更内容ソースインテイクおよびフェッチングパイプライン
CI環境とツールチェーン
推定コードレビュー作業量🎯 3 (中程度) | ⏱️ 約25分 関連の可能性があるPR
詩
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (4)
src/server/analysis/providers/mock-answer-graph-provider.ts (1)
176-200: ⚡ Quick win
mapSourceCandidatesが 2 回呼ばれています。Line 176 で
mapSourceCandidates(normalizedInput)を 2 度呼び出しており、冗長です。一度変数に格納してください。♻️ 提案修正
+ const mappedSources = mapSourceCandidates(normalizedInput); return { kind: "success", payload: { ... - sources: mapSourceCandidates(normalizedInput).length > 0 ? mapSourceCandidates(normalizedInput) : [ + sources: mappedSources.length > 0 ? mappedSources : [🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/server/analysis/providers/mock-answer-graph-provider.ts` around lines 176 - 200, The code calls mapSourceCandidates(normalizedInput) twice when building the sources array; compute it once into a local variable (e.g., const mapped = mapSourceCandidates(normalizedInput)) before the object construction and then use that variable in the ternary expression for sources to avoid redundant work and ensure consistency when referencing mapped candidates across the sources logic.src/server/analysis/providers/openai-answer-graph-provider.ts (1)
207-212: 💤 Low value
return文のインデントが関数本体と合っていません。Line 212 の
returnがコラム 0 にあり、buildSourceCandidateContextの他のコードと揃っていません。🔧 提案修正
-return `\n\nAvailable source candidates:\n${lines.join("\n")}`; + return `\n\nAvailable source candidates:\n${lines.join("\n")}`;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/server/analysis/providers/openai-answer-graph-provider.ts` around lines 207 - 212, The closing return in buildSourceCandidateContext is mis-indented (it's at column 0) causing inconsistent function body formatting; align the return that produces the "Available source candidates" string with the rest of the function block so it sits at the same indentation level as other statements in buildSourceCandidateContext (ensure the `return \n\nAvailable source candidates:\n${lines.join("\n")}` line is indented inside the function scope and matches the indentation of the preceding `return `${index + 1}...` mapping or the function's other statements).specs/source-intake-and-fetching.md (2)
60-65: ⚡ Quick winキャッシュ関連のテスト要件を追加してください。
Test requirements にキャッシュ機能のテストが明記されていません。Source cache strategy(lines 38-43)が仕様の重要な部分であるため、以下のテストケースを追加することを推奨します:
- キャッシュヒット/ミスのシナリオ
- TTL 期限切れ時の挙動(stale cache からの fetch 再実行)
SourceCacheEntryおよびSourceFetchSnapshotモデルへの正しいデータ保存TRACEMAP_SOURCE_CACHE_TTL_DAYS環境変数の動作確認これにより、キャッシュ戦略が意図通りに実装されていることを保証できます。
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@specs/source-intake-and-fetching.md` around lines 60 - 65, Test requirements are missing cache-related cases: add unit/integration tests covering cache hit and cache miss flows for fetch-source, TTL expiry behavior that forces re-fetch from stale cache, persistence and shape checks for the SourceCacheEntry and SourceFetchSnapshot models, and behavior driven by the TRACEMAP_SOURCE_CACHE_TTL_DAYS env var to ensure TTL is applied; reference the fetch-source code paths and the SourceCacheEntry/SourceFetchSnapshot types when implementing these tests so they validate caching, stale/refresh logic, and env-driven TTL changes.
47-47: ⚡ Quick winOpenAI provider の source context 追加時の制約を定義してください。
OpenAI provider が prompt に compact source context を追加する際、以下の制約が仕様に記載されていません:
- prompt に含める source の最大数(token limit を超える場合の対処)
- 複数 source がある場合の優先順位・並び順
- source context が token limit に収まらない場合の truncate 戦略
これらを明記しないと、実装時に予期しない token limit エラーや不適切な source 選択が発生する可能性があります。
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@specs/source-intake-and-fetching.md` at line 47, Specify exact constraints for the OpenAI provider when appending compact source context (label/url/content_type/excerpt) to prompts: define a maximum number of sources to include and a token budget per prompt (e.g., max_sources and max_context_tokens), describe the ordering/priority rule for multiple sources (e.g., sort by relevance score then recency using the same ranking codepath that produces source relevance), and add a truncation strategy when the combined source context exceeds the token budget (e.g., drop lowest-priority sources first, then truncate excerpts with an ellipsis preserving label/url/content_type, and include a warning token like “[TRUNCATED]”); document how to compute token usage (using the tokenizer used by OpenAI) and fallback behavior (if even one source excerpt exceeds per-source token limit, truncate the excerpt to per-source limit and still include label/url/content_type).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@acceptance/source-intake-and-fetching.feature`:
- Line 27: The acceptance scenario uses a mixed Japanese/English phrase ("And
raw HTML全文 is not passed to provider"); replace it with a consistent English
phrase such as "And raw full HTML is not passed to provider" (or "And the raw
full HTML is not passed to the provider") in the acceptance scenario text, and
mirror the same wording change in the corresponding spec under specs/ so both
specs/ and acceptance/ are kept in sync per the guideline to start from spec and
update both directories.
In `@specs/source-intake-and-fetching.md`:
- Line 36: 文書内でモデル名が混在しているため一貫性を持たせてください:
「SourceSnapshot」と「SourceFetchSnapshot」がどちらを指すか不明な箇所(特に Line 35–36 と Data model
strategy
セクション)を確認し、正しいモデル名に統一して仕様書を修正してください。該当する箇所で使われている名称をすべて検索し、もし二つが別モデルであればそれぞれの定義(目的、フィールド、保存先)を明示し、誤記であればすべて「SourceFetchSnapshot」または正しい方の名称に置換して整合性を取ってください。
- Line 42: 仕様書内の「fetch 成功履歴」の判定が曖昧なので、「stale / missing のみ fetch
実行。」の節に具体定義を追記してください:成功は HTTP ステータス 200–299 のみと明記し(例: isFetchSuccess を満たす条件)、404
や 5xx をキャッシュした場合の再試行ポリシー(例:
キャッシュ済み404は再フェッチしない/5xxは短時間で再試行またはTTLを短縮)を明示し、コンテンツが存在するが空(空ボディやzero-length)の扱い(success
と見なすか、再フェッチ対象にするか)を定義して、これらのルールを「fetch 成功履歴」判定基準として仕様に追記してください。
- Line 13: The spec line "URL 正規化(tracking params 除去、hash 除去、host/protocol
小文字化)" lacks a concrete definition of which query parameters count as "tracking
params"; update the spec to list the exact removal rules by name (e.g.,
utm_source, utm_medium, utm_campaign, utm_term, utm_content, gclid, fbclid, _ga,
_gid, _gat, mc_cid, mc_eid, _openstat, etc.) and/or a pattern-based rule (e.g.,
remove keys matching /^utm_/i or known analytics keys), state whether matching
is case-insensitive, clarify that keys are removed entirely (key and value),
define behavior for repeated keys and ordering, and include a canonical set (or
allow configurable list) referenced in the "URL 正規化" section so implementations
(URL 正規化, tracking params 除去, hash 除去, host/protocol 小文字化) can apply the same
deterministic filter.
- Line 40: TRACEMAP_SOURCE_CACHE_TTL_DAYS
の取り扱いが未定義なので、仕様に「検証ルールとフォールバック」を追記してください:
値は整数で最小0(日数0はキャッシュ無効)かつ最大を例えば3650(日数上限=10年)に設定し、非数値・小数・負数・上限超過の場合はデフォルト定数(例
DEFAULT_TRACEMAP_SOURCE_CACHE_TTL_DAYS)を使用して警告ログを出す、と明記すること(設定読み取り箇所=環境変数パース/設定初期化ロジックにこのバリデーションを実装することを併記)。
- Around line 54-59: Security constraints に HTTP
リダイレクト先の検証ルールが欠けているため、仕様に「リダイレクト方針(追従の可否)」「リダイレクト先にも Security constraints
の同一検証を適用すること」「最大リダイレクト回数(例: 5 回)を明記」および DNS rebinding 対策を追加してください。具体的には Security
constraints セクションに「リダイレクトを追従する場合は各跳び先 URL について scheme と解決後の IP が block
list(localhost/private/link-local/metadata
等)に該当しないことを検証し、検証に失敗したら追従を中止する」「リダイレクトごとに DNS を再解決して得た IP
を再検査する」「最大リダイレクト回数を設定する(例: 5)」「可能ならホスト名と最終解決 IP の整合性チェックや公開 DNS
の使用要件を追加する」という文言を追記してください。
In `@src/server/analysis/create-analysis-run-from-provider.ts`:
- Around line 65-69: buildSourceIntakeFromQuestion が try/catch
の外にあり、resolveSourceCacheForUrl 等の失敗で例外が上位に漏れて analysis_runs が "processing"
のまま残る可能性があります。修正は buildSourceIntakeFromQuestion 呼び出しを try/catch
で囲み、失敗したら補助的な処理として空の候補を持つデフォルト SourceIntakeResult(例: { candidates: []
})を代入して処理を継続し、必ず provider.generateAnswerGraph({ question, sourceCandidates:
sourceIntake.candidates }) が実行されるようにすることと、必要な型/名前(SourceIntakeResult)をインポートすること。
In `@src/server/analysis/source-intake/source-intake-service.ts`:
- Around line 11-16: The call to resolveSourceCacheForUrl inside the rawUrls
loop can throw and will abort buildSourceIntakeFromQuestion; wrap each call to
resolveSourceCacheForUrl(rawUrl) in a try/catch inside the loop (the loop that
builds ignoredUrls) so failures for a single URL are caught, push an entry into
ignoredUrls with { url: rawUrl, reason: (error.message || String(error)) } on
catch, and continue to the next rawUrl; keep the existing handling for
result.kind === "invalid" unchanged.
---
Nitpick comments:
In `@specs/source-intake-and-fetching.md`:
- Around line 60-65: Test requirements are missing cache-related cases: add
unit/integration tests covering cache hit and cache miss flows for fetch-source,
TTL expiry behavior that forces re-fetch from stale cache, persistence and shape
checks for the SourceCacheEntry and SourceFetchSnapshot models, and behavior
driven by the TRACEMAP_SOURCE_CACHE_TTL_DAYS env var to ensure TTL is applied;
reference the fetch-source code paths and the
SourceCacheEntry/SourceFetchSnapshot types when implementing these tests so they
validate caching, stale/refresh logic, and env-driven TTL changes.
- Line 47: Specify exact constraints for the OpenAI provider when appending
compact source context (label/url/content_type/excerpt) to prompts: define a
maximum number of sources to include and a token budget per prompt (e.g.,
max_sources and max_context_tokens), describe the ordering/priority rule for
multiple sources (e.g., sort by relevance score then recency using the same
ranking codepath that produces source relevance), and add a truncation strategy
when the combined source context exceeds the token budget (e.g., drop
lowest-priority sources first, then truncate excerpts with an ellipsis
preserving label/url/content_type, and include a warning token like
“[TRUNCATED]”); document how to compute token usage (using the tokenizer used by
OpenAI) and fallback behavior (if even one source excerpt exceeds per-source
token limit, truncate the excerpt to per-source limit and still include
label/url/content_type).
In `@src/server/analysis/providers/mock-answer-graph-provider.ts`:
- Around line 176-200: The code calls mapSourceCandidates(normalizedInput) twice
when building the sources array; compute it once into a local variable (e.g.,
const mapped = mapSourceCandidates(normalizedInput)) before the object
construction and then use that variable in the ternary expression for sources to
avoid redundant work and ensure consistency when referencing mapped candidates
across the sources logic.
In `@src/server/analysis/providers/openai-answer-graph-provider.ts`:
- Around line 207-212: The closing return in buildSourceCandidateContext is
mis-indented (it's at column 0) causing inconsistent function body formatting;
align the return that produces the "Available source candidates" string with the
rest of the function block so it sits at the same indentation level as other
statements in buildSourceCandidateContext (ensure the `return \n\nAvailable
source candidates:\n${lines.join("\n")}` line is indented inside the function
scope and matches the indentation of the preceding `return `${index + 1}...`
mapping or the function's other statements).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 34e41816-1526-41b5-9f0c-142ff2fe2c16
📒 Files selected for processing (17)
.github/workflows/ci.ymlacceptance/README.mdacceptance/source-intake-and-fetching.featurespecs/README.mdspecs/source-intake-and-fetching.mdsrc/server/analysis/create-analysis-run-from-provider.tssrc/server/analysis/fetch-source-snapshot.tssrc/server/analysis/providers/mock-answer-graph-provider.tssrc/server/analysis/providers/openai-answer-graph-provider.tssrc/server/analysis/source-intake/extract-urls.tssrc/server/analysis/source-intake/source-intake-service.tssrc/types/answer-graph-generation.tssrc/types/source-intake.tstests/create-mock-run.test.tstests/mock-answer-graph-provider.test.tstests/openai-answer-graph-provider.test.tstests/source-intake.test.ts
Motivation
SourceCacheEntry/SourceFetchSnapshotmodels without DB migrations.Description
specs/source-intake-and-fetching.mdandacceptance/source-intake-and-fetching.featureand updatespecs/README.mdandacceptance/README.md.src/types/source-intake.ts,src/server/analysis/source-intake/extract-urls.ts, andsrc/server/analysis/source-intake/source-intake-service.tsthat extract http/https URLs, trim trailing punctuation, normalize/validate, dedupe by normalized URL, callresolveSourceCacheForUrl, and buildSourceCandidate[].buildSourceIntakeFromQuestion(question)increateAnalysisRunFromProviderand passsourceCandidatesvia the non-breaking additionGenerateAnswerGraphInput.sourceCandidates?: SourceCandidate[].sourceCandidatesassourceswhen present; OpenAI provider injects a compact source candidate context into the user prompt (label/url/content_type/excerpt) while preserving structured output schema.fetch-source-snapshot.tsnow stripsnoscriptas well asscript/styleand raises excerpt cap to 2000 chars; fetch behavior, safety checks, snapshot creation, andSourceCacheEntry.latest*updates reuse existing code paths.SourceCacheEntry/SourceFetchSnapshotmodels.Testing
pnpm lint(passed) andpnpm typecheck(passed) to ensure code quality and type safety.pnpm testwhere all tests passed (142 passed, 0 failed).pnpm buildwhich completed successfully;pnpm exec prisma validatefailed in this CI run due to missingDATABASE_URLenvironment variable (local/CI env constraint), not schema issues.sourceCandidatesand OpenAI provider prompt includes compact source context in tests added/updated undertests/.Codex Task
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests
Chores