Add USJ book parsing pipeline and interlinearizer display WebView#30
Add USJ book parsing pipeline and interlinearizer display WebView#30alex-rawlings-yyc merged 32 commits intomainfrom
Conversation
- Add usjBookExtractor and bookTokenizer parsers that transform a Platform.Bible USJ book into the interlinear Book/Segment/Token model - Replace placeholder WebView with SegmentView showing tokenized word and punctuation chips - Add BookChapterControl verse picker wired to the scroll-group scrRef - Add unit tests for both parsers and the full WebView component (100% coverage); fix mock signatures for useLocalizedStrings (tuple) and useRecentScriptureRefs (object) - Add Jest mocks for lucide-react and platform-bible-react to unblock ESM imports in test environment
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRefactors Jest mocks and mappings; adds USJ→RawBook extractor and Unicode-aware tokenizer; updates web view to render tokenized segments and controls with stable scrRef handling; expands parser/tokenizer tests; changes PT9 parser validation to throw SyntaxError; updates docs, config, and JSDoc. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant BookCtrl as BookChapterControl
participant WebView as InterlinearizerWebView
participant Hooks as ProjectDataHooks
participant Extractor as extractBookFromUsj
participant Tokenizer as tokenizeBook
participant SegmentView
User->>BookCtrl: submit scrRef
BookCtrl->>WebView: setScrRef(newRef)
WebView->>Hooks: fetch USJ & writingSystem
Hooks-->>WebView: usj, writingSystem
WebView->>Extractor: extractBookFromUsj(usj, writingSystem)
Extractor-->>WebView: RawBook
WebView->>Tokenizer: tokenizeBook(RawBook)
Tokenizer-->>WebView: Book (segments & tokens)
WebView->>WebView: filter segments by book/chapter
WebView->>SegmentView: render segments (clickable)
SegmentView-->>WebView: onSegmentClick(segmentRef)
WebView->>Hooks: setScrRef(segmentRef) / addRecentScriptureRef
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…o deps - Flush open verse at chapter boundaries and traverse chapter/verse inline content - Add inter-paragraph spacing in usjBookExtractor - Use stableStringify for deterministic content hashing - Fix bookTokenizer to test full surfaceText (not first code unit) so astral-plane letters tokenize as words - Narrow scrRef memo deps to primitives to avoid stale re-renders - Add aria-current to active segment button - Upgrade cn mock to handle object-form class arguments
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
…le-dependence
- bookTokenizer.test: extend Unicode test to assert '.' is punctuation; add combining-mark (n\u0303) test covering the \p{M} TOKEN_RE branch
- usjBookExtractor: guard para space-append with endsWith(' ') to prevent double spaces across consecutive empty paragraphs
- usjBookExtractor: replace localeCompare with code-point sort in stableStringify so FNV-1a input is deterministic regardless of engine locale
- interlinearizer.web-view: log parse/tokenize errors via papi logger instead of silently swallowing them
- Expand JSDoc on stableStringify and fnv1a32
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
alex-rawlings-yyc
left a comment
There was a problem hiding this comment.
@alex-rawlings-yyc partially reviewed 8 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on imnasnainaec).
src/parsers/papi/bookTokenizer.ts line 12 at r21 (raw file):
Previously, imnasnainaec (D. Ror.) wrote…
I asked Gemini about how to only write the LNM set once, and it also suggested including Join_Control (for the sake of Arabic, Farsi, and Indic scripts) and replacing
uwithv, a laconst CHAR_SET = '\\p{L}\\p{N}\\p{M}\\p{Join_Control}'; const TOKEN_RE = new RegExp(`[${CHAR_SET}]+|[^${CHAR_SET}\\s]`, 'gv'); const WORD_START_RE = new RegExp(`[${CHAR_SET}]`, 'v');
Made the change. Devin suggested that apostrophes and dashes be included in tokens rather than splitting them, and I tend to agree, so I made that change too. Let me know if you can think of any reason to split words along those characters
src/parsers/papi/usjBookExtractor.ts line 227 at r21 (raw file):
Previously, imnasnainaec (D. Ror.) wrote…
⛏️ How about
.sort(([a], [b]) => (a.localeCompare(b)))
Using localeCompare() runs the risk of returning different strings if it's being used by people in different regions. Hash generation and comparison is done client-side, so I'm concerned that a translator and a consultant may end up getting different hashes for the same source text. I'm not totally sure what sorts of locale differences the method produces. For now, replacing with boolean subtraction pattern to allow return of 0 without nested ternary.
…act`, add clarifying comment to method result to ensure it isn't flagged as problematic in the future
- Previously only covered parsers, main.ts, and the WebView component. - Now includes any new source files without manual config updates.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/main.ts (1)
90-96:⚠️ Potential issue | 🟡 MinorAvoid prompting project picker when
webViewIdis stale/invalid.If Line 95 returns no definition (or no
projectId), Line 96 currently falls back toopenInterlinearizer(undefined), which opens the picker unexpectedly in a context-menu flow.Suggested fix
async function openInterlinearizerForWebView(webViewId?: string): Promise<string | undefined> { if (!webViewId) return openInterlinearizer(); const webViewDefinition = await papi.webViews.getOpenWebViewDefinition(webViewId); - return openInterlinearizer(webViewDefinition?.projectId); + if (!webViewDefinition?.projectId) { + logger.warn( + `Could not resolve projectId from webViewId "${webViewId}" while opening Interlinearizer`, + ); + return undefined; + } + return openInterlinearizer(webViewDefinition.projectId); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.ts` around lines 90 - 96, openInterlinearizerForWebView currently calls openInterlinearizer(undefined) when a provided webViewId resolves to no definition or missing projectId, which triggers the project picker unexpectedly; update the function so that after awaiting papi.webViews.getOpenWebViewDefinition(webViewId) it checks that webViewDefinition and webViewDefinition.projectId exist and if not returns undefined (do not call openInterlinearizer), otherwise call openInterlinearizer(webViewDefinition.projectId); the change should be applied within openInterlinearizerForWebView around the papi.webViews.getOpenWebViewDefinition and openInterlinearizer call sites.
🧹 Nitpick comments (1)
src/parsers/pt9/interlinearXmlParser.ts (1)
224-227: Clarify error text for invalid numeric attributes.At Line 227, the thrown message says attributes are “missing”, but this branch also triggers for present-but-invalid values (e.g., non-integer strings). Making the message explicit will reduce debugging ambiguity.
Suggested patch
- if (index === undefined || length === undefined) { - throw new SyntaxError('Invalid XML: Range missing required Index or Length attributes'); - } + if (index === undefined || length === undefined) { + throw new SyntaxError( + 'Invalid XML: Range missing or invalid Index/Length attributes (must be non-negative integers)', + ); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/parsers/pt9/interlinearXmlParser.ts` around lines 224 - 227, Update the thrown SyntaxError in the range parsing branch to indicate invalid numeric values (not just missing attributes): when parseStrictNumber returns undefined for rangeElement['@_Index'] or rangeElement['@_Length'] in interlinearXmlParser.ts (the logic using parseStrictNumber for Index and Length), throw a clearer message such as "Invalid XML: Range has missing or non-numeric Index/Length attributes" so callers know the attributes may be present but malformed; keep this change localized to the code that validates index and length in the range parsing logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/main.ts`:
- Around line 90-96: openInterlinearizerForWebView currently calls
openInterlinearizer(undefined) when a provided webViewId resolves to no
definition or missing projectId, which triggers the project picker unexpectedly;
update the function so that after awaiting
papi.webViews.getOpenWebViewDefinition(webViewId) it checks that
webViewDefinition and webViewDefinition.projectId exist and if not returns
undefined (do not call openInterlinearizer), otherwise call
openInterlinearizer(webViewDefinition.projectId); the change should be applied
within openInterlinearizerForWebView around the
papi.webViews.getOpenWebViewDefinition and openInterlinearizer call sites.
---
Nitpick comments:
In `@src/parsers/pt9/interlinearXmlParser.ts`:
- Around line 224-227: Update the thrown SyntaxError in the range parsing branch
to indicate invalid numeric values (not just missing attributes): when
parseStrictNumber returns undefined for rangeElement['@_Index'] or
rangeElement['@_Length'] in interlinearXmlParser.ts (the logic using
parseStrictNumber for Index and Length), throw a clearer message such as
"Invalid XML: Range has missing or non-numeric Index/Length attributes" so
callers know the attributes may be present but malformed; keep this change
localized to the code that validates index and length in the range parsing
logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 32ea5924-5506-42c5-b24b-4f325a424b0e
📒 Files selected for processing (4)
__mocks__/papi-frontend-react.tsjest.config.tssrc/main.tssrc/parsers/pt9/interlinearXmlParser.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- jest.config.ts
Co-authored-by: alex-rawlings-yyc <alex.rawlings@wycliffe.ca>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/parsers/papi/bookTokenizer.ts (2)
116-124: Consider enforcingsid/bookCodeconsistency here.Right now a bad
RawBookcan yieldbookRef: rawBook.bookCodewhilestartRef.book/endRef.bookcome from a different SID. Guarding that invariant would keep the returnedBookinternally consistent.Diff
export function tokenizeBook(rawBook: RawBook): Book { const segments: Segment[] = rawBook.verses.map(({ sid, text }) => { const ref = parseSid(sid); + if (ref.book !== rawBook.bookCode) { + throw new SyntaxError( + `Verse SID "${sid}" does not match book code "${rawBook.bookCode}"`, + ); + } return { id: sid, startRef: { ...ref }, endRef: { ...ref }, baselineText: text,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/parsers/papi/bookTokenizer.ts` around lines 116 - 124, The segments construction can produce inconsistent Book data when a verse's parsed SID book (from parseSid(sid)) differs from rawBook.bookCode; update the logic in the segments creation (where rawBook.verses are mapped and parseSid, sid, startRef, endRef are used) to enforce consistency by either validating that parsedRef.book === rawBook.bookCode and throwing/logging an error on mismatch, or by overriding parsedRef.book to rawBook.bookCode when building startRef/endRef so the returned Book and its segments share the same book code; pick one approach and apply it consistently in the map that builds segments.
91-98: UseArray.from(..., mapper)to skip the extra allocation.This path runs for every verse, so it’s worth avoiding the intermediate array of matches.
Diff
function tokenizeVerse(text: string, sid: string, writingSystem: string): Token[] { - return Array.from(text.matchAll(TOKEN_RE)).map((match) => { + return Array.from(text.matchAll(TOKEN_RE), (match) => { const surfaceText = match[0]; const charStart = match.index; const charEnd = charStart + surfaceText.length; const type: TokenType = WORD_CONTAIN_RE.test(surfaceText) ? 'word' : 'punctuation'; return { id: `${sid}:${charStart}`, surfaceText, writingSystem, type, charStart, charEnd }; }); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/parsers/papi/bookTokenizer.ts` around lines 91 - 98, tokenizeVerse currently calls Array.from(text.matchAll(TOKEN_RE)).map(...) which creates an intermediate array; change it to use Array.from with the optional mapper: Array.from(text.matchAll(TOKEN_RE), match => ...) to produce tokens in one allocation, keeping the same fields (surfaceText from match[0], charStart from match.index, charEnd = charStart + surfaceText.length, type using WORD_CONTAIN_RE.test(surfaceText), id `${sid}:${charStart}`, and include writingSystem). Ensure you handle the possibility of match.index being null (use a non-null assertion or safe guard) when computing charStart and charEnd.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/parsers/papi/bookTokenizer.ts`:
- Around line 116-124: The segments construction can produce inconsistent Book
data when a verse's parsed SID book (from parseSid(sid)) differs from
rawBook.bookCode; update the logic in the segments creation (where
rawBook.verses are mapped and parseSid, sid, startRef, endRef are used) to
enforce consistency by either validating that parsedRef.book ===
rawBook.bookCode and throwing/logging an error on mismatch, or by overriding
parsedRef.book to rawBook.bookCode when building startRef/endRef so the returned
Book and its segments share the same book code; pick one approach and apply it
consistently in the map that builds segments.
- Around line 91-98: tokenizeVerse currently calls
Array.from(text.matchAll(TOKEN_RE)).map(...) which creates an intermediate
array; change it to use Array.from with the optional mapper:
Array.from(text.matchAll(TOKEN_RE), match => ...) to produce tokens in one
allocation, keeping the same fields (surfaceText from match[0], charStart from
match.index, charEnd = charStart + surfaceText.length, type using
WORD_CONTAIN_RE.test(surfaceText), id `${sid}:${charStart}`, and include
writingSystem). Ensure you handle the possibility of match.index being null (use
a non-null assertion or safe guard) when computing charStart and charEnd.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 915905cb-3c0c-4f9d-89a2-99ffc8bcf11b
📒 Files selected for processing (1)
src/parsers/papi/bookTokenizer.ts
… error handling when attempting to open Interlinearizer WebView when `projectId` can't be resolved from `webViewId`
…ct picker for when `undefined` is returned, add `user-event` package for better react testing, other minor adjustments
alex-rawlings-yyc
left a comment
There was a problem hiding this comment.
@alex-rawlings-yyc partially reviewed 9 files and all commit messages, made 2 comments, and resolved 1 discussion.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on imnasnainaec).
src/__tests__/interlinearizer.web-view.test.tsx line 352 at r21 (raw file):
Previously, imnasnainaec (D. Ror.) wrote…
⛏️ I'm under the impression that using
import userEvent from '@testing-library/user-event'; [...] await userEvent.click(is generally recommended as more closely mimicking user interaction than
fireEvent.click(Though doing so would require
npm i -D @testing-library/user-event- Making these two test functions
async
I think this is a worthy change to make.
src/parsers/pt9/interlinearXmlParser.ts line 170 at r27 (raw file):
Previously, imnasnainaec (D. Ror.) wrote…
❓ Does this intentionally exclude numbers like
1.and2.0?
Yes and no. I hadn't considered numbers like that, but PT9 shouldn't generate XML with numbers like that anyway, afaik. Do we want to account for edge cases like 1. and 2.0?
alex-rawlings-yyc
left a comment
There was a problem hiding this comment.
@alex-rawlings-yyc made 1 comment.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on imnasnainaec).
src/parsers/papi/bookTokenizer.ts line 26 at r27 (raw file):
Previously, imnasnainaec (D. Ror.) wrote…
On reflection, why is this only for word-initial? E.g., Hebrew words can end with aleph.
Adjusted so these characters are always captured as a part of tokens. Though do we want to account for single quotes being used simply as quotation marks? I think this is a question that can be resolved later, unless we already know how glottal character and quotations are generally handled in Platform.Bible projects.
alex-rawlings-yyc
left a comment
There was a problem hiding this comment.
@alex-rawlings-yyc reviewed 1 file and all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on imnasnainaec).
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/__tests__/interlinearizer.web-view.test.tsx`:
- Around line 177-192: Tests leak mock call history because module mocks are
only re-mocked, not cleared; add a call to jest.clearAllMocks() at the start of
the beforeEach so call history for the long-lived mocks (mockBookData,
mockWritingSystem, jest.mocked(useLocalizedStrings),
jest.mocked(useRecentScriptureRefs), jest.mocked(extractBookFromUsj),
jest.mocked(tokenizeBook)) is reset before each test run.
In `@src/parsers/papi/bookTokenizer.ts`:
- Around line 17-18: CHAR_SET currently includes \p{Join_Control}, which causes
isolated joiner characters to be tokenized as words; remove \p{Join_Control}
from CHAR_SET and instead allow join controls only as internal separators in the
word regex: update WORD_CONTAIN_RE to require at least one letter/number/mark
(from CHAR_SET without join controls) and permit \u200C/\u200D between those
characters (e.g., allow join-control sequences between valid codepoints), and
apply the same change where CHAR_SET/WORD_CONTAIN_RE are reused (the other
occurrence around the second reference).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 88618db8-ffbe-4841-b1fb-4624d976f46c
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (7)
package.jsonsrc/__tests__/interlinearizer.web-view.test.tsxsrc/__tests__/parsers/papi/bookTokenizer.test.tssrc/__tests__/parsers/pt9/interlinearXmlParser.test.tssrc/interlinearizer.web-view.tsxsrc/parsers/papi/bookTokenizer.tssrc/parsers/pt9/interlinearXmlParser.ts
✅ Files skipped from review due to trivial changes (1)
- package.json
🚧 Files skipped from review as they are similar to previous changes (3)
- src/tests/parsers/papi/bookTokenizer.test.ts
- src/parsers/pt9/interlinearXmlParser.ts
- src/tests/parsers/pt9/interlinearXmlParser.test.ts
imnasnainaec
left a comment
There was a problem hiding this comment.
@imnasnainaec partially reviewed 9 files and all commit messages, made 1 comment, and resolved 3 discussions.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on alex-rawlings-yyc).
src/parsers/pt9/interlinearXmlParser.ts line 170 at r27 (raw file):
Previously, alex-rawlings-yyc (Alex Rawlings) wrote…
Yes and no. I hadn't considered numbers like that, but PT9 shouldn't generate XML with numbers like that anyway, afaik. Do we want to account for edge cases like
1.and2.0?
Not necessary until it is.
alex-rawlings-yyc
left a comment
There was a problem hiding this comment.
@alex-rawlings-yyc reviewed all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on alex-rawlings-yyc).
alex-rawlings-yyc
left a comment
There was a problem hiding this comment.
@alex-rawlings-yyc reviewed 1 file.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on alex-rawlings-yyc).
jasonleenaylor
left a comment
There was a problem hiding this comment.
@jasonleenaylor partially reviewed 4 files and made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on alex-rawlings-yyc).
This change is
Summary by CodeRabbit
New Features
Improvements
Tests
Chores
Documentation