Cody: Handle inline code blocks and use marked to lex markdown #51576

philipp-spiess · 2023-05-08T15:02:42Z

We found more issues where the naive markdown parser would not work well, including inline code blocks. To fix this, this PR uses the marked lexer to parse Markdown instead of doing our own thing. This comes with some other issues, though:

For some reason, tilde fenced blocks are not working if we use the inline lexer on the whole text. Instead, we have to use a combination of the normal lexer (for root blocks) and inline lexer for detecting nested layouts like inline code blocks. We only use the inline lexer for escaping and not the hallucination detection which means that inline code blocks can have hallucination spans which actually seems desired from my testing
The marked lexer produces different results for documents that aren't fully streamed yet. In general, we have some issues already where applying the markdown formatter for every token that is streamed in, can cause layout shifts and I observed another issue where a top-line code block was not detected as one by the lexer, if the ending fence was missing. What's odd is that it does seem to parse the markdown in this case. This resulted in a very noticeable layout shift from >html ... to <html ... which was super annoying. To fix this, we now do not escape HTML while the message is in flight instead. This means that if you let Cody generate some HTML (without using markdown) you'd see it formatted first before the full message is received and we can properly replace.

IMO this is still desired as there is no harm in rendering the HTML (especially when we do more aggressive deny-listing here) however you still want the html tags shown to you at the end because we are, after all, we generate code and sometimes Cody is not using the markdown properly.
There are some other changes to the hallucination detection that I'll mark inline

Test plan

Did a bunch of different prompts to ensure it kinda works. Also piped the test cases through the right functions and added a case for inline code snippets.

sourcegraph-bot · 2023-05-08T15:05:11Z

Codenotify: Notifying subscribers in OWNERS files for diff 5eeb210...e257ca5.

No notifications.

philipp-spiess · 2023-05-08T15:03:45Z

client/cody-shared/src/hallucinations-detector/index.ts

 }

 function highlightLine(line: string, tokens: HighlightedToken[]): string {
    let highlightedLine = line
    for (const token of tokens) {
-        highlightedLine = highlightedLine.replaceAll(token.outerValue, getHighlightedTokenHTML(token))
+        highlightedLine = highlightedLine.replaceAll(token.innerValue, getHighlightedTokenHTML(token))


@novoselrok Do you have some context why we used the outerValue here? it was screwing up formatting in some cases, e.g. the boundary could start with a comma (,) which would be put inside the highlighted line for some reason.

I think it was because the Markdown backticks styling was messing up something with the highlights underline. If everything works and looks correct, then go ahead with your changes.

Cool I can test specifically regarding backticks to make sure, ty!

Yeah this is still an issue, reverting.

sourcegraph-buildkite · 2023-05-08T15:09:57Z

Bundle size report 📦

Initial size	Total size	Async size	Modules
0.00% (0.00 kb)	0.00% (+0.15 kb)	0.00% (+0.15 kb)	0.00% (0)

Look at the Statoscope report for a full comparison between the commits e257ca5 and 5eeb210 or learn more.

Open explanation

Initial size is the size of the initial bundle (the one that is loaded when you open the page)
Total size is the size of the initial bundle + all the async loaded chunks
Async size is the size of all the async loaded chunks
Modules is the number of modules in the initial bundle

sourcegraph-bot · 2023-05-08T15:10:43Z

📖 Storybook live preview

mrnugget

Thank you!

client/cody-shared/src/chat/markdown.ts

mrnugget · 2023-05-09T08:09:18Z

client/cody-shared/src/chat/markdown.ts

+    if (isStreaming) {
+        return markdown
+    }


Would we need to write our own parser if we want to do what ChatGPT does and stream into a code block? That's what I mean:

No, that's the odd thing, it already visually renders a codeblock like this, but when I use the lexer not the render feature, it would not detect the in-progress code block as such. That's so odd :(

This PR visually works the way you describe but by basically not escaping HTML while the message is not done. so it might render a <h1>Title</h1> as a title and then shift to showing the <h1> tags when done. I found this to be better (and less common) than the inverse where the encoding inside the code block is wrong.

mrnugget

Thank you!

Co-authored-by: Thorsten Ball <mrnugget@gmail.com>

…locks-escaping

Fixes #51768 When working on #51576 I was trying to fix some issues where the hallucination detection logic highlighting was not working well and later, based on Rok's input, noticed that this was causing a regression. The problem? I forgot to push my revert 😭 <img width="896" alt="Screenshot 2023-05-11 at 12 29 47" src="https://github.com/sourcegraph/sourcegraph/assets/458591/af334243-7a30-4f4b-af90-ec8f5e2bc86e"> So here's the revert for the changes in #51576 ## Test plan <img width="556" alt="Screenshot 2023-05-11 at 12 31 03" src="https://github.com/sourcegraph/sourcegraph/assets/458591/c7ce8465-2ac7-4b49-b205-3555c433942a">

Cody: Handle inline code blocks and use marked to lex markdown

7f1af4a

philipp-spiess requested review from novoselrok and a team May 8, 2023 15:02

philipp-spiess self-assigned this May 8, 2023

cla-bot bot added the cla-signed label May 8, 2023

github-actions bot added the team/code-exploration Issues owned by the Code Exploration team label May 8, 2023

philipp-spiess commented May 8, 2023

View reviewed changes

Add PR number

8166ac7

philipp-spiess added 2 commits May 8, 2023 18:37

Fix test

92f0e5a

Add buildfiles

94549d2

philipp-spiess requested a review from vdavid May 8, 2023 18:56

mrnugget approved these changes May 9, 2023

View reviewed changes

philipp-spiess and others added 2 commits May 9, 2023 10:37

Update client/cody-shared/src/chat/markdown.ts

58731ba

Co-authored-by: Thorsten Ball <mrnugget@gmail.com>

Merge remote-tracking branch 'origin/main' into ps/cody-inline-code-b…

e257ca5

…locks-escaping

philipp-spiess merged commit 4feda61 into main May 9, 2023
17 checks passed

philipp-spiess deleted the ps/cody-inline-code-blocks-escaping branch May 9, 2023 11:18

This was referenced May 11, 2023

cody bug: links to files are escaped after being fully rendered and show up as HTML #51768

Closed

Revert hallucination detector changes #51785

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cody: Handle inline code blocks and use marked to lex markdown #51576

Cody: Handle inline code blocks and use marked to lex markdown #51576

philipp-spiess commented May 8, 2023

sourcegraph-bot commented May 8, 2023 •

edited

Loading

philipp-spiess May 8, 2023

novoselrok May 8, 2023

philipp-spiess May 8, 2023

philipp-spiess May 9, 2023

sourcegraph-buildkite commented May 8, 2023 •

edited

Loading

sourcegraph-bot commented May 8, 2023 •

edited

Loading

mrnugget left a comment

mrnugget May 9, 2023

philipp-spiess May 9, 2023

mrnugget left a comment

Cody: Handle inline code blocks and use marked to lex markdown #51576

Cody: Handle inline code blocks and use marked to lex markdown #51576

Conversation

philipp-spiess commented May 8, 2023

Test plan

sourcegraph-bot commented May 8, 2023 • edited Loading

philipp-spiess May 8, 2023

Choose a reason for hiding this comment

novoselrok May 8, 2023

Choose a reason for hiding this comment

philipp-spiess May 8, 2023

Choose a reason for hiding this comment

philipp-spiess May 9, 2023

Choose a reason for hiding this comment

sourcegraph-buildkite commented May 8, 2023 • edited Loading

Bundle size report 📦

sourcegraph-bot commented May 8, 2023 • edited Loading

mrnugget left a comment

Choose a reason for hiding this comment

mrnugget May 9, 2023

Choose a reason for hiding this comment

philipp-spiess May 9, 2023

Choose a reason for hiding this comment

mrnugget left a comment

Choose a reason for hiding this comment

sourcegraph-bot commented May 8, 2023 •

edited

Loading

sourcegraph-buildkite commented May 8, 2023 •

edited

Loading

sourcegraph-bot commented May 8, 2023 •

edited

Loading