Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cody: Handle inline code blocks and use marked to lex markdown #51576

Merged
merged 6 commits into from
May 9, 2023

Conversation

philipp-spiess
Copy link
Contributor

Closes #51528

We found more issues where the naive markdown parser would not work well, including inline code blocks. To fix this, this PR uses the marked lexer to parse Markdown instead of doing our own thing. This comes with some other issues, though:

  • For some reason, tilde fenced blocks are not working if we use the inline lexer on the whole text. Instead, we have to use a combination of the normal lexer (for root blocks) and inline lexer for detecting nested layouts like inline code blocks. We only use the inline lexer for escaping and not the hallucination detection which means that inline code blocks can have hallucination spans which actually seems desired from my testing

  • The marked lexer produces different results for documents that aren't fully streamed yet. In general, we have some issues already where applying the markdown formatter for every token that is streamed in, can cause layout shifts and I observed another issue where a top-line code block was not detected as one by the lexer, if the ending fence was missing. What's odd is that it does seem to parse the markdown in this case. This resulted in a very noticeable layout shift from &gt;html ... to <html ... which was super annoying. To fix this, we now do not escape HTML while the message is in flight instead. This means that if you let Cody generate some HTML (without using markdown) you'd see it formatted first before the full message is received and we can properly replace.

    IMO this is still desired as there is no harm in rendering the HTML (especially when we do more aggressive deny-listing here) however you still want the html tags shown to you at the end because we are, after all, we generate code and sometimes Cody is not using the markdown properly.

  • There are some other changes to the hallucination detection that I'll mark inline

Test plan

Did a bunch of different prompts to ensure it kinda works. Also piped the test cases through the right functions and added a case for inline code snippets.

Screenshot 2023-05-08 at 16 31 44

Screenshot 2023-05-08 at 15 42 26

Screenshot 2023-05-08 at 15 40 11

@philipp-spiess philipp-spiess requested review from novoselrok and a team May 8, 2023 15:02
@philipp-spiess philipp-spiess self-assigned this May 8, 2023
@cla-bot cla-bot bot added the cla-signed label May 8, 2023
@github-actions github-actions bot added the team/code-exploration Issues owned by the Code Exploration team label May 8, 2023
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented May 8, 2023

Codenotify: Notifying subscribers in OWNERS files for diff 5eeb210...e257ca5.

No notifications.

}

function highlightLine(line: string, tokens: HighlightedToken[]): string {
let highlightedLine = line
for (const token of tokens) {
highlightedLine = highlightedLine.replaceAll(token.outerValue, getHighlightedTokenHTML(token))
highlightedLine = highlightedLine.replaceAll(token.innerValue, getHighlightedTokenHTML(token))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@novoselrok Do you have some context why we used the outerValue here? it was screwing up formatting in some cases, e.g. the boundary could start with a comma (,) which would be put inside the highlighted line for some reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was because the Markdown backticks styling was messing up something with the highlights underline. If everything works and looks correct, then go ahead with your changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool I can test specifically regarding backticks to make sure, ty!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is still an issue, reverting.

@sourcegraph-buildkite
Copy link
Collaborator

sourcegraph-buildkite commented May 8, 2023

Bundle size report 📦

Initial size Total size Async size Modules
0.00% (0.00 kb) 0.00% (+0.15 kb) 0.00% (+0.15 kb) 0.00% (0)

Look at the Statoscope report for a full comparison between the commits e257ca5 and 5eeb210 or learn more.

Open explanation
  • Initial size is the size of the initial bundle (the one that is loaded when you open the page)
  • Total size is the size of the initial bundle + all the async loaded chunks
  • Async size is the size of all the async loaded chunks
  • Modules is the number of modules in the initial bundle

@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented May 8, 2023

📖 Storybook live preview

@philipp-spiess philipp-spiess requested a review from vdavid May 8, 2023 18:56
Copy link
Contributor

@mrnugget mrnugget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

client/cody-shared/src/chat/markdown.ts Outdated Show resolved Hide resolved
Comment on lines +41 to +43
if (isStreaming) {
return markdown
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we need to write our own parser if we want to do what ChatGPT does and stream into a code block? That's what I mean:

screenshot_2023-05-09_10 00 30

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's the odd thing, it already visually renders a codeblock like this, but when I use the lexer not the render feature, it would not detect the in-progress code block as such. That's so odd :(

This PR visually works the way you describe but by basically not escaping HTML while the message is not done. so it might render a <h1>Title</h1> as a title and then shift to showing the <h1> tags when done. I found this to be better (and less common) than the inverse where the encoding inside the code block is wrong.

Copy link
Contributor

@mrnugget mrnugget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@philipp-spiess philipp-spiess merged commit 4feda61 into main May 9, 2023
17 checks passed
@philipp-spiess philipp-spiess deleted the ps/cody-inline-code-blocks-escaping branch May 9, 2023 11:18
philipp-spiess added a commit that referenced this pull request May 11, 2023
Fixes #51768

When working on #51576 I was trying to fix some issues where the
hallucination detection logic highlighting was not working well and
later, based on Rok's input, noticed that this was causing a regression.

The problem? I forgot to push my revert 😭 

<img width="896" alt="Screenshot 2023-05-11 at 12 29 47"
src="https://github.com/sourcegraph/sourcegraph/assets/458591/af334243-7a30-4f4b-af90-ec8f5e2bc86e">

So here's the revert for the changes in #51576

## Test plan

<img width="556" alt="Screenshot 2023-05-11 at 12 31 03"
src="https://github.com/sourcegraph/sourcegraph/assets/458591/c7ce8465-2ac7-4b49-b205-3555c433942a">


<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed team/code-exploration Issues owned by the Code Exploration team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ampersands in code are escaped
5 participants