Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separators: Implement html chunking strategy. #23

Merged
merged 1 commit into from
May 14, 2024

Conversation

cpursley
Copy link
Contributor

No description provided.

@cpursley cpursley mentioned this pull request May 11, 2024
Copy link
Collaborator

@stuartjohnpage stuartjohnpage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great: one question, but no blockers. Thank you again!

:rtf,
:ruby,
:typescript,
:vue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for organising these!

"<article",
"<section",
"<table"
] ++
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why is it splitting on <h1 rather than <h1>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just being lazy and following the other examples (vue).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just struck me while driving that this will also split on any HMTL tags that have attrs! So it makes way more sense than adding in the closing >.

end

def get_separators(:typescript), do: get_separators(:javascript)

def get_separators(format) when format in @plaintext_formats, do: get_separators(:plaintext)

defp empty_and_new_line_separators do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Copy link

@grossvogel grossvogel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you!

My only inclination to nit-pick is around whether <article and <section should be higher-up in the separator list. But the more I think about it, I'm concluding that documents that use those elements according to the spec are most likely also going to be using the <h... tags in a way that gives us better information about how to group the content.

tl;dr; LGTM

@stuartjohnpage stuartjohnpage merged commit 6ec07ff into revelrylabs:main May 14, 2024
@cpursley
Copy link
Contributor Author

cpursley commented May 15, 2024

I actually thought about not including article and section - perhaps they are actually unnecessary. Thoughts?

If you think we should keep, moving up prob makes sense. If so - where?

@grossvogel
Copy link

@cpursley I can't convince myself that moving (or removing) them would be a definite improvement, so I'm happy with this until somebody comes along with a stronger argument 😄

Thanks so much for this contribution!

@cpursley
Copy link
Contributor Author

cpursley commented May 15, 2024

You're welcome.

Y'all should do a detailed blog post about RAG stuff in Elixir (how you're working with the chunked data, etc)!

@cpursley cpursley deleted the separators/html branch May 15, 2024 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants