Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HTML Parsing #151

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

amenk
Copy link

@amenk amenk commented Jun 14, 2024

This adds support for HTML parsing.

@MaximeThoonsen
Copy link
Collaborator

Great idea @amenk

Copy link
Contributor

@f-lombardo f-lombardo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add some html file to the test suite and run it on those too.

"predis/predis": "This is required for the RedisVectoreStore.",
"elasticsearch/elasticsearch": "This is required for the ElasticsearchVectoreStore."
"elasticsearch/elasticsearch": "This is required for the ElasticsearchVectoreStore.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comma

@@ -184,7 +184,8 @@ The first part of the flow is to read data from a source.
This can be a database, a csv file, a json file, a text file, a website, a pdf, a word document, an excel file, ...
The only requirement is that you can read the data and that you can extract the text from it.

For now we only support text files, pdf and docx but we plan to support other data type in the future.
For now we only support text files, PDF, DOCX and HTML. but we plan to support other data type in the future.
To only supports HTML files, if you `composer require html2text/html2text`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could add this library as a full dependency, what do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you like?

@f-lombardo
Copy link
Contributor

@amenk You can try to fix formatting problems with:
composer fix-lint
Try also

composer refactor && composer test

@amenk
Copy link
Author

amenk commented Jul 17, 2024

@f-lombardo thanks, currently not working on this

Will hopefully pick up later.

@MaximeThoonsen
Copy link
Collaborator

@amenk @f-lombardo what about just use WebPageTextGetter class in Tool? To get the text from html should be quite enough no? (even if quite messy ^^)

@f-lombardo
Copy link
Contributor

@amenk @f-lombardo what about just use WebPageTextGetter class in Tool? To get the text from html should be quite enough no? (even if quite messy ^^)

Well, it's an option, even if whe should change it a bit in order to parse also HTML coming from a file.
Another thing to consider is if LLPhant should handle the parsing of complex HTML pages by itself or it has to delegate that to an external specialized library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants