-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support HTML Parsing #151
base: main
Are you sure you want to change the base?
Support HTML Parsing #151
Conversation
Great idea @amenk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add some html file to the test suite and run it on those too.
"predis/predis": "This is required for the RedisVectoreStore.", | ||
"elasticsearch/elasticsearch": "This is required for the ElasticsearchVectoreStore." | ||
"elasticsearch/elasticsearch": "This is required for the ElasticsearchVectoreStore.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comma
@@ -184,7 +184,8 @@ The first part of the flow is to read data from a source. | |||
This can be a database, a csv file, a json file, a text file, a website, a pdf, a word document, an excel file, ... | |||
The only requirement is that you can read the data and that you can extract the text from it. | |||
|
|||
For now we only support text files, pdf and docx but we plan to support other data type in the future. | |||
For now we only support text files, PDF, DOCX and HTML. but we plan to support other data type in the future. | |||
To only supports HTML files, if you `composer require html2text/html2text`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could add this library as a full dependency, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you like?
@amenk You can try to fix formatting problems with:
|
@f-lombardo thanks, currently not working on this Will hopefully pick up later. |
@amenk @f-lombardo what about just use WebPageTextGetter class in Tool? To get the text from html should be quite enough no? (even if quite messy ^^) |
Well, it's an option, even if whe should change it a bit in order to parse also HTML coming from a file. |
This adds support for HTML parsing.