-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: LinkContentFetcher - replace requests with httpx, add async and http/2 #9034
base: main
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 13929227857Details
💛 - Coveralls |
@julian-risch @dfokina I messed up a bit #9032 so this one is superseding it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good to me already! I suggest merging the two test files. Other than that just minor comments. And I was wondering what you think about customizable headers as a separate issue?
|
||
from haystack.components.fetchers.link_content import LinkContentFetcher, DEFAULT_USER_AGENT | ||
|
||
HTML_URL = "https://docs.haystack.deepset.ai/docs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we merge the two files into one.
It's more consistent with the rest of the code base (one file with all test cases per component) and less code duplication (HTML_URL, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahaha I wanted to have one to begin with but I made two files because I saw that we have separate files for asynchronous and synchronous in openai chat generator. See https://github.com/deepset-ai/haystack/tree/main/test/components/generators/chat Perhaps we can chat internally about it and whatever we decide - I'll adjust
|
||
while attempt <= self.retry_attempts: | ||
try: | ||
headers = REQUEST_HEADERS.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that users reported issues using LinkContentFetcher, I think the component would benefit from making REQUEST_HEADERS
customizable. Separate issue and PR maybe? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's go for separate issue in the next sprint
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only one small change request remaining: merging the two test files. Unless you see a good reason to keep them separate. https://github.com/deepset-ai/haystack/pull/9034/files#r1995051947
Why:
Enhances the
LinkContentFetcher
functionality by introducing asynchronous content fetching capabilities and optional HTTP/2 support, replacingrequests
withhttpx
to improve performance and reliability.What:
httpx
with lazy loading for theh2
library.requests
withhttpx
for both synchronous and asynchronous HTTP operations, including custom client configurations.How can it be used:
How did you test it:
Implemented and utilized comprehensive unit and integration tests using
pytest
, focusing on both synchronous and asynchronous fetching paths, various content types, and error handling scenarios.Notes for the reviewer:
Pay attention to the lazy import of HTTP/2 support and verify if error handling is comprehensive for network failures and import issues with
h2
. Review the updated test cases to ensure all edge cases are covered.