Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: LinkContentFetcher - replace requests with httpx, add async and http/2 #9034

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

vblagoje
Copy link
Member

Why:

Enhances the LinkContentFetcher functionality by introducing asynchronous content fetching capabilities and optional HTTP/2 support, replacing requests with httpx to improve performance and reliability.

What:

  • Added asynchronous methods for fetching web content, significantly improving efficiency for parallel requests.
  • Incorporated support for HTTP/2 using httpx with lazy loading for the h2 library.
  • Replaced requests with httpx for both synchronous and asynchronous HTTP operations, including custom client configurations.

How can it be used:

  • To fetch web content asynchronously:
    async def fetch_async():
        fetcher = LinkContentFetcher(http2=True)
        results = await fetcher.run_async(urls=["https://example.com"])
        return results["streams"]
    streams = asyncio.run(fetch_async())
  • To enable HTTP/2:
    fetcher = LinkContentFetcher(http2=True)

How did you test it:

Implemented and utilized comprehensive unit and integration tests using pytest, focusing on both synchronous and asynchronous fetching paths, various content types, and error handling scenarios.

Notes for the reviewer:

Pay attention to the lazy import of HTTP/2 support and verify if error handling is comprehensive for network failures and import issues with h2. Review the updated test cases to ensure all edge cases are covered.

@coveralls
Copy link
Collaborator

coveralls commented Mar 13, 2025

Pull Request Test Coverage Report for Build 13929227857

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 25 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.02%) to 90.092%

Files with Coverage Reduction New Missed Lines %
components/fetchers/link_content.py 25 86.03%
Totals Coverage Status
Change from base Build 13920793695: 0.02%
Covered Lines: 9802
Relevant Lines: 10880

💛 - Coveralls

@vblagoje vblagoje marked this pull request as ready for review March 13, 2025 01:21
@vblagoje vblagoje requested review from a team as code owners March 13, 2025 01:21
@vblagoje vblagoje requested review from anakin87 and removed request for a team and anakin87 March 13, 2025 01:21
@vblagoje
Copy link
Member Author

@julian-risch @dfokina I messed up a bit #9032 so this one is superseding it

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good to me already! I suggest merging the two test files. Other than that just minor comments. And I was wondering what you think about customizable headers as a separate issue?


from haystack.components.fetchers.link_content import LinkContentFetcher, DEFAULT_USER_AGENT

HTML_URL = "https://docs.haystack.deepset.ai/docs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we merge the two files into one.
It's more consistent with the rest of the code base (one file with all test cases per component) and less code duplication (HTML_URL, etc.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahaha I wanted to have one to begin with but I made two files because I saw that we have separate files for asynchronous and synchronous in openai chat generator. See https://github.com/deepset-ai/haystack/tree/main/test/components/generators/chat Perhaps we can chat internally about it and whatever we decide - I'll adjust


while attempt <= self.retry_attempts:
try:
headers = REQUEST_HEADERS.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that users reported issues using LinkContentFetcher, I think the component would benefit from making REQUEST_HEADERS customizable. Separate issue and PR maybe? What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's go for separate issue in the next sprint

vblagoje and others added 4 commits March 18, 2025 17:42
@vblagoje vblagoje requested a review from julian-risch March 18, 2025 17:20
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only one small change request remaining: merging the two test files. Unless you see a good reason to keep them separate. https://github.com/deepset-ai/haystack/pull/9034/files#r1995051947

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants