feat: LinkContentFetcher - replace requests with httpx, add async and http/2 #9034

vblagoje · 2025-03-13T01:13:06Z

Why:

Enhances the LinkContentFetcher functionality by introducing asynchronous content fetching capabilities and optional HTTP/2 support, replacing requests with httpx to improve performance and reliability.

fixes Migrate LinkContentFetcher from requests to httpx and add async run #9000

What:

Added asynchronous methods for fetching web content, significantly improving efficiency for parallel requests.
Incorporated support for HTTP/2 using httpx with lazy loading for the h2 library.
Replaced requests with httpx for both synchronous and asynchronous HTTP operations, including custom client configurations.

How can it be used:

To fetch web content asynchronously:

async def fetch_async():
    fetcher = LinkContentFetcher(http2=True)
    results = await fetcher.run_async(urls=["https://example.com"])
    return results["streams"]
streams = asyncio.run(fetch_async())

To enable HTTP/2:

fetcher = LinkContentFetcher(http2=True)

How did you test it:

Implemented and utilized comprehensive unit and integration tests using pytest, focusing on both synchronous and asynchronous fetching paths, various content types, and error handling scenarios.

Notes for the reviewer:

Pay attention to the lazy import of HTTP/2 support and verify if error handling is comprehensive for network failures and import issues with h2. Review the updated test cases to ensure all edge cases are covered.

coveralls · 2025-03-13T01:18:28Z

Pull Request Test Coverage Report for Build 14043178735

Details

0 of 0 changed or added relevant lines in 0 files are covered.
25 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.02%) to 90.153%

Files with Coverage Reduction	New Missed Lines	%
components/fetchers/link_content.py	25	86.03%

Totals
Change from base Build 14043167436:	0.02%
Covered Lines:	9934
Relevant Lines:	11019

💛 - Coveralls

vblagoje · 2025-03-13T01:22:08Z

@julian-risch @dfokina I messed up a bit #9032 so this one is superseding it

julian-risch

Looks quite good to me already! I suggest merging the two test files. Other than that just minor comments. And I was wondering what you think about customizable headers as a separate issue?

haystack/components/fetchers/link_content.py

julian-risch · 2025-03-14T07:58:59Z

test/components/fetchers/test_link_content_fetcher_async.py

+
+from haystack.components.fetchers.link_content import LinkContentFetcher, DEFAULT_USER_AGENT
+
+HTML_URL = "https://docs.haystack.deepset.ai/docs"


I suggest we merge the two files into one.
It's more consistent with the rest of the code base (one file with all test cases per component) and less code duplication (HTML_URL, etc.)

Ahaha I wanted to have one to begin with but I made two files because I saw that we have separate files for asynchronous and synchronous in openai chat generator. See https://github.com/deepset-ai/haystack/tree/main/test/components/generators/chat Perhaps we can chat internally about it and whatever we decide - I'll adjust

haystack/components/fetchers/link_content.py

julian-risch · 2025-03-14T08:59:21Z

haystack/components/fetchers/link_content.py

+
+        while attempt <= self.retry_attempts:
+            try:
+                headers = REQUEST_HEADERS.copy()


Given that users reported issues using LinkContentFetcher, I think the component would benefit from making REQUEST_HEADERS customizable. Separate issue and PR maybe? What do you think?

Yes let's go for separate issue in the next sprint

test/components/fetchers/test_link_content_fetcher_async.py

Co-authored-by: Julian Risch <julian.risch@deepset.ai>

julian-risch

only one small change request remaining: merging the two test files. Unless you see a good reason to keep them separate. https://github.com/deepset-ai/haystack/pull/9034/files#r1995051947

vblagoje · 2025-03-24T19:07:40Z

Ok nice, all the tests pass now @julian-risch and they've been merged

julian-risch

LGTM! 👍

LinkContentFetcher - replace requests with httpx, add async and http/2

3cc8416

github-actions bot added topic:tests topic:build/distribution type:documentation Improvements on the docs labels Mar 13, 2025

vblagoje requested review from julian-risch and dfokina March 13, 2025 01:17

vblagoje marked this pull request as ready for review March 13, 2025 01:21

vblagoje requested review from a team as code owners March 13, 2025 01:21

vblagoje requested review from anakin87 and removed request for a team and anakin87 March 13, 2025 01:21

julian-risch requested changes Mar 14, 2025

View reviewed changes

vblagoje and others added 4 commits March 18, 2025 17:42

Update haystack/components/fetchers/link_content.py

4679843

Co-authored-by: Julian Risch <julian.risch@deepset.ai>

Update haystack/components/fetchers/link_content.py

ebd8f11

Co-authored-by: Julian Risch <julian.risch@deepset.ai>

PR feedback

55e9461

Merge branch 'main' into link_fetcher

597851a

vblagoje requested a review from julian-risch March 18, 2025 17:20

julian-risch requested changes Mar 19, 2025

View reviewed changes

julian-risch mentioned this pull request Mar 20, 2025

Make REQUEST_HEADERS in LinkContentFetcher customizable #9064

Open

vblagoje added 3 commits March 24, 2025 10:05

Merge sync and async

75702b0

Merge branch 'main' into link_fetcher

7a2c388

Merge branch 'main' into link_fetcher

e34544d

julian-risch approved these changes Mar 25, 2025

View reviewed changes

vblagoje merged commit 13941d8 into main Mar 26, 2025
18 checks passed

vblagoje deleted the link_fetcher branch March 26, 2025 13:55

vblagoje mentioned this pull request Mar 26, 2025

Migrate LinkContentFetcher from requests to httpx and add async run #9000

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: LinkContentFetcher - replace requests with httpx, add async and http/2 #9034

feat: LinkContentFetcher - replace requests with httpx, add async and http/2 #9034

Uh oh!

vblagoje commented Mar 13, 2025 •

edited

Loading

Uh oh!

coveralls commented Mar 13, 2025 •

edited

Loading

Uh oh!

vblagoje commented Mar 13, 2025

Uh oh!

julian-risch left a comment

Uh oh!

Uh oh!

Uh oh!

julian-risch Mar 14, 2025

Uh oh!

vblagoje Mar 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

julian-risch Mar 14, 2025

Uh oh!

vblagoje Mar 18, 2025

Uh oh!

Uh oh!

julian-risch left a comment

Uh oh!

vblagoje commented Mar 24, 2025

Uh oh!

julian-risch left a comment

Uh oh!

Uh oh!

Uh oh!


		from haystack.components.fetchers.link_content import LinkContentFetcher, DEFAULT_USER_AGENT

		HTML_URL = "https://docs.haystack.deepset.ai/docs"

feat: LinkContentFetcher - replace requests with httpx, add async and http/2 #9034

feat: LinkContentFetcher - replace requests with httpx, add async and http/2 #9034

Uh oh!

Conversation

vblagoje commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why:

What:

How can it be used:

How did you test it:

Notes for the reviewer:

Uh oh!

coveralls commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 14043178735

Details

💛 - Coveralls

Uh oh!

vblagoje commented Mar 13, 2025

Uh oh!

julian-risch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

julian-risch Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

vblagoje Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

julian-risch Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

vblagoje Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

julian-risch left a comment

Choose a reason for hiding this comment

Uh oh!

vblagoje commented Mar 24, 2025

Uh oh!

julian-risch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vblagoje commented Mar 13, 2025 •

edited

Loading

coveralls commented Mar 13, 2025 •

edited

Loading