Skip to content

fix: Do not raise an error to check 'same-domain' if there is no hostname in the url #1251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 19, 2025

Conversation

Mantisus
Copy link
Collaborator

Description

  • The need for PR is due to the fact that now the enqueue strategy check, is used in extract_links and can be applied to invalid urls.

@Mantisus Mantisus requested review from Copilot and vdusek June 16, 2025 13:31
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR changes the behavior of the same-domain enqueue strategy to return False instead of raising an error when either URL lacks a hostname, preventing exceptions on invalid URLs.

  • Replace ValueError with a silent skip (return False) for missing hostnames in same-domain checks.
  • Ensures extract_links can apply the enqueue strategy without erroring on malformed URLs.
Comments suppressed due to low confidence (2)

src/crawlee/crawlers/_basic/_basic_crawler.py:970

  • Add a unit test for the case where origin_url.hostname or target_url.hostname is None with the same-domain strategy to verify that it returns False and does not enqueue.
return False

src/crawlee/crawlers/_basic/_basic_crawler.py:967

  • Update the method docstring for _check_enqueue_strategy to document that URLs without hostnames will simply return False for the same-domain strategy instead of raising an error.
def _check_enqueue_strategy(

@Mantisus Mantisus self-assigned this Jun 16, 2025
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable 👍

@vdusek vdusek changed the title fix: Do not raise an error to check ‘same-domain’ if there is no hostname in the url fix: Do not raise an error to check 'same-domain' if there is no hostname in the url Jun 17, 2025
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vdusek vdusek merged commit a6c3aab into apify:master Jun 19, 2025
44 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants