Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix crawler issues #25

Open
tacman opened this issue Sep 4, 2023 · 2 comments
Open

fix crawler issues #25

tacman opened this issue Sep 4, 2023 · 2 comments
Assignees

Comments

@tacman
Copy link
Collaborator

tacman commented Sep 4, 2023

https://mus.wip/en/crawlerdata

image

Crawler should take an optional --locale argument that limits the pages being searched to only that locale. Otherwise, we're overwhelmed with duplicate links.

_profiler/ links should be excluded for the json -- they should never even get to the link list.

image

There are still pages missing, like https://www.mus.wip/en/mus/countries, which is in the footer. Maybe because of so many langugage and profile links.

@tacman
Copy link
Collaborator Author

tacman commented Nov 27, 2023

Specifically, the DX for crawler should be:

  • Require the bundle
  • Configure survos_crawler.yaml (users: routes_to_ignore, paths_to_ignore)
  • Run the crawler against the dev environment, fix any errors. (Dynamically turn off the web profiler toolbar).
  • Configure the test environment and run the crawler with --env=test
  • run PHP unit with the generated routes from the --env=test run.
  • run code coverage (php-pcov seems to work in 8.3. This is already done in a few applications).

@tacman
Copy link
Collaborator Author

tacman commented Nov 29, 2023

  • mailto: links should be ignored.
  • if a link has been seen, we don't need to write it again unless we have some new data, like # of times seen.
    image
    Some 403's are wrong

image

Switch User should be excluded, and may be part of the problem above.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants