Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider separating out the various parts of *URL Hacks* into separate options #271

Open
AnyOldName3 opened this issue Jan 19, 2024 · 1 comment

Comments

@AnyOldName3
Copy link

A site mirror I made didn't work properly until I commented out the guts of jump_normalized_const so it didn't jump over a www. prefix, and then did work afterwards (although this is oversimplifying as I ended up needing to use https://github.com/mitchcapper/httrack so the project would build, and then had to patch out a couple of regressions it had versus this version).

If the options to treat http:// and https:// URLs as the same, treat www.thing.com and thing.com URLs as the same, and to remove redundant slashes were separate instead of under one umbrella URL Hacks setting, I could have just enabled and disabled the bits I needed.

@AnyOldName3
Copy link
Author

I determined that the particular site could in principle have worked with both the http:///https:// equivalence and the www.domain/domain equivalence, but the system to detect when one redirected to another failed when it took more than one step. The http:// URLs redirected to the https:// URLs, which HTTrack handled sensibly, but then the non-www. URLs redirected to the www. ones, which HTTrack didn't bother fetching. I'm guessing that this was misinterpreted as a redirect loop as the URLs were the same after normalisation, but they were different before normalisation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant