Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping from archives feature #336

Open
catfromplan9 opened this issue May 7, 2023 · 0 comments
Open

Scraping from archives feature #336

catfromplan9 opened this issue May 7, 2023 · 0 comments

Comments

@catfromplan9
Copy link

Add feature to scrape from archive site. Using that flag will detect for archive.today (theres a few backup domains ppl use so dont hardcode domain) and if it finds it, edit the html and remove the divs that contain the scraper stuff leaving behind just site contents. I did this manually and im sure it could be automated. And for archive.org you can parse out some html field on the site that contains a link to the un-archive.orgified webpage just as it was originally.

Also, another flag to disable the behaviour of converting links on the page if this archiving archive option is on. Converting links can work by looking for a second https:// or http:// after start of link

You could support other archive sites with this feature but i only know of these two. I did this manually with a site i archived using monolith and I havent seen any tool for parsing archive.org or archive.today sites into original format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant