-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to download text with id 5001 #22
Comments
Hmmmmm, I am not sure why but it looks like Project Gutenberg no longer has the plain text version of this book available. If you click through to I'll need to change this section to use a book that is available via plain text from Project Gutenberg. In the meantime, if you'd like to pick another book to work through the examples, I might suggest something else physics-related. Just make sure the plain text version is available, and that it's in English (or else tf-idf isn't a meaningful statistic). |
Perfect - would it be useful to be able to download old versions via that 'old' directory? I could write something up and make a pull request? |
The gutenbergr package does not do web scraping per se; it actually follows Project Gutenberg's rules for robot access. You can check out the R code for this robot access in the gutenbergr package. We want to be careful to follow Project Gutenberg's own rules for automated traffic, which look like they preclude digging around in the |
@juliasilge does this mean the PR #20 should be closed because of Project Gutenberg's rules? |
I believe so, yes. @dgrtwo |
Hmm, I'm not as confident that Project Gutenberg would have an issue with someone downloading the old/ directory. We're still downloading from the aleph.gutenberg.org mirror and from the same The web scraping rules are meant to be strict about accessing the Project Gutenberg site itself, but I don't think taking an archived version in |
Silly suggestion, do you know someone at Project Gutenberg who could answer this question? |
Ah, I see what you mean @dgrtwo! Perhaps I was being too literal there and if the mirror has that directory, then why not download it in the same way? I unfortunately don't know anybody to ask, and I have some skepticism about how useful it would be to just email one of their general list. Maybe I am wrong, though. 🤔 |
The same book is available with another number, try 30155. http://www.gutenberg.org/ebooks/30155 |
You can try this mirror website: http://mirror.csclub.uwaterloo.ca/gutenberg/ |
Since this has a new ID and a lot has changed since this issue was open, we won't address this directly. Thanks for letting us know about the confusion! |
I am trying to download Einstein's book on Relativity from the website: it has id '5001'.
Trying to download it using the command:
From looking around the site, it looks like the file has moved to 5001-h.zip, but I am not sure how to modify the URL to do this properly.
The text was updated successfully, but these errors were encountered: