Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to download text with id 5001 #22

Closed
kaybenleroll opened this issue Apr 21, 2019 · 11 comments
Closed

Unable to download text with id 5001 #22

kaybenleroll opened this issue Apr 21, 2019 · 11 comments

Comments

@kaybenleroll
Copy link

I am trying to download Einstein's book on Relativity from the website: it has id '5001'.

Trying to download it using the command:

gutenberg_download('5001')

Warning in .f(.x[[i]], ...) :
  Could not download a book at http://aleph.gutenberg.org/5/0/0/5001/5001.zip
Warning: Unknown or uninitialised column: 'text'.

From looking around the site, it looks like the file has moved to 5001-h.zip, but I am not sure how to modify the URL to do this properly.

@juliasilge
Copy link

juliasilge commented May 4, 2019

Hmmmmm, I am not sure why but it looks like Project Gutenberg no longer has the plain text version of this book available. If you click through to old/ you'll see that it is there but no longer in the main directory.

I'll need to change this section to use a book that is available via plain text from Project Gutenberg.

In the meantime, if you'd like to pick another book to work through the examples, I might suggest something else physics-related. Just make sure the plain text version is available, and that it's in English (or else tf-idf isn't a meaningful statistic).

@kaybenleroll
Copy link
Author

Perfect - would it be useful to be able to download old versions via that 'old' directory? I could write something up and make a pull request?

@juliasilge
Copy link

The gutenbergr package does not do web scraping per se; it actually follows Project Gutenberg's rules for robot access. You can check out the R code for this robot access in the gutenbergr package.

We want to be careful to follow Project Gutenberg's own rules for automated traffic, which look like they preclude digging around in the old/ directory. If you are interested, a PR to gutenbergr for a more informative error message could be helpful, though!

@maelle
Copy link
Member

maelle commented May 22, 2019

@juliasilge does this mean the PR #20 should be closed because of Project Gutenberg's rules?

@juliasilge
Copy link

I believe so, yes. @dgrtwo

@dgrtwo
Copy link
Collaborator

dgrtwo commented May 22, 2019

Hmm, I'm not as confident that Project Gutenberg would have an issue with someone downloading the old/ directory.

We're still downloading from the aleph.gutenberg.org mirror and from the same 5/0/0/etc folder as the other accessible files. Indeed, the script already tries adding -0 and -8 as a suffix since some books have that; I don't think trying out adding old/ as a prefix is meaningfully different.

The web scraping rules are meant to be strict about accessing the Project Gutenberg site itself, but I don't think taking an archived version in old/ is any different or something they'd have a problem with. I'm inclined to accept #20, what do you think?

@maelle
Copy link
Member

maelle commented May 22, 2019

Silly suggestion, do you know someone at Project Gutenberg who could answer this question?

@juliasilge
Copy link

Ah, I see what you mean @dgrtwo! Perhaps I was being too literal there and if the mirror has that directory, then why not download it in the same way?

I unfortunately don't know anybody to ask, and I have some skepticism about how useful it would be to just email one of their general list. Maybe I am wrong, though. 🤔

@harryrampr
Copy link

The same book is available with another number, try 30155. http://www.gutenberg.org/ebooks/30155

@RunjiGao
Copy link

You can try this mirror website: http://mirror.csclub.uwaterloo.ca/gutenberg/

@jonthegeek
Copy link
Collaborator

Since this has a new ID and a lot has changed since this issue was open, we won't address this directly. Thanks for letting us know about the confusion!

@jonthegeek jonthegeek closed this as not planned Won't fix, can't repro, duplicate, stale Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants