Unable to download text with id 5001 #22

kaybenleroll · 2019-04-21T11:49:58Z

I am trying to download Einstein's book on Relativity from the website: it has id '5001'.

Trying to download it using the command:

gutenberg_download('5001')

Warning in .f(.x[[i]], ...) :
  Could not download a book at http://aleph.gutenberg.org/5/0/0/5001/5001.zip
Warning: Unknown or uninitialised column: 'text'.

From looking around the site, it looks like the file has moved to 5001-h.zip, but I am not sure how to modify the URL to do this properly.

The text was updated successfully, but these errors were encountered:

juliasilge · 2019-05-04T17:10:02Z

Hmmmmm, I am not sure why but it looks like Project Gutenberg no longer has the plain text version of this book available. If you click through to old/ you'll see that it is there but no longer in the main directory.

I'll need to change this section to use a book that is available via plain text from Project Gutenberg.

In the meantime, if you'd like to pick another book to work through the examples, I might suggest something else physics-related. Just make sure the plain text version is available, and that it's in English (or else tf-idf isn't a meaningful statistic).

kaybenleroll · 2019-05-09T13:51:36Z

Perfect - would it be useful to be able to download old versions via that 'old' directory? I could write something up and make a pull request?

juliasilge · 2019-05-16T01:44:58Z

The gutenbergr package does not do web scraping per se; it actually follows Project Gutenberg's rules for robot access. You can check out the R code for this robot access in the gutenbergr package.

We want to be careful to follow Project Gutenberg's own rules for automated traffic, which look like they preclude digging around in the old/ directory. If you are interested, a PR to gutenbergr for a more informative error message could be helpful, though!

maelle · 2019-05-22T07:49:43Z

@juliasilge does this mean the PR #20 should be closed because of Project Gutenberg's rules?

juliasilge · 2019-05-22T13:18:14Z

I believe so, yes. @dgrtwo

dgrtwo · 2019-05-22T16:48:45Z

Hmm, I'm not as confident that Project Gutenberg would have an issue with someone downloading the old/ directory.

We're still downloading from the aleph.gutenberg.org mirror and from the same 5/0/0/etc folder as the other accessible files. Indeed, the script already tries adding -0 and -8 as a suffix since some books have that; I don't think trying out adding old/ as a prefix is meaningfully different.

The web scraping rules are meant to be strict about accessing the Project Gutenberg site itself, but I don't think taking an archived version in old/ is any different or something they'd have a problem with. I'm inclined to accept #20, what do you think?

maelle · 2019-05-22T16:58:16Z

Silly suggestion, do you know someone at Project Gutenberg who could answer this question?

juliasilge · 2019-05-25T23:24:04Z

Ah, I see what you mean @dgrtwo! Perhaps I was being too literal there and if the mirror has that directory, then why not download it in the same way?

I unfortunately don't know anybody to ask, and I have some skepticism about how useful it would be to just email one of their general list. Maybe I am wrong, though. 🤔

harryrampr · 2019-08-25T04:03:35Z

The same book is available with another number, try 30155. http://www.gutenberg.org/ebooks/30155

RunjiGao · 2023-11-12T15:56:19Z

You can try this mirror website: http://mirror.csclub.uwaterloo.ca/gutenberg/

jonthegeek · 2023-11-29T13:28:44Z

Since this has a new ID and a lot has changed since this issue was open, we won't address this directly. Thanks for letting us know about the confusion!

juliasilge mentioned this issue Aug 31, 2019

Switch out Project Gutenberg ID for Einstein dgrtwo/tidy-text-mining#66

Closed

jonthegeek closed this as not planned Won't fix, can't repro, duplicate, stale Nov 29, 2023

andrewheiss mentioned this issue May 22, 2024

Many common titles cannot be found on any mirror #55

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to download text with id 5001 #22

Unable to download text with id 5001 #22

kaybenleroll commented Apr 21, 2019

juliasilge commented May 4, 2019 •

edited

Loading

kaybenleroll commented May 9, 2019

juliasilge commented May 16, 2019

maelle commented May 22, 2019

juliasilge commented May 22, 2019

dgrtwo commented May 22, 2019

maelle commented May 22, 2019

juliasilge commented May 25, 2019

harryrampr commented Aug 25, 2019

RunjiGao commented Nov 12, 2023

jonthegeek commented Nov 29, 2023

Unable to download text with id 5001 #22

Unable to download text with id 5001 #22

Comments

kaybenleroll commented Apr 21, 2019

juliasilge commented May 4, 2019 • edited Loading

kaybenleroll commented May 9, 2019

juliasilge commented May 16, 2019

maelle commented May 22, 2019

juliasilge commented May 22, 2019

dgrtwo commented May 22, 2019

maelle commented May 22, 2019

juliasilge commented May 25, 2019

harryrampr commented Aug 25, 2019

RunjiGao commented Nov 12, 2023

jonthegeek commented Nov 29, 2023

juliasilge commented May 4, 2019 •

edited

Loading