This repository has been archived by the owner on May 10, 2022. It is now read-only.
NEW FEATURES
crm_pdf()
andcrm_text()
lose thecache
parameter, which toggled whether or not to use caching. those functions always cache requests now (#37)crm_extract()
gains parametertry_ocr
(logical, default:FALSE
) to optionally try Optical Character Recognition (OCR) with extract pdf pages if the pdf is scanned images. extraction can take a while, but the result is cached, so will be very fast on subsequent requests for the same article (#37)
MINOR IMPROVEMENTS
crm_plain()
,crm_xml()
,crm_html()
, andcrm_text()
now cache articles ascrm_pdf()
has for a while. Along with this change caching is now split into separate folders for pdf, txt (for plain), xml, and html (#17)- internally force Pensoft publisher urls to https from http (#48)
- added docs section
User-agent
tocrm_html()
,crm_pdf()
,crm_plain()
,crm_xml()
, andcrm_text()
detailing how users can set a user agent string with theuseragent
curl option (#41) (#42) - fix a link in the README (#47) thanks @salim-b
BUG FIXES
- for wiley articles, replace part of url
pdf
withpdfdirect
for better access (#40) - initially for wiley specific errors, extracted out internal function
try_extract_pdf_errors()
to attempt to extract various errors that occur when trying to download and extract text from pdfs (#40) - eLife specific url fix in
crm_links()
, older url was leading to article landing pages (#6) - fix for cases in which Elsevier returns just the first page of a pdf instead of the whole article. we show the user a warning when this occurs and delete the 1 page pdf file (#43)