Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading from more general inputs #278

Open
jennybc opened this issue Mar 1, 2017 · 23 comments
Open

Support reading from more general inputs #278

jennybc opened this issue Mar 1, 2017 · 23 comments
Labels
feature a feature request or enhancement

Comments

@jennybc
Copy link
Member

jennybc commented Mar 1, 2017

Right now readxl reads only from xlsx and xls files.

In the fullness of time, some functionality for supporting more general inputs will be pulled out of readr, at which point readxl can exploit that. We might implement an interim solution for some of these in the meantime.

This issue will cover all related feature requests (all of which I'm now closing):

@jjcad
Copy link

jjcad commented Jul 31, 2017

Question: Along with this support for more general inputs, will that also signal a transition to readr style col_types? Having that kind of unity between the packages would be welcome.

@jennybc
Copy link
Member Author

jennybc commented Jul 31, 2017

The col_types transition is #198. I can't think of any reason that implementation of these two things would be inextricably tied together. They'd more likely happen around the same time because it meant I was working on a new release.

@MichaelChirico
Copy link
Contributor

fread's approach is to download the file & proceed to use plaid-old fread on that.

@jennybc If you're happy with that logic, I'd be happy to file a PR.

@jennybc
Copy link
Member Author

jennybc commented Feb 19, 2018

@MichaelChirico Yeah, that is an option we've contemplated re: "faking" read from a URL. I'm definitely interested in a PR that downloads to a temp file, then reads that.

MichaelChirico pushed a commit to MichaelChirico/readxl that referenced this issue Feb 20, 2018
MichaelChirico pushed a commit to MichaelChirico/readxl that referenced this issue Feb 20, 2018
MichaelChirico pushed a commit to MichaelChirico/readxl that referenced this issue Feb 21, 2018
MichaelChirico pushed a commit to MichaelChirico/readxl that referenced this issue Feb 21, 2018
@jknowles
Copy link

Does this include reading files from within a zip file using the unz() command like below:

zzz <- read_xlsx(unz("data/myzip.zip", "one_of_many.xlsx"))

This would be a handy option for allowing users to access compressed files.

@jennybc
Copy link
Member Author

jennybc commented Mar 22, 2018

@jknowles I rather doubt it. It's especially hard to see that working for xlsx, where we explicitly unpack that into individual XML files.

jennybc added a commit that referenced this issue Apr 16, 2018
@jennybc
Copy link
Member Author

jennybc commented Apr 16, 2018

I started to tackle the URL piece of this as part of #454 and there is also another separate PR for URL download in #426. But Slack discussion has convinced me I really should tackle all of these together and soon-ish. But not for this week's release, which gets the security-patched libxls out there.

Notes from team discussion:

  • Even though download.file() is better behaved now wrt https, it may still be wise to prefer curl::curl_download().
  • Switch over to accepting a connection vs. just a file. Consult what readr and xml2 do: "if given a connection they just use that, otherwise they create the appropriate connection first based on the protocol / file extension(s)". Do it like so so that curl can go in Suggests.
  • This probably means that curl will handle file:// URLs, so can adopt this suggestion at same time.
  • Relevant code in the IDE
  • R source re: reading from file://

@MichaelChirico
Copy link
Contributor

Would appreciate if you could reproduce/summarize the part of the discussion regarding why to keep using curl for https as this might inform the choice in data.table::fread as well.

Thanks again!

@jennybc
Copy link
Member Author

jennybc commented Apr 17, 2018

why to keep using curl for https

It was a rather vague suggestion, based on past experience with things "just working" as expected and consistently across OSes.

@electricstorm
Copy link

electricstorm commented Nov 13, 2018

Has there been any update here?
What's the typical workaround people are using? (not considering writing to disk)

[Update, for others who land here after getting stuck]
A possible workaround (if files are NOT large) is to create a temporary file as defined in one of the comments here.

tmp <- tempfile(fileext = ".xlsx")

httr::GET(url = "url/to/file",
            write_disk( tmp) )

xljh <- read_excel(tmp, skip = n)

@jennybc jennybc added feature a feature request or enhancement and removed future labels Dec 11, 2018
@igordot
Copy link

igordot commented Oct 31, 2019

Sorry. I am a little confused. I saw a commit that added the ability to read from URL in April 2018:

readxl/R/utils.R

Lines 6 to 10 in 839d023

if (grepl("^((http|ftp)s?|sftp)://", path)) {
tmp_file <- tempfile()
utils::download.file(path, tmp_file, mode = "wb")
path <- tmp_file
}

Somehow that edit never got integrated. Was it taken out for a specific reason?

@jennybc
Copy link
Member Author

jennybc commented Oct 31, 2019

@igordot That commit was in a PR that I ultimately did not merge #454.

@igordot
Copy link

igordot commented Oct 31, 2019

@jennybc Thanks for clarifying. I am just not sure if you are postponing URL handling or giving up on it entirely.

@jennybc
Copy link
Member Author

jennybc commented Oct 31, 2019

No, definitely not giving up on it. Looking back at that, it looks like I did not want it wrapped up with some of the other things I was doing in that PR.

@jennybc
Copy link
Member Author

jennybc commented Oct 31, 2019

I think this is the deal: to do the temp file thing to "fake" reading from URL is easy and perhaps I should just do that. But the declared goal here is noticeably higher, which is why it hasn't been handled yet.

@igordot
Copy link

igordot commented Oct 31, 2019

@jennybc Thank you for clarifying again. Dealing with remote data is complicated, so I understand the concern. If my opinion makes any difference, you can add the quick temp file fix for now. If anyone runs into any issues with it, they can always just not use a URL (same as they would now).

@mkoohafkan
Copy link

mkoohafkan commented Feb 26, 2020

right now I'm reading excel files from a URL using

  tf = tempfile(fileext = ".xlsx")
  curl::curl_download(url, tf)
  readxl::read_excel(tf, ...)

But I could also see a more generic approach using a .con method

con = curl::curl(url)
read_excel.con(con, ...)
close(con)

or a .raw method

raw = curl::curl_fetch_memory(url)$content
read_excel.raw(raw, ...)

@gbganalyst
Copy link

@jennybc
Copy link
Member Author

jennybc commented Mar 10, 2020

If you are reading a public Google Sheet, you can also use googlesheets4 without having it transit through the xlsx format. Or a private Sheet, for that matter.

@gbganalyst
Copy link

gbganalyst commented Mar 11, 2020

@jennybc thank you. It works when I converted the Excel file to google sheets format.

I used googlesheets4 package

googlesheets4::read_sheet("https://docs.google.com/spreadsheets/d/1wyhnEPoRB-JMssCWD_xslh_9G1JQiPXRuT_I0ibW600/edit#gid=74876023")

@sbihorel
Copy link

Hi,

Based upon the thread content, I am not sure if anything has been done to allow input via rawConnection objects. Can you please confirm if that is the case or not?

If it is possible to read from raw connections, can you please indicate how to properly do this?

Thanks

@jennybc
Copy link
Member Author

jennybc commented Aug 11, 2020

readxl cannot read from rawConnection objects.

@stephematician
Copy link

Am I correct that the main roadblock to reading from, say, rawConnection (i.e. from memory), is the unzip() which only works with file paths? It looks like libxls can read from memory via xls_open_buffer while rapidxml is already reading from memory after unzip() has done its work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

10 participants