Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicating values for all previously merged cells, for .xlsx formats #220

Closed
wants to merge 88 commits into from
Closed

Conversation

burchill
Copy link

Currently, when merged cells in the .xlsx format become an R object, all individual cells that made up the merged cell are assigned NA, except for the top-left-most cell, which retains the original value. Something like the following:

Column1 Column2 Column3
OneAndTwo Three
One Two

becomes:

Column1 Column2 Column3
OneAndTwo NA Three
One Two NA

I believe that should not be thought of as the appropriate behavior--almost all information about where the merged cells were is lost, and at the very least, it seems less intuitive and aesthetically pleasing.

To me, the preferred behavior should result in:

Column1 Column2 Column3
OneAndTwo OneAndTwo Three
One Two Three

In this pull request, I've added code that results that behavior for .xlsx files. In order to keep as much of the original code unaltered as possible, the function I added automatically duplicates the attributes and values of the node with the original value for the merged cell, and clones it into the empty cells that were once a part of the same merged cell, keeping their original references or whatever they're called (e.g. "B3", "F9", etc.).

If people think it would be better to have this behavior be optional, I can add that in. Additionally, I can add tests if need be.

Also, I was testing/editing the code on GitHub, so I have a lot of commits. I don't know what people's philosophy is about that, but if it's preferable, I can go back and get rid of a lot of the annoying commits I made.

@burchill
Copy link
Author

Added test for merged cells!

@burchill
Copy link
Author

burchill commented Dec 13, 2016

Can someone try installing this and testing it out on the 'tests/testthat/merge_test.xlsx' file? When I install and run it, it works fine, but Travis CI seems to be not replicating my results.

EDIT: I've run it on two different versions of OSX and it seems fine. No idea why Travis fails the tests I've made.

@jennybc
Copy link
Member

jennybc commented Jan 19, 2017

readxl is focused on reading rectangular data and the scope does not include dealing with merged cells. For dealing with such headaches, you may want to check out:

https://github.com/nacnudus/tidyxl#readme
https://github.com/rsheets/rexcel#readme

Thanks.

@jennybc jennybc closed this Jan 19, 2017
@burchill
Copy link
Author

Thanks for the info!

@jameshowison
Copy link

Apologies if this is the wrong place to continue this discussion but I'm also facing this issue, dealing with data with multiple headers. In the past I've handled it first in python, but I'd like a tidyverse solution. See https://howisonlab.github.io/datawrangling/Handling_multi_indexes.html#a-tidyverse-solution

Could this be helped by a tidyr::fill with direction="right"?

@jennybc
Copy link
Member

jennybc commented May 1, 2017

I think this sort of table remains squarely NOT in the target zone for readxl. It is in the target zone for packages like tidyxl and jailbreaker, however.

If I were to process this with readxl, which is not crazy, I would use skip and col_names = FALSE to omit all the header rows, so that they don't junk up my data rectangle and mess with the column typing.

Then I would read the header rows in separately, using n_max. I'd probably convert to character matrix and do some filling, collapsing, etc. to get column names. Then apply to the data post hoc.

You're unlikely to ever see tidyr::fill() with direction right because this does not make sense in general. One of the defining features of a data frame is that each column can be of disparate type. So filling up and down makes sense: you are in the same column. Filling left and right does not make sense. There would have to be lots of checks to make sure the filling doesn't do weird things across variable types.

One minor comment on the code at your link: Since you are setting the NA value here: world_bank <- read_excel(filename, col_names = FALSE,na=".."), you don't need the next line: world_bank[world_bank==".."] <- NA.

I'd be pleased if you want to open a new issue requesting merged cell support, i.e. repeating the value instead filling with NA. I can imagine how that would work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants