Duplicating values for all previously merged cells, for .xlsx formats #220

burchill · 2016-12-11T21:21:35Z

Currently, when merged cells in the .xlsx format become an R object, all individual cells that made up the merged cell are assigned NA, except for the top-left-most cell, which retains the original value. Something like the following:

Column1	Column2	Column3
OneAndTwo		Three
One	Two	Three

becomes:

Column1	Column2	Column3
OneAndTwo	NA	Three
One	Two	NA

I believe that should not be thought of as the appropriate behavior--almost all information about where the merged cells were is lost, and at the very least, it seems less intuitive and aesthetically pleasing.

To me, the preferred behavior should result in:

Column1	Column2	Column3
OneAndTwo	OneAndTwo	Three
One	Two	Three

In this pull request, I've added code that results that behavior for .xlsx files. In order to keep as much of the original code unaltered as possible, the function I added automatically duplicates the attributes and values of the node with the original value for the merged cell, and clones it into the empty cells that were once a part of the same merged cell, keeping their original references or whatever they're called (e.g. "B3", "F9", etc.).

If people think it would be better to have this behavior be optional, I can add that in. Additionally, I can add tests if need be.

Also, I was testing/editing the code on GitHub, so I have a lot of commits. I don't know what people's philosophy is about that, but if it's preferable, I can go back and get rid of a lot of the annoying commits I made.

may god have mercy on my soul

Fatally aborts on merged data

burchill · 2016-12-12T21:32:42Z

Added test for merged cells!

burchill · 2016-12-13T18:25:39Z

Can someone try installing this and testing it out on the 'tests/testthat/merge_test.xlsx' file? When I install and run it, it works fine, but Travis CI seems to be not replicating my results.

EDIT: I've run it on two different versions of OSX and it seems fine. No idea why Travis fails the tests I've made.

jennybc · 2017-01-19T04:29:34Z

readxl is focused on reading rectangular data and the scope does not include dealing with merged cells. For dealing with such headaches, you may want to check out:

https://github.com/nacnudus/tidyxl#readme
https://github.com/rsheets/rexcel#readme

Thanks.

burchill · 2017-01-19T17:11:20Z

Thanks for the info!

jameshowison · 2017-05-01T18:00:46Z

Apologies if this is the wrong place to continue this discussion but I'm also facing this issue, dealing with data with multiple headers. In the past I've handled it first in python, but I'd like a tidyverse solution. See https://howisonlab.github.io/datawrangling/Handling_multi_indexes.html#a-tidyverse-solution

Could this be helped by a tidyr::fill with direction="right"?

jennybc · 2017-05-01T18:25:15Z

I think this sort of table remains squarely NOT in the target zone for readxl. It is in the target zone for packages like tidyxl and jailbreaker, however.

If I were to process this with readxl, which is not crazy, I would use skip and col_names = FALSE to omit all the header rows, so that they don't junk up my data rectangle and mess with the column typing.

Then I would read the header rows in separately, using n_max. I'd probably convert to character matrix and do some filling, collapsing, etc. to get column names. Then apply to the data post hoc.

You're unlikely to ever see tidyr::fill() with direction right because this does not make sense in general. One of the defining features of a data frame is that each column can be of disparate type. So filling up and down makes sense: you are in the same column. Filling left and right does not make sense. There would have to be lots of checks to make sure the filling doesn't do weird things across variable types.

One minor comment on the code at your link: Since you are setting the NA value here: world_bank <- read_excel(filename, col_names = FALSE,na=".."), you don't need the next line: world_bank[world_bank==".."] <- NA.

I'd be pleased if you want to open a new issue requesting merged cell support, i.e. repeating the value instead filling with NA. I can imagine how that would work.

burchill added 30 commits December 8, 2016 22:51

started function to duplicate merged cells

fd608f2

added getColumnName function

bbc5a8e

added getColumn function

b2b5fd8

slight update, need to check tests

082f521

changed readme so I can see the build status

ec798f6

fixing typos

e164f84

first pre_test of duplicateMergedCells

41ebacc

further testinf for duplicateMergedCells

47956bd

more testing

488a140

testing duplicateMergedCells

0275f9e

may god have mercy on my soul

testing how to clone nodes

32a9882

Fatally aborts on merged data

another approach

7f7c1b8

added semicolon

f342312

more typos

d92b74c

one last test

0626ea8

fixed typo, intentionally wrong

51adf1d

another fixing of typos

56bda21

even more testing--new approach

807c03e

more typos

576c97d

technically wrong but maybe wont crash

b1cd6b9

Update XlsxWorkSheet.h

a1e1c7f

testing print

fbdae3f

pray for me fam

851f0f4

revert to before this commit, testing temp printing

1ddc712

added #include rapidxml_print.h

39c3ec3

trying to get it to use rapidxml print functions

b97fcc9

trying yet another print method

2af1366

maybe a pointer was the problem

d91eceb

prints finally working, added before/after

06f4dd3

accidentally adding no attributes

319cbd1

burchill and others added 12 commits December 11, 2016 14:29

Update XlsxWorkSheet.h

827ba52

Update XlsxWorkSheet.h

001bc2c

Update XlsxWorkSheet.h

99bb6a0

Update XlsxWorkSheet.h

3134fbb

Update XlsxWorkSheet.h

8c8ebcc

cleaned it up a bit

af2da13

more cleanup

22f7bd3

Update XlsxWorkSheet.h

10ef6a1

finished cleaning up i think

35f6dfd

changed the build status back

4ad0ee2

reverting whitespace

ab42878

adding tests for merged cells

55d76e8

burchill mentioned this pull request Dec 12, 2016

Reading in merged cells - need to know number of columns they span #166

Closed

burchill and others added 8 commits December 13, 2016 12:19

testing travis thing

8228bc6

fixed testthat errors

9c51372

hmm, maybe this will work?

6b41987

checking what travis doesnt work

d9ac8ed

revert before this point, trying other debugging method

b7dba7e

oops forgot to document

8d44391

hmmm, continuing trying to figure out travis bug

6e305e1

going back, trying to find out why travis acts weird

57f6198

hadley assigned jennybc Jan 18, 2017

jennybc closed this Jan 19, 2017

jameshowison mentioned this pull request May 1, 2017

option to repeat value in merged cells #355

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicating values for all previously merged cells, for .xlsx formats #220

Duplicating values for all previously merged cells, for .xlsx formats #220

burchill commented Dec 11, 2016

burchill commented Dec 12, 2016

burchill commented Dec 13, 2016 •

edited

jennybc commented Jan 19, 2017

burchill commented Jan 19, 2017

jameshowison commented May 1, 2017

jennybc commented May 1, 2017

Duplicating values for all previously merged cells, for .xlsx formats #220

Duplicating values for all previously merged cells, for .xlsx formats #220

Conversation

burchill commented Dec 11, 2016

burchill commented Dec 12, 2016

burchill commented Dec 13, 2016 • edited

jennybc commented Jan 19, 2017

burchill commented Jan 19, 2017

jameshowison commented May 1, 2017

jennybc commented May 1, 2017

burchill commented Dec 13, 2016 •

edited