New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty cells when parsing certain kind of XLS files #225

Closed
namelivia opened this Issue Nov 14, 2018 · 9 comments

Comments

Projects
None yet
2 participants
@namelivia

namelivia commented Nov 14, 2018

I'm trying to parse files provided by a Spanish bank backend using a script and when reading it's contents I've observed that cells come empty. I have also tried opening these files using Libreoffice and the cell contents are correctly displayed, also saving these files with Libreoffice as .XLS fixes them.

However I wanted to investigate further to see what's wrong with the files and to see if I could parse them correctly. I added lots of prints and followed the execution and this is what os making the rows to appear empty for these files:

While reading the worksheets using the method read_worksheet from reader on a "fixed" file :row ops are found first and the worksheet @offsets hash is filled with one entry for each row by the set_row_address method.
However when reading the "non-fixed" file this is not happening (yet).

After that the cell data ops are read while reading the worksheet and as these are included in ROW_BLOCK_OPS the method set_missing_row_address will be executed. Now what I've seen this method does is to create entries on the @offsets hash for those who have indexes not already present on it. In the case of the "non-fixed" file this will populate the hash, on the "fixed file" it will just leave it as it is.

Finally on the "non-fixed" file it will read the :row ops and by calling set_row_address it will override the @offsets hash with different values. I have tried commenting the call to set_row_address and it actually makes the hash values to stay, read_row to be called and correctly parse my "non-fixed" files.

To be clear, on a "fixed" file :row ops come first, then cell data ops.
On the "non-fixed" files the cell data ops come first, and then the :row ops.

My "solution" is pure coincidence and will probably break other files because I'm just focusing on this particular case, plus I have no Excel format knowledge, but if you consider this an issue that could be fixed I hope this explanation could be a starting point for a discussion on how to do it.

@zdavatz

This comment has been minimized.

Owner

zdavatz commented Nov 15, 2018

Can you provide a sample file?

@namelivia

This comment has been minimized.

namelivia commented Nov 15, 2018

Thanks for the fast reply.
Sure! Here is a sample of an original "unfixed" file:
original.zip

@zdavatz

This comment has been minimized.

Owner

zdavatz commented Nov 16, 2018

Can you please also provide the script with witch you try to open and edit the files on you local computer.

Then, can you try the following please:

  1. Open the files you receive from the Spanish Bank via LibreOffice.
  2. Save the files with LibreOffice to a new Filename.
  3. Now try to read the file with Spreadsheet gem. Is the content of the cells now displayed correctly?
@namelivia

This comment has been minimized.

namelivia commented Nov 16, 2018

Yes, opening with Libreoffice and saving them fixes the files, sorry if it wasn't clear on the issue description. When traversing this new files rows are checked first so the problem I described is not happening.
The script is here:
input_file.rb opens the file
converter.rb traverses the rows
transaction.rb reads the cells
But is not doing anything more than opening the file accessing to the first worksheet and iterating the rows and accessing their cells using the [] operator.

@zdavatz

This comment has been minimized.

Owner

zdavatz commented Nov 16, 2018

Yeah, I would be interested to know, what software creates those exported XLS files in the first place. Some binary formatting is not properly done there, I think.

@namelivia

This comment has been minimized.

namelivia commented Dec 3, 2018

I did alter some of the information there using a hex editor trying to preserve the structure in order to protect the data privacy.
I had no time to work on this but I'm still interested in taking a look at it, I assumed the problem was that the file was not properly generated by the bank backend (I have no info on what tool do they use) but the fact that libreoffice or google sheets open the file with no errors makes me think there is something more. I will take a look myself as soon as I have some spare time for this. Thanks for the patience!

@zdavatz

This comment has been minimized.

Owner

zdavatz commented Dec 4, 2018

Google or LibreOffice have more resources. If you can edit - with spreadsheet gem - the file generated by the bank after you saved it with LibreOffice, then it is the Software that generates the file that has the issue. You are not the first person having this issue.

@namelivia

This comment has been minimized.

namelivia commented Dec 4, 2018

I see your point and agree with you then, in that case is better not to alter the gem code. I will patch the method in the tool used to parse this specific kind of files as it is a specific file issue, it seems more appropriate.
I'll close the issue now! Thank you for your interest and fast responses!

@namelivia namelivia closed this Dec 4, 2018

@zdavatz

This comment has been minimized.

Owner

zdavatz commented Dec 5, 2018

Thank you too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment