Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get many _x000d if cell contain \n #16

Closed
Jack-Myth opened this issue Aug 2, 2019 · 6 comments
Closed

get many _x000d if cell contain \n #16

Jack-Myth opened this issue Aug 2, 2019 · 6 comments
Labels
ready to close Pull request has been answered or implemented & is pending closure

Comments

@Jack-Myth
Copy link

Jack-Myth commented Aug 2, 2019

I'm trying to use OpenXLSX to read XLSX. It's a great library but I find something strange. If a cell's value contains multiple lines, the "\n" will be read as x000d.
"abc
abc" will become "abc_x000d__000d
..abc"
any idea?

@Jack-Myth Jack-Myth changed the title get many _x000d if cell contain get many _x000d if cell contain \n Aug 2, 2019
@troldal
Copy link
Owner

troldal commented Aug 13, 2019

Hi @Jack-Myth

I'm afraid I can't recreate your issue. Can you show the code you are using as well an Excel spreadsheet? Also, can you Tell what system you are using (Operating system, version etc.)?

I made a small test myself; here are my results:

I made a spreadsheet called "TestBook.xlsx", with one sheet called "Sheet1" with the following contents:

Screenshot 2019-08-13 at 17 59 04

Then I ran the following code:

#include <iostream>
#include <iomanip>
#include <OpenXLSX/OpenXLSX.h>

using namespace std;
using namespace OpenXLSX;

int main() {

    XLDocument doc;
    doc.OpenDocument("./TestBook.xlsx");
    auto wks = doc.Workbook().Worksheet("Sheet1");

    cout << "In TestBook:" << endl;
    cout << wks.Cell("A1").Value().Get<std::string>() << endl;
    cout << wks.Cell("A2").Value().Get<std::string>() << endl;
    cout << wks.Cell("A3").Value().Get<std::string>() << endl;

    wks.Cell("B1").Value() = "B1: Line1";
    wks.Cell("B2").Value() = "B2: Line1\n B2: Line 2";
    wks.Cell("B3").Value() = "B1: Line1\n B2: Line 2\n B2: Line 3";

    doc.SaveDocument();
    doc.CloseDocument();

    return 0;
}

The resulting console output:
Screenshot 2019-08-13 at 18 11 41

...and the "TestBook.xlsx" after running the code:
Screenshot 2019-08-13 at 18 02 15
(Note that when you write multiple lines in an Excel cell, the cell is automatically word wrapped; the same does not happen in OpenXLSX. Therefore, the image above shows the cell contents of the B-column as one long line)

@Jack-Myth
Copy link
Author

Thanks for your reply.
I think it maybe not caused by the '\n', but in some situation it will do return many x000d.
Actually I can give you an xlsx file, you may find something in it.
test.xlsx
It's a part of a huge xlsx.
And also I write a test program, here is the code:

#include <stdio.h>
#include "OpenXLSX/OpenXLSX.h"
#pragma comment(lib,"OpenXLSX.lib")

int main(int argc, char* argv[])
{
	if (argc == 0)
		return -1;
	OpenXLSX::XLDocument doc;
	doc.OpenDocument(argv[1]);
	auto TargetWorkSheet = doc.Workbook().Worksheet(doc.Workbook().WorksheetNames()[0]);
	for (unsigned int i = 1; i <= TargetWorkSheet.ColumnCount(); i++)
	{
		printf("%s\n", TargetWorkSheet.Row(1).Cell(i).Value().AsString().c_str());
	}
}

It's very simple and just print the first row, but with it you can see the error:
image

And, by the way, I find the xlsx saved by OpenXLSX can't be read by a python library named "xlrd", But it can be read by "openpyxl". so now I use openpyxl load the xlsx and resave it so the xlrd can read the file, but openpyxl is very slow , it caused many time to resave the xlsx. so do you mind spend a little bit time on this problem?

@troldal
Copy link
Owner

troldal commented Aug 16, 2019

I looked into this, and it turns out that x000D is the UTF-8 code for "\r" (carriage return), not "\n". The difference is that "\n" jumps to the beginning of the next line, whereas "\r" jumps to the beginning of the current line.

The problem actually lies in the spreadsheet you use, not in OpenXLSX, because Excel actually stores the "\r" character as a string with the UTF-8 code (i.e. "_x000D_"), not the "\r" character itself. So it is difficult to come up with good solution to the problem. Indeed, other Excel libraries have had the same problem (for example: jmcnamara/libxlsxwriter#189).

Also, when I open the test.xlsx file in excel, it behaves strangely. When I select the cell, the contents is reformatted, as if the "\r" characters are replaced with "\n".

Solution

If you are not able to fix the spreadsheet itself (maybe you imported data from a text file?), then you can do something like the following:

#include <iostream>
#include <regex>
#include <OpenXLSX/OpenXLSX.h>

using namespace std;
using namespace OpenXLSX;

int main() {

    XLDocument doc;
    doc.OpenDocument("./test.xlsx");
    auto sht = doc.Workbook().Worksheet("Sheet1");
    auto text = std::regex_replace(sht.Cell("A1").Value().AsString(), regex("_x000D_"), "\n");

    cout << text << endl;

    return 0;
}

This will take the contents of a cell and replace all instances of the substring "_x000D_" with "\n". You can also replace it with "\r", but that is almost certainly not what you want to do (actually, Python/xlrd does that, resulting in only the last line of the cell contents being printed; this makes it very difficult to understand what is going on).

Regarding your question about opening .xlsx files made with OpenXLSX with Python/xlrd, I have fixed that now.

Please let me know if everything works out for you.

@troldal troldal added the ready to close Pull request has been answered or implemented & is pending closure label Aug 18, 2019
@troldal troldal closed this as completed Aug 26, 2019
@Jack-Myth
Copy link
Author

Sorry for my late reply, I have some personal problem last week and can't test it in time.
Thanks for your work, I have test it just now and find the xlsx can be read by xlrd, it help me saved a lot of time.
But,There still have a issue, I have fix it by myself but I think it better to let you know.
This issue seems not due to the OpenXLSX, but the other library named Zippy.
The issue is: When I trying to save the document,It bring the temp file to the target folder,but doesn't change it's name.So it cause this situation after save:
0
the target file "test.xlsx" doesn't have anything, It's an empty xlsx. and the temp file "***.tmp" is the actual xlsx.
After read the source code I find it cause by the strange behavior of these two code:
1
code is at Zippy/ZipArchive.h :442
I replace these two code to:
image
It load the tmp file and write it back to the target file,and it works fine.
I don't know why the std::rename broken but it do have some problems.

@troldal
Copy link
Owner

troldal commented Oct 12, 2019

Sorry for the delay. The last few months have been really busy!

I am unable ro reproduce the error you mention. I have tested it with Visual Studio 2019 Community Edition, using both the 32- and 64-bit compiler, with no problems.

According to these references, the std::remove and std::rename functions are available in Visual Studio:
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/rename-wrename?view=vs-2019
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/remove-wremove?view=vs-2019

I will update the library to check the return value of std::remove and std::rename. If an error code is returned, the library will throw an exception.

@MarvinXu
Copy link

@Jack-Myth I came across the same issue. I've read your test.xlsx using pandas, and I can see 'x00D' symbols. Then I opened your file in Excel and click "Save as", the 'x00D' symbol became '\r'...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to close Pull request has been answered or implemented & is pending closure
Projects
None yet
Development

No branches or pull requests

3 participants