Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cell.Value is corrupted with furigana #25

Closed
udaken opened this issue Mar 29, 2022 · 11 comments
Closed

Cell.Value is corrupted with furigana #25

udaken opened this issue Mar 29, 2022 · 11 comments
Labels
bug Something isn't working

Comments

@udaken
Copy link

udaken commented Mar 29, 2022

An xlsx file created with the Japanese version of Excel 2019 contains an annotation called Furigana.
When that xlsx file is read in the NanoXLSX file, the furigana is mixed in the Cell.Value, resulting in an incorrect value.

Cell.Value is expected to ignore the furigana.

Steps to reproduce the problem:

  1. use tokyo.xlsx
  2. Execute the following code.
     using NanoXLSX;
     var wb = Workbook.Load("tokyo.xlsx");
     Console.Write.WriteLine(wb.Worksheets[0].Cells["A1"].Value);
     
     // expect value is "ここは東京"
     // but, print "ここは東京トウキョウ"
@udaken
Copy link
Author

udaken commented Mar 29, 2022

My guess is that we can just ignore the rPH tag.

@rabanti-github
Copy link
Owner

こんいちは

Thank you for the report.

So, have I understood the issue correct:
The Furigana part above the Kana should never be visible in Excel. But it should be preserved when you copy the value of a cell to a program that handle Furigana?
I am asking because, when we ignore, rPh, the Furigana signs will be lost.

I will have a look into the formatting in the shared string table. This (inline formatting) is on top of my backlog as ToDo.

Thank you in advance for clarification.

@udaken
Copy link
Author

udaken commented Mar 29, 2022

Yes, Excel does not display the furigana part.

I have been using files containing Japanese for over 10 years and have never needed furigana.
Even when I edit files in Excel for the web, the furigana is lost.

Therefore, my conclusion is that there is no problem with lost furigana.

For a program to preserve the furigana, you generally need a different object model than a single value.
VBA provides Characters.PhoneticCharacters.

@rabanti-github
Copy link
Owner

rabanti-github commented Mar 29, 2022

OK, thank you.

When I implemented the shared string table, I decided to remove all paragraphs and to keep all enclosed text, to preserve any content. However, since Furigana is only a Kanji transcription, it's presence breaks the sentence / word structure ,as you already stated.

Let me look for the best solution to preserve formatted text and to remove transcriptions. I have analyze the rPH (and possible others) tag a little bit. Maybe it can always be ignored or there will be an import option, if there are use cases, where rPH content is meaningful.

Please give me some time to do this. Currently, I am working on a big refactoring (dev branch) and I have to adapt any change on several braches and projects.

どうもありがとうございました。

@rabanti-github rabanti-github added bug Something isn't working investigate The issue must be investigated to determine whether it's a bug or not labels Mar 29, 2022
@rabanti-github
Copy link
Owner

I have some good news.
The rPH tag seems to be designated only for transcriptions (as Furigana is) of Han-originated characters. So it can be removed from the original text without side-effects.
I will do this by default with an opt-in import-option to keep the transcriptions, enclosed in brackets, if wished.
Please give me some days to implement and test this feature.

@rabanti-github rabanti-github added pending A solution is pending or long-term work is in progress and removed investigate The issue must be investigated to determine whether it's a bug or not labels Mar 29, 2022
@rabanti-github
Copy link
Owner

rabanti-github commented Mar 30, 2022

Hello
Technically, I already have a solution.
However. I have insufficient test data. And it looks like my western Office refuses to carry Furigana at all. I tried to copy-paste from a web browser and Word, but Excel never created a rPH tag.

Could you provide me with some additional samples, where Kanji characters are transcribed by Furigana? I would suggest something like that:

  • お元気ですか --> Two connected Kanji surrounded by Hiragana
  • この所は賑やかです。--> Two disconnected Kanji, surrounded by Hiragana
  • --> One single Kanji
  • 富士山は高いです。--> Several Kanji at the beginning of the text and Hiragana at the end
  • この儘--> Hiragana with Kanji at the end of the text

If you could provide me an Excel file with these (or similar) use cases, I could test my patch properly. Currently, I cannot predict the right tag structure.

Thank you very much in advance.

@udaken
Copy link
Author

udaken commented Mar 31, 2022

Thank you for your quick response.

We will gladly provide sample data. Please wait a moment.

@udaken
Copy link
Author

udaken commented Mar 31, 2022

Sample data attached.
I hope this will be useful to you.

furigana.xlsx

@rabanti-github
Copy link
Owner

rabanti-github commented Apr 1, 2022

ただいま
Thank you for the samples. They helped me a lot.
I just released v1.8.6. The Nuget packet will be available in some minutes.
by default, Furigana will now be discarded. If the transcription is wished, there is an import option "EnforcePhoneticCharacterImport", that can be set to true. In this case, Furigana will be appended to the Kanji characters:

Default With Import Option
お元気ですか お元気(ゲンキ)ですか
この所は賑やかです この所(トコロ)は賑(ニギ)やかです
金(カネ)
富士山は高いです。 富士山(フジサン)は高(タカ)いです。
この儘 この儘(ママ)

Please let me know if something is not working as expected.
また、レポートありがとうございました。

@rabanti-github rabanti-github removed the pending A solution is pending or long-term work is in progress label Apr 2, 2022
@rabanti-github
Copy link
Owner

I will close this issue for now. If you try out the fix and find something that is not as you would expecting, please do not hesitate to re-open or file a new issue.
Thank you

@udaken
Copy link
Author

udaken commented Apr 13, 2022

Thanks for fixing it.
The new version works correctly.
If I find any problems, I will report them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants