Gibberish output in tabula-java for Japanese PDF but works in Tabula #513

zwong · 2023-01-07T13:18:05Z

I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish.

However, when using the standalone Tabula tool, the encoding is properly:

Searching online, I've tried the below with no success

Setting the -Dfile.encoding=utf8
Setting chcp 65001

I understand Tabula and tabula-java use the same library, but is there something different between the two that would explain the difference in output?

zwong · 2023-01-10T23:36:19Z

After further testing with output to CSV, I found that the gibberish results only happen in tabula-py. Tabula Java appears to output to CSV that is properly encoded. Closing this issue.

jeremybmerrill · 2023-01-12T15:03:59Z

@zwong glad you figured it out. The combination of Windows, Command Prompt, Java and tabula-py is a complicated one! I don't really anymore remember the wizardry needed to make the Windows command prompt cooperate. Have you tried havnig tabula-py output a CSV? I wonder if the CSV is correct, but that the program you use to open the CSV (e.g. Excel) is incorrectly guessing the encoding?

zwong · 2023-01-13T13:39:55Z

@zwong glad you figured it out. The combination of Windows, Command Prompt, Java and tabula-py is a complicated one! I don't really anymore remember the wizardry needed to make the Windows command prompt cooperate. Have you tried havnig tabula-py output a CSV? I wonder if the CSV is correct, but that the program you use to open the CSV (e.g. Excel) is incorrectly guessing the encoding?

Thank you. To work around the issue in tabula-py, I ended up doing similar to what you had suggested and output a CSV that I would read into python. The encoding is correct and it side steps a lot of the issues I faced with trying to import the PDF data directly. With Excel, I learned that I had to explicitly set the encoding otherwise it would just read the data as ANSI (I'm guessing). Next issue is trying to properly import the data into the correct columns so I can start processing it!

zwong closed this as completed Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gibberish output in tabula-java for Japanese PDF but works in Tabula #513

Gibberish output in tabula-java for Japanese PDF but works in Tabula #513

zwong commented Jan 7, 2023

zwong commented Jan 10, 2023

jeremybmerrill commented Jan 12, 2023

zwong commented Jan 13, 2023

Gibberish output in tabula-java for Japanese PDF but works in Tabula #513

Gibberish output in tabula-java for Japanese PDF but works in Tabula #513

Comments

zwong commented Jan 7, 2023

zwong commented Jan 10, 2023

jeremybmerrill commented Jan 12, 2023

zwong commented Jan 13, 2023