Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gibberish output in tabula-java for Japanese PDF but works in Tabula #513

Closed
zwong opened this issue Jan 7, 2023 · 3 comments
Closed

Gibberish output in tabula-java for Japanese PDF but works in Tabula #513

zwong opened this issue Jan 7, 2023 · 3 comments

Comments

@zwong
Copy link

zwong commented Jan 7, 2023

I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish.
image

However, when using the standalone Tabula tool, the encoding is properly:
image

Searching online, I've tried the below with no success

  1. Setting the -Dfile.encoding=utf8
  2. Setting chcp 65001

I understand Tabula and tabula-java use the same library, but is there something different between the two that would explain the difference in output?

@zwong
Copy link
Author

zwong commented Jan 10, 2023

After further testing with output to CSV, I found that the gibberish results only happen in tabula-py. Tabula Java appears to output to CSV that is properly encoded. Closing this issue.

@zwong zwong closed this as completed Jan 10, 2023
@jeremybmerrill
Copy link
Member

@zwong glad you figured it out. The combination of Windows, Command Prompt, Java and tabula-py is a complicated one! I don't really anymore remember the wizardry needed to make the Windows command prompt cooperate. Have you tried havnig tabula-py output a CSV? I wonder if the CSV is correct, but that the program you use to open the CSV (e.g. Excel) is incorrectly guessing the encoding?

@zwong
Copy link
Author

zwong commented Jan 13, 2023

@zwong glad you figured it out. The combination of Windows, Command Prompt, Java and tabula-py is a complicated one! I don't really anymore remember the wizardry needed to make the Windows command prompt cooperate. Have you tried havnig tabula-py output a CSV? I wonder if the CSV is correct, but that the program you use to open the CSV (e.g. Excel) is incorrectly guessing the encoding?

Thank you. To work around the issue in tabula-py, I ended up doing similar to what you had suggested and output a CSV that I would read into python. The encoding is correct and it side steps a lot of the issues I faced with trying to import the PDF data directly. With Excel, I learned that I had to explicitly set the encoding otherwise it would just read the data as ANSI (I'm guessing). Next issue is trying to properly import the data into the correct columns so I can start processing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants