Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple table in 1 page #22

Open
leeper opened this issue Sep 22, 2016 · 4 comments
Open

Multiple table in 1 page #22

leeper opened this issue Sep 22, 2016 · 4 comments
Labels

Comments

@leeper
Copy link
Member

leeper commented Sep 22, 2016

Migrated from ropensci/tabulizerjars#1 (@khun84)

Is there param that I can parse in to extract more than 1 table per page?

I have a pdf page with 2 tables:

  • table 1 is 2 columns and multiple rows
  • table 2 has 2 columns and multiple rows, but some of the cells are merged).

I use the extract_table() function with default param and the output only has 1 table (table 1).

What I can think of is to set method = 'asis' but I do not know to proceed with the output java object. Is there any documentation I can refer to?

@leeper
Copy link
Member Author

leeper commented Sep 22, 2016

@khun84 Yes, you can specify the page number twice, along with the area (or use the extract_areas() function to specify those areas interactively).

So something like extract_areas(file, pages = c(1,1)). This will give you the chance to extract two different areas from a given page.

You can pursue the Java approach, but it's really only useful if you know the underlying tabula Java library well; and that is not very well documented anywhere.

@khun84
Copy link

khun84 commented Sep 22, 2016

thanks for the clarification...ive tried with extract_areas(file, c(1, 1)) but it return the same table twice. If I have to explicitly define the area for both tables, then my code will break when the position of the tables change.

Is there any function that can return the entire content of the pdf in a DOM like format? In that case, I can traverse the DOM tree and extract what I want.

@SteveLane
Copy link

Hi @leeper - I've recently run into similar issues, but with multi-page documents and a random number of tables per page, I found that the 'spreadsheet' method on the command line and/or via Tabula's interface will drag them out. The write_csv function spills them all out correctly (at least in the cases I've tested), but the list_matrices function doesn't.

I've edited the list_matrices function if you're happy for a pull request?

@leeper
Copy link
Member Author

leeper commented Dec 21, 2016

Yes, please send a PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants