Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pdf ideas for examples #6

Open
maelle opened this issue May 2, 2016 · 8 comments
Open

Pdf ideas for examples #6

maelle opened this issue May 2, 2016 · 8 comments

Comments

@maelle
Copy link
Member

maelle commented May 2, 2016

@psychemedia
Copy link

The area argument is available. For example:

extract_tables('Lap Analysis.pdf', pages=8, guess=F, area=list(c(178, 10, 550, 50)))

The area parameter appears to take co-ordinates in the form: top, left, width, height.

You can find the necessary co-ordinates using the tabula app: if you select an area and preview the data, the selected co-ordinates are viewable in the browser developer tools console area.

However, the tabula app console output gives co-ordinates in the form: top, left, bottom, right so you need to do some sums to convert these numbers to the arguments that the tabulizer area parameter wants.

/via

@leeper
Copy link
Member

leeper commented May 2, 2016

@psychemedia The area specification is a bug in my code. I'm pushing a fix for it right now. It should be top,left,bottom,right just like in Tabula.

@maelle
Copy link
Member Author

maelle commented May 2, 2016

@leeper new terrible example, http://photos.state.gov/libraries/india/231771/PDFs/jan-dec_2015.pdf (the csv here being incomplete). It's US data, 187 pages, I'll report tomorrow once I've scraped it. Have I already said your pkg is awesome? 😁

leeper added a commit that referenced this issue May 2, 2016
@leeper leeper added this to the CRAN Release milestone May 2, 2016
@maelle
Copy link
Member Author

maelle commented May 5, 2016

I have used the tabulizer package here https://github.com/masalmon/usaqmindia/blob/master/inst/pm25_consulate.R but it's a pretty boring example.

@leeper
Copy link
Member

leeper commented May 7, 2016

This IRS document might work well as an example: https://www.irs.gov/pub/irs-soi/14databk.pdf

> extract_areas(tmp, pages = c(14, 15, 17, 18), method = "data.frame")
> str(.Last.value)
List of 6
 $ :'data.frame':       54 obs. of  8 variables:
  ..$ X   : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
  ..$ X.1.: chr [1:54] "239,874,741 " "3,074,293 " "584,480 " "4,485,975 " ...
  ..$ X.2.: chr [1:54] "2,220,921 " "17,613 " "3,362 " "33,844 " ...
  ..$ X.3.: chr [1:54] "4,642,817 " "50,438 " "9,160 " "83,945 " ...
  ..$ X.4.: chr [1:54] "3,799,428 " "45,905 " "7,383 " "84,956 " ...
  ..$ X.5.: chr [1:54] "147,444,789 " "2,048,463 " "357,733 " "2,805,861 " ...
  ..$ X.6.: chr [1:54] "23,608,340 " "252,431 " "47,482 " "430,138 " ...
  ..$ X.7.: chr [1:54] "3,205,595" "29,602" "4,178" "49,609" ...
 $ :'data.frame':       54 obs. of  8 variables:
  ..$ X    : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
  ..$ X.8. : chr [1:54] "617,649 " "5,365 " "1,067 " "7,563 " ...
  ..$ X.9. : chr [1:54] "30,065,749 " "353,564 " "79,939 " "508,257 " ...
  ..$ X.10.: chr [1:54] "34,132 " "255 " "38 " "410 " ...
  ..$ X.11.: chr [1:54] "334,641 " "3,163 " "567 " "4,626 " ...
  ..$ X.12.: chr [1:54] "987,238 " "15,016 " "3,433 " "9,225 " ...
  ..$ X.13.: chr [1:54] "1,467,402 " "16,792 " "4,682 " "19,344 " ...
  ..$ X.14.: chr [1:54] "21,446,040" "235,686" "65,456" "448,197" ...
 $ :'data.frame':       54 obs. of  7 variables:
  ..$ X   : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
  ..$ X.1.: chr [1:54] "157,187,971 " "2,122,412 " "371,057 " "2,939,657 " ...
  ..$ X.2.: chr [1:54] "1,173,505 " "10,456 " "1,524 " "12,059 " ...
  ..$ X.3.: chr [1:54] "3,439,645 " "40,500 " "6,851 " "50,573 " ...
  ..$ X.4.: chr [1:54] "2,813,102 " "36,809 " "5,205 " "49,203 " ...
  ..$ X.5.: chr [1:54] "124,585,594 " "1,785,868 " "301,830 " "2,339,074 " ...
  ..$ X.6.: chr [1:54] "47,309,667" "612,321" "151,349" "977,840" ...
 $ :'data.frame':       54 obs. of  8 variables:
  ..$ X    : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
  ..$ X.7. : chr [1:54] "3,261,248 " "39,515 " "6,909 " "64,940 " ...
  ..$ X.8. : chr [1:54] "77,275,927 " "1,173,547 " "150,481 " "1,361,234 " ...
  ..$ X.9. : chr [1:54] "2,334,249 " "21,674 " "2,840 " "35,140 " ...
  ..$ X.10.: chr [1:54] "9,615,578 " "66,424 " "11,088 " "186,577 " ...
  ..$ X.11.: chr [1:54] "253,158 " "4,431 " "258 " "2,748 " ...
  ..$ X.12.: chr [1:54] "837,997 " "11,547 " "2,966 " "11,454 " ...
  ..$ X.13.: chr [1:54] "12,135,143" "144,703" "38,495" "252,829" ...

@maelle
Copy link
Member Author

maelle commented May 7, 2016

I like that it's called data book, hehe.

BTW do you think there would be way to automatically recognize all tables in a pdf?

@leeper
Copy link
Member

leeper commented May 7, 2016

@masalmon The default behavior of extract_tables() should do this, as long as guess = TRUE.

@maelle
Copy link
Member Author

maelle commented May 7, 2016

Ah cool -- sorry I had missed that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants