-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pdf ideas for examples #6
Comments
The
The area parameter appears to take co-ordinates in the form: top, left, width, height. You can find the necessary co-ordinates using the tabula app: if you select an area and preview the data, the selected co-ordinates are viewable in the browser developer tools console area. However, the tabula app console output gives co-ordinates in the form: top, left, bottom, right so you need to do some sums to convert these numbers to the arguments that the tabulizer area parameter wants. |
@psychemedia The area specification is a bug in my code. I'm pushing a fix for it right now. It should be |
@leeper new terrible example, http://photos.state.gov/libraries/india/231771/PDFs/jan-dec_2015.pdf (the csv here being incomplete). It's US data, 187 pages, I'll report tomorrow once I've scraped it. Have I already said your pkg is awesome? 😁 |
I have used the tabulizer package here https://github.com/masalmon/usaqmindia/blob/master/inst/pm25_consulate.R but it's a pretty boring example. |
This IRS document might work well as an example: https://www.irs.gov/pub/irs-soi/14databk.pdf > extract_areas(tmp, pages = c(14, 15, 17, 18), method = "data.frame")
> str(.Last.value)
List of 6
$ :'data.frame': 54 obs. of 8 variables:
..$ X : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
..$ X.1.: chr [1:54] "239,874,741 " "3,074,293 " "584,480 " "4,485,975 " ...
..$ X.2.: chr [1:54] "2,220,921 " "17,613 " "3,362 " "33,844 " ...
..$ X.3.: chr [1:54] "4,642,817 " "50,438 " "9,160 " "83,945 " ...
..$ X.4.: chr [1:54] "3,799,428 " "45,905 " "7,383 " "84,956 " ...
..$ X.5.: chr [1:54] "147,444,789 " "2,048,463 " "357,733 " "2,805,861 " ...
..$ X.6.: chr [1:54] "23,608,340 " "252,431 " "47,482 " "430,138 " ...
..$ X.7.: chr [1:54] "3,205,595" "29,602" "4,178" "49,609" ...
$ :'data.frame': 54 obs. of 8 variables:
..$ X : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
..$ X.8. : chr [1:54] "617,649 " "5,365 " "1,067 " "7,563 " ...
..$ X.9. : chr [1:54] "30,065,749 " "353,564 " "79,939 " "508,257 " ...
..$ X.10.: chr [1:54] "34,132 " "255 " "38 " "410 " ...
..$ X.11.: chr [1:54] "334,641 " "3,163 " "567 " "4,626 " ...
..$ X.12.: chr [1:54] "987,238 " "15,016 " "3,433 " "9,225 " ...
..$ X.13.: chr [1:54] "1,467,402 " "16,792 " "4,682 " "19,344 " ...
..$ X.14.: chr [1:54] "21,446,040" "235,686" "65,456" "448,197" ...
$ :'data.frame': 54 obs. of 7 variables:
..$ X : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
..$ X.1.: chr [1:54] "157,187,971 " "2,122,412 " "371,057 " "2,939,657 " ...
..$ X.2.: chr [1:54] "1,173,505 " "10,456 " "1,524 " "12,059 " ...
..$ X.3.: chr [1:54] "3,439,645 " "40,500 " "6,851 " "50,573 " ...
..$ X.4.: chr [1:54] "2,813,102 " "36,809 " "5,205 " "49,203 " ...
..$ X.5.: chr [1:54] "124,585,594 " "1,785,868 " "301,830 " "2,339,074 " ...
..$ X.6.: chr [1:54] "47,309,667" "612,321" "151,349" "977,840" ...
$ :'data.frame': 54 obs. of 8 variables:
..$ X : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
..$ X.7. : chr [1:54] "3,261,248 " "39,515 " "6,909 " "64,940 " ...
..$ X.8. : chr [1:54] "77,275,927 " "1,173,547 " "150,481 " "1,361,234 " ...
..$ X.9. : chr [1:54] "2,334,249 " "21,674 " "2,840 " "35,140 " ...
..$ X.10.: chr [1:54] "9,615,578 " "66,424 " "11,088 " "186,577 " ...
..$ X.11.: chr [1:54] "253,158 " "4,431 " "258 " "2,748 " ...
..$ X.12.: chr [1:54] "837,997 " "11,547 " "2,966 " "11,454 " ...
..$ X.13.: chr [1:54] "12,135,143" "144,703" "38,495" "252,829" ... |
I like that it's called data book, hehe. BTW do you think there would be way to automatically recognize all tables in a pdf? |
@masalmon The default behavior of |
Ah cool -- sorry I had missed that. |
area
argument.The text was updated successfully, but these errors were encountered: