Pdf ideas for examples #6

maelle · 2016-05-02T12:58:27Z

Scientific papers often have tables and one would surely like to use the area argument.
Bus timetables, e.g. http://www.apsrtc.gov.in/Airport%20Liner%20Timings.pdf or http://www.morbihan.fr/fileadmin/Les_services/Vos_deplacements/Transports_collectifs/Fiches_horaires_TIM/TIM7-Hiver-Printemps-2016.pdf p.3

The text was updated successfully, but these errors were encountered:

psychemedia · 2016-05-02T13:17:33Z

The area argument is available. For example:

extract_tables('Lap Analysis.pdf', pages=8, guess=F, area=list(c(178, 10, 550, 50)))

The area parameter appears to take co-ordinates in the form: top, left, width, height.

You can find the necessary co-ordinates using the tabula app: if you select an area and preview the data, the selected co-ordinates are viewable in the browser developer tools console area.

However, the tabula app console output gives co-ordinates in the form: top, left, bottom, right so you need to do some sums to convert these numbers to the arguments that the tabulizer area parameter wants.

/via

leeper · 2016-05-02T14:44:50Z

@psychemedia The area specification is a bug in my code. I'm pushing a fix for it right now. It should be top,left,bottom,right just like in Tabula.

maelle · 2016-05-02T14:51:04Z

@leeper new terrible example, http://photos.state.gov/libraries/india/231771/PDFs/jan-dec_2015.pdf (the csv here being incomplete). It's US data, 187 pages, I'll report tomorrow once I've scraped it. Have I already said your pkg is awesome? 😁

maelle · 2016-05-05T15:54:57Z

I have used the tabulizer package here https://github.com/masalmon/usaqmindia/blob/master/inst/pm25_consulate.R but it's a pretty boring example.

leeper · 2016-05-07T15:22:51Z

This IRS document might work well as an example: https://www.irs.gov/pub/irs-soi/14databk.pdf

> extract_areas(tmp, pages = c(14, 15, 17, 18), method = "data.frame")
> str(.Last.value)
List of 6
 $ :'data.frame':       54 obs. of  8 variables:
  ..$ X   : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
  ..$ X.1.: chr [1:54] "239,874,741 " "3,074,293 " "584,480 " "4,485,975 " ...
  ..$ X.2.: chr [1:54] "2,220,921 " "17,613 " "3,362 " "33,844 " ...
  ..$ X.3.: chr [1:54] "4,642,817 " "50,438 " "9,160 " "83,945 " ...
  ..$ X.4.: chr [1:54] "3,799,428 " "45,905 " "7,383 " "84,956 " ...
  ..$ X.5.: chr [1:54] "147,444,789 " "2,048,463 " "357,733 " "2,805,861 " ...
  ..$ X.6.: chr [1:54] "23,608,340 " "252,431 " "47,482 " "430,138 " ...
  ..$ X.7.: chr [1:54] "3,205,595" "29,602" "4,178" "49,609" ...
 $ :'data.frame':       54 obs. of  8 variables:
  ..$ X    : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
  ..$ X.8. : chr [1:54] "617,649 " "5,365 " "1,067 " "7,563 " ...
  ..$ X.9. : chr [1:54] "30,065,749 " "353,564 " "79,939 " "508,257 " ...
  ..$ X.10.: chr [1:54] "34,132 " "255 " "38 " "410 " ...
  ..$ X.11.: chr [1:54] "334,641 " "3,163 " "567 " "4,626 " ...
  ..$ X.12.: chr [1:54] "987,238 " "15,016 " "3,433 " "9,225 " ...
  ..$ X.13.: chr [1:54] "1,467,402 " "16,792 " "4,682 " "19,344 " ...
  ..$ X.14.: chr [1:54] "21,446,040" "235,686" "65,456" "448,197" ...
 $ :'data.frame':       54 obs. of  7 variables:
  ..$ X   : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
  ..$ X.1.: chr [1:54] "157,187,971 " "2,122,412 " "371,057 " "2,939,657 " ...
  ..$ X.2.: chr [1:54] "1,173,505 " "10,456 " "1,524 " "12,059 " ...
  ..$ X.3.: chr [1:54] "3,439,645 " "40,500 " "6,851 " "50,573 " ...
  ..$ X.4.: chr [1:54] "2,813,102 " "36,809 " "5,205 " "49,203 " ...
  ..$ X.5.: chr [1:54] "124,585,594 " "1,785,868 " "301,830 " "2,339,074 " ...
  ..$ X.6.: chr [1:54] "47,309,667" "612,321" "151,349" "977,840" ...
 $ :'data.frame':       54 obs. of  8 variables:
  ..$ X    : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
  ..$ X.7. : chr [1:54] "3,261,248 " "39,515 " "6,909 " "64,940 " ...
  ..$ X.8. : chr [1:54] "77,275,927 " "1,173,547 " "150,481 " "1,361,234 " ...
  ..$ X.9. : chr [1:54] "2,334,249 " "21,674 " "2,840 " "35,140 " ...
  ..$ X.10.: chr [1:54] "9,615,578 " "66,424 " "11,088 " "186,577 " ...
  ..$ X.11.: chr [1:54] "253,158 " "4,431 " "258 " "2,748 " ...
  ..$ X.12.: chr [1:54] "837,997 " "11,547 " "2,966 " "11,454 " ...
  ..$ X.13.: chr [1:54] "12,135,143" "144,703" "38,495" "252,829" ...

maelle · 2016-05-07T15:25:42Z

I like that it's called data book, hehe.

BTW do you think there would be way to automatically recognize all tables in a pdf?

leeper · 2016-05-07T15:36:20Z

@masalmon The default behavior of extract_tables() should do this, as long as guess = TRUE.

maelle · 2016-05-07T15:37:52Z

Ah cool -- sorry I had missed that.

leeper added a commit that referenced this issue May 2, 2016

fix handling of area parameter (#5,#6)

221bbaa

leeper added the enhancement label May 2, 2016

leeper added this to the CRAN Release milestone May 2, 2016

leeper added the documentation label May 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pdf ideas for examples #6

Pdf ideas for examples #6

maelle commented May 2, 2016

psychemedia commented May 2, 2016

leeper commented May 2, 2016

maelle commented May 2, 2016

maelle commented May 5, 2016

leeper commented May 7, 2016

maelle commented May 7, 2016

leeper commented May 7, 2016

maelle commented May 7, 2016

Pdf ideas for examples #6

Pdf ideas for examples #6

Comments

maelle commented May 2, 2016

psychemedia commented May 2, 2016

leeper commented May 2, 2016

maelle commented May 2, 2016

maelle commented May 5, 2016

leeper commented May 7, 2016

maelle commented May 7, 2016

leeper commented May 7, 2016

maelle commented May 7, 2016