<font size="6"><b>EXTRACTING TABLES FROM PDF FILES</b></font>

In [None]:
library(tabulizer)
library(data.table)
library(tidyverse)

In [None]:
options(repr.matrix.max.rows=100, repr.matrix.max.cols=40) # for limiting the number of top and bottom rows of tables printed 

![xkcd](../imagesbb/preprint.png)

(https://xkcd.com/2304)

Let's extract the tables from the YKS report by OSYM:

https://dokuman.osym.gov.tr/pdfdokuman/2021/GENEL/yksdegrapor24122021.pdf

We had extracted unformatted text from the YKS report pdf file using `textreadr::read_pdf` function

Now we will extract tabular data as a matrix (The case with page 103 is a little bit more complex, so we skip that page).
For more complex situations, you can provide the geometry of the table and the column positions in points through the `area` and `columns` arguments:

In [None]:
yksreport <- extract_tables("~/databb/pdf/yksdegrapor24122021.pdf",
                              pages = 104:110,
                              guess = T,
                              method = "lattice", output = "matrix")

See the initial rows:

In [None]:
yksreport %>% lapply(function(x) x[1:5, ])

Now first extract the column names, first row for the first two columns, second row for other columns:

In [None]:
headx <- c(yksreport[[1]][1, 1:2], yksreport[[1]][2, -(1:2)])

In [None]:
headx

Delete line feed character and unnecessary spaces:

In [None]:
headx <- headx %>% str_replace_all("\\r", " ") %>% str_replace_all("- ", "-")

In [None]:
headx

Convert all pages to data.table, trim the first two rows and combine into one large table:

In [None]:
yksreport2 <- lapply(yksreport, as.data.table) %>% lapply(tail, -2) %>% rbindlist

In [None]:
yksreport2 %>% head

Set the names

In [None]:
setnames(yksreport2, headx)

In [None]:
yksreport2 %>% head

Convert the columns with headers starting with the digit to numeric values:

In [None]:
yksreport3 <- yksreport2 %>% mutate_all(na_if, "-") %>% mutate_at(vars(matches("^\\d")), parse_number, locale = locale(decimal_mark = ",", grouping_mark = "."))

Now we have a complete and clean table:

In [None]:
yksreport3 %>% head

In [None]:
yksreport3 %>% DT::datatable(filter = "top")

Let's do some filtering and ordering:

In [None]:
yksreport3[str_detect(`Üniversite Adı`, "Boğaziçi")]

In [None]:
yksreport3[!is.na(`1-100`)][order(-`1-100`)]

Boğaziçi rules!