Skip to content

Accelerate rvest::html_table #237

@georgevbsantiago

Description

@georgevbsantiago

I currently use the XML package only because of the XML::readHTMLTable function. The xml2 package does not have a function to read tables in the HTML files, correct? I've already tried using the rvest :: html_table function, but readHTMLTable is 10x faster and produces a cleaner data table.

In the example below, the rvest::html_table function creates one more column; included the table footer and handles the lines with NA. Already XML::readTable captures only the header and the body of the table.
However, the biggest "problem" for me is the issue of execution speed.

tab_example.html.gz

library(magrittr)
library(XML)
library(xml2)

# HTML - Table example
html <- "tab_example.html"


# Like XML
get_tab_XML <- html %>% 
        XML::htmlParse(encoding = "UTF-8") %>%
        XML::readHTMLTable(stringsAsFactors = FALSE, which = 2)

# Like xml2
get_tab_xml2 <- html %>% 
        xml2::read_html() %>%
        rvest::html_node("#tabelaResultado") %>% 
        rvest::html_table(fill = TRUE)

image

Unit: seconds
 expr       min        lq      mean    median        uq       max neval
  XML  3.816312  3.955153  4.173987  4.093994  4.352824  4.611654     3
 xml2 33.720705 34.495118 35.144829 35.269531 35.856891 36.444251     3

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions