Skip to content

Use insights from htmltab package #63

@hadley

Description

@hadley
  1. Expansion of row and column spans: I am confident that this part of the code (span_body, span_header) is working as expected for spans in the header or the body. E.g.,

    library(XML)
    library(stringr)
    library(magrittr)
    doc <- "http://en.wikipedia.org/wiki/Usage_share_of_web_browsers"
    bFun <- function(node) {xmlValue(node) %>% str_replace(., '%$', '') %>% ifelse(equals(., ''), NA, .)}
    htmltable(doc = doc, which = "//table[5]", bodyFun = bFun)
    
    url  <- "http://de.wikipedia.org/wiki/Bundestagswahlkreis_Frankfurt_am_Main_II"
    htmltable(doc = url, which = 14, encoding = "UTF-8")

    I also managed to make these functions quite robust to certain misspecifications in the HTML code. For example, in this table the last column has a span of 8, but it should be 6:

    htmltable(doc = "http://en.wikipedia.org/wiki/Jamie_xx", 
              which = "/html/body/div[3]/div[3]/div[4]/table[2]", 
              encoding = "UTF-8", 
              header = 1:2, body = "tr[./td[not(@colspan = '9')]]")
  2. Identification of header and body elements: The identification of header and body elements is an
    issue that I still have not completely solved. It's the cause of nearly all the fails right now. I tried to come up with some reasonable heuristics of how to identify these elements but it's not done yet and this still needs some more work as well as more testing with 'real-life' HTML tables. Ultimately, I think it's necessary to give users more control over identifying these elements -- or just fall back on very simple decision rules.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions