New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor XmlSeeker to support more types of return values #55

Merged
merged 24 commits into from Nov 12, 2015

Conversation

Projects
None yet
2 participants
@jimhester
Member

jimhester commented Nov 5, 2015

This refactors the XmlSeeker class to return RObjects rather than strictly XPtrNode or Lists. It provides a general search() method that directly returns the result RObject, and an exported xpath_search() function which provides a superset of the behavior of the previous node_find_all(), node_find_one() functions. It also allows one to return a user defined subset of the results (1:max found), rather than just 1 or all.

The existing behavior and user facing API is preserved, however you can also use xpath_search() to return booleans, numbers, characters, e.g.

x <- read_xml("<blah></blah>")
xpath_search(x$node, x$doc, "1+1", nsMap = character(), num_results = Inf)
#> [1] 2
xpath_search(x$node, x$doc, "1=1", nsMap = character(), num_results = Inf)
#> [1] TRUE
xpath_search(x$node, x$doc, "2=1", nsMap = character(), num_results = Inf)
#> [1] FALSE
xpath_search(x$node, x$doc, "string(5)", nsMap = character(), num_results = Inf)
#> [1] "5"

I figured we should discuss the R API for this, so I did not flesh that out at all or add tests, but it is a functional first pass.

@jimhester jimhester changed the title from Refactor XmlSeeker to support More types of return values to Refactor XmlSeeker to support more types of return values Nov 6, 2015

@jimhester

This comment has been minimized.

Member

jimhester commented Nov 7, 2015

Character returns also seem to be working fine, I edited the above comment respectively.

@jimhester

This comment has been minimized.

Member

jimhester commented Nov 7, 2015

Some libxml2 tests for return types

}
return out;
}
case XPATH_NUMBER: { return Rcpp::NumericVector(1, result_->floatval); }

This comment has been minimized.

@hadley

hadley Nov 9, 2015

Member

You can use Rcpp::NumericVector::create(result_->floatval)

}
case XPATH_NUMBER: { return Rcpp::NumericVector(1, result_->floatval); }
case XPATH_BOOLEAN: { return Rcpp::LogicalVector(1, result_->boolval); }
case XPATH_STRING: { return Rcpp::CharacterVector(1, reinterpret_cast<const char*>(result_->stringval)); }

This comment has been minimized.

@hadley

hadley Nov 9, 2015

Member

Need to make sure this gets correctly encoded

@hadley

This comment has been minimized.

Member

hadley commented Nov 9, 2015

I prefer type-stable functions, so maybe the R interface should be: xml_find_num(), xml_find_chr(), and xml_find_lgl(). They'd throw an error if the input wasn't the correct type.

@jimhester

This comment has been minimized.

Member

jimhester commented Nov 10, 2015

@hadley I think this is ready for review, I added the type specific functions as well as tests for them. The only thing remaining would be tests for the missing functionality if you think we need them.

I was getting segfaults when I tried to use the Xml2String() class to convert the XmlChar (see jimhester@142c95e). I do not get them with the reinterpret_cast. Maybe Xml2String is not doing a deep copy of the data and the character array is being freed when the XmlSeeker destructor is called? What is the best way to force a deep copy with Rcpp? Also Xml2String::asRString() seems to return a CHARXSP rather than a CharacterVector, is using Rcpp::as<CharacterVector> the best way to convert that?

@jimhester jimhester referenced this pull request Nov 10, 2015

Closed

Handle missing nodes #57

@hadley

This comment has been minimized.

Member

hadley commented Nov 10, 2015

I think it's because that object should actually be a const xmlChar* - Xml2String assumes if you pass it in xmlChar* it needs to free it.

A slightly better fix is CharacterVector::create(Rf_mkCharCE((char*) result_->stringval, CE_UTF8)), which ensures that the string is always flagged as UTF-8 in R. I'll make the change

@hadley

This comment has been minimized.

Member

hadley commented Nov 10, 2015

Hmmm, I think there's something wrong in the logic for nodes_duplicated:

x <- read_xml("<body>
  <p>Some <b>text</b>.</p>
  <p>Some <b>other</b> <b>text</b>.</p>
  <p>No bold text</p>
</body>")
para <- xml_find_all(x, ".//p")

xml_find_one(para, ".//b")
# Error: expecting an external pointer

I'm not sure about the logic in there - should it be checking for xml_missing? What does as<XPtrNode>(List(nodes[i])["node"]) do?

@jimhester

This comment has been minimized.

Member

jimhester commented Nov 11, 2015

Re: nodes_duplicated(), you are correct, it needs to check for xml_missing as well. I wrote that code before merging the two pull requests and did not update it properly. I will add a test case for this.

@jimhester

This comment has been minimized.

Member

jimhester commented Nov 11, 2015

0a12a80 adds tests for all the methods on xml_missing objects and contains the fix for the xml_find_one error you mentioned. Also added support for xml_missing to nodes_duplicated.

@hadley

This comment has been minimized.

Member

hadley commented Nov 11, 2015

Ok, this looks great :)

Two last things:

  • Can you please add a couple of bullets to NEWS
  • Can you please update the xml_find_all() example to the code below:
#' # Find all vs find one -----------------------------------------------------
#' x <- read_xml("<body>
#'   <p>Some <b>text</b>.</p>
#'   <p>Some <b>other</b> <b>text</b>.</p>
#'   <p>No bold here!</p>
#' </body>")
#' para <- xml_find_all(x, ".//p")
#'
#' # If you apply xml_find_all to a nodeset, it finds all matches,
#' # de-duplicates them, and returns as a single list. This means you
#' # never know how many results you'll get
#' xml_find_all(para, ".//b")
#'
#' # xml_find_one only returns one match per input node. If there are 0
#' # matches it will return a missing node; if there are more than one it picks
#' # the first with a warning
#' xml_find_one(para, ".//b")
#' xml_text(xml_find_one(para, ".//b"))
@jimhester

This comment has been minimized.

Member

jimhester commented Nov 11, 2015

I missing committing the new test file (whoops) in the previous round, but it is now added, as well as the documentation and notes to NEWS.

hadley added a commit that referenced this pull request Nov 12, 2015

Merge pull request #55 from jimhester/numeric
Refactor XmlSeeker to support more types of return values

@hadley hadley merged commit 59c2c2f into r-lib:master Nov 12, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@hadley

This comment has been minimized.

Member

hadley commented Nov 12, 2015

Thanks Jim!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment