Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor XmlSeeker to support more types of return values #55

Merged
merged 24 commits into from Nov 12, 2015

Conversation

jimhester
Copy link
Member

@jimhester jimhester commented Nov 5, 2015

This refactors the XmlSeeker class to return RObjects rather than strictly XPtrNode or Lists. It provides a general search() method that directly returns the result RObject, and an exported xpath_search() function which provides a superset of the behavior of the previous node_find_all(), node_find_one() functions. It also allows one to return a user defined subset of the results (1:max found), rather than just 1 or all.

The existing behavior and user facing API is preserved, however you can also use xpath_search() to return booleans, numbers, characters, e.g.

x <- read_xml("<blah></blah>")
xpath_search(x$node, x$doc, "1+1", nsMap = character(), num_results = Inf)
#> [1] 2
xpath_search(x$node, x$doc, "1=1", nsMap = character(), num_results = Inf)
#> [1] TRUE
xpath_search(x$node, x$doc, "2=1", nsMap = character(), num_results = Inf)
#> [1] FALSE
xpath_search(x$node, x$doc, "string(5)", nsMap = character(), num_results = Inf)
#> [1] "5"

I figured we should discuss the R API for this, so I did not flesh that out at all or add tests, but it is a functional first pass.

@jimhester jimhester changed the title Refactor XmlSeeker to support More types of return values Refactor XmlSeeker to support more types of return values Nov 6, 2015
@jimhester
Copy link
Member Author

@jimhester jimhester commented Nov 7, 2015

Character returns also seem to be working fine, I edited the above comment respectively.

@jimhester
Copy link
Member Author

@jimhester jimhester commented Nov 7, 2015

Some libxml2 tests for return types

}
return out;
}
case XPATH_NUMBER: { return Rcpp::NumericVector(1, result_->floatval); }
Copy link
Member

@hadley hadley Nov 9, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Rcpp::NumericVector::create(result_->floatval)

@hadley
Copy link
Member

@hadley hadley commented Nov 9, 2015

I prefer type-stable functions, so maybe the R interface should be: xml_find_num(), xml_find_chr(), and xml_find_lgl(). They'd throw an error if the input wasn't the correct type.

@jimhester
Copy link
Member Author

@jimhester jimhester commented Nov 10, 2015

@hadley I think this is ready for review, I added the type specific functions as well as tests for them. The only thing remaining would be tests for the missing functionality if you think we need them.

I was getting segfaults when I tried to use the Xml2String() class to convert the XmlChar (see jimhester@142c95e). I do not get them with the reinterpret_cast. Maybe Xml2String is not doing a deep copy of the data and the character array is being freed when the XmlSeeker destructor is called? What is the best way to force a deep copy with Rcpp? Also Xml2String::asRString() seems to return a CHARXSP rather than a CharacterVector, is using Rcpp::as<CharacterVector> the best way to convert that?

@jimhester jimhester mentioned this pull request Nov 10, 2015
@hadley
Copy link
Member

@hadley hadley commented Nov 10, 2015

I think it's because that object should actually be a const xmlChar* - Xml2String assumes if you pass it in xmlChar* it needs to free it.

A slightly better fix is CharacterVector::create(Rf_mkCharCE((char*) result_->stringval, CE_UTF8)), which ensures that the string is always flagged as UTF-8 in R. I'll make the change

@hadley
Copy link
Member

@hadley hadley commented Nov 10, 2015

Hmmm, I think there's something wrong in the logic for nodes_duplicated:

x <- read_xml("<body>
  <p>Some <b>text</b>.</p>
  <p>Some <b>other</b> <b>text</b>.</p>
  <p>No bold text</p>
</body>")
para <- xml_find_all(x, ".//p")

xml_find_one(para, ".//b")
# Error: expecting an external pointer

I'm not sure about the logic in there - should it be checking for xml_missing? What does as<XPtrNode>(List(nodes[i])["node"]) do?

@jimhester
Copy link
Member Author

@jimhester jimhester commented Nov 11, 2015

Re: nodes_duplicated(), you are correct, it needs to check for xml_missing as well. I wrote that code before merging the two pull requests and did not update it properly. I will add a test case for this.

@jimhester
Copy link
Member Author

@jimhester jimhester commented Nov 11, 2015

0a12a80 adds tests for all the methods on xml_missing objects and contains the fix for the xml_find_one error you mentioned. Also added support for xml_missing to nodes_duplicated.

@hadley
Copy link
Member

@hadley hadley commented Nov 11, 2015

Ok, this looks great :)

Two last things:

  • Can you please add a couple of bullets to NEWS
  • Can you please update the xml_find_all() example to the code below:
#' # Find all vs find one -----------------------------------------------------
#' x <- read_xml("<body>
#'   <p>Some <b>text</b>.</p>
#'   <p>Some <b>other</b> <b>text</b>.</p>
#'   <p>No bold here!</p>
#' </body>")
#' para <- xml_find_all(x, ".//p")
#'
#' # If you apply xml_find_all to a nodeset, it finds all matches,
#' # de-duplicates them, and returns as a single list. This means you
#' # never know how many results you'll get
#' xml_find_all(para, ".//b")
#'
#' # xml_find_one only returns one match per input node. If there are 0
#' # matches it will return a missing node; if there are more than one it picks
#' # the first with a warning
#' xml_find_one(para, ".//b")
#' xml_text(xml_find_one(para, ".//b"))

@jimhester
Copy link
Member Author

@jimhester jimhester commented Nov 11, 2015

I missing committing the new test file (whoops) in the previous round, but it is now added, as well as the documentation and notes to NEWS.

hadley added a commit that referenced this issue Nov 12, 2015
Refactor XmlSeeker to support more types of return values
@hadley hadley merged commit 59c2c2f into r-lib:master Nov 12, 2015
1 check passed
@hadley
Copy link
Member

@hadley hadley commented Nov 12, 2015

Thanks Jim!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants