Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any ideas how to iterate faster using multi core? #96

Closed
koustuvsinha opened this issue Apr 13, 2016 · 8 comments
Closed

Any ideas how to iterate faster using multi core? #96

koustuvsinha opened this issue Apr 13, 2016 · 8 comments
Labels

Comments

@koustuvsinha
Copy link

Hi! This is an amazing library, and only one working for me which is taking the least ram space. Had a doubt regarding faster iteration. Currently I am iterating through a 2GB xml file (5,000,000+ nodes) using the built in iterator. Since it is taking quite a lot of time, I want to use multi core support to iterate faster. I tried using OpenMP but its not working for me. Any ideas?

@zeux
Copy link
Owner

zeux commented Apr 13, 2016

  1. You may want to try compact mode (make sure pugiconfig.hpp has a commented-out define PUGIXML_COMPACT and uncomment it). The effects of this on performance are not well understood yet, but it may improve traversal performance of large documents due to better locality, given a fast CPU.
  2. I'd like to know a bit more about your XML. Is it a root node with lots (millions) of children, or is the number of children of the root relatively small? How are you trying to use OpenMP?

@koustuvsinha
Copy link
Author

@zeux the dataset is DBLP - Computer Science Bibliography. its mostly a flat structure, i.e root with millions of children.. depth upto 3 or 4. I am trying to use OpenMP by using the pragma definition above the loop like this :

#pragma omp parallel for
for(int n=0; n<node.size(); ++n) {

but for some strange reason its not working. will check out the compact mode as you mentioned.

@zeux
Copy link
Owner

zeux commented Apr 14, 2016

This is a neat data set to test with!

I wrote a simple test program that loads this XML into memory and then does a simple XPath query to get all nodes:

pugi::xpath_node_set ns = doc.select_nodes("dblp/article");

Without compact mode I'm getting 5.7 seconds to parse the file and 0.9 seconds to run the query, as well as 6.98 GB memory consumption.

With compact mode I'm getting 4.8 seconds to parse the file and 0.3 seconds to run the query; memory consumption is 2.77 GB.

Now, these timings are pretty reasonable but of course my test is very simple - my XPath query is pretty simple and I'm not doing any processing on the results of the query.

In your application, what's the distribution of time between parsing, populating the node variable (which I'm assuming is an XPath node set) and processing the results? And how heavy is the processing - what kind of work do you do inside your loop?

@koustuvsinha
Copy link
Author

Yeah actually i figured out the processing inside a loop is what taking time. I am trying to index the dataset based on a Dewey Based coding scheme, and thus within each iteration I have to retrieve the child nodes, do some computation and get the computed index label, and go on. So a faster iteration would have helped.

On that note, is there any api to directly get child nodes for any element? Since i am using a custom method to retrieve them, the complexity rises to O(n^2). I guess thats my bottleneck

@zeux
Copy link
Owner

zeux commented Apr 14, 2016

Since i am using a custom method to retrieve them, the complexity rises to O(n^2).

So you have a function that given an xml_node and a child index iterates through children and returns the child?

There's no way to provide faster access - you can think of child container as a linked list, so indexing operations are naturally slow.

I'd suggest grabbing children of the root element once and storing them in a vector, e.g.:

pugi::xml_node root = doc.child("dblp");
std::vector<pugi::xml_node> children(root.begin(), root.end());

And using that for indexing instead.

If you're talking about the access to children of nodes like article/phdthesis/etc., then ideally you should not need to use index-based access since they just represent properties of the item? Posting a code example may help me understand the use case better.

@koustuvsinha
Copy link
Author

As per my requirement I have to perform DFS on the tree, get every children for each element, and perform some computation. kind of like this :

struct mdc_faster_calc:xml_tree_walker
{
    long t = 0;
    virtual bool for_each(xml_node& node) {
        cout<<"Iteration : "<<t++<<endl;
        if(node_types[node.type()] == "element") {

            //if no parent then assign 0 to self

            if(string(node.parent().name()).size() == 0) {
                cout<<"----------Parent Node-----------"<<endl;
                label_hash.insert({node,"0"});
                m_hash.insert({node,0});
            } else {
                // not a parent
                // retrive parents children tags & nodes
                node_info ni = getChildrenNodesWithTags(node.parent());
                // .... some linear computation with retrieved nodes

This getChildrenNodesWithTags(node.parent()) call is where the bottleneck is, which creates the O(n^2) complexity. As you mentioned grabbing the children with XPath, here i have to do it for every step. So going by your way I have to generate an XPath to query for every node.

@zeux
Copy link
Owner

zeux commented Apr 14, 2016

If getChildrenNodesWithTags just iterates through the children of a given node and returns them then the bigger problems seems to be the fact that you call this function, not the implementation of this function.

Basically regardless of the performance of getChildrenNodesWithTags, it has to be linear since it returns all nodes in the parent. Plus you mention linear computation with the retrieved nodes.

It seems like you're doing this processing redundantly though? I.e. that code would execute for every child of dblp (e.g. article), get parent (which is dblp - root), get all 5M children of that node again and again, etc. Is it possible to cache the results of this processing?

@koustuvsinha
Copy link
Author

right. good idea, will try to add a separate processing loop to cache the results and then re-use it. thanks! will keep you posted here

@zeux zeux added the question label Jun 13, 2016
@zeux zeux closed this as completed Jun 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants