New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any ideas how to iterate faster using multi core? #96
Comments
|
@zeux the dataset is DBLP - Computer Science Bibliography. its mostly a flat structure, i.e root with millions of children.. depth upto 3 or 4. I am trying to use OpenMP by using the pragma definition above the loop like this : #pragma omp parallel for
for(int n=0; n<node.size(); ++n) { but for some strange reason its not working. will check out the compact mode as you mentioned. |
This is a neat data set to test with! I wrote a simple test program that loads this XML into memory and then does a simple XPath query to get all nodes:
Without compact mode I'm getting 5.7 seconds to parse the file and 0.9 seconds to run the query, as well as 6.98 GB memory consumption. With compact mode I'm getting 4.8 seconds to parse the file and 0.3 seconds to run the query; memory consumption is 2.77 GB. Now, these timings are pretty reasonable but of course my test is very simple - my XPath query is pretty simple and I'm not doing any processing on the results of the query. In your application, what's the distribution of time between parsing, populating the |
Yeah actually i figured out the processing inside a loop is what taking time. I am trying to index the dataset based on a Dewey Based coding scheme, and thus within each iteration I have to retrieve the child nodes, do some computation and get the computed index label, and go on. So a faster iteration would have helped. On that note, is there any api to directly get child nodes for any element? Since i am using a custom method to retrieve them, the complexity rises to O(n^2). I guess thats my bottleneck |
So you have a function that given an xml_node and a child index iterates through children and returns the child? There's no way to provide faster access - you can think of child container as a linked list, so indexing operations are naturally slow. I'd suggest grabbing children of the root element once and storing them in a vector, e.g.:
And using that for indexing instead. If you're talking about the access to children of nodes like article/phdthesis/etc., then ideally you should not need to use index-based access since they just represent properties of the item? Posting a code example may help me understand the use case better. |
As per my requirement I have to perform DFS on the tree, get every children for each element, and perform some computation. kind of like this : struct mdc_faster_calc:xml_tree_walker
{
long t = 0;
virtual bool for_each(xml_node& node) {
cout<<"Iteration : "<<t++<<endl;
if(node_types[node.type()] == "element") {
//if no parent then assign 0 to self
if(string(node.parent().name()).size() == 0) {
cout<<"----------Parent Node-----------"<<endl;
label_hash.insert({node,"0"});
m_hash.insert({node,0});
} else {
// not a parent
// retrive parents children tags & nodes
node_info ni = getChildrenNodesWithTags(node.parent());
// .... some linear computation with retrieved nodes This |
If getChildrenNodesWithTags just iterates through the children of a given node and returns them then the bigger problems seems to be the fact that you call this function, not the implementation of this function. Basically regardless of the performance of getChildrenNodesWithTags, it has to be linear since it returns all nodes in the parent. Plus you mention linear computation with the retrieved nodes. It seems like you're doing this processing redundantly though? I.e. that code would execute for every child of dblp (e.g. article), get parent (which is dblp - root), get all 5M children of that node again and again, etc. Is it possible to cache the results of this processing? |
right. good idea, will try to add a separate processing loop to cache the results and then re-use it. thanks! will keep you posted here |
Hi! This is an amazing library, and only one working for me which is taking the least ram space. Had a doubt regarding faster iteration. Currently I am iterating through a 2GB xml file (5,000,000+ nodes) using the built in iterator. Since it is taking quite a lot of time, I want to use multi core support to iterate faster. I tried using OpenMP but its not working for me. Any ideas?
The text was updated successfully, but these errors were encountered: