Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

root_numpy in rootpy? #13

Closed
ndawe opened this issue Mar 28, 2012 · 14 comments
Closed

root_numpy in rootpy? #13

ndawe opened this issue Mar 28, 2012 · 14 comments

Comments

@ndawe
Copy link
Member

ndawe commented Mar 28, 2012

Check this out:

https://github.com/piti118/root_numpy

@piti118
Copy link
Member

piti118 commented Mar 28, 2012

However, there is a name clash. But, that's minor problem.
I look through your code but seems that it has a slightly different signatures and a subtle difference in limitation.

tree_to_recarray(trees, branches=None,
use_cache=False, cache_size=1000000,
include_weight=False,
weight_name='weight',
weight_dtype='f4')
vs

root2array(fnames,treename,branches=None)

The reason I chose this signature is that I have absolutely no idea how to get a C++ TTree from PyROOT TTree Object(so that I can avoid all python function calls). Do you happen to know how? So, the subtle difference is that root_numpy.root2array need a tree from a file. While tree_to_recarray can probably read in in memory TTree as well.

How would you like to include it in?

If you want to include it as it is. Just copy them over change setup.py, rename methods. But you will have 2 methods that serves very similar purpose, which is kind of odd. But I think people can probably live with it. Bridging them will require the knowledge of how to get C++ TTree from PyROOT TTree Object. The PyROOT header doesn't such a thing. Otherwise it just defeat the purpose of having a c extension.

I'm not even sure if it compiles under windows. My setup.py depends on a call to root-config. So, users may have trouble trying to install rootpy bacause of this extension.

One more thing numpy boolean string type is actually "bool". "b" means a byte or something else that I don't quite understand. This took me a good chunk of time to figure out. I know the document says "b" but that doesn't work.

@jklukas
Copy link
Member

jklukas commented Mar 28, 2012

This is very cool, especially that it avoids pyroot altogether, which I
would imagine makes it much more portable. It's not clear to me yet what
parts of rootpy this could be useful for.

On Wed, Mar 28, 2012 at 6:47 AM, Noel Dawe <
reply@reply.github.com

wrote:

Check this out:

https://github.com/piti118/root_numpy


Reply to this email directly or view it on GitHub:
#13

@piti118
Copy link
Member

piti118 commented Mar 28, 2012

That's one of the issue as well.

The philosophy behind root_numpy is to avoid PyROOT altogether and have the user do everything in standard numpy/matplotib while being able to read data from root file.

I even patch numpy so that the column names of recarray is auto-complete in ipython. I found that my workflow is greatly improve with root_numpy/numpy/matplotlib/ipython notebook.

But having said that, feel free to take the code.

@ndawe
Copy link
Member Author

ndawe commented Mar 28, 2012

@piti118 thanks again for this great package. The idea is that root_numpy could be used by tree_to_recarray to convert a TTree into a numpy array and avoid all the overhead of looping in Python (currently making this method slow).

Yes, tree_to_recarray currenty reads the tree in-memory. You might be able to pass a TTree (or Tree in the rootpy framework, which inherits from TTree) to the C extension with ROOT.AsCObject(t) where t is a TTree which gives you a PyCObject. The C extension can then cast it to a TTree:

>>> from rootpy.io import File
>>> from rootpy.tree import Tree
>>> a = File('test.root', 'recreate')
>>> t = Tree()
>>> t
Tree('44e831cd3e7b4a228ddf279aa285b0d9')
>>> import ROOT
>>> ROOT.AsCObject(t)
<PyCObject object at 0x378b0a8>

No worries about API differences. These things can be sorted out later and rootpy's API isn't written in stone yet.

One of the major goals of rootpy is to provide a way to easily integrate ROOT within the vast ecosystem of scientific Python packages like numpy, scipy, matplotlib, pytables, scikit-learn, etc... and root_numpy would be a very nice improvement to rootpy's numpy interface. I fear that too many physicists are limiting themselves to ROOT and have no idea that any of these other very powerful frameworks exist.

Would you be interesting in maintaining root_numpy as a subpackage of rootpy? Or if you prefer to continue maintaining it as a separate package then we could simply add your git repository as a subrepository (submodule) of rootpy's repository.

@piti118
Copy link
Member

piti118 commented Mar 28, 2012

I'll try PyCObject.

In that case, I think maintaining it as a subrepository or rootpy organization is a better idea in case people have trouble compiling the c extension.

Then tree_to_recarray can, first find if it can import root_numpy(it will throw ImportError), if it can then call the faster c-extension otherwise calling the pure python code.

@piti118
Copy link
Member

piti118 commented Mar 28, 2012

Done check out the head:
root_numpy.pyroot2rec
and
root_numpy.pyroot2array

It's more subtle than I thought because PyCObject is deprecate in 2.7 and 3.1. PyCObject is removed in python 3.2. It's being replaced with capsule but capsule doesn't exists in python 2.6........... and root doesn't provide AsCapsule interface yet -_-". But, it's all good now with C directive.

See: http://docs.python.org/howto/cporting.html

@ndawe
Copy link
Member Author

ndawe commented Mar 29, 2012

Thanks! I can try integrating root_numpy into rootpy tomorrow. Thanks again, that was fast.

@piti118
Copy link
Member

piti118 commented Mar 29, 2012

would also be nice if you could do some benchmark of pure python vs this one

@ndawe
Copy link
Member Author

ndawe commented Mar 30, 2012

A huge improvement (see rootpy/benchmarks/tree/root2array):

./test.py 
Using pure Python method...
         4400080 function calls in 19.807 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   19.807   19.807 <string>:1(<module>)
   800000    1.999    0.000    1.999    0.000 records.py:223(__getattribute__)
        1    0.000    0.000    0.000    0.000 records.py:390(__new__)
        1    0.000    0.000    0.000    0.000 records.py:407(__getattribute__)
   800000    2.806    0.000    7.594    0.000 records.py:456(__getitem__)
        1    5.339    5.339   19.807   19.807 root2array.py:37(tree_to_recarray)
   100001    0.334    0.000    0.463    0.000 tree.py:1025(reset_collections)
        1    0.000    0.000    0.000    0.000 tree.py:210(use_cache)
   100001    1.757    0.000    2.220    0.000 tree.py:243(GetEntry)
   100001    0.589    0.000    2.808    0.000 tree.py:248(__iter__)
        1    0.000    0.000    0.000    0.000 tree.py:274(__setattr__)
        1    0.000    0.000    0.000    0.000 tree.py:448(GetEntries)
        1    0.000    0.000    0.000    0.000 tree.py:968(set_tree)
        3    0.000    0.000    0.000    0.000 tree.py:980(__setattr__)
        8    0.000    0.000    0.000    0.000 types.py:541(convert)
   800000    2.491    0.000    4.066    0.000 types.py:64(value)
   800000    1.575    0.000    1.575    0.000 types.py:85(__getitem__)
        1    0.000    0.000    0.000    0.000 utils.py:14(asrootpy)
        1    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x84e840}
   800014    2.789    0.000    4.788    0.000 {isinstance}
        8    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        6    0.000    0.000    0.000    0.000 {method 'has_key' of 'dict' objects}
        8    0.000    0.000    0.000    0.000 {method 'index' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
   100001    0.129    0.000    0.129    0.000 {method 'iterkeys' of 'dict' objects}
       16    0.000    0.000    0.000    0.000 {method 'upper' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {sum}


time without profiler overhead:
8.278806 seconds
========================================
Using compiled C extension...
Warning: unknown root type: vector<float> skip 
Warning: unknown root type: TLorentzVector skip 
Warning: unknown root type: vector<float> skip 
Warning: unknown root type: vector<float> skip 
         16 function calls in 0.459 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.459    0.459 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 ROOT.py:416(__getattr2)
        1    0.000    0.000    0.455    0.455 __init__.py:44(pyroot2array)
        1    0.000    0.000    0.455    0.455 __init__.py:66(pyroot2rec)
        2    0.000    0.000    0.000    0.000 records.py:407(__getattribute__)
        1    0.000    0.000    0.459    0.459 root2array.py:20(tree_to_recarray_c)
        1    0.000    0.000    0.000    0.000 {dir}
        2    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 {libPyROOT.AsCObject}
        1    0.000    0.000    0.000    0.000 {libPyROOT.LookupRootEntity}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'view' of 'numpy.ndarray' objects}
        1    0.004    0.004    0.004    0.004 {numpy.core.multiarray.concatenate}
        1    0.455    0.455    0.455    0.455 {rootpy.root2array.root_numpy.src.croot_numpy.root2array_from_cobj}


time without profiler overhead:
Warning: unknown root type: vector<float> skip 
Warning: unknown root type: TLorentzVector skip 
Warning: unknown root type: vector<float> skip 
Warning: unknown root type: vector<float> skip 
0.453431 seconds
========================================
Comparison of output:
[ (8, 0, 52.7789192199707, 3.788414478302002, -0.9572163224220276, -1.6373909711837769, 2.4526565074920654, -22.792055130004883)
 (1, 1, -10.197294235229492, -0.13024774193763733, 1.648690104484558, 0.7045357823371887, 1.670132040977478, 64.8749771118164)
 (8, 2, 58.47273254394531, 1.918209195137024, 1.3651632070541382, -0.28253698348999023, 2.201446771621704, 16.73053550720215)
 ...,
 (4, 99997, 110.29533386230469, -0.5117789506912231, -0.01048351638019085, 0.11074820905923843, -2.3150572776794434, -10.907495498657227)
 (6, 99998, -33.58203125, -2.158933401107788, 1.2362463474273682, -0.6049940586090088, 4.583289623260498, 1.942242980003357)
 (1, 99999, 22.445335388183594, 3.3832831382751465, -0.26965275406837463, -0.7595866322517395, -2.5078225135803223, 16.156869888305664)]
[ (8, 0, 52.7789192199707, 3.788414478302002, -0.9572163224220276, -1.6373909711837769, 2.4526565074920654, -22.792055130004883)
 (1, 1, -10.197294235229492, -0.13024774193763733, 1.648690104484558, 0.7045357823371887, 1.670132040977478, 64.8749771118164)
 (8, 2, 58.47273254394531, 1.918209195137024, 1.3651632070541382, -0.28253698348999023, 2.201446771621704, 16.73053550720215)
 ...,
 (4, 99997, 110.29533386230469, -0.5117789506912231, -0.01048351638019085, 0.11074820905923843, -2.3150572776794434, -10.907495498657227)
 (6, 99998, -33.58203125, -2.158933401107788, 1.2362463474273682, -0.6049940586090088, 4.583289623260498, 1.942242980003357)
 (1, 99999, 22.445335388183594, 3.3832831382751465, -0.26965275406837463, -0.7595866322517395, -2.5078225135803223, 16.156869888305664)]
[-0.95721632  1.6486901   1.36516321 ..., -0.01048352  1.23624635
 -0.26965275]
[-0.95721632  1.6486901   1.36516321 ..., -0.01048352  1.23624635
 -0.26965275]

@piti118
Copy link
Member

piti118 commented Mar 30, 2012

Awesome

@ndawe
Copy link
Member Author

ndawe commented Apr 1, 2012

root_numpy has been working very well! Already using it in my analysis...

Just some comments: It's fine that root_numpy skips branches that are not basic types if branches is not specified but I think it should raise a TypeError if the user specifies a branch in branches that is not of a basic type.

There also seems to be a problem if a branch in branches does not exist (I got a segfault). In this case maybe raising a ValueError is best.

I made a few modifications on our rootpy branch of root_numpy. One is to allow empty trees. I think it's fine to return an empty array in this case.

@piti118
Copy link
Member

piti118 commented Apr 1, 2012

fix the segfault.(i believe) check the head

@piti118
Copy link
Member

piti118 commented Apr 1, 2012

The rationale for raising error on empty tree is because it's usally a typo for filename and I aim for this to be used in an interactive environment. May be I should raise something else like file not found execption-ish.

@pwaller
Copy link
Member

pwaller commented Nov 22, 2012

Can this issue be closed and/or broken into new issues? @ndawe @piti118

@piti118 piti118 closed this as completed Nov 30, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants