Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak performance of quantify lib #32

Open
cblock opened this issue Dec 18, 2012 · 6 comments
Open

Tweak performance of quantify lib #32

cblock opened this issue Dec 18, 2012 · 6 comments

Comments

@cblock
Copy link
Contributor

cblock commented Dec 18, 2012

Hi Andrew,

currently I'm in the process of performance tweaking quantify_mongoid. That's when I noticed that Quantity#to_si and Unit#for operations are very slow. Tracking it down further (see https://github.com/cblock/quantify/tree/perf/perf) I found out that Dimensions#to_hash is both expensive in terms of cpu cycles and frequently called when invoking #to_si on a quantity.
Same goes for Module#escape_multi_word_units when invoking Unit#for on complex units.

My feeling is that, e.g. #to_hash could be momoized but when I started refactoring I also noticed that you allow reconfiguration of dimensions and units during runtime. This in turn makes it necessary to invalidate memoized values for #to_hash and other operations.

So my question is: Is it really necessary to support reconfiguration during runtime? I found that you use that in quantify_amee to unregister some units. But wouldn't it be better to load everything once during initialization (e.g. into Singletons) and then not change it until app shuts down?

Also: I'm happy to help out refactoring quantify lib into that direction but I'm not too sure where to start. Any suggestions are greatly appreciated.

@cblock
Copy link
Contributor Author

cblock commented Dec 18, 2012

Just a thought: Why not defining all units an app may use in a yml file, which gets parsed exactly once during startup. The parser then creates Unit singletons for all defined units...

@cblock
Copy link
Contributor Author

cblock commented Dec 19, 2012

Just finished refactoring Dimensions class with its notoriously slow #to_hash method: https://github.com/cblock/quantify/

This brings down spec runtime from ~16 secs to ~5 secs already. The benchmark is even better (Quantity#to_si uses #to_hash internally):

       user     system      total        real
[ Quantity#to_si ]
  11.860000   0.100000  11.960000 ( 11.969939)

         user     system      total        real
[ Quantity#to_si ]
   0.970000   0.000000   0.970000 (  0.972983)

Let me know what you think. Should I send you a pull request for the main branch?

@spatchcock
Copy link
Owner

Hey Carsten. Sounds great, certainly send a pull request.

I like the idea of specifying units once on initialization. That was my intention really when using the config files, although given the 'default' configuration (currently, all units), a custom configuration might involve dumping a load of previously defined units which I don't think is very nice. Trouble is, there has to be some default units (and dimensions), for example, the SI units against which all other units are referenced.

I think I would prefer a system where a smaller set of required units is always instantiated and others can be pulled in if desired, perhaps in a modular fashion (require 'cgs', require 'imperial', etc.), or individually defined. YAML format sounds fine. Currently, you can override the default config by running additional config in an initializer (or elsewhere). I've basically never been able to decide on the best solution or anticipate what would suit others best (convention or configuration?).

The Dimension#to_hash method is mostly used in operations (multiply, divide, etc) between two dimensions. I think in most cases it is passed in to the private #enumerate_base_quantities method which alters the base quantities on the receiver. The memoized hash would probably need refreshing at the end of that method.

I have thought about using singletons to represent units but never got to a satisfactory point. At present, an instance of each method is held in the Unit.units 'module' variable and then cloned anytime it is requested. So we get many instances of each unit existing. Are you suggesting that we have only one (singleton) instance of, say, a kg unit, which is used in all operations involving a kg? That would be nice, but I think there would be complications where compound units and unit operations come in to play, so it might take a bit of work.

There is a lot of fairly messy code in Unit.rb which provides the versatile functionality for accessing units via names, pluralised names and symbols for single and compound units and units within strings. Unit::escape_multi_word_units is part of this functionality. It basically allow multi word units to be recognised within larger strings. I have never been very happy with most of this parsing and regex code. Although it works fairly well, it just seems way to messy. One implication of removing the ability to configure units at runtime would be that a number of these methods (which basically build lists and regexen) would be static and could therefore be memoized.

Cheers

Andrew

@cblock
Copy link
Contributor Author

cblock commented Dec 19, 2012

Hi Andrew,

refactoring dimensions.rb was quite a challenge for me because there are so many back and forth references between Unit, BaseUnit, Dimensions etc. With that experience in mind and after having read your comments above I would suggest the following:

1.) I like your idea of always loading a minimal set of default units (convention) and optionally adding more units through configuration. I guess the minimal set should be unity + si_units, right? Then lets make these ones (immutable) singletons.

2.) Make all units immutable. A unit instance should never change its dimensions during its life.
If we multiply two units the result should be a new (immutable) unit and not an updated original unit. That shouldn't be too hard to realize as you already have this nice concept of compound units and you mostly clone the units right now already. Note: Consolidating units (e.g. m^3/m^2) may then result in a reference to the original base unit instance again. I don't see any reason why this could be a problem.

3.) Rethink some of the class concepts: For example, Dimensions.rb mixes the concept of a single dimension with that of a set of dimensions. Breaking this down into a Dimension.rb and a Dimensions.rb would make both comprehension and testing easier for outsiders like me. Another thing could be the extraction of unit name parsing into a parser class on its own. This would concentrate all parsing code into one class and make Unit.rb more lightweight as it then only has to represent 'a unit'.

For my current web project (where quantify_mongoid is used) I suffer heavily from Unit.for() being really, really slow. The scenario is a batch import of thousands of records, each holding several quantities. To ensure that all imported quantities come with valid units, I have a Unit.for() call for each entry. My idea right now is to speed things up using (cached) Unit#unit_label_regex, Unit#unit_symbol_regex etc. If that's not enough I will digg into Unit.rb next to speed it up too.

Cheers,
Carsten

@cblock
Copy link
Contributor Author

cblock commented Dec 20, 2012

One more thing that comes to my mind: If there's a UnitParser class this class could also maintain a cache of recently parsed units. Like this, complex unit parsing (think 'm² kg/s³ A') is costly only once.

@cblock
Copy link
Contributor Author

cblock commented Dec 20, 2012

And another thing:
Why not bootstrap quantify lib through a configurator class, This is responsible for reading (yml) config and to instantiate the system of known units. After having initialized all units it then initializes (resets) the UnitParser, which triggers (re-)building of the complete set of regexes required to validate and parse units. Like this building regexes (currently one of the most expensive operations - see https://github.com/cblock/quantify/blob/f99b93128ec38ab99c87b25cd432c3fb6b80c1e3/perf/quantity_to_si.profile.pdf) is done only once during (re-) initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants