Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Koala is getting increasingly difficult to manage -- re-write? #249

Open
bradbase opened this issue Mar 6, 2020 · 30 comments
Open

Koala is getting increasingly difficult to manage -- re-write? #249

bradbase opened this issue Mar 6, 2020 · 30 comments

Comments

@bradbase
Copy link

bradbase commented Mar 6, 2020

Hi Everyone,

I have a vested interest in increasing Koala's maintainability. My project FlyingKoala relies on Koala and I am trying to start a business off the back of it. Some of changes to Koala in the last 12 months, which have helped move Kola forward, have removed features I rely upon and have made it very difficult to re-instate the features I need.

I am in a situation where I have been putting effort into re-writing Koala so we (/I) can have a modernized code base to work from while supporting the features we all want and making the code maintainable.

The original intent was to just remove all the things that are in there for no reason, and keep as much as possible. This was quickly proven not realistically possible. So I've used the "good" bits -- the stuff that does the AST and evaluation -- and have been slowly re-writing the surrounding code.

I have a version now which supports:

  • parsing cells, formulas and defined names into a dataclass representing a spreadsheet ("model")
  • persisting and loading the model dataclass
  • evaluation of cell values (eg; no formulas), and formulas which use cells

I am about 50% through supporting ranges. Ranges are being recognized in defined names and formulas and are being resolved to collections of cell addresses. I am today beginning work on having those collections of cell addresses packaged and passed to the formula functions.

Once Ranges are supported, the basics of Koala's features will have been successfully re-written. I have ignored OFFSET and INDEX for now. I am yet to test (or consciously support) unicode.

The ignore, input/output are not yet supported but need to be for feature parity (and are unbelievably useful).

And I'm yet to implement a method to set value.

Besides re-using a small amount of code for parsing formulas and executing the subsequent Python code, this is essentially a completely new codebase.

Is there an appetite to try and use a re-write for Koala moving forward or is it best this becomes a new project?

I'm happy to make the code available for critique.

@bradbase
Copy link
Author

bradbase commented Mar 6, 2020

The current example of use:

import logging
from new_koala import ModelCompiler
from new_koala import Model
from new_koala import Evaluator

logging.basicConfig(level=logging.DEBUG)

filename = 'reader.xlsm'
compiler = ModelCompiler()
new_model = compiler.parse_excel_file(filename)

new_model.persist_to_json_file('brad_graph.json.gzip')

reconstituted_model = Model()
reconstituted_model.construct_from_json_file('brad_graph.json.gzip')
# reconstituted_model.draw_graph()
reconstituted_model.build_code()

evaluator = Evaluator(reconstituted_model)
val1 = evaluator.evaluate('First!A2')
print("value 'evaluated' for First!A2 without a formula:", val1)
val2 = evaluator.evaluate('Seventh!C1')
print("value 'evaluated' for Seventh!C1 with a formula:", val2)
val3 = evaluator.evaluate('Ninth!B1')
print("value 'evaluated' for Ninth!B1 with a defined name:", val3)
val4 = evaluator.evaluate('Hundred')
print("value 'evaluated' for Hundred with a defined name:", val4)
val5 = evaluator.evaluate('Tenth!C1')
print("value 'evaluated' for Tenth!C1 with a defined name:", val5)
val6 = evaluator.evaluate('Tenth!C2')
print("value 'evaluated' for Tenth!C2 with a defined name:", val6)
val7 = evaluator.evaluate('Tenth!C3')
print("value 'evaluated' for Tenth!C3 with a defined name:", val7)

Current output (defined name "Hundred" is cell Eighth!B1):

Cell First!A2 has no formula
but its value is 0.1

value 'evaluated' for First!A2 without a formula: 0.1
Evaluator.apply("add",self.eval_ref("Seventh!A1", ref = None),self.eval_ref("Seventh!B1", ref = None),None)
DEBUG:root:
Cell Seventh!A1 has no formula
but its value is 1

DEBUG:root:
Cell Seventh!B1 has no formula
but its value is 2

func add first 1 second 2 ref None
DEBUG:root:
Cell Seventh!C1 has a formula, A1+B1
which has been translated to Python as Evaluator.apply("add",self.eval_ref("Seventh!A1", ref = None),self.eval_ref("Seventh!B1", ref = None),None) which evaluates to 3

value 'evaluated' for Seventh!C1 with a formula: 3
Evaluator.apply("add",self.eval_ref("Ninth!A1", ref = None),self.eval_ref("Eighth!B1", ref = None),None)
DEBUG:root:
Cell Ninth!A1 has no formula
but its value is 1

DEBUG:root:
Cell Eighth!B1 has no formula
but its value is 100

func add first 1 second 100 ref None
DEBUG:root:
Cell Ninth!B1 has a formula, A1+Hundred
which has been translated to Python as Evaluator.apply("add",self.eval_ref("Ninth!A1", ref = None),self.eval_ref("Eighth!B1", ref = None),None) which evaluates to 101

value 'evaluated' for Ninth!B1 with a defined name: 101
DEBUG:root:
Cell Eighth!B1 has no formula
but its value is 100

value 'evaluated' for Hundred with a defined name: 100
xSum.xsum(self.eval_ref("Tenth!A1:B1", ref = None))
ERROR:root:Empty cell at Tenth!A1:B1
ERROR:root:ExcelError, #NULL, Cell Tenth!A1:B1 is empty
DEBUG:root:
Cell Tenth!C1 has a formula, SUM(A1:B1)
which has been translated to Python as xSum.xsum(self.eval_ref("Tenth!A1:B1", ref = None)) which evaluates to 0

value 'evaluated' for Tenth!C1 with a defined name: 0
xSum.xsum(self.eval_ref("Tenth!A2:B2", ref = None))
ERROR:root:Empty cell at Tenth!A2:B2
ERROR:root:ExcelError, #NULL, Cell Tenth!A2:B2 is empty
DEBUG:root:
Cell Tenth!C2 has a formula, SUM(A2:B2)
which has been translated to Python as xSum.xsum(self.eval_ref("Tenth!A2:B2", ref = None)) which evaluates to 0

value 'evaluated' for Tenth!C2 with a defined name: 0
xSum.xsum(self.eval_ref("Tenth!A3", ref = None),self.eval_ref("Eighth!B1", ref = None))
DEBUG:root:
Cell Tenth!A3 has no formula
but its value is 2

DEBUG:root:
Cell Eighth!B1 has no formula
but its value is 100

DEBUG:root:
Cell Tenth!C3 has a formula, SUM(A3,Hundred)
which has been translated to Python as xSum.xsum(self.eval_ref("Tenth!A3", ref = None),self.eval_ref("Eighth!B1", ref = None)) which evaluates to 100

value 'evaluated' for Tenth!C3 with a defined name: 100

@bradbase
Copy link
Author

bradbase commented Mar 6, 2020

Features I can see a pathway to support in the new approach:

  • persisting different formats from the model (compressed/not compressed, GraphML, MS Excel, etc...)
  • being able to evaluate ad-hoc formulas, so they don't have to be in Excel at all... eg: evaluator.evaluate_formula("=AVERAGE(Sheet1!A1:A5)")
  • More simplified pathways to adding Excel function support

@brianmay
Copy link
Collaborator

brianmay commented Mar 6, 2020 via email

@bradbase
Copy link
Author

bradbase commented Mar 6, 2020

Thanks for the input Brian.

In short, I'd be leaning toward something like "option 1" as you've described. But my needs and intentions for a Koala style solution appear to be different from others here which is why I'm asking the question.

It depends on how committed people are to the current use cases and what Koala is expected to become.

I think I have been observing a movement toward viewing Koala as "replacing a spreadsheet" eg; openpyxl with a "calculate" button. But this is not the direction I've taken. I have been viewing Excel workbooks as definitions of a calculation (a model or part thereof). Where the model is an abstraction that just so happens to be expressed in the form of an Excel spreadsheet. So, in that sense, (as I see it) Koala isn't simply replacing the Excel calculation engine (eg; openpyxl with a "calculate" button) but is more a toolkit which uses Excel as an interface for defining parts or whole mathematical systems which can use Python as a gateway to evaluation. If used in one way it would achieve the purposes of openpyxl with a "go" button. But would offer significantly more for those who are inextricably bound to Excel but need more from their math. Now that I've written that, it really sounds like I've got an independent project... (again, depending on what Koala is understood to be and which direction it is heading)

To be 100% honest, I've not been measuring for speed. My sole measure has been to get a solution which satisfies the features of roughly Koala 0.0.31 which can actually be worked on. My assumption has been that increasing speed (and other code progression) will be easier to achieve if;

  • there isn't code with comments like "does anyone know what this does?", and
  • where methods are not duplicated multiple times across the solution,
  • where we aren't using lru caches in place of hashmaps /dictionaries (because we can't figure out WTF a method does or are scared/unable to implement a purpose built cache)

The question of speed is interesting, though. Which element of the solution needs to be faster Depends on what you need to be faster, and what is meant by fast.

I have broken the process down into steps.

  1. load the Excel file into a Python compatible state, I call a model. (ModelCompiler and parse_excel_file())
  2. interpret the model into Python executable code (Model and build_code())
  3. evaluation. executing the Python code (Evaluator and evaluate())

After either step 1 and/or 2, the Model can be persisted to/from JSON files. So these steps only really need to be done once (unless the spreadsheet changes). And if we can implement Evaluator.evaluate_function("=Average(Sheet1!A1:A5)"), it may only ever need to be done once because this function could "freestyle"/ad-hoc/"interactively" calculate against the pre-existing model.

Step 3, I'm guessing, is actually where you're asking for "speed".

Step 1: Reading and translating Excel into Python

I am now using xml.etree.ElementTree to read the interesting parts of the Excel file rather than using custom regex and string processing. Also, providing you don't change the spreadsheet, translating the Excel workbook into a Python compatible state only has to happen once. You can persist the Python compatible state as JSON (I'm using JSON and jsonpickle). And, equally (obvjously?), load the Python compatible state from the JSON. So if you've got a very complex spreadsheet that took ages to translate, I expect it'll be easier/quicker to load from JSON. I'm hoping (for my needs), once the input and output concepts are implemented, the library will be able to load everything associated with one formula (ergo, one specified cell). This will optimize the Python compatible state. I can't see an easy way to optimize reading the Excel file in that situation, but the resulting Python compatible state would be optimized to contain only the cells, defined names, ranges and formulas associated in calculating that specified cell. Maybe this has to happen in tandem with "step 2"..?

Step 2: Converting the formulas into Python

This is reasonably quick because Step 1 has already parsed cells and recorded all the discovered formulas. I'm not convinced the process to tokenize and convert to Python code is a particularly bad bottleneck.

Step 3: Evaluating the results

I have broken the Excellib functions out into their own objects so it's possible to have multiple ways to define a function. There is a trap, Excel is occasionally incorrect in its calculations and people now count on those bugs so we may want strict Excel compatible modes along side mathematically correct modes. Many of the existing Excellib definitions ignore the faults in Excel and so, pedantically, may produce results different to Excel. We may also want implementations in vanilla Python, numpy/pandas, wrapped C++ libraries, and Jython. Currently this element of the process is going to rely on dictionary lookups to extract the values from cells and then package them in a way which the Excellib function can handle (this makes set_value() really easy to implement). As far as I'm aware, dictionaries in Python are optimized. I'm happy if someone wants to alter this paradigm to optimize for speed. But my current focus is on getting a maintainable project running so dicts are well good enough for me ATM. I'm expecting to use dataframes as the structures representing ranges to pass to the Excellib functions.

Which aspect of speed are you interested in?

@bradbase
Copy link
Author

bradbase commented Mar 6, 2020

Equally... There's no reason why the AST interpreting the formulas needs to write Python code.

It can write any code you like. BUT the code it writes does need align with a Model object in the target language, the "Excellib" function library in the target language, and have a Evaluator object written in the target language.

@vallettea
Copy link
Owner

hello @bradbase,

nice recap.
I see your point when you say "Excel workbooks as definitions of a calculation" and it is in fact a bit different of what koala was coded for. Koala was coded because we needed to do enormous amount of calculation from gigantic excel workbooks and we had no time to translate the models of these workbooks into a well written model.
From my point of view (of someone who hates excel), defining a model through excel is not the way to go if you know maths and code because you always have to deal with this tabular structure that creates a bias in the way you have to lay out things. However, for people who don't code (and they are may) your approach may be interesting.

As you can noticed, i'm not really active on koala and I'm really open to how it should be re-thought and maintained.

Would your project be open-source ?

@bradbase
Copy link
Author

bradbase commented Mar 9, 2020

Hi @vallettea,

Thank you for your input. It is most appreciated. Hopefully my discussion here will answer questions from others as well.

My understanding of koala has come from the doco. The presentation, the abstract found on the anthill site and the other writing in the docs directory along side reading the code. Knowing the history, and having faced the same challenges, explains everything. And as I've formed an understanding I've seen great inspiration.

I absolutely agree with you about Excel being sub-optimal for defining a mathematical model. The limitations are excruciating. I hate it too. Finding ways around Excel has lead me to become an "expert"(/?) in it.

Over the last 10 years I have been working as an analyst is the energy sector, programmer and a consulting to a demographics company with some time spent in telephone billing (accounts receivable - read big data). The skills these companies have valued the most has been in the space of mathematical modelling and communicating mathematical modelling.

What I have observed is; each case the domain expert is usually not skilled in programming. This causes some challenges;

  • intra-company communications surrounding calculations. Especially between engineer(/s), data analyist/programmer(/s) and manager(/s),
  • inter-company communications surrounding calculations as company A may need to check the workings of company B which requires both sides have comparable expertise in a specific language (which is never trusted),
  • Excel is ubiquitous and is the first line to express a mathematical model as other tools have traditionally been expensive and require specialist knowledge to install, support and/or use. This causes networks of spreadsheets with 100+ tabs each and endless copy+pasted poorly understood VBA which, often enough, cannot be understood thoroughly enough by anyone,
  • Key person risk on these mega spreadsheets causes a multitude of severe problems.

These challenges are compounded by the domain expert's time being too valuable for them to sufficiently learn programming. So companies turn to hiring analysts, programmers and contractors to get what they need from the investment they've made in the math expressed in Excel. When they learn it'll take 3 years to translate Excel to code which will tie them to a new world of technical debt and still not assure them of correctness and accuracy, it's deemed unworthy.

Microsoft are trying to alleviate this problem taking Excel to the cloud. This may (or may not) eventuate in helping. Faster processing of Excel is only part of the answer. What is really useful is a common language to express math which can be understood by engineers, managers and programmers both within and across companies. The most important of all these positions, however, are auditors. The question of proof that the audited and approved Excel model has been precisely translated into another language so often comes down to "trust me" which is obscene. Accepting the significant deficiencies, currently Excel is the one language of expression which "everyone" knows.

So, I have seen that using Koala can resolve the above challenges. They keys are:

  • keeping the domain expert working in a familiar environment,
  • removing the need for people (programmers) in language translation,
  • facilitating the writing of Python math "objects" (can't think of a better name) without the use of a programmer or analyst. If you do have a programmer handy the Python math "objects" could then be run in places like PLPython. Good use of defined names helps automate the koala "object" to database field name resolution. I have committed code (Tokenizer.py:620) code which helps software outside koala "autodetect"/query the input names to an equation system,
  • If you couple koala with other tools (xlwings), you can define User Defined Functions (UDF) that use the Python math "objects" from koala directly in Excel formulas which helps leverage the domain expert's time. In the case of xlwings, UDF definition can be easy enough for non-programmers to write. Again, the code I committed (Tokenizer.py:620) helps software like xlwings "autodetect"/query the inputs to an equation system. See FlyingKoala for examples.

The advantage of above has already been exampled in my project FlyingKoala.

I can accept koala has been written to essentially replace the Excel calculation engine to gain speed and remove the translation problem. Eg; feature compatible with Excel, just faster -- openpyxl with a "go" button. But the features koala had (around 0.0.31 and before) were properly genius. Where you could;

  • save an equation and its state,
  • load an equation and its state,
  • "focus" koala on only certain outputs (eg; ignore everything except a given equation system)

These are the things that turn koala from "Excel but faster" into a really powerful tool to leverage domain experts and ease communication between all stakeholders. Sadly much of this was broken post version 0.0.31.

My interests are such:

  • an open source tool which services the "keys", as mentioned above (I'd like it to be Koala, but I need it in any case),
  • my project (FlyingKoala) is expected to remain open source (hopefully leveraging knowledge domains other people are in),
  • Flyingkoala already has an Excel add-in, but I am writing a commercial "one-click installer" which will appear in the Excel in-app store and is likely to have some features which support a value add on top of simply installing an open source toolkit.

The best worked example of the principles I have documented: https://flyingkoala.readthedocs.io/en/latest/worked_example_horticulture.html

Thanks again for your questions and input.

@bradbase
Copy link
Author

bradbase commented Mar 9, 2020

Once the one-click installer is up, I'm will develop commercial solutions for companies to leverage FlyingKoala with PLPython and Python within MS products (eg; MSSQL Server and PowerBI).

@bradbase
Copy link
Author

For those following:

First order approximation of Range support using Dataframes as the unit for calculation in the "Excel_lib" functions (currently only SUM).

Cell First!A2 has no formula
but its value is 0.1
value 'evaluated' for First!A2 without a formula: 0.1
ERROR:root:ExcelError, #N/A, 'DataFrame' object has no attribute 'value'
DEBUG:root:
Cell Seventh!C1 has a formula, A1+B1
which has been translated to Python as Evaluator.apply("add",self.eval_ref("Seventh!A1"),self.eval_ref("Seventh!B1"),None) which evaluates to #N/A

value 'evaluated' for Seventh!C1 with a formula: #N/A
ERROR:root:ExcelError, #N/A, 'DataFrame' object has no attribute 'value'
DEBUG:root:
Cell Ninth!B1 has a formula, A1+Hundred
which has been translated to Python as Evaluator.apply("add",self.eval_ref("Ninth!A1"),self.eval_ref("Eighth!B1"),None) which evaluates to #N/A

value 'evaluated' for Ninth!B1 with a defined name: #N/A
DEBUG:root:
Cell Eighth!B1 has no formula
but its value is 100
value 'evaluated' for Hundred with a defined name: 100
DEBUG:root:
Cell Tenth!C1 has a formula, SUM(A1:B1)
which has been translated to Python as xSum.xsum(self.eval_ref("Tenth!A1:B1")) which evaluates to 102

value 'evaluated' for Tenth!C1 with a defined name: 102
DEBUG:root:
Cell Tenth!C2 has a formula, SUM(A2:B2)
which has been translated to Python as xSum.xsum(self.eval_ref("Tenth!A2:B2")) which evaluates to 102

value 'evaluated' for Tenth!C2 with a defined name: 102
DEBUG:root:
Cell Tenth!C3 has a formula, SUM(A3,Hundred)
which has been translated to Python as xSum.xsum(self.eval_ref("Tenth!A3"),self.eval_ref("Eighth!B1")) which evaluates to 102.0

value 'evaluated' for Tenth!C3 with a defined name: 102.0

@bradbase
Copy link
Author

bradbase commented Mar 12, 2020

@vallettea and @brianmay and anyone lurking, the code I've written so far;

https://github.com/bradbase/koala_xlcalculator

It's not yet feature compatible with koala, but will need to be for it to be useful to anyone.

EDIT: koala_xlcalculator is now known as xlcalculator and can be found here:
https://github.com/bradbase/xlcalculator

@vallettea
Copy link
Owner

Hello @bradbase
sorry i take years to answer but I running another company and I have no time for koala as people may have noticed ;)

I'm quite impressed by your work and I really think that there is an opportunity in this kind of product especially with someone like you who knows the internal needs of the sector.

I haven't had time to benchmark flyingkoala but i think that if is does what does koala2 with pretty good performances, we should consider telling people to use flyingkoala instead.

@bradbase
Copy link
Author

Hello @vallettea

No problem on timing. The world is in considerable turmoil at the moment and these projects are the least of anyone's worries. The time you've taken to respond is especially appreciated. (for posterity COVID-19 era)

Your feedback means a great deal. Thank you.

In case our timing doesn't align and to be a pedant: FlyingKoala 0.0.3b is currently associated with Koala2 0.0.31.

The code I've put forward as potential inspiration for "Koala2 re-write" is currently found as koala_xlcalcualtor and intended for eventual use with FlyingKoala . It is currently a long way from being feature complete against Koala2. I am working on that as a priority.

For those who tread here: It's sounding like koala_xlcalcualtor may be on collision course to be re-named Koala3 once I (or the royal "we"?) reach feature parity.

@bradbase
Copy link
Author

For those lurking: I have just release FlyingKoala 0.0.4b0. It's got a bunch of bugfixes and changes to make the add-in optional and actually compatible with Koala2 0.0.31.

@LFoglietti
Copy link

Hi Guys,
my start up wants to build an easy UI to make algotrading, not HF, easy to be implemented by investor with no coding skills.

We will start with Cryptomarkets but the idea could be easily translated for traditional markets.
Let s assume the all infrastructure is in place. We just need the rule from the investor:

Ex. Buy Bitcoin with USD if BTC price is Up 2% within 45 Min...then...

which MUST be written in a language compatible with the rest of the infrastucture.
If we want to give to the final user a good product, he has to ba able to build his own rule with a great deal of flexibility :

Ex. Do the previus buy only if the price is above the Max price of last 3 days. Do not execute the rule on friday...

We end up with the only option: The final user needs to know at least basic excel logic.
If he does he can use the UI to build an excel file without coding.
Excel is definetly the solution. we all know it.
Just think about the amount of strategy already implemented in excel by the financial industry.

Problem is... can excel be wrapped in our infrustructure as it is ???
As you can undesrtand, it would be difficoult and extremly expensive to implement such a solution with thousand of excel instances opend toghether. ( not impossible, Microsoft is working on it..)
This is why we need to translate it in a -- Python math "objects" -- and connect our strategy to the market for real time action.
I 100 % agree with bradbase.
Your project is a powerfoul tool and it opens up to lots of intersting commercial use.

Ciao

@bradbase
Copy link
Author

@LFoglietti, Thank you for your support and validation.

I have a feeling the mechanics for the solution you are talking about may be encapsulated in FlyingKoala as it is a combination of two projects, xlwings and Koala2. The intent of FlyingKoala is to enable people like your target market to put systems like this together for themselves. Maybe the scale you are talking about requires assistance but I like to think the basics of defining the individual algorithms and setting up a set of rules would be achievable by the traders and analysts themselves. A worked example using two formula from horticulture can illustrate the concepts.

If you are in financial analytics and don't know about xlwings it's worth becoming familiar. Here is their website.

xlwings is a solution coming from the financial trading analytics world. At its core it is a COM wrapper but is incredibly inspired. The practicality of the solution provides two-way communication between Python and Excel on all important facets. I have found great use for xlwings well beyond finance and even well beyond quantitative analytics into qualitative analytics and even commercial data management.

When coupling xlwings with Koala... Things become surprisingly powerful.

Disclaimer: I am not affiliated with Zoomer Analytics or xlwings. The only involvement I have had with xlwings to date (besides using it) is I pioneered Anaconda support in the xlwings project.

@LFoglietti
Copy link

@bradbase,
FlyingKoala and xlwings might be very useful.

I looked into your worked example. Let's assume my analysis' result is:
IF degreeday() is >= 10 AND previus degreeday() is <= 5 THEN True ->: open water.
I hard code this simple if in Excel for every row without using array. Now I want to integrate this logic in my Py garden where I already have a device sending the temperature through a websocket written in Python, in real time.

In terms of excel, every time I get a new data, it would be:
Write Temperatures in first empty row,
Copy&Paste formulas from previous row
Calculate formulas
Check condition cell : If true tell Python to Open Water

This would be the difficult part because:
I do not want any excel to be opened or even installed.
I receive new data frequently ( fast is 3 times a second) when they are available from the websocket;
I need to open water very fast;
I have many happy plants, all well watered... :-) I hope;

Devil is often in details: like copy and pasting array formulas...

Pm me lucianofoglietti@gmail.com if you are interested in the project.

Ciao

@bradbase
Copy link
Author

@LFoglietti Thank you for the extension of the example.

I have used Koala2 <= 0.0.31 in the way you describe.

Using Koala2 version <= 0.0.31, or when feature compatible the re-write koala_xlcalculator, you can save the state. In Koala2 0.0.31 you can specify to "focus" on a particular output cell so the resulting state only relates to the cells involved with the formula you are interested in. This provides you with the ability to read the definition of a formula, convert it to Python and persist that as a "math object". The input terms are still referred by their cell address (or defined names), but they are the inputs to the equation which gets evaluated in Python. This "math object" can then be used either as a lambda (in a programming sense) or in a lambda (in a cloud processing sense).

All the above can happen completely independently of a Microsoft environment.

The FlyingKoala solution joins Koala2 and xlwings. xlwings is there to to provide;

  • the ability to get data from Excel into a Python environment (uses COM)
  • the ability to get data from Python into a user friendly form (Excel) (uses COM)
  • an Python environment to run the converted formula

FlyingKoala actually uses the "math object" essentially as a lambda (in the programmatic sense).

There is absolutely no reason why you need Excel except for defining the formulas. Once the formulas are defined you can then apply the formula to any data you have.

I have put introspection code into Koala2 so that a "math object" can be queried for its terms (input parameters). The method is ExcelParser.getOperandRanges().

The code in FlyingKoala.py demonstrates the use of this. FlyingKoala is literally only 185 lines of code. It's easy to see how it works. The "money-shot" is in a combination of the methods generate_model_graph() and evaluate_koala_model().

** I keep specifying Koala2 0.0.31 because after that version there was a shift in the code which broke the ability to persist the model and to properly use the output cell mechanism. **

Also, I have been remiss to mention other solutions. If you don't already know about them, checkout PyCell, PyXll and the pair of solutions Formula and Schedula.

@LFoglietti
Copy link

@bradbase
Thank you for all your valuable information.
It' s clear you know a lot about this issue.
We both understand the need, in an Object Oriented Programming world, of this functionality.

I agree with you. In my opinion the problem can be divided in to:

  1. Pyton Excel comunications : Get(data), Set(data), manipulate data mehods ( Ranges, CopyAndPast, Wiew/Graph ).

  2. Excel formulas and logic are translated in to Python.
    The calculation engine is now a python object wich mirrors the one Excel has ( written in C++ by the way). It returns the same results.

Just to give you an idea: our main evaluation guidelines are:

  1. VBA comparison.
    If I can do that in VBA I want to be able to do that in Python (at least same speed and same flexibility). This part is usually handled by COM quite well. When we deal with large file we measure the cost of calling Excel ( Get, Set, Manipulate ) and performance degradation (not linear). We are not concerned about decorating the file which, in our case, is not an UI.

  2. Calculation Efficiency.
    Assuming Python and Excel give the same results ( which does not consider the excel bugs' problems ), Python MUST be considerably faster.
    This is because Excel is optimized for interactive use, but more important, because while Excel itself is fast ( if data and formulas are well structured ), calling Excel from OLE Automation is not.
    All specific calculation method like Cell, Range, Dirty, can improve efficiency.

  3. Reliability.
    Is the code stable, actively supported, well documented, already used in a consistent number of applications? Is it free? is it OpenSource?

At the moment we have a Beta working in c# .Net.
The job is now handled by Flexcell
We are considering moving to python for further development.

@bradbase
Copy link
Author

@LFoglietti It is marvelous communicating with you. So few people understand these things at such a level. I will PM you with respect to this as I know I'm losing some detail.

As we have gone so deep I think it may be good to have some of this conversation in the open for those lurking. We are genuinely at the core.

I've had a cursory look at FlexCel user guide. It's currently late where I am and I've spent a lot of hours today coding functions and tests (which is absolutely boring and mind-numbing) so I can't take it in as I'd like. I'll look again after some rest.

That said; Koala2 in its "pure" form (or, maybe the concept) might be a great start as a Python answer to FlexCel. In my opinion (happy to be corrected by the Koala gurus here) Koala2 currently focuses on the idea of an Excel replacement. There is focus on being able to read and write Excel files as well as calculate them. The library being used for managing Excel files is OpenPyXL which is a gold standard for Excel files in Python (there are some others which specifically define for reading or writing). Openpyxl, however, has two deficiencies that I've identified (there may be more);

  • It can't do pivot tables
  • It can't calculate

In reality, unless you are interactive with the spreadsheet, who cares about pivot tables...
In any case there's nothing we can do about them. They need an object found within Excel. COM can assist but that requires an Excel install and an OS that supports COM. We know COM is managed by:

  • xlwings (essentially just a COM wrapper - an astoundingly good one)
  • PyXll (I've had no contact as it's commercial but they seem to have great RibbonMenu flexibility)

Calculation has been solved by Koala, PyCell and Formula/Schedula and soon to be Koala_xlcalculator.

  • PyCel focuses on being able to visualize the formula networks (and is associated with PyXll)
  • Formula/Schedula are tightly focused on big system modelling (Automotive industry)

Koala is in the sweet spot for Excel file management + calculation. My needs are different from that of Koala's direction in that I don't care to manage Excel files. I'm trying to get the math from a domain expert to transposing data without need of a developer. Whether that data is on the desktop, on a DB or in the cloud.

I find one major challenge with Koala and that is the code is very difficult to manage. It's completely understandable due to the heritage of the project, but there's a great deal of clean-up to be done. My figuring is a clean calculator module will assist and may well make integration with openpyxl, xlrd or similar easier to achieve. And that feeling seems to be validated. I've spent less time writing a "new" Koala (leaning heavily on Koala) than I have trying to fix things in the existing code and I've gotten much further. The reality is there is need for both Koala and FlyingKoala as they address different yet heavily related problem domains. I'm yet to fully get my head around FlexCel to appreciate where it lives but the quickstart introduction appears to show intent to be similar to Koala2.

@LFoglietti
Copy link

In Flexcell there is also a tool for automatically showing needed Delphi/C++ code for generating specific .XLS/.XLSX file cells.

It reminds the macro recorder in Excel. You upload a file and you get the C code to replicate it.

My considerations, so far, are that Python, at this stage, canNOT compete with other languages/Environment like C#.Net. as far as transforming Excel in to a "Python Math object" is concerned (2).

Also considering Java, you can find excellent commercial solution like the Grapecity family products.
Excel, once loaded, is abstracted to a Java class.
You do not need to have it on your machine.
The calculation happens in Java.
You can show your clients a java UI which is exactly like Excel.

My question is.
Why is Pyhon not interesting in pushing integration in the above direction?
Possible answers are:

  1. Technical. Excel interacts better, at an highest level, with C and Java.
  2. There is no request of such integration by Python community,
    Why do you want to run a time series analysys in a spreadsheet when you can do that that directly in Python.. Code your own logic there or use a Library. No need of Excel.
    I am Python. I am perfect to sort this math problem. Why do you want me to learn Excel engine calculation difficoult logic. :-)
    Pythons already communicates with Excel, you can also write powerful UDF.
  3. I am missing something in my analysis.

Contributions will be appreciated.

@bradbase
Copy link
Author

@LFoglietti

Outside technology... My strength as a software developer is to promote to non-developers capabilities that they need but can't get access to without my skill. The more I empower non-developers the more value I am to them. I am of most value if I can facilitate them to empower themselves without need to come into direct contact with me.

The solution I'm championing is not solely addressing a technical problem. The problem space is a combination of personal skills and inter-human communications.

That said, from a technical perspective;

You are 100% correct in suggesting other languages may be better at various things with respect to interacting with Excel files and replicating Excel's functionality. It would be marvelous if those other languages had some of the strengths of Python but each language has succeeded for their own reasons.

I disagree with the statement that there's no request for such an ability in Python. Koala exists. It had a reason. With its existence there's a sustained enthusiastic following. PyXll is making money from this. xlwings exists. PyCel, Formula/Schedula. Pyopenxl and xlrd both exist because people are trying to interact with Excel files in Python. The reason the libraries you speak of exist in those other languages is the same which demand one in Python.

What happens when your workforce doesn't have the skills in Java or C#.net? Even if they do, what happens when development is not the core purpose of your business and Java or C#.net environments are not going to be authorized?

  • Maybe you are an engineering company who hires engineers (non-coders) and you've got the luxury of having employed one data analyst (maybe not a full stack dev, but can code). Your engineers are making mathematical models. They use Excel.

  • Maybe you are in billing and your billing analysts are using Excel but need more power in terms of time series tools. Do you train them up as developers in Python so they can become Python data analysts? Do you need to contract a developer to write time series tools for you? Do you need to buy some?

  • Maybe you are in accounts and you need to create an integration between different accounting systems. Often timesheets, accounts receivable and general ledger are in different applications. You may have a data analyst in the office. If you do, Python it likely already on the approved list of company software.

  • What happens when the model coded in Java, C#.net, Python, or essentially any language, needs the be audited? Banks don't like handing over tens of millions of dollars without seeing inside the black box. Do you insist that the auditor is familiar enough with ALL your tech and how the solution has been put together before you can get them to understand it? Or do you want to operate in an environment which is independent of computer language? This aspect actually strengthens the need to have this kind of tool in all languages... Auditors can work in the language they are most comfortable be it Excel, VBA, VBA.net, Java, C#.net, Python, blah.

There are many, many reasons why Python would want this capability. The strongest need is outside the scope of math literate people who can code. I am finding the struggle is the only people who understand the technology are math literate analysts and developers. Obviously they don't need it. The people who can benefit from this tech have their hands tied behind their back because they aren't necessarily code literate.

Do you (or I) want to learn how Excel does its calculations and knowingly implement incorrect calculation..? Absolutely not. But there have been people who have done that for the libraries and tools you've spoken about. Why does Python need to miss out?

Although it may not get much traction I see there is value in having a library of objects which calculate in the same way as Excel. Again with auditing in mind. If your Python code (or even Java, C#.net) can't produce the same answers as the Excel model which has been authorized by the bank you are introducing risk of liability. Currently I don't know of a single tool which can categorically prove a complex mathematical calculation differs between Excel and a programming language due to Excel's deficiencies. For that you need excel compatible/non-compatible "modes". The way to do that is write objects which implement multiple methods addressing "Excel compatibility" or straight/correct calculation and switch between them. There is opportunity for that in the way I've structured koala_xlcalculator.

@LFoglietti
Copy link

@bradbase ,
100% with you. Your project goes in the right direction.
it makes sense and there is a general need for it, outside my sector indeed.
I am supporting it as I can.

Having said that.
Do you really think I can replace a tool like Flexcell in my C# . Net architecture with one of the library you mentioned??
Formula is asking donations to develop new functions compatible with Excel...

I am talking about an essential piece of software in a commercial app. Used by thousands of people.
I still do not know the answer. I want to look better in the Python' libraries involved.
As you said.. something is moving.
Maybe in a couple of year for the all object engine part. Manipulation is already there and it s good.
Thanks

@bradbase
Copy link
Author

This is probably hijacking the intent of this thread. To find the answer it's a case of reading the documentation of each of the projects and seeing for yourself whether they align.

I've read the doco for FlexCel and am confused as to where it fits.

I'm happy to continue this conversation, but off-channel somewhere.

@LFoglietti
Copy link

It was a very nice conversation and you are right: we are getting out of topic.
Let me end this post with a general question for everybody, not only top skilled nerd, who might be reading it.

Assuming I can manipualte Excel using another language ( C#, Java, Python...) we have 3 main scenarios for a simple worksheet where A1=5, A2=10 A3=Sum(A1:A2) =15

1) Read & Write
Python Writes A1=20.
Excel Calculate A3.
Python Reads A3=30.
Excel Must be installed.

2) Read & Write & UDF
I write a Python function that sum 2 numbers. PySum(x. y)
A3 now becomes PySum(A1:A2) =15 as if I were using and UDF.
Python Writes A1=20.
Excel calculates A3 which contains a pythonic UDF.
Excel calls Python passing A1, A2.
Python calculates PySum().
Python reads A3=30. This can be skipped because Python already knows Pysum() = 30.
Even if I have only PythonicUDF in my sheet, the Excel engine is responsible for building a calculation tree.
Pythonic Udf calculation will then be in Python ( c# java ecc..) leveraging speed, library and so on.
Excel Must be installed.

3) Complete Integration
Load Excel file in to Python.
Excel is now a Python Object which behaves exactly as Excel. The sum function is automatically transformed in a Pysum(). I can now tell to my Python object to add a row between 1 and 2. Pysum will become PySum(A1:A3) and the result will be placed in Pythonic A4 as it would be in case 1 and 2 in Excel A4.
You can think of your Excel file as a pithon ( or C# or java) 2D array where you can use Excel sintax and get the same Excel result.
From now on everything is handled by Pithon which will clone and write its object to Excel if asked.
Python changes its A1 array value to 20.
Pithon calculates its A3 array value = 30.
Excel does not need be installed.
We used excel only to upload data and formula but we could have called a blank Python object and build it as our excel file as we could do with Read and Write in (1) and (2).
Excel does not need to be installed.

In Python: Koala, Koala2, FlyingKoala, Formula are library going towards a complete integration.
For other languages you can look at: EasyXls, Grapecity, Flexcell, Spreadsheetgear just to mention some.

Will Python continue its efforts towards a complete integration or it s happy to stay in scenario 2

@bradbase
Copy link
Author

Koala is striving for option 3 and, to large extent, already demonstrates the achievement of this goal. Will it continue to..? I expect so.

Python cannot really achieve options 1 or 2 as it's not possible to integrate the language at a deep enough level. Currently those options are faked using COM. In the future we are likely to be able to integrate at that level through the web based API (REST?). Excel is heading toward the cloud in a very serious way and that is the interaction pathway becoming available.

@LFoglietti
Copy link

I forgot to mention the following for Python.
For a better explanation of the problem you can see this excellent blog.
Pycel: Compiling Excel spreadsheets to Python and making pretty pictures
and the resoulting library:
Pycel
also look at
Dirigibile

Thankyou @bradbase for your analisys and good luck with your project.

@bradbase
Copy link
Author

bradbase commented May 3, 2020

@vallettea, @brianmay, @strichter, @kmonson and others,

I have written and now released xlcalculator. It is heavily based on Koala2 and is written for Python 3.
https://github.com/bradbase/xlcalculator

I have done a performance comparison between PyCel, Koala2 0.0.33 and xlcalculator. The performance scripts can be found in this repo.
https://github.com/bradbase/Perfomance_testing_Python_Excel_calculators

I intend to write a software review to comprehensively explore at least PyCel, Koala2 and xlcalculator but in short what I've been seeing;

  • Koala2 is rather slow up-front. It takes a long time to load a file, but then is fast enough to evaluate. Takes 25 mins to load a file that PyCel does in less than a second, and xlcalculator does in 13 seconds. But will eval one line of nested equations in less than half a second.
  • PyCel is lightning fast. I'm not certain it will win in every situation, but for the simple test of loading a moderate sized file, evaluating a single line of nested formulas and recursing them rapidly, it's very fast. about 1 second to do the whole lot.
  • xlcalculator is reasonably quick, loading a file in 8 seconds, taking until 13 seconds to build the Python code and about half a second to eval a single line of nested formulas.

xlcalculator is now the back-end calculator for FlyingKoala and is exceptionally performant.
https://github.com/bradbase/flyingkoala

A note: I have extracted the Excel function implementations into a library on their own, so anyone can use xlfunctions to run Excel compatible calculations.
https://github.com/bradbase/xlfunctions

The only thing now is to get people to use it so I can find all the bugs :D

@vallettea
Copy link
Owner

Congrats ! Looks awesome.

@charlesdwright
Copy link

charlesdwright commented Jun 10, 2022

@LFoglietti Has Tradestation the trade rule interface you're looking for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants