Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data transformer update #2182

Open
jkiviluo opened this issue Jun 28, 2023 · 1 comment
Open

Data transformer update #2182

jkiviluo opened this issue Jun 28, 2023 · 1 comment
Labels
Data enhancement Enhancement of existing feature Feature Possible new feature
Milestone

Comments

@jkiviluo
Copy link
Member

jkiviluo commented Jun 28, 2023

In Stockholm meeting, we discussed about adding more manipulation functionality to the data transformer. First, these issues are linked:

First, we should enable all different use cases for data transformer:

  • data store / importer / transformer --> transformer
  • transformer --> data store / exporter / transformer

Then there is the manipulation chain:

  • filter to pick only necessary data (could be in the arrow)
  • set operations (joins etc.) that create the data sets used in the next step
  • math operations that modify and/or create new sets from the data sets in the previous step
  • allow naming and renaming. Possibly also deleting based on a set.

The set selection should create sets of "entities" with their parameters that can then be used in the math operations that will be done in the next step. The most common selection is just to take all the entities of a class with selected parameters. There should be a possibility to filter entities (using regex). However, there can also be a need to make filtering using simple set operations like intersection, union, difference, and symmetric difference (typically this filtering would be done to entities from the same class, but using different regex filters for each selection of entities).

A more complicated selection is to link entities from two classes using a third multidimensional class to establish the link (entities from A and B based on C that is actually a A__B relationship -- the resulting set gets its name from A__B and the parameters from all A, B and C can be available for the math operations). This could also be extended to more than 2 source classes. This could also be done without C if it is just assumed that the operation is a cartesian product of A and B resulting in A__B where A1__B1 gets parameters both from A1 and from B1 (to be used in math operations).

Similarly, there could be splitting operations: C, which is a subset of A, is used to pick elements from A__B. (Lot of this can be done with filters, but this formalized splitting would allow hand-made selections.

A third category of selection would include aggregation: A__B is used to create set of Bs where all As are aggregated. For any parameters of A there should then be an aggregation function: sum, average, count, max, min, median, first, last and maybe more.

A fourth category of selection is not really selection but reordering, adding and removing dimensions from the entity. And it might actually better to be considered as a preliminary step in the set selection. Cartesian product is a form of adding dimensions. A new dimension could also be pulled from multi-dimensional parameter values. Similarly, a dimension could be pushed inside parameter by adding that dimension there.

Possibly helpful clarifications:

  • intersection (entities that are both in A and B)
  • union (entities that are either in A or B)
  • difference (entities that are in B but not in A)
  • symmetric difference (entities that are either in A or B but not in both)
  • cartesian product (combination of As and Bs with a user chosen delimiter)

Math operations

As a result of the set selections, there should be several sets of entities and their associated parameter data. The math operations would not take place between entity sets (because they can have different entities and math operations would not be defined in those cases). So, each set should have all the parameters needed for the math operation stage.

Math operations should be presented by a formula based on parameters. In addition to regular math operations (add, multiply, etc) there could be things like sin, cos, exp, log, ln, mod.

There is a further complication when the parameter data is multi-dimensional. The math formula should be able to handle that. If the data has always a particular dimensionality, single formula should be able to do it (e.g. [annual_demand] x [hourly_demand(time)] where time is a dimension name.

naming and renaming

This would be for giving a name for new dimensions and parameters created or renaming existing dimensions and parameters. Maybe should be already in the step of set selection ("SELECT * FROM UNIT AS PLANT")

@jkiviluo jkiviluo added this to the V1.0 milestone Jun 28, 2023
@jkiviluo jkiviluo added Data enhancement Enhancement of existing feature Feature Possible new feature labels Jun 28, 2023
@LouisFliche
Copy link

  • During an attempt to translate RTS GMLC data into SpineOpt, I found myself faced with the problem of having a tool to do what is described above. To try and solve this problem, I tried to build the beginnings of a tool to allow such database modifications. This is by no means a tool that aims to become the future data transformer. However, it can give an idea of the useful functionalities in practice.

  • First of all, I created an initial tool (cf. the Spine-Toolbox project RTS_To_SpineOpt_Basis), enabling systematic operations on parameters: renaming, multiplying by a constant factor, and between parameters on a SpineOpt database: creation of a parameter equal to the sum of two others, multiplication, division, maximum between two parameters, etc. In an Excel file, which acts as a user interface, you enter the name of the parameters you want to create or modify, the parameters you are using and then the operation you want to perform on these parameters. The python tool in Spine-Toolbox evaluates these different expressions in the Excel file, performs the operations on the specified input database, and creates the output database. In this way, all the mathematical operations of python (and numpy) can easily be performed on parameters in a SpineOpt database.

  • However, these operations are performed each time on all the values of the specified parameters. Translating RTS into a SpineOpt format database also required the ability to partially filter/select the input data, for example selecting parameters corresponding only to certain objects (i.e. selecting a parameter according to the value of another parameter), and to perform operations on the different values of the same parameter: maximum of the values taken by this parameter, etc. To do this I added a 'Filters and column operations' section to the previous workflow, upstream of the operations between/among the parameters described above (see the Spine-Toolbox RTS_To_SpineOpt_with_filters project).

  • However, the tool in question was very limited. Firstly, it used an Excel file as the user interface, which is undesirable. It also involved string evaluation operations in the code, which was not very practical, not very clean, and could lead to a large number of errors (typically this system was not at all resilient to the slightest typo). Finally, and most importantly, the tool didn't allow us to perform all the operations we would have liked, and which were necessary to translate the database we were trying to translate. In fact, we would have liked to be able to perform all the basic operations on databases: unions, intersections, etc. (all the operations described above). We therefore had to make a real decision about how to go about it (recode all the possible operations in SQL in a dedicated tool, use database management modules in Python or another language, etc.). Finally, the workflow in two distinct parts (essentially: selections, then operations) was not practical.

  • In conclusion, this attempt to translate the RTS database (for which you can find the codes here) is an example that shows the importance of such a translation tool, its necessary versatility - it must be capable of performing all the basic operations that can be performed in SQL - and the importance of having a single tool with a dedicated user interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data enhancement Enhancement of existing feature Feature Possible new feature
Projects
None yet
Development

No branches or pull requests

2 participants