Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local datasets scaling graph v2 #77

Merged
merged 14 commits into from
Jan 12, 2017
Merged

Conversation

ploh
Copy link

@ploh ploh commented Jan 6, 2017

Scale the graph and time curves of a derived dataset alongside the area attributes - and refactor scaling of the latter.

(Followup PR since #73 accidentally closed and reopening impossible)

@ploh ploh mentioned this pull request Jan 6, 2017
@codecov-io
Copy link

codecov-io commented Jan 6, 2017

Current coverage is 93.18% (diff: 100%)

Merging #77 into local-datasets will increase coverage by 0.27%

@@           local-datasets        #77   diff @@
================================================
  Files                  62         64     +2   
  Lines                1609       1674    +65   
  Methods                 0          0          
  Messages                0          0          
  Branches                0          0          
================================================
+ Hits                 1495       1560    +65   
  Misses                114        114          
  Partials                0          0          

Powered by Codecov. Last update dbd83a0...6f47a35

@antw
Copy link
Contributor

antw commented Jan 6, 2017

I pushed a fix for the incorrect energy_regasification_lng demand. You needed to scale edge demand in addition to node demand. 😃

@ploh
Copy link
Author

ploh commented Jan 9, 2017

I pushed a fix for the incorrect energy_regasification_lng demand. You needed to scale edge demand in addition to node demand. 😃

@antw Thank you very much 👍
Now my tests are successful for all node demands as well as final demand Co2 emissions.

btw: I did not know that edges could have a set demand. Now I checked and found out that only these do:

energy_regasification_bio_lng-energy_national_gas_network_natural_gas@greengas
energy_regasification_lng-energy_national_gas_network_natural_gas@natural_gas

What is so special about them?

@ploh
Copy link
Author

ploh commented Jan 9, 2017

I still have some open questions:

  1. demands/metal_demands.csv, primary_production.csv and central_producers.csv all have a demand column. Now these files are certainly used by dynamic node attributes which are calculated in Atlas::Runner and the computed values are stored in graph.yml for a derived dataset. Hence, we might not need to scale the csv files - since the values in graph.yml will be scaled themselves.
    But... are these files maybe also directly accessed by ETEngine?
    My source code search found only one such place, namely in Etsource::Dataset::Import. There, only the full_load_hours attribute is read from the central_producers.csv.
    @antw How certain are you that there are no direct uses of demand (or max_demand) from one of these files?

  2. To ensure the correctness of the scaling, I successfully checked all node demands as well as all final demand Co2 emissions. But I do not think this is sufficient.
    I would like to (automatically) check all gqueries. (This can then also be set up as an automated CI task.)
    @ChaelKruip Do you think this is a good idea of validating the scaling?
    @antw Is there some sort of "query rake task" in etengine which I could use for this?

@ChaelKruip
Copy link

@ChaelKruip Do you think this is a good idea of validating the scaling?
@antw Is there some sort of "query rake task" in etengine which I could use for this?

Yes.

I think there is already quite a bit of machinery to view the outcomes of all gqueries in the ETE back-end. For example this list of all outcomes of queries

antw added a commit that referenced this pull request Jan 9, 2017
@antw
Copy link
Contributor

antw commented Jan 9, 2017

But... are these files maybe also directly accessed by ETEngine?
My source code search found only one such place, namely in Etsource::Dataset::Import. There, only the full_load_hours attribute is read from the central_producers.csv.

In hindsight, I think setting full_load_hours like this is wrong, and that it should be done with a dynamic attribute (like demand or share). This way, it would benefit from being scaled correctly without having to duplicate the CSV.

Ideally this would be handled automatically by Atlas (so that you didn't have to put the query in every central producer node document), but presently queries have to be saved in the file.

I pushed Atlas and ETSource branches with a experimental changes, which should allow that code to be removed from ETEngine and the scaling to function correctly:

@ploh
Copy link
Author

ploh commented Jan 9, 2017

This way, it would benefit from being scaled correctly without having to duplicate the CSV.
...
and the scaling to function correctly

I think I should have been more precise above: The full_load_hours is not a problem (at least in the outcome) since it does not need to be scaled. I was only wondering whether there might be similar direct references to the demand column as well.

Ideally this would be handled automatically by Atlas (so that you didn't have to put the query in every central producer node document), but presently queries have to be saved in the file.

For the demand it seems to be set via CENTRAL_PRODUCTION() for only some of the *.central_producer.ad nodes as well - which seems strange.. but also off topic here 😜

I pushed Atlas and ETSource branches with a experimental changes

👍
I will merge those changes into our local-datasets branches. Do you agree that the code in ETEngine can then simply be removed - since the full_load_hours for central producers are now gonna work as for other converters?

ploh pushed a commit that referenced this pull request Jan 9, 2017
ploh pushed a commit to quintel/etsource that referenced this pull request Jan 9, 2017
@antw
Copy link
Contributor

antw commented Jan 9, 2017

I was only wondering whether there might be similar direct references to the demand column as well.

Not that I'm aware of.

The full_load_hours is not a problem (at least in the outcome) since it does not need to be scaled. [...] I will merge those changes into our local-datasets branches.

If it doesn't need to be scaled, no need to merge the branches. 😃 I think full_load_hours should be scaled though; @ChaelKruip what do you think?

@ploh
Copy link
Author

ploh commented Jan 9, 2017

If it doesn't need to be scaled, no need to merge the branches.

Well, as you said above...

In hindsight, I think setting full_load_hours like this is wrong, and that it should be done with a dynamic attribute (like demand or share)

... so I think we should merge it anyways since it improves the system design. @antw What do you think?

I think full_load_hours should be scaled though; @ChaelKruip what do you think?

As far as I can tell the numbers are "per unit per year", i.e. the hours for one power plant, and thus not to be scaled.

@ChaelKruip
Copy link

ChaelKruip commented Jan 9, 2017

I think full_load_hours should be scaled though; @ChaelKruip what do you think?

full_load_hours are fixed for most technologies but can be changed by ETE (for heating technologies in households) or the Merit module (for the central producers). They don't need to be scaled I think as they just express how many hours a technology typically runs during a year.

The energy (demand) of technologies can be scaled in many ways. The input_capacity, demand, full_load_hours and number_of_units are linked by:

demand [Joule] = input_capacity [Watt] * number_of_units [#] * full_load_seconds [#]

of which demand or number_of_units could be used as the scaling variable to which the other of the two would need to react. I think that currently, the demand is used. input_capacity is kept constant for all technologies except for the ones mentioned in #73 (comment). full_load_seconds (or full_load_hours) need to be fixed as well.

@ploh
Copy link
Author

ploh commented Jan 9, 2017

Partial answer to my own question about the two explicit-demand edges:

quintel/etsource@81463bdb3d0

ploh pushed a commit to quintel/etengine that referenced this pull request Jan 10, 2017
Peter Lohmann and others added 6 commits January 10, 2017 19:15
- ScaledAttributes -> AreaAttributesScaler
- Compute scaling_factor in top-level class Scaler
- Add a export_modifier hook to GraphPersistor
- Scale after conversion to a hash
- White-list of node attributes that need scaling
  - Most of the node attributes occurring in graph.yml are scaled
@ploh ploh force-pushed the local-datasets-scaling_graph branch from 8d19108 to 6f47a35 Compare January 10, 2017 18:24
@ploh ploh requested a review from grdw January 10, 2017 18:25
@ploh
Copy link
Author

ploh commented Jan 10, 2017

@grdw I think this is finally ready to be reviewed and merged 😄
@antw I would appreciate your feedback as well if your time permits


# Delete the header row for the internal representation -
# will be dynamically (re-)created when outputting
table.delete(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this line. On line 53 you specify that the CSV should return headers only for them to be deleted a few lines down. Could you explain why you are doing this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because otherwise the @table will not contain the headers. So I return headers, retrieve them and then delete the superfluous line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote comments in the source code for this. Do you have any suggestions how to make them clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because otherwise the @table will not contain the headers

Just to be clear; line 56 will return [] instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear; line 56 will return [] instead?

Yes - only if the table is empty except for the header row, of course.

.merge(new_attributes))

derived_dataset.save!

GraphPersistor.call(@base_dataset, derived_dataset.graph_path)
GraphPersistor.call(@base_dataset, derived_dataset.graph_path, export_modifier: Scaler::GraphScaler.new(scaling_factor))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we call export_modifier, scaler for now. I understand that it's practical to keep it abstract; migth there be a reason to one day support (multiple?) "export modifiers". But I'm not so sure we should do this already considering that only 'scaling' is relevant for local datasets.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that I would like to avoid the word "scaling" completely inside graph_persistor.rb since the latter has nothing to do with the former.

end

def persist!
data = EssentialExporter.dump(refinery_graph)
@export_modifier.call(data) if @export_modifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which situation will export_modifier be blank?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, not at all at the moment, but since it is an optional attribute, the code should be able to handle the blank case.

File.open(@path, 'w') do |f|
f.write EssentialExporter.dump(refinery_graph).to_yaml
f.write data.to_yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe data can be a separate method to keep the code clean.

For example:

def data
  @export_modifier.call(EssentialExporter.dump(refinery_graph))
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking, but can this File.open block not be simplified to a one-liner?

File.write(@path, data.to_yaml)

I'm not sure I see the value in extracting data out to a separate method, unless it is something which is used in more than one place (the persist! method is already very short).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also done in 62cffb5

@@ -9,16 +9,30 @@ def initialize(base_dataset_key, derived_dataset_name, number_of_residences)
def create_scaled_dataset
derived_dataset = Dataset::DerivedDataset.new(
@base_dataset.attributes
.merge(scaled_attributes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking: can this be indented 2 lines further for the sake of readability? I don't care to strongly about it though so I'm fine either way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 8f85716

if value = @base_dataset[attr]
[attr, Util::round_computation_errors(value * @scaling_factor)]
end
end.compact
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executing a method on an end is breaking the style guide a little. You can use a tool like rubocop to validate your code.

Edit: you might solve it like:

def scale
  @base_dataset.attributes.slice(*SCALEABLE_AREA_ATTRS).map do |key, value|
    [key, ... ]
  end
end

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that way I would not avoid the attributes set to nil. I found another solution though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved in commit fdc9e3f

scaled_csv.set(row_key, column_key, base_value * @scaling_factor)
end
end
scaled_csv.save
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is a bit to long.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved in 9e6e6e0


# Delete the header row for the internal representation -
# will be dynamically (re-)created when outputting
table.delete(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because otherwise the @table will not contain the headers

Just to be clear; line 56 will return [] instead?

@scaling_factor = scaling_factor
end

# Public: Scales the demands in the graph - modifying the original graph!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this documentation is out-of-date. It looks like graph is actually the hash of node and edge data, and it doesn't return the graph object anymore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved in commit a3ea7ce

@@ -0,0 +1,29 @@
module Atlas
class Scaler::TimeCurveScaler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have some documentation for this class. The method signatures of call and initialize don't give much clue as to what the parameters should be, or what the result values are (if any).

For example: are base_dataset and derived_dataset keys for datasets (no), or actual dataset objects (yes). Is it acceptable for both to be any Dataset subclass, or does base have to be a Full and derived a Derived?

Does scale return the scaled CSV file when it has completed, or is the return value nothing useable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 065345b

File.open(@path, 'w') do |f|
f.write EssentialExporter.dump(refinery_graph).to_yaml
f.write data.to_yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking, but can this File.open block not be simplified to a one-liner?

File.write(@path, data.to_yaml)

I'm not sure I see the value in extracting data out to a separate method, unless it is something which is used in more than one place (the persist! method is already very short).

def initialize(dataset, path)
@dataset = dataset
@path = path
def initialize(dataset, path, export_modifier: nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking: I don't think path needs to be provided to initialize; it is only used in the persist! method. This is somewhat true of export_modifier too, but since that param forms part of the class "identity" I think it's fair to provide that when creating the class.

How about instead giving the destination path to persist?

GraphPersitor.new(dataset, modifier).persist!(path)

At the moment, the whole class feels a bit overkill to me; it can be substituted for a three-line lambda:

module Atlas
  GraphPersistor = lambda do |dataset, path, export_modifier: nil|
    data = EssentialExporter.dump(Runner.new(dataset).refinery_graph(:export))
    export_modifier.call(data) if export_modifier
    File.write(path, data.to_yaml)
  end
end

It could also use some documentation to explain what the params and return values are. export_modifier in particular is completely opaque to any new reader of this code. What values are yielded to the modifier? What should the modifier return?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved in commit 62cffb5

initial_table = CSV::Table.new([CSV::Row.new(headers, headers, true)])

write(initial_table, Pathname.new(path))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current use of this method (in TimeCurveScaler) looks like: CSVDocument.create → add data → save.

That involves two writes to disk (in create and again in save), and if something goes wrong in "add data" or the second save, you end up with a half-complete CSV file on disk. Perhaps create could yield itself prior to writing, so that the user can add their initial data? (like File.open(path, 'w') { |f| ... })

CSVDocument.create(path, headers) do |doc|
  doc.set(:a, :b, 1)
  doc.set(:c, :d, 2)
end

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are so right 😔
I did it this way because Ruby's CSV class is not really good for read-write access and because I did not want to mess up the current CSVDocument.new signature.
But I found an okish way to build only the document and not save it until later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be commit 310ab25

@ploh
Copy link
Author

ploh commented Jan 12, 2017

@grdw @antw I fixed all your remarks. I wrote answers to your remarks with corresponding commit hashes for my fixes, in case you want to track my latest changes.

I will merge this branch now and fix possible follow-up remarks in another PR.

@ploh ploh merged commit eafba60 into local-datasets Jan 12, 2017
@ploh ploh deleted the local-datasets-scaling_graph branch January 12, 2017 20:42
grdw pushed a commit to quintel/etsource that referenced this pull request Jan 27, 2017
grdw pushed a commit to quintel/etengine that referenced this pull request Jan 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants