-
Notifications
You must be signed in to change notification settings - Fork 1.3k
perf: optimize yaml parsing for stages #2775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@Suor Any story on this? Could you elaborate on what you are doing and why? |
|
This is after #2671. We discussed it on planning I believe. |
|
@Suor Still, could you summarize what is going on this this PR? It is not clear to me what this is about and why. |
|
And I don't think we've discussed this PR specifically during the planning. Might be forgetting something though. |
|
We've discussed it during Demo. Mostly me and @shcheklein. The thing is that the majority of the time to collect stages is spent to parse YAML. We still need |
|
The graph collection for 21k add stages takes 23.6s instead of 69.6s for me after this optimization. The optimization strategy is also described here. |
| from ruamel.yaml.error import YAMLError | ||
|
|
||
| try: | ||
| from yaml import CSafeLoader as SafeLoader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it installed automatically on most systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyYAML for Windows has wheels, which include binaries. I didn't test how it works though. For other systems you will need libyaml-dev installed to use C. Was not installed in my Ubuntu, so one needs to:
sudo apt install libyaml-devBefore installing PyYAML, otherwise it won't be linked.
Pure python PyYAML still works 2x fast as ruamel.yaml. We might also poke PyYAML author to create all the wheels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is quite a requirement, and won't be easy to explain to the users. This will result in dvc stage collection staying slow for pretty much everyone, except a few guys who will discover that they need to install libyam-dev. Though 2x is still pretty good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poking PyYAML guy and helping him is still an option. Will work on top of this seamlessly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2x sounds good and we can def ping the guys (let's do this right now?)
| if self._stage_text is not None: | ||
| saved_state = parse_stage_for_update(self._stage_text, fname) | ||
| if "meta" in saved_state: | ||
| state["meta"] = saved_state["meta"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is some special handling required for meta here? (vs just a single apply_diff before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because state, which we get from self.dumpd() does not have meta key. So apply_diff() will remove it from the target structure. The alternative would be to store meta in some key and then include it into .dumpd(), which is more hustle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why didn't we have the same (seemingly very ad-hoc and fragile logic) before? We were using dump before right? and were applying diff on top of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Before it was part of the dumpd. I missed that a line was removed in dumpd as well.
Still looks fragile that we have one more place that manipulates with all these attributes. It feels that it would be better to follow the "regular" way in this file and make the meta a regular stage attribute. So, that we don't have a separate path to handle just it.
Also, wondering if is_cached can be optimized considering that we are saving the state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regular way now will mean we will need to store meta on some attribute, then return it in .dumpd() then let it be applied to the same meta, but with comments etc inside apply_diff().
I am not sure that is a good idea, I don't want to add an another attribute, which is never used in fact. And even less I want to add another constructor param.
|
@Suor I would put some comments there. I'll be very hard to understand the motivation behind using two parsers. Also, would like to know how often will it use the fast one? will it try to build if from sources (and expect gcc to be installed for example)? how slow (compared to the current implementation) the one that is not backed by the implementation in C. |
shcheklein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, see some comments around
|
@shcheklein it always uses the fast one when reading stage, it only uses slow one to update a stage file. The supposition is that the overall number of stages might grow faster then the number of stages changed by a single command. PyYAML will always use binaries under Windows, will automatically link to Pure python PyYAML parser is 2x faster than I will add some motivation comments into code. |
|
So I looked through PyYAML issues, there are a couple of occasions where people ask for non-Windows wheels - yaml/pyyaml#43, yaml/pyyaml#303, yaml/pyyaml#227, which don't get any traction though. I created a new issue there to ask whether they are interested and offer our help -yaml/pyyaml#346. |
|
@efiop @shcheklein Updated an explanation and added a docstring to |
|
The reason why it was not explained before is that target functionality (retaining comments, formatting and order) was effectively an invisible side-effect of using |
No description provided.