Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for an ecl race in maxima init #35195

Merged
merged 2 commits into from
Mar 26, 2023
Merged

Conversation

tornaria
Copy link
Contributor

📚 Description

When maxima is initialized a bug in ecl implementation of
ensure-directories-exist might result in a runtime error.

As a workaround, in case we get a runtime error we use python to create
the directory and continue with maxima initialization.

Note that for normal usage the directory will already exist within the
user's DOT_SAGE so this code will almost never run. However, when
running doctests on CI this occasionally triggers.

New doctest

The first commit introduces a doctest to try to catch this race.
We run a new instance of sage in a subprocess to ensure maxima is
not already initialized. We use a temporary MAXIMA_USERDIR so its empty,
and we try to initialize maxima twice in parallel to entice the race.

This temporary dir is placed within DOT_SAGE so it is easy
to try different filesystems.

The bug triggers more frequently if DOT_SAGE is in a high
latency filesystem (e.g. sshfs on a non-local host).

Closes #26968.

📝 Checklist

  • I have made sure that the title is self-explanatory and the description concisely explains the PR.
  • I have linked an issue or discussion.
  • I have created tests covering the changes.

@tornaria
Copy link
Contributor Author

@mkoeppe : I've just realized that I misplaced the # long time. However my first "physical line" ends with a triple quote. What do you suggest as a way to format this?

I marked this long time since it takes ~ 2s in my (fast) machine with DOT_SAGE on a nvme (and ~5s with DOT_SAGE on sshfs). I don't know how to make this faster. Sage startup takes too long; ideally I would run this on bare python instead, however import sage.interfaces.maxima_lib fails miserably without import sage.all...)

@mkoeppe
Copy link
Member

mkoeppe commented Feb 25, 2023

I've just realized that I misplaced the # long time. However my first "physical line" ends with a triple quote. What do you suggest as a way to format this?

Extra line break before f''', I'd say

@mkoeppe
Copy link
Member

mkoeppe commented Feb 25, 2023

Sage startup takes too long; ideally I would run this on bare python instead, however import sage.interfaces.maxima_lib fails miserably without import sage.all..

You could try if import sage.all__sagemath_categories; import sage.interfaces.maxima_lib does the job

We run a new instance of sage in a subprocess to ensure maxima is
not already initialized. We use a temporary MAXIMA_USERDIR so its empty,
and we try to initialize maxima twice in parallel to entice the race.

This temporary dir is placed within `DOT_SAGE` so it is easy
to try different filesystems.

The bug triggers more frequently if `DOT_SAGE` is in a high
latency filesystem (e.g. sshfs on a non-local host).

The next commit introduces a workaround for the bug.
@tornaria
Copy link
Contributor Author

Sugestion for testing:

  • get the new version of maxima_lib.py and run sage -t built without this PR on this new file to make sure the race is triggered (if I set DOT_SAGE to a directory within a sshfs this triggers the race every time).
  • once you are confident the bug triggers, rebuild sage with this PR (or just replace maxima_lib.py in your build/lib... directory if you know what you are doing, and see how it won't break anymore.
  • 🎉

@tornaria
Copy link
Contributor Author

I rebased on 10.0.beta2 and fixed the # long time marker as suggested by @mkoeppe.

@@ -116,7 +136,19 @@
ecl_eval("(setq $nolabels t))")
ecl_eval("(defvar *MAXIMA-LANG-SUBDIR* NIL)")
ecl_eval("(set-locale-subdir)")
ecl_eval("(set-pathnames)")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same could also be done on the CL side using something like (handler-case (set-pathnames) (file-error () (progn (ensure-directories-exist *maxima-objdir*) (set-pathnames)))
(untested)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing a fix in lisp, but that would be prone to the same race.

Bear in mind that *maxima-objdir* has multiple components, here you restart once but the race might hit again in a following component. OTOH there might be a real reason for ensure-directories-exist to fail (e.g. permissions, or maybe there is a file or other non-directory thing with a name clash with *maxima-objdir*, etc) so we don't want to retry forever.

I think the only proper way to fix this is to catch the error in the implementation of ensure-directories-exist. I.e. when mkdir fails, test again if the directory exists and if so ignore the error and continue to the next component. Idealy this has to be done in ECL, or else include a correct implementation in maxima itself and use it instead.

Note that this calling of ensure-directories-exists in the initialization of maxima is only done for ecl; based on the comment in maxima source maybe this is so because of this race.

I have a PoC patch to ecl to fix this, I'm yet to test it thoroughly and submit to ecl. But even if we patch ecl in sage and this is eventually fixed upstream, we'd still need a workaround for all the system ecl in the wild. A workaround in maxima itself may work better since sage-the-dist doesn't accept system maxima but that is still a step backwards IMO (it would be nice if #33718 and #32867 moved forward).

--- src/lsp/mislib.lsp.orig	2021-02-01 09:59:46.000000000 -0300
+++ src/lsp/mislib.lsp	2023-02-15 21:10:54.498524116 -0300
@@ -302,7 +302,14 @@
             (let ((ps (namestring p)))
               (when verbose
                 (format t "~%;;; Making directory ~A" ps))
-              (si::mkdir ps mode)))))
+              (handler-case
+                (si::mkdir ps mode)
+                (error (e)
+                  (if (si::file-kind p nil)
+                    (when verbose
+                      (format t "~%;;; Warning: ~A already exists" ps))
+                    (error e)))
+              )))))
       (values pathname created))))
 
 (defmacro with-hash-table-iterator ((iterator package) &body body)

In any case my point is that this workaround may be necessary for a while nevertheless. The code I want to push does nothing unless an exception is raised, which should almost never happen in normal usage and never happen with a fixed ecl).

The advantage of using python os.makedirs() is that its implementation is immune to this race.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point in favor of the Python solution. Probably best to include this as a comment in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point in favor of the Python solution. Probably best to include this as a comment in the code.

I added something to the comment, no other change. See if you like it or feel free to suggest any wording for the comment.

@codecov-commenter
Copy link

codecov-commenter commented Feb 25, 2023

Codecov Report

Base: 88.60% // Head: 88.59% // Decreases project coverage by -0.02% ⚠️

Coverage data is based on head (0b13dfc) compared to base (8f5bbd2).
Patch coverage: 28.57% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #35195      +/-   ##
===========================================
- Coverage    88.60%   88.59%   -0.02%     
===========================================
  Files         2140     2140              
  Lines       397273   397279       +6     
===========================================
- Hits        352004   351951      -53     
- Misses       45269    45328      +59     
Impacted Files Coverage Δ
src/sage/interfaces/maxima_lib.py 92.54% <28.57%> (-1.19%) ⬇️
src/sage/interfaces/qsieve.py 71.30% <0.00%> (-2.61%) ⬇️
src/sage/groups/additive_abelian/qmodnz.py 91.48% <0.00%> (-2.13%) ⬇️
src/sage/modular/local_comp/liftings.py 98.33% <0.00%> (-1.67%) ⬇️
src/sage/graphs/generators/random.py 91.26% <0.00%> (-1.37%) ⬇️
src/sage/homology/matrix_utils.py 87.15% <0.00%> (-0.92%) ⬇️
src/sage/cpython/_py2_random.py 75.61% <0.00%> (-0.83%) ⬇️
src/sage/schemes/elliptic_curves/hom_velusqrt.py 94.48% <0.00%> (-0.79%) ⬇️
src/sage/coding/channel.py 97.10% <0.00%> (-0.73%) ⬇️
...sage/geometry/hyperbolic_space/hyperbolic_model.py 88.95% <0.00%> (-0.62%) ⬇️
... and 20 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

When maxima is initialized a bug in ecl implementation of
`ensure-directories-exist` might result in a runtime error.

As a workaround, in case we get a runtime error we use python to create
the directory and then continue with maxima initialization.

Note that for normal usage the directory will already exist within the
user's `DOT_SAGE` so this code will almost never run. However, when
running doctests on CI this occasionally triggers.
Copy link
Member

@mkoeppe mkoeppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Haven't spent much time trying to repro the bug, but the test looks plausible.

@tornaria
Copy link
Contributor Author

I hate this: "This branch is out-of-date with the base branch".

This is awful because it triggers people to click on a button doing a stupid innecessary merge. I hate the way sage tree is littered with merges. Release manager merges are awesome because it gives a clean straightforward history combined with --first-parent but all the unnecessary merges are confusing. And this is pushing us to add even more unnecessary merges.

I'd rather prefer to get this message "This branch has no conflicts with the base branch" as in e.g. leahneukirchen/xtools#266.

Moreover, sometimes adding a merge on a PR branch confuses github which then starts claiming one has many more commits and files changed than there really are.

@mkoeppe
Copy link
Member

mkoeppe commented Feb 25, 2023

This is awful because it triggers people to click on a button doing a stupid innecessary merge.

At least only maintainers can push this button on another person's PR (I think)

@github-actions
Copy link

Documentation preview for this PR is ready! 🎉
Built with commit: 0b13dfc

@tornaria
Copy link
Contributor Author

This is awful because it triggers people to click on a button doing a stupid innecessary merge.

At least only maintainers can push this button on another person's PR (I think)

I guess so. But on other repos I don't see the button on my own PR. It seems unnecessary: either it can be merged without conflicts (if so, why do it? the release manager can merge it) or there are conflicts (then you have to do it on a checkout).

IMO what it is useful is that I get a banner telling me either

  • merge will be ok without conflicts, nothing to do
  • merge will give conflicts, so I know I have to do something

Also: rebase FTW.

/rant

@vbraun vbraun merged commit e1cb236 into sagemath:develop Mar 26, 2023
@mkoeppe mkoeppe added this to the sage-10.0 milestone Mar 26, 2023
@tornaria tornaria deleted the ecl_race branch April 4, 2023 21:46
tornaria added a commit to tornaria/sage that referenced this pull request Jun 25, 2023
Calling initialize-runtime-globals will run set-pathnames and be subject
to the issue described in sagemath#26968.

Thus the workaround introduced in sagemath#35195 has to be done before anything
that may call set-pathnames (e.g. initialize-runtime-globals).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ECL test sometimes fails to create maxima directory
5 participants