Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a file handle or Path object to be sent to the outfile parameter of fit #2054

Merged
merged 11 commits into from
Jun 27, 2024

Conversation

DougBurke
Copy link
Contributor

@DougBurke DougBurke commented May 29, 2024

Summary

The fit call can now be sent a Path object or a file handle, as well as a string. Fix issue #2063 (ensure consistency when changing the statistic object used in a fit; this is not an issue for the UI code).

Details

Partly taken from #2015, #2022.

The idea is for a user at the ui layer to be able to say things like

fit(outfile=Path(...))

from io import StringIO
store = StringIO()
fit(outfile=store)
out = store,getvalue()

with open("txt.txt") as fh:
    fit(outfile=fh)
    # this is not really sensible, but you can do it
    fit(outfile=fh)

To do this we need to be able to send in a Path or file-handle-like object to the Fit.fit method. This does change the IterFit interface but this is an internal class which users do not interact with directly.

To do this I've made some minor clean up changes. This is part of trying to address #2007 and came out of #2022 - I am trying to separate things to make the PRs easier to understand and review.

One of the fun things is in trying to add typing statements for this. We could use typing.TextIO for this but there's actually some "interesting" discussion about this on the internet, since it's often too "large" a requirement. So I follow one piece of advice from the typing world and use a Protocol instead to indicate the functionality we need (close and write).

There is a similar issue here to #1929 (comment) in that I have had to add a close_on_exit flag to decide whether to call fh.close. Normally you would say "just use a context manager" but, as with the IO case, it's not that simple. There may be a better way to do this, and if anyone has an idea please holler, but the current approach - whilst ugly - seems to work.

I have then made some changes to address review comments, including discovering and "fixing" #2063.

@DougBurke DougBurke marked this pull request as draft May 29, 2024 14:20
@DougBurke DougBurke force-pushed the fit-cleanup branch 3 times, most recently from 72a4c67 to 0a206c4 Compare June 4, 2024 13:22
@DougBurke DougBurke force-pushed the fit-cleanup branch 2 times, most recently from 81ef507 to e0234a6 Compare June 8, 2024 18:20
@DougBurke DougBurke marked this pull request as ready for review June 8, 2024 18:29
@DougBurke DougBurke changed the title Fit cleanup Allow a file handle or Path object to be sent to the outfile parameter of fit Jun 12, 2024
@DougBurke DougBurke requested a review from hamogu June 12, 2024 18:44
sherpa/fit.py Outdated Show resolved Hide resolved
sherpa/fit.py Outdated Show resolved Hide resolved
@anetasie
Copy link
Contributor

@DougBurke What would be a new behavior?

Current option fit(outfile="path/filename") generates an output file with the record of the search for all the parameters.

sherpa In [2]: fit("out.txt")
IdentifierErr: data set out.txt has not been set

sherpa In [3]: fit(outfile="out.txt")
Dataset               = 1
Method                = neldermead
Statistic             = wstat
Initial fit statistic = 1.88051e+08
Final fit statistic   = 475.847 at function evaluation 339
Data points           = 446
Degrees of freedom    = 444
Probability [Q-value] = 0.143116
Reduced statistic     = 1.07173
Change in statistic   = 1.88051e+08
   p1.gamma       1.23288     
   p1.ampl        9.07174e-06 

sherpa In [4]: !more out.txt
# nfev statistic p1.gamma p1.ampl
0.000000e+00 1.880512e+08 1.000000e+00 1.000000e+00
1.000000e+00 7.535066e+07 2.200000e+00 1.000000e+00
2.000000e+00 4.137324e+08 1.000000e+00 2.200000e+00
3.000000e+00 2.309309e+08 1.300000e+00 1.600000e+00
4.000000e+00 3.656551e+07 1.900000e+00 4.000000e-01
5.000000e+00 1.906931e+07 3.100000e+00 4.000000e-01
6.000000e+00 3.386327e+06 4.150000e+00 1.000000e-01
7.000000e+00 3.736637e+07 2.612500e+00 6.250000e-01
8.000000e+00 2.362581e+07 2.818750e+00 4.375000e-01
9.000000e+00 3.975639e+06 5.068750e+00 1.375000e-01
'''

sherpa In [5]: fit(outfile="../out.txt")
Dataset = 1
Method = neldermead
Statistic = wstat
Initial fit statistic = 475.847
Final fit statistic = 475.847 at function evaluation 302
Data points = 446
Degrees of freedom = 444
Probability [Q-value] = 0.143116
Reduced statistic = 1.07173
Change in statistic = 0
p1.gamma 1.23288
p1.ampl 9.07174e-06
'''

@hamogu
Copy link
Contributor

hamogu commented Jun 17, 2024

@anetasie : @DougBurke gave an example of the new behavior in #2054 (comment): Instead of a string "path/filename", you can now give it some other object that supports the same interface, for example you can capture the output into a variable using the StringIO module so you don't have to write it to a file first and then read it back in. (In the UI, that might not be what most users do, but it's useful when e/g/ you use Sherpa to build some pipeline and don't know what system it runs on or what directory you can use to write temporary files or you are multi-threading and don't want to bother to make code such that each output filename in unique etc. or other cases like that).

@DougBurke
Copy link
Contributor Author

@anetasie

So, setting outfile to a string has not changed (i.e. this does not change existing behaviour).

The new behaviour is if you set outfile to a Path object or a file-handle. The Path version seems like it's just more to type than a string, but Path is meant to be the "better approach" to specifying file locations, so we may as well support it - e.g.

from pathlib import Path

out = Path("/foo/bar/out.dat")
fit(outfile=out)

You can instead pass it a file-handle, which is again a specialized feature. The one I want it for is because it makes testing things easier since you can say

from io import StringIO

buffer = StringO()
fit(outfile=buffer)
out = buffer.getvalue()
print(out)

[ie you don't need to write to a file just to get the data].

You can also send in an actual filehandle - e.g.

with open("/foo/bar/baz.dat", "wt") as fh:
    fit(outfile=fh)
    fh.write("YOU CAN ADD EXTRA INFO TO THE FILE IF YOU REALLY WANT\n")

but again this is likely only useful for specialized cases.

  • you can set outfile to something like

@DougBurke DougBurke linked an issue Jun 17, 2024 that may be closed by this pull request
Copy link
Contributor

@hamogu hamogu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of this chain from Fit to IterFit to IterFitCallback. It seems convoluted and I'm sure there must be a simpler way of organizing it, but that's not for this PR to litigate.

@anetasie
Copy link
Contributor

@DougBurke thanks for the description.

Rather than 'from numpy import x, y, z' use np.x, np.y, np.z
directly. This is a style choice, but I think makes the code
clearer.
- use super().__init__() rather than hard-code the super-class
  name
- add or remove the NoNewAttributesAfterInit class (for
  example, IterFit is removed because
  - we never called the super-class so never actually made
    use of it
  - it's an internal class we don't expect the user to come
    across
  - a subsequent change is awkward if we actually initialize
    the NewAttributesAfterInit class
- avoid re-creating the list that get_thawed_pars() creates for
  us (this is due to a clean up in sherpa#1974 that created this
  routine)
- create a n-element tuple by appending to a list and then
  converting that to a tuple, rather than just adding a single
  tuple (since mypy doesn't like this)
Use isinstance rather than type. There are other uses of type
in this module but it't tricky because they need checking to
make sure we don't accidentally change behaviour (as some
classes may be sub-classed).
We know explicitly allow the output file for fit calls to be
a Path object (previously we allowed it but not intentionally)
and also allow a file-handle like object to be used. This then
lets you say

    out = String()
    fit(outfile=out)
    txt = out.getvalue()

without having to create an on-disk file.

In doing this the internals of the IterFit class were changed,
and the callback routine it uses was converted from a local
definition to a top-level class. This change I think better
separates the code logic, but is also to help address
issue sherpa#2007 (where the multiprocessing code doesn't seem to like
our use of local function definitions when using anything but
the "fork" method).

As part of this work some basic typing statements have been
added. There is some fun in how best to indicate the type of
a "file handle": the easy solution is typing.TextIO but this
is "too large" as all we need are write() and close() methods,
so we use a Protocol. See the discussion at
python/typing#829
Now, this may not be ideal - as we don't currently export the
protocol so how will it work with the documentation - but it
would be easy to change to use TextIO.
Clarify the internal fitting code, where a tuple was being
converted to a list to be able to edit an element. We now explicitly
decompose the tuple, so it is more obvious what is going on.

I had hoped to be able to simplify this routine even more, but
the difference between the statistic object stored locally and
the one stored in the callback means it is not as simple as I
had hoped.
The class used a boolean to determine whether to call a function
when we can just check if the function is defined.
The issue at hand is that if you *change* the stat method of the
Fit structure then you will get a mis-match with the actual
fitted statistic, which remains using the _iterfit.stat method
(i.e. the original statistic).

We only have two test cases that expose this difference, so take
advantage of the new "write the fit results to StingIO" capability
and show that the stat value has changed (i.e. all but the last row
use the original statistic, in this case Chi2DataVar, so the stat
value is ~ 8500, and only the last row uses the Cash or CStat
value).

This is intended as a regression test.
We can't make use of a simple context manager for handling
the fit output since

a) we may not have a handle/filename
b) even if we do, we may not want to close it (i.e. if sent in by
   the user)

We can at least try to emulate the "close even on an error" handling
of a context manager with try/finally, as suggested by @hamogu

This is a bit invasive visually, but it's not clear to me that
breaking the code up into smaller chunks is worth it here.
We can make sure that changing the Fit stat also changes the
associated _iterfit.stat field, which seems like it should be
the correct thing to do.

There are only two tests where this is an issue, and it's unclear
whether the changes are good, bad, or not significant, since

a) the tests appear to be regression tests
b) we don't know how the test data was created, so we don't know
   what the true values are

   Note that the parameter values do not appear to have changed
   "hugely", but I haven't looked into the results too deeply.

The tests do show the change - in that the statistic value reported
in the "fit(outfile=...)" option now remains consistent (i.e.
matches the expected value, assuming our new interpretation of
the stat field of the Fit object is correct [which it should be,
I claim]).
We can just call the callback routne after updating the model
values, rather than manually setting the thawed pars,
calculating the stat, and writing out the values to a file.
This is ony possible now that sherpa#2063 has been addressed.
Although this is primarily used in the UI code it is used often
enough that it's useful to have in sherpa.utils.types.
@DougBurke
Copy link
Contributor Author

Rebased

There's been no other changes to the code in the rebase.

Copy link

codecov bot commented Jun 27, 2024

Codecov Report

Attention: Patch coverage is 97.32143% with 3 lines in your changes missing coverage. Please review.

Project coverage is 86.68%. Comparing base (44442ee) to head (c32c81c).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2054      +/-   ##
==========================================
+ Coverage   86.64%   86.68%   +0.03%     
==========================================
  Files          82       82              
  Lines       29766    29786      +20     
  Branches     4487     4492       +5     
==========================================
+ Hits        25792    25821      +29     
+ Misses       3863     3854       -9     
  Partials      111      111              
Files Coverage Δ
sherpa/astro/ui/serialize.py 89.67% <100.00%> (ø)
sherpa/astro/ui/utils.py 98.15% <100.00%> (+<0.01%) ⬆️
sherpa/ui/utils.py 97.25% <100.00%> (ø)
sherpa/utils/types.py 100.00% <100.00%> (ø)
sherpa/fit.py 83.12% <97.08%> (+0.42%) ⬆️

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44442ee...c32c81c. Read the comment docs.

@wmclaugh wmclaugh merged commit 4e85332 into sherpa:main Jun 27, 2024
17 checks passed
@DougBurke DougBurke deleted the fit-cleanup branch June 27, 2024 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fit class: what to do about changing the stats class?
4 participants