-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rethink UIDArray #456
Rethink UIDArray #456
Conversation
@cliffckerr In the process of going over the changes and going to try it out with the HIV module we're currently implementing, I did just encounter some odd behaviour though?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't had a chance to test the performance with HIV yet with some of those indexing edge cases (and actually converting over existing modules requires a fair bit of work...!) but the interface is broadly looking good! Some edge cases noted in specific comments
starsim/settings.py
Outdated
|
||
INT_NAN = 2147483647 # From np.iinfo(np.int32).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array) | ||
intnan = -32767 # From np.iinfo(np.int16).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a style guide for these kinds of constants wrt. capitalization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a formal one; I used lowercase to be consistent with np.nan
. In general, Numpy doesn't use uppercase for constants (e.g. np.pi
), nor does SciPy.
For internal-use constants (e.g. np.lib.format.BUFFER_SIZE
), I agree we should use the convention of uppercase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in latest revision
starsim/settings.py
Outdated
|
||
INT_NAN = 2147483647 # From np.iinfo(np.int32).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array) | ||
intnan = -32767 # From np.iinfo(np.int16).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a negative value here can be problematic for the pattern where we have a time index (like ti_dead
) and at each timestep we check for which agents now meet the time criteria with something like self.ti_dead <= sim.ti
. A negative value always meets that criteria so would need some different logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.ti_dead <= sim.ti
still works because of the self.notnan()
call. It won't work to do self.ti_dead.raw <= sim.ti
, though (see comment below).
I agree that for int arrays, which tend to be time indices, we use <
much more often than >
. I see two conflicting arguments here:
- We should use a very large value to represent NaN, so if a mistake is made and it isn't getting filtered out as it should, it isn't likely to matter.
- We should use a negative value to represent NaN, so mistakes are less likely to go unnoticed and thus get fixed.
Ironically I feel I'm usually on the side of (1) and you're usually on the side of (2), but it seems we're swapped here :)
My personal reason for preferring a negative value is perhaps no longer relevant, but it's that a few times when I was looking at the arrays (especially derived quantities like self.ti_dead.mean()
), a few large positive values mixed in with many "correct" values didn't immediately jump out to me as an error. To me negative values more clearly indicate that something's not right (and I do like @robynstuart's suggestion of just using -1, but that feels restrictive for a general IntArr
class).
I had a look at pandas' nullable integer series -- it looks like exactly what we need, but I think performance is considerably worse than plain Numpy arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should rely too much on the notnan()
call because indexing the IntArr
turns it back into a normal array, right? So not just using .raw
but if they were operating on a collection of UIDs that would also be the same right (like self.ti_dead[uids] <= sim.ti
)?
My thinking originally was that just writing <= sim.ti
would be the intended use case and not a mistake, hence it should 'just work' - and aside from that, it wasn't obvious to me whether a large positive or a large negative value would be more noticeable as a mistake in general. Probably context dependent although I'd agree that in typical ti_*
usages a negative value would be more noticeable. The magnitude also matters, I suspect -32767
might be too small, if the user is taking a sum/average it might not be big enough to outweigh valid entries and the aggregate might still look OK, so I think whichever sign we go with, a larger magnitude would decrease (but not eliminate) the chances of that happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, removed IntArr
entirely so we can just use np.nan
@@ -71,7 +71,7 @@ def __init__(self, pars=None, key_dict=None, vertical=False, *args, **kwargs): | |||
p2 = ss_int_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does kind of feel odd that there is a 'UID type' for the purpose of indexing (with ss.uids
) but there is no corresponding scalar type. How would it look if there was a UID type inheriting from ss_int_
and then instead of ss.uids
being a subclass of np.ndarray
it could just cast an array of int
type to an array of uid
type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a single scalar value:
import starsim as ss
sim = ss.demo(plot=False)
sim.people.age[ss.uids(9999)]
126.41934324595512 # need to fix the age bug lol
We could implement it this way, I prefer having a ss.uids
class to support other methods, e.g. I like being able to do self.auids = self.auids.remove(uids)
. But there is an annoyance here, namely that because ss.uids
is implemented as a view of an array, you can't modify the array in-place (because it doesn't own its own data, or there are multiple references to the data). If we could get around this by subclassing int instead, that might be preferable!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could self.auids.remove(uids)
be implemented as Arr.remove()
(and same for intersect) to provide similar convenience? I think having the notion of a 'UID' being a thing (via subclassing int) will be clearer than having the UID-ness be a property of a collection of ints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could implement it this way, but I prefer the current implementation for a couple reasons:
- It matches how
pd.Series
is implemented, e.g.:
import pandas as pd
s = pd.Series([1,2,None,4], dtype='Int64')
type(s[0])
Out[10]: numpy.int64 # but this is not the Series dtype, which is pd.types.Int64
- This way it's easy to check the type (
isinstance(key, ss.uids)
), whereas otherwise it gets complicated. In particular, Numpy seems to jam int-like things intonp.int64
, e.g.
import numpy as np
class UID(np.int64): pass
a = np.arange(5, dtype=UID)
assert a.dtype == UID # passes
assert a.dtype != np.int64 # fails
AssertionError
def __invert__(self): return self.asnew(~self.values) | ||
|
||
@property | ||
def uids(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having a method to do this rather than ss.true
but can we call the method true
instead of uids
and have a corresponding false
method as well? That would also then allow auids
to be renamed uids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale for this implementation is:
Arr.auids
maps directly tosim.people.auids
(which is an array of active UID indices), which we need to distinguish fromsim.people.uid
(which is an Arr containing all UIDs ever created)- I thought it could get confusing which methods return UIDs/inds and which return booleans, which is why I thought
.uids
clarified that it was returning a list of UIDs in a way thattrue()
wouldn't - The ufuncs should be working, so
arr.false()
would here be~arr.uids
. I like this better because it clarifies that you can invert anArr
and use it as anArr
(~arr
), or you can also just pull out the False UIDs (~arr.uids
). Withfalse()
, it always just returned the UIDs rather than giving this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was coming at it with the notion that an Arr
corresponded to a subset of the UIDs, and then the ambiguity as to whether BoolArr.uids
was the UIDs contained in the Arr
, or the UIDs for which it was true. I was expecting uids
to be the former, although maybe that's just something that people need to learn (to use auids
for that).
But I wonder if that's making too much of Arr.auids
mapping directly to sim.people.auids
- it would be cool eventually to be able to filter the People
in different ways by having a different mask on the Arr
instance, in which case Arr.auids
might not match sim.people.auids
. That's the scenario I was thinking of in having Arr.uids
mean 'the UIDs that appear in this Arr
' and Arr.true
/Arr.false
clearly referring to the true/false entries.
I don't think that true
and false
return the UIDs would be too confusing though, since that's what cv.true()
and cv.false()
did (and what ss.true()
and ss.false()
currently do)?
key = key.uids | ||
|
||
if isinstance(key, (int, uids, slice)): | ||
self.raw[key] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as in __getitem__
raising an error with an empty array (ss.uids([])
). This was happening a bit when trying to convert the HIV code over where we unconditionally read/write from the arrays with a collection of UIDs but if nobody is sampled on that timestep then that collection is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
starsim/states.py
Outdated
if isinstance(key, ss.BoolArr): | ||
key = key.uids | ||
|
||
if isinstance(key, (int, uids, slice)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about this not being symmetric with __getitem__
with respect to slices - here, the slice acts on .raw
whereas in __getitem__
the slice acts on .values
. This can lead to some weird-looking behaviour once some of the agents have died e.g.,
sim.people.age[0:5]
Out[14]: array([22.08333333, 42.08333333, 64.08333333, 28.08333333, 33.08333333])
sim.people.age[0:5] = 1
sim.people.age[0:5]
Out[16]: array([ 1. , 1. , 1. , 28.08333333, 33.08333333])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
starsim/states.py
Outdated
elif idx >= len(vals): | ||
raise Exception('Attempted to write to a non-existant index - this can happen if attempting to write to new entries that have not yet been allocated via grow()') | ||
vals[idx] = value | ||
|
||
def __getitem__(self, key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting that this would work
sim.people.ti_dead[sim.people.uid]
Traceback (most recent call last):
File "C:\Users\romesh\miniconda3\envs\atomica312\Lib\site-packages\IPython\core\interactiveshell.py", line 3577, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-29-e17a9156e088>", line 1, in <module>
sim.people.ti_dead[sim.people.uid]
~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
File "C:\Users\romesh\projects\idm\starsim\starsim\states.py", line 97, in __getitem__
return self.values[key]
~~~~~~~~~~~^^^^^
IndexError: index 11794 is out of bounds for axis 0 with size 11793
but since IndexArr
is not a uids
instance it does something different? If you put in a scalar UID type it could be used to here to have IndexArr
also do lookups by UID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
I think this should be fixed now, but not sure how to reproduce it so not 100% sure! |
self.screens = ss.State('screens', int, 0) | ||
self.ti_screened = ss.State('ti_screened', int, ss.INT_NAN) | ||
self.screened = ss.BoolArr('screened') | ||
self.screens = ss.FloatArr('screens', default=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems liek screens
should be an IntArr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed on the call -- no more IntArr
due to np.nan
support.
self.ti_screened = ss.State('ti_screened', int, ss.INT_NAN) | ||
self.screened = ss.BoolArr('screened') | ||
self.screens = ss.FloatArr('screens', default=0) | ||
self.ti_screened = ss.FloatArr('ti_screened') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I would expect that ti_*
are IntArray
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed on the call -- no more IntArr
due to np.nan
support.
starsim/network.py
Outdated
loc = self.pars.embedding_func.rvs(available) | ||
if np.isnan(loc).sum(): | ||
raise Exception('WARNING, should not happen!') | ||
loc = sc.rmnans(loc, replacenans=9999) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like temporary code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
starsim/network.py
Outdated
p1 = available_m[ind_m] | ||
p2 = available_f[ind_f] | ||
beta = np.ones(n_pairs) # TODO: fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What remains TODO / fix here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment -- would be nice to potentially have an ability to set beta for new connections. This ends up being confusing with beta-change interventions that must get reapplied at each timestep to modify new connections, rather than having a way to just set the default beta for new connections. I'm imagining something like beta = np.full(n_pairs, self.beta)
.
Update 2024-04-23: This PR is now ready to review + merge!
Replaces
UIDArray
,ArrayView
, andState
withArr
.Dead agents
Dead agents are no longer removed, so
uid
always corresponds to the position in the array. This means that no remapping is necessary, which has a significant performance benefit; running this script:for me takes about 2.7 s on
main
and about 1.8 s on this branch.Instead, each
Arr
(formerly known asState
) carries around a copy of thePeople
object, from which it can accessaliveinds
, which is used to filter the arrays by default (only if UIDs are not provided: if they are, they can be used without any extra filtering). When an agent dies, all of their properties are set tonan
(see below), and their UID is removed fromaliveinds
.What else has changed?
Arr
has type-specific subclasses, likeFloatArr
andBoolArr
. These classes have sensible defaults defined, so e.g.ss.State('ti_dead', int, ss.INT_NAN)
can be replaced withss.IntArr('ti_dead')
.These classes also have a defined
nan
value (for float and int, not for boolean). These also have only appropriate operations defined on them, e.g. onmain
:On this branch:
Outstanding tasks
Other than fixing the failing tests, a major outstanding question is how chaining should work. For performance and simplicity, an operation on an
Arr
returns anp.array
. However, in some cases, it's necessary to do operations on anArr
before indexing it withuids
. We could just do the operation on the whole array (rather than justaliveinds
), but this is inefficient.The solution is to define a
people.remap_uids()
function that will "expand" the sparse array (of lengthlen(people.aliveinds)
) back into the underlying array (of lengthmax(people.uid)
). This is pretty fast, and is actually only used in one place currently (make_new_cases
), but is ugly.Closes #296, #304, #400, #407, #423