Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink UIDArray #456

Merged
merged 92 commits into from
May 1, 2024
Merged

Rethink UIDArray #456

merged 92 commits into from
May 1, 2024

Conversation

cliffckerr
Copy link
Contributor

@cliffckerr cliffckerr commented Apr 13, 2024

Update 2024-04-23: This PR is now ready to review + merge!

Replaces UIDArray, ArrayView, and State with Arr.

Dead agents

Dead agents are no longer removed, so uid always corresponds to the position in the array. This means that no remapping is necessary, which has a significant performance benefit; running this script:

import sciris as sc
import starsim as ss

kw = dict(n_agents=100e3, start=2000, end=2100, diseases='sir', plot=False)

with sc.cprofile(sort='selfpct', mintime=1e-1) as cpr:
    sim = ss.demo(**kw)

for me takes about 2.7 s on main and about 1.8 s on this branch.

Instead, each Arr (formerly known as State) carries around a copy of the People object, from which it can access aliveinds, which is used to filter the arrays by default (only if UIDs are not provided: if they are, they can be used without any extra filtering). When an agent dies, all of their properties are set to nan (see below), and their UID is removed from aliveinds.

What else has changed?

Arr has type-specific subclasses, like FloatArr and BoolArr. These classes have sensible defaults defined, so e.g. ss.State('ti_dead', int, ss.INT_NAN) can be replaced with ss.IntArr('ti_dead').

These classes also have a defined nan value (for float and int, not for boolean). These also have only appropriate operations defined on them, e.g. on main:

>>> sim = ss.demo()
>>> sim.diseases.sir.ti_infected
      Quantity
UID           
0            1
1            2
2            1
3            2
4            2
...        ...
9995         2
9996         2
9997         2
9998         2
9999         0

[9806 rows x 1 columns]
>>> ~sim.diseases.sir.ti_infected # uh oh, bitwise inverse of an int
      Quantity
UID           
0           -2
1           -3
2           -2
3           -3
4           -3
...        ...
9995        -3
9996        -3
9997        -3
9998        -3
9999        -1

[9806 rows x 1 columns]

On this branch:

>>> sim = ss.demo()
>>> sim.diseases.sir.ti_infected
<IntArr "ti_infected", len=4064, [1 2 1 ... 2 2 0]>
>>> ~sim.diseases.sir.ti_infected
Traceback (most recent call last):

  Cell In[2], line 1
    ~sim.diseases.sir.ti_infected

  File ~/idm/starsim/starsim/states.py:119 in __invert__
    def __invert__(self):     raise BooleanOperationError(self)

BooleanOperationError: Logical operations are only valid on Boolean arrays, not <class 'numpy.int64'>

Outstanding tasks

Other than fixing the failing tests, a major outstanding question is how chaining should work. For performance and simplicity, an operation on an Arr returns a np.array. However, in some cases, it's necessary to do operations on an Arr before indexing it with uids. We could just do the operation on the whole array (rather than just aliveinds), but this is inefficient.

The solution is to define a people.remap_uids() function that will "expand" the sparse array (of length len(people.aliveinds)) back into the underlying array (of length max(people.uid)). This is pretty fast, and is actually only used in one place currently (make_new_cases), but is ugly.

Closes #296, #304, #400, #407, #423

@cliffckerr cliffckerr requested a review from RomeshA April 22, 2024 22:54
@RomeshA
Copy link
Contributor

RomeshA commented Apr 23, 2024

@cliffckerr In the process of going over the changes and going to try it out with the HIV module we're currently implementing, I did just encounter some odd behaviour though?

self.partners.values
Out[13]: array([-32767, -32767, -32767, ..., -32767, -32767, -32767], dtype=int64)
self.concurrency.values
Out[14]: array([1, 1, 1, ..., 1, 1, 1], dtype=int64)
self.partners.values < self.concurrency.values
Out[15]: array([ True,  True,  True, ...,  True,  True,  True])
self.partners < self.concurrency
Out[16]: <BoolArr "partners", len=10001, [False False False ... False False False]>

Copy link
Contributor

@RomeshA RomeshA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't had a chance to test the performance with HIV yet with some of those indexing edge cases (and actually converting over existing modules requires a fair bit of work...!) but the interface is broadly looking good! Some edge cases noted in specific comments


INT_NAN = 2147483647 # From np.iinfo(np.int32).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array)
intnan = -32767 # From np.iinfo(np.int16).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a style guide for these kinds of constants wrt. capitalization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a formal one; I used lowercase to be consistent with np.nan. In general, Numpy doesn't use uppercase for constants (e.g. np.pi), nor does SciPy.

For internal-use constants (e.g. np.lib.format.BUFFER_SIZE), I agree we should use the convention of uppercase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in latest revision


INT_NAN = 2147483647 # From np.iinfo(np.int32).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array)
intnan = -32767 # From np.iinfo(np.int16).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a negative value here can be problematic for the pattern where we have a time index (like ti_dead) and at each timestep we check for which agents now meet the time criteria with something like self.ti_dead <= sim.ti. A negative value always meets that criteria so would need some different logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.ti_dead <= sim.ti still works because of the self.notnan() call. It won't work to do self.ti_dead.raw <= sim.ti, though (see comment below).

I agree that for int arrays, which tend to be time indices, we use < much more often than >. I see two conflicting arguments here:

  1. We should use a very large value to represent NaN, so if a mistake is made and it isn't getting filtered out as it should, it isn't likely to matter.
  2. We should use a negative value to represent NaN, so mistakes are less likely to go unnoticed and thus get fixed.

Ironically I feel I'm usually on the side of (1) and you're usually on the side of (2), but it seems we're swapped here :)

My personal reason for preferring a negative value is perhaps no longer relevant, but it's that a few times when I was looking at the arrays (especially derived quantities like self.ti_dead.mean()), a few large positive values mixed in with many "correct" values didn't immediately jump out to me as an error. To me negative values more clearly indicate that something's not right (and I do like @robynstuart's suggestion of just using -1, but that feels restrictive for a general IntArr class).

I had a look at pandas' nullable integer series -- it looks like exactly what we need, but I think performance is considerably worse than plain Numpy arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should rely too much on the notnan() call because indexing the IntArr turns it back into a normal array, right? So not just using .raw but if they were operating on a collection of UIDs that would also be the same right (like self.ti_dead[uids] <= sim.ti)?

My thinking originally was that just writing <= sim.ti would be the intended use case and not a mistake, hence it should 'just work' - and aside from that, it wasn't obvious to me whether a large positive or a large negative value would be more noticeable as a mistake in general. Probably context dependent although I'd agree that in typical ti_* usages a negative value would be more noticeable. The magnitude also matters, I suspect -32767 might be too small, if the user is taking a sum/average it might not be big enough to outweigh valid entries and the aggregate might still look OK, so I think whichever sign we go with, a larger magnitude would decrease (but not eliminate) the chances of that happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, removed IntArr entirely so we can just use np.nan

@@ -71,7 +71,7 @@ def __init__(self, pars=None, key_dict=None, vertical=False, *args, **kwargs):
p2 = ss_int_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does kind of feel odd that there is a 'UID type' for the purpose of indexing (with ss.uids) but there is no corresponding scalar type. How would it look if there was a UID type inheriting from ss_int_ and then instead of ss.uids being a subclass of np.ndarray it could just cast an array of int type to an array of uid type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a single scalar value:

import starsim as ss
sim = ss.demo(plot=False)
sim.people.age[ss.uids(9999)]
126.41934324595512 # need to fix the age bug lol

We could implement it this way, I prefer having a ss.uids class to support other methods, e.g. I like being able to do self.auids = self.auids.remove(uids). But there is an annoyance here, namely that because ss.uids is implemented as a view of an array, you can't modify the array in-place (because it doesn't own its own data, or there are multiple references to the data). If we could get around this by subclassing int instead, that might be preferable!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could self.auids.remove(uids) be implemented as Arr.remove() (and same for intersect) to provide similar convenience? I think having the notion of a 'UID' being a thing (via subclassing int) will be clearer than having the UID-ness be a property of a collection of ints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could implement it this way, but I prefer the current implementation for a couple reasons:

  • It matches how pd.Series is implemented, e.g.:
import pandas as pd
s = pd.Series([1,2,None,4], dtype='Int64')
type(s[0])
Out[10]: numpy.int64 # but this is not the Series dtype, which is pd.types.Int64
  • This way it's easy to check the type (isinstance(key, ss.uids)), whereas otherwise it gets complicated. In particular, Numpy seems to jam int-like things into np.int64, e.g.
import numpy as np
class UID(np.int64): pass
a = np.arange(5, dtype=UID)
assert a.dtype == UID # passes
assert a.dtype != np.int64 # fails
AssertionError

starsim/states.py Outdated Show resolved Hide resolved
starsim/states.py Outdated Show resolved Hide resolved
starsim/states.py Outdated Show resolved Hide resolved
def __invert__(self): return self.asnew(~self.values)

@property
def uids(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having a method to do this rather than ss.true but can we call the method true instead of uids and have a corresponding false method as well? That would also then allow auids to be renamed uids

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale for this implementation is:

  • Arr.auids maps directly to sim.people.auids (which is an array of active UID indices), which we need to distinguish from sim.people.uid (which is an Arr containing all UIDs ever created)
  • I thought it could get confusing which methods return UIDs/inds and which return booleans, which is why I thought .uids clarified that it was returning a list of UIDs in a way that true() wouldn't
  • The ufuncs should be working, so arr.false() would here be ~arr.uids. I like this better because it clarifies that you can invert an Arr and use it as an Arr (~arr), or you can also just pull out the False UIDs (~arr.uids). With false(), it always just returned the UIDs rather than giving this option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was coming at it with the notion that an Arr corresponded to a subset of the UIDs, and then the ambiguity as to whether BoolArr.uids was the UIDs contained in the Arr, or the UIDs for which it was true. I was expecting uids to be the former, although maybe that's just something that people need to learn (to use auids for that).

But I wonder if that's making too much of Arr.auids mapping directly to sim.people.auids - it would be cool eventually to be able to filter the People in different ways by having a different mask on the Arr instance, in which case Arr.auids might not match sim.people.auids. That's the scenario I was thinking of in having Arr.uids mean 'the UIDs that appear in this Arr' and Arr.true/Arr.false clearly referring to the true/false entries.

I don't think that true and false return the UIDs would be too confusing though, since that's what cv.true() and cv.false() did (and what ss.true() and ss.false() currently do)?

key = key.uids

if isinstance(key, (int, uids, slice)):
self.raw[key] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as in __getitem__ raising an error with an empty array (ss.uids([])). This was happening a bit when trying to convert the HIV code over where we unconditionally read/write from the arrays with a collection of UIDs but if nobody is sampled on that timestep then that collection is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if isinstance(key, ss.BoolArr):
key = key.uids

if isinstance(key, (int, uids, slice)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about this not being symmetric with __getitem__ with respect to slices - here, the slice acts on .raw whereas in __getitem__ the slice acts on .values. This can lead to some weird-looking behaviour once some of the agents have died e.g.,

sim.people.age[0:5]
Out[14]: array([22.08333333, 42.08333333, 64.08333333, 28.08333333, 33.08333333])
sim.people.age[0:5] = 1
sim.people.age[0:5]
Out[16]: array([ 1.        ,  1.        ,  1.        , 28.08333333, 33.08333333])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

elif idx >= len(vals):
raise Exception('Attempted to write to a non-existant index - this can happen if attempting to write to new entries that have not yet been allocated via grow()')
vals[idx] = value

def __getitem__(self, key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting that this would work

sim.people.ti_dead[sim.people.uid]
Traceback (most recent call last):
  File "C:\Users\romesh\miniconda3\envs\atomica312\Lib\site-packages\IPython\core\interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-29-e17a9156e088>", line 1, in <module>
    sim.people.ti_dead[sim.people.uid]
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "C:\Users\romesh\projects\idm\starsim\starsim\states.py", line 97, in __getitem__
    return self.values[key]
           ~~~~~~~~~~~^^^^^
IndexError: index 11794 is out of bounds for axis 0 with size 11793

but since IndexArr is not a uids instance it does something different? If you put in a scalar UID type it could be used to here to have IndexArr also do lookups by UID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@cliffckerr
Copy link
Contributor Author

@cliffckerr In the process of going over the changes and going to try it out with the HIV module we're currently implementing, I did just encounter some odd behaviour though?

self.partners.values
Out[13]: array([-32767, -32767, -32767, ..., -32767, -32767, -32767], dtype=int64)
self.concurrency.values
Out[14]: array([1, 1, 1, ..., 1, 1, 1], dtype=int64)
self.partners.values < self.concurrency.values
Out[15]: array([ True,  True,  True, ...,  True,  True,  True])
self.partners < self.concurrency
Out[16]: <BoolArr "partners", len=10001, [False False False ... False False False]>

I think this should be fixed now, but not sure how to reproduce it so not 100% sure!

@cliffckerr cliffckerr mentioned this pull request Apr 27, 2024
self.screens = ss.State('screens', int, 0)
self.ti_screened = ss.State('ti_screened', int, ss.INT_NAN)
self.screened = ss.BoolArr('screened')
self.screens = ss.FloatArr('screens', default=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems liek screens should be an IntArr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed on the call -- no more IntArr due to np.nan support.

self.ti_screened = ss.State('ti_screened', int, ss.INT_NAN)
self.screened = ss.BoolArr('screened')
self.screens = ss.FloatArr('screens', default=0)
self.ti_screened = ss.FloatArr('ti_screened')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would expect that ti_* are IntArray as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed on the call -- no more IntArr due to np.nan support.

loc = self.pars.embedding_func.rvs(available)
if np.isnan(loc).sum():
raise Exception('WARNING, should not happen!')
loc = sc.rmnans(loc, replacenans=9999)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like temporary code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

p1 = available_m[ind_m]
p2 = available_f[ind_f]
beta = np.ones(n_pairs) # TODO: fix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What remains TODO / fix here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment -- would be nice to potentially have an ability to set beta for new connections. This ends up being confusing with beta-change interventions that must get reapplied at each timestep to modify new connections, rather than having a way to just set the default beta for new connections. I'm imagining something like beta = np.full(n_pairs, self.beta).

@kaa2102 kaa2102 merged commit 69a5238 into main May 1, 2024
3 checks passed
@cliffckerr cliffckerr deleted the rethink-uidarray branch May 2, 2024 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check that dead-but-not-removed agents are being handled sensibly
5 participants