Rethink UIDArray #456

cliffckerr · 2024-04-13T14:29:46Z

Update 2024-04-23: This PR is now ready to review + merge!

Replaces UIDArray, ArrayView, and State with Arr.

Dead agents

Dead agents are no longer removed, so uid always corresponds to the position in the array. This means that no remapping is necessary, which has a significant performance benefit; running this script:

import sciris as sc
import starsim as ss

kw = dict(n_agents=100e3, start=2000, end=2100, diseases='sir', plot=False)

with sc.cprofile(sort='selfpct', mintime=1e-1) as cpr:
    sim = ss.demo(**kw)

for me takes about 2.7 s on main and about 1.8 s on this branch.

Instead, each Arr (formerly known as State) carries around a copy of the People object, from which it can access aliveinds, which is used to filter the arrays by default (only if UIDs are not provided: if they are, they can be used without any extra filtering). When an agent dies, all of their properties are set to nan (see below), and their UID is removed from aliveinds.

What else has changed?

Arr has type-specific subclasses, like FloatArr and BoolArr. These classes have sensible defaults defined, so e.g. ss.State('ti_dead', int, ss.INT_NAN) can be replaced with ss.IntArr('ti_dead').

These classes also have a defined nan value (for float and int, not for boolean). These also have only appropriate operations defined on them, e.g. on main:

>>> sim = ss.demo()
>>> sim.diseases.sir.ti_infected
      Quantity
UID           
0            1
1            2
2            1
3            2
4            2
...        ...
9995         2
9996         2
9997         2
9998         2
9999         0

[9806 rows x 1 columns]
>>> ~sim.diseases.sir.ti_infected # uh oh, bitwise inverse of an int
      Quantity
UID           
0           -2
1           -3
2           -2
3           -3
4           -3
...        ...
9995        -3
9996        -3
9997        -3
9998        -3
9999        -1

[9806 rows x 1 columns]

On this branch:

>>> sim = ss.demo()
>>> sim.diseases.sir.ti_infected
<IntArr "ti_infected", len=4064, [1 2 1 ... 2 2 0]>
>>> ~sim.diseases.sir.ti_infected
Traceback (most recent call last):

  Cell In[2], line 1
    ~sim.diseases.sir.ti_infected

  File ~/idm/starsim/starsim/states.py:119 in __invert__
    def __invert__(self):     raise BooleanOperationError(self)

BooleanOperationError: Logical operations are only valid on Boolean arrays, not <class 'numpy.int64'>

Outstanding tasks

Other than fixing the failing tests, a major outstanding question is how chaining should work. For performance and simplicity, an operation on an Arr returns a np.array. However, in some cases, it's necessary to do operations on an Arr before indexing it with uids. We could just do the operation on the whole array (rather than just aliveinds), but this is inefficient.

The solution is to define a people.remap_uids() function that will "expand" the sparse array (of length len(people.aliveinds)) back into the underlying array (of length max(people.uid)). This is pretty fast, and is actually only used in one place currently (make_new_cases), but is ugly.

Closes #296, #304, #400, #407, #423

RomeshA · 2024-04-23T01:23:46Z

@cliffckerr In the process of going over the changes and going to try it out with the HIV module we're currently implementing, I did just encounter some odd behaviour though?

self.partners.values
Out[13]: array([-32767, -32767, -32767, ..., -32767, -32767, -32767], dtype=int64)
self.concurrency.values
Out[14]: array([1, 1, 1, ..., 1, 1, 1], dtype=int64)
self.partners.values < self.concurrency.values
Out[15]: array([ True,  True,  True, ...,  True,  True,  True])
self.partners < self.concurrency
Out[16]: <BoolArr "partners", len=10001, [False False False ... False False False]>

RomeshA

Haven't had a chance to test the performance with HIV yet with some of those indexing edge cases (and actually converting over existing modules requires a fair bit of work...!) but the interface is broadly looking good! Some edge cases noted in specific comments

RomeshA · 2024-04-23T12:19:43Z

starsim/settings.py


-INT_NAN = 2147483647 # From np.iinfo(np.int32).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array)
+intnan = -32767 # From np.iinfo(np.int16).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array)


Do we have a style guide for these kinds of constants wrt. capitalization?

I don't think a formal one; I used lowercase to be consistent with np.nan. In general, Numpy doesn't use uppercase for constants (e.g. np.pi), nor does SciPy.

For internal-use constants (e.g. np.lib.format.BUFFER_SIZE), I agree we should use the convention of uppercase.

Removed in latest revision

RomeshA · 2024-04-23T12:25:16Z

starsim/settings.py


-INT_NAN = 2147483647 # From np.iinfo(np.int32).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array)
+intnan = -32767 # From np.iinfo(np.int16).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array)


Using a negative value here can be problematic for the pattern where we have a time index (like ti_dead) and at each timestep we check for which agents now meet the time criteria with something like self.ti_dead <= sim.ti. A negative value always meets that criteria so would need some different logic

self.ti_dead <= sim.ti still works because of the self.notnan() call. It won't work to do self.ti_dead.raw <= sim.ti, though (see comment below).

I agree that for int arrays, which tend to be time indices, we use < much more often than >. I see two conflicting arguments here:

We should use a very large value to represent NaN, so if a mistake is made and it isn't getting filtered out as it should, it isn't likely to matter.

We should use a negative value to represent NaN, so mistakes are less likely to go unnoticed and thus get fixed.

Ironically I feel I'm usually on the side of (1) and you're usually on the side of (2), but it seems we're swapped here :)

My personal reason for preferring a negative value is perhaps no longer relevant, but it's that a few times when I was looking at the arrays (especially derived quantities like self.ti_dead.mean()), a few large positive values mixed in with many "correct" values didn't immediately jump out to me as an error. To me negative values more clearly indicate that something's not right (and I do like @robynstuart's suggestion of just using -1, but that feels restrictive for a general IntArr class).

I had a look at pandas' nullable integer series -- it looks like exactly what we need, but I think performance is considerably worse than plain Numpy arrays.

I don't think we should rely too much on the notnan() call because indexing the IntArr turns it back into a normal array, right? So not just using .raw but if they were operating on a collection of UIDs that would also be the same right (like self.ti_dead[uids] <= sim.ti)?

My thinking originally was that just writing <= sim.ti would be the intended use case and not a mistake, hence it should 'just work' - and aside from that, it wasn't obvious to me whether a large positive or a large negative value would be more noticeable as a mistake in general. Probably context dependent although I'd agree that in typical ti_* usages a negative value would be more noticeable. The magnitude also matters, I suspect -32767 might be too small, if the user is taking a sum/average it might not be big enough to outweigh valid entries and the aggregate might still look OK, so I think whichever sign we go with, a larger magnitude would decrease (but not eliminate) the chances of that happening.

Agree, removed IntArr entirely so we can just use np.nan

RomeshA · 2024-04-23T12:32:11Z

starsim/network.py

@@ -71,7 +71,7 @@ def __init__(self, pars=None, key_dict=None, vertical=False, *args, **kwargs):
            p2 = ss_int_,


It does kind of feel odd that there is a 'UID type' for the purpose of indexing (with ss.uids) but there is no corresponding scalar type. How would it look if there was a UID type inheriting from ss_int_ and then instead of ss.uids being a subclass of np.ndarray it could just cast an array of int type to an array of uid type?

You can use a single scalar value:

import starsim as ss sim = ss.demo(plot=False) sim.people.age[ss.uids(9999)] 126.41934324595512 # need to fix the age bug lol

We could implement it this way, I prefer having a ss.uids class to support other methods, e.g. I like being able to do self.auids = self.auids.remove(uids). But there is an annoyance here, namely that because ss.uids is implemented as a view of an array, you can't modify the array in-place (because it doesn't own its own data, or there are multiple references to the data). If we could get around this by subclassing int instead, that might be preferable!

Could self.auids.remove(uids) be implemented as Arr.remove() (and same for intersect) to provide similar convenience? I think having the notion of a 'UID' being a thing (via subclassing int) will be clearer than having the UID-ness be a property of a collection of ints

I think we could implement it this way, but I prefer the current implementation for a couple reasons:

It matches how pd.Series is implemented, e.g.:

import pandas as pd s = pd.Series([1,2,None,4], dtype='Int64') type(s[0]) Out[10]: numpy.int64 # but this is not the Series dtype, which is pd.types.Int64

This way it's easy to check the type (isinstance(key, ss.uids)), whereas otherwise it gets complicated. In particular, Numpy seems to jam int-like things into np.int64, e.g.

import numpy as np class UID(np.int64): pass a = np.arange(5, dtype=UID) assert a.dtype == UID # passes assert a.dtype != np.int64 # fails AssertionError

starsim/states.py

RomeshA · 2024-04-23T13:07:56Z

starsim/states.py

+    def __invert__(self):     return self.asnew(~self.values)
+
+    @property
+    def uids(self):


I like having a method to do this rather than ss.true but can we call the method true instead of uids and have a corresponding false method as well? That would also then allow auids to be renamed uids

The rationale for this implementation is:

Arr.auids maps directly to sim.people.auids (which is an array of active UID indices), which we need to distinguish from sim.people.uid (which is an Arr containing all UIDs ever created)

I thought it could get confusing which methods return UIDs/inds and which return booleans, which is why I thought .uids clarified that it was returning a list of UIDs in a way that true() wouldn't

The ufuncs should be working, so arr.false() would here be ~arr.uids. I like this better because it clarifies that you can invert an Arr and use it as an Arr (~arr), or you can also just pull out the False UIDs (~arr.uids). With false(), it always just returned the UIDs rather than giving this option.

I think I was coming at it with the notion that an Arr corresponded to a subset of the UIDs, and then the ambiguity as to whether BoolArr.uids was the UIDs contained in the Arr, or the UIDs for which it was true. I was expecting uids to be the former, although maybe that's just something that people need to learn (to use auids for that).

But I wonder if that's making too much of Arr.auids mapping directly to sim.people.auids - it would be cool eventually to be able to filter the People in different ways by having a different mask on the Arr instance, in which case Arr.auids might not match sim.people.auids. That's the scenario I was thinking of in having Arr.uids mean 'the UIDs that appear in this Arr' and Arr.true/Arr.false clearly referring to the true/false entries.

I don't think that true and false return the UIDs would be too confusing though, since that's what cv.true() and cv.false() did (and what ss.true() and ss.false() currently do)?

RomeshA · 2024-04-23T13:30:58Z

starsim/states.py

+            key = key.uids
+
+        if isinstance(key, (int, uids, slice)):
+            self.raw[key] = value


Same issue as in __getitem__ raising an error with an empty array (ss.uids([])). This was happening a bit when trying to convert the HIV code over where we unconditionally read/write from the arrays with a collection of UIDs but if nobody is sampled on that timestep then that collection is empty.

RomeshA · 2024-04-23T13:32:25Z

starsim/states.py

+        if isinstance(key, ss.BoolArr):
+            key = key.uids
+
+        if isinstance(key, (int, uids, slice)):


I was wondering about this not being symmetric with __getitem__ with respect to slices - here, the slice acts on .raw whereas in __getitem__ the slice acts on .values. This can lead to some weird-looking behaviour once some of the agents have died e.g.,

sim.people.age[0:5] Out[14]: array([22.08333333, 42.08333333, 64.08333333, 28.08333333, 33.08333333]) sim.people.age[0:5] = 1 sim.people.age[0:5] Out[16]: array([ 1. , 1. , 1. , 28.08333333, 33.08333333])

RomeshA · 2024-04-23T13:51:53Z

starsim/states.py

-            elif idx >= len(vals):
-                raise Exception('Attempted to write to a non-existant index - this can happen if attempting to write to new entries that have not yet been allocated via grow()')
-            vals[idx] = value
-
    def __getitem__(self, key):


I was expecting that this would work

sim.people.ti_dead[sim.people.uid] Traceback (most recent call last): File "C:\Users\romesh\miniconda3\envs\atomica312\Lib\site-packages\IPython\core\interactiveshell.py", line 3577, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-29-e17a9156e088>", line 1, in <module> sim.people.ti_dead[sim.people.uid] ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ File "C:\Users\romesh\projects\idm\starsim\starsim\states.py", line 97, in __getitem__ return self.values[key] ~~~~~~~~~~~^^^^^ IndexError: index 11794 is out of bounds for axis 0 with size 11793

but since IndexArr is not a uids instance it does something different? If you put in a scalar UID type it could be used to here to have IndexArr also do lookups by UID

cliffckerr · 2024-04-27T15:20:07Z

@cliffckerr In the process of going over the changes and going to try it out with the HIV module we're currently implementing, I did just encounter some odd behaviour though?

self.partners.values
Out[13]: array([-32767, -32767, -32767, ..., -32767, -32767, -32767], dtype=int64)
self.concurrency.values
Out[14]: array([1, 1, 1, ..., 1, 1, 1], dtype=int64)
self.partners.values < self.concurrency.values
Out[15]: array([ True,  True,  True, ...,  True,  True,  True])
self.partners < self.concurrency
Out[16]: <BoolArr "partners", len=10001, [False False False ... False False False]>

I think this should be fixed now, but not sure how to reproduce it so not 100% sure!

Rethink uidarray2

daniel-klein · 2024-04-29T22:15:40Z

starsim/interventions.py

-        self.screens = ss.State('screens', int, 0)
-        self.ti_screened = ss.State('ti_screened', int, ss.INT_NAN)
+        self.screened = ss.BoolArr('screened')
+        self.screens = ss.FloatArr('screens', default=0)


Seems liek screens should be an IntArr?

Discussed on the call -- no more IntArr due to np.nan support.

daniel-klein · 2024-04-29T22:16:00Z

starsim/interventions.py

-        self.ti_screened = ss.State('ti_screened', int, ss.INT_NAN)
+        self.screened = ss.BoolArr('screened')
+        self.screens = ss.FloatArr('screens', default=0)
+        self.ti_screened = ss.FloatArr('ti_screened')


Also I would expect that ti_* are IntArray as well.

Discussed on the call -- no more IntArr due to np.nan support.

daniel-klein · 2024-04-29T22:23:36Z

starsim/network.py

        loc = self.pars.embedding_func.rvs(available)
+        if np.isnan(loc).sum():
+            raise Exception('WARNING, should not happen!')
+            loc = sc.rmnans(loc, replacenans=9999)


Looks like temporary code?

daniel-klein · 2024-04-29T22:24:08Z

starsim/network.py

        p1 = available_m[ind_m]
+        p2 = available_f[ind_f]
+        beta = np.ones(n_pairs) # TODO: fix


What remains TODO / fix here?

Updated the comment -- would be nice to potentially have an ability to set beta for new connections. This ends up being confusing with beta-change interventions that must get reapplied at each timestep to modify new connections, rather than having a way to just set the default beta for new connections. I'm imagining something like beta = np.full(n_pairs, self.beta).

cliffckerr added 30 commits April 7, 2024 11:56

works when agents are not removed, but no faster

db98a0a

simplified but no faster, still called twice

db90cab

removed second call, 2x faster

29da0b0

tests fail with deaths

3809647

Merge branch 'main' into rethink-uidarray

6091f33

small update

ee18f56

rename INT_NAN to intnan

c9bb347

rename _initialized to initialized

6b8061a

rename slots

b7f646e

working on changing State to Arr

170bb0b

working on grow

0483ecb

updating grow

d46a8c4

more work on aliveinds and set_people

a2b08de

rename _data to _arr

7dc8294

working on zombies and death

93682a9

halfway through converting states

0e7b7b7

done converting states

820af4a

update version

5652a17

update changelog

154b791

starting debugging

75c7492

fix call to State

ba3b0dd

runs but everyone dead

e2d80e2

closer to working but results are wrong

aa69c0d

working!

8492fde

deaths not being handled correctly

bd6cfe9

considerable progress, 18 failed

ded033e

placeholder for true, 15 failed

ad491c1

fixed count, 13 failed

8ff73e1

parallel test hangs, 9 failed

9fde51d

fixed deepcopy error, 8 failed

bf62887

cliffckerr requested a review from RomeshA April 22, 2024 22:54

update changelog

aa2a7ae

cliffckerr mentioned this pull request Apr 22, 2024

"omergeleft" is confusing to new users #472

Closed

RomeshA reviewed Apr 23, 2024

View reviewed changes

cliffckerr added 8 commits April 24, 2024 12:05

Merge branch 'main' into rethink-uidarray

e148093

update nullnet with new syntax

cef2ab5

removed intarr, tests pass

370141c

remove notnan call

afee34d

refactor get and set

bf5051b

fix indexarr indexing

734c576

allow empty indexing

b229774

reimplement isnan and notnan

cf59898

cliffckerr mentioned this pull request Apr 27, 2024

Rethink uidarray2 #483

Merged

cliffckerr added 7 commits April 27, 2024 11:41

update changelog

19541df

more clarification in changelog

8347b60

add docstrings

0e745f8

fix dtype

8ea0255

tidy errors

3611248

Merge pull request #483 from starsimhub/rethink-uidarray2

1e9a9a2

Rethink uidarray2

move arr docstring

b4f78ec

daniel-klein reviewed Apr 29, 2024

View reviewed changes

cliffckerr mentioned this pull request Apr 30, 2024

Streamline __getitem__ in states.py #160

Closed

update comment

1e91532

kaa2102 merged commit 69a5238 into main May 1, 2024
3 checks passed

cliffckerr deleted the rethink-uidarray branch May 2, 2024 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink UIDArray #456

Rethink UIDArray #456

cliffckerr commented Apr 13, 2024 •

edited

RomeshA commented Apr 23, 2024

RomeshA left a comment

RomeshA Apr 23, 2024

cliffckerr Apr 23, 2024

cliffckerr Apr 27, 2024

RomeshA Apr 23, 2024

cliffckerr Apr 23, 2024

RomeshA Apr 24, 2024

cliffckerr Apr 27, 2024

RomeshA Apr 23, 2024

cliffckerr Apr 23, 2024

RomeshA Apr 23, 2024

cliffckerr Apr 27, 2024 •

edited

RomeshA Apr 23, 2024

cliffckerr Apr 27, 2024

RomeshA May 1, 2024

RomeshA Apr 23, 2024

cliffckerr Apr 27, 2024

RomeshA Apr 23, 2024

cliffckerr Apr 27, 2024

RomeshA Apr 23, 2024

cliffckerr Apr 27, 2024

cliffckerr commented Apr 27, 2024

daniel-klein Apr 29, 2024

cliffckerr May 1, 2024

daniel-klein Apr 29, 2024

cliffckerr May 1, 2024

daniel-klein Apr 29, 2024

cliffckerr May 1, 2024

daniel-klein Apr 29, 2024

cliffckerr May 1, 2024


		INT_NAN = 2147483647 # From np.iinfo(np.int32).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array)
		intnan = -32767 # From np.iinfo(np.int16).max: value to use to flag invalid content (i.e., an integer value we are treating like NaN, since NaN can't be stored in an integer array)

		@@ -71,7 +71,7 @@ def __init__(self, pars=None, key_dict=None, vertical=False, args, *kwargs):
		p2 = ss_int_,

Rethink UIDArray #456

Rethink UIDArray #456

Conversation

cliffckerr commented Apr 13, 2024 • edited

Dead agents

What else has changed?

Outstanding tasks

RomeshA commented Apr 23, 2024

RomeshA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cliffckerr Apr 27, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cliffckerr commented Apr 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cliffckerr commented Apr 13, 2024 •

edited

cliffckerr Apr 27, 2024 •

edited