Remove iter protocol from TableCollection #694

jeromekelleher · 2020-06-29T16:27:53Z

Closes #500

This removes the __iter__ protocol from TableCollection and replaces it with two properties on TableCollection and TreeSequence which do the same thing, more directly. This was done for a couple of reasons:

Implementing __iter__ without also implementing __getitem__, __len__ etc isn't a great practise. We will probably want to implement __getitem__ at some point, and it would probably not be compatible with this definition of iter, so we don't want to be constrained by that.
A list of (key, value) tuples is more naturally a dict, so it makes sense to be explicity about this.

The behaviour was undocumented, so I don't think we're doing to break much code with this change.

The names of the new attributes could probably be improved, so open to ideas there. You could also argue that they should be methods rather than properties - I don't have a strong opinion here.

@hyanwong, I think you're the main user of this - is it going to affect you much?

hyanwong · 2020-06-29T17:05:32Z

I agree that a dict is more natural if we want to return names and tables, and it's trivial to iterate over the dict, especially now that order is guaranteed (as of python 3.7). I don't think this should affect my code, and if it does, it should be easy to change. But in fact, when I iterate over all the tables, I usually don't actually need the names: when I want to check the type of table, I just check whether the class is e.g. tskit.ProvenanceTable, which seems less fragile than doing a string comparison. So I'm wondering if we should have a method that returns the tables in an array or list instead, and simply drop the table names (but perhaps this is the point of an iterator!)?

OTOH it is useful to have a function which returns the names of the tables, and the name_map method doubles up for that purpose.

Re naming, it does seem weird to have the tree sequence method called "tables_dict" and the tables method to be "name_map". I see that we already have an asdict() method that serialises each table into a dictionary, so we can't use that.

hyanwong · 2020-06-29T17:11:36Z

Sorry - following my thought above to the natural conclusion. Why not just implement __getitem__, and then we don't need to support the additional name_map function? Is it terribly hard to think of what we want __getitem__ to return? Surely just individual tables in the collection? The only other think that's stored is the sequence_length.

jeromekelleher · 2020-06-29T17:12:35Z

So I'm wondering if we should have a method that returns the tables in an array or list instead, and simply drop the table names (but perhaps this is the point of an iterator!)?

tables.name_map.values() does this pretty well

Re naming, it does seem weird to have the tree sequence method called "tables_dict" and the tables method to be "name_map". I see that we already have an asdict() method that serialises each table into a dictionary, so we can't use that.

I started out with them both being called tables_dict, but it looked weird to write self.tables.tables_dict. I think the name_map conveys the point pretty well, but tables_dict less so. TreeSequence.name_map doesn't seem right to me though. I guess we could drop the TreeSequence form, it's not that much of a chore to write ts.tables.name_map?

jeromekelleher · 2020-06-29T17:18:51Z

Sorry - following my thought above to the natural conclusion. Why not just implement getitem, and then we don't need to support the additional name_map function? Is it terribly hard to think of what we want getitem to return? Surely just individual tables in the collection? The only other think that's stored is the sequence_length.

Well there's metadata and stuff as well. If we're happy with us viewing the TableCollection as a dict-like mapping table names to tables, then yeah, we should definitely just do that. I'm uncertain about whether dropping sequence_length and metadata is the correct thing to do - but I'm 90% leaning towards just implementing __getitem__ as you say.

hyanwong · 2020-06-29T17:28:23Z

So I'm wondering if we should have a method that returns the tables in an array or list instead, and simply drop the table names (but perhaps this is the point of an iterator!)?

tables.name_map.values() does this pretty well

I was really meaning that it's simpler to just return the list of tables, if you don't actually need the names at all. I guess this ties in with the point below. I agree about dropping the ts form if we do keep a dict-like mapping function, though.

hyanwong · 2020-06-29T17:33:19Z

Well there's metadata and stuff as well. If we're happy with us viewing the TableCollection as a dict-like mapping table names to tables, then yeah, we should definitely just do that. I'm uncertain about whether dropping sequence_length and metadata is the correct thing to do - but I'm 90% leaning towards just implementing __getitem__ as you say.

ISTR I just wanted a way to check equality without checking provenance. It occurs to me that we could do this even if the items returned by __getitem__ are not of the same class, as long as an equality operator is defined on each of them. I.e. it would be useful to do something like

all(x=y for x, y in zip(tc1, tc2) if not isinstance(tc1, tskit.ProvenanceTable))

codecov · 2020-06-29T17:38:54Z

Codecov Report

Merging #694 into master will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #694      +/-   ##
==========================================
- Coverage   87.70%   87.69%   -0.01%     
==========================================
  Files          24       24              
  Lines       19328    19323       -5     
  Branches     3618     3618              
==========================================
- Hits        16951    16946       -5     
  Misses       1292     1292              
  Partials     1085     1085

Flag	Coverage Δ
#c_tests	`85.33% <ø> (ø)`
#python_c_tests	`90.12% <100.00%> (-0.01%)`	⬇️
#python_tests	`99.01% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/tskit/tables.py	`99.68% <100.00%> (-0.01%)`	⬇️
python/tskit/trees.py	`98.25% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9398b9a...bd4ad7e. Read the comment docs.

jeromekelleher · 2020-06-29T17:44:51Z

ISTR I just wanted a way to check equality without checking provenance. It occurs to me that we could do this even if the items returned by getitem are not of the same class, as long as an equality operator is defined on each of them. I.e. it would be useful to do something like

This is a pretty niche operation, and I don't think we should bias the design of these really high-level operations just to make it a tiny bit more convenient. If we do implement a dictionary or sequence protocol, then we want to be sure that this is generally useful and doesn't lead to nasty gotchas for users. I can't see tables[0] being equal to the EdgeTable instance ever being useful, but maybe tables["edges"] is. Or, maybe it's just confusing for users and we stick with the simpler and more descriptive tables.name_map["edges"]. Probably, some user calling list(tables) would be better off getting an error than ["edges, ...]``. Implementing special methods has far reaching consequences which need to be thought through carefully.

hyanwong · 2020-06-29T18:05:44Z

I completely agree about not making a special case. It's more to do with whether there's an obvious set of things that a user would expect to iterate over. When I first implemented the __iter__ method it seems as if there was (after all, it is called a "TableCollection", so the things in it are "obviously" the tables), but perhaps that's changed with the introduction of metadata?

I can't see why you would use tables.name_map["edges"] rather than tables.edges? To my mind, the only reason to have the dictionary-like accessor is so that you can iterate over all the "things" (presumably tables) in a table collection without having to know what they are all called. I would be worried about writing them all out, as I might miss one by accident, or it might be that a new table is introduced in a later version, and I still want my code to iterate over every "thing". If you only want a specific table, you might as well just use the attributes.

benjeffery · 2020-06-29T23:25:04Z

My thoughts on this:
TableCollection is a essentially a dict of tables and a dict of attributes.

tc[0] makes no sense, leads to unreadable code and is ambigous.

As iteration or length is also ambiguous (did you mean the tables or the attributes?) I would error on both. The error would point to using methods analogous to dict.items: tc.tables and tc.attributes or tc.iter_tables and tc.iter_attributes if you wanted to be explicit.

tc.edges and tc['edges'] should both work for both tables and attributes as they have their uses and avoid the less readable getattr.

hyanwong · 2020-06-29T23:27:52Z

tc.tables and tc.attributes or tc.iter_tables and tc.iter_attributes if you wanted to be explicit.

Yes, I like this. Much clearer. Thanks @benjeffery .

jeromekelleher · 2020-06-30T07:20:37Z

Good call @benjeffery, I'll update the PR with a proposal.

hyanwong · 2020-06-30T08:01:45Z

Good call @benjeffery, I'll update the PR with a proposal.

Thanks @jeromekelleher . If we use tc.tables() as an iterator we would to be able to call ts.tables.tables() which seems a bit weird. So I think I'd go for tc.iter_tables(). If necessary we can also implement a tree sequence wrapper so we can do ts.iter_tables() (which simply calls ts.tables.iter_tables())

jeromekelleher · 2020-06-30T08:16:05Z

Thanks @jeromekelleher . If we use tc.tables() as an iterator we would to be able to call ts.tables.tables() which seems a bit weird. So I think I'd go for tc.iter_tables(). If necessary we can also implement a tree sequence wrapper so we can do ts.iter_tables() (which simply calls ts.tables.iter_tables())

I think tables_dict is more useful than iter_tables though, because we don't just want the tables we want their names (and a list of (key, value) tuples is naturally a dict). So, we've kind of gone around in a circle here! Having iter_tables return a dict is just wrong IMO.

hyanwong · 2020-06-30T08:23:51Z

I think tables_dict is more useful than iter_tables though, because we don't just want the tables we want their names (and a list of (key, value) tuples is naturally a dict). So, we've kind of gone around in a circle here! Having iter_tables return a dict is just wrong IMO.

Hmm. ISWYM. For my use case (which is as you say, not a compelling argument), I'm not sure I want the table names, though? What's the use-case for wanting the names, rather than (say) having a separate tc.table_names() method?

Or indeed, could we just rely on table_names() instead of a dict, so you would be expected to use the names directly on the table collection: [tc[name] for name in ts.table_names()]?

[edit - tc.table_names is actually a property of the class, not the instance, so I guess this should be a static attribute: tskit.TableCollection.table_names()?]

hyanwong · 2020-06-30T08:34:29Z

Sorry - an additional (possibly alternative) thought: should the table names be a property of the table, not so much one returned by the table collection directly? Then if we do want something to iterate over the tables, we use tc.iter_tables() and get the names from each table: [t.name for t in tc.iter_tables()]? I guess it depends what you want to use the names for? If checking on what sort of table you are getting in this particular instance, I'd usually prefer to check using is instance rather than a string comparison (which I generally find a bit fragile).

Closes tskit-dev#500

jeromekelleher · 2020-07-24T13:44:20Z

@benjeffery and I decided to merge this one in the interest of getting the 0.3.0 release done soon.

jeromekelleher requested review from benjeffery and hyanwong and removed request for benjeffery June 29, 2020 16:34

jeromekelleher force-pushed the tables-dict-iter branch from b14e3b9 to ae58bc3 Compare June 29, 2020 17:14

jeromekelleher marked this pull request as draft June 30, 2020 07:20

Remove __iter__ protocol from TableCollection

bd4ad7e

Closes tskit-dev#500

jeromekelleher force-pushed the tables-dict-iter branch from ae58bc3 to bd4ad7e Compare July 24, 2020 13:39

jeromekelleher marked this pull request as ready for review July 24, 2020 13:39

benjeffery approved these changes Jul 24, 2020

View reviewed changes

benjeffery added the AUTOMERGE-REQUESTED label Jul 24, 2020

mergify bot merged commit 3b78da2 into tskit-dev:master Jul 24, 2020

mergify bot removed the AUTOMERGE-REQUESTED label Jul 24, 2020

jeromekelleher deleted the tables-dict-iter branch July 24, 2020 15:32

Remove __iter__ protocol from TableCollection #694

Remove __iter__ protocol from TableCollection #694

Uh oh!

Conversation

jeromekelleher commented Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyanwong commented Jun 29, 2020

Uh oh!

hyanwong commented Jun 29, 2020

Uh oh!

jeromekelleher commented Jun 29, 2020

Uh oh!

jeromekelleher commented Jun 29, 2020

Uh oh!

hyanwong commented Jun 29, 2020

Uh oh!

hyanwong commented Jun 29, 2020

Uh oh!

codecov bot commented Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Jun 29, 2020

Uh oh!

hyanwong commented Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjeffery commented Jun 29, 2020

Uh oh!

hyanwong commented Jun 29, 2020

Uh oh!

jeromekelleher commented Jun 30, 2020

Uh oh!

hyanwong commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Jun 30, 2020

Uh oh!

hyanwong commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyanwong commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Jul 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove iter protocol from TableCollection #694

Remove iter protocol from TableCollection #694

jeromekelleher commented Jun 29, 2020 •

edited

Loading

codecov bot commented Jun 29, 2020 •

edited

Loading

hyanwong commented Jun 29, 2020 •

edited

Loading

hyanwong commented Jun 30, 2020 •

edited

Loading

hyanwong commented Jun 30, 2020 •

edited

Loading

hyanwong commented Jun 30, 2020 •

edited

Loading