Skip to content

Conversation

@jeromekelleher
Copy link
Member

@jeromekelleher jeromekelleher commented Jun 29, 2020

Closes #500

This removes the __iter__ protocol from TableCollection and replaces it with two properties on TableCollection and TreeSequence which do the same thing, more directly. This was done for a couple of reasons:

  • Implementing __iter__ without also implementing __getitem__, __len__ etc isn't a great practise. We will probably want to implement __getitem__ at some point, and it would probably not be compatible with this definition of iter, so we don't want to be constrained by that.
  • A list of (key, value) tuples is more naturally a dict, so it makes sense to be explicity about this.

The behaviour was undocumented, so I don't think we're doing to break much code with this change.

The names of the new attributes could probably be improved, so open to ideas there. You could also argue that they should be methods rather than properties - I don't have a strong opinion here.

@hyanwong, I think you're the main user of this - is it going to affect you much?

@jeromekelleher jeromekelleher requested review from benjeffery and hyanwong and removed request for benjeffery June 29, 2020 16:34
@hyanwong
Copy link
Member

I agree that a dict is more natural if we want to return names and tables, and it's trivial to iterate over the dict, especially now that order is guaranteed (as of python 3.7). I don't think this should affect my code, and if it does, it should be easy to change. But in fact, when I iterate over all the tables, I usually don't actually need the names: when I want to check the type of table, I just check whether the class is e.g. tskit.ProvenanceTable, which seems less fragile than doing a string comparison. So I'm wondering if we should have a method that returns the tables in an array or list instead, and simply drop the table names (but perhaps this is the point of an iterator!)?

OTOH it is useful to have a function which returns the names of the tables, and the name_map method doubles up for that purpose.

Re naming, it does seem weird to have the tree sequence method called "tables_dict" and the tables method to be "name_map". I see that we already have an asdict() method that serialises each table into a dictionary, so we can't use that.

@hyanwong
Copy link
Member

Sorry - following my thought above to the natural conclusion. Why not just implement __getitem__, and then we don't need to support the additional name_map function? Is it terribly hard to think of what we want __getitem__ to return? Surely just individual tables in the collection? The only other think that's stored is the sequence_length.

@jeromekelleher
Copy link
Member Author

So I'm wondering if we should have a method that returns the tables in an array or list instead, and simply drop the table names (but perhaps this is the point of an iterator!)?

tables.name_map.values() does this pretty well

Re naming, it does seem weird to have the tree sequence method called "tables_dict" and the tables method to be "name_map". I see that we already have an asdict() method that serialises each table into a dictionary, so we can't use that.

I started out with them both being called tables_dict, but it looked weird to write self.tables.tables_dict. I think the name_map conveys the point pretty well, but tables_dict less so. TreeSequence.name_map doesn't seem right to me though. I guess we could drop the TreeSequence form, it's not that much of a chore to write ts.tables.name_map?

@jeromekelleher
Copy link
Member Author

Sorry - following my thought above to the natural conclusion. Why not just implement getitem, and then we don't need to support the additional name_map function? Is it terribly hard to think of what we want getitem to return? Surely just individual tables in the collection? The only other think that's stored is the sequence_length.

Well there's metadata and stuff as well. If we're happy with us viewing the TableCollection as a dict-like mapping table names to tables, then yeah, we should definitely just do that. I'm uncertain about whether dropping sequence_length and metadata is the correct thing to do - but I'm 90% leaning towards just implementing __getitem__ as you say.

@hyanwong
Copy link
Member

So I'm wondering if we should have a method that returns the tables in an array or list instead, and simply drop the table names (but perhaps this is the point of an iterator!)?

tables.name_map.values() does this pretty well

I was really meaning that it's simpler to just return the list of tables, if you don't actually need the names at all. I guess this ties in with the point below. I agree about dropping the ts form if we do keep a dict-like mapping function, though.

@hyanwong
Copy link
Member

Well there's metadata and stuff as well. If we're happy with us viewing the TableCollection as a dict-like mapping table names to tables, then yeah, we should definitely just do that. I'm uncertain about whether dropping sequence_length and metadata is the correct thing to do - but I'm 90% leaning towards just implementing __getitem__ as you say.

ISTR I just wanted a way to check equality without checking provenance. It occurs to me that we could do this even if the items returned by __getitem__ are not of the same class, as long as an equality operator is defined on each of them. I.e. it would be useful to do something like

all(x=y for x, y in zip(tc1, tc2) if not isinstance(tc1, tskit.ProvenanceTable))

@codecov
Copy link

codecov bot commented Jun 29, 2020

Codecov Report

Merging #694 into master will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #694      +/-   ##
==========================================
- Coverage   87.70%   87.69%   -0.01%     
==========================================
  Files          24       24              
  Lines       19328    19323       -5     
  Branches     3618     3618              
==========================================
- Hits        16951    16946       -5     
  Misses       1292     1292              
  Partials     1085     1085              
Flag Coverage Δ
#c_tests 85.33% <ø> (ø)
#python_c_tests 90.12% <100.00%> (-0.01%) ⬇️
#python_tests 99.01% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/tables.py 99.68% <100.00%> (-0.01%) ⬇️
python/tskit/trees.py 98.25% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9398b9a...bd4ad7e. Read the comment docs.

@jeromekelleher
Copy link
Member Author

ISTR I just wanted a way to check equality without checking provenance. It occurs to me that we could do this even if the items returned by getitem are not of the same class, as long as an equality operator is defined on each of them. I.e. it would be useful to do something like

This is a pretty niche operation, and I don't think we should bias the design of these really high-level operations just to make it a tiny bit more convenient. If we do implement a dictionary or sequence protocol, then we want to be sure that this is generally useful and doesn't lead to nasty gotchas for users. I can't see tables[0] being equal to the EdgeTable instance ever being useful, but maybe tables["edges"] is. Or, maybe it's just confusing for users and we stick with the simpler and more descriptive tables.name_map["edges"]. Probably, some user calling list(tables) would be better off getting an error than ["edges, ...]``. Implementing special methods has far reaching consequences which need to be thought through carefully.

@hyanwong
Copy link
Member

hyanwong commented Jun 29, 2020

I completely agree about not making a special case. It's more to do with whether there's an obvious set of things that a user would expect to iterate over. When I first implemented the __iter__ method it seems as if there was (after all, it is called a "TableCollection", so the things in it are "obviously" the tables), but perhaps that's changed with the introduction of metadata?

I can't see why you would use tables.name_map["edges"] rather than tables.edges? To my mind, the only reason to have the dictionary-like accessor is so that you can iterate over all the "things" (presumably tables) in a table collection without having to know what they are all called. I would be worried about writing them all out, as I might miss one by accident, or it might be that a new table is introduced in a later version, and I still want my code to iterate over every "thing". If you only want a specific table, you might as well just use the attributes.

@benjeffery
Copy link
Member

My thoughts on this:
TableCollection is a essentially a dict of tables and a dict of attributes.

tc[0] makes no sense, leads to unreadable code and is ambigous.

As iteration or length is also ambiguous (did you mean the tables or the attributes?) I would error on both. The error would point to using methods analogous to dict.items: tc.tables and tc.attributes or tc.iter_tables and tc.iter_attributes if you wanted to be explicit.

tc.edges and tc['edges'] should both work for both tables and attributes as they have their uses and avoid the less readable getattr.

@hyanwong
Copy link
Member

tc.tables and tc.attributes or tc.iter_tables and tc.iter_attributes if you wanted to be explicit.

Yes, I like this. Much clearer. Thanks @benjeffery .

@jeromekelleher
Copy link
Member Author

Good call @benjeffery, I'll update the PR with a proposal.

@jeromekelleher jeromekelleher marked this pull request as draft June 30, 2020 07:20
@hyanwong
Copy link
Member

hyanwong commented Jun 30, 2020

Good call @benjeffery, I'll update the PR with a proposal.

Thanks @jeromekelleher . If we use tc.tables() as an iterator we would to be able to call ts.tables.tables() which seems a bit weird. So I think I'd go for tc.iter_tables(). If necessary we can also implement a tree sequence wrapper so we can do ts.iter_tables() (which simply calls ts.tables.iter_tables())

@jeromekelleher
Copy link
Member Author

Thanks @jeromekelleher . If we use tc.tables() as an iterator we would to be able to call ts.tables.tables() which seems a bit weird. So I think I'd go for tc.iter_tables(). If necessary we can also implement a tree sequence wrapper so we can do ts.iter_tables() (which simply calls ts.tables.iter_tables())

I think tables_dict is more useful than iter_tables though, because we don't just want the tables we want their names (and a list of (key, value) tuples is naturally a dict). So, we've kind of gone around in a circle here! Having iter_tables return a dict is just wrong IMO.

@hyanwong
Copy link
Member

hyanwong commented Jun 30, 2020

I think tables_dict is more useful than iter_tables though, because we don't just want the tables we want their names (and a list of (key, value) tuples is naturally a dict). So, we've kind of gone around in a circle here! Having iter_tables return a dict is just wrong IMO.

Hmm. ISWYM. For my use case (which is as you say, not a compelling argument), I'm not sure I want the table names, though? What's the use-case for wanting the names, rather than (say) having a separate tc.table_names() method?

Or indeed, could we just rely on table_names() instead of a dict, so you would be expected to use the names directly on the table collection: [tc[name] for name in ts.table_names()]?

[edit - tc.table_names is actually a property of the class, not the instance, so I guess this should be a static attribute: tskit.TableCollection.table_names()?]

@hyanwong
Copy link
Member

hyanwong commented Jun 30, 2020

Sorry - an additional (possibly alternative) thought: should the table names be a property of the table, not so much one returned by the table collection directly? Then if we do want something to iterate over the tables, we use tc.iter_tables() and get the names from each table: [t.name for t in tc.iter_tables()]? I guess it depends what you want to use the names for? If checking on what sort of table you are getting in this particular instance, I'd usually prefer to check using is instance rather than a string comparison (which I generally find a bit fragile).

@jeromekelleher jeromekelleher marked this pull request as ready for review July 24, 2020 13:39
@jeromekelleher
Copy link
Member Author

@benjeffery and I decided to merge this one in the interest of getting the 0.3.0 release done soon.

@mergify mergify bot merged commit 3b78da2 into tskit-dev:master Jul 24, 2020
@jeromekelleher jeromekelleher deleted the tables-dict-iter branch July 24, 2020 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change TableCollection to be a mapping tablename -> table?

3 participants