Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace inheritance of NetworkX by encapsulation #501

Merged
merged 28 commits into from Dec 12, 2019

Conversation

geoffj-d61
Copy link
Contributor

Under this revision, StellarGraph is no longer also a NetworkX - it now has its own interface. The NetworkXStellarGraph implementation class (formerly StellarGraphBase) now wraps the supplied NetworkX object.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

node_type_attribute="label",
seed=None,
):
def run(self, nodes=None, n=None, length=None, metapaths=None, seed=None):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor this function to reduce its Cognitive Complexity from 16 to the 15 allowed.

seed=None,
weighted=False,
edge_weight_label="weight",
self, nodes=None, n=None, p=1.0, q=1.0, length=None, seed=None, weighted=False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method "run" has 8 parameters, which is greater than the 7 authorized.

)

# XXX This has not yet been standardised in the interface.
def adjacency_types(self, graph_schema: GraphSchema):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cyclomatic complexity is too high in method adjacency_types. (7)

Implementation based on encapsulating a NetworkX graph.
"""

def __init__(self, graph=None, is_directed=False, **attr):

This comment was marked as off-topic.

Implementation based on encapsulating a NetworkX graph.
"""

def __init__(self, graph=None, is_directed=False, **attr):

This comment was marked as off-topic.

@codeclimate
Copy link

codeclimate bot commented Oct 10, 2019

Code Climate has analyzed commit e70e637 and detected 23 issues on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 18
Security 5

Note: there are 2 critical issues.

View more on Code Climate.

Returns:
bool: The graph directedness status.
"""
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are essentially "abstract" methods for the StellarGraph class, right? Is it worth using something more specific to convey this, like the abc module https://docs.python.org/3/library/abc.html?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely abstract, yes. I would have to look more closely at ABC classes, but I have no objection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Using the abc module and decorating the functions with @abstractclass would be better.

return type.__call__(cls, *args, **kwargs)


class StellarGraph(metaclass=StellarGraphFactory):
Copy link
Member

@huonw huonw Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having this be an abstract class, and using a metaclass here, did you consider storing the implementation graph internally? This is thinking along the lines of "Composition over inheritance".

This might look like:

class StellarGraph:
    def __init__(self, *args, **kwargs):
        if self.__class__ is StellarGraph:
            self._graph = NetworkXStellarGraph(*args, is_directed=False, **kwargs)
        elif self.__class__ is StellarDiGraph:
            self._graph = NetworkXStellarGraph(*args, is_directed=False, **kwargs)
        else:
            ...

   def is_directed(self):
       return self._graph.is_directed()

   ...

This seems to simplify the relationship between things, and means users only see StellarGraph (or StellarDiGraph) as the class, rather than being able to observe the NetworkXStellarGraph class values directly.

Any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I like the ability to have different backends to support different underlying data types, I think that it is confusing that instantiating the StellarGraph class gives back a different class e.g. NetworkXStellarGraph.

Options for encapsulation:

  1. Subclassing StellarGraph:
    • NetworkXStellarGraph
    • PandasStellarGraph
  2. Use graph class as member
    StellarGraph.graph = NetworkXGraphInterface(G)
    StellarGraph.graph = PandasGraphInterface(nodes_pd, edges_pd)
  3. Convert all inputs to an underlying format:
    StellarGraph.graph = convert_from_nx(G)
    StellarGraph.graph = convert_from_pandas(nodes_pd, edges_pd)

I tend to think 2 & 3 are best for users, they see a consistent StellarGraph object. I favour option 3 myself.

I'm happy to go ahead with the MetaClass style for now on the develop branch but I think we should think about this carefully from a user perspective before we release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced composition would buy us anything here. Firstly, every implemented method would have to delegate to the underlying implementation. Secondly, if the user instantiates StellarGraph with a NetworkX graph, then I don't see a particular problem if they get a NetworkXStellarGraph back.

Don't forget here that we still do not know exactly what methods the StellarGraph interface requires. This is being found by trial and error. Thus, the initial NetworkXStellarGraph keeps accessible implementation details that are not necessarily yet part of StellarGraph. I know this is a hack, but the point of the ticket was simply to make sure NetworkXStellarGraph no longer inherited from NetworkX graph types, without breaking existing functionality (where possible).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced composition would buy us anything here

It buys a simpler relationship between the various classes, because there's no metaclass trying to swap in different implementations.

But fair enough, about the rest of your points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I have said before, there have been no firm requirements for this work, and we can treat this code as purely exploratory if we want. If the requirement is that users instantiating StellarGraph get StellarGraph, we can work to that, but it will mean a performance hit due to delegating all calls to the underlying implementation.

On the other hand, Java programmers don't seem to have a problem asking for a class implementing interface X and actually getting Y. Given that that g=StellarGraph() gives back something that satisfies isinstance(g,StellarGraph), I'm not sure that casual users will notice, and expert users shouldn't be surprised.

The only odd thing about my implementation is that g=StellarDiGraph() satisfies isinstance(g,StellarGraph) by design, which might be a bit unexpected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that it's not that unusual to have classes change, but the ramifications of this logic are more than just the types that get returned; e.g. the user has to be able to work out what arguments get passed to the StellarGraph constructor: documenting this usefully is much harder when there isn't an __init__ that lists them all, and tooling like PyCharm & mypy likely doesn't work that great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises the question once again whether it is better to document the interface in the class or the __init__. In this case, the NetworkX-y interface is documented in the class, leaving the init free. Answers on the web suggest that people are more likely to type help(StellarGraph) than help(StellarGraph.__init__), so class docs seem the best place.

Again, the problem we have is that we do not really have fully-defined arguments, due to organic growth of functionality. Multiply this by the polymorphism inherent in trying to support multiple interfaces, and I don't really see a neat encapsulation to be able to be explicit about all arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises the question once again whether it is better to document the interface in the class or the init. In this case, the NetworkX-y interface is documented in the class, leaving the init free. Answers on the web suggest that people are more likely to type help(StellarGraph) than help(StellarGraph.init), so class docs seem the best place.

I think the location of the documentation is an orthogonal issue (also, help(StellarGraph) includes the documentation for __init__, and, in particular, its signature, which is what I care most about).

In particular, I'm getting at a more fundamental issue: having a computer-understandable definition of how a StellarGraph can be constructed. The best case for a library (and, indeed, anything) is no documentation needed, because the interface is "perfectly intuitive", meaning a user is guided down the right path automatically without having to read about how to use it ("the easy path is the right path"). Obviously, doing that entirely is unattainable for anything non-trivial, but it's still something that can be kept as a goal.

IME, a critical part of this is making it easy for computers to understand the easy path, because then the user's computer can tell the user when they've gone wrong, as soon as possible. In this specific case, this means having the expected arguments visible to the user and their editor/IDE/tooling. As this code is written, the only way to even know the names of the arguments is to read the documentation, which is something a computer cannot easily do.

Another part of it is making it easy for library maintainers to keep the documentation up to date, so when a user is forced to consult the docs, they are useful. With highly dynamic approaches, more documentation is required (i.e. more to maintain), and, there's less chance of tooling helping maintain it (e.g. if we remove or rename an argument, remind us to update the docs for it).

For instance, you may already know that IntelliJ/pycharm can be helpful about constructing things and calling functions:

image

image

It can even show type annotations, if they exist on the function in question:

image

However, with the dynamic metaclasses, it can't show anything like this:

image

There's no hint that I might want to be passing two arguments in there.

(NB. my example of IntelliJ/PyCharm is just a proxy for any tooling that might want to reason about our python code, other examples are mypy, pylint and even documentation generators.)

Again, the problem we have is that we do not really have fully-defined arguments, due to organic growth of functionality. Multiply this by the polymorphism inherent in trying to support multiple interfaces, and I don't really see a neat encapsulation to be able to be explicit about all arguments.

This is a bit unfortunate, but it seems to me we should still think quite carefully about steps that make it harder to have great tooling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that, and agree with everything you said. BUT in this case we still have the problem of having two incompatible class argument signatures, and will continue to have until major rationalisation of the NetworkX-based version can be done.

That said, the NodeData, EdgeData and EdgeCache interfaces could all be simplified back to standard init arguments.

Copy link
Contributor

@adocherty adocherty Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving forward, I think we can drop the NetworkX backend. I don't see it as useful moving forward. This will mean we can control our class arguments more easily. I see this PR is a temporary exploration and test for isolating our API from NetworkX.

self._graph = graph

# Name of optional attribute for edge weights
self._edge_weight_label = attr.get("edge_weight_label", "weight")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to require that there's a single notion of "edge weight" for any given graph, but it seems to me that's not quite true. E.g. in a transport network one could weight edges (roads/paths) by:

  • distance
  • time while driving
  • time while cycling
  • time while walking
  • cost
  • elevation change
  • arbitrary combinations of the above

Do you think having a single one is an appropriate trade-off?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing use-cases seem to pick exactly one attribute, typically weight by default, to be the edge weight. I agree that there can be multiple types of edge weights, but if these exist at the same time, I imagine they would become edge features (not yet implemented). Again, I chose the simplest interface to match actual usage. I have no objection to this changing in future if necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't understand why this had to move into the StellarGraph class instead of remaining in the algorithms as it was before. It seems more flexible to be part of the algorithm, because different invocations can chose different attributes, if they're available, and, on the surface, it seems like this change isn't related to breaking the networkx-inheritance.

Could you expand?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an intermediate solution for now. In the future we need to think about something more general that will store arbitrary node & edge features in a memory and performance efficient way.

Also, I think that edge weights should be "just" a special edge feature expected by the random walk algorithms, I don't think that the StellarGraph class needs to know that it is special.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, line 489 and down (edge_weight_label, etc.) were existing code, I think (been a while). What part of this has moved into StellarGraph? Ok, maybe you are talking about why this has to be defined up front instead of passed to each algorithm individually? Future creep, I guess.

A fundamental assertion of the subsequent branch (which includes the new StandardStellarGraph implementation) is that node data and edge data need to be fully encapsulated up front into standardised interfaces. Whilst it would be possible to include a method giving back data for an arbitrary 'column', I instead made it part of the design to pre-designate all necessary columns.

Based on the usages I have seen, it didn't seem problematic to just have different StellarGraph instances if one wanted to experiment with different edge weight variables. We can rethink this if it is or becomes a major decision point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What part of this has moved into StellarGraph? Ok, maybe you are talking about why this has to be defined up front instead of passed to each algorithm individually?

Yes, this edge_weight_label argument and the self._edge_weight_label processing has moved from each algorithm into the StellarGraph class.

A fundamental assertion of the subsequent branch (which includes the new StandardStellarGraph implementation) is that node data and edge data need to be fully encapsulated up front into standardised interfaces

Nodes support arbitrary attributes in some sort of standardised interface, I imagine?

Whilst it would be possible to include a method giving back data for an arbitrary 'column', I instead made it part of the design to pre-designate all necessary columns.

I don't think edge weight is a necessary column (it doesn't seem to even be necessary for most of the algorithms that support obeying weights). How does this model an unweighted graph? Does it force the user to manually specify "weight": 1 for every edge?

From a neon/platform perspective, the more generic approach of arbitrary edge attributes is the right one: e.g. the bitcoin graph has timestamps on the edges, which don't make sense to pretend to be weights (and, some feature engineering uses these timestamps, although that processing is currently done purely in spark).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan is to implement edge features along the lines of node features (see #520). Whether or not this solves the problem of 'arbitrary' columns, I don't know. By default, if you don't specify an edge weight, it is usually assumed to be 1 - typically this would be a virtual value.

If we can come up with some simple use-cases for different sorts of edge information, then I am happy to have a (group) rethink about the design. I've been purely guessing so far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether or not this solves the problem of 'arbitrary' columns, I don't know

I would think so; in my mind, edge features and columns of data on edges are synonymous (but differently presented views) of the same concept?

By default, if you don't specify an edge weight, it is usually assumed to be 1 - typically this would be a virtual value

I think the piece I was missing is that all of the downstream code handles this defaulting already (given I didn't notice any changes along these lines already), e.g. to_scipy_sparse_matrix handles seems to ignore edges that don't have the specified weight property, and BiasedRandomWalk seems to just branch on if weighted: at a very early point.

If we can come up with some simple use-cases for different sorts of edge information, then I am happy to have a (group) rethink about the design. I've been purely guessing so far.

One example: the feature engineering done on bitcoin uses edge attributes; you can see some of this done in PySpark in https://github.com/stellargraph/neon/blob/develop/scenarios/scenario-2/Bitcoin_1_Features_Engineering.ipynb in "Add transaction features" where it aggregates the transactions (edges) per day, using the timestamp associated with those edges.

stellargraph/data/converter.py Outdated Show resolved Hide resolved
seed=seed,
weighted="unknown",
edge_weight_label="weight",
nodes=nodes, n=n, p=p, q=q, length=length, seed=seed, weighted="unknown"
)

with pytest.raises(ValueError):
# edge weight labels are by default called weight as is in networkx but they can be any string value if user specified
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part of the test is no longer relevant, now that edge weight labels aren't specified in run.

(Also, I think it was never testing what this comment implies it was testing: ValueError will be thrown because of the weighted="unknown", not because of edge_weight_label=None.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. SEP if you read HHGTTG.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely the problem of a PR that deletes/moves some functionality to remove outdated tests, otherwise there's just an endless accrual of weird and wacky tests that aren't doing anything useful. Fortunately this particular one isn't too entangled and we can just delete the lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if this is a hang-over from me shifting arguments out of methods into the constructor, then I agree it is my problem. Will review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not convinced I made any changes relevant to this code, but I removed the duplicate test and reworded the comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 (You deleted the edge_weight_label argument, which this second test was attempting to validate.)

tests/mapper/test_node_mappers.py Outdated Show resolved Hide resolved
Copy link
Contributor

@adocherty adocherty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments in code.

return type.__call__(cls, *args, **kwargs)


class StellarGraph(metaclass=StellarGraphFactory):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I like the ability to have different backends to support different underlying data types, I think that it is confusing that instantiating the StellarGraph class gives back a different class e.g. NetworkXStellarGraph.

Options for encapsulation:

  1. Subclassing StellarGraph:
    • NetworkXStellarGraph
    • PandasStellarGraph
  2. Use graph class as member
    StellarGraph.graph = NetworkXGraphInterface(G)
    StellarGraph.graph = PandasGraphInterface(nodes_pd, edges_pd)
  3. Convert all inputs to an underlying format:
    StellarGraph.graph = convert_from_nx(G)
    StellarGraph.graph = convert_from_pandas(nodes_pd, edges_pd)

I tend to think 2 & 3 are best for users, they see a consistent StellarGraph object. I favour option 3 myself.

I'm happy to go ahead with the MetaClass style for now on the develop branch but I think we should think about this carefully from a user perspective before we release.

Returns:
bool: The graph directedness status.
"""
raise NotImplementedError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Using the abc module and decorating the functions with @abstractclass would be better.


def __call__(cls, *args, **kwargs):
if cls is StellarGraph:
return NetworkXStellarGraph(*args, is_directed=False, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user specifies is_directed in kwargs then this will fail.

e.g.

    def test_directed_graph_from_nx():
        Gnx = nx.karate_club_graph()
>       sg = StellarGraph(Gnx, is_directed=False)

cls = <class 'stellargraph.core.graph.StellarGraph'>, args = (<networkx.classes.graph.Graph object at 0x13c65ef28>,)
kwargs = {'is_directed': False}

    def __call__(cls, *args, **kwargs):
        if cls is StellarGraph:
>           return NetworkXStellarGraph(*args, is_directed=False, **kwargs)
E           TypeError: StellarGraphFactory object got multiple values for keyword argument 'is_directed'

stellargraph/core/graph.py:249: TypeError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I left it up to the Python infrastructure to complain, since is_directed was left undocumented. The other thing we could do is raise an error saying to not explicitly set is_directed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think raising an error if is_directed and the class-specified directedness don't agree would be good.

self._graph = graph

# Name of optional attribute for edge weights
self._edge_weight_label = attr.get("edge_weight_label", "weight")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an intermediate solution for now. In the future we need to think about something more general that will store arbitrary node & edge features in a memory and performance efficient way.

Also, I think that edge weights should be "just" a special edge feature expected by the random walk algorithms, I don't think that the StellarGraph class needs to know that it is special.

def node_degrees(self) -> Mapping[Any, int]:
return self._graph.degree()

def adjacency_weights(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better called to_adjacency_matrix, and have an option of using edge weights or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a conceptual problem here in that this ticket is (supposedly) about making minimal changes to stop StellarGraph inheriting from NetworkX, whereas the subsequent ticket (see #520) is about distilling the StellarGraph interface and making a time- and memory-efficient implementation that does not use NetworkX, including encapsulating edge and node data in a standard way. So some of the concerns, and some of my answers, keep crossing between these two tickets.

This is one of the reasons I've been trying to push ahead with crystallising what the StellarGraph interface currently supports (i.e. everything that NetworkXStellarGraph can do, rightly or wrongly, including create_graph_schema()), in advance of trying to optimise what it actually should do. And I wholeheartedly agree that this approach runs the risk of implementing a suboptimal design, but we're working in the dark otherwise. We all need a good think about this, because I don't have any ready answers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your points, but as I understand it this function was introduced to the class in this ticket, or am I mistaken?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual function adjacency_weights() was created in this ticket, but the functionality itself was already in use in several places. This change just hid away the fact that the code was using NetworkX-specific structures and calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so can we can change the name to to_adjacency_matrix?

# even when the seed is set.
adj[et][n1] = sorted(neigh_et, key=str)
self.adj_types = adj
self.adj_types = adj = self.graph.adjacency_types(self.graph_schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this object should be replaced with functionality inside the stellargraph class, this is just used for efficiency in sampling therefore we can use a typed adjacency list (as proposed as MDAL in this document https://docs.google.com/document/d/1ztdDOmIUFerSVu-R0fgKP3vYTTdZBmY9qbSoIXb3NWQ)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which it is in #520 (see my previous comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let's move that to be decided in the next ticket and in future API discussions.

self._raise_error("node {} not in graph".format(node))
return list(nx.neighbors(self.graph, node))
return list(self.graph.neighbour_nodes(node))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This involves accessing the graph twice (once to check, once to get neighbours). Better to use a try/except:

        try:
            neighbours = list(self.graph.neighbour_nodes(node))
        except KeyError:
             self._raise_error("node {} not in graph".format(node))

if isinstance(G, StellarGraph):
nodes = list(G.nodes())
else:
nodes = list(G)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't list(G.nodes()) work for NetworkX objects too? I don't think we need the test, just the single line:

nodes = list(G.nodes())

Comment on lines 3 to 16
#
# Copyright 2019 Data61, CSIRO
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch .. how did we miss this?!

def number_of_edges(self) -> int:
return self._graph.number_of_edges()

def nodes(self) -> Iterable[Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have list(G.nodes()) everywhere we need to get the nodes, as NetworkX returns an iterator. Can we just convert to a list in this method, rather than in most calls to it? This can be part of the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer is two-fold: 1. Always converting to a list is potentially a waste for a big graph if all we want to do is iterate over the nodes. 2. If we have a complex data-type (e.g. a node-type to node-data mapping), using itertools.chain() or generators is very convenient (see #520).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also an example of the design principle to prefer abstract types (e.g. Iterable) over concrete implementations (e.g. list). Part of the 'D' in 'SOLID', I think.

The user having to do list(g.nodes()) is an indicator that they understand they will be allocating additional memory to store the nodes, rather than that choice being forced on them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Although somehow we always end up converting to a list!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually because we want to sub-sample the nodes. But I think we occasionally just iterate through the nodes.

@geoffj-d61
Copy link
Contributor Author

geoffj-d61 commented Dec 11, 2019

@geoffj-d61

I think we should merge this ASAP. To wrap this up I think we should do the following:

  • Rename the graph .neighbour_nodes method as .neighbors.

Done.

  • Merge with develop and fix merge conflicts (mainly around the time/memory benchmarks @huonw has written)

Done.

  • Change the name of adjacency_weights to to_adjacency_matrix

Done.

  • Standardize the return type of the .neighbors method. Currently it seems to return a set if it's directed and an iterator otherwise.

Not sure about this one. I created both StellarGraph(gnx) and StellarDiGraph(gnx), and .nodes() gave me a NodeView (which is Iterable) in each case. Where does the set value come into play? Addendum: Doh! It's neighbors() not nodes() you were talking about - will recheck.

Okay, so my observation still stands. For an undirected graph, NetworkX .neighbors() returns a dict_keyiterator that is Iterable, so NetworkXStellarGraph just returns this. For a directed graph, NetworkX returns another dict_keyiterator that only contains the out-nodes, so NetworkXStellarGraph corrects this by also including the in-nodes in a set (to deduplicate in the case of self edges). Since a set is also Iterable, both outputs obey the Iterable interface defined in StellarGraph - the underlying type of the return value is an implementation issue, and users should neither care nor rely on a specific type.

  • Check all demos to ensure they are working.

Checked - okay.

  • Check the benchmarks checked in by @huonw on develop and on this branch to check for any major performance issues.

Checked - not happy with results.
I've run the benchmarks for both develop and this branch, and this branch seems slower (don't know why). I will rerun both tests after lunch, first shutting down everything non-essential, and put the comparison in here.

After this, I'm good with merging.

@geoffj-d61
Copy link
Contributor Author

geoffj-d61 commented Dec 11, 2019

         develop                       drop-networkx
       Mean      (StdDev)            Mean      (StdDev)
-------------------------------------------------------------------------------------------------------------------------------
test_benchmark_biasedrandomwalk
           63.2991 (    9.0200)          75.2602 (    14.4769)
test_benchmark_uniformrandomwalk
           70.4323 (   14.1472)          78.3540 (    10.7312)
test_benchmark_uniformrandommetapathwalk
           97.3489 (   20.6781)         128.1268 (    16.6644)
test_benchmark_bfs_walk
          164.7183 (   17.4904)         179.8372 (    17.1532)
test_benchmark_biasedweightedrandomwalk
          304.9791 (   33.4000)         421.5909 (    64.5393)
test_benchmark_sampledheterogeneousbreadthfirstwalk
          406.8566 (   43.5902)         409.1014 (    43.4972)
test_benchmark_bfs_walk
        1,471.4640 (   97.9828)       1,330.2763 (   150.8902)
-------------------------------------------------------------------------------------------------------------------------------
test_allocation_benchmark_creation_from_networkx[None-0-0]
        1,260.8000 (  288.3553)       1,688.4000 (   264.1782)
test_allocation_benchmark_creation_from_networkx[100-0-0]
        1,529.2000 (  112.6978)       1,872.8000 (    75.1319)
test_allocation_benchmark_creation_from_networkx[None-100-200]
      137,684.8000 (4,196.4479)     138,900.8000 ( 4,166.7269)
test_allocation_benchmark_creation_from_networkx[100-100-200]
      227,676.4000 (4,211.7674)     230,750.0000 ( 4,075.8318)
test_allocation_benchmark_creation_from_networkx[None-1000-5000]
    3,171,509.6000 (4,103.9143    3,178,549.6000 ( 4,099.7130)
test_allocation_benchmark_creation_from_networkx[100-1000-5000]
    4,010,263.2000 (4,106.4258)   4,011,485.6000 ( 3,996.6412)
-------------------------------------------------------------------------------------------------------------------------------
test_benchmark_get_neighbours
           21.7508 (    6.3989)          19.0489 (     3.5435)
-------------------------------------------------------------------------------------------------------------------------------
test_benchmark_get_features[specify-4]
           26.0074 (    6.4494)          25.8620 (     3.3190)
test_benchmark_get_features[specify-1]
           28.2127 (    6.8341)          28.5375 (     8.4647)
test_benchmark_get_features[infer-4]
           67.2492 (   17.3558           68.9817 (    16.6034)
test_benchmark_get_features[infer-1]
           67.2902 (   10.9388)          73.6866 (    20.1374)
-------------------------------------------------------------------------------------------------------------------------------
test_benchmark_setup_generator_small
            2.6805 (    0.2293)           2.6566 (     0.1555)
test_benchmark_node_generator_small
           94.3042 (    2.9019           97.3943 (     3.4163)
test_benchmark_link_generator_small
           94.6126 (    0.7385)          98.4902 (     0.5390)
test_benchmark_setup_generator_large
          121.5673 (    2.2757)         130.5395 (    13.5090)
test_benchmark_node_generator_large
          120.6086 (    0.7242)         126.7655 (     1.2720)
test_benchmark_link_generator_large
          124.5168 (    0.1740)         130.4071 (     0.1578)
-------------------------------------------------------------------------------------------------------------------------------

@geoffj-d61 geoffj-d61 requested review from adocherty and removed request for adocherty, kieranricardo and daokunzhang December 11, 2019 05:52
@geoffj-d61
Copy link
Contributor Author

@adocherty , still not sure about the Set vs Iterable problem (note that a set is also iterable).
Have added the benchmark comparisons - could you see if you are okay with them?

@huonw
Copy link
Member

huonw commented Dec 11, 2019

The deltas seem like pretty close (within a standard deviation, at least), except for some of the node_generator and link_generator ones. Do you suspect there's something deeper going on?

@geoffj-d61
Copy link
Contributor Author

The deltas seem like pretty close (within a standard deviation, at least), except for some of the node_generator and link_generator ones. Do you suspect there's something deeper going on?

My first runs had every timing benchmark and memory allocation higher for drop-networkx than for develop. Which is why I shut down everything else to rerun the tests. Plus I did some twice just in case switching between branches was causing Python recompilation.

These comparisons don't seem too bad, though still disappointingly slow in some cases.

@huonw
Copy link
Member

huonw commented Dec 11, 2019

My first runs had every timing benchmark and memory allocation higher for drop-networkx than for develop

Ah, I see. I'm glad the quiet run was better 👍

These comparisons don't seem too bad, though still disappointingly slow in some cases.

Could you clarify what you mean by this? My understanding is that this patch is not expected to make any performance difference, other than potentially introducing fractionally more time overhead due to having to do some extra function calls and member look-ups, and fractionally more memory use due to having the extra wrapper object.

@geoffj-d61
Copy link
Contributor Author

As far as I recall, this branch mostly just shifted existing functionality from an implementation class called StellarGraph to a new implementation class called NetworkXStellarGraph. As a consequence, I wouldn't have expected timing benchmarks to change. Allocation I expected to maybe be a little higher due to the new interface, etc.

So my concern is I don't know why some tests seem slower on average.

@adocherty
Copy link
Contributor

adocherty commented Dec 12, 2019

@adocherty , still not sure about the Set vs Iterable problem (note that a set is also iterable).
Have added the benchmark comparisons - could you see if you are okay with them?

Sorry, not iteratable but iterator. Undirected graphs return an iterator from NetworkX; however, for directed graphs a set is returned (which is iterable but not an iterator). This is done when you collect the set of in and out nodes over the edges for directed graphs. Note that using iter on a set will return an iterator similar to what NetworkX returns:

import stellargraph as sg
import networkx as nx

g = nx.erdos_renyi_graph(100, 0.1)

Undirected graph:

gs=sg.StellarGraph(g)
gs.neighbors(2)

output:

<dict_keyiterator object at 0x137d15638>

Directed graph:

gs=sg.StellarDiGraph(g)
print(gs.neighbors(2))
type(gs.neighbors(2))

output:

{97, 98, 3, 41, 78, 51, 60}
set

I think what you want to return to have more consistency in return types is this:

iter(gs.neighbors(2))

output:

<set_iterator at 0x137c63af8>

@adocherty
Copy link
Contributor

As far as I recall, this branch mostly just shifted existing functionality from an implementation class called StellarGraph to a new implementation class called NetworkXStellarGraph. As a consequence, I wouldn't have expected timing benchmarks to change. Allocation I expected to maybe be a little higher due to the new interface, etc.

So my concern is I don't know why some tests seem slower on average.

Most of the tests seem within the margin of variation expected by the std dev. There are a few that seem slower. I assume this is due to the abstraction overhead caused by the new class performing more tests and then calling NetworkX. I'm not sure they are that much worse that we need to worry.

@geoffj-d61
Copy link
Contributor Author

geoffj-d61 commented Dec 12, 2019

In my tests, isinstance(g.neighbors(node), Iterable) returns True for both directed and undirected graphs. Specifically, dict_keyiterator is also iterable, as a for n in g.neighbors(node) also verifies.

@adocherty
Copy link
Contributor

adocherty commented Dec 12, 2019

In my tests, isinstance(g.neighbors(node), Iterable) returns True for both directed and undirected graphs. Specifically, dict_keyiterator is also iterable, as a for n in g.neighbors(node) also verifies.

They are still different return types even though they are both iterable. It would be better to return the same object type regardless of if it's directed or not. Specifically, isinstance(g.neighbors(node), Iterator) will not be True for both. The fact that sets and iterators can both be iterated over is not the point.

My point here is: wrap sets returned in out_nodes and in_nodes with the iter function then we will have the same return type (Iterator) for both directed and undirected graphs.

@geoffj-d61
Copy link
Contributor Author

They are still different return types even though they are both iterable. It would be better to return the same object type regardless of if it's directed or not. Specifically, isinstance(g.neighbors(node), Iterator) will not be True for both. The fact that sets and iterators can both be iterated over is not the point.

I disagree, I'm afraid. If the interface says you will get something that can be iterated over, and you get something back that can be iterated over, the the interface is satisfied; the actual return type is irrelevant as far as functionality is concerned. I don't see that consistency of return type matters.
If you really want it we can have it, but I really don't want to change the interface from Iterable to Iterator to match - that would be a retrograde step in my book (as it reduces the level of abstraction).

@youph , you're the product owner, so you can make the call.

My point here is: wrap sets returned in out_nodes and in_nodes with the iter function then we will have the same return type (Iterator) for both directed and undirected graphs.

Not sure if you mean that in_nodes() and out_nodes() should return iter(set) or that the union of the sets for a directed graph should be wrapped with the iter? The latter is doable; the former would mess up doing unions.

@geoffj-d61
Copy link
Contributor Author

@youph , if you want a check added that a NetworkX graph has been supplied, then this logically should go into the implementation NetworkXStellarGraph. It doesn't make sense to be in the StellarGraphFactory, otherwise we have to have an extra dependency in the latter on NetworkX, even if (e.g. in the future) we never supply a NetworkX object for initialisation.

So, I can add a check in NetworkXStellarGraph if you think that is useful, otherwise initialisation will just break on its own.

@geoffj-d61
Copy link
Contributor Author

@youph , @adocherty , I've added a note in CHANGELOG.md about this being a breaking change - is that the right place to put it?

@adocherty
Copy link
Contributor

adocherty commented Dec 12, 2019

I disagree, I'm afraid. If the interface says you will get something that can be iterated over, and you get something back that can be iterated over, the the interface is satisfied; the actual return type is irrelevant as far as functionality is concerned. I don't see that consistency of return type matters.
If you really want it we can have it, but I really don't want to change the interface from Iterable to Iterator to match - that would be a retrograde step in my book (as it reduces the level of abstraction).

This was a pretty minor point, but somehow we have ended up arguing about it for far too long.

Not sure if you mean that in_nodes() and out_nodes() should return iter(set) or that the union of the sets for a directed graph should be wrapped with the iter? The latter is doable; the former would mess up doing unions.

You are right – it's difficult to return an Iterator for in and out nodes with the current implementation.

You have addressed my other issues, and this seems to be an important point for you, so I'm OK to go ahead with it this way.

@geoffj-d61
Copy link
Contributor Author

You have addressed my other issues, and this seems to be an important point for you, so I'm OK to go ahead with it this way.

Cheers!

@geoffj-d61 geoffj-d61 merged commit fba856c into develop Dec 12, 2019
@youph
Copy link
Contributor

youph commented Dec 19, 2019

@geoff-d61 @adocherty I just discovered that this PR breaks the interpretability demos, e.g., this one: stellargraph/demos/interpretability/gcn/node-link-importance-demo-gcn.ipynb, since they call nx.ego_graph method on the SG object, resulting in this error:
G_ego = nx.ego_graph(G, target_nid, radius=len(gcn.activations))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-3e6b0d6ab545> in <module>
----> 1 G_ego = nx.ego_graph(G, target_nid, radius=len(gcn.activations))

~/anaconda3/envs/stellargraph/lib/python3.6/site-packages/networkx/generators/ego.py in ego_graph(G, n, radius, center, undirected, distance)
     64                                               weight=distance)
     65         else:
---> 66             sp = dict(nx.single_source_shortest_path_length(G, n, cutoff=radius))
     67 
     68     H = G.subgraph(sp).copy()

~/anaconda3/envs/stellargraph/lib/python3.6/site-packages/networkx/algorithms/shortest_paths/unweighted.py in single_source_shortest_path_length(G, source, cutoff)
     59     shortest_path_length
     60     """
---> 61     if source not in G:
     62         raise nx.NodeNotFound('Source {} is not in G'.format(source))
     63     if cutoff is None:

TypeError: argument of type 'NetworkXStellarGraph' is not iterable

How do we fix this?
Currently the interpretability demos don't work.
One way to fix this would be to replace G with Gnx in the above call.

UPDATE: issue #570, PR #571

@youph youph mentioned this pull request Dec 19, 2019
5 tasks
@huonw huonw deleted the feature/297-drop-networkx-inheritance branch January 3, 2020 06:17
huonw added a commit that referenced this pull request Jan 9, 2020
These were missed in #501 (and #502, after merging with #501) which renamed the
`StellarGraphBase` type to `StellarGraph`.

See: #632
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants