# Exploratory Workflow

If you are looking at network data for the first time, use this workflow.

If you are updating a network diagram based on changes made from an earlier session, use the [Iteration Workflow](./iteration_workflow.ipynb).

In [None]:
from py2cytoscape.data.cynetwork import CyNetwork
from py2cytoscape.data.cyrest_client import CyRestClient
from py2cytoscape.data.style import StyleUtil

import py2cytoscape.util.cytoscapejs as cyjs
import py2cytoscape.cytoscapejs as renderer

import igraph as igraph
import pandas as pd
import json
import numpy as np

import sand.io as io
import sand.graph as sg
import sand.cytoscape.positions as scp

In [487]:
commit = "exercise"
network_collection_name = "process"
# You might also want to use %cd magic to change to a different working directory.
data_path = "./data/" + network_collection_name + "-" + commit
edge_file = data_path + ".csv"
positions_file = data_path + "-positions.csv"

In [502]:
%%writefile $edge_file
LeadGen/Attend Meetups,LeadGen/Business with Software Problem,1
LeadActivation/Gathering leads from website,LeadGen/Attend Meetups,1
Discovery/Gather info during assessment,LeadActivation/Gathering leads from website,1

Overwriting ./data/process-exercise.csv


In [503]:
edgelist = io.csv_to_edgelist(edge_file)
g = sg.edgelist_to_igraph(edgelist)

In [504]:
g.summary()

'IGRAPH D-W- 4 3 -- \n+ attr: label (v), weight (e)'

In [505]:
# Access vertex attributes
g.vs[0]['label']

'LeadGen/Attend Meetups'

In [None]:
g.is_weighted()

A loop is an edge for which both ends connect to a single vertex.
A pair of vertices with more than one edge between them is a multi-edge.

A multi-graph is a graph with loops or multi-edges.

A graph that is not a multi-graph is called a simple graph, and its edges are referred to as proper edges.

Checking whether or not a network is simple is an important preliminary step in doing a typical network analysis, as many models and methods assume the input graph to be simple or behave differently if it is not.

In [None]:
g.is_simple()

In [None]:
g.is_directed()

A DAG is a directed graph with no directed cycles.

In [None]:
g.is_dag()

A vertex $v$ in a graph $G$ is said to be _reachable_ from another vertex $u$ if there exists a walk from $u$ to $v$. A graph is said to be _connected_ if every vertex is reachable from every other. [SANDR - p 23] A connected graph with no cycles is called a _tree_.

A digraph $G$ is _weakly connected_ if its underlying graph (i.e., the result of stripping away the labels 'source' and 'target' from $G$) is connected.
It is called _strongly connected_ if every vertex $v$ is reachable from every $u$ by a directed walk. 

In [None]:
g.is_connected(mode="weak")

In [None]:
g.is_connected(mode="strong")

A common notion of distance between vertices on a graph is defined as the length of the shortest path(s) between the vertices. The value of the longest distance in a graph is called the _diameter_ of the graph.

In [None]:
g.diameter()

In [None]:
g.average_path_length()

The following are vertex attributes computed from the network structure that we want to save and use in the visualization and analysis:

In [None]:
indegree  = g.degree(mode="in")
g.vs['indegree'] = indegree
outdegree = g.degree(mode="out")
g.vs['outdegree'] = outdegree

In [None]:
degrees = {v['label']: {'indegree': v['indegree'], 'outdegree': v['outdegree']} for v in g.vs}

## Properties of namespaces

### What is the degree distribution among vertex namespaces?

In [None]:
from itertools import groupby

sorted_degrees = sorted(degrees.items(), key=lambda tup: tup[0])

ns_degrees = {}
for key, group in groupby(sorted_degrees, lambda x: x[0].split('/')[0]):
    outdegrees = []
    indegrees = []
    for entry in group:
        outdegrees.append(entry[1]['outdegree'])
        indegrees.append(entry[1]['indegree'])
    ns_degrees[key] = {'indegree': sum(indegrees), 'outdegree': sum(outdegrees)}

In [None]:
[[key,value['outdegree']] for (key, value) in ns_degrees.items()]

In [None]:
[[key,value['indegree']] for (key, value) in ns_degrees.items()]

## Explore the network of namespaces

In [None]:
ns_edges = {}

from collections import namedtuple
Edge = namedtuple('Edge', ['source', 'target'])

def fqn_to_ns(fqn):
    return fqn.split('/')[0]

In [None]:
for e in edgelist:
    k = Edge(source=fqn_to_ns(e['source']), target=fqn_to_ns(e['target']))
    v = 1 if not ns_edges.has_key(k) else (ns_edges[k] + int(e['weight']))
    ns_edges[k] = v

In [None]:
ns_edges.items()[0:5]

In [None]:
ns_vertices = set()
for e in ns_edges.keys():
    ns_vertices.add(e.source)
    ns_vertices.add(e.target)

In [None]:
ns_vertices

In [None]:
ns_graph = igraph.Graph(directed=True)
ns_graph.add_vertices(list(ns_vertices))

for e in ns_edges.items():
    ns_graph.add_edge(e[0].source, e[0].target, weight=e[1], directed=True)

In [None]:
ns_graph.summary()

In [None]:
# Create a list of patterns for all namespaces that we want to keep:
ns_names_to_keep = ['topology', 'clojure.java.io', 'clojure.repl', 'clojure.tools.namespace.file', 'clojure.zip']

ns_interest = ns_graph.vs(lambda v: any(match in v['name'] for match in ns_names_to_keep))

ns_subgraph = ns_graph.subgraph(ns_interest)
ns_subgraph.simplify() # Remove loops
ns_subgraph.summary()

## Load into Cytoscape with a default layout

In [506]:
# Create py2cytoscape client
cy = CyRestClient()

In [493]:
cy.session.delete()

In [507]:
ns_network = cy.network.create_from_igraph(g, name="namespaces", collection=network_collection_name)

In [508]:
ns_network_id = ns_network.get_id()
ns_network_id

5642

In [509]:
# Apply layout
cy.layout.apply(name='force-directed', network=ns_network)

## Apply Style

In [510]:
# Get a reference to the existing style
curved = cy.style.create('Curved_1')

In [511]:
# Map the label property in the igraph data to Cytoscape's NODE_LABEL visual property
curved.create_passthrough_mapping(column='label', vp='NODE_LABEL', col_type='String')

cy.style.apply(curved, ns_network)

## Explore in Cytoscape

At this point, we have a list of interesting namespaces laid out in Cytoscape. We can now start visually exploring to look for interesting features to dig into more deeply.

When this step is done, we can move on to looking at the function network beyond just the namespaces.

## Extract the subgraph of local namespaces from the full graph

There are some analyses where it will be useful to see all the vertices. For the high-level architecture diagram, we can focus on the library's namespaces.

In [None]:
# List all patterns of vertex names that we want to keep:
names_to_keep = ['topology', 'clojure.core/*err*', 'clojure.core/println']

In [None]:
lv = g.vs(lambda v: any(match in v['label'] for match in names_to_keep))

# lg...the local graph
lg = g.subgraph(lv, implementation='copy_and_delete')

In [None]:
# Copy the label attribute to name so that cytoscape will pick it up without extra mapping
lg.vs['name'] = lg.vs['label']
lg.summary()

In [None]:
# Visualize in Cytoscape
fn_network = cy.network.create_from_igraph(lg, name=commit, collection=network_collection_name)
cy.layout.apply(name='force-directed', network=fn_network)
cy.style.apply(curved, fn_network)

### Map attributes to visual properties

In [None]:
# degrees = fn_network.get_node_column('outdegree')

# Scale color of nodes
# color_gradient = StyleUtil.create_2_color_gradient(min=1, max=degrees.max(), colors=('white', '#FFCC00'))
# curved.create_continuous_mapping(column='outdegree', vp='NODE_FILL_COLOR', col_type='Double', points=color_gradient)

# Scale size of nodes
# degree_to_size = StyleUtil.create_slope(min=0, max=degrees.max(), values=(30, 80))
# curved.create_continuous_mapping(column='outdegree', vp='NODE_HEIGHT', col_type='Double', points=degree_to_size)
# curved.create_continuous_mapping(column='outdegree', vp='NODE_WIDTH', col_type='Double', points=degree_to_size)
# curved.create_continuous_mapping(column='outdegree', vp='NODE_LABEL_FONT_SIZE', col_type='Double', points=degree_to_size)

In [None]:
# BUG: All weights are 1 because igraph's subgraph method loses the weight attribute.
# weights = fn_network.get_edge_column('weight')

# weight_to_size = StyleUtil.create_slope(min=weights.min(), max=weights.max(), values=(2,10))
# curved.create_continuous_mapping(column='weight', vp="EDGE_WIDTH", col_type='Double', points=weight_to_size)

In [None]:
cy.style.apply(curved, fn_network)

# Note that there are still a couple of manual steps using the default Curved.
# Uncheck 'Lock node width and height'
# Remove the default size mapping

## Save the updated layout coordinates after making changes

One benefit of this workflow over solutions that just render static diagrams is the ability to make changes manually to the network layout in Cytoscape.

After making changes, save the coordinates for a later session using the [Iteration Workflow](./iteration_workflow.ipynb).

In [None]:
positions_file = data_path + "-positions.csv"

## scp.positions_to_csv(network=fn_network, path=positions_file)
scp.positions_to_csv(network=ns_network, path=positions_file)

You can now safely close Cytoscape.

## Future Work

### Interpretation of articulation points

A vertex is an articulation point if its removal increases the number of connected components in the graph:

In [None]:
g.vs(g.articulation_points())['label']

What are the implications of these articulation points in the domain of function dependency graphs? We'd need to look at several examples across multiple applications. In the context of a larger system architecture, these might indicate single points of failure that would partition the system.