# Introduction to graphs with Python and its module networkx

* Dataset: [TUDataset](https://chrsmrrs.github.io/datasets/docs/datasets/)
* Tools: [pandas](https://pandas.pydata.org), [numpy](http://www.numpy.org), [networkx](https://networkx.github.io)

## Importing packages

By convention, the first lines of code are always about importing the packages we'll use.

In [None]:
import pandas as pd
import numpy as np
import networkx as nx

Tutorials on pandas can be found at:
* https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
* https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html

Tutorials on numpy can be found at:
* https://numpy.org/doc/stable/user/quickstart.html
* <http://www.scipy-lectures.org/intro/numpy/index.html>
* <http://www.scipy-lectures.org/advanced/advanced_numpy/index.html>

A tutorial on networkx can be found at:
* https://networkx.org/documentation/stable/tutorial.html

## Import the data

We will play with a collection of protein graphs available on the [TU Dortmund Dataset repository](https://chrsmrrs.github.io/datasets/). For simplicity, the zipped protein dataset has been added to the tutorial repository, in the `data/` folder and will be loaded from there. This dataset is contains a collection of 1113 graphs with a binary label for each. Each node has a single attribute and a label.
![Proteins as graphs, visualized with Gephi](Proteinsgraph.png)


In [None]:
# The data is compressed in a zip file
import zipfile
archive = zipfile.ZipFile('data/PROTEINS.zip', 'r')

# We use pandas to read csv files (after being uncompressed by zipfile)
# There are 5 data files, each one containing different information about the graphs

# Connections between nodes
edges = pd.read_csv(archive.open('PROTEINS/PROTEINS_A.txt'), names=['Source','Target'])
# Indicator associated to each node: indicates the graph it belongs to (from the 1113 possible graphs)
graph_ind = pd.read_csv(archive.open('PROTEINS/PROTEINS_graph_indicator.txt'), names=['Graph_indicator'])
# Label for each graph
graph_labels = pd.read_csv(archive.open('PROTEINS/PROTEINS_graph_labels.txt'), names=['Label'])
# Attribute for each node of the dataset
node_attributes = pd.read_csv(archive.open('PROTEINS/PROTEINS_node_attributes.txt'), names=['Node_attribute'])
# Label for each node
node_labels = pd.read_csv(archive.open('PROTEINS/PROTEINS_node_labels.txt'), names=['Label'])

## Explore the data

In [None]:
graph_A

In [None]:
graph_ind

We can see we have 1113 different graphs, and 43470 nodes in total.

In [None]:
print(graph_labels)
print('How many unique graph labels are there? What are their values?')
print(graph_labels['Label'].unique())
print(graph_labels['Label'].value_counts())

In [None]:
print(node_labels)
print('How many unique node labels are there? What are their values?')
print(node_labels['Label'].value_counts())

In [None]:
print(node_attributes)
print('How many unique node attributes are there? What are their values?')
print(node_attributes['Node_attribute'].value_counts())

How to access individual values.

In [None]:
graph_A.iloc[0, :]

In [None]:
graph_A.loc[0, 'Target']

Selecting a single graph and displaying its nodes labels

In [None]:
node_labels[graph_ind['Graph_indicator'] == 1]

## The graph
Now the data is has the appropriate shape, we may build the graph using `networkx`. It is a simple iteration over the rows of the dataframe, using `nx.add_edge`. Alternatively, you may use `nx.add_edge_from` with a list of edges as input.

In [None]:
# A simple command to create the graph from the edge list.
graph = nx.Graph() # Graph class is for undirected graph
_ = [graph.add_edge(source, target) for source, target in zip(edges['Source'], edges['Target'])]

We can also use the `add_edges_from` function instead of a list comprehension

In [None]:
graph = nx.Graph()
graph.add_edges_from(edges[['Source', 'Target']].itertuples(name=None, index=False))

And finally, the dataframe can be used directly to create the graph thanks to the `from_pandas_edgelist` function.

In [None]:
graph = nx.from_pandas_edgelist(edges, source='Source', target='Target', create_using=nx.Graph())

In addition, let us add some attributes to the nodes:

In [None]:
node_props = node_attributes.to_dict()

In [None]:
for key in node_props:
    nx.set_node_attributes(graph, node_props[key], key)

Let us check if it is correctly recorded:

In [None]:
print(graph.nodes[1], graph.nodes[2])

In [None]:
# let us add node labels and graph indicator too as node properties
node_l = node_labels.to_dict()
for key in node_l:
    nx.set_node_attributes(graph, node_l[key], key)
node_gi = graph_ind.to_dict()
for key in node_gi:
    nx.set_node_attributes(graph, node_gi[key], key)
    
# Exercice: make a function containing this for loop to load attibutes from dataframes to the networkx graph

**Exercise:** 
* Have a look a the [networkx documentation](https://networkx.org/documentation/stable/tutorial.html) and display the number of nodes and edges of the graph.
* Display the neighbors of node '0', then the 2-hops neighbors.

In [None]:
# your code here

Let us use some convenient functions of networkx. Our global graph contains all the subgraphs of the dataset. Let us check that ´graph´ is a disconnected graph, i.e. is made of subgraphs that are not connected to each other.

In [None]:
print('Is the graph connected?')
nx.is_connected(graph)

We separate each small graph from the others and make a list of them.

In [None]:
S = [graph.subgraph(c).copy() for c in nx.connected_components(graph)]

In [None]:
idx = 0
Gs = S[idx]
print('The graph with index {} in the list has {} nodes.'.format(idx, Gs.number_of_nodes()))

## Graph visualization

To conclude, let us visualize the graph. We will use networkx for that.

The following line is a [magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html). It enables plotting inside the notebook.

In [None]:
%matplotlib inline

You may also try `%matplotlib notebook` for a zoomable version of plots.

Let us draw the graph with two different [layout algorithms](https://en.wikipedia.org/wiki/Graph_drawing#Layout_methods). As you will see, networkx and matplotlib are not very convenient for plotting graphs. We will see other visualization tools later on.

In [None]:
nx.draw_spectral(Gs)

In [None]:
nx.draw_spring(Gs)
# You may also visualize names with the following command,
# but in our case the graph is too big and labels overlap:
#
# nx.draw_spring(graph), labels=node_props['name'])

## Saving the graph
Save the graph to disk in the `gexf` format, readable by Gephi and other tools that manipulate graphs. You may now explore the graph using [gephi](https://gephi.org/) and compare the visualizations.

In [None]:
# Global graph
nx.write_gexf(graph, 'data/Gprotein.gexf')
# first subgraph
nx.write_gexf(Gs, 'data/Gproteinsubgraph.gexf')

Note: the `gexf` format allows one to save node and edge properties, except if the properties have a complex structure such as python lists or dictionaries. In that case, these structures must be converted to strings (using json) before saving the graph.