## Preliminaries

We need to get the python package that works with Python, 'networkx', and then tell Python we're going to use it and a few other pieces of useful code:

In [None]:
!pip install networkx

In [None]:
import requests
import networkx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Grab the Data!

The dataset page describing the metadata is at https://purl.stanford.edu/mn425tz9757. That page also shows us two files, one with the orbis edges: https://stacks.stanford.edu/file/mn425tz9757/orbis_edges_0514.csv and one with the orbis nodes: https://stacks.stanford.edu/file/mn425tz9757/orbis_nodes_0514.csv.

We'll download that data and turn both into a 'dataframe' that Python can manipulate. Once you've got the data though double click on the csv files and have a look at what you've obtained.

In [None]:
# lets make a nice function to retrieve data we want
def save_csv(url, filename):
    response = requests.get(url)
    if response.status_code == 200: #which means, we ping'd the url and found there was something there
        with open(filename, 'wb') as f:
            f.write(response.content)
        print(f"Data saved to {filename}")
    else:
        print(f"Error: {response.status_code}")


In [None]:
url_edges = 'https://stacks.stanford.edu/file/mn425tz9757/orbis_edges_0514.csv'
url_nodes = 'https://stacks.stanford.edu/file/mn425tz9757/orbis_nodes_0514.csv'

filename1 = 'edges.csv'
filename2 = 'nodes.csv'

save_csv(url_edges, filename1)
save_csv(url_nodes, filename2)

## Explore the Data

First we turn the csv files into dataframes that Python can manipulate.

In [None]:
edges_df = pd.read_csv('edges.csv')
nodes_df = pd.read_csv('nodes.csv')

The next block of code builds up a network one step at a time. We create a variable called `G` that we're going to tell python, 

1. 'hey, use the networkx package's function 'from pandas edgelist' to load up edges_df, and create a network where the source node for a relationship is in the 'source' column, the target for the relationship is in the 'target' column, and the stregth of the relationship will be however many days it takes to get from source -> target'.
2. 'hey, these nodes are all specified by a numbers, so go look at the nodes table and create a dictionary where we can look up any id number and find the actual label/name of the settlement'.
3. 'hey, use the set_node_attributes function from networkx with that dictionary id_to_label so that in our graph, we now have the right label for each node.

The next bit uses functions from matplot (via its knickname, plt) to create a new figure. We tell it to use an algorithm called 'spring layout' to figure out the position of each node for visualization. With spring layout, you imagine each edge in the graph as a spring whose strength is set by its attributes. This pushes/pulls the nodes so that you can see something of the structure. There are many different layouts possible, but it's important to remember: UNLESS YOU'RE ACTUALLY USING GEOGRAPHIC COORDINATES for positioning, the x and y layout itself carries no meaning - something higher up the page isn't 'more important', for instance.

Then, finally, we use networkx's 'draw' functions to build our visualization of our graph, G.

In [None]:
# Create the graph from edges
G = nx.from_pandas_edgelist(edges_df, source='source', target='target', edge_attr='days')

# Create a mapping from node id to label
id_to_label = dict(zip(nodes_df['id'], nodes_df['label']))

# Add node attributes to the graph
nx.set_node_attributes(G, id_to_label, 'label')

# Create the plot
plt.figure(figsize=(12, 8))

# Choose a layout (you can experiment with different ones)
pos = nx.spring_layout(G, k=1, iterations=50)

# Draw the network
nx.draw_networkx_edges(G, pos, alpha=0.5, edge_color='gray')
nx.draw_networkx_nodes(G, pos, node_color='lightblue', 
                       node_size=50, alpha=0.8)

# Draw labels using the label attribute
labels = nx.get_node_attributes(G, 'label')
nx.draw_networkx_labels(G, pos, labels, font_size=5)

plt.title("Network Graph")
plt.axis('off')
plt.tight_layout()
plt.show()

We *do* have the X and Y positioning for the nodes, so we *could* lay this network out by geographic positioning. However, if you check the nodes.csv, you'll see that there are some errors and missing data in the positioning values. We would have to do a bunch of data cleaning to fix that, either finding the correct x and y adjusted for this dataset, or by cleaning/filtering that data out. And THAT would have an impact on what comes next. Having some holes in our x,y data is only a problem if we are trying to lay the graph out using the x and y positioning; we're not doing spatial analysis here. Network analysis is powerful precisely because it explores _relative_ positioning and for this data - _this_ city connects to _that_ one, in x days - is all present. So let's see what we can see!

(Incidentally, if you spot a long tail of connected cities/settlements in the visualization, you've just spotted Egypt! And if you notice that the whole thing is kinda arranged around two areas of lesser 'holes', you've just spotted the eastern and western halves of the Mediterranean!)

## Measuring a Network 
By looking at how different nodes connect or not, we can begin to examine questions like, how are individual cities or settlements in positions to control information flow? This might hold implications for economic or social development. We might ask, are there any subgroups implied by these connections? And so on. It's important to always try to imagine: what does this metric mean in the _context_ of my data?

### Degree
The simplest indication of importance in a network is a metric called ‘degree’ or the number of connections a node has. What might it mean that a city has the most connections to other cities?

We can calculate degree like so:

In [None]:
networkx.degree(G)

...but that's just a list of id's and degree measurements and unless you've got a really good memory, not all that useful. So let's see if we can make a _new_ dataframe that will have three columns: node, degree, and label.

In [None]:
# Get degrees and labels from the graph
degrees = dict(G.degree()) #we create a dictionary where the calculated 'degree' statistic is written down for each node 
labels = nx.get_node_attributes(G, 'label') # we also get the labels for each node

# then we use pandas (pd) function 'DataFrame' to get the node, dgree, and label for each node.
degree_df = pd.DataFrame([
    {'node': node, 'degree': degrees[node], 'label': labels.get(node, node)}
    for node in G.nodes()
])

# and let's sort the dataframe so that we can see which settlements have the highest number of connections:
degree_df = degree_df.sort_values(by='degree', ascending=False)

# and then let's take a look!
degree_df

If you're a student of ancient history, the top 5 cities/settlements are going to jump right out at you. What do you think this implies about the cultural, economic, or historical importance of those cities? What era does this data represent, anyway?

#### Your Observations

Make some observations here. Also, if you'd like, it is possible to make a very nice bar histogram showing that same information:

In [None]:
num_nodes_to_inspect = 10
degree_df[:num_nodes_to_inspect].plot(x='label', y='degree', kind='barh').invert_yaxis()


### Betweeness Centrality

What does it mean to be 'central' in a graph or network? There are lots of ways of measuring this. Degree can be seen as the simplest measure of being 'central' or important in the overall structure. Another measure is 'closeness', which looks at every node and works out the average length of the shortest path from it to every other node. Thus a node or settlement that has a low average length to all other nodes is very central! Another measurement is 'betweewness centrality' and I tend to use this metric a lot because it captures something important. With Betweeness, we're looking at ALL the shortest paths between EVERY pair of nodes in the graph. [Wikipedia](https://en.wikipedia.org/wiki/Centrality) is quite handy on this and says: "Betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path between two other nodes."

Do you see why we might be interested in Betweeness Centrality in the context of the Roman communications networks? 

We can call networkx to calculate betwenness_centrality on our graph like so:

In [None]:
networkx.betweenness_centrality(G)

But that's rather hard to make sense of, eh? So let's make a nice dataframe with that metric:

In [None]:

# Get betweenness scores for the graph, and call that variable 'bw'
bw = nx.betweenness_centrality(G)

# get the labels from the node attributes in our graph
labels = nx.get_node_attributes(G, 'label')

# make a dataframe where we have three columns node, betweenness, and label and make sure we grab the relevant node data each time!
bw_df = pd.DataFrame([
    {'node': node, 'betweenness': bw[node], 'label': labels.get(node, node)}
    for node in G.nodes()
])

# it's always nice to sort your data; here we'll sort it by the middle column 'betweeness'
bw_df = bw_df.sort_values(by='betweenness', ascending=False)

# and let's take a look:
bw_df

In [None]:
# and we can make a nice bar chart from the df:
num_nodes_to_inspect = 10
bw_df[:num_nodes_to_inspect].plot(x='label', y='betweenness', color='green', kind='barh').invert_yaxis()


#### Your observations
Interesting! You might have to look some of those places up. Do you spot any names that appeared when we looked at degree centrality? When you start looking at the graph and you start spotting particular nodes (or relationships) that keep emerging, maybe you're starting to spot something worth investigating further.

### Modularity & Communities

The last metric we'll look at today is called 'modularity'. You'll also see this called 'community detection'. There are a variety of algorithms for doing this, but all of them are trying to answer the question, 'are there natural subgraphs in this data'? That is to say: can we detect groups of nodes who are more alike to each other than they are not? If we had a network of friend relationships at a school, modularity might help us detect the jocks from the nerds, the band kid from the theatre kids. Some of these methods take into account the attributes of the edges - strength of the relationship, number of days to travel, number of interactions between the two nodes, whatever - rather than just the presence/absence of a relationship, so you always - always! - have to think hard what a given metric means in a given context. Also: there is an element of probability in these algorithms. They do have a chance of returning results that are wrong.

With our data on Roman connectivity, what do you think 'community.greedy_modularity_communities' might imply? (Hint: you'll have to search for networkx and that algorithmn, and track back through the documentation to work it out).

In [None]:
# Get communities and labels from the graph
# other algorithms:
# nx.community.louvain_communities(G) - Louvain method
# nx.community.label_propagation_communities(G) - Label propagation
# nx.community.asyn_lpa_communities(G) - Asynchronous label propagation


# let's run the metric, and grab the labels
communities = nx.community.greedy_modularity_communities(G)
labels = nx.get_node_attributes(G, 'label')

# Create a mapping from node to community so that the result for each node is correctly put together
node_to_community = {}
for i, community in enumerate(communities):
    for node in community:
        node_to_community[node] = i

# then we'll make our dataframe
community_df = pd.DataFrame([
    {'node': node, 'community': node_to_community[node], 'label': labels.get(node, node)}
    for node in G.nodes()
])

# we'll sort it
community_df = community_df.sort_values(by='community')

# now let's have a look!
community_df

So like I was saying, there is also a chance that a city/settlement/node is assigned to the wrong community. With modularity, you can measure roughly how well the algorithm has found actual communities and grouped the nodes correctly. The closer to 1, the greater the chance that what you've found is actually there in your data.

In [None]:
modularity_score = nx.community.modularity(G, communities)
print(f"Modularity score: {modularity_score:.3f}")

It'd be nice to know where a particular city is grouped. We can do this by filtering the data frame according to the label field for a particular place, eg 'Roma':

In [None]:
community_df[community_df['label'] == 'Roma']

Ok, great: we know the community number for Rome. We're going to plot the Roman network data again, but this time we're going to colour it by communities we've found. So first we figure out how many unique colours we'll need (one per community) and then we'll assign a colour palette to those communities. Then we also assign those colours to the individual node. Then we'll plot the graph like we did before, but telling it to use those colours. Also, let's create a legend too while we're at it. 

In [None]:
# Create the plot
plt.figure(figsize=(12, 8))

# Get community colors - create a color map
communities_list = community_df['community'].unique()
colors = plt.cm.Set3(np.linspace(0, 1, len(communities_list)))
community_colors = dict(zip(communities_list, colors))

# Create node color list based on community
node_colors = [community_colors[node_to_community[node]] for node in G.nodes()]

# Choose layout
pos = nx.spring_layout(G, k=1, iterations=50)

# Draw the network
nx.draw_networkx_edges(G, pos, alpha=0.5, edge_color='gray')
nx.draw_networkx_nodes(G, pos, node_color=node_colors, 
                       node_size=500, alpha=0.8)

# Draw labels
labels = nx.get_node_attributes(G, 'label')
nx.draw_networkx_labels(G, pos, labels, font_size=10)

# Create legend
legend_elements = [plt.Line2D([0], [0], marker='o', color='w', 
                             markerfacecolor=community_colors[comm], 
                             markersize=10, label=f'Community {comm}')
                  for comm in sorted(communities_list)]
plt.legend(handles=legend_elements, loc='upper right', bbox_to_anchor=(1.15, 1))

plt.title("Network Graph - Colored by Community")
plt.axis('off')
plt.tight_layout()
plt.show()

# Print community info
print(f"Rome is in community {node_to_community[community_df[community_df['label'] == 'Roma']['node'].iloc[0]]}")

Since we now know which community Rome is in _at least today, when I ran this code_ we can plot _just_ the nodes in its community, `3`. When you ran the notebook today, Rome might've been assigned a different community label. So you'll need to change the code below appropriately!


In [None]:
# Get nodes in community 3
community_3_nodes = community_df[community_df['community'] == 3]['node'].tolist()

# Create subgraph with only community 3 nodes
G_community_3 = G.subgraph(community_3_nodes)

# Create the plot
plt.figure(figsize=(10, 8))

# Choose layout
#pos = nx.spring_layout(G_community_3, k=1, iterations=500)
pos = nx.fruchterman_reingold_layout(G_community_3, k=1, iterations=500)

# Draw the network
nx.draw_networkx_edges(G_community_3, pos, alpha=0.5, edge_color='gray')
nx.draw_networkx_nodes(G_community_3, pos, node_color='lightcoral', 
                       node_size=500, alpha=0.8)

# Draw labels
labels = nx.get_node_attributes(G_community_3, 'label')
nx.draw_networkx_labels(G_community_3, pos, labels, font_size=12)

plt.title("Community 3 - Rome's Community")
plt.axis('off')
plt.tight_layout()
plt.show()

# Print some info about this community
print(f"Community 3 has {len(community_3_nodes)} nodes:")
print(community_df[community_df['community'] == 3]['label'].tolist())

#### Your Observations

So... what stands out for you? Make some observations.



## Exercise

Take a look at [[network-exercise]] and see how you get on.

For reference, I also show you how to do all of this in R in the [[networks-via-R.ipynb]] notebook.


Moving onto our [[next notebook|abm.ipynb]] we'll create some software agents and let them swarm all over this network. This is a kind of artificial intelligence where we look for emergent properties of behaviour. This [older piece by me](https://core.ac.uk/download/pdf/147823016.pdf) shows you some of what you might encounter...