# 05 Network Analysis 

![network](https://d33wubrfki0l68.cloudfront.net/10d43940cb362319c951f2e6febe0bf711be7181/1f212/img/portfolio/2018/intangible-cultural-heritage/intangible_cultural_heritage_detail_2.png)  
**Source:** Nadieh Bremer ([Link to her post](https://www.visualcinnamon.com/portfolio/intangible-cultural-heritage/))

## Table of Contents

1. What is a Network?
2. Components of a Network
3. Analysis
4. Summary
5. Resources

## 1. What is a Network?

> "Network analysis (NA) is a set of integrated techniques to depict relations among actors and to analyze the social structures that emerge from the recurrence of these relations. The basic assumption is that better explanations of social phenomena are yielded by analysis of the relations among entities." ~ [A.M. Chiesi](https://www.sciencedirect.com/science/article/pii/B008043076704211X#!)

Networks are usually defined by two main sets of characteristics, nodes and edges. The former would represent the points in a graph while the latter will represent the connection between these two. In addition, both of these components can hold metadata or additional information about the network.

Networks are everywhere
- Social media
- Company employees
- Companies in an industry
- Supermarkets represent a vast network
- Many more...

## 2. Important Concepts of a Network

- **Network:** a graph consisting of nodes that are linked to one another by edges
- **Nodes:** Points on a graph/network
- **Edges:** represent the path or link between nodes
- **Directed Graph:** Graph where nodes are connected in a non-reciprocal way. This about from the perspective of Twitter or instagram, you can follow as many people as you'd like but they don't necessarily have to follow you
- **Undirected Graph:** a network where connections are reciprocal. For example, you cannot be friends with someone on Facebook or LinkedIn without their approval and agreement to connect with you
- **Degree Centrality:** represents the most important nodes in a network. This is taken out of a network by looking at the nodes a particular node is connected to versus all of the possible nodes that same one could be connected to. Hence, every node has a degree centrality
- **Network Density:** the portion of the potential connections in a network that are actual connections. Potential connection means that the node of interest could be connected to another, nearby one, but isn't

## 3. Analysis

Let's begin by importing the packages we will be using.

In [None]:
import networkx as nx # network analysis library
import pandas as pd
import matplotlib.pyplot as plt # data visualization library
import numpy as np # numerical computing library

# this command helps us display the plots in the notebook
%matplotlib inline 

# this command increases the quality of the visualization
%config InlineBackend.figure_format = 'retina' 

## The data

We will be using the Hypata dataset again and explore the connections available between professions/roles at the company. Let's read in our dataframe with pandas.

In [None]:
df = pd.read_csv('data/Edgelist_Hypatia.csv')
df.head()

The first thing we need in order create our graph is to initialize an empty graph and assign a label/variable to it. We will use the capital letter `G` for graph.

In [None]:
G = nx.Graph() # Create a new empty graph object
G

Notice that the object has been initialized above and it has a reference ID for our session.

The next step is to add the nodes to our graph. In our dataset, these are represented by the `Adminmgr1` column, so we want to pass in that entire array to one of the functions in our graph called, `our_graph.add_nodes_from()`.

In [None]:
G.add_nodes_from(df['Adminmgr1']) # select the column of interest from the dataset

The nodes are now represented as a dictionary in graph where the key is the name of the node (i.e. the role at Hypatia) and the values can be any information (i.e. metadata) we might want to keep or use in our analysis later on. We can examine the nodes we just added with `our_graph.nodes(data=True)`. To see a better representation of the nodes you can wrap them around the `list()` function.

In [None]:
list(G.nodes(data=True))

The next step would be to add the edges. These need to be represented as a list of lists comprised of at least two elements, the node where the edge starts and the one where it ends. To get the exact representation we need, in our case, the two columns in our dataframe, we will pass in our dataframe (`df`) with the `.values` method added to it.

In [None]:
df.values # this is what that representation would look like

In [None]:
G.add_edges_from(df.values)

We can visualize our edges in the same way we did with the nodes, by using `our_graph.edges(data=True)`

In [None]:
list(G.edges(data=True))

Once we have our edges and nodes, we can also get a bit more information from our graph with a few methods.

In [None]:
G.number_of_nodes() # get the number of nodes

In [None]:
G.number_of_edges() # get the number of edges

There is a convenient info() method which gives us some basic stats, including the average degree, which is a fancy way of saying the number of connections an average node in this graph has.

In [None]:
print(nx.info(G))

We can also get the neighbors of a particular node, which means, the other nodes our node of interest is connected to.

In [None]:
print(list(G.neighbors('Sales2'))) # get the connected nodes

In [None]:
len(list(G.neighbors('Sales2')))

Lastly, we can always see an image representation of our graph by calling using the function `nx.graph(our_graph)`. The additional function below, `plt.figure(figsize=(10,10))`, helps us customise the size of the figure.

In [None]:
plt.figure(figsize=(10,10))
nx.draw(G);

If we would like to see the names of the nodes we can also use the function `nx.draw_networkx(G)`, or we can pass in the parameter `with_labels=True` to our `nx.draw()` function.

In [None]:
plt.figure(figsize=(10,10))
nx.draw_networkx(G)

## Network Density

**Network Density:** the portion of the potential connections in a network that are actual connections. Potential connection means that the node of interest could be connected to another, nearby one, but isn't

Networks may vary a lot by their number of edges in relationship to the number of nodes, and this variation helps us evaluate the density of a network. The density of a network ranges between 0 and 1: from no connections whatsover between several nodes, to many connections between a node others. The value we get from calculating the density speaks to the probability of two randomly selected nodes being connected to one another. To calculate this we would use the following formula.

$ND = \frac{Number\ of\ existing\ links}{Number\ of\ possible\ links} = \frac{L}{N * \frac{(N - 1)}{2}}$

- ND = Network Density
- N = Number of Nodes
- L = Number of links/edges in the network

#### Benefits of density

- Improved coordination and communication
- Higher levels of trust
- Higher efficiency at implementation

####  Costs of density

- May suggest large amounts of redundant communication
- May suggest group is inward looking
- Risks of group think and lower innovation rate

In [None]:
density = nx.density(G)
print("Network density:", density)

## Centrality Measures

A considerable part of network analysis is devoted to identifying the most important actors in a network according to their position in the network, which is also referred to as centrality. One very simple centrality measure for relative importance of a node is its degree, i.e., the number of connections it has to other nodes. There are many other measures available and they each served similar and yet different purposes.

- Degree Centrality
- Bonacich's Centrality
- Betweenness Centrality
- Eigenvector Centrality

In essence, Centrality Measures try to answer the question, which nodes are the most important ones?

## Degree centrality

The people most popular or more liked usually are the ones who have more friends. Degree centrality is a measure of the number of connections a particular node has in the network. It is based on the fact that important nodes will be likely to have many connections. The function `nx.degree_centrality(G)` returns a dictionary with the nodes as the keys and their centrality as a percentage of the whole as the values.

In [None]:
most_connected = pd.Series(nx.degree_centrality(G)).sort_values(ascending=False)
most_connected

Since degree essentially represents the raw number of people you are connected to (e.g. if a node is connected to 3 nodes with a line/edge each, then its degree is 3), we can extract this in a dictionary format and add to a pandas Series as done above. For this we will use the method `.degree()` and the method `.nodes()` from our graph object.

In [None]:
degree_dict = pd.Series(dict(G.degree(G.nodes()))).sort_values(ascending=False)
degree_dict

In [None]:
def plot_most_centrality(array, num_to_show):
    """
    This function plots the top pleyers given a centrality measure.
    array: you pandas series with the names in the index and the values in the column
    num_to_show: how many top nodes in the network you want to display
    """
    plt.figure(figsize=(10,10))
    array[:num_to_show].plot(kind='bar', rot=45 if num_to_show <= 20 else 75)
    plt.title(f"Top {num_to_show} Most Connected Employees")
    plt.xlabel("Position")
    plt.ylabel("Degree of Centrality")
    plt.show()

In [None]:
plot_most_centrality(most_connected, 20)

What this is saying is that the Sales Director is connected to over 30% of the network.

## Brokerage (Betweenness) Centrality

Brokerage is a state or situation in which an actor connects otherwise unconnected actors or fills gaps or network holes in the social structure (Burt, 1992; Gould & Fernandez, 1989).

Let's see how this would look like in a network with a toy example using a function from networkX called, `nx.barbell_graph(unconnected_nodes, connectors)`.

In [None]:
nx.draw(nx.barbell_graph(5, 1));

The middle point in the above graph represents the broker, in other words, the connector between the two groups of 5 nodes.

Alongside the **Broker** we have the measure of Betweenness Centrality, which quantifies the number of times a node acts as a bridge (or "broker") along the shortest path between two other nodes. One of the important characteristics to remember is that this measure of centrality represents control, i.e. the role played by specific nodes in the communication/information flow between other unconnected ones within the network graph.

The nodes with high betweenness centrality can have a strategic control and influence on others. A broker at such a strategic position can influence the whole group, by either withholding or propagating the information in transmission. The brokers can also be referred to as information bottlenecks.

#### Formula

$C_{Bni} = \frac{\sum_{j<k} (n_i)}{g_{jk}}$

- $C_{Bni}$ = Betweennes Centrality
- $n_i$ = 1 node
- $j$ = any node
- $k$ = any node
- $g_{jk}$ = total number of shortest paths between j and k

In [None]:
nx.betweenness_centrality(G)

In [None]:
brokers_scores = pd.Series(nx.betweenness_centrality(G)).sort_values(ascending=False)
brokers_scores.head()

In [None]:
plot_most_centrality(brokers_scores, 15)

The above chart shows the top brokers in Hypatia.

Betweeness centrality and degree centrality do not need to be correlated with one another. 

In [None]:
df2 = pd.DataFrame({'dc': most_connected, 'bc': brokers_scores})

# plt.figure()
df2.plot(kind='scatter', x='dc', y='bc', figsize=(10,10))
plt.title("Correlation Between Brokers and Degree Centrality")
plt.xlabel("Degree Centrality")
plt.ylabel("Brokers")
plt.show();

## Eigenvector Centrality

It is not just how many individuals one is connected too, but the type of people one is connected with that can decide the importance of a node. With that in mind, **Eigenvector Centrality** is what helps us measure of how import a node is (its influence/prestige) by accounting for how well that same node is connected to other well-connected nodes. The score goes from 0 to 1 and the higher the score of a node, the more surrounded by other important nodes (i.e. nodes with high centrality) it will be.

The Google’s Pagerank algorithm is a variant of the Eigenvector Centrality algorithm, hence, this algorithm help us get amswers to a lot our queries on a daily basis.

To get Eigenvector Centrality with networkX we use `nx.eigenvector_centrality(our_graph)` while passing our graph as an argument.

In [None]:
eigen_cent = pd.Series(nx.eigenvector_centrality(G)).sort_values(ascending=False)
eigen_cent.head()

In [None]:
plot_most_centrality(eigen_cent, 15)

In [None]:
def ecdf(data):
    return np.sort(data), np.arange(1, len(data) + 1) / len(data)

degree centrality and number of neighbors should be perfectly correlated
x = degree centrality
y = cummulative fraction of data points smaller than a given value

In [None]:
fig = plt.figure(figsize=(10, 10))
# Get a list of degree centrality scores for all of the 
# nodes in the graph
degree_centralities = list(nx.degree_centrality(G).values())

x, y = ecdf(degree_centralities)

# Plot the ecdf of degree centralities.
plt.scatter(x, y)

# Set the plot title. 
plt.title('Degree Centralities')

## 4. Summary

Here are some of the most important concepts covered in this session:

- A network is a graph composed of nodes and edges
    - Nodes can represent people, locations, companies, etc.
    - Edges are the links connecting different nodes with one another.
- Centrality measures help us answer questions related to the importance of the nodes within our networks. There are plenty of such measures and some, like the ones we have covered in this course, are used more frequently than other to drive business outcomes.
- Degree centrality represents the raw frequencies of the network nodes, the higher the frequency of the edges a node has, the higher the degree of such node within the network.
- Betweenness centrality picks nodes not based of the amount of connections they have but rather on how well they connect unconnected groups of well-connected nodes.
- Eigenvector Centrality can help us pinpoint prestige, that is, highly connected nodes that are also connected to other highly connected dots.

## 5. Resources

- Blogpost Tutorials
    - [Exploring and Analyzing Network Data with Python](https://programminghistorian.org/en/lessons/exploring-and-analyzing-network-data-with-python#centrality)
    - [Network Centrality Measures in a Graph using Networkx | Python
](https://www.geeksforgeeks.org/network-centrality-measures-in-a-graph-using-networkx-python/)
- Video Tutorials
    - [Network Analysis Made Simple: Network Fundamentals | SciPy 2018 Tutorial | Eric Ma, Mridul Seth](https://www.youtube.com/watch?v=K5xiFDClgjo&ab_channel=Enthought)
    - [Rob Chew, Peter Baumgartner | Connected: A Social Network Analysis Tutorial with NetworkX](https://www.youtube.com/watch?v=7fsreJMy_pI&t=6s&ab_channel=PyData)
- Books
    - [Zinoviev, D. (2018). _Complex network analysis in Python: Recognize, construct, visualize, analyze, interpret._ Raleigh, NC: The Pragmatic Bookshelf.](https://www.amazon.com.au/Complex-Network-Analysis-Python-Recognize-ebook/dp/B079ZN9K5M/ref=sr_1_1?dchild=1&keywords=Complex+network+analysis+in+Python%3A+Recognize%2C+construct%2C+visualize%2C+analyze%2C+interpret&qid=1613982710&sr=8-1)