<font size = "20"> SoNAR (IDH) - HNA Curriculum </font>

<font size = "5">Notebook 2:  Graph Theory and HNA</font>

This curriculum is created for the SoNAR (IDH) project. SoNAR (IDH) is in it's core a graph based approach to structure and link big amounts of historical data (more on the SoNAR (IDH) project and database can be found in Notebook 3). Therefor the whole curriculum is mainly about graph theory and network analysis. 

This notebook provides an introduction into graph theory in general and also a quick dive into historical network analysis (HNA).

# Graph Theory

## Origin

The Swiss mathematician Leonhard Euler is considered to be the origin of graph studies. In 1735, he published a paper proposing a solution for the "Königsberg problem". The Königsberg problem was a mathematical problem, unsolved at this point of time.
Euler developed a new analysis technique to tackle this problem and thus was able to prove that there is no solution for the Königsberg problem. This new technique was the origin of the graph theory we know today. 


Let's take a look at the Königsberg problem: 

**Königsberg Problem** 

>Given the city whose geography is depicted in the following image, is there a way to walk across each of the seven bridges of the city once and only once, and return to our starting point? <cite data-cite="7184373/UCHCU3AD"></cite>

<center>
    <img src="../images/notebook2/01-1_kb_base.png", width="238">
    <a href="https://commons.wikimedia.org/wiki/File:Konigsberg_bridges.png">Image Source</a>
</center>

The city is crossed by a river that splits the city into multiple parts. There are two banks (above and below the river) and additionally the river creates two islands (the rectangle in the middle and the one at the very right-hand side). Those two banks and two islands are connected by a total of seven bridges (highlighted in green). 

Euler's approach was reducing this problem to the most simple drawing (see picture below). The idea was that each of the four city parts (two banks & two islands) is considered a single point (*node*). And each of the points is connected to another by one or more links (*edges*).

<p>
<center>
    <img src="../images/notebook2/01-2_kb_nodes.png", width="238">
    <a href="https://commons.wikimedia.org/wiki/File:K%C3%B6nigsberg_graph.svg">Image Source</a> (Labels added by the author)
</center>


Based on this simplified picture of the problem Euler noted that, if you arrive at one point via one link, you need to leave it using another link (except for the starting and ending point). 
This means that all points but two need to be connected to an even number of links. The Königsberg problem has the following connections:

| City Part | Connection |
|--- |----------|
| A | twice to C; once to D |
| B | twice to C; once to D       |
| C | twice to A; twice to B, once to D|
| D | once to A; once to C; once to D|


As can be seen in the figure above, the condition (*all points but two need to be connected to an even number of links*) is not true in the Königsberg problem, and hence there is no solution for it.

This schematic approach to solve the problem was since further developed and applied to other problems and resulted in the **Graph Theory** we know today.


## Definitions & Terminology

Graph theory comes with a specific set of definitions and terminology used for talking about the subject. This section will provide an overview of the most important concepts.

### Mathematical Definition

A graph is defined as: $G = (V, E) $. Where the symbols have the following meanings:

| Symbol | Description |
|--- |----------|
| $G$ | The Graph |
| $V$ | A set of **nodes** (the city parts in the example above)|
| $E$ | A set of **edges** (relationships) connecting nodes <br>(the bridges in the example above)|


Following this definition, the Königsberg problem from above can be defined as:

$ V = [A, B, C, D] $

$E = [\\
        \quad (A, C), \\
        \quad (A, C), \\
        \quad (C, B), \\
        \quad (C, B), \\
        \quad (A, D), \\
        \quad (C, D), \\
        \quad (B, D) \\
]$

### Terminology

Graph theory has a very distinctive terminology to refer to different concepts. However, some concepts have several terms they are described by. See the table below for an overview of these terms.


<div class="alert alert-block alert-info">
<b>Hint:</b> This curriculum uses the terms *network* and *graph* interchangeably following the common practice in the scientific literature. However, sometimes it is pointed out that there are subtle differences between the two terms. For a more in depth discussion on this topic, check out chapter two of <a href="http://networksciencebook.com/">Network Science by Albert-László Barabási</a> <cite data-cite="7184373/W9KC4XSJ"></cite> </div>



| Term | Synonyms | Description |
|--- |----------| ------- |
|**Graph** | Network | Abstract representation of objects (**nodes**) and the <br> relationships (**edges**) between these objects.
| **Node** | Vertex  | Fundamental unit of a **graph**. Nodes can be connected <br> with each other by **edges**. Nodes can have **labels** categorizing the nodes.<br> Nodes can have **properties** providing information about their characteristics.
| **Edge** | Relationship | Describes the connection between **nodes**. Edges are of a specific **type**. <br> Edges can have **properties** providing information <br> about their characteristics.
| **Label** | type | A label is used to categorize different kind of **nodes**. <br >A label marks a **node** as part of a group. 
| **Property** | Attribute | **Nodes** and **edges** can have properties. <br> Properties are like meta-information about a node or a relationship <br> and can contain a variety of different data types, like numbers, <br> strings, spatial data or temporal data.
| **Path** | - | Describes a group of **nodes** and their connecting relationships.<br> A path is usually a discription on how to get from one **node** to another. 

<cite data-cite="7184373/9A4UBKNN"></cite>

## Graph Types & Structures

When analyzing real world data and phenomena, the resulting graphs can take many different forms and structures. There can be many relationships between nodes or even self-referencing relationships. 

Most commonly there is a distinction between three different **graph types**:
<p>
<center>
    <img src="../images/notebook2/02-1_graph_types.png", width="600">
    <cite data-cite="7184373/9A4UBKNN"></cite>
</center>
    
    
However, graphs can not only be categorized by their general type but also by the shape they take. This shape is usually refereed to as **structure**. There are three different **structures of networks**:

<p>
<center>
    <img src="../images/notebook2/02-2_graph_structures.png", width="600">
    <cite data-cite="7184373/9A4UBKNN"></cite>
</center>
    

The two most common **network structures** in real world data are *small-world networks* and *scale-free networks*. <br>
*Small-world networks* are very common when analyzing any kind of social networks. Networks of friends, networks of "social-media-bubbles" or cultural networks often are *small-world networks*. The main characteristics of *small-world networks* are:

* The path between two random nodes tends to be rather short
* Presence of "cliques" is highly likely; sub-networks in which nearly every node has a direct connection to every other node

*Scale-free networks* on the other hand differ from *small-world networks* foremost by the characteristic that they follow a *power-law distribution*. Hence *scale-free networks* have a small number of highly connected nodes (*hubs*) and a big number of nodes with just a few edges. Further more *scale-free networks* occur especially in contexts where a hub-and-spoke architecture is present. Some common examples for this network structure are the World Wide Web or graphs of software dependencies.

## Flavors of Graphs

Graphs can not only be distinguished by their overall structure and type, but also by *characteristic attributes* they have. This section covers three very important attributes of networks, that play a big roll in historical network analyses.

These *network attributes* or *flavors* are highly important for the choice of network algorithm you want to apply. The overview table below depicts the most noteworthy factors and considerations for the three *flavors* covered in this curriculum.

|Graph attribute| Key factor | Algorithm consideration |
|---------------|------------|-------------------------|
Connected vs. <br> unconnected | Whether there is a path between any two nodes in the graph, irrespective of distance| Islands of nodes can cause unexpected behaviour, such as getting stuck in or failing to process disconnected components |
Weighted vs. <br> unweighted | Whether there are (domain-specific) values on relationships or nodes | Many algorithms expect weights, and we’ll see significant differences in performance and results when they’re ignored. |
Directed vs. <br> undirected | Whether or not relationships explicitly define a start and end node | This adds rich context to infer additional meaning. In some algorithms you can explicitly set the use of one, both, or no direction. |

<cite data-cite="7184373/9A4UBKNN"></cite>



### Connected vs. Disconnected Graphs

A graph is called *connected* when there is a path connecting every node in the network. When there is a node or a group of nodes completely detached - a so called *island*, the graph is *disconnected*. Also, when the *disconnected* nodes are connected with each other, they are called *components* or *clusters*.

Some graph algorithms assume the graph to be connected and hence lead to misleading results, when it isn't. Hence it's a good idea to check, whether you are dealing with a connected or a disconnected graph. 

<p>
<center>
    <img src="../images/notebook2/03-1_connected.png", width="600">
    <cite data-cite="7184373/9A4UBKNN"></cite>
</center>
    

### Weighted vs. Unweighted Graphs

A graph is *weighted* when there are numeric values attached to the nodes or edges. These numeric values represent a prioritization that can be specific to a domain (e.g. cost, distance, capacity).
A graph is *unweighted* on the other hand, when there is no numeric value attached to nodes or edges, representing a specific weight. 

Weight values can representt the strength or intensity of a relationship. Also, they can be used for path finding algorithms, that try to find the shortest or most efficient path between two distant nodes. 

<p>
<center>
    <img src="../images/notebook2/03-2_weighted.png", width="600">
    <cite data-cite="7184373/9A4UBKNN"></cite>
</center>
    


### Directed vs. Undirected Graphs

The last attribute of graphs we want to take a look at is whether it is directed or not. *Direction* in this context provides an additional dimension to the *relationship* between nodes. When a relationship between two nodes has a direction, this direction is an indicator for a dependency or a flow. 

Let's say person A sends a letter to person B; this action is directed and bears the information that the letter was send from A to B and not from B to A. On the other hand, a relationship is undirected, when there is no information about dependency or flow. Let's say person A is the sibling of person B. In this case there is no information about direction since A is the sibling of B and hence B is also the sibling of A.

When a directed relationship points to a node, the relationship is referred to as *in-link*. When a relationship originates from a node it is referred to as *out-link* instead. 

Whether a graph is directed or undirected can have a big influence on the result of many algorithms. For example for path finding algorithms it is a crucial difference whether a relationship is a "one-way road" (*directed*) or if the relationship does not point in a specific direction (*undirected*).

<p>
<center>
    <img src="../images/notebook2/03-3_directed.png", width="600">
    <cite data-cite="7184373/9A4UBKNN"></cite>
</center>

# Introduction to Historical Network Analysis (HNA)

Historical Network Analysis is a scientific method that is derived from Social Network Analysis. 
When analyzing historical events or phenomena it is often crucial to join different historical data sources in meaningful ways. Using the principles of Graph Theory helps to reconstruct historical networks and detect relevant information in big historical data sources. 





[CKCC (Circulation of Knowledge and Learned Practices in the 17th-century Dutch Republic)](http://ckcc.huygens.knaw.nl/) 

[ORKG (Open Research Knowledge Graph)](https://projects.tib.eu/orkg/)

[HistoGraph](http://histograph.eu/)

## Case Study: Nobel Laureates

# Bibliography

<div class="cite2c-biblio"></div>