# Graphs

We finally got to one of the most used data structures for large data: the graph.

A graph is basically a tree, but with less restrictions. Recalling the structure of a binary tree, where each node can have up to two nodes (one left, one right), a graph does not follow this limitation. The amount of nodes and their connections can vary. There is also no root node as the graph is not hierarchical.
What we have instead is a loose structure that contains `nodes` and `edges` (representing the connections between nodes).

```
      D
    /
   B -- C
       / \
      A   G
```

Unlike in trees, where we are mostly focused on the `nodes`, in graphs we add more meaning to the `edges`. Example: let's model a road network between cities.

```
San Francisco
    |        \
    |         Las Vegas --- New York
Los Angeles                /    |
                    Atlanta     |
                              Miami
```

That looks like a very simplified road map, right?! True! And its intended to be like that. A graph represents an abstract model of relashionships between different entities. This enables to reduce the overall problem complexity and focus on the actual task. For the road graph we are usually interested in the "shortest" or "most beneficial" path between two different cities. Adding "weights" to the `edges` enables exactly that. These weights could represent the distance, the overall effort required to travel between nodes, etc.


Ok, enough theory, let's implement a graph.

Similar to the tree, we will create `Node` class that would hold a value and references to connected nodes. Since a node could have more than just two connections in a graph, however, we would use a list representing the references.

In our example we would create bi-directional connections similar to a doubly linked list (making this graph undirected).

In [1]:
class Node:
    def __init__(self, val):
        self.value = val
        self.connected_nodes = [] # list with references to connected nodes
        
    def add_connection(self, node):
        self.connected_nodes.append(node)
    
    def drop_connection(self, node):
        if node in self.children:
            self.connected_nodes.remove(node)

class Graph:
    def __init__(self, node_list):
        self.nodes = node_list
        
    def add_edge(self,node1,node2):
        """ Adds bi-directional connection """
        if node1 in self.nodes and node2 in self.nodes:
            node1.add_connection(node2)
            node2.add_connection(node1)
            
    def drop_edge(self,node1,node2):
        """ Removes bi-directional connection """
        if node1 in self.nodes and node2 in self.nodes:
            node1.drop_connection(node2)
            node2.drop_connection(node1)

In [2]:
sf = Node("San Francisco")
la = Node("Los Angeles")
vegas = Node("Las Vegas")
ny = Node("New York")
atl = Node("Atlanta")
miami = Node("Miami")

road_graph = Graph([sf, la, vegas, ny, atl, miami])
road_graph.add_edge(sf, la)
road_graph.add_edge(sf, vegas)
road_graph.add_edge(vegas, ny)
road_graph.add_edge(ny, miami)
road_graph.add_edge(ny, atl)

Ok, and how do we traverse a graph?! 

=> Same way we traverse a tree: 

1. Depth First Search (DFS)
   => look for child nodes first, implemented easily using a stack
3. Breadth First Search (BFS)
   => look for all nodes on same level first, implemented easily using a queue


In a way, implementing both methods for a graph is a bit simpler than for a tree as we do not need to keep track of the left and right nodes. We will also use a python list this time (avoiding to re-implement a stack / queue). The resulting code is very compact and easy to understand.

In [3]:
def traverse_graph_dfs(start_node):
    # keep track of already visited nodes
    visited_nodes = set()
    # init stack with start node
    state_stack = [start_node]              
    
    while len(state_stack) != 0:
        # get next node to be traversed
        current_node = state_stack.pop()
        # mark it as visited
        visited_nodes.add(current_node)
        print(current_node.value)
        for node in current_node.connected_nodes:
            # iterate through ALL the connected nodes one by one
            if (node not in visited_nodes) and (node not in state_stack):
                # node has not been seen yet and node is not the stack
                # add to stack
                state_stack.append(node)

traverse_graph_dfs(sf)

San Francisco
Las Vegas
New York
Atlanta
Miami
Los Angeles


=> we follow a connection until we reach an end. Then we take the next node. 

Hence, the first depth iteration starting from San Francisco is

`San Francisco -> Las Vegas -> New York -> Atlanta -> Miami`

Followed by `Los Angeles` as Los Angeles is the second connection from San Francisco.


For graphs, Breadth First Search is more beneficial in case neighbouring nodes are more related to each other.

Here, we would first iterate through all neighbour nodes before traversing deeper.

In [4]:
def traverse_graph_bfs(start_node):
    visited_nodes = set()
    # init queue with start node
    state_queue = [start_node]              
    
    while len(state_queue) != 0:
        # remove first element (the only difference compared to dfs)
        current_node = state_queue.pop(0)
    
        visited_nodes.add(current_node)
        print(current_node.value)
        for node in current_node.connected_nodes:
            if node not in visited_nodes:
                state_queue.append(node)

traverse_graph_bfs(sf)

San Francisco
Los Angeles
Las Vegas
New York
Miami
Atlanta


# Dijkstra's Algorithm

When talking out graphs and finding the shortest path between nodes, Dijkstra's Algorithm does never seem to not far away from the discussion. Hence, we will implement the algorithm below.

Edsger W. Dijsktra introduced this "greedy" approach in 1956 and it is quite simple to summarize:

1. choose start and end node
2. initialize a list of distances with infinity
3. traverse the graph (in direction of the smallest distance)
4. update list of distances
5. compare all distances and return the smalles value


## Weighted Graph

As mentioned before, edges have a special treatment when working with graphs. These connections between nodes can hold a certain information such as the "distance" between two nodes. Such graphs are also referred to as weighted graphs.

We did not consider this in our example above. Let's make a weighted graph out of our roadmap by adding a distance attribute for each node.

This would require an additional class, the `Edge`.

In [5]:
class Edge:
    def __init__(self, node, distance):
        self.node = node
        self.distance = distance

We would also change the Node class to store edges instead of nodes. The rest remains almost the same.

In [6]:
class Node:
    def __init__(self, val):
        self.value = val
        self.edges = [] # we will store edges now
        
    def add_connection(self, node, distance):
        self.edges.append(Edge(node, distance))
    
    def drop_connection(self, node):
        if node in self.edges:
            self.edges.remove(node)

class WeightedGraph:
    def __init__(self, node_list):
        self.nodes = node_list
        
    def add_edge(self, node1, node2, distance):
        """ Adds bi-directional connection """
        if(node1 in self.nodes and node2 in self.nodes):
            node1.add_connection(node2, distance)
            node2.add_connection(node1, distance)
            
    def drop_edge(self, node1, node2):
        """ Removes bi-directional connection """
        if(node1 in self.nodes and node2 in self.nodes):
            node1.drop_connection(node2)
            node2.drop_connection(node1)

## Finding the shortest path

Dijkstra, representing a greedy approach, will traverse the entire graph and return the shortest path in between two nodes.

Lets use our road graph from before and add a few more connections (making it a cyclic graph) and add some weights (making it a weighted cyclic graph).


```
San Francisco
    |       \600
    |        \         2600
    |400      Las Vegas ----- New York
    |        /    \          /     |
    |       /300   \2000    /900   |
Los Angeles         \      /       |1300
                     Atlanta       |
                             \700  |
                              \    |
                                Miami
```

In [7]:
sf = Node("San Francisco")
la = Node("Los Angeles")
vegas = Node("Las Vegas")
ny = Node("New York")
atl = Node("Atlanta")
miami = Node("Miami")


road_graph = WeightedGraph([sf, la, vegas, ny, atl, miami])
road_graph.add_edge(sf, la, 400)
road_graph.add_edge(sf, vegas, 600)
road_graph.add_edge(la, vegas, 300)
road_graph.add_edge(vegas, ny, 2600)
road_graph.add_edge(vegas, atl, 2000)
road_graph.add_edge(atl, ny, 900)
road_graph.add_edge(atl, miami, 700)
road_graph.add_edge(ny, miami, 1300)

In [131]:
import math

def dijkstra(graph, start_node, end_node):
    # create a dictionary with key: node, value: distance
    # and initialize the values with infinity
    distance_dict = {node: math.inf for node in graph.nodes}
    
    # create a dict to store the shortes distances
    shortest_distance = {}

    # the start node gets distance 0
    distance_dict[start_node] = 0

    # iterate through 
    while distance_dict:
   
        # we need to sort the distance_dict
        # this is easily achieved by using a lambda function
        sorted_distance_dict = sorted(distance_dict.items(), key=lambda x: x[1])
        
        # take the first item in the sorted dict,
        # representing the shortest path from the current node to the next node
        # (for the first iteration this will be the starting node as the remaining items are set to infinity)
        current_node, node_distance = sorted_distance_dict[0]
        
        # print current node and shortest distance to next node (to visualize the traversal)
        print("== current node: " + str(current_node.value) + ", current acc. distance " + str(node_distance) + " ==")

        # set the value for the shortest distance for next node of current node and
        # drop the current node from the distance dict (so we will not pass it again)
        shortest_distance[current_node] = distance_dict.pop(current_node)

        print("\t-- connected nodes")
        for edge in current_node.edges: # iterate through each edge of the current node
            print("\t\t name: " + str(edge.node.value) + ", distance: " + str(edge.distance))
            if edge.node in distance_dict: # if the neighbour node is not already passed
                # get the distance to the neighbor node and add to the current distance
                distance_to_neighbour = node_distance + edge.distance
                # if this distance is smaller than the one we have stored,
                # replace it with the existing node in the distance_dict
                if distance_dict[edge.node] > distance_to_neighbour:
                    print("\t\t\tcurrent distance to node: " + str(distance_dict[edge.node]))
                    print("\t\t\tupdating with " + str(distance_to_neighbour))
                    distance_dict[edge.node] = distance_to_neighbour


    return shortest_distance[end_node]

In [134]:
shortest_distance_sf_miami = dijkstra(road_graph, sf, miami)
print("\nfinal shortest distance: " + str(shortest_distance_sf_miami))

== current node: San Francisco, current acc. distance 0 ==
	-- connected nodes
		 name: Los Angeles, distance: 400
			current distance to node: inf
			updating with 400
		 name: Las Vegas, distance: 600
			current distance to node: inf
			updating with 600
== current node: Los Angeles, current acc. distance 400 ==
	-- connected nodes
		 name: San Francisco, distance: 400
		 name: Las Vegas, distance: 300
== current node: Las Vegas, current acc. distance 600 ==
	-- connected nodes
		 name: San Francisco, distance: 600
		 name: Los Angeles, distance: 300
		 name: New York, distance: 2600
			current distance to node: inf
			updating with 3200
		 name: Atlanta, distance: 2000
			current distance to node: inf
			updating with 2600
== current node: Atlanta, current acc. distance 2600 ==
	-- connected nodes
		 name: Las Vegas, distance: 2000
		 name: New York, distance: 900
		 name: Miami, distance: 700
			current distance to node: inf
			updating with 3300
== current node: New York, current 

The runtime of this algorithm is `O( N + E log N )` where `N` represents the amount of nodes and `E` the amount of edges in a graph.
