# Graphs and Algorithms

A graph is a non-linear data structure that consist of vertices(nodes) and edges $G(V, E)$.  
A vertex is a point (or an object) in the Graph, and edge connectes two vertices.  
![graph1.png](graph1.png)  
Graph Properties:
- weighted graph: the edges have values.
- connected graph: all edges are connected through edges somehow.
- directed graph: the edges points from one node to another but not the opposed
- undirected graph: edges connect two nodes in either direction
- cyclic graph:
  - for a directed graph a cycle is a path from a node back to itself.
  - for an undirected graph a cycle is a path from node back to itself, but without using the same edge twice.
- a loop, called a self-loop, is an edge from a vertex to itself. A loop is a cycle.

### Graph Representations

A graph $G(V,E)$ with vertices $V=\{v_1,v_2,\dots,v_n\}$ and edges  
$E=\{(v_i,v_j) | \text{for each pair of vetrices that are connected}\}$    can have representations:
- Adjacency Matrix Graph Representation:
$$ Ad = \{a_{ij}= 0 \text{ or }1 (\text{if }(v_i,v_j)\in E) | \text{ for } i,j=1,\dots, n\}$$
  - if the graph is undirected, then $ Ad $ is symmetric, i.e. 
  $a_{ij} = a_{ji}$
  - if the graph is weighted we can represented by $a_{ij}=w_{ij}$
- the Adjecency matrix takes $\mathcal{O}(n^2)$ space
![graph2.png](graph2.png)
- Adjacency List Graph Representation:
  - if a graph has $k$ vertices it can have up to $k(k-1)$ edges, but most graph have must fewer and so the Adjacency Matrix is mostly empty (sparsity).
  - An Adjacency list has an array that contains the vertices and each vertex has a linked list (or array) with the vertex edges  
![graph3.png](graph3.png)
  - if the graph is weighted, the weights can be stored in the linked list
  - the adjacency list takes $\mathcal{O}(|E|)$ space 

### Graph Traversals
To traverse a graph means starting from one vertex to visit all other vertices (or as many as possible) in the graphs along the edges. If the graph is not connected or directed it might not be possible to reach any node from any starting node.  
Two common algorithms to traverse a graph are:
- Breadth First Search (BFS): starting from a starting vertex we visit all adjacent vertices first and then those vertices adjecent vertices and so on. Here is a description of the algorithm:
  - initialize all nodes as unvisited, i.e. visited = [False]*n
  - start at a node starting_node and mark it as visited
  - examine all the adjacent nodes of the starting node
  - keep a queue with the discovered nodes that we haven't examined their adjacent vertices yet
  - pop the first node of the queue and examine its adjacent nodes; if an adjacent node is not visited add it to the queue and mark it as visited
  - continue as long as there nodes in the queue, i.e. there are discovered nodes that haven't been examined if they have undiscovered adjacent nodes
  - additionally for every discovered node we can keep the time of the discovery and the parent node
  - since we examine each node and its adjacent list once, the time complexity of BFS is $\mathcal{O}(|V|+|E|)$
  - since we keep a list with boolean flags for visited-unvisited nodes, a queue with unexamined nodes, and maybe a list with info about the discovered nodes, all of size $|V|$ the space complexity of BST is $\mathcal{O}(|V|)$
- Depth First Search (DFS): starting from a starting vertex we visit an adjacent vertex, then an adjecent vertex of that vertex and so on.  
Here is a description of the algorithm:
  - initialize all nodes as unvisited
  - start at a node and mark it as visited
  - examine its adjacent nodes
  - if a node is discovered (i.e. is not visited) recursively call dfs on this node
  - when a node is discovered increase the global time by one, keep the discovery time and the parent node for that node
  - a node finishes when all its adjacent nodes are visited and finished
  - when a node finishes increase the global time by one and set its finished time
  - keep a list of the backedges
  - a backedge is an edge from node u to v, such that node v is already visited but haven't finished yet
  - a non-trivial backedge is a backegde from u to v such that the parent of u is not v
  - since we examine each node and its adjacent list once, the time complexity of DFS is $\mathcal{O}(|V|+|E|)$
  - since we keep a list with boolean flags for visited-unvisited nodes, a list with info about the discovered nodes (discovered and finished times, parent), and a recursive stack of functions, all of size $|V|$ the space complexity of BST is $\mathcal{O}(|V|)$

### DFS Tree, Forest, Cycles
- When running a DFS on a node in a graph a DFS tree is created for that node
- if the graph is not connected, to traverse all nodes in the graph we have to run DFS starting at every node if it is not already visited
- the result is a DFS Forest
- an edge in the graph $u \rightarrow v$ can be of 4 types ($u.d, u.f$ denotes the time of discovery and finish time of the node):
  - tree-edges: each edge in as dfs tree is a tree edge; it goes from a parent node $u$ that has been discovered, but not finished, to a child node $v$ that is just discovered, i.e.
  $$ u.d < v.d < v.f < u.f \quad \text{and} \quad v.parent = u $$ 
  - back-edge: it goes from a descendant node $u$ to an ancestor node $v$, i.e. $v$ has been discovered but not finished yet
  $$ v.d < u.d < u.f < v.f $$ 
  - forward edge: it goes from an ancestor node $u$ to a descendant node $v$, that has been discovered and already finished
  $$ u.d < v.d < v.f < u.f $$ 
  - corss-edge: it goes from a node $u$ to a node $v$ in the same or other dfs tree, but $v$ is already discovered and finished and not a descendant of $u$, i.e.
  $$ v.d < v.f < u.d < u.f $$ 
- in an undirected graph all edges are either tree-edges or back-edges
- non-trivial back-edges are those edges $u\rightarrow v$, where the parent of $u$ is not $v$.
- Theorem: a graph has a cycle iff there is a back-edge
![graph4.png](graph4.png)

### Topological Sort
Given tasks $u_1,u_2,\dots$ which form a Directed Acyclic Graph (DAG) G, each edge $u \rightarrow v$ means that $v$ should be performed after $u$.
To topologically sort the tasks means that we should sort them in reversing order of their finished time on the DFS on the graph G.
To topologically sort a DAG graph we follow:
- perform a DFS on the graph
- for each node that finishes store it at the beggining of a linked list
- the resulting linked-list will be in topological sortied order
- time complexity: $\mathcal{O}(|V|+|E|)$ for the DFS
- space complexity: $\mathcal{O}(|V|)$ for the linked list with n nodes

### Strongly Connected Components (SCC)
Given a graph $G(V,E)$:
- a subset $S\subseteq V$ of vertises $S = \{v_1,v_2,\dots, v_k, \,\, \text{with} \,\, k\le n\}$ is a SSC if for any $v_i,v_j\in S$ there is a path from $v_i$ to $v_j$ and that path is contained in $S$
- a maximumm strongly connected component (MSCC) is a subset $S\subseteq V$ that is SSC and for any other subset $\hat{S}\subseteq V$ with $\hat{S} \nsubseteq S$, the set $S\cup \hat{S}$ is SCC.
- an undirected connected graph is an MSCC as a whole
- a self-pointing node is a SSC.  
The transpose of a graph is defined as the graph with the same vertices and all edges reversed, i.e.
$$G^\top(V, E^\top) \,\, \text{with} \,\, E^{\top} = \{(v,u) \,\, \text{for all } (u,v)\in E\} $$
#### Properties of MSCC for a graph $G(V,E)$:
- if $S_1,S_2$ two different MSCC then $S_1\cap S_2 = \emptyset$
- $G$ is partitioned by its MSCCS, i.e.
$$G = \bigcup\limits_{i=1}^{k}S_k$$
- the supergraph of $G$ is defined by compressing its MSCC to a node
- the supergraph of $G$ is a directed acycle graph
- the transpose graph $G^\top$ has the same MSCCs as $G$  
#### To find the MSCC of a directed graph $G$ we perform the following:
- run a DFS on G and keep a list of the vertices in reversed finished times
- compute the transpose graph $G^\top$
- run a DFS on $G^\top$ with the vertices in the order of the list
- the resultin DFS forest contain on each tree a MSCC


In [1]:
class GraphAdjMatrix:
    '''Implementaion of a graph using an adjacency matrix
    '''

    def __init__(self, size: int, directed: bool = False, weighted: bool = False):
        self.size = size
        self.adj_matrix = [[0]*size for _ in range(size)]
        self.vertex_data = [None]*self.size
        self.directed = directed
        self.weighted = weighted

    def add_edge(self, u:int, v:int, w = 1):
        if not self.weighted:
            w =1
        if 0<=u<=self.size and 0<=v<=self.size:
            self.adj_matrix[u][v] = w 
            if not self.directed:
                self.adj_matrix[v][u] = w
    
    def add_vertex_data(self, vertex: int, data = None):
        if data==None:
            data=vertex

        if 0<= vertex <=self.size:
            self.vertex_data[vertex] = data

    def print_graph(self):
        for vertex, nbrhs in enumerate(self.adj_matrix):
            print(f'Vertex {self.vertex_data[vertex]} is connected to {",".join([self.vertex_data[v] for v, w in enumerate(nbrhs) if w!=0])}') 

In [1]:
from collections import deque

class GraphAdjList:
    '''Implementaion of a graph using an adjacency list
    '''

    def __init__(self, size: int, directed: bool = False, weighted: bool = False):
        self.size = size
        self.adj_list = [set() for _ in range(size)]
        self.vertex_data = [None]*self.size
        self.directed = directed
        self.weighted = weighted

    def add_edge(self, u:int, v:int, w = 1):
        if not self.weighted:
            w =1
        if 0<=u<=self.size and 0<=v<=self.size:
            self.adj_list[u].add((v, w))
            if not self.directed:
                self.adj_list[v].add((u, w))
    
    def add_vertex_data(self, vertex: int, data = None):
        if data==None:
            data=vertex

        if 0<= vertex <=self.size:
            self.vertex_data[vertex] = data

    def breadth_first_search(self, starting_node:int = 0):
        '''Implement the BFS algorithm
        '''
        # keep a list with boolean values that indicates which nodes have been visited
        # initialize all to false
        visited = [False]*self.size
        # keep a list with the time of the discovery and the parent node of the discovered node
        node_time_parent = [{'node' : starting_node, 'time' :0, 'parent':None}]
        # keep a FIFO queue with the discovered nodes that haven't examine their adjecent nodes yet
        q = deque()
        # start at the starting node
        visited[starting_node] = True
        q.append(starting_node)
        current_time = 0

        # continue searching adjecent nodes while there are nodes in the queue
        while q:
            # examine the first node in the queue
            node_examined = q.popleft()
            # examine all of its adjecent nodes
            for (adj_node, w) in self.adj_list[node_examined]:
                # check if the adjecent node has been visited before; if not added on the queue
                if not visited[adj_node]:
                    current_time+=1
                    q.append(adj_node)
                    visited[adj_node] = True
                    # add also the discoverd time and parent
                    node_time_parent.append({'node':adj_node, 'time': current_time, 'parent': node_examined})
        
        # print the discoverd nodes
        for node_info in node_time_parent:
            print(f"node {node_info['node']}, discovered at time {node_info['time']} from the parent node {node_info['parent']}")
    
    def depth_first_search(self, starting_node: int = 0):
        '''Initialize the DFS algorithm. 
        Increase the global time when discovering a new node and when closing a node.
        '''
        global time 
        time = 1
        # keep a list with boolean values to indicate which nodes have been visited
        visited = [False]*self.size
        # when discovering a node keep the node, time of discovery, parent and finished time of that node
        node_time_parent = {starting_node: {'node' : starting_node, 'time_discovered' : time, 'time_finished':0, 'parent':None}}
        print(f"start at node {starting_node} at time {time}")
        # keep a list with the all the backedges in the form [(u,v,w)]
        backedges = []
        self._depth_first_search(starting_node, visited, node_time_parent, backedges)
        
        # print the non-trivial backedges
        non_trivial_backedges = []
        for (u,v,w) in backedges:
            if node_time_parent[u]['parent']!=v:
                non_trivial_backedges.append((u,v,w))
        
        for (u,v,w) in non_trivial_backedges:
            print(f"non-trivial-backedge from node {u} to node {v}")
    
    def _depth_first_search(self, node_examined: int, visited: list, node_time_parent : dict, backedges: list, topological_sort : deque = None):
        global time
    
        visited[node_examined] = True
        for (adj_node,w) in self.adj_list[node_examined]:
            if not visited[adj_node]:
                time+=1
                node_time_parent[adj_node] = {'node':adj_node, 'time_discovered': time,'time_finished': 0, 'parent': node_examined}
                print(f'discover node {adj_node} from parent node {node_examined} at time {time}')
                self._depth_first_search(adj_node, visited, node_time_parent, backedges, topological_sort)
            else:
                # examine if the adjacent node of the examined node is already visited and haven't finished yet
                # for undirected graphs this includes all the trivial backedges
                # only check for edges in the same dfs tree
                # for cross-edges between dfs trees need the dictionaries of the other trees as well.
                if adj_node in node_time_parent and node_time_parent[adj_node]['time_finished']==0:
                    backedges.append((node_examined, adj_node, w))
        
        # set the finished time for the examined node
        time+=1
        node_time_parent[node_examined]['time_finished'] = time
        
        if topological_sort!=None:
            topological_sort.appendleft(node_examined)
            
        print(f'node {node_examined} finished at time {time}')
    
    def traverse_graph(self, vertex_order : list = None, topological_sort: deque = None):
        '''Run DFS for each node in the graph, if it is not visited already.
        '''
        
        global time 
        time = 0
        # keep a list with boolean values to indicate which nodes have been visited
        visited = [False]*self.size
        # keep a list with the all the backedges in the form [(u,v,w)]
        backedges = []
        # keep a list with the dfs trees, i.e for each unvisited starting node that we run dfs store the resuling dictionary representing that dfs tree
        dfs_forest = []
        
        if vertex_order == None:
            vertex_order = [i for i in range(self.size)]
        
        for starting_node in vertex_order:
            # when discovering a node keep the node, time of discovery, parent and finished time of that node
            if not visited[starting_node]:
                # keep a dictionary with keys the int number of the nodes and values a dict with info about that node
                node_time_parent = {}
                time+=1
                node_time_parent[starting_node] = {'node' : starting_node, 'time_discovered' : time, 'time_finished':0, 'parent':None}
                print(f"start at node {starting_node} at time {time}")
                self._depth_first_search(starting_node, visited, node_time_parent, backedges, topological_sort)
                dfs_forest.append(node_time_parent)
            else:
                print(f'node {starting_node} is already visited')    
        
        # combine all dfs trees dictionaries into one to find all the non-trivial back-edges
        node_time_parent = {}
        for d in dfs_forest:
            node_time_parent |=d
            
        # print the non-trivial backedges
        non_trivial_backedges = []
        for (u,v,w) in backedges:
            if node_time_parent[u]['parent']!=v:
                non_trivial_backedges.append((u,v,w))
        
        for (u,v,w) in non_trivial_backedges:
            print(f"non-trivial-backedge from node {u} to node {v}")
        
        return dfs_forest 
    
    
    def transpose_graph(self):
        '''returns a graph object with transpose edges
        '''
        
        transpose_graph = GraphAdjList(size= self.size, directed= self.directed, weighted=self.weighted)
        transpose_graph.vertex_data = self.vertex_data
        
        for vertex, vertex_adj_set in enumerate(self.adj_list):
            for adj_vertex, w in vertex_adj_set:
                transpose_graph.adj_list[adj_vertex].add((vertex, w))
        
        return transpose_graph
    
    def get_mscc(self):
        g_t = self.transpose_graph()
        topological_sort = deque()
        self.traverse_graph(topological_sort = topological_sort)
        print(f'nodes in topological order {list(topological_sort)}')
        # run a DFS on the transpose graph in the topological order of the vertices
        transpose_dfs_forest = g_t.traverse_graph(vertex_order=topological_sort)
        
        # return a list of tuples with the mscc    
        return [tuple(tree_dict.keys()) for tree_dict in transpose_dfs_forest]

    def print_graph(self):
        for vertex, nbrhs in enumerate(self.adj_list):
            print(f"Vertex {self.vertex_data[vertex]} is connected to {','.join([self.vertex_data[v] for (v, w) in nbrhs])}") 

In [101]:
g = GraphAdjMatrix(4)
g.add_vertex_data(0, 'A')
g.add_vertex_data(1, 'B')
g.add_vertex_data(2, 'C')
g.add_vertex_data(3, 'D')
g.add_edge(0, 1)  # A - B
g.add_edge(0, 2)  # A - C
g.add_edge(0, 3)  # A - D
g.add_edge(1, 2)  # B - C

g.print_graph()

Vertex A is connected to B,C,D
Vertex B is connected to A,C
Vertex C is connected to A,B
Vertex D is connected to A


In [102]:
g = GraphAdjList(9)

g.add_vertex_data(0, 'A')
g.add_vertex_data(1, 'B')
g.add_vertex_data(2, 'C')
g.add_vertex_data(3, 'D')
g.add_vertex_data(4, 'E')
g.add_vertex_data(5, 'F')
g.add_vertex_data(6, 'G')
g.add_vertex_data(7, 'H')
g.add_vertex_data(8, 'I')

g.add_edge(1, 0)  # D -> A
g.add_edge(2, 0)  # D -> A
g.add_edge(1, 6)  # D -> A
g.add_edge(6, 8)  # D -> E
g.add_edge(6, 7)  # E -> A
g.add_edge(3, 2)  # A -> C
g.add_edge(2, 4)  # C -> F
g.add_edge(3, 4)  # C -> G
g.add_edge(5, 4)  # F -> B
g.add_edge(5, 6)  # B -> C

print(f'run BFS algorithm starting at node 0')
g.breadth_first_search()
print('\n')
print(f'run BFS algorithm starting at node 6')
g.breadth_first_search(starting_node = 6)
print('\n')
print(f'run DFS algorithm starting at node 0')
g.depth_first_search()
print('\n')
print(f'run dFS algorithm starting at node 6')
g.depth_first_search(starting_node = 6)

run BFS algorithm starting at node 0
node 0, discovered at time 0 from the parent node None
node 1, discovered at time 1 from the parent node 0
node 2, discovered at time 2 from the parent node 0
node 6, discovered at time 3 from the parent node 1
node 4, discovered at time 4 from the parent node 2
node 3, discovered at time 5 from the parent node 2
node 7, discovered at time 6 from the parent node 6
node 5, discovered at time 7 from the parent node 6
node 8, discovered at time 8 from the parent node 6


run BFS algorithm starting at node 6
node 6, discovered at time 0 from the parent node None
node 1, discovered at time 1 from the parent node 6
node 7, discovered at time 2 from the parent node 6
node 5, discovered at time 3 from the parent node 6
node 8, discovered at time 4 from the parent node 6
node 0, discovered at time 5 from the parent node 1
node 4, discovered at time 6 from the parent node 5
node 2, discovered at time 7 from the parent node 0
node 3, discovered at time 8 from 

In [103]:
g = GraphAdjList(8, directed=True)

g.add_vertex_data(0, 'A')
g.add_vertex_data(1, 'B')
g.add_vertex_data(2, 'C')
g.add_vertex_data(3, 'D')
g.add_vertex_data(4, 'E')
g.add_vertex_data(5, 'F')
g.add_vertex_data(6, 'G')
g.add_vertex_data(7, 'H')

g.add_edge(0, 3)  
g.add_edge(2, 0)  
g.add_edge(1, 0) 
g.add_edge(1, 2) 
g.add_edge(3, 2) 
g.add_edge(3, 4) 
g.add_edge(3, 5) 
g.add_edge(4, 5) 
g.add_edge(6, 7)  
g.add_edge(7, 6)  

print(f'run BFS algorithm starting at node 0')
g.breadth_first_search()
print('\n')
print(f'run BFS algorithm starting at node 6')
g.breadth_first_search(starting_node = 6)
print('\n')
print(f'run DFS algorithm starting at node 0')
g.depth_first_search()
print('\n')
print(f'run dFS algorithm starting at node 6')
g.depth_first_search(starting_node = 6)

run BFS algorithm starting at node 0
node 0, discovered at time 0 from the parent node None
node 3, discovered at time 1 from the parent node 0
node 5, discovered at time 2 from the parent node 3
node 4, discovered at time 3 from the parent node 3
node 2, discovered at time 4 from the parent node 3


run BFS algorithm starting at node 6
node 6, discovered at time 0 from the parent node None
node 7, discovered at time 1 from the parent node 6


run DFS algorithm starting at node 0
start at node 0 at time 1
discover node 3 from parent node 0 at time 2
discover node 5 from parent node 3 at time 3
node 5 finished at time 4
discover node 4 from parent node 3 at time 5
node 4 finished at time 6
discover node 2 from parent node 3 at time 7
node 2 finished at time 8
node 3 finished at time 9
node 0 finished at time 10
non-trivial-backedge from node 2 to node 0


run dFS algorithm starting at node 6
start at node 6 at time 1
discover node 7 from parent node 6 at time 2
node 7 finished at time 3

In [104]:
print(f'run DFS on the whole graph')
g.traverse_graph()

run DFS on the whole graph
start at node 0 at time 1
discover node 3 from parent node 0 at time 2
discover node 5 from parent node 3 at time 3
node 5 finished at time 4
discover node 4 from parent node 3 at time 5
node 4 finished at time 6
discover node 2 from parent node 3 at time 7
node 2 finished at time 8
node 3 finished at time 9
node 0 finished at time 10
start at node 1 at time 11
node 1 finished at time 12
node 2 is already visited
node 3 is already visited
node 4 is already visited
node 5 is already visited
start at node 6 at time 13
discover node 7 from parent node 6 at time 14
node 7 finished at time 15
node 6 finished at time 16
node 7 is already visited
non-trivial-backedge from node 2 to node 0


[{0: {'node': 0, 'time_discovered': 1, 'time_finished': 10, 'parent': None},
  3: {'node': 3, 'time_discovered': 2, 'time_finished': 9, 'parent': 0},
  5: {'node': 5, 'time_discovered': 3, 'time_finished': 4, 'parent': 3},
  4: {'node': 4, 'time_discovered': 5, 'time_finished': 6, 'parent': 3},
  2: {'node': 2, 'time_discovered': 7, 'time_finished': 8, 'parent': 3}},
 {1: {'node': 1, 'time_discovered': 11, 'time_finished': 12, 'parent': None}},
 {6: {'node': 6, 'time_discovered': 13, 'time_finished': 16, 'parent': None},
  7: {'node': 7, 'time_discovered': 14, 'time_finished': 15, 'parent': 6}}]

In [2]:
g = GraphAdjList(8, directed=True)

g.add_vertex_data(0, '0')
g.add_vertex_data(1, '1')
g.add_vertex_data(2, '2')
g.add_vertex_data(3, '3')
g.add_vertex_data(4, '4')
g.add_vertex_data(5, '5')
g.add_vertex_data(6, '6')
g.add_vertex_data(7, '7')

g.add_edge(0, 1)  
g.add_edge(1, 2)  
g.add_edge(2, 3) 
g.add_edge(3, 2) 
g.add_edge(2, 6) 
g.add_edge(3, 7) 
g.add_edge(1, 4) 
g.add_edge(4, 0) 
g.add_edge(4, 5)  
g.add_edge(5, 6)
g.add_edge(6, 5)
g.add_edge(6, 7)
g.add_edge(7, 7)  
  
g.print_graph()

Vertex 0 is connected to 1
Vertex 1 is connected to 4,2
Vertex 2 is connected to 3,6
Vertex 3 is connected to 7,2
Vertex 4 is connected to 0,5
Vertex 5 is connected to 6
Vertex 6 is connected to 7,5
Vertex 7 is connected to 7


In [3]:
g_transpose = g.transpose_graph()
g.get_mscc()

start at node 0 at time 1
discover node 1 from parent node 0 at time 2
discover node 4 from parent node 1 at time 3
discover node 5 from parent node 4 at time 4
discover node 6 from parent node 5 at time 5
discover node 7 from parent node 6 at time 6
node 7 finished at time 7
node 6 finished at time 8
node 5 finished at time 9
node 4 finished at time 10
discover node 2 from parent node 1 at time 11
discover node 3 from parent node 2 at time 12
node 3 finished at time 13
node 2 finished at time 14
node 1 finished at time 15
node 0 finished at time 16
node 1 is already visited
node 2 is already visited
node 3 is already visited
node 4 is already visited
node 5 is already visited
node 6 is already visited
node 7 is already visited
non-trivial-backedge from node 4 to node 0
non-trivial-backedge from node 7 to node 7
nodes in topological order [0, 1, 2, 3, 4, 5, 6, 7]
start at node 0 at time 1
discover node 4 from parent node 0 at time 2
discover node 1 from parent node 4 at time 3
node 1 f

[(0, 4, 1), (2, 3), (5, 6), (7,)]