## Introduction
This tutorial will introduce you to the basics of graphs and some basic methods for exploring the nodes in a graph and searching for a specific node. Graphs are a very useful data strcture that do a good job a representing relationships between different objects and are used daily to solve challenging problems such as crawling web pages, finding shortest distance from point A to B on a map, representing friendships on Facebook, and finding the cheapest flight. 


### Tutorial content

In this tutorial, we will show how create graphs and some different ways of traversing a graph. 


We will cover the following topics in this tutorial:
- [Basics of Graphs](#Basics-of-Graphs)
- [DFS](#DFS)
- [BFS](#BFS)
- [DFS and BFS Application: Paths](#DFS-and-BFS-Application:-Paths)
- [Summary and references](#Summary-and-references)


## Basics of Graphs

First, we need to define the components of a graph. A node of a graph repsenets some sort of object that holds data. The node is often referred to as a vertex. An edge is a path between two vertices so it represents the relationship between the two vertices. There is a path between two vertices if through a series of connected edges, you can get from one vertices to the other. Finally, two vertcies are referred to as adjacent if have an edge between them. There are generally two categories of graphs: directed and undirected. In a directed graph, all edges have a direction. For example, if there is an edge from A to B, then you could travel from A to B, but could not travel from B to A. Undirected graphs on the other hand have edges that have no restriction in direction so an edge between A and B allows you to travel from A to B and from B to A. These different types of graphs become useful in different scenarios. For example: undirected edges may make sense for representing friendship on Facebook because if person A and person B are friends, the friendship holds both ways, but directed graphs may make sense for representing streets between intersections because some streets are one-way while others are two-way.

In [97]:
class Node:
    def __init__(self, data):
        self.data = data
        self.edges = []
        
    def __repr__(self):
        return str(self.data)
    
    def add_directed_edge(self, node):
        self.edges.append(node)
        
    def add_undirected_edge(self, node):
        self.edges.append(node)
        node.edges.append(self)
    
    def get_edges(self):
        return self.edges

Above we have the implementation of a node class. Each node has some data associated with it along with a list of outgoing edges to other nodes. The class also includes functions that allow you to add edges to the node and get all the nodes that this node has an edge to. 

In [98]:
# create Node(A)
A = Node("A")
# create Node(B)
B = Node("B")
# create Node(C)
C = Node("C")
# create Node(D)
D = Node("D")
# create Node(E)
E = Node("E")

# create edge from A to D
A.add_directed_edge(D)
# create edge from A to B
A.add_directed_edge(B)
# create edge from B to C
B.add_directed_edge(C)
# create edge from C to D
C.add_directed_edge(D)
# create edge from C to E
C.add_directed_edge(E)
# create edge from D to E
D.add_directed_edge(E)

print "A has direct edge(s) leading to: " + str(A.get_edges())
print "B has direct edge(s) leading to: " + str(B.get_edges())
print "C has direct edge(s) leading to: " + str(C.get_edges())
print "D has direct edge(s) leading to: " + str(D.get_edges())
print "E has direct edge(s) leading to: " + str(E.get_edges())

A has direct edge(s) leading to: [D, B]
B has direct edge(s) leading to: [C]
C has direct edge(s) leading to: [D, E]
D has direct edge(s) leading to: [E]
E has direct edge(s) leading to: []


The code above shows the creations of nodes A, B, C, D and E with their own unique data along with creating some edges between them. The graph created above is shown below. We can see that the edges we printed for each node corresponds to the nodes that can be reached directly from that node in the graph. For example we can see that node A has a direct edge to node D and B and that node E has no edges to any other nodes.

<img src="graphA.png">


## DFS

Often we want to exlpore all nodes that are reachable from a starting node. The first algorithm we will discuss that allows us to do this is Depth-First Search (DFS). The concept of DFS is that we start at a node and explore nodes in each branch in a depthward motion before backtracking. 

The steps invovled in DFS are outlined below:

1.) Select a node from the top of the stack which is intialized with the starting node  
2.) Perform some operation on the selected node based upon application of DFS algorithm  
3.) Look at all adjacent nodes and add these nodes to the stack if they have not been visited  
4.) If stack is not empty, repeat steps 1-3  

Let's take a look at the implementation:

In [99]:
def dfs(starting_node):
    visited = set()
    visited.add(starting_node)
    stack = [starting_node]
    while len(stack) > 0:
        cur_vertex = stack.pop()
        
        # perform some operation on cur_vertex
        print cur_vertex
        
        children = cur_vertex.get_edges()
        for child in children:
            if child not in visited:
                visited.add(child)
                stack.append(child)
                
    return visited

As you can see, we have implemeneted DFS using a stack, because we want to add nodes to this stack and be able to pop a node off it when we need to backtrack. Also notice how we have a set called visited. We have this to keep track of all the nodes we have visited to ensure that we only visit each node once. This is very important because it prevents the algorithm from getting stuck in an infinite loop which could occur if there was a cycle (or loop) in the graph. 



In [107]:
# create nodes numbered from 1-12
Node_1 = Node(1)
Node_2 = Node(2)
Node_3 = Node(3)
Node_4 = Node(4)
Node_5 = Node(5)
Node_6 = Node(6)
Node_7 = Node(7)
Node_8 = Node(8)
Node_9 = Node(9)
Node_10 = Node(10)
Node_11 = Node(11)
Node_12 = Node(12)

# create edges between nodes
Node_9.add_undirected_edge(Node_11)
Node_9.add_undirected_edge(Node_10)
Node_8.add_undirected_edge(Node_12)
Node_8.add_undirected_edge(Node_9)
Node_3.add_undirected_edge(Node_5)
Node_3.add_undirected_edge(Node_4)
Node_2.add_undirected_edge(Node_6)
Node_2.add_undirected_edge(Node_3)
Node_1.add_undirected_edge(Node_8)
Node_1.add_undirected_edge(Node_7)
Node_1.add_undirected_edge(Node_2)

nodes_visited = dfs(Node_1)
print "\n" + "All visited nodes in graph: " + str(nodes_visited)

1
2
3
4
5
6
7
8
9
10
11
12

All visited nodes in graph: set([7, 1, 5, 10, 4, 11, 8, 12, 3, 2, 9, 6])


Above we created a graph with nodes numbered from 1-12 and by running dfs and looking at the printed output, we can see the order in which the nodes were visited also depcited in the graph below. 

<img src="graph_numbered.png">

Here we can see that the DFS algorithm explored all the way to the bottom of the leftmost branch, and then backtracked to visit the rest of the nodes in the graph. The dfs function also returns a set of all the nodes visited which we can see includes all the nodes in the graph.

## BFS

As with the Depth-First Search, we can also use Breadth-First Search (BFS) as an algorithm to explore all nodes that are reachable from a starting node. The BFS algorithm will explore all the nodes in the graph just as the DFS algorithm does, but with the guarantee that it will visit all nodes closest to the starting node first. The concept of BFS is that we start at a node and explore all the nodes in the graph layer by layer. 

The steps invovled in BFS are outlined below:

1.) Select a node from the queue which is intialized with the starting node  
2.) Perform some operation on the selected node based upon application of BFS algorithm  
3.) Look at all adjacent nodes and add these nodes to the queue if they have not been visited  
4.) If queue is not empty, repeat steps 1-3  

Let's take a look at the implementation:

In [101]:
import Queue

def bfs(starting_node):
    # keep track of visited nodes
    visited = set()
    visited.add(starting_node)
    queue = Queue.Queue()
    queue.put(starting_node)
    while not queue.empty():
        cur_vertex = queue.get()
        # do whatever you want with cur_vertex here
        print cur_vertex
        
        # add children to queue if they have not been visited
        children = cur_vertex.get_edges()
        for child in children:
            if child not in visited:
                visited.add(child)
                queue.put(child)
    return visited

As you can see, the BFS implementation is very simlar to the DFS implementation except we are now using a queue instead of a stack. This change in data structure changes the order in which we visit the nodes in the graph and allows us to visit nodes closer to the starting node first. 

<img src="graph_numbered.png">

Using the same graph in the DFS section with nodes numbered 1-12 (pictured above), we can run our BFS algorithm starting from Node_1 to see the order in which the nodes were visited. 

In [106]:
nodes_visited = bfs(Node_1)

print "\n" + "All visited nodes in graph: " + str(nodes_visited)

1
8
7
2
12
9
6
3
11
10
5
4

All visited nodes in graph: set([8, 2, 12, 5, 9, 1, 6, 10, 3, 7, 11, 4])


Here we can see that the BFS algorithm explored the nodes in the graph layer by layer starting with nodes closet to Node_1. First 1 was visited, then the second layer including nodes 8, 7, and 2, then the third layer including nodes 12, 9, 6, and 3 and finaly the last layer including nodes 11, 10, 5, and 4. The BFS function also returns a set of all the nodes visited which we can see includes all the nodes in the graph just as with the DFS algorithm.

## DFS and BFS Application: Paths

One application DFS and BFS are often used for is to search for a node in a graph and if present, find a path from the starting node to that node. Below we tweaked the dfs and bfs algorithms defined above to search for a node and find a path to it. 

In [103]:
def dfs_path_to_goal(starting_node, target_node):
    visited = set()
    visited.add(starting_node)
    initial_path = [starting_node]
    stack = [(starting_node, initial_path)]
    while len(stack) > 0:
        (cur_vertex, cur_path) = stack.pop()
        
        children = cur_vertex.get_edges()
        for child in children:
            if child not in visited:
                visited.add(child)
                
                new_path = cur_path + [child]
                if child == target_node:
                    return new_path
                else:
                    stack.append((child, new_path))
                
    return False



import Queue
def bfs_path_to_goal(starting_node, target_node):
    # keep track of visited nodes
    visited = set()
    visited.add(starting_node)
    queue = Queue.Queue()
    initial_path = [starting_node]
    queue.put((starting_node, initial_path))
    while not queue.empty():
        (cur_vertex, cur_path) = queue.get()
        
        # add children to queue if they have not been visited
        children = cur_vertex.get_edges()
        for child in children:
            if child not in visited:
                visited.add(child)
                
                new_path = cur_path + [child]
                if child == target_node:
                    return new_path
                else:
                    queue.put((child, new_path))
                    
    return False


print "Path from Node A to Node E using DFS: " + str(dfs_path_to_goal(A, E))
print "Path from Node A to Node E using BFS: " + str(bfs_path_to_goal(A, E))

Path from Node A to Node E using DFS: [A, B, C, E]
Path from Node A to Node E using BFS: [A, D, E]


As we can see above, the DFS algorithm returns a valid path from A to E going though the nodes B and C. But observe how the BFS returns a shorter valid path from A to E only going through node E. Because the BFS algorithm visits nodes layer by layer, it is gauranteed to find the shortest path. 

## Example application: LinkedIn Connections

As a final example, let's consider a simplified version of LinkedIn's connections feature. LinkedIn allows you to create a professional profile and connect to people you know in order for you to maintain and expand your network. An important feature in LinkedIn is finding how many connections away you are from another person. If you are connected to another person on LinkedIn then he/she is considered a first degree connection. If you are two connections away from someone, then this person is considered to be a second degree connection. This same pattern repeats for third, fourth, etc. connections. 

Let's take a simple example where we have a bunch of professional profiles connected to each other and we want to write an algorithm to find what degree connection someone has from someone else. 

Let's consider the profiles of 5 people named Peter, Sachin, Rahul, Tom, and Tim where Peter is connected to both Sachin and Rahul, Sachin is connected to Tom and Rahul, and Tim is connected to Tom and Rahul. The graph is implemented below and also shown in the image below. 

In [104]:
# create nodes for each profile
Peter = Node("Peter")
Sachin = Node("Sachin")
Rahul = Node("Rahul")
Tom = Node("Tom")
Tim = Node("Tim")

# create edges between nodes
Peter.add_undirected_edge(Sachin)
Peter.add_undirected_edge(Rahul)
Sachin.add_undirected_edge(Tom)
Sachin.add_undirected_edge(Rahul)
Tim.add_undirected_edge(Tom)
Tim.add_undirected_edge(Rahul)


<img src="linkedin_graph.png" alt="Drawing" style="width: 400px;"/>


We are interested in finding how many connections away one person is from another. In this case, we want to make sure we find the shortest number of connections away one person is from another so it makes sense to use BFS because it is gauranteed to return the shortest path. 

Using the bfs_path_to_goal algorithm we defined above, lets find how many connections away Peter is from Tim. 

In [108]:
path = bfs_path_to_goal(Peter, Tim)
connections_away = len(path) - 1

print "Peter is " + str(connections_away) + " connections from Tim"
print path

Peter is 2 connections from Tim
[Peter, Rahul, Tim]


As we can see, Peter is only two connections away from Tim so Tim is a second degree connection to Peter. Notice how this is the shorterst number of connections Tim is away from Peter. If the path we took went from Peter to Sachin to Rahul to Tim then we would have gotten 3 connections away or even worse if our path went from Peter to Rahul to Sachin to Tom to Tim we would have gotten 4 connections away. Seeing all these different possibilities, it is clear that using BFS was best in this case to ensure that we receive the shorted path so we can find the least number of connections away Peter is from Tim. 

## Summary and references

This tutorial highlighted just a few elements of what is possible with graph algorithms. Much more information about graphs and different graph algorithms are available from the following links.

1. Applications of DFS: http://www.geeksforgeeks.org/applications-of-depth-first-search/
2. Applications of BFS: http://www.geeksforgeeks.org/applications-of-breadth-first-traversal/
3. Popular Graph Algorithms: http://www.geeksforgeeks.org/top-algorithms-and-data-structures-for-competitive-programming/
4. Ways to Represent a Graph: https://www.khanacademy.org/computing/computer-science/algorithms/graph-representation/a/representing-graphs
5. Recognizing a Graph Problem: https://www.topcoder.com/community/data-science/data-science-tutorials/introduction-to-graphs-and-their-data-structures-section-1/
6. Graph and Networks Overview: http://www.datasciencecourse.org/graphs.pdf
7. Graph Theory Relevance to Big Data: https://www.wired.com/insights/2014/03/graph-theory-key-understanding-big-data/