# Pathfinding and Graph Search Algorithms

In [1]:
from code.script.path import *

In [2]:
g.vertices.toPandas()

Unnamed: 0,id,latitude,longitude,population
0,Amsterdam,52.379189,4.899431,821752
1,Utrecht,52.092876,5.10448,334176
2,Den Haag,52.078663,4.288788,514861
3,Immingham,53.612389,-0.22219,9642
4,Doncaster,53.52285,-1.13116,302400
5,Hoek van Holland,51.977501,4.13333,9382
6,Felixstowe,51.963749,1.3511,23689
7,Ipswich,52.05917,1.15545,133384
8,Colchester,51.88921,0.90421,104390
9,London,51.509865,-0.118092,8787892


In [3]:
g.edges.toPandas()

Unnamed: 0,src,dst,relationship,cost
0,Amsterdam,Utrecht,EROAD,46
1,Amsterdam,Den Haag,EROAD,59
2,Den Haag,Rotterdam,EROAD,26
3,Amsterdam,Immingham,EROAD,369
4,Immingham,Doncaster,EROAD,74
5,Doncaster,London,EROAD,277
6,Hoek van Holland,Den Haag,EROAD,27
7,Felixstowe,Hoek van Holland,EROAD,207
8,Ipswich,Felixstowe,EROAD,22
9,Colchester,Ipswich,EROAD,32


### Breadth First Search with Apache Spark

BFS is most commonly used as the basis for other more goal-oriented algorithms. For example, Shortest Path, Connected Components, and Closeness Centrality all use the BFS algorithm. It can also be used to find the shortest path between nodes.

**Find the first medium-sized (by European standards) city that has a population of between 100,000 and 300,000 people:**

In [4]:
(g.vertices
    .filter("population > 100000 and population < 300000")
    .sort("population")
    .toPandas())

Unnamed: 0,id,latitude,longitude,population
0,Colchester,51.88921,0.90421,104390
1,Ipswich,52.05917,1.15545,133384


**Find the shortest path from Den Haag to a medium-sized city**

In [5]:
from_expr = "id='Den Haag'"
to_expr = "population > 100000 and population < 300000 and id <> 'Den Haag'"
result = g.bfs(from_expr, to_expr)

print(result.columns)

['from', 'e0', 'v1', 'e1', 'v2', 'e2', 'to']


Columns beginning with e represent relationships (edges) and columns beginning with v represent nodes (vertices). We’re only interested in the nodes, so let’s filter out any columns that begin with e from the resulting DataFrame:

In [6]:
columns = [column for column in result.columns if not column.startswith("e")]
result.select(columns).toPandas()

Unnamed: 0,from,v1,v2,to
0,"(Den Haag, 52.07866287231445, 4.288787841796875, 514861)","(Hoek van Holland, 51.977500915527344, 4.13332986831665, 9382)","(Felixstowe, 51.963748931884766, 1.351099967956543, 23689)","(Ipswich, 52.05916976928711, 1.1554499864578247, 133384)"


## Shortest Path

The Shortest Path algorithm calculates the shortest (weighted) path between a pair of nodes. It’s useful for user interactions and dynamic workflows because it works in real time.

### Shortest Path (Weighted) with Apache Spark

Use Shortest Path to find optimal routes between a pair of nodes, based on either the number of hops or any weighted relationship value. For example, it can provide realtime
answers about degrees of separation, the shortest distance between points, or the least expensive route. You can also use this algorithm to simply explore the connections
between particular nodes.

In [7]:
result = shortest_path(g, "Amsterdam", "Colchester", "cost")
result.select("id", "distance", "path").toPandas()

Unnamed: 0,id,distance,path
0,Colchester,347.0,"[Amsterdam, Den Haag, Hoek van Holland, Felixstowe, Ipswich, Colchester]"


## All Pairs Shortest Path
The All Pairs Shortest Path (APSP) algorithm calculates the shortest (weighted) path between all pairs of nodes. It’s more efficient than running the Single Source Shortest Path algorithm for every pair of nodes in the graph. APSP optimizes operations by keeping track of the distances calculated so far and running on nodes in parallel. Those known distances can then be reused when calculating the shortest path to an unseen node.

### All Pairs Shortest Path with Apache Spark

All Pairs Shortest Path is commonly used for understanding alternate routing when the shortest route is blocked or becomes suboptimal. For example, this algorithm is used in logical route planning to ensure the best multiple paths for diversity routing. Use All Pairs Shortest Path when you need to consider all possible routes between all or most of your nodes.

In [8]:
result = g.shortestPaths(["Colchester", "Immingham", "Hoek van Holland"])
result.sort(["id"]).select("id", "distances").show(truncate = False)

+----------------+--------------------------------------------------------+
|id              |distances                                               |
+----------------+--------------------------------------------------------+
|Amsterdam       |[Immingham -> 1, Hoek van Holland -> 2, Colchester -> 4]|
|Colchester      |[Hoek van Holland -> 3, Immingham -> 3, Colchester -> 0]|
|Den Haag        |[Hoek van Holland -> 1, Immingham -> 2, Colchester -> 4]|
|Doncaster       |[Hoek van Holland -> 4, Immingham -> 1, Colchester -> 2]|
|Felixstowe      |[Immingham -> 4, Hoek van Holland -> 1, Colchester -> 2]|
|Gouda           |[Hoek van Holland -> 2, Immingham -> 3, Colchester -> 5]|
|Hoek van Holland|[Immingham -> 3, Hoek van Holland -> 0, Colchester -> 3]|
|Immingham       |[Hoek van Holland -> 3, Immingham -> 0, Colchester -> 3]|
|Ipswich         |[Immingham -> 4, Hoek van Holland -> 2, Colchester -> 1]|
|London          |[Hoek van Holland -> 4, Immingham -> 2, Colchester -> 1]|
|Rotterdam  

## Single Source Shortest Path
The SSSP algorithm calculates the shortest (weighted) path from a root node to all other nodes in the graph.

## Single Source Shortest Path with Apache Spark

Use Single Source Shortest Path when you need to evaluate the optimal route from a fixed start point to all other individual nodes. Because the route is chosen based on the total path weight from the root, it’s useful for finding the best path to each node, but not necessarily when all nodes need to be visited in a single trip.

In [9]:
result = sssp(g, "Amsterdam", "cost")
(result
 .withColumn("via", via_udf("path"))
 .select("id", "distance", "via")
 .sort("distance")
 .toPandas())

Unnamed: 0,id,distance,via
0,Amsterdam,0.0,[]
1,Utrecht,46.0,[]
2,Den Haag,59.0,[]
3,Gouda,81.0,[Utrecht]
4,Rotterdam,85.0,[Den Haag]
5,Hoek van Holland,86.0,[Den Haag]
6,Felixstowe,293.0,"[Den Haag, Hoek van Holland]"
7,Ipswich,315.0,"[Den Haag, Hoek van Holland, Felixstowe]"
8,Colchester,347.0,"[Den Haag, Hoek van Holland, Felixstowe, Ipswich]"
9,Immingham,369.0,[]


## Minimum Spanning Tree
The Minimum (Weight) Spanning Tree algorithm starts from a given node and finds all its reachable nodes and the set of relationships that connect the nodes together with the minimum possible weight. It traverses to the next unvisited node with the lowest weight from any visited node, avoiding cycles.

### When Should I Use Minimum Spanning Tree?
Use Minimum Spanning Tree when you need the best route to visit all nodes. Because the route is chosen based on the cost of each next step, it’s useful when you must visit all nodes in a single walk. You can use this algorithm for optimizing paths for connected systems like water pipes and circuit design. It’s also employed to approximate some problems with
unknown compute times, such as the Traveling Salesman Problem and certain types of rounding problems. Although it may not always find the absolute optimal solution, this algorithm makes potentially complicated and compute-intensive analysis much
more approachable.

## Random Walk
The Random Walk algorithm provides a set of nodes on a random path in a graph. The term was first mentioned by Karl Pearson in 1905 in a letter to Nature magazine titled “The Problem of the Random Walk”. Although the concept goes back even further, it’s only more recently that random walks have been applied to network science.

### When Should I Use Random Walk?
Use the Random Walk algorithm as part of other algorithms or data pipelines when you need to generate a mostly random set of connected nodes.