In this exercise we determine which stop is the closest to the specified position.

We copy the first part of exercise 4 in order to calculate the distance between each stop
and the specified position.

First we initialise PySpark.


In [1]:
from pyspark import SparkContext

sc = SparkContext.getOrCreate()


We first read all the stops. The specified file is a preprocessed version of the JSON "stops.txt", in which 
each line contains one stop in the format 

halte_id;halte_name;lat;long;town_name

This makes it easier
to parse since it only requires a call to str.split().


In [2]:
stops = sc.textFile("./converted_stops.csv").map(lambda stop: tuple([x.strip() for x in stop.split(";")]))


In order to determine the amount of stops inside a radius, we first need to
add the radius and coordinate data to the stops. 

The user can select a point by setting the variables "lat", "long".

This is done using a simple map.
The result is an RDD with tuples of the form (stop, point).


In [3]:
# these need to be floats!
lat = 51.21989
long = 4.40346

stops_with_geodata = stops.map(lambda stop: (stop, (lat, long)))


Next, we create a function that determines the distance between the specified 
point, and a set of coordinates. We use the function to map the 
(stop, point) tuples to (stop, distance) tuples. 

For this we also need the coordinates of the stops themselves, for this we have the function "get_stop_coord".

Note: since the earth is a sphere euclidian distances are not 
accurate enough, I have used an online implementation of the haversine method.
http://evoling.net/code/haversine/
https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points/4913653#4913653
There are many sources, I don't know which one is the original.


In [4]:
def haversine(coord1, coord2):
    from math import radians, cos, sin, asin, sqrt
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    
    lat1, lon1 = coord1
    lat2, lon2 = coord2
    
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2.0)**2.0 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2.0
    c = 2.0 * asin(sqrt(a)) 
    # Radius of earth in kilometers is 6371
    km = 6371.0 * c
    m = km * 1000.0
    return m

def get_stop_coord(stop):
  """Retrieve the geo coordinate of a stop."""
  return float(stop[2]), float(stop[3])

stop_with_distance = stops_with_geodata.map(lambda x: (x[0], haversine(get_stop_coord(x[0]), x[1])))


Now that we've obtained the stops and their respective 
distances to the user specified point, we will sort the stops
by that distance using the "sortBy". To retrieve the closest point, we will simply
retrieve the first element, using the "take" method that takes
the first n elements of the RDD (but this does not sort them, so this is
efficient).

The result is a single tuple (stop, distance)


In [5]:
stops_sorted_by_distance = stop_with_distance.sortBy(lambda x : x[1])
closest = stops_sorted_by_distance.take(1)[0] # take 0-th element of list of size 1


Finally, we print the result.


In [6]:
print("The closest stop to ({}, {}) is:".format(lat, long))
print("Name={}".format(closest[0][1]))
print("City={}".format(closest[0][4]))
print("Distance={} meters".format(closest[1]))


The closest stop to (51.21989, 4.40346) is:
Name=Melkmarkt
City=Antwerpen
Distance=71.9400173804 meters
