In this exercise we will determine for each stop, the distance between that stop 
and the town in which the stop is located. 

Note that here I used the district, since the coordinate data was available 
per district and going district by district is more precise.

First we initialise PySpark.


In [13]:
from pyspark import SparkContext

sc = SparkContext.getOrCreate()


We first read all the stops. The specified file is a preprocessed version of the JSON "stops.txt", in which 
each line contains one stop in the format 

halte_id;halte_name;lat;long;town_name

This makes it easier
to parse since it only requires a call to str.split().


In [17]:
stops = sc.textFile("./converted_stops.csv").map(lambda stop: tuple([x.strip() for x in stop.split(";")]))


Next we will create an RDD that maps town names to town coordinates.
This is done by a CSV file that contains coordinates for each town. 
This CSV file is a preprocessed version of "zipcodes.CSV". The preprocessing 
filters out unnecessary data, so that each line is of the following format:

town-name;lat;long

This will then be mapped into tuples (town, (lat, long))


In [18]:

town_coords = sc.textFile("./coord_map.csv").map(lambda towncoord: tuple([x.strip() for x in towncoord.split(";")])).map(lambda x: (x[0], (x[1], x[2])))


Next we will join the stops with the town coordinates. This will create an
RDD with tuples of the form (town-name, (stop, town-coord)) which are then
mapped to (stop, town-coord).

Note: the town names in the CSV file contain some errors, therefore 270 towns will be lost.


In [24]:

# map stop to (stop-town, stop) to make the "join" operation work
stops_by_town = stops.keyBy(lambda stop: stop[4])

# join operation, and map to (stop, town-coord)
stops_town_coords = stops_by_town.join(town_coords).map(lambda x: x[1])


The last thing to do is determine the distance between the stops themselves
and the towns they are located in. For this we also need the coordinates of the 
stops themselves, for this we have the function "get_stop_coord".

This map operation will result in tuples of the form (stop, distance-to-town).

Note: since the earth is a sphere euclidian distances are not 
accurate enough, I have used an online implementation of the haversine method.
http://evoling.net/code/haversine/
https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points/4913653#4913653
There are many sources, I don't know which one is the original.


In [25]:
def haversine(coord1, coord2):
    from math import radians, cos, sin, asin, sqrt
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    
    lat1, lon1 = coord1
    lat2, lon2 = coord2
    
    lat1 = float(lat1)
    lon1 = float(lon1)
    lat2 = float(lat2)
    lon2 = float(lon2)
    
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2.0)**2.0 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2.0
    c = 2.0 * asin(sqrt(a)) 
    # Radius of earth in kilometers is 6371
    km = 6371.0 * c
    m = km * 1000.0
    return m

def get_stop_coord(stop):
  """Retrieve the geo coordinate of a stop."""
  return float(stop[2]), float(stop[3])

stops_distances = stops_town_coords.map(lambda x: (x[0], haversine(get_stop_coord(x[0]), x[1]))).collect()


Finally, we print the result.


In [26]:
print("Stad - Halte naam - Afstand")
print("---------------------------\n")
for stop in stops_distances:
  print("{} - {} - {}".format(stop[0][4].encode('utf-8'), stop[0][1].encode('utf-8'), stop[1]))

Stad - Halte naam - Afstand
---------------------------

Huldenberg - Gemeenteplein - 341.468414858
Huldenberg - Gemeenteplein - 338.622228273
Huldenberg - Gemeenteplein - 354.898682533
Huldenberg - Geroytstraat - 522.172694146
Huldenberg - Geroytstraat - 503.230031342
Huldenberg - Boven Smeysberg - 1344.83916045
Huldenberg - Boven Smeysberg - 1333.14955954
Huldenberg - Kaalheide - 1396.48218846
Huldenberg - Kaalheide - 1424.9105949
Huldenberg - Klooster - 2216.3815558
Huldenberg - Klooster - 2189.84145217
Huldenberg - Koxberg (Huis nr 16) - 613.45092941
Huldenberg - Koxberg (Huis nr 16) - 579.355710667
Huldenberg - Onder Smeysberg - 273.824618218
Huldenberg - Onder Smeysberg - 418.408975411
Huldenberg - Stroobantsstraat 76 - 1843.69030723
Huldenberg - Stroobantsstraat 76 - 1849.51623683
Huldenberg - Stroobantsweg - 853.556137398
Huldenberg - Stroobantsweg - 845.529845225
Huldenberg - Theyssensstraat - 1027.92557787
Huldenberg - Theyssensstraat - 1013.41941891
Huldenberg - Van Kildonck