In this exercise we determine the stops per citizen for each town and then sort
the results (these are (town, stop-per-citizen) tuples) by the name of the town.

First we initialise PySpark.


In [1]:
from pyspark import SparkContext

sc = SparkContext.getOrCreate()


We first read all the stops. The specified file is a preprocessed version of the JSON "stops.txt", in which 
each line contains one stop in the format 

halte_id;halte_name;lat;long;town_name

This makes it easier
to parse since it only requires a call to str.split().
The result is an RDD with tuples (halte_id, halte_name, lat, long, town_name).


In [2]:
stops = sc.textFile("./converted_stops.csv").map(lambda stop: tuple([x.strip() for x in stop.split(";")]))


Next, we create a map that converts deelgemeenten to gemeenten. This
is a preprocessed version of the "flemish_districts.txt" file in
which each line is of the format "subtown;town". This results in an RDD
of format (district-name, town-name).

We also create a map that maps town names to town populations. This is
a preprocessed version of the "citizens2.txt" file. This results in an RDD
of format (town-name, pop).


In [3]:
# maps subtowns to towns
townname_map = sc.textFile("./townname_map.csv").map(lambda mapping: tuple([x.strip() for x in mapping.split(";")]))

# maps towns to town populations
townpop_map = sc.textFile("./town_pop_map.csv").map(lambda mapping: tuple([x.strip() for x in mapping.split(";")]))


We first join the stops with the town-name-map, to discover to which towns the stops
belong. This results in (district-name, (stop, town-name)) tuples which
are then mapped to (town-name, stop).

Now that we have keyed the stops by their town name, we join those tuples
with the (town-name, pop) tuples of the town-name map. Resulting in (town-name, (stop, pop)) tuples.

If we don't apply the town-name-map first, 
around 1114 towns will not have a corresponding population and would be removed.
This is because town populations are per town and not per district while the stops
are per district.

However 274 districts specified in the stops do not have a mapping in the
town-name-map and will be lost. Also, 68 towns do not have a specified population
and will thus also be lost. All this is caused by inaccuracies in the original data which
is not possible to fix since it would need to be done by hand. These stats were obtained during
preprocessing.


In [4]:
# first label the stops by their location, then join with the town-name-map to get (subtown, (stop, town))
# lastly we cleanup the list so that we get (townname, stop)
# this RDD contains data for 306 towns 
stops_with_towns = stops.keyBy(lambda stop: stop[4]).join(townname_map).map(lambda x: (x[1][1], x[1][0]))

# we join the list of (townname, stop) tuples with the (townname, population) tuples of the town-pop-map
# we then get a list of (townname, (stop, pop))
# this RDD contains data for 238 towns (=306-68)
# the 68 missing towns account for about 12000 stops 
# ~176 stops / town which is reasonable since big cities such as Antwerp are lost
stops_with_pop = stops_with_towns.join(townpop_map)

# make list of towns specified in stops before and after join
# town_list_pre_join = set([x[0] for x in stops_with_towns.collect()])
# town_list_post_join = set([x[0] for x in stops_with_pop.collect()])
# diff = town_list_pre_join - town_list_post_join
# diff_size = len(diff) # this is equal to 68. This conforms to our predictions so the 12000 lost stops are accounted for


Lastly, we map the (townname, (stop, pop)) tuples to (townname, (1, pop)) tuples.
Which are then added together so that the middle member contains the amount of stops for that town.

(town, (x, pop)) + (town, (y, pop)) = (town, (x + y, pop))

In order to determine the total amount of stops / citizen for each town we map as follows

(town, (stops, pop)) -> (town, stops-per-citizen)

And finally we sort the RDD by the name of the town using "sortBy".


In [5]:
towns_stopcount_pop = stops_with_pop.map(lambda x: (x[0], (1, x[1][1]))).reduceByKey(lambda x, y: (x[0] + y[0], x[1]))
towns_stops_per_citizen = towns_stopcount_pop.mapValues(lambda x: float(x[0]) / float(x[1]))
result = towns_stops_per_citizen.sortBy(lambda x: x[0]).collect()

for town in result:
  print("{}: {}".format(town[0], town[1]))


Aalter: 0.00520833333333
Aarschot: 0.00701028174656
Aartselaar: 0.00370681214156
Affligem: 0.00514333257696
Alken: 0.0102040816327
Alveringem: 0.0153332022803
Anzegem: 0.00630180149325
Ardooie: 0.00678682688028
Arendonk: 0.00452011450957
As: 0.0041514041514
Asse: 0.00440194292653
Assenede: 0.00767389467756
Avelgem: 0.00476994931929
Balen: 0.00691192865106
Beernem: 0.00592998788497
Beerse: 0.00345827755466
Beersel: 0.00483323347314
Begijnendijk: 0.00915149706555
Bekkevoort: 0.0119636305631
Beringen: 0.00417146457514
Berlaar: 0.00625760472797
Berlare: 0.00464990902352
Bertem: 0.00722891566265
Beveren: 0.00284456625555
Bierbeek: 0.00907911802853
Bilzen: 0.00727576705161
Blankenberge: 0.0026112233335
Bocholt: 0.00718599495451
Boechout: 0.00442275430837
Bonheiden: 0.00534973920021
Boom: 0.00389863547758
Boortmeerbeek: 0.00582288718156
Bornem: 0.00453300594957
Borsbeek: 0.00196684461928
Boutersem: 0.0104102878138
Brakel: 0.0104244229337
Brasschaat: 0.00362050739958
Brecht: 0.00382824624935
B