# Homework 4 - More trains (Part I & II)

__Hand-in:__

- __Due: 12.05.2020 23:59:59 CET__
- `git push` your final verion to your group's Renku repository before the due
- check if `Dockerfile`, `environment.yml` and `requirements.txt` are properly written
- add necessary comments and discussion to make your codes readable

For this homework, you will be working with the real-time streams of the NS, the train company of the Netherlands. You can see an example webpage that uses the same streams to display the train information on a map: https://spoorkaart.mwnn.nl/ . 

To help you and avoid having too many connections to the NS streaming servers, we have setup a service that collects the streams and pushes them to our Kafka instance. The related topics are: 

`ndovloketnl-arrivals`: For each arrival of a train in a station, describe the previous and next station, time of arrival (planned and actual), track number,...

`ndovloketnl-departures`: For each departure of a train from a station, describe the previous and next station, time of departure (planned and actual), track number,...

`ndovloketnl-gps`: For each train, describe the current location, speed, bearing.

The events are serialized in JSON (actually converted from XML), with properties in their original language. Google translate could help you understand all of them, but we will provide you with some useful mappings.

---
## Create a Kafka client

In [1]:
import os
from pykafka import KafkaClient
from pykafka.common import OffsetType
import time
username = os.environ['JUPYTERHUB_USER']

ZOOKEEPER_QUORUM = 'iccluster044.iccluster.epfl.ch:2181,'\
                   'iccluster054.iccluster.epfl.ch:2181,'\
                   'iccluster059.iccluster.epfl.ch:2181'

client = KafkaClient(zookeeper_hosts=ZOOKEEPER_QUORUM)

---

## Part I - Live Plot (20 points / 50)

The goal of this part is to obtain an interactive plot use the train positions from the GPS stream. We encourage you to use the examples from last week to achieve the expected result.

First, let's write a function to decode the messages from the `ndovloketnl-gps` topic.

In [2]:
import json
from pykafka.common import OffsetType

example_gps = client.topics[b'ndovloketnl-gps'].get_simple_consumer(
    auto_offset_reset=OffsetType.EARLIEST,
    reset_offset_on_start=True
).consume()
json.loads(example_gps.value)

{'tns3:ArrayOfTreinLocation': {'@xmlns:tns3': 'http://schemas.datacontract.org/2004/07/Cognos.Infrastructure.Models',
  'tns3:TreinLocation': [{'tns3:TreinNummer': '15867',
    'tns3:TreinMaterieelDelen': [{'tns3:MaterieelDeelNummer': '2451',
      'tns3:Materieelvolgnummer': '1',
      'tns3:GpsDatumTijd': '2020-04-23T17:06:31Z',
      'tns3:Orientatie': '0',
      'tns3:Bron': 'NTT',
      'tns3:Fix': '1',
      'tns3:Berichttype': None,
      'tns3:Longitude': '5.1597604696057',
      'tns3:Latitude': '52.273841159003',
      'tns3:Elevation': '0.0',
      'tns3:Snelheid': '90.0',
      'tns3:Richting': '166.6',
      'tns3:Hdop': '2.38',
      'tns3:AantalSatelieten': '0'},
     {'tns3:MaterieelDeelNummer': '2430',
      'tns3:Materieelvolgnummer': '2',
      'tns3:GpsDatumTijd': '2020-04-23T17:06:34Z',
      'tns3:Orientatie': '0',
      'tns3:Bron': 'NTT',
      'tns3:Fix': '1',
      'tns3:Berichttype': None,
      'tns3:Longitude': '5.1597830062141',
      'tns3:Latitude': '52.

We can see that the message has the following structure:

```
{
  'tns3:ArrayOfTreinLocation': {
    'tns3:TreinLocation': [
      <train_info_1>,
      <train_info_2>,
      ...
    ]
  }
}
```

With the `<train_info_x>` messages containing:
- `tns3:TreinNummer`: the train number. This number is used in passenger information displays.
- `tns3:MaterieelDeelNummer`: the train car number. It identifies the physical train car.
- `tns3:Materieelvolgnummer`: the car position. 1 is the car in front of the train, 2 the next one, etc.
- `tns3:GpsDatumTijd`: the datetime given by the GPS.
- `tns3:Latitude`, `tns3:Longitude`, `tns3:Elevation`: 3D coordinates given by the GPS.
- `tns3:Snelheid`: speed, most likely given by the GPS.
- `tns3:Richting`: heading, most likely given by the GPS.
- `tns3:AantalSatelieten`: number of GPS satellites in view.

We also notice that when a train is composed of multiple cars, the position is given in an array, with the position of all individual cars.

**Question I.a. (5/20)** Write a function `extract_gps_data` which takes the message as input and extracts the train number, train car and GPS data from the source messages. Using this function, you should be able to obtain the example table, or something similar:

| timestamp | train_number | car_number | car_position | longitude | latitude | elevation | heading | speed |
|:---------:|:------------:|:----------:|:------------:|:---------:|:--------:|:---------:|:-------:|:-----:|
|    ...    |      ...     |     ...    |      ...     |    ...    |    ...   |    ...    |   ...   |  ...  |

__Hints:__
- The messages can be occaionally are empty, for example, `tns3:ArrayOfTreinLocation` or `tns3:TreinLocation` can be empty.
- Not every message shares exactly the same structure, for example, `tns3:TreinMaterieelDelen` may be a list but not always
- You may find Python disctionary [get(key, default)](https://docs.python.org/3.7/library/stdtypes.html#dict.get) method helpful.

In [3]:
def extract_form(mat, train):
    return [mat.get('tns3:GpsDatumTijd'), train.get('tns3:TreinNummer'), mat.get('tns3:MaterieelDeelNummer'), 
            mat.get('tns3:Materieelvolgnummer'), mat.get('tns3:Longitude'),
            mat.get('tns3:Latitude'), mat.get('tns3:Elevation'),
            mat.get('tns3:Richting'), mat.get('tns3:Snelheid')]

def extract_gps_data(msg):
    result = []
    trains = msg['tns3:ArrayOfTreinLocation']['tns3:TreinLocation']
    if type(trains) is dict:
        extract_info(trains.get('tns3:TreinMaterieelDelen'),trains,result)
    else:
        for train in trains:
            extract_info(train.get('tns3:TreinMaterieelDelen'),train,result)
    return result

def extract_info(info,train,result):
    if type(info) is dict:
        result.append(extract_form(train.get('tns3:TreinMaterieelDelen'), train))
    else: 
        for mat in info:
            result.append(extract_form(mat, train))

In [4]:
# Example results from "extract_gps_data"
import numpy as np
import pandas as pd

pd.DataFrame(
    data=extract_gps_data(json.loads(example_gps.value)),
    columns=['timestamp', 'train_number', 'car_number', 'car_position', 
             'longitude', 'latitude', 'elevation', 'heading', 'speed']
).head(n=20)

Unnamed: 0,timestamp,train_number,car_number,car_position,longitude,latitude,elevation,heading,speed
0,2020-04-23T17:06:31Z,15867,2451,1,5.1597604696057,52.273841159003,0.0,166.6,90.0
1,2020-04-23T17:06:34Z,15867,2430,2,5.1597830062141,52.273789073293,0.0,166.6,86.4
2,2020-04-23T17:06:36Z,14667,2728,1,5.322429,52.41208,0.62,61.88,133.789
3,2020-04-23T17:06:32Z,14667,2332,2,5.319623,52.411064,0.62,56.88,129.182
4,2020-04-23T17:06:36Z,14667,2740,3,5.320444,52.411377,0.58,58.44,133.808
5,2020-04-23T17:06:31Z,6662,2513,1,5.79994583333,51.8246003333,0.0,79.17,66.0
6,2020-04-23T17:06:35.8Z,6662,2507,2,5.799830330535919,51.82456617400802,14.0,79.11,76.0
7,2020-04-23T17:06:34Z,6966,2618,1,5.2605806017145,51.824128392772,0.0,353.6,104.4
8,2020-04-23T17:06:31Z,8161,8723,1,6.55242116667,52.9407786667,0.0,12.3,125.0
9,2020-04-23T17:06:31Z,5668,2710,1,5.892624,52.408997,0.63,232.0,11.646


**Question I.b (15/20)** Make a live plot of the train positions.

You can do so by using `bokeh`; use last week's lab as an example.

See also: https://docs.bokeh.org/en/latest/docs/user_guide/geo.html#tile-provider-maps

You can compare your plot to one of the live services: https://spoorkaart.mwnn.nl/, http://treinenradar.nl/

__Q (1/15)__ To plot points with GPS location information on bokeh's map, we need a transoformer. What is the following transformer capable of? Check `bokeh`'s documenation on [Tile Provider Maps](https://docs.bokeh.org/en/latest/docs/user_guide/geo.html#tile-provider-maps).

In [5]:
from pyproj import Transformer
transformer = Transformer.from_proj("EPSG:4326", "EPSG:3857", always_xy=True)

__Answer:__ `transformer` here can transform the original coordinate system (`EPSG:4326`), which has been used on the surface of a sphere or ellipsoid of reference, into the new coordinate system (`EPSG:3857`), PROJECTED from the surface of the sphere or ellipsoid to a flat surface.

To perform a transfromation, you need to use the method `Transfromer.transform`, please check [here](https://pyproj4.github.io/pyproj/stable/api/transformer.html?highlight=transformer#pyproj.transformer.Transformer.transform).

__Q (14/15)__ Let's make the plot.

**Care should be taken for the following point:**
- We expect the train positions to fall on rail tracks on the map. Showing each train as a circle is good enough. Check [Scatter Markers](https://docs.bokeh.org/en/2.0.2/docs/user_guide/plotting.html?highlight=scatter#scatter-markers).
- One train may have many cars. You do not need to show every car on the map, please keep only car whose `car_position` equals to '1'.
- Provide an interactive label with the train number (we do not expect train type, as this needs to be recovered from other sources). Check [Hovertool](https://docs.bokeh.org/en/2.0.1/docs/user_guide/tools.html#hovertool).

**You can get bonus points if you make followings happen on your plot:**
- Trains on the map should not appear/disappear when data is absent for a few messages.
- Find a way to show where the train is heading.
- Add any other pieces of information that may be of interest to users.

In [108]:
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, Range1d,HoverTool
from bokeh.tile_providers import get_provider, OSM
import time

output_notebook()

In [109]:
x = []
y = []
tn = []
source = ColumnDataSource(data=dict(x=x, y=y, tn=tn))

# TOOLTIPS = [('train NO.', '@tn'),]
custom_hover = HoverTool()

custom_hover.tooltips = """
    <style>
        .bk-tooltip>div:not(:first-child) {display:none;}
    </style>

    <b>train NO.@tn <br>
"""

# create the map
tile_provider = get_provider(OSM)
p = figure(x_axis_type="mercator", y_axis_type="mercator", tools=[custom_hover])
p.add_tile(tile_provider)

# add circle points
p.circle('x', 'y', source=source, size=10, line_color="navy", fill_color="red",alpha=0.8)

# make the plot centered at Amsterdam
x_min, y_min = transformer.transform(4.4, 52.2)
x_max, y_max = transformer.transform(5.4, 52.6)
p.x_range = Range1d(x_min, x_max)
p.y_range = Range1d(y_min, y_max)

t=show(p, notebook_handle=True)
####### due to streaming, the update speed of the plot may be slow!!!!!!!

Create a simple consumer for `ndovloketnl-gps`, which consumes the latest information from the stream.

In [110]:
consumer = client.topics[b'ndovloketnl-gps'].get_simple_consumer(consumer_group = b'ndovloketnl-gps', 
                                                                   auto_offset_reset = OffsetType.LATEST,
                                                                   reset_offset_on_start = True)

Make the plot alive. You can refer the exercise of week 9 for an idea.

In [None]:
# due to streaming, the update speed of the plot may be slow
try:
    for message in consumer:
        if message is not None:
            source.data = {k: [] for k in source.data}
            df = pd.DataFrame(data = extract_gps_data(json.loads(message.value)),
                              columns=['timestamp', 'train_number', 'car_number', 'car_position', 'longitude', 'latitude', 'elevation', 'heading', 'speed'])
            df = df[df.car_position == '1']
            x = df['longitude'].astype(float).values
            y = df['latitude'].astype(float).values
            trains = list(df['train_number'].astype(int).values)
            newX = []
            newY = []
            for x_,y_ in zip(x,y):
                new_x,new_y = transformer.transform(x_,y_)
                newX.append(new_x)
                newY.append(new_y)

            source.stream({'x': newX, 'y': newY, 'tn': trains})
            push_notebook(handle=t)
#             time.sleep(0.1)
except KeyboardInterrupt:
    print("Plot interrupted.")

---

# Part II - Locate Message (10 points / 50)

After you finish this part, you are able to locate the message given a specific timestamp.

You can find below a helper function to read a message at a specific offset from a Kafka topic.

In [37]:
import warnings
warnings.filterwarnings('ignore')

In [47]:
def fetch_message_at(topic, offset):
    if isinstance(topic, str):
        topic = topic.encode('utf-8')
    t = client.topics[topic]
    consumer = t.get_simple_consumer()
    p = list(consumer.partitions.values())[0]
    consumer.reset_offsets([(p,int(offset)-1)], )
    return consumer.consume()

In [48]:
msg = fetch_message_at(b'ndovloketnl-gps', 31243)

In [49]:
msg.offset

31243

In [50]:
msg.value

b'{"tns3:ArrayOfTreinLocation": {"@xmlns:tns3": "http://schemas.datacontract.org/2004/07/Cognos.Infrastructure.Models", "tns3:TreinLocation": [{"tns3:TreinNummer": "1685", "tns3:TreinMaterieelDelen": [{"tns3:MaterieelDeelNummer": "4058", "tns3:Materieelvolgnummer": "0", "tns3:GpsDatumTijd": "2020-04-26T04:36:23Z", "tns3:Orientatie": "0", "tns3:Bron": "NTT", "tns3:Fix": "1", "tns3:Berichttype": null, "tns3:Longitude": "5.3565625", "tns3:Latitude": "52.154243", "tns3:Elevation": "0.0", "tns3:Snelheid": "0", "tns3:Richting": "0.0", "tns3:Hdop": "4.3", "tns3:AantalSatelieten": "9"}, {"tns3:MaterieelDeelNummer": "4204", "tns3:Materieelvolgnummer": "0", "tns3:GpsDatumTijd": "2020-04-26T04:36:29Z", "tns3:Orientatie": "0", "tns3:Bron": "NTT", "tns3:Fix": "1", "tns3:Berichttype": null, "tns3:Longitude": "5.35796566667", "tns3:Latitude": "52.1534048333", "tns3:Elevation": "0.0", "tns3:Snelheid": "0", "tns3:Richting": "0.0", "tns3:Hdop": "4.3", "tns3:AantalSatelieten": "10"}]}, {"tns3:TreinNummer

**Question II.a (5/10)** Write a function to extract the median timestamp from a message of the `ndovloketnl-gps` topic. You can reuse the `extract_gps_data` function from part I.

In [51]:
example_gps = client.topics[b'ndovloketnl-gps'].get_simple_consumer(
    auto_offset_reset=OffsetType.EARLIEST,
    reset_offset_on_start=True
).consume()

In [136]:
# Answer II.a
import pandas as pd
import numpy as np

def extract_gps_time_approx(msg):
    data = extract_gps_data(msg)
    result = sorted([np.datetime64(x[0]) for x in data])
    return result[len(result)//2]

In [53]:
extract_gps_time_approx(json.loads(example_gps.value))

numpy.datetime64('2020-04-23T17:06:34')

In [54]:
# Example results from `extract_gps_time_approx`
extract_gps_time_approx(json.loads(example_gps.value))

numpy.datetime64('2020-04-23T17:06:34')

**Question II.b (5/10)** Using `fetch_message_at` and `extract_gps_time_approx`, write a function named `search_gps` to find the "first" offset for a given timestamp in the `ndovloketnl-gps` topic. You function should use [Binary Search Algorithm](https://en.wikipedia.org/wiki/Binary_search_algorithm).

More preciseley, if we note `offset = search_gps(ts)` where `ts` is a timestamp, then we have:
```
ts <= extract_gps_time_approx(fetch_message_at('ndovloketnl-gps', offset))

extract_gps_time_approx(fetch_message_at('ndovloketnl-gps', offset - 1)) < ts
```

In [172]:
def search_gps(findTimeStr):
    findTime = findTimeStr.to_datetime64()
    left = 0
    right = 77596
    while True: 
        mid = left + (right - left) //2
        print("mid: {}".format(mid))
        tsright = extract_gps_time_approx(json.loads(fetch_message_at('ndovloketnl-gps', mid).value))
        tsleft = extract_gps_time_approx(json.loads(fetch_message_at('ndovloketnl-gps', mid - 1).value))
        print("tsleft: {0} tsright: {1}".format(tsleft,tsright))
        if tsright < findTime:
            left = mid + 1
        elif tsleft >= findTime:
            right = mid - 1
        elif tsleft < findTime and tsright >= findTime:
            return mid

In [173]:
offset = search_gps(pd.Timestamp('2020-04-30'))
print("\"first\" offset: {}".format(offset))

mid: 38798
tsleft: 2020-04-27T14:38:34 tsright: 2020-04-27T14:38:44
mid: 58197
tsleft: 2020-04-29T23:59:57 tsright: 2020-04-30T00:00:06
"first" offset: 58197


In [14]:
# Example results from `search_gps`
search_gps(pd.Timestamp('2020-04-30'))

58197

In [55]:
# Verify that offset returned above returns a timestamp on or after 2020-04-30 (replace <--OFFSET-->)
extract_gps_time_approx(json.loads(fetch_message_at('ndovloketnl-gps', 58197).value))

numpy.datetime64('2020-04-30T00:00:06')

In [56]:
# Verify that offset returned above returns a timestamp before 2020-04-30 (replace <--OFFSET-->)
extract_gps_time_approx(json.loads(fetch_message_at('ndovloketnl-gps', 58197 - 1).value))

numpy.datetime64('2020-04-29T23:59:57')