## Custom iteration in Python: a use case with Pandas.

Python offers convenient ways to programmatically define how to iterate on structures.

Before going further, be sure to understand (know how to explain):
 - what an iterable structure is in Python;
 - how the yield keyword works.
 
We will work in this notebook with the following data, representing trajectories supervising the Tour de France cycling race:

In [None]:
import pandas as pd

df = pd.read_json("data/tour_de_france.json.gz")
df.shape

Let's have a look at a sample of this data. Two columns will be of interest to us:
- `icao24` is an hexadecimal identifier of the transponder of the aircraft.  
  It is (almost) equivalent to its tail number;
- `callsign` is what appears on the radar screen of the air traffic controller.  
  It corresponds to a mission, or for a commercial flight, to a commercial flight number. It is not enough to identify a flight as the same callsign may be reused over several days or even sometimes in the same day.

In [None]:
df.sample(10)

All data for all trajectories has been flattened in one single dataframe.

<div class='alert alert-warning'>
    <b>Exercice:</b> Let's write an <code>iterate_callsign(data)</code> function that will <em>yield</em> one sub-dataframe for each callsign.
</div>

In [None]:
# %load solutions/iterate_callsign.py


Let's check the first and last timestamp recorded for this subset:

In [None]:
elt = next(iterate_callsign(df))
elt.timestamp.min(), elt.timestamp.max()

In [None]:
elt.agg(dict(timestamp=["min", "max"]))

In the end wrote a function that splits our dataframe by callsign (= mission code), but it is obviously not enough to separate trajectories flying over several days.

We may count how many trajectories we found:

In [None]:
sum(1 for _ in iterate_callsign(df))

For comparison with other ways to iterate, let's use this convenient aggregation method: it seems our dataset is heavily unbalanced with a lot of `ASR172B` flights.

In [None]:
df.groupby("callsign").agg(dict(timestamp=["count", "min", "max"]))

A smarter way to iterate may be to use both `icao24` code and `callsign` for building our iteration method.

<div class='alert alert-warning'>
    <b>Exercice:</b> Let's write an <code>iterate_icao24_callsign(data)</code> function that will <em>yield</em> one sub-dataframe for each icao24/callsign pair. Count how many elements you get.
</div>

In [None]:
# %load solutions/iterate_icao24_callsign.py


Let's compare the groups we managed to produce.

In [None]:
df.groupby(["icao24", "callsign"]).agg(dict(timestamp=["count", "min", "max"]))

With this new method, we managed to separate one `ASR172B` mission that has been flown with a different aircraft on July 25th. But we are still being inefficient for the other ones.

<div class='alert alert-warning'>
    <b>Exercice:</b> Let's store in the <code>bigger_chunk</code> variable all data associated to an `icao24` code equal to 3924a4.
</div>

In [None]:
# %load solutions/bigger_chunk.py


<div class='alert alert-warning'>
    <b>Exercice:</b> Suggest a way to plot how timestamps are distributed in July.
</div>

In [None]:
# %load solutions/bigger_chunk_plot.py


What your plot should suggest is that these aircraft do not continuously fly. They are recording data continuously throughout the month but have long breaks in between (most probably night time)

Let's see how much time is left between two consecutive timestamps:

In [None]:
bigger_chunk.timestamp.diff().dt.total_seconds().plot.hist(bins=20)

Of course, for most samples, we get a decent sampling of the trajectory, hence the high density toward zero. 

We may adapt the command to plot the density of higher timestamp differences:

In [None]:
bigger_chunk.timestamp.diff().dt.total_seconds().loc[lambda x: x > 100].plot.hist(bins=20)

This means that trajectories are separated by at least 60000 seconds (about 17 hours). We can use this idea to better iterate on our data. Let's set an arbitrary threshold (to be passed in parameter, but we could start with 20000 for instance), and yield chunks of the `bigger_chunk` dataset corresponding to consecutive timestamps.

<div class='alert alert-warning'>
    <b>Exercice:</b> Write an <code>iterate_time(data, threshold)</code> function that yields pieces of trajectories corresponding to consecutive timestamps of less than threshold seconds.
</div>

In [None]:
# %load solutions/iterate_time.py


See how many trajectories you get now (on `bigger_chunk`, i.e. aircraft `3924a4`):

In [None]:
sum(1 for _ in iterate_time(bigger_chunk, 20000))

In [None]:
list(
    (str(chunk.timestamp.min()), str(chunk.timestamp.max()))
    for chunk in iterate_time(bigger_chunk, 20000)
)

<div class='alert alert-warning'>
    <b>Wrap it up!</b> Write an <code>iterate_all(data, threshold)</code> function that combines iteration on aircraft icao24, callsign and timestamp intervals.
</div>

In [None]:
# %load solutions/iterate_all.py


In [None]:
sum(1 for _ in iterate_all(df, 20000))

You may now build a summary table like the following.

<div class="alert alert-danger">
    <b>Be sure to fully understand this notebook!</b> We will be using these results during next session.
</div>

In [None]:
pd.DataFrame.from_records(
    [
        {
            "icao24": chunk.icao24.min(),
            "callsign": chunk.callsign.min(),
            "start": chunk.timestamp.min(),
            "stop": chunk.timestamp.max(),
        }
        for chunk in iterate_all(df, 20000)
    ]
)