## Periodic pattern mining on canadian TV logs
<img src="skmine_series.png" alt="logo" style="width: 60%;"/>

### The problem, informally
Let's take a simple example. Imagine we simply ring a bell at certain moments in time. 

In python we can load those "ringings" as event logs, and store them in a [pandas.Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html), like so

In [1]:
import datetime as dt
minutes = [0, 30, 61, 99, 120, 150, 181, 210, 400]

S = pd.Series("ring a bell", index=minutes)
now = dt.datetime.now()  # store call to .now()
S.index = S.index.map(lambda e: now + dt.timedelta(seconds=e))
S.index = S.index.round("s")  # seconds as the lowest unit of difference
S

2021-04-08 17:56:10    ring a bell
2021-04-08 17:56:40    ring a bell
2021-04-08 17:57:11    ring a bell
2021-04-08 17:57:49    ring a bell
2021-04-08 17:58:10    ring a bell
2021-04-08 17:58:40    ring a bell
2021-04-08 17:59:11    ring a bell
2021-04-08 17:59:40    ring a bell
2021-04-08 18:02:50    ring a bell
dtype: object

You can see the bell ring at pretty well defined intervals (mostly every 30 seconds). But some entries are inconsistent with this 30 seconds interval. **How to deal with these "outliers" timestamps ?**

Now imagine there is not just a few ringings, like above, but thousands.
**How would you be able to detect regularities in the data ?**

### Introduction to periodic pattern mining
Periodic pattern mining aims at exploiting regularities not only about `what happens` by finding coordinated event occurrences, but also about `when it happens` and `how it happens`, by **finding consistent inter-occurrence timeintervals**.

Next, we introduce the concept of cycles

#### The cycle : a building block for periodic pattern mining
Here is an explicit example of a cycle

<img src="cycle_color.png" alt="cycle" style="width: 60%;"/>

This definition, while being relatively simple, is general enough to allow us to find regularities in different types of logs

#### Handling noise in our timestamps

Needless to say, it would be too easy if events in our data were equally spaced. As data often comes inpure, we have to be fault tolerant, and allow small errors to sneak into our cycles. `shift corrections` can be used to recursively build the original events there were drawn from, with the following relation
<img src="shifts.png" alt="shifts" style="width: 60%;"/>

#### A tiny example with scikit-mine
`scikit-mine` offers a `PeriodicCycleMiner`, just out of the box.
You can use it to **detect regularities, in the form of cycles**, in the input data. These regularities are submitted to an MDL criterion, so that we do not mistakenly include occurences, nor forgive to consider other intervals that would sumarize our data in a better way.

MDL offers a framework to find `the best set of cycles`, i.e the set that gives the most succint representation of the data. And `as humans, we often like to deal with succint, well organized data`.

In [2]:
from skmine.periodic import PeriodicCycleMiner
pcm = PeriodicCycleMiner().fit(S)
pcm.discover()

Unnamed: 0,Unnamed: 1,start,length,period
ring a bell,0,2021-04-08 17:56:10,8,0 days 00:00:30


You can see one cycle has been extracted for our event `ring a bell`. It has a length of 8 (it covers the entire database but the last entry) and a period of 30 seconds, as expected.

Also, note that we "lost" some information here. Our period of 30s offers the best summary for this data.
Accessing the little "shifts" as encountered in original data is also possible, with an extra argument in our `.discover` call

In [3]:
pcm.discover(shifts=True)

Unnamed: 0,Unnamed: 1,start,length,period,dE
ring a bell,0,2021-04-08 17:56:10,8,0 days 00:00:30,"[0, 1, 8, -9, 0, 1, -1]"


The last column named `dE` contains a list of shifts to apply to our cycle in case we want to reconstruct the original data. You see there is 
 * a 0 second shift between the first and second entry (30 seconds exactly)
 * a 1 second shift between the second and third entry (31 seconds)
 * an 8 second shift between the third and fourth entry (38 seconds)
 * an -9 second shift between the fourth and fifth entry (21 seconds)
 * ...
 
We can call `.reconstruct` to get back to the original data, and make sure our shifts are properly aligned

In [4]:
pcm.reconstruct()

2021-04-08 17:56:10    ring a bell
2021-04-08 17:56:40    ring a bell
2021-04-08 17:57:11    ring a bell
2021-04-08 17:57:49    ring a bell
2021-04-08 17:58:10    ring a bell
2021-04-08 17:58:40    ring a bell
2021-04-08 17:59:11    ring a bell
2021-04-08 17:59:40    ring a bell
2021-04-08 18:02:50    ring a bell
dtype: object

### Fetching logs from canadian TV

In this section we are going load some event logs of TV programs (the `WHAT`), indexed by their broadcast timestamps (the `WHEN`).

`PeriodicCycleMiner` is here to help us discovering regularities (the `HOW`)

In [5]:
from skmine.datasets import fetch_canadian_tv
from skmine.periodic import PeriodicCycleMiner

#### Searching for cycles in TV programs

Remember about the definition of cycles ?
Let's apply it to our TV programs

In our case

* $\alpha$ is the name of a TV program

* $r$ is the number of broadcasts (repetitions) for this TV program (inside this cycle)

* $p$ is the optimal time delta between broadcasts in this cycle. If a program is meant to be live everyday at 14:00PM, then $p$ is likely to be `1 day`

* $tau$ is the first broadcast time in this cycle

* $dE$ are the shift corrections between the $p$ and the actual broadcast time of an event. If a TV program was scheduled at 8:30:00AM and it went on air at 8:30:23AM the same day, then we keep track of a `23 seconds shift`. This way we can summarize our data (via cycles), and reconstruct it (via shift corrections). 


Finally we are going to dig a little deeper into these cycles, to answer quite complex questions about are logs. We will see that cycles contains usefull information about our input data

In [6]:
ctv_logs = fetch_canadian_tv()
ctv_logs.head()

0
2020-08-01 06:00:00            The Moblees
2020-08-01 06:11:00    Big Block Sing Song
2020-08-01 06:13:00    Big Block Sing Song
2020-08-01 06:15:00               CBC Kids
2020-08-01 06:15:00               CBC Kids
Name: canadian_tv, dtype: string

In [7]:
pcm = PeriodicCycleMiner()
pcm.fit(ctv_logs)



<skmine.periodic.cycles.PeriodicCycleMiner at 0x136e9cb70>

In [8]:
cycles = pcm.discover()
cycles

Unnamed: 0,Unnamed: 1,start,length,period
Addison,0,2020-08-03 07:11:00,5,1 days 00:00:00
Addison,1,2020-08-10 07:11:00,5,1 days 00:00:00
Addison,2,2020-08-17 07:11:00,5,1 days 00:00:00
Addison,3,2020-08-24 07:11:00,5,1 days 00:00:00
Arthur Shorts,0,2020-08-17 09:48:00,5,1 days 00:00:00
...,...,...,...,...
This Hour Has 22 Minutes,2,2020-08-11 00:30:00,7,0 days 00:30:00
This Hour Has 22 Minutes,3,2020-08-25 02:00:00,4,0 days 00:30:00
This Hour Has 22 Minutes,4,2020-08-12 19:00:00,4,1 days 00:00:00
Thrillusionists,0,2020-08-02 07:36:00,5,7 days 00:00:00


Now that we have our cycles in a [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), we can play with the pandas API and answer questions about our logs

#### Did I find cycles for the TV show "Arthurt Shorts"

In [9]:
cycles.loc["Arthur Shorts"]

Unnamed: 0,start,length,period
0,2020-08-17 09:48:00,5,1 days 00:00:00
1,2020-08-24 09:48:00,5,0 days 23:59:30
2,2020-08-04 09:48:00,4,1 days 00:00:00
3,2020-08-12 09:47:00,3,1 days 00:00:30


#### What are the top 10 most representative TV programs ?
Let's take the top 10 longest cycles

In [10]:
cycles.nlargest(10, ["length"])

Unnamed: 0,Unnamed: 1,start,length,period
Grand Designs,0,2020-08-01 05:00:00,31,1 days 00:00:00
Schitt's Creek,0,2020-08-28 00:00:00,8,0 days 00:30:00
Kim's Convenience,0,2020-08-05 00:30:00,7,0 days 00:30:00
Kim's Convenience,1,2020-08-26 00:30:00,7,0 days 00:30:00
Mr. D,0,2020-08-06 00:30:00,7,0 days 00:30:00
Schitt's Creek,1,2020-08-07 00:30:00,7,0 days 00:30:00
This Hour Has 22 Minutes,0,2020-08-18 00:30:00,7,0 days 00:30:00
This Hour Has 22 Minutes,1,2020-08-04 00:30:00,7,0 days 00:30:00
This Hour Has 22 Minutes,2,2020-08-11 00:30:00,7,0 days 00:30:00
Addison,0,2020-08-03 07:11:00,5,1 days 00:00:00


#### what are the 10 most unpunctual TV programs ?
For this we are going to :
 1. extract the shift corrections along with other informations about our cycles
 2. compute the sum of the absolute values for the shift corrections, for every cycles
 3. get the 10 biggest sums

In [11]:
full_cycles = pcm.discover(shifts=True)
full_cycles.head()

Unnamed: 0,Unnamed: 1,start,length,period,dE
Addison,0,2020-08-03 07:11:00,5,1 days,"[0, 0, 0, 0]"
Addison,1,2020-08-10 07:11:00,5,1 days,"[0, 0, 0, 0]"
Addison,2,2020-08-17 07:11:00,5,1 days,"[0, 0, 0, 0]"
Addison,3,2020-08-24 07:11:00,5,1 days,"[0, 0, 0, 0]"
Arthur Shorts,0,2020-08-17 09:48:00,5,1 days,"[0, 0, 0, 0]"


In [12]:
def absolute_sum(*args):
    return sum(map(abs, *args))

# level 0 is the name of the TV program
shift_sums = full_cycles["dE"].map(absolute_sum).groupby(level=[0]).sum()
shift_sums.nlargest(10)

Rusty Rivets                             120
Arthur Shorts                             48
Kiri & Lou                                24
Daniel Tiger's Neighbourhood              18
PJ Masks                                  18
Daisy & The Gumboot Kids                  12
Holy Baloney                              12
Thrillusionists                           12
Ollie: The Boy Who Became What He Ate      6
The Strange Chores                         6
Name: dE, dtype: int64

#### What TV programs have been broadcasted everyday for at least 5 days straight ?
Let's make use of the [pandas.DataFrame.query](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html) method to express our question in an SQL-like syntax

In [13]:
cycles.query('length >= 5 and period >= "1 days"', engine='python')

Unnamed: 0,Unnamed: 1,start,length,period
Addison,0,2020-08-03 07:11:00,5,1 days
Addison,1,2020-08-10 07:11:00,5,1 days
Addison,2,2020-08-17 07:11:00,5,1 days
Addison,3,2020-08-24 07:11:00,5,1 days
Arthur Shorts,0,2020-08-17 09:48:00,5,1 days
Beat Bugs,0,2020-08-03 07:30:00,5,1 days
Beat Bugs,1,2020-08-10 07:30:00,5,1 days
Beat Bugs,2,2020-08-17 07:30:00,5,1 days
Beat Bugs,3,2020-08-24 07:30:00,5,1 days
Big Block Sing Song,0,2020-08-03 07:25:00,5,1 days


### What TV programs are broadcast only on business days ?
From the previous query we see we have a lot of 5-length cycles, with periods of 1 day.
An intuition is that these cycles take place on business days. Let's confirm this by considering cycles with
 1. start timestamps on mondays
 2. periods of roughly 1 day  

In [14]:
monday_starts = cycles[cycles.start.dt.weekday == 0]  # start on monday
monday_starts.query('length == 5 and period >= "1 days"', engine='python')

Unnamed: 0,Unnamed: 1,start,length,period
Addison,0,2020-08-03 07:11:00,5,1 days
Addison,1,2020-08-10 07:11:00,5,1 days
Addison,2,2020-08-17 07:11:00,5,1 days
Addison,3,2020-08-24 07:11:00,5,1 days
Arthur Shorts,0,2020-08-17 09:48:00,5,1 days
Beat Bugs,0,2020-08-03 07:30:00,5,1 days
Beat Bugs,1,2020-08-10 07:30:00,5,1 days
Beat Bugs,2,2020-08-17 07:30:00,5,1 days
Beat Bugs,3,2020-08-24 07:30:00,5,1 days
Big Block Sing Song,0,2020-08-03 07:25:00,5,1 days


References
----------

1.
    Galbrun, E & Cellier, P & Tatti, N & Termier, A & Crémilleux, B
    "Mining Periodic Pattern with a MDL Criterion"

2.
    Galbrun, E
    "The Minimum Description Length Principle for Pattern Mining : A survey"

3. 
    Termier, A
    ["Periodic pattern mining"](http://people.irisa.fr/Alexandre.Termier/dmv/DMV_Periodic_patterns.pdf) 