# Anomaly Detection <a id="top"></a>

**Problem statement**:
Let `df` be a dataframe of geolocation points, with the variables `time`, `adid`, `lat`, `lon`, here we assume the dataframe `df` is already sorted by increasing time. Here an anomaly is defined as a *short* time interval where at least a pair of `adid`s are found to be spatially apart.

<hr>

**Step 1: Find outlying intervals**
- Set `time_thres` to be an upper bound on the length of time interval where anomalies are to be found.
- Find all intervals `[i,j]` such that
    - `0 <= i < j < len(df)` AND 
    - `df.time[j] - df.time[i] <= time_thres` AND 
    - if `df.time[j+1]` exists, `df.time[j+1] - df.time[i] > time_thres` AND
    - more than 1 `adid`s are found in the time interval, i.e. `len(df.loc[i:j, "adid"].unique()) > 1`.
- Time complexity: Let $n$ be `len(df)`
    - Best case: $O(n)$ where the whole dataset has time within `time_thres`.
    - Worst case: $O(n^2)$ (?)

**Step 2: After finding all the intervals `[i,j]` from step 1, we further find all pairs of `adid`s in these intervals that are far apart spatially.**
- Set some distance (in metres) `dist_thres` to be the upper bound on the distance of 2 geolocations.
- For each `adid` in the interval we define its geolocation as the median of all its geolocation points within the interval.
- For each pair of `adid`, check if they are further than `dist_thres`.
- Time complexity: $O(nm^2)$, where $n$ refers to the no. of intervals found from Step 1 and $m$ refers to the no. of unique `adid`s in `df`.
- Can restrict the `time_thres` in step 1 to be small to reduce the occurence of large $m$.

**Step 3 (Optional): Visualise all these anomalies.**
- Set some integer values `prepend` and `append`
- To make the anomalies more obvious visually, prepend and append each interval with `prepend` and `append` amount respectively. These appended and prepended geolocations are considered normal.
- Merge all the overlapping intervals to obtain a set of mutually exclusive intervals.
- Plot the points in the intervals.
- Time complexity: $O(n)$, where $n$ refers to the no. found from Step 2.

<hr>

**Further work:** This algorithm can be made online, i.e. suppose anomalies have been identified for `df` for 1 year's worth of data the anomalies for the next 1 year worth of data can be found without rerunning on the 2 years' worth of data.

Let the old data be `old_df` and the new data be `new_df` and the anomalies found for the old data be `old_anomalies_df`.
- Let the indices corresponding to the last interval in `old_anomalies_df` be $[i,j]$ and the last row number of `old_df` be $N$.
- Prepend the rows $i+1$ to $N$ of `old_df` to `new_df` and run the algorithm on the modified `new_df`.

The algorithm for steps 2 and 3 are rather trivial, so I only provide the algorithm for step 1 and some testcases.

<hr>

Pseudocode for Algorithm in Step 1:

- Suppose we have the interval $[i,j], 0\le i<j < n$, where $n$ is the last index of the dataset.
- Set $i=0, j=1$
- Repeat while $i<j$ AND $j<n$
    - if $t_j - t_i \le t_{thres} $
        - if $t_{j+1} - t_i > t_{thres}$
            - if more than 1 `adid` is present in $[t_i,t_j]$, record $\{i,j,t_i,t_j\}$
                - $j \leftarrow j+1$, $i \leftarrow i+1$
            - else
                - $j \leftarrow j+1$
    - else:
        - if $j = i+1$:
            - $j \leftarrow j+1$, $i \leftarrow i+1$
        - else: 
            - $i \leftarrow i+1$
- if $j = n$,
    - if $[i,j]$ is an anomaly, record it down. Algorithm end.
    - else while $i<j$, check if $[i+1,j]$ is an anomaly.
    

<hr>

**Testcases**

- [Testcase 1](#testcase1): 5 distinct adid within `time_thres`
- [Testcase 2](#testcase2): 2 sets of 5 distinct adid, each within `time_thres`, but more than `time_thres` apart
- [Testcase 3](#testcase3): 5 non-distinct adid within `time_thres`
- [Testcase 4](#testcase4): 2 sets of 5 non-distinct adid, each within `time_thres`, but more than `time_thres` apart
- [Testcase 5](#testcase5): 2 sets of 5 distinct adid, each within `time_thres`, but at most `time_thres` apart
- [Testcase 6](#testcase6): 2 sets of 2 non-distinct adid, each within `time_thres`, but at most `time_thres` apart
- [Testcase 7](#testcase7): More than 2 sets of overlapping adid, each within `time_thres`
- [Testcase 8](#testcase8): All records are more than `time_thres` apart.
- [Testcase 9](#testcase9): 2 sets of 5 distinct adid, first set within `time_thres`, second set each more than `time_thres` 
- [Testcase 10](#testcase10): 2 sets of 5 distinct adid, first set each more than `time_thres`, second set within `time_thres`
- [Testcase 11](#testcase11): 2 points, distinct adid within `time_thres`
- [Testcase 12](#testcase12): 2 points, non-distinct adid within `time_thres`
- [Testcase 13](#testcase13): 2 points, distinct adid more than `time_thres`
- [Testcase 14](#testcase14): 2 points, non-distinct adid more than `time_thres`

In [1]:
from datetime import datetime, timedelta
import pandas as pd
import numpy as np

In [2]:
def detect_weird_timeint(data, time_window):
    '''
    '''
    def check_multiple_adid(data, anomalies):
        adids = data.adid.unique()
        num_adid = len(adids)
        token = False

        if num_adid > 1:
            print("anomaly found")
            anomalies.loc[len(anomalies)] = [start_index, end_index, data.time[start_index],
                                            data.time[end_index], num_adid, list(adids)]
        return anomalies
    
    start_index = 0
    end_index = 1
    
    anomalies = pd.DataFrame(columns=["start_index", "end_index", "start_time", "end_time", "num_adid", "adids"])
    anomalies.dtypes
    
    while (start_index < end_index and end_index+1 < len(data)):
        print("-"*50)
        print(start_index, end_index)
        
        if (data.time[end_index] - data.time[start_index] <= time_window):
            print("current time window not exceeded", str(data.time[end_index] - data.time[start_index]))
            if (data.time[end_index+1] - data.time[start_index] > time_window):
                print("next time window exceeded", str(data.time[end_index] - data.time[start_index]))
                anomalies = check_multiple_adid(data.loc[start_index:end_index], anomalies)

                start_index += 1
                
            end_index += 1
            
        else:
            print("current time window exceeded", str(data.time[end_index] - data.time[start_index]))
            if start_index + 1 == end_index:
                print("timestep = 1")
                start_index += 1
                end_index += 1
                continue
            else:
                print("timestep > 1")
                start_index += 1
    
    print(start_index, end_index)
    
    if end_index == len(data)-1:
        print("end of data")
        if end_index in set(anomalies.end_index):
            return anomalies
        
        while (start_index < end_index):
            print("-"*50)
            print(start_index, end_index)
            if (data.time[end_index] - data.time[start_index] <= time_window):
                print("current time window not exceeded", str(data.time[end_index] - data.time[start_index]))
                anomalies = check_multiple_adid(data.loc[start_index:end_index], anomalies)
                
                return anomalies
            start_index += 1
    
    return anomalies

In [3]:
START_TIME = datetime(2020,1,1)
time_thres = timedelta(minutes=10)
ADIDS = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M"]

## Testcase 1 <a id="testcase1"></a>

5 distinct adid within `time_thres`

[Back to top](#top)

In [4]:
testcase1_df = pd.DataFrame(columns=["time", "adid"])

for i in range(5):
    testcase1_df.loc[i,:] = [START_TIME + i*time_thres/10, ADIDS[i]]
    
testcase1_df = testcase1_df.sort_values(by="time").reset_index(drop=True)
testcase1_df

Unnamed: 0,time,adid
0,2020-01-01 00:00:00,A
1,2020-01-01 00:01:00,B
2,2020-01-01 00:02:00,C
3,2020-01-01 00:03:00,D
4,2020-01-01 00:04:00,E


In [5]:
correct1_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct1_df.loc[0,:] = [0, 4, testcase1_df.time[0], testcase1_df.time[4], 5, list(testcase1_df.loc[0:4, "adid"].unique())]
correct1_df = correct1_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct1_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,4,2020-01-01,2020-01-01 00:04:00,5,"[A, B, C, D, E]"


In [6]:
output1_df = detect_weird_timeint(testcase1_df, timedelta(minutes=10))
output1_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
output1_df

--------------------------------------------------
0 1
current time window not exceeded 0:01:00
--------------------------------------------------
0 2
current time window not exceeded 0:02:00
--------------------------------------------------
0 3
current time window not exceeded 0:03:00
0 4
end of data
--------------------------------------------------
0 4
current time window not exceeded 0:04:00
anomaly found


Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,4,2020-01-01,2020-01-01 00:04:00,5,"[A, B, C, D, E]"


In [7]:
pd.testing.assert_frame_equal(correct1_df, output1_df)

## Testcase 2 <a id="testcase2"></a>

2 sets of 5 distinct adid, each within `time_thres`, but more than `time_thres` apart

[Back to top](#top)

In [8]:
testcase2_df = pd.DataFrame(columns=["time", "adid"])

for i in range(5):
    testcase2_df.loc[i,:] = [START_TIME + i*time_thres/10, ADIDS[i]]
    
for i in range(5):
    testcase2_df.loc[5+i,:] = [START_TIME + timedelta(minutes=31) + i*time_thres/10, ADIDS[i]]
    
testcase2_df = testcase2_df.sort_values(by="time").reset_index(drop=True)
testcase2_df

Unnamed: 0,time,adid
0,2020-01-01 00:00:00,A
1,2020-01-01 00:01:00,B
2,2020-01-01 00:02:00,C
3,2020-01-01 00:03:00,D
4,2020-01-01 00:04:00,E
5,2020-01-01 00:31:00,A
6,2020-01-01 00:32:00,B
7,2020-01-01 00:33:00,C
8,2020-01-01 00:34:00,D
9,2020-01-01 00:35:00,E


In [9]:
correct2_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct2_df.loc[0,:] = [0, 4, testcase2_df.time[0], testcase2_df.time[4], 5, list(testcase2_df.loc[0:4, "adid"].unique())]
correct2_df.loc[1,:] = [5, 9, testcase2_df.time[5], testcase2_df.time[9], 5, list(testcase2_df.loc[5:9, "adid"].unique())]
correct2_df = correct2_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct2_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,4,2020-01-01 00:00:00,2020-01-01 00:04:00,5,"[A, B, C, D, E]"
1,5,9,2020-01-01 00:31:00,2020-01-01 00:35:00,5,"[A, B, C, D, E]"


In [10]:
output2_df = detect_weird_timeint(testcase2_df, timedelta(minutes=10))
output2_df

--------------------------------------------------
0 1
current time window not exceeded 0:01:00
--------------------------------------------------
0 2
current time window not exceeded 0:02:00
--------------------------------------------------
0 3
current time window not exceeded 0:03:00
--------------------------------------------------
0 4
current time window not exceeded 0:04:00
next time window exceeded 0:04:00
anomaly found
--------------------------------------------------
1 5
current time window exceeded 0:30:00
timestep > 1
--------------------------------------------------
2 5
current time window exceeded 0:29:00
timestep > 1
--------------------------------------------------
3 5
current time window exceeded 0:28:00
timestep > 1
--------------------------------------------------
4 5
current time window exceeded 0:27:00
timestep = 1
--------------------------------------------------
5 6
current time window not exceeded 0:01:00
--------------------------------------------------
5

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,4,2020-01-01 00:00:00,2020-01-01 00:04:00,5,"[A, B, C, D, E]"
1,5,9,2020-01-01 00:31:00,2020-01-01 00:35:00,5,"[A, B, C, D, E]"


In [11]:
pd.testing.assert_frame_equal(correct2_df, output2_df)

## Testcase 3 <a id="testcase3"></a>

5 non-distinct adid within `time_thres`

[Back to top](#top)

In [12]:
testcase3_df = pd.DataFrame(columns=["time", "adid"])

for i in range(5):
    testcase3_df.loc[i,:] = [START_TIME + i*time_thres/10, ADIDS[i%3]]
    
testcase3_df = testcase3_df.sort_values(by="time").reset_index(drop=True)
testcase3_df

Unnamed: 0,time,adid
0,2020-01-01 00:00:00,A
1,2020-01-01 00:01:00,B
2,2020-01-01 00:02:00,C
3,2020-01-01 00:03:00,A
4,2020-01-01 00:04:00,B


In [13]:
correct3_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct3_df.loc[0,:] = [0, 4, testcase3_df.time[0], testcase3_df.time[4], 3, list(testcase3_df.loc[0:4, "adid"].unique())]
correct3_df = correct3_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct3_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,4,2020-01-01,2020-01-01 00:04:00,3,"[A, B, C]"


In [14]:
output3_df = detect_weird_timeint(testcase3_df, timedelta(minutes=10))
output3_df

--------------------------------------------------
0 1
current time window not exceeded 0:01:00
--------------------------------------------------
0 2
current time window not exceeded 0:02:00
--------------------------------------------------
0 3
current time window not exceeded 0:03:00
0 4
end of data
--------------------------------------------------
0 4
current time window not exceeded 0:04:00
anomaly found


Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,4,2020-01-01,2020-01-01 00:04:00,3,"[A, B, C]"


In [15]:
pd.testing.assert_frame_equal(correct3_df, output3_df)

## Testcase 4 <a id="testcase4"></a>

2 sets of 5 non-distinct adid, each within `time_thres`, but more than `time_thres` apart

[Back to top](#top)

In [16]:
testcase4_df = pd.DataFrame(columns=["time", "adid"])

for i in range(5):
    testcase4_df.loc[i,:] = [START_TIME + i*time_thres/10, ADIDS[i%3]]
    
for i in range(5):
    testcase4_df.loc[5+i,:] = [START_TIME + timedelta(minutes=31) + i*time_thres/10, ADIDS[i%3]]
    
testcase4_df = testcase4_df.sort_values(by="time").reset_index(drop=True)
testcase4_df

Unnamed: 0,time,adid
0,2020-01-01 00:00:00,A
1,2020-01-01 00:01:00,B
2,2020-01-01 00:02:00,C
3,2020-01-01 00:03:00,A
4,2020-01-01 00:04:00,B
5,2020-01-01 00:31:00,A
6,2020-01-01 00:32:00,B
7,2020-01-01 00:33:00,C
8,2020-01-01 00:34:00,A
9,2020-01-01 00:35:00,B


In [17]:
correct4_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct4_df.loc[0,:] = [0, 4, testcase4_df.time[0], testcase4_df.time[4], 3, list(testcase4_df.loc[0:4, "adid"].unique())]
correct4_df.loc[1,:] = [5, 9, testcase4_df.time[5], testcase4_df.time[9], 3, list(testcase4_df.loc[5:9, "adid"].unique())]
correct4_df = correct4_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct4_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,4,2020-01-01 00:00:00,2020-01-01 00:04:00,3,"[A, B, C]"
1,5,9,2020-01-01 00:31:00,2020-01-01 00:35:00,3,"[A, B, C]"


In [18]:
output4_df = detect_weird_timeint(testcase4_df, timedelta(minutes=10))
output4_df

--------------------------------------------------
0 1
current time window not exceeded 0:01:00
--------------------------------------------------
0 2
current time window not exceeded 0:02:00
--------------------------------------------------
0 3
current time window not exceeded 0:03:00
--------------------------------------------------
0 4
current time window not exceeded 0:04:00
next time window exceeded 0:04:00
anomaly found
--------------------------------------------------
1 5
current time window exceeded 0:30:00
timestep > 1
--------------------------------------------------
2 5
current time window exceeded 0:29:00
timestep > 1
--------------------------------------------------
3 5
current time window exceeded 0:28:00
timestep > 1
--------------------------------------------------
4 5
current time window exceeded 0:27:00
timestep = 1
--------------------------------------------------
5 6
current time window not exceeded 0:01:00
--------------------------------------------------
5

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,4,2020-01-01 00:00:00,2020-01-01 00:04:00,3,"[A, B, C]"
1,5,9,2020-01-01 00:31:00,2020-01-01 00:35:00,3,"[A, B, C]"


In [19]:
pd.testing.assert_frame_equal(correct4_df, output4_df)

## Testcase 5 <a id="testcase5"></a>

2 sets of 5 distinct adid, each within `time_thres`, but at most `time_thres` apart

[Back to top](#top)

In [20]:
testcase5_df = pd.DataFrame(columns=["time", "adid"])

for i in range(5):
    testcase5_df.loc[i,:] = [START_TIME + i*time_thres/10, ADIDS[i]]
    
for i in range(5):
    testcase5_df.loc[5+i,:] = [testcase5_df.time[2] + timedelta(seconds=5) + i*timedelta(seconds=177), ADIDS[i+5]]
    
testcase5_df = testcase5_df.sort_values(by="time").reset_index(drop=True)
testcase5_df

Unnamed: 0,time,adid
0,2020-01-01 00:00:00,A
1,2020-01-01 00:01:00,B
2,2020-01-01 00:02:00,C
3,2020-01-01 00:02:05,F
4,2020-01-01 00:03:00,D
5,2020-01-01 00:04:00,E
6,2020-01-01 00:05:02,G
7,2020-01-01 00:07:59,H
8,2020-01-01 00:10:56,I
9,2020-01-01 00:13:53,J


In [21]:
correct5_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct5_df.loc[0,:] = [0, 7, testcase5_df.time[0], testcase5_df.time[7], 8, list(testcase5_df.loc[0:7, "adid"].unique())]
correct5_df.loc[1,:] = [1, 8, testcase5_df.time[1], testcase5_df.time[8], 8, list(testcase5_df.loc[1:8, "adid"].unique())]
correct5_df.loc[2,:] = [5, 9, testcase5_df.time[5], testcase5_df.time[9], 5, list(testcase5_df.loc[5:9, "adid"].unique())]
correct5_df = correct5_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct5_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,7,2020-01-01 00:00:00,2020-01-01 00:07:59,8,"[A, B, C, F, D, E, G, H]"
1,1,8,2020-01-01 00:01:00,2020-01-01 00:10:56,8,"[B, C, F, D, E, G, H, I]"
2,5,9,2020-01-01 00:04:00,2020-01-01 00:13:53,5,"[E, G, H, I, J]"


In [22]:
output5_df = detect_weird_timeint(testcase5_df, timedelta(minutes=10))
output5_df

--------------------------------------------------
0 1
current time window not exceeded 0:01:00
--------------------------------------------------
0 2
current time window not exceeded 0:02:00
--------------------------------------------------
0 3
current time window not exceeded 0:02:05
--------------------------------------------------
0 4
current time window not exceeded 0:03:00
--------------------------------------------------
0 5
current time window not exceeded 0:04:00
--------------------------------------------------
0 6
current time window not exceeded 0:05:02
--------------------------------------------------
0 7
current time window not exceeded 0:07:59
next time window exceeded 0:07:59
anomaly found
--------------------------------------------------
1 8
current time window not exceeded 0:09:56
next time window exceeded 0:09:56
anomaly found
2 9
end of data
--------------------------------------------------
2 9
--------------------------------------------------
3 9
----------

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,7,2020-01-01 00:00:00,2020-01-01 00:07:59,8,"[A, B, C, F, D, E, G, H]"
1,1,8,2020-01-01 00:01:00,2020-01-01 00:10:56,8,"[B, C, F, D, E, G, H, I]"
2,5,9,2020-01-01 00:04:00,2020-01-01 00:13:53,5,"[E, G, H, I, J]"


In [23]:
pd.testing.assert_frame_equal(correct5_df, output5_df)

## Testcase 6 <a id="testcase6"></a>

2 sets of 2 non-distinct adid, each within `time_thres`, but at most `time_thres` apart

[Back to top](#top)

In [24]:
testcase6_df = pd.DataFrame(columns=["time", "adid"])

testcase6_df.loc[0,:] = [START_TIME, ADIDS[0]]

for i in range(1, 5):
    testcase6_df.loc[i,:] = [START_TIME + i*time_thres/10, ADIDS[1]]
    
for i in range(4):
    testcase6_df.loc[5+i,:] = [testcase6_df.time[2] + timedelta(seconds=5) + i*timedelta(seconds=207), ADIDS[1]]

testcase6_df.loc[9,:] = [testcase6_df.time[2] + timedelta(seconds=5) + 4*timedelta(seconds=207), ADIDS[0]]
    
testcase6_df = testcase6_df.sort_values(by="time").reset_index(drop=True)
testcase6_df

Unnamed: 0,time,adid
0,2020-01-01 00:00:00,A
1,2020-01-01 00:01:00,B
2,2020-01-01 00:02:00,B
3,2020-01-01 00:02:05,B
4,2020-01-01 00:03:00,B
5,2020-01-01 00:04:00,B
6,2020-01-01 00:05:32,B
7,2020-01-01 00:08:59,B
8,2020-01-01 00:12:26,B
9,2020-01-01 00:15:53,A


In [25]:
correct6_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct6_df.loc[0,:] = [0, 7, testcase6_df.time[0], testcase6_df.time[7], 2, list(testcase6_df.loc[0:7, "adid"].unique())]
correct6_df.loc[1,:] = [7, 9, testcase6_df.time[7], testcase6_df.time[9], 2, list(testcase6_df.loc[6:9, "adid"].unique())]
correct6_df = correct6_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct6_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,7,2020-01-01 00:00:00,2020-01-01 00:08:59,2,"[A, B]"
1,7,9,2020-01-01 00:08:59,2020-01-01 00:15:53,2,"[B, A]"


In [26]:
output6_df = detect_weird_timeint(testcase6_df, timedelta(minutes=10))
output6_df

--------------------------------------------------
0 1
current time window not exceeded 0:01:00
--------------------------------------------------
0 2
current time window not exceeded 0:02:00
--------------------------------------------------
0 3
current time window not exceeded 0:02:05
--------------------------------------------------
0 4
current time window not exceeded 0:03:00
--------------------------------------------------
0 5
current time window not exceeded 0:04:00
--------------------------------------------------
0 6
current time window not exceeded 0:05:32
--------------------------------------------------
0 7
current time window not exceeded 0:08:59
next time window exceeded 0:08:59
anomaly found
--------------------------------------------------
1 8
current time window exceeded 0:11:26
timestep > 1
--------------------------------------------------
2 8
current time window exceeded 0:10:26
timestep > 1
--------------------------------------------------
3 8
current time wi

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,7,2020-01-01 00:00:00,2020-01-01 00:08:59,2,"[A, B]"
1,7,9,2020-01-01 00:08:59,2020-01-01 00:15:53,2,"[B, A]"


In [27]:
pd.testing.assert_frame_equal(correct6_df, output6_df)

## Testcase 7 <a id="testcase7"></a>

More than 2 sets of overlapping adid, each within time_thres

[Back to top](#top)

In [28]:
testcase7_df = pd.DataFrame(columns=["time", "adid"])

for i in range(4):
    testcase7_df.loc[i,:] = [START_TIME + i*timedelta(seconds=207), ADIDS[i%3]]
    
for i in range(4):
    testcase7_df.loc[i+4,:] = [testcase7_df.time[1] + timedelta(minutes=3) + i*time_thres/10, ADIDS[i%3]]
    
for i in range(4):
    testcase7_df.loc[i+8,:] = [testcase7_df.time[3] + timedelta(minutes=5) + i*time_thres/4, ADIDS[1]]
    
testcase7_df = testcase7_df.sort_values(by="time").reset_index(drop=True)
testcase7_df

Unnamed: 0,time,adid
0,2020-01-01 00:00:00,A
1,2020-01-01 00:03:27,B
2,2020-01-01 00:06:27,A
3,2020-01-01 00:06:54,C
4,2020-01-01 00:07:27,B
5,2020-01-01 00:08:27,C
6,2020-01-01 00:09:27,A
7,2020-01-01 00:10:21,A
8,2020-01-01 00:15:21,B
9,2020-01-01 00:17:51,B


In [29]:
correct7_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct7_df.loc[0,:] = [0, 6, testcase7_df.time[0], testcase7_df.time[6], 3, list(testcase7_df.loc[0:6, "adid"].unique())]
correct7_df.loc[1,:] = [1, 7, testcase7_df.time[1], testcase7_df.time[7], 3, list(testcase7_df.loc[1:7, "adid"].unique())]
correct7_df.loc[2,:] = [2, 8, testcase7_df.time[2], testcase7_df.time[8], 3, list(testcase7_df.loc[2:8, "adid"].unique())]
correct7_df.loc[3,:] = [5, 9, testcase7_df.time[5], testcase7_df.time[9], 3, list(testcase7_df.loc[5:9, "adid"].unique())]
correct7_df.loc[4,:] = [7, 10, testcase7_df.time[7], testcase7_df.time[10], 2, list(testcase7_df.loc[7:10, "adid"].unique())]
correct7_df = correct7_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct7_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,6,2020-01-01 00:00:00,2020-01-01 00:09:27,3,"[A, B, C]"
1,1,7,2020-01-01 00:03:27,2020-01-01 00:10:21,3,"[B, A, C]"
2,2,8,2020-01-01 00:06:27,2020-01-01 00:15:21,3,"[A, C, B]"
3,5,9,2020-01-01 00:08:27,2020-01-01 00:17:51,3,"[C, A, B]"
4,7,10,2020-01-01 00:10:21,2020-01-01 00:20:21,2,"[A, B]"


In [30]:
output7_df = detect_weird_timeint(testcase7_df, timedelta(minutes=10))
output7_df

--------------------------------------------------
0 1
current time window not exceeded 0:03:27
--------------------------------------------------
0 2
current time window not exceeded 0:06:27
--------------------------------------------------
0 3
current time window not exceeded 0:06:54
--------------------------------------------------
0 4
current time window not exceeded 0:07:27
--------------------------------------------------
0 5
current time window not exceeded 0:08:27
--------------------------------------------------
0 6
current time window not exceeded 0:09:27
next time window exceeded 0:09:27
anomaly found
--------------------------------------------------
1 7
current time window not exceeded 0:06:54
next time window exceeded 0:06:54
anomaly found
--------------------------------------------------
2 8
current time window not exceeded 0:08:54
next time window exceeded 0:08:54
anomaly found
--------------------------------------------------
3 9
current time window exceeded 0:10

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,6,2020-01-01 00:00:00,2020-01-01 00:09:27,3,"[A, B, C]"
1,1,7,2020-01-01 00:03:27,2020-01-01 00:10:21,3,"[B, A, C]"
2,2,8,2020-01-01 00:06:27,2020-01-01 00:15:21,3,"[A, C, B]"
3,5,9,2020-01-01 00:08:27,2020-01-01 00:17:51,3,"[C, A, B]"
4,7,10,2020-01-01 00:10:21,2020-01-01 00:20:21,2,"[A, B]"


In [31]:
pd.testing.assert_frame_equal(correct7_df, output7_df)

## Testcase 8 <a id="testcase8"></a>

All records are more than `time_thres` apart

[Back to top](#top)

In [39]:
testcase8_df = pd.DataFrame(columns=["time", "adid"])
    
for i in range(10):
    testcase8_df.loc[i,:] = [START_TIME + (i+1)*(time_thres+timedelta(minutes=1)), ADIDS[i%3]]
    
testcase8_df = testcase8_df.sort_values(by="time").reset_index(drop=True)
testcase8_df

Unnamed: 0,time,adid
0,2020-01-01 00:11:00,A
1,2020-01-01 00:22:00,B
2,2020-01-01 00:33:00,C
3,2020-01-01 00:44:00,A
4,2020-01-01 00:55:00,B
5,2020-01-01 01:06:00,C
6,2020-01-01 01:17:00,A
7,2020-01-01 01:28:00,B
8,2020-01-01 01:39:00,C
9,2020-01-01 01:50:00,A


In [43]:
correct8_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct8_df = correct8_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct8_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids


In [46]:
output8_df = detect_weird_timeint(testcase8_df, timedelta(minutes=10))
output8_df = output8_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
output8_df

--------------------------------------------------
0 1
current time window exceeded 0:11:00
timestep = 1
--------------------------------------------------
1 2
current time window exceeded 0:11:00
timestep = 1
--------------------------------------------------
2 3
current time window exceeded 0:11:00
timestep = 1
--------------------------------------------------
3 4
current time window exceeded 0:11:00
timestep = 1
--------------------------------------------------
4 5
current time window exceeded 0:11:00
timestep = 1
--------------------------------------------------
5 6
current time window exceeded 0:11:00
timestep = 1
--------------------------------------------------
6 7
current time window exceeded 0:11:00
timestep = 1
--------------------------------------------------
7 8
current time window exceeded 0:11:00
timestep = 1
8 9
end of data
--------------------------------------------------
8 9


Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids


In [47]:
pd.testing.assert_frame_equal(correct8_df, output8_df)

## Testcase 9 <a id="testcase9"></a>

2 sets of 5 distinct adid, first set within `time_thres`, second set each more than `time_thres`

[Back to top](#top)

In [51]:
testcase9_df = pd.DataFrame(columns=["time", "adid"])
    
for i in range(4):
    testcase9_df.loc[i,:] = [START_TIME + i*time_thres/10, ADIDS[i]]
    
for i in range(4):
    testcase9_df.loc[i+4,:] = [testcase9_df.time[len(testcase9_df)-1] + time_thres + timedelta(minutes=1), 
                               ADIDS[i+4]]
    
testcase9_df = testcase9_df.sort_values(by="time").reset_index(drop=True)
testcase9_df

Unnamed: 0,time,adid
0,2020-01-01 00:00:00,A
1,2020-01-01 00:01:00,B
2,2020-01-01 00:02:00,C
3,2020-01-01 00:03:00,D
4,2020-01-01 00:14:00,E
5,2020-01-01 00:25:00,F
6,2020-01-01 00:36:00,G
7,2020-01-01 00:47:00,H


In [57]:
correct9_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct9_df = correct9_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct9_df.loc[0,:] = [0, 3, testcase9_df.time[0], testcase9_df.time[3], len(testcase9_df.loc[0:3, "adid"]), list(testcase9_df.loc[0:3, "adid"])]
correct9_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,3,2020-01-01,2020-01-01 00:03:00,4,"[A, B, C, D]"


In [58]:
output9_df = detect_weird_timeint(testcase9_df, timedelta(minutes=10))
output9_df = output9_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
output9_df

--------------------------------------------------
0 1
current time window not exceeded 0:01:00
--------------------------------------------------
0 2
current time window not exceeded 0:02:00
--------------------------------------------------
0 3
current time window not exceeded 0:03:00
next time window exceeded 0:03:00
anomaly found
--------------------------------------------------
1 4
current time window exceeded 0:13:00
timestep > 1
--------------------------------------------------
2 4
current time window exceeded 0:12:00
timestep > 1
--------------------------------------------------
3 4
current time window exceeded 0:11:00
timestep = 1
--------------------------------------------------
4 5
current time window exceeded 0:11:00
timestep = 1
--------------------------------------------------
5 6
current time window exceeded 0:11:00
timestep = 1
6 7
end of data
--------------------------------------------------
6 7


Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,3,2020-01-01,2020-01-01 00:03:00,4,"[A, B, C, D]"


In [60]:
pd.testing.assert_frame_equal(correct9_df, output9_df)

## Testcase 10 <a id="testcase10"></a>

2 sets of 5 distinct adid, first set each more than `time_thres`, second set within `time_thres`

[Back to top](#top)

In [62]:
testcase10_df = pd.DataFrame(columns=["time", "adid"])
    
for i in range(4):
    testcase10_df.loc[i,:] = [START_TIME + i*(time_thres + timedelta(minutes=1)), ADIDS[i]]

for i in range(4):
    testcase10_df.loc[i+4,:] = [testcase10_df.time[3] + (i+1)*time_thres/10, ADIDS[i+4]]
    
testcase10_df = testcase10_df.sort_values(by="time").reset_index(drop=True)
testcase10_df

Unnamed: 0,time,adid
0,2020-01-01 00:00:00,A
1,2020-01-01 00:11:00,B
2,2020-01-01 00:22:00,C
3,2020-01-01 00:33:00,D
4,2020-01-01 00:34:00,E
5,2020-01-01 00:35:00,F
6,2020-01-01 00:36:00,G
7,2020-01-01 00:37:00,H


In [66]:
correct10_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct10_df = correct10_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct10_df.loc[0,:] = [3, 7, testcase10_df.time[3], testcase10_df.time[7], len(testcase10_df.loc[3:7, "adid"]), list(testcase10_df.loc[3:7, "adid"])]
correct10_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,3,7,2020-01-01 00:33:00,2020-01-01 00:37:00,5,"[D, E, F, G, H]"


In [67]:
output10_df = detect_weird_timeint(testcase10_df, timedelta(minutes=10))
output10_df = output10_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
output10_df

--------------------------------------------------
0 1
current time window exceeded 0:11:00
timestep = 1
--------------------------------------------------
1 2
current time window exceeded 0:11:00
timestep = 1
--------------------------------------------------
2 3
current time window exceeded 0:11:00
timestep = 1
--------------------------------------------------
3 4
current time window not exceeded 0:01:00
--------------------------------------------------
3 5
current time window not exceeded 0:02:00
--------------------------------------------------
3 6
current time window not exceeded 0:03:00
3 7
end of data
--------------------------------------------------
3 7
current time window not exceeded 0:04:00
anomaly found


Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,3,7,2020-01-01 00:33:00,2020-01-01 00:37:00,5,"[D, E, F, G, H]"


In [68]:
pd.testing.assert_frame_equal(correct10_df, output10_df)

## Testcase 11 <a id="testcase11"></a>

2 points, distinct adid within `time_thres`

[Back to top](#top)

In [70]:
testcase11_df = pd.DataFrame(columns=["time", "adid"])
    
testcase11_df.loc[0,:] = [START_TIME, ADIDS[0]]
testcase11_df.loc[1,:] = [START_TIME + time_thres/2, ADIDS[1]]
    
testcase11_df = testcase11_df.sort_values(by="time").reset_index(drop=True)
testcase11_df

Unnamed: 0,time,adid
0,2020-01-01 00:00:00,A
1,2020-01-01 00:05:00,B


In [71]:
correct11_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct11_df = correct11_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct11_df.loc[0,:] = [0, 1, testcase11_df.time[0], testcase11_df.time[1], len(testcase11_df.loc[0:1, "adid"]), list(testcase10_df.loc[0:1, "adid"])]
correct11_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,1,2020-01-01,2020-01-01 00:05:00,2,"[A, B]"


In [73]:
output11_df = detect_weird_timeint(testcase11_df, timedelta(minutes=10))
output11_df = output11_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
output11_df

0 1
end of data
--------------------------------------------------
0 1
current time window not exceeded 0:05:00
anomaly found


Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids
0,0,1,2020-01-01,2020-01-01 00:05:00,2,"[A, B]"


In [74]:
pd.testing.assert_frame_equal(correct11_df, output11_df)

## Testcase 12 <a id="testcase12"></a>

2 points, non-distinct adid within `time_thres`

[Back to top](#top)

In [75]:
testcase12_df = pd.DataFrame(columns=["time", "adid"])
    
testcase12_df.loc[0,:] = [START_TIME, ADIDS[0]]
testcase12_df.loc[1,:] = [START_TIME + time_thres/2, ADIDS[0]]
    
testcase12_df = testcase12_df.sort_values(by="time").reset_index(drop=True)
testcase12_df

Unnamed: 0,time,adid
0,2020-01-01 00:00:00,A
1,2020-01-01 00:05:00,A


In [76]:
correct12_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct12_df = correct12_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct12_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids


In [77]:
output12_df = detect_weird_timeint(testcase12_df, timedelta(minutes=10))
output12_df = output12_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
output12_df

0 1
end of data
--------------------------------------------------
0 1
current time window not exceeded 0:05:00


Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids


In [78]:
pd.testing.assert_frame_equal(correct12_df, output12_df)

## Testcase 13 <a id="testcase13"></a>

2 points, distinct adid more than `time_thres`

[Back to top](#top)

In [81]:
testcase13_df = pd.DataFrame(columns=["time", "adid"])
    
testcase13_df.loc[0,:] = [START_TIME, ADIDS[0]]
testcase13_df.loc[1,:] = [START_TIME + time_thres + timedelta(minutes=1), ADIDS[1]]
    
testcase13_df = testcase13_df.sort_values(by="time").reset_index(drop=True)
testcase13_df

Unnamed: 0,time,adid
0,2020-01-01 00:00:00,A
1,2020-01-01 00:11:00,B


In [82]:
correct13_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct13_df = correct13_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct13_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids


In [83]:
output13_df = detect_weird_timeint(testcase13_df, timedelta(minutes=10))
output13_df = output13_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
output13_df

0 1
end of data
--------------------------------------------------
0 1


Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids


In [84]:
pd.testing.assert_frame_equal(correct13_df, output13_df)

## Testcase 14 <a id="testcase14"></a>

2 points, non-distinct adid more than `time_thres`

[Back to top](#top)

In [85]:
testcase14_df = pd.DataFrame(columns=["time", "adid"])
    
testcase14_df.loc[0,:] = [START_TIME, ADIDS[0]]
testcase14_df.loc[1,:] = [START_TIME + time_thres + timedelta(minutes=1), ADIDS[0]]
    
testcase14_df = testcase14_df.sort_values(by="time").reset_index(drop=True)
testcase14_df

Unnamed: 0,time,adid
0,2020-01-01 00:00:00,A
1,2020-01-01 00:11:00,A


In [86]:
correct14_df = pd.DataFrame(columns=['start_index', 'end_index', 'start_time', 'end_time', 'num_adid', 'adids'])
correct14_df = correct14_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
correct14_df

Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids


In [87]:
output14_df = detect_weird_timeint(testcase14_df, timedelta(minutes=10))
output14_df = output14_df.astype({'start_time': 'datetime64[ns]', 'end_time': 'datetime64[ns]'})
output14_df

0 1
end of data
--------------------------------------------------
0 1


Unnamed: 0,start_index,end_index,start_time,end_time,num_adid,adids


In [88]:
pd.testing.assert_frame_equal(correct14_df, output14_df)