## Introduction
The purpose of this algorithm is to find duplicate records that have different UTC time stamps between different uploads. It does NOT consider the case where there is duplicate data within an upload.

This algorithm is also NOT designed to handle the case where duplicate data has the same time stamp, as there are more efficient functions  (e.g., the duplicated function in pandas) that can quickly find and delete duplicated data. In fact, one should get rid of those types of duplicates before running this algorithm.  

Also, there are two key steps in the de-duplication process: 1) identifying duplicates, and 2) getting rid of the bad duplicate(s). This algorithm only addresses (1); however, if we can find all cases of duplicates from step (1) it should allow us to examine and solve (2).

This method is comprehensive in the sense that it looks for duplicates between every unique pair-wise combination of uploadIds for a given user. While this may be overkill, as it may compare data that is separted by several years, it is important that we don't make any assumptions about the actual time of the data, as the whole point of the algorithm is to find duplicates where the times are different. Further, we have seen a few cases where the UTC time can be off by months or years.

## Algorithm Logic

This algorithm finds sequences of cgm values that are duplicated between two different uploadIds. Here is the basic logic of the algorithm. For one user's data, we loop through each unique pair-wise combination of uploadIds. We do the following to each uploadId:
* First we round all of the cgm data to the nearest 5 minutes.
* Second we create a contiguous time series between the first data point and the last data point in the time series.
* We then merge the two time series, so that the missing cgm points are filled with nans.

We then repeat for the following for each unique pairwise combination of uploadId:
* We take the longer/larger time series and call it TL, and the shorter one Ts, which is useful for keeping track of which indices match.
* At this step, there is an optional preprocessing step that orders the shift indices to speed up the algorithm (see details in the second read example below).
* Next we shift Ts over TL, and at each step we calculate the element by element difference between cgm values. If there is an exact match, then the difference will be zero.
* At each step we count the number of zeros, and if it exceeds an algorithm defined threshold, we tag the sequence in both time series as being duplicate. 

For the examples below we will focus on cgm time series data, but if there are other data types that are missing deviceTime data and/or have incorrect UTC times, this algorithm can be adapted to those situations too. The following examples are given below:
* a very simple (fake) illustrative example
* an example with real Tidepool donor data

In [2]:
# load in the required libraries
import os
import pandas as pd
import numpy as np
from itertools import combinations
from math import factorial

## simple (fake) illustrative example

In [17]:
# here is the setup for this example
TL = np.array([150, 160, 170, 180, 190, 200, 210, 220, 230, 240])
Ts = np.array([180, np.nan, 200, 210, 220])
minThreshold = 3  # NOTE: a real example should consider a much 
# higher threshold, like 48, 96, or 288 

for i in range(-len(Ts) + 1, len(TL)):
    print("trying index", i)
    tempTL = TL[max([0, i]):min([len(TL), (len(Ts) + i)])]
    print(tempTL)
    tempTs = Ts[-len(tempTL):]
    print(tempTs)
    tempDiff = tempTL - tempTs
    print("difference = ", tempDiff)
    nZeros = sum(tempDiff == 0)
    print("number of zeros = ", nZeros)
    if nZeros >= minThreshold:
        print("FOUND DUPLICATES AT", i)
        

trying index -4
[150]
[220.]
difference =  [-70.]
number of zeros =  0
trying index -3
[150 160]
[210. 220.]
difference =  [-60. -60.]
number of zeros =  0
trying index -2
[150 160 170]
[200. 210. 220.]
difference =  [-50. -50. -50.]
number of zeros =  0
trying index -1
[150 160 170 180]
[ nan 200. 210. 220.]
difference =  [ nan -40. -40. -40.]
number of zeros =  0
trying index 0
[150 160 170 180 190]
[180.  nan 200. 210. 220.]
difference =  [-30.  nan -30. -30. -30.]
number of zeros =  0
trying index 1
[160 170 180 190 200]
[180.  nan 200. 210. 220.]
difference =  [-20.  nan -20. -20. -20.]
number of zeros =  0
trying index 2
[170 180 190 200 210]
[180.  nan 200. 210. 220.]
difference =  [-10.  nan -10. -10. -10.]
number of zeros =  0
trying index 3
[180 190 200 210 220]
[180.  nan 200. 210. 220.]
difference =  [ 0. nan  0.  0.  0.]
number of zeros =  4
FOUND DUPLICATES AT 3
trying index 4
[190 200 210 220 230]
[180.  nan 200. 210. 220.]
difference =  [10. nan 10. 10. 10.]
number of z

## real world example





COMING SOON