Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient Data Synchronization #6

Open
thesofakillers opened this issue Aug 24, 2021 · 0 comments
Open

More efficient Data Synchronization #6

thesofakillers opened this issue Aug 24, 2021 · 0 comments
Labels
enhancement New feature or request

Comments

@thesofakillers
Copy link
Owner

The current data synchronisation implementation, in particular with regards to finding overlapping contiguous chunks across data sources, might ultimately require a lot of memory if the time series is long enough/the sampling is rate is too high.

P. Fluxa mentions:

A colleague of mine and I figured out a "compressed" way for synchronising chunks, which requires knowing of the start and end times of every interval. That is very cheap to obtain and scales as O(n). Then, the operation of finding all relevant intervals (the ones where there is data in all "channels") scales even better as it only depends in the number of intervals found.
This is a quick-and-dirty implementation showing how it works:

"""
Sample script showing the solution of the following problem:

"given N channels of data with R continous ranges each, find all the
ranges where there is data for all N channels"
"""

import random
import pandas
import numpy
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

# create a set of random ranges. this is just formality
numChan = 5
nRanges = 10
data = list()
for nch in range(numChan):
    ms = random.randint(0, 5)
    for nr in range(nRanges):
        jitter1 = 0
        jitter2 = 1 #random.randint(2, 6)
        width = 7
        start = ms + jitter1
        end = start + width
        entry = dict()
        entry['start'] = start
        entry['sflag'] = 1
        entry['end'] = end
        entry['eflag'] = -1
        entry['channel'] = nch
        entry['rangeidx'] = nr
        data.append(entry)        
        ms = end + jitter2
rangesdf = pandas.DataFrame(data)  
 
# extract all timestamps from ranges, keeping track of whether they
# correspond to start or end of ranges
timest = rangesdf['start'].values.tolist() 
flags = rangesdf['sflag'].values.tolist()
flags += rangesdf['eflag'].values.tolist()
timest += rangesdf['end'].values.tolist()
# build intermediate dataframe
sdf = pandas.DataFrame(dict(st = timest, flag = flags))
sdf.sort_values(by='st', inplace=True)
cumsum = sdf.flag.cumsum()
print(cumsum)
cr = numpy.where(cumsum == numChan)
crlist = cr[0].tolist()
crarr = list()
for e in crlist:
    crarr.append(e)
    crarr.append(e + 1)
crarr = numpy.asarray(crarr)
crmask = tuple((crarr,))
cmnRanges = sdf.iloc[crmask].st.values.reshape((-1, 2))

# make a figure showing the result
fig, ax = plt.subplots()
# plot all ranges
for idx, entry in rangesdf.iterrows():
    xs = entry['start']
    xe = entry['end']
    ys = entry['channel']
    ax.hlines(ys, xs, xe)
# plot commmon ranges
for cr in cmnRanges:
    # avoid drawing ranges with no width
    if cr[1] == cr[0]:
        continue
    ax.vlines(cr[0], 0, numChan, 
        color='red', alpha=0.5, linestyle='--', linewidth=0.5)
    ax.vlines(cr[1], 0, numChan, 
        color='red', alpha=0.5, linestyle='--', linewidth=0.5)
plt.savefig('ranges.pdf')

And this is the kind of the result you get

image

@thesofakillers thesofakillers added the enhancement New feature or request label Aug 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant