-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplication, support select_streams on single pass, support headers-only #74
base: main
Are you sure you want to change the base?
Conversation
…retain stream-headers-only for resolve_streams.
* main load func uses shared _read_chunks * support stream_headers_only * support select_streams on first pass
@cboulay yes, to implement
I need |
@cbrnr , ah, I was operating under the assumption that you only needed headers, so I renamed But, we should discuss a little about whether or not that function is required at all. After the changes in this PR, there remains a difference in stream header formatting between On that same train of thought, if you do build out Please take a look at what I have in place now and let me know how you would like to proceed. I think we should leave fixing up the 'verbose' usage until a later issue. |
@cboulay I will need to dig a little bit into my code to see what I really need. This will take some time, and since I'm technically on vacation right now it would be great if you could give me two weeks or so to check on this. For now, I think that streamlining the code base should take precedence over a specific use case I might have. After all, I can always pull out the functions I need into my other projects. But if possible, let's discuss this next year 😄 (i.e. in ~2 weeks). |
Of course! We need the headers-only feature sooner-than-later but we're happy go use the branch for now. |
Generally speaking, i am totally ok in rewriting my code to accomodate this PR and think it is worthwhile. I just have a few questions. One of the tests for my liesl toolbox failed, for a functtionality where I am using |
@agricolab - Based on what Clemens said, I think I should rename Let me explain the organization as I see it before we decide what to do. Path 1 - using
Path 2 - using
Note that we should not change the format returned by Path1 as this would break 3rd party applications we don't control. Remaining things to change:
After the above changes, the main difference between Path 1 and Path 2 is that Path 2 loads all the chunks before parsing them, whereas Path 1 parses them as they come in and aggregates them after they've all been parsed. Path 1 also has optional timestamp fixes. With |
I haven't tested the PR yet, but from what I've read I'm all for it. I have only two things to add:
|
@cbrnr Have you had a chance to look at this? Can you please take a look and make sure that your use-cases work or suggest how to make it work? |
@cboulay sorry, I completely forgot. I will take a look today. |
@cboulay can you maybe rename the function back to |
I’m unavailable today. Please go ahead and make that change on your end and I’ll resolve later. |
Alright, all of my files still work with this PR. In addition to comments already mentioned, I suggest that we make
I'm using @cboulay could you compile a new to do list summarizing all the points mentioned in this thread? It doesn't mean that you have to implement all of the suggestions, but I'd like to discuss all points with everyone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
BTW, |
This PR deduplicates code, allows use of select_streams without double traversal of file (#60) and adds the ability to extract headers only.
It was a bit tricky to do while holding onto a couple top-level functions. I hope I did it to everyone's satisfaction, though it's maybe a little messier than it could have been otherwise.
I think the logger is probably a little too verbose by default, especially when skipping over non-selected streams or grabbing only headers. But I'd like some feedback before I clean that up.
I also think some of the top level functions (even the _ prefixed ones) can be re-ordered more logically, but I wanted to limit the already-large diff.
There are quite a few changes here so I'm tagging lots of reviewers. Please test on your data files. If we're happy with my previous PR then I'll merge that in here so we can test on data files with problematic clock synchronization.