Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

define .slice and dataframe methods #2

Merged
merged 6 commits into from
Mar 29, 2018
Merged

define .slice and dataframe methods #2

merged 6 commits into from
Mar 29, 2018

Conversation

brandly
Copy link
Contributor

@brandly brandly commented Mar 28, 2018

here we have an implementation of .slice, a method that takes a cursor_id and a range of time, and it returns all matching events.

we also have a dataframe method that maps each span over .slice.

TODO:

  • .slice makes multiple HTTP requests if the requested span has a large number of events in it. currently, it just concatenates the list of events from each request into one big list. however, the python lib does some restructuring of the data, giving you events on a per-stream basis.

each individual response has a payload like

{ "streams": {
  "stream_id": { /* details */ },
  "another_id": {},
},
"events": [ /* lots of events */ ]}

there's an entry in streams for every stream involved in the given query. each stream ID maps to an object containing some details about the stream. i have access to those details, but since R parses the JSON into a list, i can't seem to find the stream IDs in the data structure.

  • dataframe currently returns a list. i need to figure out how to construct a proper dataframe, but .slice should be improved first
  • dataframe makes an HTTP request per span via .slice. in python, we do this concurrently via a ThreadPool. i'm under the impression that R is single-threaded, so i'm not sure what to do here. @apclypsr do i have options? it's currently quite slow.

@apclypsr
Copy link

apclypsr commented Mar 28, 2018

You can use this package for multithreading, but need to be careful to define the number of cores used not as the number you have, but as number you have minus one. Otherwise R will take up all your CPU and crash your os

http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf

@brandly
Copy link
Contributor Author

brandly commented Mar 29, 2018

i'm gonna merge this, since there's a solid number of improvements here. i'll open an issue regarding the return values of .slice and dataframe

thanks @apclypsr for pointing me to parallel. it's working on my end, and i'm under the impression that this implementation will work on Windows as well, although that hasn't been tested yet.

at this point, the API endpoints are mostly wired up. i'm shifting my focus towards query construction and figuring out what a decent DSL looks like in R.

@brandly brandly merged commit 65fdd94 into master Mar 29, 2018
@brandly brandly deleted the slice-dataframe branch March 29, 2018 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants