Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to choose or filter data and create multiple files on s3 #119

Open
ranvijayj opened this issue Oct 12, 2017 · 6 comments
Open

Option to choose or filter data and create multiple files on s3 #119

ranvijayj opened this issue Oct 12, 2017 · 6 comments

Comments

@ranvijayj
Copy link

So, we are using Snowplow to read data from out Kinesis stream and put it to S3.

There is no option to apply a filter to choose what data to pull from Kinesis and put to s3..

Also, if there could be an option to pull different data from same kinesis stream and push it to different files in S3 like Logstash.

@BenFradet
Copy link
Contributor

There is no option to apply a filter to choose what data to pull from Kinesis and put to s3..

Can you give an example of a filter you'd like to apply?

Also, if there could be an option to pull different data from same kinesis stream and push it to different files in S3 like Logstash.

How would you "route" which record goes to which file?

@ranvijayj
Copy link
Author

Our json record contains a key called "type". Based on type, we want to create separate files.

@ranvijayj
Copy link
Author

@BenFradet I don't think this feature is there yet right?
Can you tell me if there is a workaround? Some code I can change/contribute?

@BenFradet
Copy link
Contributor

No there is no such feature at the moment.

However, this use case seems to be very specific to your setup and it would be hard to translate into a feature everyone can use since you'd want to inspect every json and act on their content.

How would you do it?

@alexanderdean
Copy link
Member

alexanderdean commented Oct 13, 2017

This is quite meta, but I think there's a case for us taking the logic of Kinesis Tee over time and making that embeddable in loaders like this, so you can do partitioning and final transformation inside a loader.

@ranvijayj
Copy link
Author

@BenFradet Maybe but I think a lot of people would need this. Whosoever is using a single Kinesis stream to store all data would need this and majority are using 1 kinesis stream.

@alexanderdean Something like that.

A way we could add a filter to input Kinesis stream and choose the type of data we want to push to s3...

For now, we just need 1 type of data to be put on s3 for later analysis, but it's pushing the other 3 types of data too causing ambiguity and large sized files to be put on s3.
Using that data for analysis becomes difficult as the s3 file does not have consistent data. EMR fails somehow too and too read through small data set, we need to go through files having waste data pulled from Kinesis..

Elastic Co.. Logstash gives all the options, pulls from Kinesis, applies filters and then pushes to Elasticsearch + S3 but the problem there with s3 output is that when it creates an s3 file, the different JSON entries don't get separated by a \n causing analysis difficult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants