Option to choose or filter data and create multiple files on s3 #119

ranvijayj · 2017-10-12T10:14:56Z

So, we are using Snowplow to read data from out Kinesis stream and put it to S3.

There is no option to apply a filter to choose what data to pull from Kinesis and put to s3..

Also, if there could be an option to pull different data from same kinesis stream and push it to different files in S3 like Logstash.

BenFradet · 2017-10-12T10:17:07Z

There is no option to apply a filter to choose what data to pull from Kinesis and put to s3..

Can you give an example of a filter you'd like to apply?

Also, if there could be an option to pull different data from same kinesis stream and push it to different files in S3 like Logstash.

How would you "route" which record goes to which file?

ranvijayj · 2017-10-12T10:29:58Z

Our json record contains a key called "type". Based on type, we want to create separate files.

ranvijayj · 2017-10-13T06:40:49Z

@BenFradet I don't think this feature is there yet right?
Can you tell me if there is a workaround? Some code I can change/contribute?

BenFradet · 2017-10-13T11:44:39Z

No there is no such feature at the moment.

However, this use case seems to be very specific to your setup and it would be hard to translate into a feature everyone can use since you'd want to inspect every json and act on their content.

How would you do it?

alexanderdean · 2017-10-13T11:48:42Z

This is quite meta, but I think there's a case for us taking the logic of Kinesis Tee over time and making that embeddable in loaders like this, so you can do partitioning and final transformation inside a loader.

ranvijayj · 2017-10-14T20:06:48Z

@BenFradet Maybe but I think a lot of people would need this. Whosoever is using a single Kinesis stream to store all data would need this and majority are using 1 kinesis stream.

@alexanderdean Something like that.

A way we could add a filter to input Kinesis stream and choose the type of data we want to push to s3...

For now, we just need 1 type of data to be put on s3 for later analysis, but it's pushing the other 3 types of data too causing ambiguity and large sized files to be put on s3.
Using that data for analysis becomes difficult as the s3 file does not have consistent data. EMR fails somehow too and too read through small data set, we need to go through files having waste data pulled from Kinesis..

Elastic Co.. Logstash gives all the options, pulls from Kinesis, applies filters and then pushes to Elasticsearch + S3 but the problem there with s3 output is that when it creates an s3 file, the different JSON entries don't get separated by a \n causing analysis difficult.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to choose or filter data and create multiple files on s3 #119

Option to choose or filter data and create multiple files on s3 #119

ranvijayj commented Oct 12, 2017

BenFradet commented Oct 12, 2017

ranvijayj commented Oct 12, 2017

ranvijayj commented Oct 13, 2017

BenFradet commented Oct 13, 2017

alexanderdean commented Oct 13, 2017 •

edited

Loading

ranvijayj commented Oct 14, 2017

Option to choose or filter data and create multiple files on s3 #119

Option to choose or filter data and create multiple files on s3 #119

Comments

ranvijayj commented Oct 12, 2017

BenFradet commented Oct 12, 2017

ranvijayj commented Oct 12, 2017

ranvijayj commented Oct 13, 2017

BenFradet commented Oct 13, 2017

alexanderdean commented Oct 13, 2017 • edited Loading

ranvijayj commented Oct 14, 2017

alexanderdean commented Oct 13, 2017 •

edited

Loading