Add file name strategies #20

alexanderdean · 2015-05-22T09:10:28Z

We should support:

The current file name strategy (we need to come up with a sexy name for this)
A new strategy, see below (we need to come up with a sexy name for this)

New strategy

On initialization of each sink instance, generate a UUID and create a bigint counter set to 0
First file is <UUID>-000000000000000...0 (where 0s are padded out to max length of a bigint)
Second file is <UUID>-0000000000...1, i.e. counter is incremented
Third file is <UUID>-0000000000...2, i.e. counter is incremented
On server restart, generate a new UUID and create a bigint counter set to 0

This is a strategy designed to:

Have minimal moving parts
Be extremely idiotproof
Minimize the size of any manifest which needs to keep track of which files have been processed

The text was updated successfully, but these errors were encountered:

0xABAB · 2016-05-18T10:52:57Z

What do you think about creating a hierarchy too? Currently, when you view files on S3, there are a lot of files. It would be nicer if there would be subdirectories like <network_id>/<site_id>/<day>/<month>/<year>.

Having said that, this might go against your:

Minimize the size of any manifest which needs to keep track of which files have been processed

Quoting from a guy from StackOverflow:

Keep in mind that if you want enumerate through all your millions of keys in a single folder, it may take a while! So if there any restrictions they are on your end - users don't generally like a list of 100000 files in a single folder.

alexanderdean · 2016-05-18T11:41:28Z

Hey @0xABAB - yes sure, we could try something like:

YYYY/MM/DD/HH/{{Worker UUID}}/{{Counter}}

If we structure it so that {{Counter}} is orthogonal to YYYY/MM/DD/HH (i.e. it increments independently of the timestamp), then we can keep the manifest small (because the manifest "reads from the right", i.e. doesn't need to care about the timestamp).

sunshineo · 2018-08-09T06:07:06Z

Before this get implemented, anyone have any tools that can go through current files and move then to a folder structure like the one proposed here?

matogertel · 2019-07-15T05:25:43Z

Hey @0xABAB - yes sure, we could try something like:
YYYY/MM/DD/HH/{{Worker UUID}}/{{Counter}}
If we structure it so that {{Counter}} is orthogonal to YYYY/MM/DD/HH (i.e. it increments independently of the timestamp), then we can keep the manifest small (because the manifest "reads from the right", i.e. doesn't need to care about the timestamp).

I got to this thread looking for exactly this answer. If we had this file structure, we could point Athena straight to the S3 bucket, since the data will be nicely partitioned. Any update on this ?

alexanderdean assigned jbeemster May 22, 2015

alexanderdean added this to the Version 0.3.0 milestone May 22, 2015

jbeemster modified the milestones: Version 0.4.0, Version 0.3.0 Jun 29, 2015

alexanderdean changed the title ~~Add file partitioning~~ Add file name strategies Jul 8, 2015

alexanderdean modified the milestones: Version 0.5.0, Version 0.4.0 Aug 12, 2015

alexanderdean modified the milestones: Version 0.7.0, Version 0.6.0 Jul 22, 2017

alexanderdean assigned BenFradet and unassigned jbeemster Jul 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add file name strategies #20

Add file name strategies #20

alexanderdean commented May 22, 2015

0xABAB commented May 18, 2016

alexanderdean commented May 18, 2016

sunshineo commented Aug 9, 2018

matogertel commented Jul 15, 2019

Add file name strategies #20

Add file name strategies #20

Comments

alexanderdean commented May 22, 2015

New strategy

0xABAB commented May 18, 2016

alexanderdean commented May 18, 2016

sunshineo commented Aug 9, 2018

matogertel commented Jul 15, 2019