Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file name strategies #20

Open
alexanderdean opened this issue May 22, 2015 · 4 comments
Open

Add file name strategies #20

alexanderdean opened this issue May 22, 2015 · 4 comments
Assignees
Milestone

Comments

@alexanderdean
Copy link
Member

We should support:

  1. The current file name strategy (we need to come up with a sexy name for this)
  2. A new strategy, see below (we need to come up with a sexy name for this)

New strategy

  • On initialization of each sink instance, generate a UUID and create a bigint counter set to 0
  • First file is <UUID>-000000000000000...0 (where 0s are padded out to max length of a bigint)
  • Second file is <UUID>-0000000000...1, i.e. counter is incremented
  • Third file is <UUID>-0000000000...2, i.e. counter is incremented
  • On server restart, generate a new UUID and create a bigint counter set to 0

This is a strategy designed to:

  • Have minimal moving parts
  • Be extremely idiotproof
  • Minimize the size of any manifest which needs to keep track of which files have been processed
@alexanderdean alexanderdean added this to the Version 0.3.0 milestone May 22, 2015
@alexanderdean alexanderdean changed the title Add file partitioning Add file name strategies Jul 8, 2015
@0xABAB
Copy link

0xABAB commented May 18, 2016

What do you think about creating a hierarchy too? Currently, when you view files on S3, there are a lot of files. It would be nicer if there would be subdirectories like <network_id>/<site_id>/<day>/<month>/<year>.

Having said that, this might go against your:

Minimize the size of any manifest which needs to keep track of which files have been processed

Quoting from a guy from StackOverflow:

Keep in mind that if you want enumerate through all your millions of keys in a single folder, it may take a while! So if there any restrictions they are on your end - users don't generally like a list of 100000 files in a single folder.

@alexanderdean
Copy link
Member Author

Hey @0xABAB - yes sure, we could try something like:

YYYY/MM/DD/HH/{{Worker UUID}}/{{Counter}}

If we structure it so that {{Counter}} is orthogonal to YYYY/MM/DD/HH (i.e. it increments independently of the timestamp), then we can keep the manifest small (because the manifest "reads from the right", i.e. doesn't need to care about the timestamp).

@sunshineo
Copy link

Before this get implemented, anyone have any tools that can go through current files and move then to a folder structure like the one proposed here?

@matogertel
Copy link

Hey @0xABAB - yes sure, we could try something like:

YYYY/MM/DD/HH/{{Worker UUID}}/{{Counter}}

If we structure it so that {{Counter}} is orthogonal to YYYY/MM/DD/HH (i.e. it increments independently of the timestamp), then we can keep the manifest small (because the manifest "reads from the right", i.e. doesn't need to care about the timestamp).

I got to this thread looking for exactly this answer. If we had this file structure, we could point Athena straight to the S3 bucket, since the data will be nicely partitioned. Any update on this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants