Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task id hashing #1444

Merged
merged 6 commits into from Dec 16, 2015
Merged

Task id hashing #1444

merged 6 commits into from Dec 16, 2015

Conversation

stephenpascoe
Copy link

Implementing semi-opaque, hashed task_ids as discussed at #1312.

This is the version I will be testing on our infrastructure in the next few weeks. This PR is intended to give my proposal better visibility whilst I test.

The PR replaces task_ids with a value which is reliably unique but you can't extract the full parameter values from it. The actual algorithm is family_pval1_pval2_pval3_hash where:

  • family: The task_family (a.k.a task_name)
  • pval*: A truncated serialisation of the first 3 parameter values, sorted by parameter name
  • hash: A md5hash of the canonical JSON serialisation of the family and parameters, truncated to 10 characters.

str(task) will return the traditional serialisation.

The db_task_history database schema is changed to include tash_id in the tasks table. No further use is made of this column at this stage. A migration script is supplied to add the column to an existing database. This has been tested on MySQL only.

All tests are updated to pass by replacing hard-coded task_ids with reference to Task.task_id.

@erikbern
Copy link
Contributor

erikbern commented Dec 2, 2015

this looks great!

param_str = json.dumps(params, separators=(',', ':'), sort_keys=True)
param_hash = hashlib.md5(param_str.encode('utf-8')).hexdigest()

param_summary = '_'.join(p[:TASK_ID_TRUNCATE_PARAMS]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should probably remove non-alphanumeric characters from param_summary

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, all non-alphanumeric (+ "") converted to "".

@erikbern
Copy link
Contributor

erikbern commented Dec 3, 2015

this is great – would be great if some other people can weigh in. @Tarrasch @econchick @freider ?

@stephenpascoe
Copy link
Author

It's worth saying that we don't use Hadoop or any of the other integrations so they won't get tested by me. Further input would be good.

@erikbern
Copy link
Contributor

erikbern commented Dec 3, 2015

hadoop has tests in travis so should be fine

task_id_parts = []
param_objs = dict(params)
for param_name, param_value in param_values:
if param_objs[param_name].significant:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, since we're not sending this to the scheduler any more. Maybe we should include the insignificant parameters too?

(What I'm suggesting is to remove this one line)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm undecided. Maybe people will want to keep repr() short by marking configuration parameters as insignificant? I don't use this feature so I don't really have an opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__repr__ is only used to assist debugging right? yeah in that case let's include all params

@Tarrasch
Copy link
Contributor

Tarrasch commented Dec 7, 2015

Looks good. :)

I know that @ulzha is usually saying wise things when it comes to design. Do you have any thoughts on this Uldis?

@Tarrasch
Copy link
Contributor

Tarrasch commented Dec 7, 2015

@stephenpascoe You should not need to worry about other modules breaking as everything in luigi is (usually) tested by Travis. :)

@sisidra
Copy link
Contributor

sisidra commented Dec 7, 2015

Sorry for being late for discussion, but I would have loved to see URIfication for task id serialization. Like
taskId?param=value&... and for hash you could taskId#hash as mentioned before.
Everybody cries when sees self-invented formats like taskId(parm=value)... :(

@erikbern
Copy link
Contributor

erikbern commented Dec 7, 2015

@sisidra my concern using # in the task id is now you need to uri encode it whereas if it's only [A-Za-z0-9_] then you don't need to worry

@stephenpascoe
Copy link
Author

@sisidra, taskId(parm=value) will be gone except for str(task). task_id will no longer be a serialization in the sense that it doesn't contain all the information, only the family and some param-value prefixes.

For serialization we have the JSON {'family': ..., 'params': ...} returned by the API. You could URIfy that if you wanted, with correct escaping.

@stephenpascoe stephenpascoe force-pushed the task_id_hashing branch 2 times, most recently from 7c2873c to 1a2ef69 Compare December 15, 2015 14:07
@stephenpascoe stephenpascoe changed the title [WIP] Task id hashing Task id hashing Dec 15, 2015
@stephenpascoe
Copy link
Author

This branch has been running in production for a week without issue. A couple of tests are failing because of a syntax error in the HTTPretty mocking library which has not been fixed in pypi.

AFAIK the only outstanding issue is the version number. I suggest 2.1.1 would be suitable, considering how recently we moved to 2.0.

@erikbern
Copy link
Contributor

sgtm. but users have to run a migration script right?

@stephenpascoe
Copy link
Author

Migration script has been removed. The schema is now automatically upgraded.

@erikbern
Copy link
Contributor

that's great. cool. let's merge this!

@erikbern
Copy link
Contributor

Will merge on a successful build.

Then let's have it sit in master for say two weeks before we publish to PyPI. I'm still a bit scared :)

@stephenpascoe
Copy link
Author

HTTPretty fix requested at gabrielfalcao/HTTPretty#278

@erikbern
Copy link
Contributor

let's just set the version of httpretty in tox.ini to the latest working version

@Tarrasch
Copy link
Contributor

I think the commit history is a bit too dirty for merge. In particular no reverting commits or "fix pep8" commits should be in the commit log. I believe. Having those makes reverting commits much harder for us maintainers.

Other than that. Looks good.

@stephenpascoe
Copy link
Author

Squashed some commits.

@stephenpascoe
Copy link
Author

All tests passing. Ready to merge?

@erikbern
Copy link
Contributor

LET'S MERGE

erikbern added a commit that referenced this pull request Dec 16, 2015
@erikbern erikbern merged commit f3dcb54 into spotify:master Dec 16, 2015
@erikbern
Copy link
Contributor

let's see if this causes any trouble

@stephenpascoe
Copy link
Author

Great. 👍

@Tarrasch
Copy link
Contributor

Wow, this finally happened. Amazing execution of this @stephenpascoe!

@Tarrasch
Copy link
Contributor

@stephenpascoe, do you think this will cause breakage for this use case?

def update_id(self):
"""
This update id will be a unique identifier for this insert on this table.
"""
return self.task_id

I mean people who upgrade notice that all their SomeCopyToTableTask().complete() methods will suddenly return False.

@stephenpascoe
Copy link
Author

@Tarrasch, yes, it looks like it.

You could migrate the database by locating each row via update_id=str(Task) and updating it to self.task_id. However, because we removed the only_significant parameter from to_str_params() that wouldn't work if there are insignificant parameters.

@Tarrasch
Copy link
Contributor

Hmm...

I suppose the renaming of tables also isn't easy either since each db renames tables in different ways.

@stephenpascoe
Copy link
Author

Migration tools like alembic can do it so I guess we could work it out for anything using sqlalchemy. Do you think it is enough to move the table to a backup and start again?

@Tarrasch
Copy link
Contributor

Not sure of exactly what you mean with "move the table to a backup and start again".

But happy with any solution that strikes a reasonable work-for-maintainers/work-for-users balance.

@erikbern
Copy link
Contributor

in retrospect that mechanism wasn't great to start with, we shouldn't have had it in the code.

where is update_id actually used?

@Tarrasch
Copy link
Contributor

Unfortunately I believe it's quite widely used. All CopyToTable jobs would be affected.

@Tarrasch
Copy link
Contributor

in retrospect that mechanism wasn't great to start with, we shouldn't have had it in the code.

I can totally see what tempted the authors to write that, it's so convenient and short! :)

@stephenpascoe
Copy link
Author

luigi.contrib.rdbms.CopyToTable is inherited by 2 task classes:

  • luigi.postgres.CopyToTable
  • luigi.contrib.redshift.S3CopyToTable

There is also luigi.contrib.sqla.CopyToTable which also puts self.task_id in a database.

@erikbern
Copy link
Contributor

a dumb temporary fix would be to put back the old task_id code into luigi.deprecated.old_task_id or something like that. just buys us some time though

@erikbern
Copy link
Contributor

is there some automatic way of converting an old task_id into a new style task_id? in that case we can try to migrate the marker tables

@stephenpascoe
Copy link
Author

We can convert old_task_id to new_task_id in cases where we don't run into the original deserialisation problems (param values containing [ "'=] etc. ) We can convert using an instantiated Task instance if we roll-back the change to to_str_params().

@Tarrasch
Copy link
Contributor

Hmm. I wonder if we can just not care about fixing backward compatibility for the database-target issue. Since luigi is a bit makeish, I think all data will be generated where-ever needed/requested in the dependency chain.

I mean, surely people have added parameters and then had all their database uploads being done again without them really caring. The database ingestions are in many cases small compared to the data crunched to produce the ingestion data.

@Tarrasch
Copy link
Contributor

Again thanks for this contribution @stephenpascoe. But I noticed a minor niceness glitch in the visualiser that I think is because of this PR:

Before:

Before

After:

selection_001

I mean that the parameters are formatted like a python dict now and not in the slightly nicer (k1=v1, k2=v2) format. Can this be fixed? :)

@stephenpascoe
Copy link
Author

Yes, I was aware of this. I can fix it.

Stephen Pascoe from iPhone

On 21 Dec 2015, at 09:14, Arash Rouhani notifications@github.com wrote:

Again thanks for this contribution @stephenpascoe. But I noticed a minor niceness glitch in the visualiser that I think is because of this PR:

Before:

After:

I mean that the parameters are formatted like a python dict now and not in the slightly nicer (k1=v1, k2=v2) format. Can this be fixed? :)


Reply to this email directly or view it on GitHub.

@Tarrasch
Copy link
Contributor

Thanks for verifying! Kudos for fixing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants