Task id hashing #1444

stephenpascoe · 2015-12-02T12:17:39Z

Implementing semi-opaque, hashed task_ids as discussed at #1312.

This is the version I will be testing on our infrastructure in the next few weeks. This PR is intended to give my proposal better visibility whilst I test.

The PR replaces task_ids with a value which is reliably unique but you can't extract the full parameter values from it. The actual algorithm is family_pval1_pval2_pval3_hash where:

family: The task_family (a.k.a task_name)
pval*: A truncated serialisation of the first 3 parameter values, sorted by parameter name
hash: A md5hash of the canonical JSON serialisation of the family and parameters, truncated to 10 characters.

str(task) will return the traditional serialisation.

The db_task_history database schema is changed to include tash_id in the tasks table. No further use is made of this column at this stage. A migration script is supplied to add the column to an existing database. This has been tested on MySQL only.

All tests are updated to pass by replacing hard-coded task_ids with reference to Task.task_id.

erikbern · 2015-12-02T15:24:15Z

this looks great!

erikbern · 2015-12-02T15:25:20Z

luigi/task.py

+        param_str = json.dumps(params, separators=(',', ':'), sort_keys=True)
+        param_hash = hashlib.md5(param_str.encode('utf-8')).hexdigest()
+
+        param_summary = '_'.join(p[:TASK_ID_TRUNCATE_PARAMS]


you should probably remove non-alphanumeric characters from param_summary

Done, all non-alphanumeric (+ "") converted to "".

erikbern · 2015-12-03T15:20:03Z

this is great – would be great if some other people can weigh in. @Tarrasch @econchick @freider ?

stephenpascoe · 2015-12-03T15:22:48Z

It's worth saying that we don't use Hadoop or any of the other integrations so they won't get tested by me. Further input would be good.

erikbern · 2015-12-03T15:38:27Z

hadoop has tests in travis so should be fine

Tarrasch · 2015-12-07T05:59:56Z

luigi/task.py

+        task_id_parts = []
+        param_objs = dict(params)
+        for param_name, param_value in param_values:
+            if param_objs[param_name].significant:


Now, since we're not sending this to the scheduler any more. Maybe we should include the insignificant parameters too?

(What I'm suggesting is to remove this one line)

I'm undecided. Maybe people will want to keep repr() short by marking configuration parameters as insignificant? I don't use this feature so I don't really have an opinion.

__repr__ is only used to assist debugging right? yeah in that case let's include all params

Tarrasch · 2015-12-07T06:12:16Z

Looks good. :)

I know that @ulzha is usually saying wise things when it comes to design. Do you have any thoughts on this Uldis?

Tarrasch · 2015-12-07T06:13:22Z

@stephenpascoe You should not need to worry about other modules breaking as everything in luigi is (usually) tested by Travis. :)

sisidra · 2015-12-07T07:12:58Z

Sorry for being late for discussion, but I would have loved to see URIfication for task id serialization. Like
taskId?param=value&... and for hash you could taskId#hash as mentioned before.
Everybody cries when sees self-invented formats like taskId(parm=value)... :(

erikbern · 2015-12-07T14:40:47Z

@sisidra my concern using # in the task id is now you need to uri encode it whereas if it's only [A-Za-z0-9_] then you don't need to worry

stephenpascoe · 2015-12-07T15:03:54Z

@sisidra, taskId(parm=value) will be gone except for str(task). task_id will no longer be a serialization in the sense that it doesn't contain all the information, only the family and some param-value prefixes.

For serialization we have the JSON {'family': ..., 'params': ...} returned by the API. You could URIfy that if you wanted, with correct escaping.

stephenpascoe · 2015-12-15T14:29:13Z

This branch has been running in production for a week without issue. A couple of tests are failing because of a syntax error in the HTTPretty mocking library which has not been fixed in pypi.

AFAIK the only outstanding issue is the version number. I suggest 2.1.1 would be suitable, considering how recently we moved to 2.0.

erikbern · 2015-12-15T14:43:36Z

sgtm. but users have to run a migration script right?

stephenpascoe · 2015-12-15T15:42:25Z

Migration script has been removed. The schema is now automatically upgraded.

erikbern · 2015-12-15T16:12:05Z

that's great. cool. let's merge this!

erikbern · 2015-12-15T16:14:38Z

Will merge on a successful build.

Then let's have it sit in master for say two weeks before we publish to PyPI. I'm still a bit scared :)

stephenpascoe · 2015-12-15T16:43:56Z

HTTPretty fix requested at gabrielfalcao/HTTPretty#278

erikbern · 2015-12-15T18:12:20Z

let's just set the version of httpretty in tox.ini to the latest working version

Tarrasch · 2015-12-16T08:00:41Z

I think the commit history is a bit too dirty for merge. In particular no reverting commits or "fix pep8" commits should be in the commit log. I believe. Having those makes reverting commits much harder for us maintainers.

Other than that. Looks good.

stephenpascoe · 2015-12-16T10:03:54Z

Squashed some commits.

stephenpascoe · 2015-12-16T10:47:13Z

All tests passing. Ready to merge?

erikbern · 2015-12-16T14:47:26Z

LET'S MERGE

Task id hashing

erikbern · 2015-12-16T14:47:43Z

let's see if this causes any trouble

stephenpascoe · 2015-12-16T14:54:07Z

Great. 👍

Tarrasch · 2015-12-17T02:19:34Z

Wow, this finally happened. Amazing execution of this @stephenpascoe!

Tarrasch · 2015-12-17T07:33:39Z

@stephenpascoe, do you think this will cause breakage for this use case?

luigi/luigi/contrib/rdbms.py

Lines 98 to 102 in 154c283

    
               def update_id(self): 
        
                   """ 
        
                   This update id will be a unique identifier for this insert on this table. 
        
                   """ 
        
                   return self.task_id

I mean people who upgrade notice that all their SomeCopyToTableTask().complete() methods will suddenly return False.

stephenpascoe · 2015-12-17T09:22:36Z

@Tarrasch, yes, it looks like it.

You could migrate the database by locating each row via update_id=str(Task) and updating it to self.task_id. However, because we removed the only_significant parameter from to_str_params() that wouldn't work if there are insignificant parameters.

Tarrasch · 2015-12-17T10:02:18Z

Hmm...

I suppose the renaming of tables also isn't easy either since each db renames tables in different ways.

stephenpascoe · 2015-12-17T10:10:34Z

Migration tools like alembic can do it so I guess we could work it out for anything using sqlalchemy. Do you think it is enough to move the table to a backup and start again?

Tarrasch · 2015-12-17T10:15:19Z

Not sure of exactly what you mean with "move the table to a backup and start again".

But happy with any solution that strikes a reasonable work-for-maintainers/work-for-users balance.

erikbern · 2015-12-17T14:12:48Z

in retrospect that mechanism wasn't great to start with, we shouldn't have had it in the code.

where is update_id actually used?

Tarrasch · 2015-12-17T14:48:28Z

Unfortunately I believe it's quite widely used. All CopyToTable jobs would be affected.

Tarrasch · 2015-12-17T14:49:14Z

in retrospect that mechanism wasn't great to start with, we shouldn't have had it in the code.

I can totally see what tempted the authors to write that, it's so convenient and short! :)

stephenpascoe · 2015-12-17T14:53:42Z

luigi.contrib.rdbms.CopyToTable is inherited by 2 task classes:

luigi.postgres.CopyToTable
luigi.contrib.redshift.S3CopyToTable

There is also luigi.contrib.sqla.CopyToTable which also puts self.task_id in a database.

erikbern · 2015-12-17T15:18:57Z

a dumb temporary fix would be to put back the old task_id code into luigi.deprecated.old_task_id or something like that. just buys us some time though

erikbern · 2015-12-17T15:20:38Z

is there some automatic way of converting an old task_id into a new style task_id? in that case we can try to migrate the marker tables

stephenpascoe · 2015-12-17T15:29:32Z

We can convert old_task_id to new_task_id in cases where we don't run into the original deserialisation problems (param values containing [ "'=] etc. ) We can convert using an instantiated Task instance if we roll-back the change to to_str_params().

Tarrasch · 2015-12-21T08:59:35Z

Hmm. I wonder if we can just not care about fixing backward compatibility for the database-target issue. Since luigi is a bit makeish, I think all data will be generated where-ever needed/requested in the dependency chain.

I mean, surely people have added parameters and then had all their database uploads being done again without them really caring. The database ingestions are in many cases small compared to the data crunched to produce the ingestion data.

Tarrasch · 2015-12-21T09:13:40Z

Again thanks for this contribution @stephenpascoe. But I noticed a minor niceness glitch in the visualiser that I think is because of this PR:

Before:

After:

I mean that the parameters are formatted like a python dict now and not in the slightly nicer (k1=v1, k2=v2) format. Can this be fixed? :)

stephenpascoe · 2015-12-21T10:15:17Z

Yes, I was aware of this. I can fix it.

Stephen Pascoe from iPhone

On 21 Dec 2015, at 09:14, Arash Rouhani notifications@github.com wrote:

Again thanks for this contribution @stephenpascoe. But I noticed a minor niceness glitch in the visualiser that I think is because of this PR:

Before:

After:

I mean that the parameters are formatted like a python dict now and not in the slightly nicer (k1=v1, k2=v2) format. Can this be fixed? :)

—
Reply to this email directly or view it on GitHub.

Tarrasch · 2015-12-21T10:33:41Z

Thanks for verifying! Kudos for fixing :)

New hashed task_ids implemented. See spotify#1444.

erikbern reviewed Dec 2, 2015
View reviewed changes

Tarrasch reviewed Dec 7, 2015
View reviewed changes

stephenpascoe force-pushed the task_id_hashing branch 2 times, most recently from 7c2873c to 1a2ef69 Compare December 15, 2015 14:07

stephenpascoe changed the title ~~[WIP] Task id hashing~~ Task id hashing Dec 15, 2015

stephenpascoe force-pushed the task_id_hashing branch from 0ff8e98 to a1e4d61 Compare December 15, 2015 22:17

erikbern added a commit that referenced this pull request Dec 16, 2015

Merge pull request #1444 from stephenpascoe/task_id_hashing

f3dcb54

Task id hashing

erikbern merged commit f3dcb54 into spotify:master Dec 16, 2015

Tarrasch mentioned this pull request Dec 17, 2015

Fixed: Rdbms and Redshift Properties #1393

Closed

dlstadther mentioned this pull request Dec 18, 2015

Fixed: Rdbms and Redshift Properties #1463

Merged

stephenpascoe pushed a commit to stephenpascoe/luigi that referenced this pull request Dec 21, 2015

Merge branch 'task_id_hashing'

889aacd

New hashed task_ids implemented. See spotify#1444.

stephenpascoe mentioned this pull request Dec 22, 2015

Display parameters in visualiser as "k=v, ..." instead of raw JSON. #1466

Merged

oldpa mentioned this pull request Mar 4, 2016

CopyToPostgres inserts data twice because of updated update_id #1578

Closed

joeshaw mentioned this pull request Apr 20, 2016

RedshiftTarget update_id too long for marker table #1003

Closed

Task id hashing #1444

Task id hashing #1444

Conversation

stephenpascoe commented Dec 2, 2015

erikbern commented Dec 2, 2015

erikbern Dec 2, 2015

Choose a reason for hiding this comment

stephenpascoe Dec 3, 2015

Choose a reason for hiding this comment

erikbern commented Dec 3, 2015

stephenpascoe commented Dec 3, 2015

erikbern commented Dec 3, 2015

Tarrasch Dec 7, 2015

Choose a reason for hiding this comment

stephenpascoe Dec 7, 2015

Choose a reason for hiding this comment

erikbern Dec 7, 2015

Choose a reason for hiding this comment

Tarrasch commented Dec 7, 2015

Tarrasch commented Dec 7, 2015

sisidra commented Dec 7, 2015

erikbern commented Dec 7, 2015

stephenpascoe commented Dec 7, 2015

stephenpascoe commented Dec 15, 2015

erikbern commented Dec 15, 2015

stephenpascoe commented Dec 15, 2015

erikbern commented Dec 15, 2015

erikbern commented Dec 15, 2015

stephenpascoe commented Dec 15, 2015

erikbern commented Dec 15, 2015

Tarrasch commented Dec 16, 2015

stephenpascoe commented Dec 16, 2015

stephenpascoe commented Dec 16, 2015

erikbern commented Dec 16, 2015

erikbern commented Dec 16, 2015

stephenpascoe commented Dec 16, 2015

Tarrasch commented Dec 17, 2015

Tarrasch commented Dec 17, 2015

stephenpascoe commented Dec 17, 2015

Tarrasch commented Dec 17, 2015

stephenpascoe commented Dec 17, 2015

Tarrasch commented Dec 17, 2015

erikbern commented Dec 17, 2015

Tarrasch commented Dec 17, 2015

Tarrasch commented Dec 17, 2015

stephenpascoe commented Dec 17, 2015

erikbern commented Dec 17, 2015

erikbern commented Dec 17, 2015

stephenpascoe commented Dec 17, 2015

Tarrasch commented Dec 21, 2015

Tarrasch commented Dec 21, 2015

stephenpascoe commented Dec 21, 2015

Tarrasch commented Dec 21, 2015