Feature request: support parallel fork join for remote jobs #1710

updong · 2019-08-06T18:33:11Z

What happened:
I found out that I cannot use AND conditions on remote jobs

"sd@223215:qa-minos-functional-test-us" fails to match the required pattern: /^~(sd@\d+:[\w-]+|(pr|commit|release|tag)(:(.+))?)$/,

and found out that SD doesn't support parallel fork join for remote jobs

What you expected to happen:
something like this.

A job which can wait for multiple remote job with AND condition

  requires:
      - sd@123:A
      - sd@123:B
      - sd@456:C
      - sd@456:D
      - E

this job will wait for A B C D and E which runs in parallel and gets triggered when they're all successful.

How to reproduce it:
try adding remote job in requires field without ~

The text was updated successfully, but these errors were encountered:

vnugopal · 2019-08-08T23:54:06Z

@DekusDenial

d2lam · 2019-09-27T00:27:11Z

This can be implemented only if sd@123:A, sd@123:B, sd@456:C, sd@456:D, E all start from a single point, which will look like this:

Without this constraint, it will get into race conditions & inconsistencies. For example, sd@123:A runs 1 week ago, B,C,D,E run now. Should F start or not? It's also almost impossible to implement since there is no way for F to track whether A,B,C,D,E finish in the same run because they can belong to different events. These events' end times can be within several minutes or several hours. How do we know which ones are supposed to be grouped together? This is probably why Jenkin's Join Plugin also relies on this constraint.

I assume this was your intention as well since you mention A B C D and E which runs in parallel @updong

I can start with some design doc on how we implement this and people can review.

updong · 2019-09-27T16:37:49Z

@d2lam yes I forgot to mention about that constraint. yes I was assuming that constraint to be forced. they need to be triggered by same event.

d2lam · 2019-10-01T19:48:31Z

DESIGN

Store ANY relationship that is external in Trigger Table
For every finished job, it will need to check both places: workflow graph and triggerFactory to figure out the next job.
For every next job, figure out whether it can run by building a joinList, which includes the upstream jobs that need to finish before this job can run.
To build this list, look at its own workflow graph and triggerFactory
If joinList is empty, and the next job is external, set upstreamEventId to eventId
If joinList is not empty:
- If the current job is the FIRST job that finishes in the joinList:
  - If next job belongs to same pipeline, create a build with same eventId (similar to before)
  - If next job belongs to external, create a build with the same eventId as upstreamEventId
- If it's the LAST job that finishes, start the build.
- Otherwise, do nothing.

Example

Let's simplify the above example and use: A -> [sd@123:B, sd@456:C, D] -> E
A and E belong to sd@789

From sd@789's workflow graph it will capture A -> D -> E
Trigger table will need to store these relationships since it is not captured by its own workflow graph:

src	dest
sd@789:A	sd@123:B
sd@789:A	sd@456:C
sd@123:B	sd@789:E
sd@456:C	sd@789:E

When A is done, it figures the next jobs to run:
- From its workflow graph: D
- From triggerFactory: [sd@123:B, sd@456:C]
Assume [sd@123:B, sd@456:C, D] don't need any other jobs beside A. So their joinList is empty. They can go ahead and start.
Let's say A has eventId = 1, sd@123:B and sd@456:C will will have upstreamEvent = 1 since it's external to A.
Let's say sd@123:B finishes first. It checks both places to figure out next job:
- From its workflow graph: nothing
- From triggerFactory: sd@789:E
Check if sd@789:E requires join. Build the joinList for sd@789:E:
- From its workflow graph: D
- From triggerFactory: sd@123:B, sd@456:C
It will then look at build of sd@123:B with upstreamEvent (= 1) and see if it finishes.
Since it hasn't, it creates sd@789:E with the same eventId as upstreamEventId. So now E and A belong to the same event.
When D finishes, it does not do anything since sd@789:E is already created, and it's not the last build in the joinList
When sd@456:C finishes, it starts sd@789:E

TASKS

Data-schema changes external trigger regex to accept sd@pipelineId:jobname
Config-parser updated with new data-schema
Workflow parser needs to handle getNextJobs and getSrcFromJoin accordingly
API's build plugin handles new logic for join

d2lam · 2019-10-04T21:55:39Z

DESIGN (REVISED 10/4/2019)

The above implementation will work for backend. However, UI will need to query the trigger table to show each external nodes. Even though that's similar to what we are doing now, the current UI turns OFF displaying external triggers by default. With this feature implemented, we need to always have displaying external triggers turned ON. Otherwise, it will be very confusing to users. Therefore, I have revised the implementation design so that the load on the UI will not be too much.

workflow-parser now generates relationship to downstream jobs, in addition to the relationship in the yaml.
When each job finishes, it will figure out the next jobs. Since workflow graph now includes both internal & external edges, we can figure out next jobs by looking at the workflow graph.
Before each job starts, it needs to check the joinList, which is the list of builds that nextJob depends on.

Added 10/7:

To compute the status of the builds in joinList, look at the upstream field.
When starting, it should store an additional field upstream in the following form: ~~{ pipelineId: upstreamEventId}~~

{  
${pipelineId}: 
    {
        "eventId": ${eventId},
        ${jobName}: ${buildId} 
    }
}

The field is stored in such format to allow faster lookup:
- To check if a job in joinList has started: check if upstream.pipelineId.jobName exists. (ex: sd@123:deploy is in joinList, check if upstream.123.deploy exists)
- To check status: make API call to check upstream.pipelineId.jobName build status (ex: upstream.123.deploy)
- To merge back to the original event: check if upstream.currentPipelineId exists. If yes, create a build under event upstream.currentPipelineId.eventId. If not, create a new event.
  ~~- Figure out its eventId by looking its upstream event. The build's eventId can be computed by checking if current pipelineId is in the upstream list. If it is, then use that eventId.~~

EXAMPLE & IMPLEMENTATION:

For example (current pipeline ID = 999):

Previously, workflow graph will be:

{
    "nodes": [
        { "name": "~pr" },
        { "name": "~commit" },
        { "name": "A" },
        { "name": "D" },
        { "name": "G" }
    ],
    "edges": [
        { "src": "A", "dest": "D" },
        { "src": "D", "dest": "G" }
    ]
}

Now, it should be:

{
    "nodes": [
        { "name": "~pr" },
        { "name": "~commit" },
        { "name": "A" },
        { "name": "D" },
        { "name": "G" },
        { "name": "sd@111:B" },
        { "name": "sd@222:C" },
        { "name": "sd@333:E" },
        { "name": "sd@444:F" }
    ],
    "edges": [
        { "src": "A", "dest": "sd@111:B" },
        { "src": "A", "dest": "sd@222:C" },
        { "src": "A", "dest": "D" },
        { "src": "sd@111:B", "dest": "sd@333:E" },
        { "src": "sd@222:C", "dest": "sd@444:F" },
        { "src": "sd@333:E", "dest": "G", "join": true },
        { "src": "sd@444:F", "dest": "G", "join": true },
        { "src": "D", "dest": "G", "join": true }
    ]
}

To do this, we need to go pass in Trigger Factory from API -> config parser -> workflow-parser. Inside workflow parser, we compute the external nodes & edges by DFS traverse through each job's downstream jobs (look this information from the trigger table).

getNextJobs will now return both internal and external triggers:

When A finishes, it will trigger [sd@111:B, sd@222:C, D].
When sd@111:B finishes, it will trigger [sd@333:E]
....

Store upstream field to each build. It will look like this:

| job | eventID | upstream |
|---| ---| --- |
| A | 1 | {} |
| D | 1 | {999:1} |
| sd@111:B | 2 | {999:1} |
| sd@222:C | 3 | {999:1} |
| sd@333:D | 4 | {999:1, 111:2} |
| sd@444:E | 5 | {999:1, 222:3} |
| G | 1 | {999:1, 111:2, 222:3, 333:4, 444:5} |

Modified (10/7):

job	eventID	buildID	upstream
A	1	11	{}
D	1	12	{999: {eventId: 1, A: 11}}
sd@111:B	2	13	{999: {eventId: 1, A: 11}}
sd@222:C	3	14	{999: {eventId: 1, A: 11}}
sd@333:E	4	15	{ 999: {eventId: 1, A: 11}, 111: {eventId: 2, B: 13}}
sd@444:F	5	16	{999: {eventId: 1, A: 11}, 222: {eventId: 3, C: 14}}
G	1	17	{999: {eventId: 1, A: 11, D: 12}, 111: {eventId: 2, B: 13}, 222: {eventId: 3, C: 14}, 333: {eventId: 4, E: 15}, 444: {eventId: 5, F: 16}}

Before starting the next jobs, it should build the joinList. joinList can now be derived directly from the new workflow graph (instead of fetching the trigger table).

Check status of each build in the joinList. For example: G needs to check status of upstream.333.D and upstream.444.E.
-G will have pipelineId = 999, so it will check if upstream.999 exists, and set its eventID to upstream.999.eventId

Notes:
We need to be careful since there could be multiple things updating the upstream field. For example, D finishes and update G's upstream. Shortly after, E finishes and also update G's upstream. I think we will need to implement some locking mechanism to avoid race condition.

d2lam · 2019-10-09T22:32:43Z

Changes required:

data-schema: feat(1710): allow external join, add parentBuilds field. BREAKING CHANGE: allow external AND data-schema#361
models: feat(1710): pass triggerFactory to config parser [2] models#400
config parser: feat(1710): support external join. BREAKING CHANGE: adding new join logic [3] config-parser#94
workflow parser: feat(1710): support external join. BREAKING CHANGE: add new external join logic [4] workflow-parser#27
API: feat(1710): support external join, rewrite trigger logic [5] #1805

UPDATED 10/18

To support external join, we need to rewrite the whole trigger logic.

workflow graph now includes both internal & external relationship (logic in https://github.com/screwdriver-cd/workflow-parser/pull/27/files)
we can get the joinList from workflow graph (logic in https://github.com/screwdriver-cd/workflow-parser/pull/27/files)
after a job is done, find out all the next jobs it triggers. For each job:
- find its joined builds (for example next job is D, and it requires [A,sd@111:B, sd@222:C]
- construct a parentBuilds field = current build's parentBuilds + all jobs in joinList + current build info. Looks like this: { 999: {eventId: 1, A: 1}, 111: {eventId: 2, B: 2}, 222: {eventId: null, C:null } }
- if this list is empty (this job only requires or), create/start build D (with parentBuilds info)
- if this list is not empty (this job requires and)
  - check if build D already exists
    - if internal, check by looking at builds inside current event/parent event
    - if external, check by looking at the latest build of D and see if status is CREATED
  - if build D doesn't exist, create build D (with parentBuilds info), don't start D yet.
  - once build D exists, check if [A,sd@111:B, sd@222:C] are done by looking at its parentBuilds field
  - if has one failure, delete build D
  - if some build in joinList is not done, do nothing
  - if all finished successfully, start build D

jithine · 2020-02-28T01:10:52Z

This is now available https://blog.screwdriver.cd/post/611163633952735232/remote-join

tkyi · 2020-03-09T22:45:24Z

This feature is available in production.
Blog post: https://blog.screwdriver.cd/post/611163633952735232/remote-join
Docs: https://docs.screwdriver.cd/user-guide/configuration/workflow.html#remote-join

jithine added workflow Pipeline workflow related issues feature labels Aug 6, 2019

jithine added this to Backlog in Active Work Sep 6, 2019

jithine moved this from Backlog to Doing in Active Work Sep 25, 2019

d2lam self-assigned this Sep 25, 2019

d2lam mentioned this issue Oct 2, 2019

feat(1710): support external join. BREAKING CHANGE: add new external join logic [4] screwdriver-cd/workflow-parser#27

Merged

d2lam mentioned this issue Oct 9, 2019

feat(1710): support external join. BREAKING CHANGE: adding new join logic [3] screwdriver-cd/config-parser#94

Merged

jithine mentioned this issue Oct 10, 2019

Remote triggers don't remember event source #1635

Closed

This was referenced Dec 22, 2019

fix(1710): fix type key null for Joi.alternatives() screwdriver-cd/datastore-sequelize#65

Merged

fix(1881): add scmContext constraint and fix #1710 screwdriver-cd/data-schema#374

Merged

tkyi mentioned this issue Dec 24, 2019

fix(1710): Update workflowGraph to allow external join trigger screwdriver-cd/data-schema#375

Merged

This was referenced Jan 14, 2020

fix(1710): Allow join external trigger #1913

Merged

fix(1710): Show external trigger remote pipeline link screwdriver-cd/ui#516

Merged

tkyi mentioned this issue Jan 22, 2020

fix(1710): Fix external check for join build #1923

Merged

tkyi mentioned this issue Jan 30, 2020

fix(1710): Add EXTERNAL_JOIN_AND trigger screwdriver-cd/models#430

Merged

tkyi mentioned this issue Jan 31, 2020

feat(1710): External join works no matter which fork finishes first #1942

Merged

tkyi self-assigned this Feb 5, 2020

This was referenced Feb 6, 2020

fix(1710): Allow pass in parentBuilds to event create [1] screwdriver-cd/data-schema#382

Merged

fix(1710): Restart internal fork #1960

Merged

fix(1710): Always pass in parentEventId #1964

Merged

tkyi mentioned this issue Feb 15, 2020

feat(1710): Restart external and internal join case #1972

Merged

klu909 mentioned this issue Feb 19, 2020

ChainPR broken with parallel fork change enabled #1973

Closed

jithine moved this from Doing to Verify in Active Work Feb 26, 2020

This was referenced Feb 27, 2020

fix(1710): UNSTABLE build should not trigger join job #1979

Merged

fix(1710): Downstream jobs should be triggered in single event when in the same pipeline #1986

Merged

jithine closed this as completed Feb 28, 2020

This was referenced Feb 28, 2020

fix(1710): Parallel trigger #1991

Merged

fix(1710): Race condition in logic for starting join job #1996

Merged

jithine mentioned this issue Mar 3, 2020

fix(1710): add error handling #1999

Merged

tkyi mentioned this issue Mar 4, 2020

fix(1710): Pass in startFrom for external triggers #2001

Merged

jithine mentioned this issue Mar 4, 2020

fix: more error handling #2006

Merged

tkyi mentioned this issue Mar 5, 2020

fix(1710): Simple trigger should work #2012

Merged

jithine moved this from Verify to Done in Active Work Mar 6, 2020

tkyi mentioned this issue Mar 11, 2020

fix(1710): Fix sha for internal build #2019

Merged

tkyi mentioned this issue Mar 24, 2020

fix: When external downstream trigger is part of a join, should not be triggered by external event #2042

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: support parallel fork join for remote jobs #1710

Feature request: support parallel fork join for remote jobs #1710

updong commented Aug 6, 2019 •

edited

Loading

vnugopal commented Aug 8, 2019

d2lam commented Sep 27, 2019

updong commented Sep 27, 2019 •

edited

Loading

d2lam commented Oct 1, 2019 •

edited

Loading

d2lam commented Oct 4, 2019 •

edited

Loading

d2lam commented Oct 9, 2019 •

edited by tkyi

Loading

jithine commented Feb 28, 2020

tkyi commented Mar 9, 2020

Feature request: support parallel fork join for remote jobs #1710

Feature request: support parallel fork join for remote jobs #1710

Comments

updong commented Aug 6, 2019 • edited Loading

vnugopal commented Aug 8, 2019

d2lam commented Sep 27, 2019

updong commented Sep 27, 2019 • edited Loading

d2lam commented Oct 1, 2019 • edited Loading

DESIGN

Example

TASKS

d2lam commented Oct 4, 2019 • edited Loading

DESIGN (REVISED 10/4/2019)

EXAMPLE & IMPLEMENTATION:

d2lam commented Oct 9, 2019 • edited by tkyi Loading

UPDATED 10/18

jithine commented Feb 28, 2020

tkyi commented Mar 9, 2020

updong commented Aug 6, 2019 •

edited

Loading

updong commented Sep 27, 2019 •

edited

Loading

d2lam commented Oct 1, 2019 •

edited

Loading

d2lam commented Oct 4, 2019 •

edited

Loading

d2lam commented Oct 9, 2019 •

edited by tkyi

Loading