Skip to content
This repository has been archived by the owner on Jul 14, 2022. It is now read-only.

Commit

Permalink
Merge pull request #53 from uc-cdis/feat/workflow-db
Browse files Browse the repository at this point in the history
Feat/workflow db
  • Loading branch information
UchicagoZchen138 committed Aug 26, 2021
2 parents 044a91c + cd386b4 commit 2100440
Show file tree
Hide file tree
Showing 18 changed files with 1,010 additions and 19 deletions.
12 changes: 5 additions & 7 deletions Docker/server/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,10 @@ FROM quay.io/cdis/golang:1.15-buster as builder
COPY . $GOPATH/src/github.com/uc-cdis/mariner/
WORKDIR $GOPATH/src/github.com/uc-cdis/mariner/

RUN go get -d -v
RUN go build -ldflags "-linkmode external -extldflags -static" -o /mariner
# Install db
RUN apt-get update && apt-get install --no-install-recommends -y jq bash postgresql

FROM scratch
COPY --from=builder /mariner /
# Copy CA certificates to prevent x509: certificate signed by unknown authority errors
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt
RUN go get -d -v
RUN go build -ldflags "-linkmode external -extldflags -static" -o bin/mariner

ENTRYPOINT ["/mariner", "listen"]
ENTRYPOINT ["bin/mariner", "listen"]
204 changes: 204 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,207 @@ All documentation can be found in the [docs](docs) folder, and key documents are
* [Technical Design and Architecture](docs/reference/TechnicalDesignProposal.md)
* [Running CWL Conformance Tests](https://github.com/uc-cdis/mariner/tree/master/conformance)
* [CWL User Guide](https://www.commonwl.org/user_guide/02-1st-example/index.html)
Mariner presentations:
- [Mariner pt. 1](https://docs.google.com/presentation/d/1FKlOJeGyimX3MVURNiM9gOtdHB8gu9sx6NJP0WfTtHI/edit#slide=id.p) - gives
context for the service, why it's critical to Gen3, how it fits in with the larger data commons picture
- [Mariner pt. 2](https://docs.google.com/presentation/d/1C52GialV2VYUzVW_KlObQArZi22kGuIhhRnm3mDgMDE/edit#slide=id.g7e9daf6d29_0_0) - gives high level details on the Mariner service itself, API, overview of architectural components

A sketch of the Centralized Gen3 Compute Environment idea can be found [here](https://docs.google.com/document/d/1_-y5Tpw-xeh0Ce1D7DwalLkrdVQ0Osgrd8k7RE-H6tY/edit).

The original technical design proposal for Mariner can be found [here](https://github.com/uc-cdis/mariner/blob/master/TechnicalDesignProposal.md).

## How to deploy Mariner in a Gen3 environment

### Prereq's

1. Mariner depends on the [Workspace Token Service (WTS)](https://github.com/uc-cdis/workspace-token-service)
to access data from the commons.
If WTS is not already running in your environment, deploy the WTS.

2. Add the Mariner pieces to your manifest:
1. Add [version](https://github.com/uc-cdis/gitops-dev/blob/78ce75e69c786bbdda629c6c8d76a17476c2084a/mattgarvin1.planx-pla.net/manifest.json#L19)
2. Add [config](https://github.com/uc-cdis/gitops-dev/blob/78ce75e69c786bbdda629c6c8d76a17476c2084a/mattgarvin1.planx-pla.net/manifest.json#L183-L292)
3. Currently Mariner is not setup with network policies (this will be fixed very very soon),
so for now in your dev or qa environment in order for Mariner to work,
[network policies must be "off"](https://github.com/uc-cdis/gitops-dev/blob/78ce75e69c786bbdda629c6c8d76a17476c2084a/mattgarvin1.planx-pla.net/manifest.json#L161)

### Deployment

3. Deploy the Mariner server by running `gen3 kube-setup-mariner`

### Auth and User YAML

4. Make sure you have the Mariner auth scheme in your User YAML:
1. the [policy](https://github.com/uc-cdis/commons-users/blob/a95edd2d1ac27faed2ab628280cff8923292d073/users/dev/user.yaml#L57-L60)
2. the [resource](https://github.com/uc-cdis/commons-users/blob/a95edd2d1ac27faed2ab628280cff8923292d073/users/dev/user.yaml#L419-L420)
3. the [role](https://github.com/uc-cdis/commons-users/blob/a95edd2d1ac27faed2ab628280cff8923292d073/users/dev/user.yaml#L577-L582)

5. Give the `mariner_admin` policy to those users who need it. ([example](https://github.com/uc-cdis/commons-users/blob/a95edd2d1ac27faed2ab628280cff8923292d073/users/dev/user.yaml#L1433))

#### Auth Note

Right now the Mariner auth scheme is coarse - you
either have access to all the API endpoints or none of them.
In order for a user (intended at this point to be either a CTDS dev or bio)
to interact with Mariner, that user will need to have Mariner admin privileges.

A Mariner admin can do the following:
- run workflows
- fetch run status via runID
- fetch run logs and output via runID
- cancel a run that's in-progress via runID
- query run history (i.e., fetch a list of all your runIDs)

## How to use Mariner

### A Full Example

To demonstrate how to interact with Mariner, here's a step-by-step process
of how to run a (very) small test workflow and otherwise
hit all the Mariner API endpoints.

1. On your machine, move to directory `testdata/no_input_test`

2. Fetch token using API key
```
echo Authorization: bearer $(curl -d '{"api_key": "<replaceme>", "key_id": "<replaceme>"}' -X POST -H "Content-Type: application/json" https://<replaceme>.planx-pla.net/user/credentials/api/access_token | jq .access_token | sed 's/"//g') > auth
```

3. POST the workflow request
```
curl -d "@request_body.json" -X POST -H "$(cat auth)" https://<replaceme>.planx-pla.net/ga4gh/wes/v1/runs
```

4. Check run status
```
curl -H "$(cat auth)" https://<replaceme>.planx-pla.net/ga4gh/wes/v1/runs/<runID>/status
```

5. Fetch run logs (includes output json)
```
curl -H "$(cat auth)" https://<replaceme>.planx-pla.net/ga4gh/wes/v1/runs/<runID>
```

6. Fetch your run history (list of runIDs)
```
curl -H "$(cat auth)" https://<replaceme>.planx-pla.net/ga4gh/wes/v1/runs
```

7. Cancel a run that's currently in-progress
```
curl -d "@request_body.json" -X POST -H "$(cat auth)" https://<replaceme>.planx-pla.net/ga4gh/wes/v1/runs/<runID>/cancel
```

### Writing And Running Your Own Workflows "from scratch"

A workflow request to Mariner consists of the following:
1. A CWL workflow (serialized into JSON)
2. An inputs mapping file (also in the form of JSON)

The workflow specifies the computations to run,
the inputs mapping file specifies the data to run those computations on.

So if you want to write and run your own workflow with Mariner,
the process would go like this:

1. Write your CWL workflow.

2. Use the [Mariner wftool](https://github.com/uc-cdis/mariner/tree/master/wftool)
to serialize your CWL file(s) into a single JSON file.

3. Create your inputs mapping file, which
is a JSON file where the keys are CWL input parameters
and the values are the corresponding input values
for those parameters. Here is an example
of an inputs mapping file with two inputs,
both of which are files. One file is commons data
and is specified by GUID with the prefix `COMMONS/`,
and the other file is a user file, which exists in
the "user data space", and is specified by
the filepath within that user data space
plus the prefix `USER/`:
```
{
"commons_file_1": {
"class": "File",
"location": "COMMONS/8bc9f306-5b5d-4b6b-b34e-f90680824b17"
},
"user_file": {
"class": "File",
"location": "USER/user-data.txt"
}
}
```


4. Now you can construct the Mariner workflow request
JSON body, which looks like this:
```
{
"workflow": <output_from_wftool>,
"input": <inputs_mapping_json>,
"manifest": <manifest_containing_GUIDs_of_all_commons_input_data>,
"tags": {
"author": "matt",
"type": "example",
}
}
```

An example request body can be found [here](https://github.com/uc-cdis/mariner/blob/master/testdata/user_data_test/request_body.json).

5. At this point you're ready to ask Mariner to run your workflow,
and you can do that via the API call demonstrated in step 3 from the "A Full Example" section above.

#### Notes

Notice you can apply tags to your workflow request,
which can be useful for identifying or categorizing your workflow runs.
For example if you are running a certain set of workflows for one study,
and another set of workflows for another,
you could apply a studyID tag to each workflow run.

The `manifest` field will (very) soon be removed from the workflow request body,
since of course Mariner can generate the required manifest
by parsing the inputs mapping file and collecting all the GUIDs it comes across.

#### Learning Resources

A good way to get a handle on CWL in a relatively short period of time
is to explore the [CWL User Guide](https://www.commonwl.org/user_guide/02-1st-example/index.html),
which contains a number of example workflows with explanations
of all the different parts of the syntax - what they mean and how they function -
in the context of each example.

### Browsing and Retrieving Output From A Workflow Run

Mariner implicitly depends on the existence of something like a "user data client",
which is a little API for users to browse/upload/download/delete files
from their "user data space", which is persistent storage
on the Gen3/commons side for data which belongs to a user
and is not commons data.

The user-data-space is where a user can stage files to be input
to a workflow run, and theoretically, also the same place
where users can stage input files for any "app on Gen3", e.g., a Jupyter notebook.

The user-data-space (also could be called an "analysis space") is also
where output files from apps are stored.

Concretely, right now there's an S3 bucket which is a dedicated "user data space",
where keys at the root are userID's, and any file which belongs to user_A
has `user_A/` as a prefix. Per workflow run, there is a "working directory"
created and dedicated to that run, under that user's prefix in that S3 bucket.
All files generated by the workflow run are written to this working directory,
and any files which are not explicitly listed as output files of the top-level workflow
(i.e., all intermediate files) get deleted at the end of the run so that only
the desired output files are kept.

Currently there does not exist a Gen3 user-data-client,
so in order to browse and retrieve your output files from
the workflow's working directory in S3,
you must use the [AWS S3 CLI](https://docs.aws.amazon.com/cli/latest/reference/s3/) directly.

## Running the CWL Conformance Tests against Mariner

See [here](https://github.com/uc-cdis/mariner/tree/master/conformance).
10 changes: 10 additions & 0 deletions database/constants.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
package database

const (
PostgresDB = "psql"
usrTable = "usr"
taskTable = "task"
workflowTable = "workflow"

dbcredentialpath = "/var/www/mariner/dbcreds.json"
)
23 changes: 23 additions & 0 deletions database/database.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
package database

type Dao interface {
GetAllUsers() ([]User, error)
CreateUser(name string, email string) (int64, error)
UpdateUser(user *User) error
DeleteUser(id int64) error
GetUserById(id int64) (*User, error)

GetAllWorkflows() ([]Workflow, error)
CreateWorkflow(userId int64, lastTaskCompleted int64, definition string, hash string, stats string, inputs JsonBytesMap, output string, status string, metadata JsonBytesMap) (int64, error)
UpdateWorkflow(workflow *Workflow) error
DeleteWorkflow(id int64) error
GetWorkflowById(id int64) (*Workflow, error)

GetAllTasks() ([]Task, error)
CreateTask(wf_id int64, name string, hash string, stats string, input JsonBytesMap, output string, status string, taskError string, wf_status string) (int64, error)
UpdateTask(task *Task) error
DeleteTask(id int64) error
GetTaskById(id int64) (*Task, error)

KillDao()
}
14 changes: 14 additions & 0 deletions database/factory.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
package database

import log "github.com/sirupsen/logrus"

func DaoFactory(daoType string) Dao {
switch daoType {
case "psql":
return NewPSQLDao()

default:
log.Errorf("There is no current support for the daotype %s. Please select a different supported daotype", daoType)
return nil
}
}
76 changes: 76 additions & 0 deletions database/models.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
package database

import (
"database/sql/driver"
"encoding/json"
"fmt"
"time"
)

type User struct {
ID int64 `db:"id"`
Name string `db:"name"`
Email string `db:"email"`
CreatedAt time.Time `db:"created_at"`
}

type JsonBytesMap map[string]interface{}

func (p JsonBytesMap) Value() (driver.Value, error) {
j, err := json.Marshal(p)
return j, err
}

func (p *JsonBytesMap) Scan(src interface{}) error {
source, ok := src.([]byte)
if !ok {
return fmt.Errorf("type assertion .([]byte) failed")
}

var i interface{}
err := json.Unmarshal(source, &i)
if err != nil {
return err
}

*p, ok = i.(map[string]interface{})
if !ok {
return fmt.Errorf("type assertion .(map[string]interface{}) failed")
}

return nil
}

type Workflow struct {
WorkFlowID int64 `db:"wf_id"`
UserId int64 `db:"usr_id"`
LastTaskCompleted int64 `db:"last_task_completed"`
Definition string `db:"definition"`
Hash string `db:"hash"`
Stats string `db:"stats"`
Inputs JsonBytesMap `db:"inputs"`
Outputs string `db:"outputs"`
Status string `db:"status"`
StartedAt time.Time `db:"started_at"`
EndedAt time.Time `db:"ended_at"`
CreatedAt time.Time `db:"created_at"`
UpdatedAt time.Time `db:"updated_at"`
Metadata JsonBytesMap `db:"metadata"`
}

type Task struct {
TaskId int64 `db:"task_id"`
WorkFlowID int64 `db:"wf_id"`
Name string `db:"name"`
Hash string `db:"hash"`
Stats string `db:"stats"`
Input JsonBytesMap `db:"input"`
Output string `db:"output"`
Attempt int64 `db:"attempt"`
Status string `db:"status"`
ReturnCode int64 `db:"return_code"`
Error string `db:"error"`
WorkFlowStatus string `db:"wf_status"`
CreatedAt time.Time `db:"created_at"`
UpdatedAt time.Time `db:"updated_at"`
}
Loading

0 comments on commit 2100440

Please sign in to comment.