Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
461b81e
make imgag dropbox raise exception if unknown data types arrive
wow-such-code Nov 2, 2020
ccd3077
Merge pull request #51 from qbicsoftware/hotfix/fail_unknown_data_imgag
sven1103 Nov 3, 2020
1b51c23
Update CHANGELOG.md
sven1103 Nov 3, 2020
468887b
Update README.md
sven1103 Nov 3, 2020
c8aa7a9
Update CL
sven1103 Nov 3, 2020
5d1ab7f
Merge branch 'master' into release/1.5.0
sven1103 Nov 3, 2020
1c67c06
Add new line at end of file
sven1103 Nov 3, 2020
fa9f280
Merge branch 'release/1.5.0' of github.com:qbicsoftware/etl-scripts i…
sven1103 Nov 3, 2020
187bb14
Convert experiment id to string for v3 objects (#55)
wow-such-code Nov 3, 2020
7f578fd
Rename data folder for pooled data (#54)
wow-such-code Nov 3, 2020
ab8997e
Release/1.5.0 (#53)
sven1103 Nov 3, 2020
e962a39
merge changelog
wow-such-code Jan 11, 2021
46b13be
Hotfix/wf sample fetching (#58)
wow-such-code Jan 12, 2021
9f554d7
Hotfix/retry sample tracking (#59)
wow-such-code Jan 22, 2021
413ef9a
Prepare release 1.6.0
sven1103 Jan 22, 2021
9e01d1a
Update CHANGELOG.md
wow-such-code Jan 22, 2021
893b0d5
Correct fastq dropbox docs (#69)
sven1103 Feb 25, 2021
fb8b5a5
Add documentation for NGS data with metadata (#68)
sven1103 Feb 26, 2021
f0ba260
Documentation/attachments (#65)
wow-such-code Feb 26, 2021
414073b
Documentation/convert ms (#72)
wow-such-code Feb 26, 2021
67552ae
Add description for HLA typing data (#70)
jenniferboedker Mar 2, 2021
d4eb322
Release/1.7.0 (#78)
wow-such-code Mar 19, 2021
51822bb
Merge branch 'development' into resolve_conflicts
wow-such-code May 11, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Changelog

## 1.7.0 2021-03-19

* Provides fully tested functionality to register generic imaging data, with OMERO server support (v5.4.10). [Link to PR](https://github.com/qbicsoftware/etl-scripts/pull/78)
* Uses an omero-importer-cli (with Bio-formats) for image file registration into an OMERO server instance
* Uses an initial version of the openBIS-OMERO metadata model

## 1.6.0 2021-01-22

Expand All @@ -23,6 +28,7 @@
environment for the proper setup for the register-omero-metadata
dropbox
* Register unclassified pooling data of Nanopore experiments directly at the experiment level (no copies are added to sample-based datasets)
* Add description for data of register-hlatyping-dropbox

## 1.3.1

Expand Down
255 changes: 253 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,15 +40,29 @@ openBIS.
Formats:

- [NGS single-end / paired-end data](#ngs-single-end--paired-end-data)
- [HLA Typing data](#hla-typing-data)
- [NGS single-end / paired-end data with metadata (deprecated)](#ngs-single-end--paired-end-data-with-metadata)
- [Attachment Data](#attachment-data)
- [Mass Spectrometry mzML conversion and registration](#mass-spectrometry-mzml-conversion-and-registration)
- [Imaging data with an OMERO server instance](#imaging-data-with-an-omero-server-instance)

### NGS single-end / paired-end data

**Responsible dropbox:**
[QBiC-register-fastq-dropbox](drop-boxes/register-fastq-dropbox)

**Resulting data model in openBIS**
Q_TEST_SAMPLE -> Q_NGS_RAW_DATA (with sample code) -> DataSet (directory
with files contained)
Q_TEST_SAMPLE -> Q_NGS_SINGLE_SAMPLE_RUN (with sample code) -> DataSet
of type Q_NGS_RAW_DATA (directory with files contained)

Example sample ids are:

QABCD001AE (Analyte, Q_TEST_SAMPLE)
NGSQABCD001AE (Sequencing result, Q_SINGLE_SAMPLE_RUN)

If several runs are submitted with the same analyte id, then no new id
for the run is generated, but a new dataset attached to the existing
sequencing result id.

**Description**
For paired-end sequencing reads in FASTQ format, the file structure
Expand Down Expand Up @@ -81,4 +95,241 @@ look like this:
|-- <QBIC sample code>.fastq.gz.sha256sum
```

### HLA Typing data
**Responsible dropbox:**
[QBiC-register-hlatyping-dropbox](drop-boxes/register-hlatyping-dropbox)

**Resulting data model in openBIS**
Q_TEST_SAMPLE -> Q_NGS_HLATYPING (with sample code) -> DataSet (directory
with files contained)

or

Q_TEST_SAMPLE -> Q_NGS_SINGLE_SAMPLE_RUN (provided sample code) -> Q_NGS_HLATYPING -> DataSet (directory
with files contained)

Example sample ids are:
QABCD001AE (Analyte, Q_TEST_SAMPLE)
HLA1QABCD001AE (HLA-Typing result, Q_NGS_HLATYPING) for HLA MHC class I
or
HLA2QABCD001AE (HLA-Typing result, Q_NGS_HLATYPING) for HLA MHC class II


**Description**
For HLA typing data in VCF format, the file structure
needs to look like this:

```
<QBIC sample code> // Directory
|-- <QBIC sample code>.txt
|-- <QBIC sample code>.txt.sha256sum
```

### NGS single-end / paired-end data with metadata
(deprecated)

**Disclaimer!**
This data format is targeted for a single use case and should not be
used for general data registration purposes. Please use the
[NGS single-end / paired-end data](#ngs-single-end--paired-end-data)
format for now.

**Responsible dropbox:**
[QBiC-register-imgag-dropbox](drop-boxes/register-imgag-dropbox)

**Resulting data model in openBIS**
Q_TEST_SAMPLE -> Q_NGS_SINGLE_SAMPLE_RUN (with sample code) -> DataSet
of type Q_NGS_RAW_DATA (directory with raw sequencing files contained)

Example sample ids:

QABCD001AE (Analyte, Q_TEST_SAMPLE)
NGS[0-9]{2}QABCS001AE (Sequencing Result, Q_NGS_SINGLE_SAMPLE_RUN) where
the running two-digit number is taken from the identifier suffix from
the `genetics_id` in the metadata file.

**Description**
For paired-end sequencing reads in FASTQ format, the file structure
needs to look like this

```
<QBIC sample code> // Directory
|-- file1.fastq.gz
|-- file2.fastq.gz
|-- metadata
|- ...
```

**Expected metadata**
Additional metadata is required in this format case and expected to be
noted in JSON in a file called `metadata` and following the
[upload metadata schema](drop-boxes/register-imgag-dropbox/upload-metadata.schema.json).
A valid JSON object can look like this:

```
{
"files": [
"reads.1.fastq.gz",
"reads.2.fastq.gz"
],
"type": "dna_seq",
"sample1": {
"genome": "GRCh37",
"id_genetics": "GS000000_01",
"id_qbic": "QTEST002AE",
"processing_system": "Test system",
"tumor": "no"
}
}
```

### Attachment Data

**Responsible dropbox:**
[QBiC-register-exp-proj-attachment](drop-boxes/register-attachments-dropbox)

**openBIS structure:**

Attachments are attached to the Q_PROJECT_DETAILS experiment type and its sample type Q_ATTACHMENT_SAMPLE.

**Expected data structure**
The data structure needs to be a root folder, containing a file `metadata.txt`.

Incoming structure overview:

```
|-<anything> (top level folder name, normally a time stamp of upload time)
|
|- metadata.txt
```

**Expected metadata**
Metadata is expected to be denoted in line-separated key-value pairs, where key and value are separated by a '='. The following structure/pairs are expected:

```
user=<the (optional) uploading user name>
info=<short info about the file>
barcode=<the sample code of the attachment sample>
type=<the type of attachment: information or results>
```
The code of the attachment sample is built from the project code followed by three zeroes, conforming to the regular expression "Q[A-Z0-9]{4}000", e.g. QABCD000.

See code examples:
https://github.com/qbicsoftware/attachi-cli/blob/master/attachi/attachi.py#L63
https://github.com/qbicsoftware/projectwizard-portlet/blob/9c86f500b26af4cf2613cfae32e470bf5d50bf78/src/main/java/life/qbic/projectwizard/io/AttachmentMover.java#L145


### Mass Spectrometry mzML conversion and registration

**Responsible dropbox:**
[QBiC-convert-register-ms-vendor-format](drop-boxes/register-convert-ms-vendor-format)

**Resulting data model in openBIS**
...Q_TEST_SAMPLE (-> Q_MHC_LIGAND_EXTRACT (Immunomics case)) -> Q_MS_RUN per data file --> 2 DataSets per data file, one for raw data, one converted to mzML

**Expected data structure**
In every use case, the data structure needs to contain a top folder around the respective data in order to accommodate metadata files.

The sample code found in the top folder can be of type `Q_TEST_SAMPLE` or `Q_MS_RUN`. In the former case, a new sample of type `Q_MS_RUN` is created and attached as child to the test sample.

**Valid folder/file types**:
- Thermo Fisher Raw file format
- Waters Raw folder
- Bruker .d folder

**Incoming structure overview for standard case without additional metadata file:**
```
QABCD102A5_20201229145526_20201014_CO_0976StSi_R05_.raw
|-- QABCD102A5_20201229145526_20201014_CO_0976StSi_R05_.raw
|-- QABCD102A5_20201229145526_20201014_CO_0976StSi_R05_.raw.sha256sum
```
In this case, existing mass spectrometry metadata is expected to be already stored and the dataset will be attached.


**Incoming structure overview for the use case of Immunomics data with metadata file:**
```
QABCD090B7
|-- QABCD090B7
| |-- file1.raw
| |-- file2.raw
| |-- file3.raw
| `-- metadata.tsv
|-- QABCD090B7.sha256sum
`-- source_dropbox.txt
```
The source_dropbox.txt currently has to indicate the source as one of the Immunomics data sources.

The `metadata.tsv` columns for the Immunomics case are tab-separated:
```
Filename Q_MS_DEVICE Q_MEASUREMENT_FINISH_DATE Q_EXTRACT_SHARE Q_ADDITIONAL_INFO Q_MS_LCMS_METHODS technical_replicate workflow_type
file1.raw THERMO_QEXACTIVE 171010 10 QEX_TOP07_470MIN DDA_Rep1 DDA
```

Filename - one of the (e.g. raw) file names found in the incoming structure

Q_MS_DEVICE - openBIS code from the vocabulary of Mass Spectrometry devices

Q_MEASUREMENT_FINISH_DATE - Date in YYMMDD format (ISO 8601:2000)

Q_EXTRACT_SHARE - the extract share

Q_ADDITIONAL_INFO - any optional comments

Q_MS_LCMS_METHODS - openBIS code from the vocabulary of LCMS methods

technical_replicate - free text to denote replicates

workflow_type - DDA or DIA


### Imaging data with an OMERO server instance

**Responsible dropbox:**
[QBiC-register-omero-metadata](drop-boxes/register-omero-metadata)

**Resulting data model in openBIS**
For each tissue sample multiple images (the data files) can be created, so multiple Q_BMI_GENERIC_IMAGING_RUN samples are created and attached to that tissue sample
...Q_BIOLOGICAL_SAMPLE -> one Q_BMI_GENERIC_IMAGING_RUN per data file

**Expected data structure**
In every use case, the data structure needs to contain a top folder around the respective data in order to accommodate metadata files.

The sample code found in the top folder is of type `Q_BIOLOGICAL_SAMPLE` (tissue imaging).

**Valid file types**:
Valid files in the folder are any imaging files that can be handled by the OMERO server

**Incoming structure overview:**
```
QABCD002A8
|-- QABCD002A8
| |-- Est-B1a.lif
| |-- Image_1.czi
| |-- Image_2.czi
| |-- Image7246.tif
| |-- metadata_3.tsv
| |-- rubisco_avg.mrc
| `-- tomogram_x.mrc
|-- QABCD002A8.sha256sum
`-- source_dropbox.txt
```

The metadata file, ending in `.tsv` has tab-separated columns:
```
IMAGE_FILE_NAME IMAGING_MODALITY IMAGED_TISSUE INSTRUMENT_MANUFACTURER INSTRUMENT_USER IMAGING_DATE
tomogram_x.mrc NCIT_C18113 cell FEI Dr. Horrible 01.03.2021
rubisco_avg.mrc NCIT_C18113 cell FEI Max Mustermann 01.04.2021
Image7246.tif NCIT_C18216 leaf Zeiss Max Mustermann 23.02.2021
Est-B1a.lif NCIT_C17753 root Zeiss Max Mustermann 01.02.2021
Image_1.czi NCIT_C17753 leaf Zeiss Max Mustermann 11.02.2021
Image_2.czi NCIT_C17753 leaf Zeiss Max Mustermann 01.02.2021
```

column name | description
--------------|----------------
`IMAGE_FILE_NAME`| one of the file names found in the incoming folder per line
`IMAGING_MODALITY`| Ontology Identifier for the imaging modality, currently from the [NCI Thesaurus](https://ncit.nci.nih.gov/ncitbrowser/pages/home.jsf?version=21.02d). **Examples:** NCIT_C18113 (Cryo-Electron Microscopy), NCIT_C18216 (Transmission Electron Microscopy), NCIT_C17753 (Confocal Microscopy)
`IMAGED_TISSUE` | the imaged tissue
`INSTRUMENT_MANUFACTURER` | the imaging instrument manufacturer
`INSTRUMENT_USER` | the person who measured the data file using the imaging instrument
`IMAGING_DATE` | the date of the measurement in dd.mm.yyyy format (days and months with leading zeroes)
Original file line number Diff line number Diff line change
Expand Up @@ -92,18 +92,6 @@ def process(transaction):
sa.setExperiment(exp)
info = None

#if isProject:
#experiments = search_service.listExperiments("/" + space + "/" + project)
#for e in experiments:
# if project+"_INFO" in e.getExperimentIdentifier():
# info = e
#if not info:
# info = transaction.createNewExperiment('/' + space + '/' + project + '/'+ project+'_INFO', "Q_PROJECT_DETAILS")
#else:
# info = transaction.getExperiment('/' + space + '/' + project + '/' + code)
# register new experiment and sample
#sa.setExperiment(info)
# create new dataset
dataSet = transaction.createNewDataSet("Q_PROJECT_DATA")
dataSet.setMeasuredData(False)
dataSet.setPropertyValue("Q_SECONDARY_NAME", secname)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
### We need this object to update the sample location later
SAMPLE_TRACKER = SampleTracker.createQBiCSampleTracker(SERVICE_REGISTRY_URL, SERVICE_CREDENTIALS, QBIC_LOCATION)

# ETL script for registration of VCF files
# ETL script for registration of HLA Typing
# expected:
# *Q[Project Code]^4[Sample No.]^3[Sample Type][Checksum]*.*
pattern = re.compile('Q\w{4}[0-9]{3}[a-zA-Z]\w')
Expand Down
42 changes: 0 additions & 42 deletions drop-boxes/register-imgag-dropbox/README.md

This file was deleted.

Loading