qbicsoftware · wow-such-code · May 11, 2021 · Nov 2, 2020 · Nov 3, 2020 · Nov 3, 2020
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,10 @@
 # Changelog
 
+## 1.7.0 2021-03-19
+
+* Provides fully tested functionality to register generic imaging data, with OMERO server support (v5.4.10). [Link to PR](https://github.com/qbicsoftware/etl-scripts/pull/78)
+* Uses an omero-importer-cli (with Bio-formats) for image file registration into an OMERO server instance
+* Uses an initial version of the openBIS-OMERO metadata model
 
 ## 1.6.0 2021-01-22
 
@@ -23,6 +28,7 @@
   environment for the proper setup for the register-omero-metadata
   dropbox
 * Register unclassified pooling data of Nanopore experiments directly at the experiment level (no copies are added to sample-based datasets)
+* Add description for data of register-hlatyping-dropbox
 
 ## 1.3.1
 

diff --git a/README.md b/README.md
@@ -40,15 +40,29 @@ openBIS.
 Formats:
 
 - [NGS single-end / paired-end data](#ngs-single-end--paired-end-data)
+- [HLA Typing data](#hla-typing-data)
+- [NGS single-end / paired-end data with metadata (deprecated)](#ngs-single-end--paired-end-data-with-metadata)
+- [Attachment Data](#attachment-data)
+- [Mass Spectrometry mzML conversion and registration](#mass-spectrometry-mzml-conversion-and-registration)
+- [Imaging data with an OMERO server instance](#imaging-data-with-an-omero-server-instance)
 
 ### NGS single-end / paired-end data
 
 **Responsible dropbox:**
 [QBiC-register-fastq-dropbox](drop-boxes/register-fastq-dropbox)
 
 **Resulting data model in openBIS**  
-Q_TEST_SAMPLE -> Q_NGS_RAW_DATA (with sample code) -> DataSet (directory
-with files contained)
+Q_TEST_SAMPLE -> Q_NGS_SINGLE_SAMPLE_RUN (with sample code) -> DataSet
+of type Q_NGS_RAW_DATA (directory with files contained)
+
+Example sample ids are:
+
+QABCD001AE (Analyte, Q_TEST_SAMPLE)  
+NGSQABCD001AE (Sequencing result, Q_SINGLE_SAMPLE_RUN)
+
+If several runs are submitted with the same analyte id, then no new id
+for the run is generated, but a new dataset attached to the existing
+sequencing result id.
 
 **Description**  
 For paired-end sequencing reads in FASTQ format, the file structure
@@ -81,4 +95,241 @@ look like this:
     |-- <QBIC sample code>.fastq.gz.sha256sum
 ```
 
+### HLA Typing data
+**Responsible dropbox:**
+[QBiC-register-hlatyping-dropbox](drop-boxes/register-hlatyping-dropbox)
+
+**Resulting data model in openBIS**  
+Q_TEST_SAMPLE -> Q_NGS_HLATYPING (with sample code) -> DataSet (directory
+with files contained)
+
+or
+
+Q_TEST_SAMPLE -> Q_NGS_SINGLE_SAMPLE_RUN (provided sample code) -> Q_NGS_HLATYPING -> DataSet (directory
+with files contained)
+
+Example sample ids are:
+QABCD001AE (Analyte, Q_TEST_SAMPLE)  
+HLA1QABCD001AE (HLA-Typing result, Q_NGS_HLATYPING) for HLA MHC class I
+or
+HLA2QABCD001AE (HLA-Typing result, Q_NGS_HLATYPING) for HLA MHC class II
+
+
+**Description**  
+For HLA typing data in VCF format, the file structure
+needs to look like this:
+
+```
+<QBIC sample code> // Directory
+    |-- <QBIC sample code>.txt
+    |-- <QBIC sample code>.txt.sha256sum
+```
+
+### NGS single-end / paired-end data with metadata
+(deprecated)
+
+**Disclaimer!**  
+This data format is targeted for a single use case and should not be
+used for general data registration purposes. Please use the
+[NGS single-end / paired-end data](#ngs-single-end--paired-end-data)
+format for now.
+
+**Responsible dropbox:**
+[QBiC-register-imgag-dropbox](drop-boxes/register-imgag-dropbox)
+
+**Resulting data model in openBIS**  
+Q_TEST_SAMPLE -> Q_NGS_SINGLE_SAMPLE_RUN (with sample code) -> DataSet
+of type Q_NGS_RAW_DATA (directory with raw sequencing files contained)
+
+Example sample ids:
+
+QABCD001AE (Analyte, Q_TEST_SAMPLE)  
+NGS[0-9]{2}QABCS001AE (Sequencing Result, Q_NGS_SINGLE_SAMPLE_RUN) where
+the running two-digit number is taken from the identifier suffix from
+the `genetics_id` in the metadata file.
+
+**Description**  
+For paired-end sequencing reads in FASTQ format, the file structure
+needs to look like this
+
+```
+<QBIC sample code> // Directory
+    |-- file1.fastq.gz
+    |-- file2.fastq.gz
+    |-- metadata
+    |- ...
+```
+
+**Expected metadata**  
+Additional metadata is required in this format case and expected to be
+noted in JSON in a file called `metadata` and following the
+[upload metadata schema](drop-boxes/register-imgag-dropbox/upload-metadata.schema.json).
+A valid JSON object can look like this:
+
+```
+{
+    "files": [
+        "reads.1.fastq.gz",
+        "reads.2.fastq.gz"
+    ],
+    "type": "dna_seq",
+    "sample1": {
+        "genome": "GRCh37",
+        "id_genetics": "GS000000_01",
+        "id_qbic": "QTEST002AE",
+        "processing_system": "Test system",
+        "tumor": "no"
+    }
+}
+```
+
+### Attachment Data
+
+**Responsible dropbox:**
+[QBiC-register-exp-proj-attachment](drop-boxes/register-attachments-dropbox)
+
+**openBIS structure:**
+
+Attachments are attached to the Q_PROJECT_DETAILS experiment type and its sample type Q_ATTACHMENT_SAMPLE.
+
+**Expected data structure**
+The data structure needs to be a root folder, containing a file `metadata.txt`.
+
+Incoming structure overview:
+
+```
+|-<anything> (top level folder name, normally a time stamp of upload time)
+    |
+    |- metadata.txt
+```
+
+**Expected metadata**
+Metadata is expected to be denoted in line-separated key-value pairs, where key and value are separated by a '='. The following structure/pairs are expected:
+
+```
+user=<the (optional) uploading user name>
+info=<short info about the file>
+barcode=<the sample code of the attachment sample>
+type=<the type of attachment: information or results>
+```
+The code of the attachment sample is built from the project code followed by three zeroes, conforming to the regular expression "Q[A-Z0-9]{4}000", e.g. QABCD000.
+
+See code examples:
+https://github.com/qbicsoftware/attachi-cli/blob/master/attachi/attachi.py#L63
+https://github.com/qbicsoftware/projectwizard-portlet/blob/9c86f500b26af4cf2613cfae32e470bf5d50bf78/src/main/java/life/qbic/projectwizard/io/AttachmentMover.java#L145
+
+
+### Mass Spectrometry mzML conversion and registration
+
+**Responsible dropbox:**
+[QBiC-convert-register-ms-vendor-format](drop-boxes/register-convert-ms-vendor-format)
+
+**Resulting data model in openBIS**  
+...Q_TEST_SAMPLE (-> Q_MHC_LIGAND_EXTRACT (Immunomics case)) -> Q_MS_RUN per data file --> 2 DataSets per data file, one for raw data, one converted to mzML
+
+**Expected data structure**
+In every use case, the data structure needs to contain a top folder around the respective data in order to accommodate metadata files.
+
+The sample code found in the top folder can be of type `Q_TEST_SAMPLE` or `Q_MS_RUN`. In the former case, a new sample of type `Q_MS_RUN` is created and attached as child to the test sample.
+
+**Valid folder/file types**:
+- Thermo Fisher Raw file format
+- Waters Raw folder
+- Bruker .d folder
+
+**Incoming structure overview for standard case without additional metadata file:**
+```
+QABCD102A5_20201229145526_20201014_CO_0976StSi_R05_.raw
+|-- QABCD102A5_20201229145526_20201014_CO_0976StSi_R05_.raw
+|-- QABCD102A5_20201229145526_20201014_CO_0976StSi_R05_.raw.sha256sum
+```
+In this case, existing mass spectrometry metadata is expected to be already stored and the dataset will be attached.
+
+
+**Incoming structure overview for the use case of Immunomics data with metadata file:**
+```
+QABCD090B7
+|-- QABCD090B7
+|   |-- file1.raw
+|   |-- file2.raw
+|   |-- file3.raw
+|   `-- metadata.tsv
+|-- QABCD090B7.sha256sum
+`-- source_dropbox.txt
+```
+The source_dropbox.txt currently has to indicate the source as one of the Immunomics data sources.
+
+The `metadata.tsv` columns for the Immunomics case are tab-separated:
+```
+Filename	Q_MS_DEVICE	Q_MEASUREMENT_FINISH_DATE	Q_EXTRACT_SHARE	Q_ADDITIONAL_INFO	Q_MS_LCMS_METHODS	technical_replicate	workflow_type
+file1.raw	THERMO_QEXACTIVE	171010	10		QEX_TOP07_470MIN	DDA_Rep1	DDA
+```
+
+Filename - one of the (e.g. raw) file names found in the incoming structure
+
+Q_MS_DEVICE - openBIS code from the vocabulary of Mass Spectrometry devices
+
+Q_MEASUREMENT_FINISH_DATE - Date in YYMMDD format (ISO 8601:2000)
+
+Q_EXTRACT_SHARE - the extract share
+
+Q_ADDITIONAL_INFO - any optional comments
+
+Q_MS_LCMS_METHODS - openBIS code from the vocabulary of LCMS methods
+
+technical_replicate - free text to denote replicates
+
+workflow_type - DDA or DIA
+
+
+### Imaging data with an OMERO server instance
+
+**Responsible dropbox:**
+[QBiC-register-omero-metadata](drop-boxes/register-omero-metadata)
+
+**Resulting data model in openBIS**  
+For each tissue sample multiple images (the data files) can be created, so multiple Q_BMI_GENERIC_IMAGING_RUN samples are created and attached to that tissue sample
+...Q_BIOLOGICAL_SAMPLE -> one Q_BMI_GENERIC_IMAGING_RUN per data file
+
+**Expected data structure**
+In every use case, the data structure needs to contain a top folder around the respective data in order to accommodate metadata files.
+
+The sample code found in the top folder is of type `Q_BIOLOGICAL_SAMPLE` (tissue imaging).
+
+**Valid file types**:
+Valid files in the folder are any imaging files that can be handled by the OMERO server
+
+**Incoming structure overview:**
+```
+QABCD002A8
+|-- QABCD002A8
+|   |-- Est-B1a.lif
+|   |-- Image_1.czi
+|   |-- Image_2.czi
+|   |-- Image7246.tif
+|   |-- metadata_3.tsv
+|   |-- rubisco_avg.mrc
+|   `-- tomogram_x.mrc
+|-- QABCD002A8.sha256sum
+`-- source_dropbox.txt
+```
+
+The metadata file, ending in `.tsv` has tab-separated columns:
+```
+IMAGE_FILE_NAME  IMAGING_MODALITY  IMAGED_TISSUE  INSTRUMENT_MANUFACTURER  INSTRUMENT_USER  IMAGING_DATE
+tomogram_x.mrc   NCIT_C18113       cell           FEI                      Dr. Horrible     01.03.2021
+rubisco_avg.mrc  NCIT_C18113       cell           FEI                      Max Mustermann   01.04.2021
+Image7246.tif    NCIT_C18216       leaf           Zeiss                    Max Mustermann   23.02.2021
+Est-B1a.lif      NCIT_C17753       root           Zeiss                    Max Mustermann   01.02.2021
+Image_1.czi      NCIT_C17753       leaf           Zeiss                    Max Mustermann   11.02.2021
+Image_2.czi      NCIT_C17753       leaf           Zeiss                    Max Mustermann   01.02.2021
+```
 
+column name | description
+--------------|----------------
+`IMAGE_FILE_NAME`| one of the file names found in the incoming folder per line
+`IMAGING_MODALITY`| Ontology Identifier for the imaging modality, currently from the [NCI Thesaurus](https://ncit.nci.nih.gov/ncitbrowser/pages/home.jsf?version=21.02d). **Examples:** NCIT_C18113 (Cryo-Electron Microscopy), NCIT_C18216 (Transmission Electron Microscopy), NCIT_C17753 (Confocal Microscopy)
+`IMAGED_TISSUE` | the imaged tissue
+`INSTRUMENT_MANUFACTURER` | the imaging instrument manufacturer
+`INSTRUMENT_USER` | the person who measured the data file using the imaging instrument
+`IMAGING_DATE` | the date of the measurement in dd.mm.yyyy format (days and months with leading zeroes)
diff --git a/drop-boxes/register-attachments-dropbox/register-attachment-dropbox.py b/drop-boxes/register-attachments-dropbox/register-attachment-dropbox.py
@@ -92,18 +92,6 @@ def process(transaction):
 		sa.setExperiment(exp)
 	info = None
 
-	#if isProject:
-	#experiments = search_service.listExperiments("/" + space + "/" + project)
-	#for e in experiments:
-	#	if project+"_INFO" in e.getExperimentIdentifier():
-	#		info = e
-	#if not info:
-	#	info = transaction.createNewExperiment('/' + space + '/' + project + '/'+ project+'_INFO', "Q_PROJECT_DETAILS")
-	#else:
-	#	info = transaction.getExperiment('/' + space + '/' + project + '/' + code)
-	# register new experiment and sample
-	#sa.setExperiment(info) 
-	# create new dataset 
 	dataSet = transaction.createNewDataSet("Q_PROJECT_DATA")
 	dataSet.setMeasuredData(False)
 	dataSet.setPropertyValue("Q_SECONDARY_NAME", secname)

diff --git a/drop-boxes/register-hlatyping-dropbox/register-hlatyping.py b/drop-boxes/register-hlatyping-dropbox/register-hlatyping.py
@@ -31,7 +31,7 @@
 ### We need this object to update the sample location later
 SAMPLE_TRACKER = SampleTracker.createQBiCSampleTracker(SERVICE_REGISTRY_URL, SERVICE_CREDENTIALS, QBIC_LOCATION)
 
-# ETL script for registration of VCF files
+# ETL script for registration of HLA Typing
 # expected:
 # *Q[Project Code]^4[Sample No.]^3[Sample Type][Checksum]*.*
 pattern = re.compile('Q\w{4}[0-9]{3}[a-zA-Z]\w')

diff --git a/drop-boxes/register-imgag-dropbox/README.md b/drop-boxes/register-imgag-dropbox/README.md