Merge branch 'CAGI6-version' into 'master'

Ditto pipeline just after CAGI6 project See merge request center-for-computational-genomics-and-data-science/sciops/ditto!3
uab-cgds-worthey · Oct 25, 2021 · bcf8822 · bcf8822
2 parents dd06512 + c665340
commit bcf8822
Show file tree

Hide file tree

Showing 32 changed files with 2,916 additions and 35 deletions.
diff --git a/.gitignore b/.gitignore
@@ -7,6 +7,7 @@ __pycache__/
 
 # Distribution / packaging
 .Python
+#env/
 build/
 develop-eggs/
 dist/
@@ -18,9 +19,18 @@ lib64/
 parts/
 sdist/
 var/
+dask-worker-space/
 *.egg-info/
 .installed.cfg
 *.egg
+*.err
+*.out
+*.db
+*.py*.sh
+*.tsv
+*.csv
+*.gz*
+
 
 # PyInstaller
 #  Usually these files are written by a python script from a template
@@ -41,6 +51,7 @@ htmlcov/
 nosetests.xml
 coverage.xml
 *,cover
+*.pdf
 
 # Translations
 *.mo
@@ -72,12 +83,13 @@ target/
 .ipynb_checkpoints/
 
 # exclude data from source control by default
-# data/
-variant_annotation/data/
+/data/
+cagi*/
 
 #snakemake
 .snakemake/
-
+# data/
+variant_annotation/data/
 
 # exclude test data used for development
 to_be_deleted/test_data/data/ref
@@ -92,3 +104,4 @@ logs/
 # .java/fonts dir get created when creating fastqc conda env
 .java/
 
+/.vscode/settings.json
diff --git a/README.md b/README.md
@@ -1,3 +1,140 @@
 # DITTO
 
-Diagnosis prediction tool using AI
+***!!! For research purposes only !!!***
+
+- [DITTO](#ditto)
+    - [Data](#data)
+    - [Usage](#usage)
+        - [Installation](#installation)
+        - [Requirements](#requirements)
+        - [Activate conda environment](#activate-conda-environment)
+        - [Steps to run DITTO predictions](#steps-to-run-ditto-predictions)
+            - [Run VEP annotation](#run-vep-annotation)
+            - [Parse VEP annotations](#parse-vep-annotations)
+            - [Filter variants for Ditto prediction](#filter-variants-for-ditto-prediction)
+            - [DITTO prediction](#ditto-prediction)
+            - [Combine with Exomiser scores](#combine-with-exomiser-scores)
+        - [Cohort level analysis](#cohort-level-analysis)
+    - [Contact information](#contact-information)
+
+**Aim:** We aim to develop a pipeline for accurate and rapid prioritization of variants using patient’s genotype (VCF) and/or phenotype (HPO) information.
+
+## Data
+
+Input for this project is a single sample VCF file. This will be annotated using VEP and given to Ditto for predictions.
+
+## Usage
+
+### Installation
+
+Installation simply requires fetching the source code. Following are required:
+
+- Git
+
+To fetch source code, change in to directory of your choice and run:
+
+```sh
+git clone -b master \
+    --recurse-submodules \
+    git@gitlab.rc.uab.edu:center-for-computational-genomics-and-data-science/sciops/ditto.git
+```
+
+### Requirements
+
+*OS:*
+
+Currently works only in Linux OS. Docker versions may need to be explored later to make it useable in Mac (and
+potentially Windows).
+
+*Tools:*
+
+- Anaconda3
+    - Tested with version: 2020.02
+
+### Activate conda environment
+
+Change in to root directory and run the commands below:
+
+```sh
+# create conda environment. Needed only the first time.
+conda env create --file configs/envs/testing.yaml
+
+# if you need to update existing environment
+conda env update --file configs/envs/testing.yaml
+
+# activate conda environment
+conda activate testing
+```
+
+### Steps to run DITTO predictions
+
+Remove variants with `*` in `ALT Allele` column. These are called "Spanning or overlapping deletions" introduced in the VCF v4.3 specification. More on this [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele-).
+Current version of VEP that we're using doesn't support these variants. We will work on this in our future release.
+
+```sh
+bcftools annotate  -e'ALT="*" || type!="snp"' path/to/indexed_vcf.gz -Oz -o path/to/indexed_vcf_filtered.vcf.gz
+```
+
+#### Run VEP annotation
+
+Please look at the steps to run VEP [here](variant_annotation/README.md)
+
+
+#### Parse VEP annotations
+
+Please look at the steps to parse VEP annotations [here](annotation_parsing/README.md)
+
+
+#### Filter variants for Ditto prediction
+
+Filtering step includes imputation and one-hot encoding of columns.
+
+```sh
+python src/Ditto/filter.py -i path/to/parsed_vcf_file.tsv -O path/to/output_directory
+```
+
+Output from this step includes -
+
+```directory
+output_directory/
+├── data.csv               <--- used for Ditto predictions
+├── Nulls.csv - indicates number of Nulls in each column
+├── stats_nssnv.csv - variant stats from the vcf
+├── correlation_plot.pdf- Plot to check if any columns are directly correlated (cutoff >0.95)
+└── columns.csv - columns before and after filtering step
+
+```
+
+#### Ditto prediction
+
+```sh
+python src/Ditto/predict.py  -i path/to/output_directory/data.csv --sample sample_name -o path/to/output_directory/ditto_predictions.csv -o100 .path/to/output_directory/ditto_predictions_100.csv
+```
+
+#### Combine with Exomiser scores
+
+If phenotype terms are present for the sample, one could use Exomiser to rank genes and then prioritize Ditto predictions according to the phenotype. Once you have Exomiser scores, please run the following command to combine Exomiser and Ditto scores
+
+```sh
+python src/Ditto/combine_scores.py  --raw .path/to/parsed_vcf_file.tsv --sample sample_name --ditto path/to/output_directory/ditto_predictions.csv -ep path/to/exomiser_scores/directory -o .path/to/output_directory/predictions_with_exomiser.csv -o100 path/to/output_directory/predictions_with_exomiser_100.csv
+```
+
+
+### Cohort level analysis
+
+Please refer to [CAGI6-RGP](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/mana/mini_projects/rgp_cagi6) project for filtering and annotation of variants as done above for single sample VCF along with calculating Exomiser scores.
+
+For predictions, make necessary directory edits to the snakemake [workflow](workflow/Snakefile) and run the following command.
+
+```sh
+sbatch src/predict_variant_score.sh
+```
+
+**Note**: The commit used for CAGI6 challenge pipeline is [be97cf5d](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/ditto/-/merge_requests/3/diffs?commit_id=be97cf5dbfcb099ac82ef28d5d8b0919f28aed99). It was used along with annotated VCFs  and exomiser scores obtained from [rgp_cagi6 workflow](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/mana/mini_projects/rgp_cagi6).
+
+
+## Contact information
+
+For issues, please send an email with clear description to
+
+Tarun Mamidi    -   tmamidi@uab.edu
diff --git a/annotation_parsing/parse_annotated_vars.py b/annotation_parsing/parse_annotated_vars.py
@@ -17,10 +17,10 @@ def parse_n_print(vcf, outfile):
                     output_header = ["Chromosome", "Position", "Reference Allele", "Alternate Allele"] + \
                         line.replace(" Allele|"," VEP_Allele_Identifier|").split("Format: ")[1].rstrip(">").rstrip('"').split("|")
                 elif line.startswith("#CHROM"):
-                    vcf_header = line.split("\t")    
+                    vcf_header = line.split("\t")
             else:
                 break
-    
+
     for idx, sample in enumerate(vcf_header):
         if idx > 8:
             output_header.append(f"{sample} allele depth")
@@ -36,7 +36,8 @@ def parse_n_print(vcf, outfile):
                     line = line.rstrip("\n")
                     cols = line.split("\t")
                     csq = parse_csq(next(filter(lambda info: info.startswith("CSQ="),cols[7].split(";"))).replace("CSQ=",""))
-                    var_info = parse_var_info(vcf_header, cols)
+                    #print(line, file=open("var_info.txt", "w"))
+                    #var_info = parse_var_info(vcf_header, cols)
                     alt_alleles = cols[4].split(",")
                     alt2csq = format_alts_for_csq_lookup(cols[3], alt_alleles)
                     for alt_allele in alt_alleles:
@@ -45,14 +46,14 @@ def parse_n_print(vcf, outfile):
                             possible_alt_allele4lookup = alt_allele
                         try:
                             write_parsed_variant(
-                                out, 
-                                vcf_header, 
-                                cols[0], 
-                                cols[1], 
-                                cols[3], 
-                                alt_allele, 
-                                csq[possible_alt_allele4lookup], 
-                                var_info[alt_allele]
+                                out,
+                                vcf_header,
+                                cols[0],
+                                cols[1],
+                                cols[3],
+                                alt_allele,
+                                csq[possible_alt_allele4lookup]
+                                #,var_info[alt_allele]
                             )
                         except KeyError:
                             print("Variant annotation matching based on allele failed!")
@@ -62,15 +63,15 @@ def parse_n_print(vcf, outfile):
                             raise SystemExit(1)
 
 
-def write_parsed_variant(out_fp, vcf_header, chr, pos, ref, alt, annots, var_info):
+def write_parsed_variant(out_fp, vcf_header, chr, pos, ref, alt, annots):#, var_info):
     var_list = [chr, pos, ref, alt]
     for annot_info in annots:
         full_fmt_list = var_list + annot_info
-        for idx, sample in enumerate(vcf_header):
-            if idx > 8:
-                full_fmt_list.append(str(var_info[sample]["alt_depth"]))
-                full_fmt_list.append(str(var_info[sample]["total_depth"]))
-                full_fmt_list.append(str(var_info[sample]["prct_reads"]))
+        #for idx, sample in enumerate(vcf_header):
+        #    if idx > 8:
+        #        full_fmt_list.append(str(var_info[sample]["alt_depth"]))
+        #        full_fmt_list.append(str(var_info[sample]["total_depth"]))
+        #        full_fmt_list.append(str(var_info[sample]["prct_reads"]))
 
         out_fp.write("\t".join(full_fmt_list) + "\n")
 
@@ -103,9 +104,9 @@ def parse_csq(csq):
         parsed_annot = annot.split("|")
         if parsed_annot[0] not in csq_allele_dict:
             csq_allele_dict[parsed_annot[0]] = list()
-        
+
         csq_allele_dict[parsed_annot[0]].append(parsed_annot)
-    
+
     return csq_allele_dict
 
 
@@ -129,13 +130,13 @@ def parse_var_info(headers, cols):
                         alt_depth = int(ad_info[alt_index + 1])
                         total_depth = sum([int(dp) for dp in ad_info])
                         prct_reads = (alt_depth / total_depth) * 100
-                    
+
                     allele_dict[sample] = {
                         "alt_depth": alt_depth,
                         "total_depth": total_depth,
                         "prct_reads": prct_reads
                     }
-            
+
             parsed_alleles[alt_allele] = allele_dict
 
         return parsed_alleles
@@ -184,5 +185,5 @@ def is_valid_file(p, arg):
 
     inputf = Path(ARGS.input_vcf)
     outputf = Path(ARGS.output) if ARGS.output else inputf.parent / inputf.stem.rstrip(".vcf") + ".tsv"
-    
+
     parse_n_print(inputf, outputf)
diff --git a/configs/cluster_config.json b/configs/cluster_config.json
@@ -0,0 +1,16 @@
+{
+    "__default__": {
+        "ntasks": 1,
+        "partition": "express",
+        "cpus-per-task": "{threads}",
+        "mem-per-cpu": "4G",
+        "output": "logs/rule_logs/{rule}-%j.log"
+    },
+     "ditto_filter": {
+    "partition": "largemem",
+      "mem-per-cpu": "200G"
+     },
+     "combine_scores": {
+        "mem-per-cpu": "50G"
+     }
+}