Skip to content

Commit

Permalink
Merge branch 'CAGI6-version' into 'master'
Browse files Browse the repository at this point in the history
Ditto pipeline just after CAGI6 project

See merge request center-for-computational-genomics-and-data-science/sciops/ditto!3
  • Loading branch information
tkmamidi committed Oct 25, 2021
2 parents dd06512 + c665340 commit bcf8822
Show file tree
Hide file tree
Showing 32 changed files with 2,916 additions and 35 deletions.
19 changes: 16 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ __pycache__/

# Distribution / packaging
.Python
#env/
build/
develop-eggs/
dist/
Expand All @@ -18,9 +19,18 @@ lib64/
parts/
sdist/
var/
dask-worker-space/
*.egg-info/
.installed.cfg
*.egg
*.err
*.out
*.db
*.py*.sh
*.tsv
*.csv
*.gz*


# PyInstaller
# Usually these files are written by a python script from a template
Expand All @@ -41,6 +51,7 @@ htmlcov/
nosetests.xml
coverage.xml
*,cover
*.pdf

# Translations
*.mo
Expand Down Expand Up @@ -72,12 +83,13 @@ target/
.ipynb_checkpoints/

# exclude data from source control by default
# data/
variant_annotation/data/
/data/
cagi*/

#snakemake
.snakemake/

# data/
variant_annotation/data/

# exclude test data used for development
to_be_deleted/test_data/data/ref
Expand All @@ -92,3 +104,4 @@ logs/
# .java/fonts dir get created when creating fastqc conda env
.java/

/.vscode/settings.json
139 changes: 138 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,140 @@
# DITTO

Diagnosis prediction tool using AI
***!!! For research purposes only !!!***

- [DITTO](#ditto)
- [Data](#data)
- [Usage](#usage)
- [Installation](#installation)
- [Requirements](#requirements)
- [Activate conda environment](#activate-conda-environment)
- [Steps to run DITTO predictions](#steps-to-run-ditto-predictions)
- [Run VEP annotation](#run-vep-annotation)
- [Parse VEP annotations](#parse-vep-annotations)
- [Filter variants for Ditto prediction](#filter-variants-for-ditto-prediction)
- [DITTO prediction](#ditto-prediction)
- [Combine with Exomiser scores](#combine-with-exomiser-scores)
- [Cohort level analysis](#cohort-level-analysis)
- [Contact information](#contact-information)

**Aim:** We aim to develop a pipeline for accurate and rapid prioritization of variants using patient’s genotype (VCF) and/or phenotype (HPO) information.

## Data

Input for this project is a single sample VCF file. This will be annotated using VEP and given to Ditto for predictions.

## Usage

### Installation

Installation simply requires fetching the source code. Following are required:

- Git

To fetch source code, change in to directory of your choice and run:

```sh
git clone -b master \
--recurse-submodules \
git@gitlab.rc.uab.edu:center-for-computational-genomics-and-data-science/sciops/ditto.git
```

### Requirements

*OS:*

Currently works only in Linux OS. Docker versions may need to be explored later to make it useable in Mac (and
potentially Windows).

*Tools:*

- Anaconda3
- Tested with version: 2020.02

### Activate conda environment

Change in to root directory and run the commands below:

```sh
# create conda environment. Needed only the first time.
conda env create --file configs/envs/testing.yaml

# if you need to update existing environment
conda env update --file configs/envs/testing.yaml

# activate conda environment
conda activate testing
```

### Steps to run DITTO predictions

Remove variants with `*` in `ALT Allele` column. These are called "Spanning or overlapping deletions" introduced in the VCF v4.3 specification. More on this [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele-).
Current version of VEP that we're using doesn't support these variants. We will work on this in our future release.

```sh
bcftools annotate -e'ALT="*" || type!="snp"' path/to/indexed_vcf.gz -Oz -o path/to/indexed_vcf_filtered.vcf.gz
```

#### Run VEP annotation

Please look at the steps to run VEP [here](variant_annotation/README.md)


#### Parse VEP annotations

Please look at the steps to parse VEP annotations [here](annotation_parsing/README.md)


#### Filter variants for Ditto prediction

Filtering step includes imputation and one-hot encoding of columns.

```sh
python src/Ditto/filter.py -i path/to/parsed_vcf_file.tsv -O path/to/output_directory
```

Output from this step includes -

```directory
output_directory/
├── data.csv <--- used for Ditto predictions
├── Nulls.csv - indicates number of Nulls in each column
├── stats_nssnv.csv - variant stats from the vcf
├── correlation_plot.pdf- Plot to check if any columns are directly correlated (cutoff >0.95)
└── columns.csv - columns before and after filtering step
```

#### Ditto prediction

```sh
python src/Ditto/predict.py -i path/to/output_directory/data.csv --sample sample_name -o path/to/output_directory/ditto_predictions.csv -o100 .path/to/output_directory/ditto_predictions_100.csv
```

#### Combine with Exomiser scores

If phenotype terms are present for the sample, one could use Exomiser to rank genes and then prioritize Ditto predictions according to the phenotype. Once you have Exomiser scores, please run the following command to combine Exomiser and Ditto scores

```sh
python src/Ditto/combine_scores.py --raw .path/to/parsed_vcf_file.tsv --sample sample_name --ditto path/to/output_directory/ditto_predictions.csv -ep path/to/exomiser_scores/directory -o .path/to/output_directory/predictions_with_exomiser.csv -o100 path/to/output_directory/predictions_with_exomiser_100.csv
```


### Cohort level analysis

Please refer to [CAGI6-RGP](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/mana/mini_projects/rgp_cagi6) project for filtering and annotation of variants as done above for single sample VCF along with calculating Exomiser scores.

For predictions, make necessary directory edits to the snakemake [workflow](workflow/Snakefile) and run the following command.

```sh
sbatch src/predict_variant_score.sh
```

**Note**: The commit used for CAGI6 challenge pipeline is [be97cf5d](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/ditto/-/merge_requests/3/diffs?commit_id=be97cf5dbfcb099ac82ef28d5d8b0919f28aed99). It was used along with annotated VCFs and exomiser scores obtained from [rgp_cagi6 workflow](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/mana/mini_projects/rgp_cagi6).


## Contact information

For issues, please send an email with clear description to

Tarun Mamidi - tmamidi@uab.edu
45 changes: 23 additions & 22 deletions annotation_parsing/parse_annotated_vars.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ def parse_n_print(vcf, outfile):
output_header = ["Chromosome", "Position", "Reference Allele", "Alternate Allele"] + \
line.replace(" Allele|"," VEP_Allele_Identifier|").split("Format: ")[1].rstrip(">").rstrip('"').split("|")
elif line.startswith("#CHROM"):
vcf_header = line.split("\t")
vcf_header = line.split("\t")
else:
break

for idx, sample in enumerate(vcf_header):
if idx > 8:
output_header.append(f"{sample} allele depth")
Expand All @@ -36,7 +36,8 @@ def parse_n_print(vcf, outfile):
line = line.rstrip("\n")
cols = line.split("\t")
csq = parse_csq(next(filter(lambda info: info.startswith("CSQ="),cols[7].split(";"))).replace("CSQ=",""))
var_info = parse_var_info(vcf_header, cols)
#print(line, file=open("var_info.txt", "w"))
#var_info = parse_var_info(vcf_header, cols)
alt_alleles = cols[4].split(",")
alt2csq = format_alts_for_csq_lookup(cols[3], alt_alleles)
for alt_allele in alt_alleles:
Expand All @@ -45,14 +46,14 @@ def parse_n_print(vcf, outfile):
possible_alt_allele4lookup = alt_allele
try:
write_parsed_variant(
out,
vcf_header,
cols[0],
cols[1],
cols[3],
alt_allele,
csq[possible_alt_allele4lookup],
var_info[alt_allele]
out,
vcf_header,
cols[0],
cols[1],
cols[3],
alt_allele,
csq[possible_alt_allele4lookup]
#,var_info[alt_allele]
)
except KeyError:
print("Variant annotation matching based on allele failed!")
Expand All @@ -62,15 +63,15 @@ def parse_n_print(vcf, outfile):
raise SystemExit(1)


def write_parsed_variant(out_fp, vcf_header, chr, pos, ref, alt, annots, var_info):
def write_parsed_variant(out_fp, vcf_header, chr, pos, ref, alt, annots):#, var_info):
var_list = [chr, pos, ref, alt]
for annot_info in annots:
full_fmt_list = var_list + annot_info
for idx, sample in enumerate(vcf_header):
if idx > 8:
full_fmt_list.append(str(var_info[sample]["alt_depth"]))
full_fmt_list.append(str(var_info[sample]["total_depth"]))
full_fmt_list.append(str(var_info[sample]["prct_reads"]))
#for idx, sample in enumerate(vcf_header):
# if idx > 8:
# full_fmt_list.append(str(var_info[sample]["alt_depth"]))
# full_fmt_list.append(str(var_info[sample]["total_depth"]))
# full_fmt_list.append(str(var_info[sample]["prct_reads"]))

out_fp.write("\t".join(full_fmt_list) + "\n")

Expand Down Expand Up @@ -103,9 +104,9 @@ def parse_csq(csq):
parsed_annot = annot.split("|")
if parsed_annot[0] not in csq_allele_dict:
csq_allele_dict[parsed_annot[0]] = list()

csq_allele_dict[parsed_annot[0]].append(parsed_annot)

return csq_allele_dict


Expand All @@ -129,13 +130,13 @@ def parse_var_info(headers, cols):
alt_depth = int(ad_info[alt_index + 1])
total_depth = sum([int(dp) for dp in ad_info])
prct_reads = (alt_depth / total_depth) * 100

allele_dict[sample] = {
"alt_depth": alt_depth,
"total_depth": total_depth,
"prct_reads": prct_reads
}

parsed_alleles[alt_allele] = allele_dict

return parsed_alleles
Expand Down Expand Up @@ -184,5 +185,5 @@ def is_valid_file(p, arg):

inputf = Path(ARGS.input_vcf)
outputf = Path(ARGS.output) if ARGS.output else inputf.parent / inputf.stem.rstrip(".vcf") + ".tsv"

parse_n_print(inputf, outputf)
16 changes: 16 additions & 0 deletions configs/cluster_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"__default__": {
"ntasks": 1,
"partition": "express",
"cpus-per-task": "{threads}",
"mem-per-cpu": "4G",
"output": "logs/rule_logs/{rule}-%j.log"
},
"ditto_filter": {
"partition": "largemem",
"mem-per-cpu": "200G"
},
"combine_scores": {
"mem-per-cpu": "50G"
}
}
Loading

0 comments on commit bcf8822

Please sign in to comment.