# CRISPRCasMiner v1.5.0
A CRISPR-Cas systems mining pipeline.

`Input`: Metagenome-assembled genomes/contigs.

`Output`: Known CRISPR-Cas systems & suspicious proteins adjacent to CRISPR arrays.


*Note: before executing this notebook, please ensure that you have to save/add the following folders into your Google Drive ("MyDrive" folder):*

- [data](https://drive.google.com/drive/folders/1WcClm_TebUuuRY_b8i6n6EuEXdR_WhT2?usp=sharing)
- [inputs](https://drive.google.com/drive/folders/18GGlIEWYtJVTn2oBXMqbghyYQCLKelLg?usp=sharing)
- [ccminer](https://drive.google.com/drive/folders/1uPjufxWPjEmx2yDota60qVl05dugS0OS?usp=sharing)

In [None]:
#@title Step.01 setup **CRISPRCasMiner** (~5m)
%%time
import os, time, signal
import sys, random, string, re
## mount google drive
from google.colab import drive
drive.mount('/content/drive')

## install conda
!pip install -q condacolab
import condacolab
condacolab.install()
!conda update -n base -c conda-forge conda -y

## install cctyper
!conda install -c bioconda hmmer -y
!conda install -c bioconda prodigal -y
!conda install -c bioconda minced -y
!conda install -c bioconda bioawk -y
!python -m pip install 'drawSvg~=1.9'
!python -m pip install 'cctyper~=1.8.0'

## install ccminer
!conda install -c bioconda mafft -y
!conda install -c bioconda viennarna -y
!conda install -c conda-forge webencodings -y
!conda install -c bioconda seqkit -y
!conda install -c bioconda seqtk -y

## export database
!export CCTYPER_DB='/content/drive/MyDrive/data'

In [None]:
#@title Step.02 run CRISPRCasMiner to generate suspicious Cas proteins (~1m)
%%time

#@markdown **Parameters** settings
#@markdown ---

from google.colab import files
import os
#@title Step 02.run **CRISPRCasMiner** to generate suspicious Cas proteins (~1m)
project_name = "test" #@param {type:"string"}
#@markdown - The `name` of your project.
input_fna =  "drive/MyDrive/inputs/input_test.fna" #@param {type:"string"}
if not input_fna:
  input_fna = files.upload()
  input_fna = list(input_fna.keys())[0]
#@markdown - The path of metagenome-assembled genome/contig in **YOUR_GOOGLE_DRIVE_PATH**. `drive/MyDrive/inputs/input_test.fna`
#@markdown - Or upload your own genomic data (.fasta/.fna/.fa file).
prodigal_mode = "meta" #@param ["meta","single"] {type:"raw"}
#@markdown - The mode to run prodigal. `meta` or `single`.
span = 10 #@param {type:"raw"}
#@markdown - The range (`number`) of proteins around the CRISPR array that were interested, both upstream and downstream.
sample_name, _ = os.path.splitext(os.path.basename(input_fna))
path = project_name
!mkdir {path}

## run cctyper
!cctyper $(pwd)/{input_fna} $(pwd)/{path}/01_cctyper --db $(pwd)/drive/MyDrive/data --prodigal {prodigal_mode} --keep_tmp
!prodigal -i $(pwd)/{input_fna} -d $(pwd)/{path}/01_cctyper/genes.cds -p {prodigal_mode} > /dev/null
!rm -rf $(pwd)/{path}/01_cctyper/hmmer $(pwd)/{path}/01_cctyper/spacers $(pwd)/{path}/01_cctyper/*.log $(pwd)/{path}/01_cctyper/Flank*

## run ccminer
!python $(pwd)/drive/MyDrive/ccminer/ccminer.py $(pwd)/{input_fna} $(pwd)/{path}/02_ccminer --db $(pwd)/drive/MyDrive/data --prodigal {prodigal_mode} --span {span} --keep_tmp --cctyper_path $(pwd)/{path}/01_cctyper --database_name {project_name} --name {sample_name}
!awk '$12=="primary"  {printf ">%s\n%s\n",$9,$13} ' $(ls */02_ccminer/out.tab) > suspicious.faa
!rm -rf $(pwd)/{path}/02_ccminer/spacers

In [None]:
## show the suspicious proteins
!cat suspicious.faa

In [None]:
#@title Step.03 Package and download results
from google.colab import files
!zip -r {path}.result.zip ./{path}/01_cctyper ./{path}/02_ccminer ./suspicious.faa
files.download(f"{path}.result.zip")

#**Note**

Two bioinformatics pipelines were executed on your metagenome-assembled data:

- `cctyper`. All known CRISPR-Cas systems with certainty.

- `ccminer`. All suspious proteins located near the CRISPR arrays, excluding the already known systems.


The folder tree of output directory:

```
YOUR_PROJECT_output
output
├── 01_cctyper              # output of cctyper v1.8.0
├── 02_ccminer              # output of ccminer v1.5.0
│   ├── arguments.tab
│   ├── crisprs_calib.tab
│   ├── genes.tab
│   ├── minced.out
│   ├── nearCRISPR.txt
│  *├── out.tab             # proteins adjacent to CRISPRs
│   └── seqLen.tab
└── suspicious.faa          # suspicious Cas proteins
```

If no known CRISPR-Cas systems were identified by cctyper, then pay your attention to the suspious proteins adjacent to the CRISPR array. For the output table of ccminer, `out.tab` presents the following columns:

- `DB_name`. The name of YOUR_PROJECT.
- `Sample`. Sample name of YOUR_METAGENOMITC_DATA.
- `CRISPR`. CRISPR id.
- `DRrepeat`.
- `Remark`. Upstream or downstrem of the CRISPR array.
- `Strand`.
- `Rank`. Distance bewteen the protein and the CRISPR array.
- `Prolen`. Protein size (aa).
- `Protein`. Protein id.
- `HMM`. HMMER annotation of this protein.
- `typetag`. Types of CRISPR-Cas systems identifed by cctyper (e.g. `I-*`, `II-*`,...). `Unkown` represents no CRISPR-Cas systems with certainty identified adjacent to the CRISPR array.
- `level`. Additional annotation of the protein:
  - `certain`: this protein was located within or very close to a known type of CRISPR-Cas system.
  - `putative`: this protein was located within or very close to a putative CRISPR-Cas system.
  - `large`: the size of this protein is over 2000aa.
  - `split`: this protein might be incomplete.
  - `other`: this protein was not included in the above mentioned scenarios.
  - `primary`: this protein falls under the 'other' category and has a size over 700 amino acids, indicating its potential to be a suspicious novel Cas protein.
- `pro_seq`. Amino acid sequence of the protein.
- `gene_seq`. CDS of the protein.