# Machines used

### MacBook Pro Mojave OS

Most tools (except antiSMASH, Nerpa and the ones with webservers) were executed in the MacBook Pro with:

```
Processor: 2.4 GHz Intel Core i5, 8 CPUs.
Memory: 8 GB 2133 MHz LPDDR3.
```

### Server

AntiSMASH and Nerpa were executed at a server node with:

```
Processor: Intel Xeon X7560 2.27 GHz, 16 CPUs.
Memory: ~13 GB RAM is more than enough for up to 100 genomes, if thousands genomes are needed, use ~20 GB.
```

# Installation

### Anaconda

We used the command-line install according to `https://docs.anaconda.com/anaconda/install/mac-os/`.

### Docker

We installed the docker app from `https://docs.docker.com/desktop/mac/install/`.

### antiSMASH

AntiSMASH can be installed using Anaconda:

`conda install -c bioconda antismash`

### BiG-SCAPE

1. Download pFAM here:

`wget http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/Pfam-A.hmm.gz`

2. Install software:

`conda create -n bigscape -c bioconda bigscape`

### SIRIUS/CANOPUS

Download th SIRIUS app from `https://boecker-lab.github.io/docs.sirius.github.io/install/#mac-osx`.

`git clone https://github.com/kaibioinfo/canopus_treemap.git`

### NPOmix

`1.` Clone the repository with `git clone https://github.com/tiagolbiotech/NPOmix_python`. 

PS: You can also use the jupyter notebook version but it is not as updated as the python version. 

For such, use `git clone https://github.com/tiagolbiotech/NPOmix.git`

`2.` Download PoDP database

Please use the notebook `NPOmix_v1.0-notebook1_downloading-210811.ipynb` to download and parse the data. Or, you can download data ready to run using the Zenodo links at the end of this document.

`3.` Install the following python packages`

`conda install -c bioconda pyteomics`

`conda install -c anaconda requests`

`conda install -c anaconda networkx`

`conda install -c trentonoliphant datetime`

`conda install -c conda-forge biopython`

`conda install -c anaconda scikit-learn`

### Nerpa

`
conda install -c bioconda nerpa
`

### NPLinker

`
docker pull andrewramsay/nplinker:mibigtest
`

### NRPminer

NRPminer is not publically availible yet and we were able to use it through our collaboration with Dr Alexey Gurevich, one of the developers.

### Other tools

The following tools can be used online: Metaminer [(link here)](https://gnps.ucsd.edu/ProteoSAFe/index.jsp?params=%7B%22workflow%22:%22RIPPQUEST%22%7D); DeepRiPP [(link here)](http://deepripp.magarveylab.ca).

The following tools seem to be discontinued: GARLIC; GNP; NRPquest.

# Running softwares

### antiSMASH

`antismash XXX`


### BiG-SCAPE


`conda activate bigscape`

`bigscape.py --pfam_dir /path/to/pfam_files/ -i /path/to/antismash/ \
-o /path/to/bigscape_outputs_220603_1791samples/ -c 8 --include_singletons --mibig`

### SIRIUS/CANOPUS


`cd /path/to/canopus_treemap/`

`./sirius.app/Contents/MacOS/sirius -i /path/to/all_mzml/ -o sirius_npomix --maxmz=800 lcms-align sirius zodiac fingerid canopus`

`./sirius.app/Contents/MacOS/sirius -i sirius_npomix mgf-export --merge-ms2  --quant-table=colabA-quant.csv --output colabA.mgf`

### NPOmix python version

1. Open file at 

`/path/to/NPOmix_python/npomix.py` (this path is where you cloned the folder)

2. Modify paths properly. Replace `/path/to/` with your actual path to the proper folder.

```
mgf_folder = "/path/to/selected_mgf/" # this folder needs to contain all mgf files (MS/MS data) to be tested, they can be generated by the SIRIUS commands above
LCMS_folder = "/path/to/podp_LCMS_round5/" # this folder needs to contain all mz(X)ML files in the training set
ena_df_file = "/path/to/NPOmix_python/ena_dict-210315.csv" # this file contains the correspondance between ENA codes
input_bigscape_net = "/path/to/bigscape_all_c030_220512_1777samples.txt" # this file contains the BiG-SCAPE scores for all BGCs from the all genomes (in the paper we used 1,040 genomes)
antismash_folder = "/path/to/antismash/" # this folder needs to contain all antismash files (annotated genomes) to be used in the training set
merged_ispec_mat_file = "/path/to/NPOmix_python/mass-affinity_df-NPOmix1.0_rene-%s.txt"%current_date # (OPTIONAL) if you already ran the step to obtain the merged_ispec_mat, you can skip this time consuming step by inputting this file
results_folder = "/path/to/NPOmix_python/main_code_results/" # folder where the results will be saved
k_value = 3
```

3. Run the tool:

`python npomix.py`

### NPOmix website submission

Go to `https://www.tfleao.com/general-8` and submit your samples. 

This website also contains video tutorials and workshops (in Portuguese and English).

### Nerpa

```
# for the run against the default Nerpa database of  8,368 known and putative NRP structures from the Nerpa paper supplement at https://zenodo.org/record/5503984 (the full metadata is also there in pnrpdb_summary.tsv)
wget https://zenodo.org/record/5503984/files/pnrpdb_preprocessed.info   # downloading the preprocessed database
nerpa.py --threads $THREADS -a $ANTISMASH_PROCESSED_INPUT --structures pnrpdb_preprocessed.info --process-hybrids -o $OUTPUT_DIR

# for the run against specific set of structures, e.g., the selected 22 metabolites (in the SMILES format)
nerpa.py --threads $THREADS -a $ANTISMASH_PROCESSED_INPUT --smiles-tsv $SMILES_INPUT --col-smiles "SMILES" --col-id "Accession" --process-hybrids -o $OUTPUT_DIR

Where 
$THREADS - the number of threads (int)
$ANTISMASH_PROCESSED_INPUT - the path to a directory with antiSMASH output or to a directory with multiple antiSMASH outputs, e.g., results for a set of genomes. Supported antiSMASH versions are v.3, v.5, and v.6.
$SMILES_INPUT - the path to a tab-separated file with structures in the SMILES format, might also include compound names (in this example, I assume the names are in the Accession column).

An example $SMILES_INPUT is attached.
An example content of an $ANTISMASH_PROCESSED_INPUT is below (the key file Nerpa actually uses is <dirname>.json)

ls ./antiSMASH/GCA_000012265/
CP000076.1.region001.gbk  CP000076.1.region005.gbk  CP000076.1.region009.gbk  CP000076.1.region013.gbk  GCA_000012265.gbk   images             knownclusterblastoutput.txt
CP000076.1.region002.gbk  CP000076.1.region006.gbk  CP000076.1.region010.gbk  CP000076.1.region014.gbk  GCA_000012265.json  index.html         regions.js
CP000076.1.region003.gbk  CP000076.1.region007.gbk  CP000076.1.region011.gbk  CP000076.1.region015.gbk  GCA_000012265.zip   js                 svg
CP000076.1.region004.gbk  CP000076.1.region008.gbk  CP000076.1.region012.gbk  CP000076.1.region016.gbk  css                 knownclusterblast
```

### NPLinker

We prepared the folder with the data according to `https://github.com/NPLinker/nplinker/wiki/LoadingLocalData`. 

We used the Docker version with the following commands:

`docker run --name webapp -p 5006:5006 -v c:/Users/myusername/nplinker_shared:/data:rw andrewramsay/nplinker:mibigtest`

Then we visualized the results in the browser using the url `http://localhost:5006/npapp`.

# Downloaded NPOmix database

antiSMASH folders (BGCs only) are availible at: `https://doi.org/10.5281/zenodo.6637083`.

LCMS metabolomes (MGF only) will be availible soon, we need to extract the MGFs from the mzML first.