#Pharokka + Phold + Phynteny

[pharokka](https://github.com/gbouras13/pharokka) is a rapid standardised annotation tool for bacteriophage genomes and metagenomes. You can read more about pharokka in the [documentation](https://pharokka.readthedocs.io/).

[phold](https://github.com/gbouras13/phold) is a sensitive annotation tool for bacteriophage genomes and metagenomes using protein structural homology. You can read more about phold in the [documentation](https://phold.readthedocs.io/).

phold uses the [ProstT5](https://github.com/mheinzinger/ProstT5) protein language model to translate protein amino acid sequences to the 3Di token alphabet used by [Foldseek](https://github.com/steineggerlab/foldseek). Foldseek is then used to search these against a database of 803k protein structures mostly predicted using [Colabfold](https://github.com/sokrypton/ColabFold).

[phyntney](https://github.com/susiegriggo/Phynteny) uses a long-short term memory model trained on phage synteny (the conserved gene order across phages) to assign hypothetical phage proteins to a PHROG category.

**NOTE: Phynteny will only work if your phage has fewer than 120 predicted genes**

**If this is the case for your phage(s), you should just skip running Phynteny (Cells 5+6)**

The tools are best run sequentially, as Pharokka conducts extra annotation steps like tRNA, tmRNA, CRISPR and INPHARED searches that Phold lacks (for now at least). Pharokka will also (rarely) annotate CDS that Phold can miss. Phynteny can then help annotate remaining hypothetical proteins with a PHROG category.

* **Before you start, please make sure you change the runtime to T4 GPU (or any other kind of GPU if you have $$$), otherwise Phold won't be installed properly**
* To do this, go to the top toolbar, then to Runtime -> Change runtime type -> Hardware accelerator

* To run the cells, press the play button on the left side
* Cells 1 and 2 install pharokka and phold and download the databases/models.
* Once they have been run, you can re-run Cell 3 (to run Pharokka), Cell 4 (to run Phold) and Cell 5+6 (to install and run Phynteny) as many times as you would like



In [None]:
#@title 1. Install pharokka and phold

#@markdown This cell installs pharokka and phold. It will take a few minutes. Please be patient

%%bash

set -e

PYTHON_VERSION="3.10"
PHAROKKA_VERSION="1.7.5"
PHOLD_VERSION="0.2.0"

echo "python version ${PYTHON_VERSION}"

if [ ! -f CONDA_READY ]; then
  echo "installing python"
  wget -qnc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  bash Miniconda3-latest-Linux-x86_64.sh -bfp /usr/local 2>&1 1>/dev/null
  rm Miniconda3-latest-Linux-x86_64.sh
  conda config --set auto_update_conda false
  touch CONDA_READY
fi

if [ ! -f PHAROKKA_PHOLD_READY ]; then
  echo "installing pharokka and phold"
  conda install -y -c conda-forge -c bioconda pip pharokka==${PHAROKKA_VERSION} python=${PYTHON_VERSION} phold==${PHOLD_VERSION} pytorch=*=cuda*
  touch PHAROKKA_PHOLD_READY
fi





python version 3.10
installing python
installing pharokka and phold
Channels:
 - conda-forge
 - bioconda
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - \ | / - done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - pharokka==1.7.5
    - phold==0.2.0
    - pip
    - python=3.10
    - pytorch[build=cuda*]


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _openmp_mutex-4.5          |       3_kmp_llvm           7 KB  conda-forge
    about-time-4.2.1           |     pyhd8ed1ab_1          16 KB  conda-forge
    aiohappyeyeballs-2.6.1     |     pyhd8ed1ab_0          19 KB  conda-forge
    aiohttp-3.11.18            |  py310h89163eb_0     

(installing took 7 minutes)

In [None]:
#@title 2. Download pharokka phold databases

#@markdown This cell downloads the pharokka then the phold database. It will take some time (5-10 minutes probably). Please be patient.


%%time
import os
print("Downloading pharokka database. This will take a few minutes. Please be patient :)")
os.system("install_databases.py -o pharokka_db")
print("Downloading phold database. This will take a few minutes. Please be patient :)")
os.system("phold install -d phold_db")





Downloading pharokka database. This will take a few minutes. Please be patient :)
Downloading phold database. This will take a few minutes. Please be patient :)
CPU times: user 394 ms, sys: 61.8 ms, total: 456 ms
Wall time: 2min 39s


256

In [None]:
%%bash
# one‑time DB download (~8 GB)
DB_DIR="/content/drive/MyDrive/fereshte/phold_db"
mkdir -p "$DB_DIR"

echo "Installing PHold database to $DB_DIR …"
MPLBACKEND=Agg phold install -d "$DB_DIR"

Installing PHold database to /content/drive/MyDrive/fereshte/phold_db …
|████████████████████████████████████████| 2.16G/2.16G [100%] in 11:47.4 (3.05M/s) 


2025-05-18 15:17:19.692 | INFO     | phold:install:1119 - You have specified the /content/drive/MyDrive/fereshte/phold_db directory to store the Phold database and ProstT5 model
2025-05-18 15:17:19.693 | INFO     | phold:install:1131 - Checking that the Rostlab/ProstT5_fp16 ProstT5 model is available in /content/drive/MyDrive/fereshte/phold_db
2025-05-18 15:17:19.693 | INFO     | phold.features.predict_3Di:get_T5_model:121 - Using device: cpu
2025-05-18 15:17:19.693 | INFO     | phold.features.predict_3Di:get_T5_model:127 - Loading T5 from: /content/drive/MyDrive/fereshte/phold_db/Rostlab/ProstT5_fp16
2025-05-18 15:17:19.693 | INFO     | phold.features.predict_3Di:get_T5_model:128 - If /content/drive/MyDrive/fereshte/phold_db/Rostlab/ProstT5_fp16 is not found, it will be downloaded
2025-05-18 15:17:19.694 | INFO     | phold.features.predict_3Di:get_T5_model:138 - ProstT5 not found. Downloading ProstT5 from Hugging Face
You are using the default legacy behaviour of the <class 'transform

In [None]:
%%bash
# change SAMPLE if you prefer another genome
SAMPLE="AB823818.1"
GBK="/content/drive/MyDrive/fereshte/pharokka/${SAMPLE}/${SAMPLE}.gbk"
OUT="/content/drive/MyDrive/fereshte/pharokka/${SAMPLE}/phold_test"

# remove any old test folder
rm -rf "$OUT"

echo "⇢ Running PHold on $SAMPLE (MPLBACKEND=Agg)…"
MPLBACKEND=Agg phold run \
  -i "$GBK" \
  -o "$OUT" \
  -p test \
  -t 4 \
  -d /content/drive/MyDrive/fereshte/phold_db \
  -f

⇢ Running PHold on AB823818.1 (MPLBACKEND=Agg)…


.______    __    __    ______    __       _______  
|   _  \  |  |  |  |  /  __  \  |  |     |       \ 
|  |_)  | |  |__|  | |  |  |  | |  |     |  .--.  |
|   ___/  |   __   | |  |  |  | |  |     |  |  |  |
|  |      |  |  |  | |  `--'  | |  `----.|  '--'  |
| _|      |__|  |__|  \______/  |_______||_______/ 
                                                   




2025-05-18 15:32:24.362 | INFO     | phold.utils.validation:instantiate_dirs:70 - Checking the output directory /content/drive/MyDrive/fereshte/pharokka/AB823818.1/phold_test
2025-05-18 15:32:24.362 | INFO     | phold.utils.validation:instantiate_dirs:76 - --force was specified even though the output directory does not already exist. Continuing
2025-05-18 15:32:24.374 | INFO     | phold.utils.util:begin_phold:72 - phold: annotating phage genomes with protein structures
2025-05-18 15:32:24.374 | INFO     | phold.utils.util:begin_phold:74 - You are using phold version 0.2.0
2025-05-18 15:32:24.375 | INFO     | phold.utils.util:begin_phold:75 - Repository homepage is https://github.com/gbouras13/phold
2025-05-18 15:32:24.375 | INFO     | phold.utils.util:begin_phold:76 - You are running phold run
2025-05-18 15:32:24.375 | INFO     | phold.utils.util:begin_phold:77 - Listing parameters
2025-05-18 15:32:24.375 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --input /content/drive/

CalledProcessError: Command 'b'# change SAMPLE if you prefer another genome\nSAMPLE="AB823818.1"\nGBK="/content/drive/MyDrive/fereshte/pharokka/${SAMPLE}/${SAMPLE}.gbk"\nOUT="/content/drive/MyDrive/fereshte/pharokka/${SAMPLE}/phold_test"\n\n# remove any old test folder\nrm -rf "$OUT"\n\necho "\xe2\x87\xa2 Running PHold on $SAMPLE (MPLBACKEND=Agg)\xe2\x80\xa6"\nMPLBACKEND=Agg phold run \\\n  -i "$GBK" \\\n  -o "$OUT" \\\n  -p test \\\n  -t 4 \\\n  -d /content/drive/MyDrive/fereshte/phold_db \\\n  -f\n'' returned non-zero exit status 1.

In [None]:
! MPLBACKEND=Agg phold install -d /content/drive/MyDrive/fereshte/phold_db

[32m2025-05-18 15:35:52.112[0m | [1mINFO    [0m | [36mphold[0m:[36minstall[0m:[36m1119[0m - [1mYou have specified the /content/drive/MyDrive/fereshte/phold_db directory to store the Phold database and ProstT5 model[0m
[32m2025-05-18 15:35:52.112[0m | [1mINFO    [0m | [36mphold[0m:[36minstall[0m:[36m1131[0m - [1mChecking that the Rostlab/ProstT5_fp16 ProstT5 model is available in /content/drive/MyDrive/fereshte/phold_db[0m
[32m2025-05-18 15:35:52.112[0m | [1mINFO    [0m | [36mphold.features.predict_3Di[0m:[36mget_T5_model[0m:[36m121[0m - [1mUsing device: cpu[0m
[32m2025-05-18 15:35:52.113[0m | [1mINFO    [0m | [36mphold.features.predict_3Di[0m:[36mget_T5_model[0m:[36m127[0m - [1mLoading T5 from: /content/drive/MyDrive/fereshte/phold_db/Rostlab/ProstT5_fp16[0m
[32m2025-05-18 15:35:52.113[0m | [1mINFO    [0m | [36mphold.features.predict_3Di[0m:[36mget_T5_model[0m:[36m128[0m - [1mIf /content/drive/MyDrive/fereshte/phold_db/Rostlab/

In [None]:
%%bash
SAMPLE="AB823818.1"
GBK="/content/drive/MyDrive/fereshte/pharokka/${SAMPLE}/${SAMPLE}.gbk"
OUT="/content/drive/MyDrive/fereshte/pharokka/${SAMPLE}/phold_test"
rm -rf "$OUT"

echo "⇢ PHold test after full DB extract…"
MPLBACKEND=Agg phold run \
  -i "$GBK" \
  -o "$OUT" \
  -p test \
  -t 4 \
  -d /content/drive/MyDrive/fereshte/phold_db \
  -f

⇢ PHold test after full DB extract…


.______    __    __    ______    __       _______  
|   _  \  |  |  |  |  /  __  \  |  |     |       \ 
|  |_)  | |  |__|  | |  |  |  | |  |     |  .--.  |
|   ___/  |   __   | |  |  |  | |  |     |  |  |  |
|  |      |  |  |  | |  `--'  | |  `----.|  '--'  |
| _|      |__|  |__|  \______/  |_______||_______/ 
                                                   




2025-05-18 15:36:41.936 | INFO     | phold.utils.validation:instantiate_dirs:70 - Checking the output directory /content/drive/MyDrive/fereshte/pharokka/AB823818.1/phold_test
2025-05-18 15:36:41.936 | INFO     | phold.utils.validation:instantiate_dirs:76 - --force was specified even though the output directory does not already exist. Continuing
2025-05-18 15:36:41.957 | INFO     | phold.utils.util:begin_phold:72 - phold: annotating phage genomes with protein structures
2025-05-18 15:36:41.958 | INFO     | phold.utils.util:begin_phold:74 - You are using phold version 0.2.0
2025-05-18 15:36:41.958 | INFO     | phold.utils.util:begin_phold:75 - Repository homepage is https://github.com/gbouras13/phold
2025-05-18 15:36:41.958 | INFO     | phold.utils.util:begin_phold:76 - You are running phold run
2025-05-18 15:36:41.958 | INFO     | phold.utils.util:begin_phold:77 - Listing parameters
2025-05-18 15:36:41.959 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --input /content/drive/

CalledProcessError: Command 'b'SAMPLE="AB823818.1"\nGBK="/content/drive/MyDrive/fereshte/pharokka/${SAMPLE}/${SAMPLE}.gbk"\nOUT="/content/drive/MyDrive/fereshte/pharokka/${SAMPLE}/phold_test"\nrm -rf "$OUT"\n\necho "\xe2\x87\xa2 PHold test after full DB extract\xe2\x80\xa6"\nMPLBACKEND=Agg phold run \\\n  -i "$GBK" \\\n  -o "$OUT" \\\n  -p test \\\n  -t 4 \\\n  -d /content/drive/MyDrive/fereshte/phold_db \\\n  -f\n'' returned non-zero exit status 1.

# Looping Runs

In [None]:
%%time
import os, glob, subprocess, pandas as pd, tqdm, traceback

# ── CONFIG ───────────────────────────────────────────────
INPUT_DIR        = "/content/drive/MyDrive/fereshte/genomes"
OUTPUT_ROOT      = "/content/drive/MyDrive/fereshte/pharokka"
PHAROKKA_DB      = "pharokka_db"
THREADS          = 4
FAST_MODE        = True
FORCE_REWRITE    = True
PREDICTORS       = ["phanotate", "prodigal"]   # try these in order
# ─────────────────────────────────────────────────────────

def done(out_dir, sample):
    return os.path.exists(os.path.join(out_dir, f"{sample}_cds_functions.tsv"))

fasta_paths = sorted(glob.glob(os.path.join(INPUT_DIR, "*.f*")))
manifest_rows, failed = [], []

for fasta in tqdm.tqdm(fasta_paths, desc="Pharokka"):
    sample  = os.path.splitext(os.path.basename(fasta))[0]
    out_dir = os.path.join(OUTPUT_ROOT, sample)
    os.makedirs(out_dir, exist_ok=True)

    if not FORCE_REWRITE and done(out_dir, sample):
        manifest_rows.append({"sample": sample,
                              "fasta": fasta,
                              "pharokka_funcs": os.path.join(out_dir, f"{sample}_cds_functions.tsv")})
        continue

    success = False
    for predictor in PREDICTORS:
        try:
            cmd = (
                f"pharokka.py -d {PHAROKKA_DB} -i {fasta} -t {THREADS}"
                f" -o {out_dir} -p {sample} -l {sample} -g {predictor}"
                + (" --fast" if FAST_MODE else "")
                + (" -f"     if FORCE_REWRITE else "")
            )
            subprocess.run(cmd, shell=True, check=True)
            if done(out_dir, sample):
                success = True
                break
        except subprocess.CalledProcessError:
            print(f"⚠ {sample}: {predictor} exited with non‑zero code")
            # fall through to next predictor (or fail)

    if success:
        manifest_rows.append({"sample": sample,
                              "fasta": fasta,
                              "pharokka_funcs": os.path.join(out_dir, f"{sample}_cds_functions.tsv")})
    else:
        print(f"✖ {sample}: all predictors failed")
        failed.append(sample)

# save manifest + failure list
manifest = pd.DataFrame(manifest_rows)
manifest_path = os.path.join(OUTPUT_ROOT, "analysis_manifest.csv")
manifest.to_csv(manifest_path, index=False)

if failed:
    fail_path = os.path.join(OUTPUT_ROOT, "failed_genomes.txt")
    with open(fail_path, "w") as fh:
        fh.write("\n".join(failed))
    print(f"\nFinished with {len(failed)} failures → {fail_path}")

print(f"\n✓ Pharokka done for {len(manifest)} genomes. Manifest → {manifest_path}")

Pharokka: 100%|██████████| 331/331 [7:07:07<00:00, 77.42s/it]


✓ Pharokka done for 331 genomes. Manifest → /content/drive/MyDrive/fereshte/pharokka/analysis_manifest.csv
CPU times: user 41.5 s, sys: 5.45 s, total: 46.9 s
Wall time: 7h 7min 7s





#### test

In [None]:
%%bash
SAMPLE="AP012530.1"
FASTA="/content/drive/MyDrive/fereshte/genomes/${SAMPLE}.fasta"
OUT="/content/drive/MyDrive/fereshte/pharokka/${SAMPLE}"
rm -rf "$OUT"

echo "⇢ Re‑running Pharokka on $SAMPLE with full log:"
set -x                    # echo commands
pharokka.py \
  -d pharokka_db \
  -i "$FASTA" \
  -t 4 \
  -o "$OUT" \
  -p "$SAMPLE" \
  -l "$SAMPLE" \
  -g phanotate \
  --fast

⇢ Re‑running Pharokka on AP012530.1 with full log:


+ pharokka.py -d pharokka_db -i /content/drive/MyDrive/fereshte/genomes/AP012530.1.fasta -t 4 -o /content/drive/MyDrive/fereshte/pharokka/AP012530.1 -p AP012530.1 -l AP012530.1 -g phanotate --fast
  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)
  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)
2025-05-18 15:40:36.942 | INFO     | __main__:main:95 - Starting Pharokka v1.7.5
2025-05-18 15:40:36.943 | INFO     | __main__:main:96 - Command executed: Namespace(infile='/content/drive/MyDrive/fereshte/genomes/AP012530.1.fasta', outdir='/content/drive/MyDrive/fereshte/pharokka/AP012530.1', database='pharokka_db', threads='4', force=False, prefix='AP012530.1', locustag='AP012530.1', gene_predictor='phanotate', meta=False, split=False, coding_table='11', evalue='1E-05', fast=True, mmseqs2_only=False, meta_hmm=False, dnaapler=False, custom_hmm='', genbank=False, terminase=False, termi

# Default Runs

In [None]:
#@title 3. Run Pharokka

#@markdown First, upload your phage(s) as a nucleotide input FASTA file

#@markdown Click on the folder icon to the left and use the file upload button.

#@markdown Once it is uploaded, write the file name in the INPUT_FILE field on the right.

#@markdown Then provide a directory for pharokka's output using PHAROKKA_OUT_DIR.
#@markdown The default is 'output_pharokka'.

#@markdown Then type in a gene prediction tool for pharokka.
#@markdown Please choose either 'phanotate', 'prodigal', or 'prodigal-gv'.

#@markdown You can also provide a prefix for your output files with PHAROKKA_PREFIX.
#@markdown If you provide nothing it will default to 'pharokka'.

#@markdown You can also provide a locus tag for your output files.
#@markdown If you provide nothing it will generate a random locus tag.

#@markdown You can click FAST to turn off --fast.
#@markdown By default it is True so that Pharokka runs faster in the Colab environment.

#@markdown You can click META to turn on --meta if you have multiple phages in your input.

#@markdown You can click META_HMM to turn on --meta_hmm.

#@markdown You can click FORCE to overwrite the output directory.
#@markdown This may be useful if your earlier pharokka run has crashed for whatever reason.

#@markdown The results of Pharokka will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is PHAROKKA_OUT_DIR.zip, where PHAROKKA_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".


%%time
import os
import sys
import subprocess
import zipfile
INPUT_FILE = '/content/drive/MyDrive/fereshte/genomes/AB823818.1.fasta' #@param {type:"string"}

if os.path.exists(INPUT_FILE):
    print(f"Input file {INPUT_FILE} exists")
else:
    print(f"Error: File {INPUT_FILE} does not exist")
    print(f"Please check the spelling and that you have uploaded it correctly")
    sys.exit(1)

PHAROKKA_OUT_DIR = '/content/drive/MyDrive/fereshte/output_pharokka'  #@param {type:"string"}
GENE_PREDICTOR = 'phanotate'  #@param {type:"string"}
allowed_gene_predictors = ['phanotate', 'prodigal', 'prodigal-gv']
# Check if the input parameter is valid
if GENE_PREDICTOR.lower() not in allowed_gene_predictors:
    raise ValueError("Invalid GENE_PREDICTOR. Please choose from: 'phanotate', 'prodigal', 'prodigal-gv'.")

PHAROKKA_PREFIX = 'pharokka'  #@param {type:"string"}
LOCUS_TAG = 'Default'  #@param {type:"string"}
FAST = True  #@param {type:"boolean"}
META = False  #@param {type:"boolean"}
META_HMM = False  #@param {type:"boolean"}
FORCE = True  #@param {type:"boolean"}


# Construct the command
command = f"pharokka.py -d pharokka_db -i {INPUT_FILE} -t 4 -o {PHAROKKA_OUT_DIR} -p {PHAROKKA_PREFIX} -l {LOCUS_TAG} -g {GENE_PREDICTOR}"

if FORCE is True:
  command = f"{command} -f"

if FAST is True:
  command = f"{command} --fast"

if META is True:
  command = f"{command} -m"

if META_HMM is True:
  command = f"{command} --meta_hmm"

# Execute the command
try:
    print("Running pharokka")
    subprocess.run(command, shell=True, check=True)
    print("pharokka completed successfully.")
    print(f"Your output is in {PHAROKKA_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{PHAROKKA_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(PHAROKKA_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), PHAROKKA_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")







Input file /content/drive/MyDrive/fereshte/genomes/AB823818.1.fasta exists
Running pharokka
pharokka completed successfully.
Your output is in /content/drive/MyDrive/fereshte/output_pharokka.
Zipping the output directory so you can download it all in one go.
Output directory has been zipped to /content/drive/MyDrive/fereshte/output_pharokka.zip
CPU times: user 161 ms, sys: 16.3 ms, total: 177 ms
Wall time: 48.4 s


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#@title 4. Run phold

#@markdown This cell will run phold on the output of cell 3's Pharokka run

#@markdown You do not need to provide any further input files

#@markdown You can now provide a directory for phold's output with PHOLD_OUT_DIR.
#@markdown The default is 'output_phold'.

#@markdown You can also provide a prefix for your output files with PHOLD_PREFIX.
#@markdown If you provide nothing it will default to 'phold'.

#@markdown You can click FORCE to overwrite the output directory with .
#@markdown This may be useful if your earlier phold run has crashed for whatever reason.

#@markdown If your input has multiple phages, you can click SEPARATE.
#@markdown This will output separate GenBank files in the output directory.

#@markdown The results of Phold will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is PHOLD_OUT_DIR.zip, where PHOLD_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".


%%time
import os
import subprocess
import zipfile

# phold input is pharokka output
PHOLD_INPUT = f"{PHAROKKA_OUT_DIR}/{PHAROKKA_PREFIX}.gbk"
PHOLD_OUT_DIR = 'output_phold'  #@param {type:"string"}
PHOLD_PREFIX = 'phold'  #@param {type:"string"}
FORCE = True  #@param {type:"boolean"}
SEPARATE = False  #@param {type:"boolean"}

# Construct the command
command = f"phold run -i {PHOLD_INPUT} -t 4 -o {PHOLD_OUT_DIR} -p {PHOLD_PREFIX} -d phold_db"

if FORCE is True:
  command = f"{command} -f"
if SEPARATE is True:
  command = f"{command} --separate"


# Execute the command
try:
    print("Running phold")
    subprocess.run(command, shell=True, check=True)
    print("phold completed successfully.")
    print(f"Your output is in {PHOLD_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{PHOLD_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(PHOLD_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), PHOLD_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")







Running phold
Error occurred: Command 'phold run -i /content/drive/MyDrive/fereshte/output_pharokka/pharokka.gbk -t 4 -o output_phold -p phold -d phold_db -f' returned non-zero exit status 1.
CPU times: user 2.04 ms, sys: 812 µs, total: 2.86 ms
Wall time: 421 ms


In [None]:
#@title 5. Install phynteny

#@markdown This cell installs phynteny and downloads the models. It will take a few minutes. Please be patient
%%bash
PHYNTENY_VERSION="0.1.13"
NUMPY_VERSION="1.26.4"

if [ ! -f PHYNTENY_READY ]; then
  echo "installing phynteny"
  pip install phynteny==${PHYNTENY_VERSION} numpy==${NUMPY_VERSION}
  echo "Downloading phynteny models"
  install_models -o phynteny_models
  touch PHYNTENY_READY
fi


installing phynteny
Collecting phynteny==0.1.13
  Downloading phynteny-0.1.13-py3-none-any.whl.metadata (7.9 kB)
Collecting numpy==1.26.4
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting scikit-learn<=1.2.2 (from phynteny==0.1.13)
  Downloading scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting joblib (from phynteny==0.1.13)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting tensorflow==2.9.1 (from phynteny==0.1.13)
  Downloading tensorflow-2.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting absl-py>=1.0.0 (from tensorflow==2.9.1->phynteny==0.1.13)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow==2.9.1->phynteny==0.1.13)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers<2,>=1.12 (from tensorflow==2.9.1-

In [None]:
#@title 6. Run Phynteny

#@markdown This cell will run phynteny on the output of cell 4's Phold run to predict the function of remaining hypothetical proteins

#@markdown You do not need to provide any further input files

#@markdown You can now provide a directory for phynteny's output with PHYNTENY_OUT_DIR.
#@markdown The default is 'output_phynteny'.

#@markdown You can click FORCE to overwrite the output directory with .
#@markdown This may be useful if your phynteny run has crashed for whatever reason.

#@markdown The results of Phynteny will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is PHYNTENY_OUT_DIR.zip, where PHYNTENY_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".


%%time
import os
import subprocess
import zipfile

# phynteny input is pharokka output
PHYNTENY_INPUT = f"{PHOLD_OUT_DIR}/{PHOLD_PREFIX}.gbk"
PHYNTENY_OUT_DIR = 'output_phynteny'  #@param {type:"string"}
FORCE = False  #@param {type:"boolean"}

# Construct the command
command = f"phynteny {PHYNTENY_INPUT} -m phynteny_models -o {PHYNTENY_OUT_DIR}"

if FORCE is True:
  command = f"{command} -f"


# Execute the command
try:
    print("Running phynteny")
    subprocess.run(command, shell=True, check=True)
    print("phynteny completed successfully.")
    print(f"Your output is in {PHYNTENY_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{PHYNTENY_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(PHYNTENY_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), PHYNTENY_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")







Running phynteny
phynteny completed successfully.
Your output is in output_phynteny.
Zipping the output directory so you can download it all in one go.
Output directory has been zipped to output_phynteny.zip
CPU times: user 160 ms, sys: 31.1 ms, total: 191 ms
Wall time: 42.1 s
