# Process genomes using GTDB-tk

This notebook first extracts the representative genomes from the zip file generated in the previous step, then processes the genomes using gtdb-tk's `classify_wf`. The output of this step is a summary file that contains the taxonomy of each genome.

- [Extract representative genomes](#extract-representative-genomes)
- [Submit GTDB-tk job](#submit-gtdb-tk-job)

## Extract representative genomes

In [2]:
import boto3
import logging
import shutil
import zipfile

from cloudpathlib import CloudPath, AnyPath

logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s\t[%(levelname)s]:\t%(message)s",
)
logging.info("Starting ...")

2023-03-24 10:57:38,978	[INFO]:	Starting ...


In [3]:
input_dir = AnyPath("../data/generated/download_genomes")
output_dir = AnyPath("../data/generated/process_genomes")
genomes_zip_file = input_dir / "genomes.zip"

binqc_s3_basepath = CloudPath("s3://genomics-workflow-core/Results/BinQC")
project = "TransposonLibrary"
prefix = "20210331"
project_s3_path = binqc_s3_basepath / project / prefix / "00_genomes"

2023-03-24 10:57:43,394	[INFO]:	Found credentials in shared credentials file: ~/.aws/credentials


In [4]:
genomes_dir = output_dir / "genomes"
# unzip the genomes.zip file
with zipfile.ZipFile(genomes_zip_file, "r") as zip_ref:
    zip_ref.extractall(genomes_dir)


In [5]:
# get the list of genomes
genomes = list(genomes_dir.rglob("*.fna"))

# upload the genomes to s3
for genome in genomes:
    project_s3_path.joinpath(genome.name).upload_from(genome)

## Submit GTDB-tk job

This submits a nextflow based job on our AWS Batch environment using the [FischbachLab/nf-binqc](https://github.com/FischbachLab/nf-binqc) github repo. The resulting output contains many qc results, one of which is the GTDB-tk v2.1.1 classify_wf results using the release `release207_v2`.

Alternatively, download the GTDB-tk database `release207_v2` to your "local" machine. Then run the `classify_wf` locally using the `gtdb_classify_wf.sh` script. Note that running the `classify_wf` locally with the selected options requires at least 320GB of memory, so you may need to use a machine with a lot of memory. I recommend using an `r5.12xlarge` AWS EC2 instance or equivalent.

Once teh database has been downloaded, update the `gtdb_classify_wf.sh` script to point to the location of the database. Then run the script as follows:

```bash
bash -x gtdb_classify_wf.sh genomes_dir
```
Here `genomes_dir` is the directory containing the representative genomes and expects the genomes with `.fna` extension (can be updated in the bash script).

In [9]:
def submit_batch_job(
    project: str,
    prefix: str,
    fastas: CloudPath,
    branch: str = "main",
    job_queue: str = "priority-maf-pipelines",
    job_definition: str = "nextflow-production",
    s3_output_base: CloudPath = CloudPath("s3://genomics-workflow-core/Results/BinQC"),
    aws_profile: str = None,
    dry_run: bool = False,
) -> dict:
    """Submit a nf-binqc job to AWS Batch
    Args:
        project (_str_): name of the project
        prefix (_str_): name of the sample
        fastas (_list_): s3 path to the fastas to be processed
        branch (_str_, optional): Branch of nf-binqc to use.
            Defaults to "main".
        job_queue (_str_, optional): name of the queue for the head node.
            Defaults to "priority-maf-pipelines".
        job_definition (_str_, optional): nextflow job definition. Doesn't usually change.
            Defaults to "nextflow-production".
        aws_profile (_str_, optional): if a non-default aws profile should be used to submit jobs.
            Defaults to "None".
        dry_run (_bool_, optional): don't submit the job, just print what the submission command would look like.
            Defaults to "False".
    Returns:
        _dict_: a response object that contains details of the job submission from AWS
        (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/batch.html#Batch.Client.submit_job)
    """
    ## Set AWS Profile
    if aws_profile is None:
        maf = boto3.session.Session()
    else:
        maf = boto3.session.Session(profile_name=aws_profile)

    batch = maf.client("batch")

    assert s3_output_base.exists(), f"{s3_output_base} does not exist"
    assert fastas.exists(), f"{fastas} does not exist"

    outdir = s3_output_base / project

    # Set the pipeline flags for the analysis
    command = [
        "FischbachLab/nf-binqc",
        "-r",
        branch,
        "--project",
        prefix,
        "--fastas",
        fastas.as_uri(),
        "--outdir",
        outdir.as_uri(),
    ]

    if dry_run:
        logging.info(f"The following command will be run\n: '{' '.join(command)}'")
        return None

    return batch.submit_job(
        jobName=f"nf-bqc-{prefix}",
        jobQueue=job_queue,
        jobDefinition=job_definition,
        containerOverrides={"command": command},
    )


In [10]:
response = submit_batch_job(project=project, prefix=prefix, fastas=project_s3_path)
response

2023-03-24 11:05:49,782	[INFO]:	Found credentials in shared credentials file: ~/.aws/credentials


{'ResponseMetadata': {'RequestId': 'cfb7e0b3-f85c-4071-957f-5931c05a90f5',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Fri, 24 Mar 2023 18:05:50 GMT',
   'content-type': 'application/json',
   'content-length': '165',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'cfb7e0b3-f85c-4071-957f-5931c05a90f5',
   'access-control-allow-origin': '*',
   'x-amz-apigw-id': 'CTDzwFLGvHcFxZA=',
   'access-control-expose-headers': 'X-amzn-errortype,X-amzn-requestid,X-amzn-errormessage,X-amzn-trace-id,X-amz-apigw-id,date',
   'x-amzn-trace-id': 'Root=1-641de67e-58300fde675e3c1312023b1e'},
  'RetryAttempts': 0},
 'jobArn': 'arn:aws:batch:us-west-2:458432034220:job/db7c6b66-b447-45ee-b13c-d06b97522b34',
 'jobName': 'nf-bqc-20210331',
 'jobId': 'db7c6b66-b447-45ee-b13c-d06b97522b34'}

In [12]:
# remove the genomes directory
shutil.rmtree(genomes_dir)