<picture>
  <source media="(prefers-color-scheme: dark)" srcset="../images/tucca-rna-seq-logo-white.png">
  <img src="../images/tucca-rna-seq-logo.png" alt="tucca-rna-seq logo" width="250">
</picture>

# `tucca-rna-seq` on Google Cloud Batch

This notebook orchestrates the `tucca-rna-seq` workflow on **Google Cloud Batch**, a fully managed service for running batch computing workloads at scale. Unlike the `google_colab_runner.ipynb`, which runs the entire workflow on a single Colab instance, this notebook uses Colab as a lightweight client to submit jobs to Google's powerful cloud infrastructure.

### Key Features:
- **Massive Scalability**: Execute hundreds or thousands of jobs in parallel on powerful, custom-configured virtual machines.
- **Cost-Effective**: Pay only for the compute resources you use. Google Batch automatically provisions and shuts down VMs for each job.
- **Centralized Storage**: Utilizes Google Cloud Storage (GCS) for all workflow data, ensuring it's accessible to all jobs.
- **Orchestration from Colab**: Use the familiar Colab interface to manage and monitor your large-scale analyses.

### How to Use:
1.  **Set up Google Cloud**: Ensure you have a Google Cloud project with billing enabled and the [Batch API enabled](https://cloud.google.com/batch/docs/get-started#enable-batch-for-a-project).
2.  **Authenticate**: Grant this notebook access to your Google Cloud account.
3.  **Configure Storage**: Specify your GCS bucket.
4.  **Upload Data**: Upload your configured workflow files and raw data to your GCS bucket.
5.  **Execute the Workflow**: Run the final cell to submit the workflow to Google Batch.
6.  **Monitor**: Keep track of your job's progress in the Google Cloud Console or directly within this notebook.

Let's get started!


## Workflow Overview

<div align="center">
  <img alt="tucca-rna-seq workflow map" src="../images/tucca-rna-seq-workflow.png" width="700">
  <p>Created in <a href="https://BioRender.com">https://BioRender.com</a></p>
</div>

## Rulegraph

<div align="center">
  <img alt="tucca-rna-seq workflow map" src="../images/rulegraph.png" width="700">
  <p>Created via <code>snakemake --rulegraph</code></p>
</div>


### 1. Authenticate with Google Cloud

This cell authenticates your Google account, allowing the notebook to interact with Google Cloud Batch and Google Cloud Storage on your behalf. You will be prompted to log in and authorize access.


In [None]:
from google.colab import auth
import os
import subprocess

# Authenticate with Google Cloud
auth.authenticate_user()

# Get the GCP project ID from the gcloud config
project_id = subprocess.run(
    ["gcloud", "config", "get-value", "project"],
    capture_output=True,
    text=True
).stdout.strip()

print(f"✅ Automatically detected project ID: {project_id}")

# @markdown ---
# @markdown ### 📝 **Review and Confirm Project ID**
# @markdown The Google Cloud Project ID has been detected automatically. If you wish to use a different project, you can specify it below.
gcp_project_id = project_id # @param {type:"string"}


os.environ['GOOGLE_CLOUD_PROJECT'] = gcp_project_id

print(f"✅ Using project ID: {gcp_project_id}")

# Verify authentication
!gcloud config list


### 2. Install Snakemake and Google Cloud Plugins

This cell installs `Mamba`, `Snakemake`, and the necessary plugins for interacting with Google Cloud Batch and Google Cloud Storage.


In [None]:
import os

# Define paths for Mamba installation
mamba_prefix = '/usr/local/mamba'

# Download and install Mamba
if not os.path.exists(os.path.join(mamba_prefix, 'bin', 'mamba')):
    print("Installing Mamba...")
    !wget -q "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh" -O Mambaforge.sh
    !bash Mambaforge.sh -b -p {mamba_prefix}
    !rm Mambaforge.sh
else:
    print("Mamba is already installed.")

# Add Mamba to the system's PATH
os.environ['PATH'] = f"{mamba_prefix}/bin:{os.environ['PATH']}"

# Install Snakemake and Google Cloud plugins
print("Installing Snakemake and plugins...")
!mamba install -y -c conda-forge -c bioconda snakemake-minimal 'snakemake-executor-plugin-googlebatch>=0.3.0' 'snakemake-storage-plugin-gcs>=0.2.0'

print("Installation complete.")


### 3. Configure Google Cloud Settings

This workflow requires all data to be stored in a Google Cloud Storage (GCS) bucket.

#### Instructions:
1.  **Create a GCS Bucket**: If you don't have one already, create a GCS bucket in your Google Cloud project.
2.  **Set the Bucket Name**: In the code cell below, replace `"your-gcs-bucket-name"` with the name of your bucket.


In [None]:
# @title Cloud Configuration
gcs_bucket = "your-gcs-bucket-name" # @param {type:"string"}
region = 'us-central1' # @param {type:"string"}

# Set the remote workdir path in GCS
remote_workdir = f"gs://{gcs_bucket}/tucca-rna-seq-workdir"

# Verify that the bucket exists, or create it.
# The -l flag sets the location to be the same as the VM region for performance.
print(f"Checking for GCS bucket gs://{gcs_bucket}...")
!gsutil ls gs://{gcs_bucket} || gsutil mb -p {gcp_project_id} -l {region} gs://{gcs_bucket}


### 4. Upload Data and Configuration to GCS

The Google Batch jobs will need access to your raw data and configuration files. This section provides instructions for uploading them to your GCS bucket.

#### Instructions:
1.  **Clone the Repository Locally**: `git clone https://github.com/tucca-cellag/tucca-rna-seq.git`
2.  **Prepare your files**: Modify your local `config/config.yaml`, `config/samples.tsv`, and `config/units.tsv` files for your analysis. Place your raw data in the `data/raw_data/` directory.
3.  **Upload to GCS**: Use the `gsutil` command to copy your project directory to GCS.

```bash
# Example command to run from your local machine:
gsutil -m rsync -r ./tucca-rna-seq gs://your-gcs-bucket-name/tucca-rna-seq
```


### 5. Execute the Workflow on Google Batch

This command submits the Snakemake workflow to Google Batch. Snakemake will orchestrate the jobs from this notebook, but the actual computation will happen on dedicated VMs in the cloud.


In [None]:
# Perform a dry run first to validate the setup
# The --default-remote-prefix points to the workflow directory in GCS.
print("--- Performing a dry run of the workflow on Google Batch ---")
!snakemake \
    --executor googlebatch \
    --storage-provider gcs \
    --default-remote-prefix {gcs_bucket}/tucca-rna-seq \
    --googlebatch-project {gcp_project_id} \
    --googlebatch-region {region} \
    --jobs 999 \
    -n

# After verifying the dry run, uncomment the command below to execute the full workflow
# print("\n--- To run the full workflow, uncomment and run the following command ---")
# !snakemake \
#     --executor googlebatch \
#     --storage-provider gcs \
#     --default-remote-prefix {gcs_bucket}/tucca-rna-seq \
#     --googlebatch-project {gcp_project_id} \
#     --googlebatch-region {region} \
#     --jobs 999



### 6. Monitor the Workflow

You can monitor the progress of your job in two ways:

1.  **Google Cloud Console**: For detailed logs and job information, navigate to the [**Batch** page](https://console.cloud.google.com/batch/jobs) in the Google Cloud Console.
2.  **In this Notebook**: Run the cell below to get a summary of your recent jobs and their current status.


In [None]:
# @title Check Job Status
# @markdown Run this cell to list your Google Batch jobs and check their status.
print(f"--- Fetching job status from region: {region} ---")
!gcloud batch jobs list --location={region}


### 7. Retrieve Results

Once the workflow is complete, your results will be located in the `results/` directory inside your GCS bucket (`gs://your-gcs-bucket-name/tucca-rna-seq/results/`). You can download them using the `gsutil` command or the Cloud Console UI.
