<picture>
  <source media="(prefers-color-scheme: dark)" srcset="../images/tucca-rna-seq-logo-white.png">
  <img src="../images/tucca-rna-seq-logo.png" alt="tucca-rna-seq logo" width="250">
</picture>

# `tucca-rna-seq` on Google Cloud Batch

This notebook orchestrates the `tucca-rna-seq` workflow on **Google Cloud Batch**, a fully managed service for running batch computing workloads at scale. Unlike the `google_colab_runner.ipynb`, which runs the entire workflow on a single Colab instance, this notebook uses Colab as a lightweight client to submit jobs to Google's powerful cloud infrastructure.

### Key Features:
- **Massive Scalability**: Execute hundreds or thousands of jobs in parallel on powerful, custom-configured virtual machines.
- **Cost-Effective**: Pay only for the compute resources you use. Google Batch automatically spins up and shuts down VMs for each job.
- **Centralized Storage**: Utilizes Google Cloud Storage (GCS) for all workflow data, ensuring it's accessible to all jobs.
- **Orchestration from Colab**: Use the familiar Colab interface to manage and monitor your large-scale analyses.

### How to Use:
1.  **Set up Google Cloud**: Ensure you have a Google Cloud project with billing enabled.
2.  **Authenticate**: Run the authentication cell to grant this notebook access to your Google Cloud account.
3.  **Configure Storage**: Specify your GCS bucket and upload your data.
4.  **Execute the Workflow**: Run the final cell to submit the workflow to Google Batch.

Let's get started!


### 1. Authenticate with Google Cloud

This cell authenticates your Google account, allowing the notebook to interact with Google Cloud Batch and Google Cloud Storage on your behalf. You will be prompted to log in and authorize access.


In [None]:
from google.colab import auth
import os

# Authenticate with Google Cloud
auth.authenticate_user()

# Set your Google Cloud project ID
project_id = 'your-gcp-project-id' # @param {type:"string"}
os.environ['GOOGLE_CLOUD_PROJECT'] = project_id

# Verify authentication
!gcloud config list



### 2. Install Snakemake and Google Cloud Plugins

This cell installs `Mamba`, `Snakemake`, and the necessary plugins for interacting with Google Cloud Batch and Google Cloud Storage.


In [None]:
import os

# Define paths for Mamba installation
mamba_prefix = '/usr/local/mamba'

# Download and install Mamba
if not os.path.exists(os.path.join(mamba_prefix, 'bin', 'mamba')):
    print("Installing Mamba...")
    !wget -q "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh" -O Mambaforge.sh
    !bash Mambaforge.sh -b -p {mamba_prefix}
    !rm Mambaforge.sh
else:
    print("Mamba is already installed.")

# Add Mamba to the system's PATH
os.environ['PATH'] = f"{mamba_prefix}/bin:{os.environ['PATH']}"

# Install Snakemake and Google Cloud plugins
print("Installing Snakemake and plugins...")
!mamba install -y -c conda-forge -c bioconda snakemake-minimal 'snakemake-executor-plugin-googlebatch>=0.3.0' 'snakemake-storage-plugin-gcs>=0.2.0'

print("Installation complete.")


### 3. Configure Google Cloud Storage

This workflow requires all data to be stored in a Google Cloud Storage (GCS) bucket. This cell helps you configure the bucket and provides instructions for uploading your data.

#### Instructions:
1.  **Create a GCS Bucket**: If you don't have one already, create a GCS bucket in your Google Cloud project.
2.  **Set the Bucket Name**: In the code cell below, replace `"your-gcs-bucket-name"` with the name of your bucket.
3.  **Upload Your Data**: Upload your raw sequencing files to a `raw_data` directory inside your bucket.


In [None]:
# The GCS bucket where your data and results will be stored
gcs_bucket = "your-gcs-bucket-name" # @param {type:"string"}
remote_workdir = f"gs://{gcs_bucket}/tucca-rna-seq-workdir"

# Verify that the bucket exists
!gsutil ls {remote_workdir} || gsutil mb -p {project_id} gs://{gcs_bucket}



### 4. Execute the Workflow on Google Batch

This cell submits the Snakemake workflow to Google Batch. Snakemake will orchestrate the jobs from this notebook, but the actual computation will happen on dedicated VMs in the cloud.


In [None]:
# Your Google Cloud region
gcp_region = "us-central1" # @param {type:"string"}

# Perform a dry run first to validate the setup
print("--- Performing a dry run of the workflow on Google Batch ---")
!snakemake \
    --executor googlebatch \
    --storage-provider gcs \
    --default-remote-prefix {remote_workdir} \
    --googlebatch-project {project_id} \
    --googlebatch-region {gcp_region} \
    --jobs 999 \
    -n

# After verifying the dry run, uncomment the command below to execute the full workflow
# print("\n--- To run the full workflow, uncomment and run the following command ---")
# !snakemake \
#     --executor googlebatch \
#     --storage-provider gcs \
#     --default-remote-prefix {remote_workdir} \
#     --googlebatch-project {project_id} \
#     --googlebatch-region {gcp_region} \
#     --jobs 999

