<picture>
  <source media="(prefers-color-scheme: dark)" srcset="../images/tucca-rna-seq-logo-white.png">
  <img src="../images/tucca-rna-seq-logo.png" alt="tucca-rna-seq logo" width="250">
</picture>

# `tucca-rna-seq` on Google Colab

**`tucca-cellag/tucca-rna-seq`** is a modular RNA-Seq workflow developed in the [Kaplan Lab at TUCCA](https://cellularagriculture.tufts.edu/) and is adaptable for most RNA-Seq projects.

Initially, this workflow was tailored for cellular agriculture research, focusing on the analysis of muscle and fat cell transcriptomes. Over time, it has been expanded into a flexible and modular tool suitable for a broad range of RNA-Seq applications. Its adaptable design allows for easy modification to fit various experimental needs beyond its original scope.

> **Colab Pro Tip**: For best performance, consider selecting a high-RAM runtime environment via `Runtime` -> `Change runtime type` in the Colab menu.

This notebook provides a self-contained environment to run the `tucca-rna-seq` workflow on Google Colab, eliminating the need for local installation. It automates the setup of all dependencies, including `mamba` and `Snakemake`, and runs the workflow using Conda. Once the workflow is complete, you can also launch interactive Shiny apps to visualize your results.

> [!WARNING]
> This workflow is still under construction. [Release v0.9.0](https://github.com/tucca-cellag/tucca-rna-seq/releases/tag/v0.9.0) marks our first public release. v0.9.0 contains all logic to process raw paired-end RNA-Seq reads through differential expression. The centerpiece of the v1.0.0 release will be an interactive analysis toolkit that allows you to dynamically explore and visualize your results.

### Key Features:
- **Zero Local Installation**: Runs the entire workflow in the cloud.
- **Persistent Caching**: Connects to your Google Drive to cache `conda` environments and `renv` packages, speeding up subsequent runs.
- **Full Functionality**: Executes the complete, unmodified workflow, ensuring reproducibility.

### How to Use:
1.  **Run the cells sequentially**: Execute each cell in order from top to bottom.
2.  **Authorize Google Drive access**: When prompted, grant permission for this notebook to access your Google Drive. This is required for caching.
3.  **Configure your analysis**: Modify the `config.yaml`, `samples.tsv`, and `units.tsv` files as needed for your specific experiment.
4.  **Execute the workflow**: Run the final cell to start the Snakemake pipeline.

Let's get started!


## Workflow Overview

<div align="center">
  <img alt="tucca-rna-seq workflow map" src="../images/tucca-rna-seq-workflow.png" width="700">
  <p>Created in <a href="https://BioRender.com">https://BioRender.com</a></p>
</div>

## Rulegraph

<div align="center">
  <img alt="tucca-rna-seq workflow map" src="../images/rulegraph.png" width="700">
  <p>Created via <code>snakemake --rulegraph</code></p>
</div>


### 1. Mount Google Drive and Set Up Caching

This cell mounts your Google Drive to the Colab environment and sets up directories for caching `conda` packages and environments. This allows for persistent storage, so you don't have to reinstall everything every time.


In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Create a symlink for easier access to the project folder on Google Drive
# This helps in creating a consistent path for the project.
project_root = '/content/tucca-rna-seq-dev'
gdrive_project_path = '/content/drive/MyDrive/Colab_Workspaces/tucca-rna-seq-dev'

if not os.path.exists(gdrive_project_path):
    os.makedirs(gdrive_project_path, exist_ok=True)
    print(f"Created Google Drive project directory: {gdrive_project_path}")

if not os.path.lexists(project_root):
    os.symlink(gdrive_project_path, project_root, target_is_directory=True)
    print(f"Symlinked {gdrive_project_path} to {project_root}")

# Change the current working directory to the project root
# This ensures that all subsequent commands are run from the correct directory.
os.chdir(project_root)
print(f"Changed working directory to: {os.getcwd()}")


### 2. Clone the Workflow Repository

This cell clones the `tucca-rna-seq` workflow from its GitHub repository into your Google Drive, ensuring you have the latest version of the code.


In [None]:
import os

# Define the path to the tucca-rna-seq repository
repo_path = 'tucca-rna-seq'
repo_url = 'https://github.com/tucca-cellag/tucca-rna-seq.git'
gdrive_repo_path = f'/content/drive/MyDrive/Colab_Workspaces/tucca-rna-seq-dev/{repo_path}'

# Check if the repository is already cloned
if not os.path.exists(gdrive_repo_path):
    print("Cloning the tucca-rna-seq repository...")
    # Use a git clone command to download the repository
    !git clone {repo_url} {gdrive_repo_path}
else:
    print("Repository already cloned. Pulling latest changes...")
    # Navigate to the repository and pull the latest changes
    %cd {gdrive_repo_path}
    !git pull
    %cd /content/tucca-rna-seq-dev

# Navigate into the workflow directory for subsequent commands
workflow_dir = gdrive_repo_path
os.chdir(workflow_dir)
print(f"Current working directory: {os.getcwd()}")


### 3. Install Mamba, Snakemake, and Setup Caching

This cell installs `Mamba` (a fast package manager) and `Snakemake`. It also configures caching to your Google Drive to avoid re-installing dependencies in future sessions.


In [None]:
import os

# Define paths for Conda and renv caching on Google Drive
gdrive_cache_path = '/content/drive/MyDrive/Colab_Workspaces/tucca-rna-seq-dev/.cache'
mamba_prefix = os.path.join(gdrive_cache_path, 'mamba')
renv_cache_home = os.path.join(gdrive_cache_path, 'renv')
conda_pkgs_dirs = os.path.join(mamba_prefix, 'pkgs')
renv_cache_path = os.path.join(renv_cache_home, 'v5')

# Create cache directories if they don't exist
os.makedirs(mamba_prefix, exist_ok=True)
os.makedirs(renv_cache_path, exist_ok=True)
os.makedirs(conda_pkgs_dirs, exist_ok=True)

# Set environment variables for caching
os.environ['CONDA_PKGS_DIRS'] = conda_pkgs_dirs
os.environ['RENV_PATHS_CACHE'] = renv_cache_path

# Download and install Mamba
if not os.path.exists(os.path.join(mamba_prefix, 'bin', 'mamba')):
    print("Installing Mamba...")
    # Download the Mambaforge installer
    !wget -q "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh" -O Mambaforge.sh
    # Install Mamba to the specified path
    !bash Mambaforge.sh -b -p {mamba_prefix}
    # Clean up the installer
    !rm Mambaforge.sh
else:
    print("Mamba is already installed.")

# Add Mamba to the system's PATH
os.environ['PATH'] = f"{mamba_prefix}/bin:{os.environ['PATH']}"

# Install Snakemake using Mamba
print("Installing Snakemake and plugins...")
!mamba install -y -c conda-forge -c bioconda 'snakemake>=8.14.0'

print("Installation complete.")


### 4. Upload Your Data

Before you configure the analysis, you need to upload your raw sequencing data.

#### Instructions:
1.  **Open the File Browser**: Click on the folder icon on the left sidebar.
2.  **Navigate to the data directory**: Go to `tucca-rna-seq` -> `data` -> `raw_data`.
3.  **Upload your files**: Drag and drop your raw sequencing files (e.g., `.fastq.gz`) into this directory.

> **Pro Tip for Large Datasets**: If you have large files, the drag-and-drop method can be slow. A more robust approach is to first upload your data to a Google Cloud Storage (GCS) bucket and then copy it into the Colab environment using the `gsutil` command. You can add a code cell and run `!gsutil -m rsync -r gs://your-bucket-name/path/to/data .` to do this efficiently.

Once your data is uploaded, you can proceed to the next step to configure your analysis.


### 5. Configure Your Analysis

Now that the environment is set up, it's time to configure the workflow for your specific analysis. You can edit the necessary configuration files directly in the Colab file browser.

#### Instructions:
1.  **Open the File Browser**: Click on the folder icon on the left sidebar to open the file browser.
2.  **Navigate to the `config` directory**: Go to `tucca-rna-seq` -> `config`.
3.  **Edit the configuration files**:
    *   `config.yaml`: This is the main configuration file where you define parameters for the analysis, such as the reference genome, differential expression settings, and pathway analysis options.
    *   `samples.tsv`: This file defines the biological samples in your experiment.
    *   `units.tsv`: This file defines the sequencing units (e.g., technical replicates) for each sample.

You can double-click on these files to open them in the editor and make your changes. Once you have saved your changes, you can proceed to the next step to run the workflow.


### 6. Execute the Workflow

This cell runs the Snakemake workflow. It starts with a dry run (`-n`) to display the jobs that will be executed, allowing you to verify the configuration before starting the full analysis.


### Important: Colab Pro and Resource Limits

The free version of Google Colab provides limited computational resources and session durations. For long-running or resource-intensive analyses, it is highly recommended to use **Colab Pro**.

#### Preventing Session Timeouts
To prevent your Colab session from timing out during a long run, you can run the following Javascript code in your browser's developer console. This will automatically click the "Connect" button every 60 seconds, keeping the session active.

```javascript
function ClickConnect(){
  console.log("Working"); 
  document.querySelector("colab-connect-button").click() 
}
setInterval(ClickConnect,60000)
```


In [None]:
# @title Execute Workflow
# @markdown Use the form below to specify the number of cores for Snakemake.
cores = 2 # @param {type:"integer"}

import os

# Define the path to the Conda environments, using the cache on Google Drive
conda_prefix = os.path.join(os.environ['CONDA_PKGS_DIRS'], 'env')

# Perform a dry run of the workflow to validate the setup
print("--- Performing a dry run of the workflow ---")
!snakemake --use-conda --conda-prefix {conda_prefix} --cores {cores} -n

# After verifying the dry run, you can execute the full workflow by
# removing the `-n` flag from the command below.
print("\n--- To run the full workflow, uncomment and run the following command ---")
# !snakemake --use-conda --conda-prefix {conda_prefix} --cores {cores} --verbose


### 8. Visualize Results Locally

Once your Snakemake workflow has completed successfully, all output files are saved within the `results/` and `resources/` directories in your project folder. The `tucca-rna-seq` repository includes powerful, interactive R-based playgrounds to explore these results locally using RStudio.

#### Instructions:

1.  **Download Your Project Folder**:
    *   After the workflow completes, use the file browser on the left panel of Colab.
    *   Navigate to the location of your `tucca-rna-seq` repository (typically inside your Google Drive at `drive/MyDrive/Colab_Workspaces/tucca-rna-seq-dev/`).
    *   Right-click on the `tucca-rna-seq` folder and select **Download**. This will zip the entire project and save it to your computer.
    *   Unzip the downloaded file.

2.  **Analyze Locally**:
    *   Open the `tucca-rna-seq` folder you just downloaded using RStudio.
    *   Navigate to the `analysis/` directory.
    *   Open `GeneTonic_playground.Rmd` or `pcaExplorer_playground.Rmd`.
    *   Follow the instructions within the RMarkdown files to run the interactive visualizations.

This approach ensures you have all the necessary data and resources to use the full power of RStudio for an in-depth analysis.
