# Downstream Analysis

This notebook should be used for downstream analysis of your OPS screen.
Cells marked with <font color='red'>SET PARAMETERS</font> contain crucial variables that need to be set according to your specific experimental setup and data organization.
Please review and modify these variables as needed before proceeding with the analysis.

## <font color='red'>SET PARAMETERS</font>

### Fixed parameters for cluster module

- `CONFIG_FILE_PATH`: Path to a Brieflow config file used during processing. Absolute or relative to where workflows are run from.

In [None]:
CONFIG_FILE_PATH = "config/config.yml"

In [None]:
from pathlib import Path
import warnings
warnings.filterwarnings('ignore', category=UserWarning)

import yaml
import pandas as pd
import numpy as np
import os

In [None]:
# load config file and determine root path
with open(CONFIG_FILE_PATH, "r") as config_file:
    config = yaml.safe_load(config_file)
    ROOT_FP = Path(config["all"]["root_fp"])

# load cell classes and channel combos
cluster_combo_fp = config["cluster"]["cluster_combo_fp"]
cluster_combos = pd.read_csv(cluster_combo_fp, sep="\t")

CHANNEL_COMBOS = list(cluster_combos["channel_combo"].unique())
print(f"Channel Combos: {CHANNEL_COMBOS}")

CELL_CLASSES = list(cluster_combos["cell_class"].unique())
print(f"Cell classes: {CELL_CLASSES}")

LEIDEN_RESOLUTION = list(cluster_combos["leiden_resolution"].unique())
print(f"Leiden resolution: {LEIDEN_RESOLUTION}")

## <font color='red'>SET PARAMETERS</font>

### Cluster Selection for Analysis

Set these parameters to select the specific cluster data to analyze:
- `CHANNEL_COMBO`: Select from available channel combinations,
- `CELL_CLASS`: Select from available cell classes,
- `LEIDEN_RESOLUTION`: Select from available Leiden resolutions,

These parameters determine which folder of cluster data will be analyzed.

In [None]:
CHANNEL_COMBO = None
CELL_CLASS = None
LEIDEN_RESOLUTION = None

In [None]:
cluster_path = ROOT_FP / "cluster" / CHANNEL_COMBO / CELL_CLASS / str(LEIDEN_RESOLUTION)
print(f"Cluster path: {cluster_path}")

if not cluster_path.exists():
    print(f"Cluster directory does not exist: {cluster_path}")
else:
    print(f"Cluster directory found")

# Mozzarellm: LLM-based Gene Cluster Analysis

## Overview
[Mozzarellm](https://github.com/cheeseman-lab/mozzarellm) is a Python package that leverages Large Language Models (LLMs) to analyze gene clusters for pathway identification and novel gene discovery. This notebook guides you through the process of:

1. **Loading and reshaping gene cluster data** from your OPS screen
2. **Analyzing gene clusters with LLMs** to identify biological pathways
3. **Categorizing genes** as established pathway members, uncharacterized, or having novel potential roles
4. **Prioritizing candidates** for experimental validation

## Prerequisites

### Package Installation
You need to install the mozzarellm package in your Brieflow environment:

```bash
pip install git+https://github.com/cheeseman-lab/mozzarellm.git
```

### API Keys
Mozzarellm requires API keys to access LLM services. You need at least one of these keys:

- **OpenAI API Key**: Required for GPT models (gpt-4o, gpt-4.5, etc.)
- **Anthropic API Key**: Required for Claude models (claude-3-7-sonnet, etc.)
- **Google API Key**: Required for Gemini models (gemini-2.0-pro, etc.)

These keys provide access to paid API services, and usage will incur costs based on the number of tokens processed. The cost per analysis varies by model but typically ranges from $0.01-$0.10 per cluster, depending on cluster size and model choice. For this reason, we only run these analyses on a specific chosen Leiden resolution, rather than across all of the generated possible resolutions.

In [None]:
from mozzarellm import analyze_gene_clusters, reshape_to_clusters

In [None]:
# Set API keys
os.environ["OPENAI_API_KEY"] = "your_openai_key_here"
os.environ["ANTHROPIC_API_KEY"] = "your_anthropic_key_here"  
os.environ["GOOGLE_API_KEY"] = "your_google_key_here"

In [None]:
RESULTS_DIR = cluster_path / "mozzarellm_analysis"
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
print(f"Results will be saved to: {RESULTS_DIR}")

In [None]:
cluster_file = cluster_path / "phate_leiden_clustering.tsv"
cluster_df = pd.read_csv(cluster_file, sep="\t")
display(cluster_df)

## <font color='red'>SET PARAMETERS</font>: 

### Data Reshaping

Configure gene clustering parameters:
- `GENE_COL`: Column containing gene identifiers
- `CLUSTER_COL`: Column containing cluster assignments
- `UNIPROT_COL`: Column with UniProt annotations

These parameters control how gene-level data is converted to cluster-level data.

In [None]:
# Set up parameters for reshape and analysis - adjust based on your dataset
GENE_COL = config["aggregate"]["perturbation_name_col"]
CLUSTER_COL = "cluster"
UNIPROT_COL = "uniprot_function"

In [None]:
llm_cluster_df, llm_uniprot_df = reshape_to_clusters(
    input_df=cluster_df, 
    gene_col=GENE_COL,
    cluster_col=CLUSTER_COL,
    uniprot_col=UNIPROT_COL, 
    verbose=True
)
display(llm_cluster_df)
display(llm_uniprot_df)

## <font color='red'>SET PARAMETERS</font>: 
### LLM Analysis Configuration

- `MODEL_NAME`: LLM to use for analysis. Usable models include:
  - OpenAI: `o4-mini`, `o3-mini`, `gpt-4.1`, `gpt-4o`
  - Anthropic: `claude-3-7-sonnet-20250219`, `claude-3-5-haiku-20241022`
  - Google: `gemini-2.5-pro-preview-03-25`, `gemini-2.5-flash-preview-04-17`
- `CONFIG_DICT`: Configuration file for the LLM model
- `SCREEN_CONTEXT`: Context for the analysis and how to evaluate _clusters_
- `CLUSTER_ANALYSIS_PROMPT`: Context for the analysis and how to evaluate _genes within clusters_

Mozzarellm includes optimized [configurations](https://github.com/cheeseman-lab/mozzarellm/blob/main/mozzarellm/configs.py) and [prompts](https://github.com/cheeseman-lab/mozzarellm/blob/main/mozzarellm/prompts.py) you can import as shown below.

Custom text files can also be used by setting `screen_context_path` and `cluster_analysis_prompt_path` parameters.

In [None]:
from mozzarellm.prompts import ROBUST_SCREEN_CONTEXT, ROBUST_CLUSTER_PROMPT
from mozzarellm.configs import DEFAULT_ANTHROPIC_CONFIG

In [None]:
# Set up model configs
MODEL_NAME = "claude-3-7-sonnet-20250219"
CONFIG_DICT = DEFAULT_ANTHROPIC_CONFIG
SCREEN_CONTEXT = ROBUST_SCREEN_CONTEXT
CLUSTER_ANALYSIS_PROMPT = ROBUST_CLUSTER_PROMPT

In [None]:
# Run LLM analysis with Anthropic
anthropic_results = analyze_gene_clusters(
    # Input data options
    input_df=llm_cluster_df,
    # Model and configuration
    model_name=MODEL_NAME,
    config_dict=CONFIG_DICT,
    # Analysis context and prompts
    screen_context=SCREEN_CONTEXT,
    cluster_analysis_prompt=CLUSTER_ANALYSIS_PROMPT,
    # Gene annotations
    gene_annotations_df=llm_uniprot_df,
    # Processing options
    batch_size=1,
    # Output options
    output_file=f"{RESULTS_DIR}/{MODEL_NAME}",
    save_outputs=True,
    outputs_to_generate=["json", "clusters", "flagged_genes"],
)

# Feature Plot Analysis

Feature plots provide powerful visualization methods to understand the phenotypic effects of gene knockdowns in your OPS screen. These visualizations help identify patterns, correlations, and outliers in your data. While the LLM analysis identifies biological pathways and gene functions at a high level, feature plots reveal the specific phenotypic changes caused by individual gene perturbations. This section demonstrates how to create four types of visualizations for specific genes and features:

### 1. Waterfall Plots
Waterfall plots rank genes by their effect on a single feature, creating a cascade visualization that highlights the genes with the strongest positive or negative effects.
- Identifying top hits for a phenotype of interest
- Comparing the magnitude of effects across genes
- Visualizing the distribution of effects across the entire dataset

### 2. Two-Feature Plots
Two-feature plots display the relationship between two different phenotypic measurements across genes.
- Discover correlations between different cellular phenotypes
- Identify genes that affect multiple features in interesting ways
- Cluster genes with similar phenotypic profiles

### 3. Volcano Plots
Volcano plots combine effect size (fold change) and statistical significance (p-value) in a single visualization.
- Distinguishing between statistically significant and biologically relevant effects
- Identifying genes with both strong and reliable phenotypic changes
- Establishing appropriate thresholds for hit selection

### 4. Heatmaps
Heatmaps visualize multiple features across multiple genes simultaneously, providing a comprehensive view of phenotypic signatures.
- Revealing patterns across large sets of genes and features
- Identifying clusters of genes with similar phenotypic profiles
- Comparing the effects of gene perturbations across different cell compartments or processes

These visualizations are created interactively in this notebook rather than through the automated Snakemake pipeline because the number of possible plot combinations is enormous (genes × features × plot types). Different analyses require different combinations based on your specific biological questions, and interactive exploration allows you to focus on the most interesting results from your study.