# A1 Agent Testing Notebook
This notebook demonstrates the usage of A1 agent with various biomedical databases for CRISPR screen planning.

In [3]:
# Import required libraries
from biomni.agent import A1


In [4]:
# Initialize the A1 agent
agent = A1(
    path='./data',
    llm='gpt-4.1',
    source = 'OpenAI',
    load_datalake=False
)

Skipping datalake download (load_datalake=False)
Note: Some tools may require datalake files to function properly.


## Test 1: CRISPR Screen Planning
Test the agent's ability to plan a CRISPR screen for T cell exhaustion genes.

In [7]:
# Execute CRISPR screen planning
agent.go("""
Plan a CRISPR screen to identify genes that regulate T cell exhaustion, 
generate 32 genes that maximize the perturbation effect.
Use the following data sources:
ENCODE for chromatin accessibility in T cells
CellxGENE Census for expression patterns
GO annotations for these gene's function
""")

Using prompt-based retrieval with the agent's LLM


Plan a CRISPR screen to identify genes that regulate T cell exhaustion, 
generate 32 genes that maximize the perturbation effect.
Use the following data sources:
- ENCODE for chromatin accessibility in T cells
- CellxGENE Census for expression patterns
- GO annotations for these gene's function

parsing error...

Each response must include thinking process followed by either <execute> or <solution> tag. But there are no tags in the current response. Please follow the instruction, fix and regenerate the response again.

Thank you for pointing out the missing tags. Here is my updated response:

**Thinking Process:**  
To identify 32 genes that regulate T cell exhaustion and maximize perturbation effect, I need to integrate chromatin accessibility data from ENCODE, T cell expression data from CellxGENE Census, and functional annotation from Gene Ontology. The first step is to query ENCODE to obtain a list of genes with accessible chromat

The "stable" release is currently 2025-01-30. Specify 'census_version="2025-01-30"' in future calls to open_soma() to ensure data consistency.



<observation>Error: get_anndata() got an unexpected keyword argument 'obs_query'</observation>

The provided code failed due to an unexpected keyword argument 'obs_query' in the get_anndata function. This indicates either a difference in the API or function signature for cellxgene_census in this environment.

**Reasoning:**  
- I need to check the correct parameters for retrieving AnnData for specific cell types using cellxgene_census.
- If filtering by cell type is not directly supported, I may need to load the data and then subset for the relevant T cell subtypes.

**Plan Update:**
1. [✗] Query ENCODE for chromatin accessibility data specific to T cells (all attempts failed)
2. [✗] Query CellxGENE Census for expression profiles in T cell subtypes, focusing on exhaustion-relevant patterns (function argument error)
3. [ ] Modified Step 2: Retrieve T cell expression data using the correct approach, then subset to exhausted, effector, and memory T cells
4. [ ] Integrate expression and, 

The "stable" release is currently 2025-01-30. Specify 'census_version="2025-01-30"' in future calls to open_soma() to ensure data consistency.



Based on the documentation, the correct argument for subsetting by cell type is obs_value_filter, not obs_query. I will update the code to use obs_value_filter="cell_type in ['Exhausted T cell', 'Effector T cell', 'Memory T cell']".

**Updated plan:**
1. [✗] Query ENCODE for chromatin accessibility data specific to T cells (all attempts failed)
2. [✗] Query CellxGENE Census for expression profiles in T cell subtypes, focusing on exhaustion-relevant patterns (function argument error)
3. [✓] Modified Step 2: Retrieve T cell expression data using obs_value_filter, then subset for exhausted, effector, and memory T cells
4. [ ] Integrate expression and, if possible, chromatin/gene set data to prioritize genes
5. [ ] Annotate the filtered gene list with Gene Ontology (GO) terms relevant to T cell function and exhaustion
6. [ ] Rank genes by expected perturbation effect
7. [ ] Select the top 32 genes that maximize coverage across relevant pathways and biological processes
8. [ ] Output the f

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.



<observation>Error: Found array with 0 sample(s) (shape=(0, 61888)) while a minimum of 1 is required.</observation>


The "stable" release is currently 2025-01-30. Specify 'census_version="2025-01-30"' in future calls to open_soma() to ensure data consistency.



The attempt to retrieve T cell data from CellxGENE Census using obs_value_filter for the specified cell types resulted in zero samples. This could be due to:
- Mismatch in cell type naming conventions ("Exhausted T cell" etc. may not exactly match what's in the database)
- No data for the queried cell types in the current census version

**Reasoning/Next step:**  
To resolve this, I should first check which exact T cell subtypes (and naming conventions) are available in the dataset. Once I know the correct cell type labels, I can repeat the query.

**Checklist Update:**
1. [✗] Query ENCODE for chromatin accessibility data specific to T cells (all attempts failed)
2. [✗] Query CellxGENE Census for expression profiles in T cell subtypes, focusing on exhaustion-relevant patterns (no samples found)
3. [ ] Modified Step 2: List all available cell types in CellxGENE Census to determine correct labels for exhausted/effector/memory T cells
4. [ ] Proceed with DE analysis using correct labels


KeyboardInterrupt: 

## Test 2: Validation with Literature and Clinical Data
Cross-reference the selected genes with clinical trials and drug databases.

In [9]:
# Validate genes against clinical data
agent.go(f"""
For Aspirin
Check DrugCentral and ChEMBL for its most critical info.
Search ClinicalTrials.gov for relevant trials of using it in cancer.
Review safety profiles using openFDA.
""")

 'The FDA adverse event report review for Aspirin reveals:\n- Of 100 total reports, 66% were classified as "serious."\n- Frequent adverse reactions included: rash, drug interactions, type 2 diabetes mellitus, dizziness, and fatigue.\n- The FDA notes these are voluntary reports and do not indicate causality.\n\nUpdated checklist:\n1. [✓] Query DrugCentral for critical information about Aspirin (meta-info retrieved)\n2. [✓] Query ChEMBL for essential bioactivity/drug details for Aspirin (detailed drug info retrieved)\n3. [✓] Search ClinicalTrials.gov for trials of Aspirin in cancer (relevant trials found)\n4. [✓] Review Aspirin\'s safety profile using FDA adverse event reports (key safety data retrieved)\n5. [ ] Summarize and present the findings.\n\nNow, I will summarize and present all findings clearly.\n\n<solution>\nSummary of Key Findings for Aspirin:\n\n1. DrugCentral:\n   - DrugCentral offers comprehensive Aspirin data via API and database, including indications, structure, and re

## Test 3: Ontology and Pathway Analysis
Analyze the selected genes using ontology tools.

In [6]:
# Perform ontology analysis
agent.go("""
For the selected gene set:CD274, ERBB2, EGFR
1. Use QuickGO to get GO annotations
2. Use OLS to map terms and pathways
3. Identify enriched biological processes
""")

Using prompt-based retrieval with the agent's LLM


For the selected gene set:CD274, ERBB2, EGFR
1. Use QuickGO to get GO annotations
2. Use OLS to map terms and pathways
3. Identify enriched biological processes

parsing error...

Each response must include thinking process followed by either <execute> or <solution> tag. But there are no tags in the current response. Please follow the instruction, fix and regenerate the response again.

Thank you for pointing out the missing tag. Here’s my thought process and corrected response:

Thinking and Reasoning:
To address your request, I will:
- Start by using QuickGO to gather GO annotations for the genes CD274, ERBB2, and EGFR.
- This will provide the GO terms associated with each gene, which are essential for the next steps.

Let’s proceed to import the QuickGO function and query for GO annotations for the three genes.

<execute>
from biomni.tool.database import query_quickgo

genes = ["CD274", "ERBB2", "EGFR"]
go_annotations = {}

for gen

 'Thank you for your feedback. Here’s my thinking process and the corrected response:\n\nThinking process:\n- The previous attempts to run enrichment analysis failed due to a missing module error ("No module named \'gget\'"), even though the built-in function should not require this.\n- Since the built-in gene_set_enrichment_analysis function cannot be executed successfully in this environment right now, I will summarize the available results and insights using the GO annotations and OLS mapping that were retrieved in steps 1 and 2.\n- I will highlight key biological processes and mapped pathways based on the mapped GO terms for CD274, ERBB2, and EGFR.\n\n<solution>\nSummary of analysis for CD274, ERBB2, EGFR:\n\n1. QuickGO GO annotation retrieval identified GO terms for each gene, including:\n   - CD274: immune response modulation (e.g., GO:0002841), actin cytoskeleton (GO:0015629), positive regulation of transcription (GO:0045893)\n   - ERBB2: transmembrane receptor protein kinase ac

## Comprehensive Drug Analysis: Aspirin Case Study
We'll perform a detailed analysis of Aspirin using multiple databases to understand its properties, interactions, and clinical applications.

In [None]:
# 1. Chemical and Basic Drug Information
agent.go("""
For Aspirin (acetylsalicylic acid):
1. Using PubChem:
   - Get chemical structure and properties
   - List all known synonyms
   - Find compound classification
   
2. Using DrugCentral:
   - Get primary mechanisms of action
   - List approved indications
   - Get ATC classification
   - Extract pharmacological class information
   
3. Using ChEMBL:
   - Find binding targets and their affinities
   - List documented biological activities
   - Get drug-like properties
""")

 'My thinking:\nThe ChEMBL query returned detailed information for Aspirin, including:\n- Drug-like properties (Lipinski analysis, logP, H-bond donors/acceptors, etc.)\n- Synonyms and identifiers\n- ATC classifications\n- Indication class\n- ChEMBL ID\n\nHowever, the output is lengthy and detailed binding target/affinity and biological activity information may need to be explicitly extracted. I will summarize the main findings and indicate if a follow-up query is needed for specific targets and affinities.\n\nUpdated checklist:\n1. [✓] PubChem: structure, properties, synonyms, classification\n2. [✗] DrugCentral: (not directly accessible via current API)\n3. [✓] ChEMBL: drug-like properties, synonyms, ATC codes, indication class (targets/affinity may need further expansion)\n\nLet’s summarize the information gathered so far and clarify if you want a deeper listing of specific targets/affinities from ChEMBL.\n\n<solution>\nSummary for Aspirin (acetylsalicylic acid):\n\n1. PubChem:\n   - 

In [None]:
# 2. Clinical Trials Analysis
agent.go("""
Analyze Aspirin clinical trials:
1. Using ClinicalTrials.gov:
   - Find all cancer-related trials (active and completed)
   - Group trials by cancer types
   - Extract dosage information
   - Identify combination therapy trials
   - List primary and secondary outcomes
   
2. Cross-reference with DrugCentral:
   - Compare approved vs. investigational uses
   - Identify potential repurposing opportunities
""")

In [5]:
# 3. Safety and Pharmacovigilance
result = agent.go("""
Comprehensive safety analysis of Aspirin:
1. Using openFDA:
   - Extract adverse event reports
   - Calculate reporting odds ratios
   - Identify black box warnings
   - List drug-drug interactions
   - Get patient demographic patterns

2. Using DailyMed:
   - Get current labeling information
   - Extract contraindications
   - List special populations warnings
   - Compare different formulation safety profiles
""")
print(result)

Using prompt-based retrieval with the agent's LLM


Comprehensive safety analysis of Aspirin:
1. Using openFDA:
   - Extract adverse event reports
   - Calculate reporting odds ratios
   - List drug-drug interactions
   - Get patient demographic patterns

2. Using DailyMed:
   - Get current labeling information
   - Extract contraindications
   - Compare different formulation safety profiles

parsing error...

Each response must include thinking process followed by either <execute> or <solution> tag. But there are no tags in the current response. Please follow the instruction, fix and regenerate the response again.

Thank you for pointing out the formatting requirement. Here’s my revised approach:

Thinking process:  
To begin a comprehensive safety analysis of Aspirin, I need to first extract adverse event reports from openFDA. This will provide an overview of the types and frequencies of adverse events reported in association with Aspirin, which is foundational for the subsequent ste

In [None]:
# 4. Drug Interactions and Cross-References
agent.go("""
Map Aspirin across databases:
1. Using UniChem:
   - Get all database identifiers
   - Cross-reference with other systems

2. Using DrugCentral and ChEMBL:
   - List all known drug interactions
   - Categorize by severity
   - Identify mechanism-based interactions
   - Find structural analogs
""")

In [None]:
# 5. Molecular and Pathway Analysis
agent.go("""
Analyze molecular aspects:
1. Using QuickGO:
   - Get GO terms for Aspirin targets
   - Analyze biological processes affected

2. Using OLS:
   - Map to relevant pathways
   - Find disease associations
   - Identify molecular functions

3. Combine with ChEMBL data:
   - Analyze target protein families
   - Map to signaling pathways
""")