# Report Generation with Reviewers

This notebook demonstrates how to generate gene analysis reports with an additional review step. The process includes:

1. Setting up the required API keys and environment
2. Provide the disease name and gene list
3. Generating detailed reports
4. Reviewing the generated reports with reviewers (Optional)



## 1. Setting up the required API keys and environment


In [1]:
# First cell should include:
import sys
sys.path.append('..')
from config import setup_api_keys

# Configure your API keys in config.py first!
setup_api_keys()

## 2. Provide the disease name and gene list

### Input Portal

You need to provide the following information to generate a specific section of gene analysis reports:

1. gene_list: a list of genes
2. disease_name: the name of the disease
3. section_name: the name of the section you want to generate


In [2]:
from Report_Generation_AgentAnalyst.tools import get_gene_list, get_gene_full_name
from Report_Generation_AgentAnalyst.run_section import directory_set_up
import os

In [3]:
#get current directory and stay at TargetSeek_Code level
prefix = directory_set_up() + '/'

# set the disease name
disease_name = 'atherosclerosis'

# set the gene list, please provide in the form of short gene code
gene_list = ['PCSK9'] 

'''
# if you want to generate reports for the five diseases mentioned in the paper
# 'atherosclerosis', 'non_small_cell_lung_cancer', 'rheumatoid_arthritis', 'type2_diabetes', and 'inflammatory_bowel_disease'
# you can use the following code to get the corresponding gene list

disease_name = 'non_small_cell_lung_cancer'

# get the gene list
gene_list = get_gene_list(disease_name, prefix=prefix)
'''

# get the full gene name
gene_list = [gene + '_' + get_gene_full_name(gene).replace(' ', '_').lower() for gene in gene_list]
print(gene_list)

['PCSK9_proprotein_convertase_subtilisin/kexin_type_9']


In [4]:
len(gene_list)

1

## 3. Generating Detailed Reports

The generation process will generate a detailed report for each gene in the gene list


In [5]:
from Report_Generation_AgentAnalyst.run_section import *
# from joblib import Parallel, delayed
from tqdm import tqdm

In [6]:
for gene in tqdm(gene_list):
    generate_with_progress(gene, disease_name, model_name = 'chatgpt-4o-latest')

  0%|          | 0/1 [00:00<?, ?it/s]

pcsk9 is processed.


100%|██████████| 1/1 [00:01<00:00,  1.01s/it]


### Generated Report Location
The complete generated report for each gene will be saved at:
**Report_Generation_AgentAnalyst/reports/{disease_name}/{gene_name}/report_{gene_name}_all.md**

For example, for the PCSK9 gene and atherosclerosis disease, the report will be at:
**Report_Generation_AgentAnalyst/reports/atherosclerosis/pcsk9/report_pcsk9_all.md**

## 4. Reviewing the generated reports with reviewers (Optional)

The review process will review each section of the report to identify potential issues and improve the quality of the report.

The refined reports will be saved in the feedback_repo/revised_reports_complete directory


In [7]:
os.chdir(prefix) # change the working directory to the target seek directory

In [8]:
from Report_Generation_AgentAnalyst.tools import get_gene_list, get_gene_name_short
from Report_Reviewing_AgentReviewer.reviewer_report import \
    run_parallel_refinement_for_a_gene, \
        process_section,\
        save_refined_report_all_sections

In [9]:
# set up the sections to be reviewed
sections = ["lof", "gof", "assays", "competitive_edge", "druggability",
            "in_vitro_or_vivo_data", "mechanism_of_action", "safety"]

In [10]:
os.chdir(prefix)

md_dir = prefix + f'/Report_Generation_AgentAnalyst/reports'

save_dir = prefix + '/Report_Reviewing_AgentReviewer/feedback_repo/summary_feedback_tmp/'
if not os.path.exists(save_dir):
    os.mkdir(save_dir)

mistake_pool_file = prefix + '/Report_Reviewing_AgentReviewer/mistake_pool/mistake_pool_0918.json' 

In [11]:
short_gene_name_list = [gene.split('_')[0].lower() for gene in gene_list]

In [12]:
from tqdm import tqdm

for gene in tqdm(short_gene_name_list):
    print(f'processing {gene}')
    #check if the gene report has been refined
    # if os.path.exists(save_dir + 'refine_reports/' + f'/{disease_name}/{gene}'):
    #     print(f'{gene} has been refined')
    #     continue
    results = run_parallel_refinement_for_a_gene(gene, disease_name, sections, save_dir, md_dir, mistake_pool_file)

  0%|          | 0/1 [00:00<?, ?it/s]

processing pcsk9


  llm_chain = LLMChain(llm=llm, prompt=mistake_pool_prompts)
  mistake_pool_response = llm_chain.run(mistake_pool=mistake_pool, report=report, summarized_feedback=summarize_feedback)


error in saving mistake pools
mistake pool response is ```python
mistakePool = {
    'Included irrelevant content to the target gene': 'The row in the table is not **directly** related to the gene of interest||Action: Delete the content.',
    'Logical error': 'A mutation cannot simultaneously be both gain of function and loss of function||Action: You should remove one of them.',
    'Redundant reference': 'Some references are not used. ||Action: Delete the references.',
    'Incorrect reference': 'The reference provided does not support the claim made in the report||Action: Delete the relevant content and the references, unless you find the correct one that supports the claim.',
    'Incorrect Genetic Loci': 'The genetic loci associated with the amino acid change is incorrect||Action: Update the genetic loci to the correct one.',
    'Incomplete Traits Association': 'The traits associated with the genetic variant are incomplete||Action: Update the traits to include all relevant associ

100%|██████████| 1/1 [01:11<00:00, 71.11s/it]

/Users/yzhe0006/models/TargetSeek_Code/Report_Reviewing_AgentReviewer/feedback_repo/summary_feedback_tmp/refine_reports/atherosclerosis/pcsk9





In [13]:
for gene in tqdm(short_gene_name_list):
    save_refined_report_all_sections(gene, disease_name, save_dir)

100%|██████████| 1/1 [00:00<00:00, 203.41it/s]


### Refined Report Location
The complete refined report for each gene will be saved at:
**Report_Reviewing_AgentReviewer/feedback_repo/summary_feedback_tmp/refine_reports/{disease_name}/{gene_name}/report_{gene_name}_all.md**

For example, for the PCSK9 gene and atherosclerosis disease, the report will be at:
**Report_Reviewing_AgentReviewer/feedback_repo/summary_feedback_tmp/refine_reports/atherosclerosis/pcsk9/report_pcsk9_all.md**