# Report Scoring

This notebook is used to score and evaluate target genes for specific diseases **after the report is generated**. 

The report scoring consists of two steps:

1. Scoring the candidate gene for the specific disease
2. Save the scoring results as a CSV file

It utilizes 13 AgentScorers to score a candidate gene for a specific disease. The 13 scoring criteria of 4 sections are as follows:

**Category 1: Causal Inference**

    - Genetic Evidence
    - Differential Expression
    - Mechanism of Action
    - In Vitro/In Vivo Experiments

**Category 2: Tractability**    

    - Small Molecule Tractability
    - Antibody Tractability
    - siRNA Tractability

**Category 3: Competitiveness**

    - Small Molecule Competitiveness
    - Antibody/siRNA Competitiveness
    - Unmet Medical Needs

**Category 4: Doability**

    - Experimental Model Availability (Assayability)
    - Biomarker Availability
    - Safety

The scoring helps quantify and compare the potential of different gene targets for therapeutic development.

## 1. Scoring the candidate gene for the specific disease

This step scores the candidate gene for the specific disease.

In [1]:
from Target_Scoring_AgentScorer.tools import *
from Target_Scoring_AgentScorer.scoring_sections import *
from Report_Generation_AgentAnalyst.tools import get_gene_list, get_gene_full_name
from config import setup_api_keys

In [2]:
setup_api_keys()
llm = testChat() # initialize the LLM

### Input Portal

You need to provide the following information to generate a specific section of gene analysis reports:

1. gene_list: a list of genes
2. disease_name: the name of the disease
3. section_name: the name of the section you want to generate


In [3]:
# set the disease name
disease_name = 'atherosclerosis'

In [4]:
# set the gene list, please provide in the form of short gene code
gene_list = ['PCSK9'] 

In [5]:
from Report_Generation_AgentAnalyst.run_section import directory_set_up
#get current directory
prefix = directory_set_up() + '/'

'''
# if you want to generate reports for the five diseases mentioned in the paper
# 'atherosclerosis', 'non_small_cell_lung_cancer', 'rheumatoid_arthritis', 'type2_diabetes', and 'inflammatory_bowel_disease'
# you can use the following code to get the corresponding gene list

disease_name = 'non_small_cell_lung_cancer'

# get the gene list
gene_list = get_gene_list(disease_name, prefix=prefix)
'''

# get the full gene name
gene_list = [gene + '_' + get_gene_full_name(gene).replace(' ', '_').replace('/', '_').lower() for gene in gene_list]
print(gene_list)

['PCSK9_proprotein_convertase_subtilisin_kexin_type_9']


In [6]:
preferred_moa = ''
match_moa = False

### Run the scoring functions

In [7]:
gene_list = [gene.split('_')[0].lower() for gene in gene_list]
for gene in gene_list:
    gene = gene.lower().replace(' ', '')
    # file_type = 'direct' or 'review' --> 'direct' retrieve reports without review, 'review' retrieve reports with review
    run_all_scoring_functions(llm, gene, disease_name, preferred_moa, match_moa = match_moa, file_type= 'direct', no_drug_pipeline_info=False)

report_pcsk9_all.mdreport_pcsk9_all.md

report_pcsk9_all.md
report_pcsk9_all.md
report_pcsk9_all.md
report_pcsk9_all.md
report_pcsk9_all.md
report_pcsk9_all.md
report_pcsk9_all.md
--------report chunk----------

### human_tissue_distribution
| Tissue Type | Expression Value |
|-------------|------------------|
| liver | 47.6 |
| cerebellum | 10.1 |
| lung | 7.6 |
| esophagus | 5.3 |
| pancreas | 5.2 |
| colon | 4.4 |
| small intestine | 3.3 |
| duodenum | 3.0 |
| rectum | 1.5 |
| vagina | 1.5 |
| stomach | 1.3 |
| skin | 1.3 |
| cervix | 1.0 |
| smooth muscle | 0.7 |
| appendix | 0.7 |
| testis | 0.4 |
| bone marrow | 0.4 |
| fallopian tube | 0.4 |
| salivary gland | 0.4 |
| spleen | 0.3 |
| tonsil | 0.3 |
| urinary bladder | 0.3 |
| adipose tissue | 0.3 |
| endometrium | 0.3 |
| epididymis | 0.3 |
| seminal vesicle | 0.2 |
| adrenal gland | 0.2 |
| basal ganglia | 0.2 |
| parathyroid gland | 0.2 |
| spinal cord | 0.2 |
| kidney | 0.1 |
| breast | 0.1 |
| cerebral cortex | 0.1 |
| gall

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gene_info["Expression Level"] = gene_info["nTPM"].apply(categorize_expression)


--------report chunk----------
Druggability Report:

### druggability
### Collected Information on PCSK9

| Gene name | Subcellular distribution | Tissue specific expression | Protein structure available? | Membrane protein | Ligand binder | Small molecule binder | Pocket | Small Molecule Tool Compound | Reference |
|-----------|--------------------------|----------------------------|------------------------------|------------------|---------------|-----------------------|--------|------------------------------|-----------|
| PCSK9     | Secreted protein           | Highly expressed in liver, detected in multiple tissues | Yes, crystal structure available (2.3 A)  |  No             | Yes           | Yes                   | Yes    | Compound 16, E28362, AZD0780 | [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]  |

#### Detailed Analysis:

1. **Gene Name**: PCSK9

2. **Subcellular Distribution**:
   - PCSK9 is primarily a secreted protein that acts extracellularly to regulate LDL recep

### Gather the scoring results to generate the final scoring report

In [8]:
for gene in gene_list:
    generate_final_report(llm, gene, disease_name,  file_type = 'direct')

llmclient=<openai.resources.chat.completions.completions.Completions object at 0x7fd8aa750940> async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x7fd8aa7851f0> root_client=<openai.OpenAI object at 0x7fd888193fa0> root_async_client=<openai.AsyncOpenAI object at 0x7fd8aa7505b0> model_name='gpt-4o-2024-05-13' model_kwargs={} openai_api_key=SecretStr('**********') max_tokens=4096
gene_namepcsk9
disease_nameatherosclerosis
file_typedirect
match_moaFalse


### Generated Scoring Report Location

The file is stored as:
TargetSeek/Target_Scoring_AgentScorer/scoring_result/direct/markdown/{disease_name}/{gene_name}_{disease_name}_scoring.md

For example, the scoring report for the gene PCSK9 and the disease atherosclerosis is:
TargetSeek/Target_Scoring_AgentScorer/scoring_result/direct/markdown/atherosclerosis/pcsk9_atherosclerosis_scoring.md

## 2. Combine & Save the scoring results as a CSV

The next step is to combine all individual gene scoring results into a single CSV file for ranking.

This CSV file will contain the scoring results for all genes analyzed for the disease of interest,
making it easier to compare and rank the targets systematically. Each row represents a gene with its complete set of scores across all evaluation criteria.

In [9]:
disease_names = [disease_name]

retrieve_path = directory_set_up()

for disease_name in disease_names:
    # gene_list = load_gene_list(retrieve_path, disease_name)
    gene_list = [gene.lower().replace(' ', '').split('_')[0] for gene in gene_list]
    ratings_of_genes = {}
    for gene in gene_list:
        gene = gene.lower().replace(' ', '')
        ratings_of_genes[gene] = return_ratings(gene, disease_name, file_type='direct')

    ratings_of_genes = map_ratings_to_categories(ratings_of_genes)

    store_final_results(ratings_of_genes, disease_name, file_type='tern')