**Remark:** In this notebook we will show how to use LLMs to annotate a drug wrt "Level 1 Reactome pathways". Since we're just using putative targets and MOA as strings we are not actually restricted to using drugs from GDSC and PRISM. I.e.: this notebook can be used to <ins>**annotate any drug**</ins>

# 0. Libraries import

Import all necessary libraries and set up the environment. This includes:
- Standard data manipulation libraries (pandas, numpy)
- File system and path management (os, sys, pathlib)
- The vLLM library for efficient large language model inference

In [1]:
import os
import pandas as pd
import numpy as np
import sys
from pathlib import Path

from vllm import LLM, SamplingParams

We'll also set the current working directory and define important file paths.

In [None]:
#set current working directory to the library path
library_path = Path('/home/fcarli/CellHit/') #replace with yours
os.chdir(library_path)
#sys.path.append(library_path) use this in case it is still not working

We are then ready to import dedicatd fucntions from the custom library

In [3]:
from CellHit.LLMs import fetch_abstracts, generate_prompt
from CellHit.data import obtain_drugs_metadata

In [32]:
#specify the data_path
data_path = library_path/'data'
describe_prompt_path = data_path/'prompts'/'drug_description_prompt.txt'
refine_prompt_path = data_path/'prompts'/'drug_refiner_prompt.txt'
model_path = Path('~/mixtral/').expanduser() # note that the model needs to be downloaded from huggingface, the one used can be found at https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ (gptq-4bit-32g-actorder_True)

# 1. Getting a drug abstracts

Taking advantage of the custom functions provided in cellHit we can get metadata related to the two dataset used in the manuscript (i.e.: gdsc and prism)

In [5]:
dataset = 'gdsc'
metadata = obtain_drugs_metadata(dataset,data_path)
metadata.head()

Unnamed: 0,DrugID,Drug,PUTATIVE_TARGET,MOA,DRUG_SYNONYMS,SMILES,repurposing_target
0,1003,Camptothecin,TOP1,DNA replication,"Camptothecine, (+)-Camptothecin",CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=CC5=CC=CC=C5N=...,TOP1
1,1004,Vinblastine,Microtubule destabiliser,Mitosis,Velban,CC[C@@]1(C[C@H]2C[C@@](C3=C(CCN(C2)C1)C4=CC=CC...,TUBB
2,1005,Cisplatin,DNA crosslinker,DNA replication,"cis-Diammineplatinum(II) dichloride, Platinol,...",N.N.Cl[Pt]Cl,DNA
3,1006,Cytarabine,Antimetabolite,Other,"Ara-Cytidine, Arabinosyl Cytosine, U-19920",C1=CN(C(=O)N=C1N)[C@H]2[C@H]([C@@H]([C@H](O2)C...,DNA synthesis
4,1007,Docetaxel,Microtubule stabiliser,Mitosis,"RP-56976, Taxotere",CC1=C2[C@H](C(=O)[C@@]3([C@H](C[C@@H]4[C@]([C@...,TUBB


After having selected the drug of interest we can fix its name

In [6]:
drug_name = 'Venetoclax'

and afterwards extraxt known metadata realted to this drug such as putative targets and mechanism of action

In [11]:
metadata.loc[metadata['Drug']==drug_name]

Unnamed: 0,DrugID,Drug,PUTATIVE_TARGET,MOA,DRUG_SYNONYMS,SMILES,repurposing_target
213,1909,Venetoclax,BCL2,Apoptosis regulation,"ABT-199, Veneclexta, GDC-0199",CC1(CCC(=C(C1)C2=CC=C(C=C2)Cl)CN3CCN(CC3)C4=CC...,BCL2


In [13]:
mechanism_of_action = metadata.loc[metadata['Drug']==drug_name]['MOA'].values[0]
putative_target = metadata.loc[metadata['Drug']==drug_name]['repurposing_target'].values[0]
mechanism_of_action,putative_target

('Apoptosis regulation', 'BCL2')

We then use the  `fetch_abstracts` function to get the abstracts related to the selected drug

In [8]:
abstracts = fetch_abstracts(drug_name,mail=None,k=10)
abstracts = [str(a) for a in abstracts]
abstracts

['The prognosis of adult acute myeloid leukemia (AML) remains poor, with the long-term survival rate less than 50%. However, the current paradigms of treatment are changing through a better understanding of the disease genetics and pathophysiology. Since 2017, eight new drugs have been approved by the U.S. Food and Drug Administration for the treatment of AML, including the FLT3 inhibitors midostaurin and gilteritinib, the IDH inhibitors ivosidenib and enasidenib, the anti-CD33 monoclonal antibody gemtuzumab ozogamicin, liposomal daunorubicin and cytarabine, the hedgehog pathway inhibitor glasdegib and the BCL-2 inhibitor venetoclax. Preclinical data demonstrated the anti-leukemic efficacy of venetoclax in AML and its synergy when combined with hypomethylating agents or chemotherapy agents. Clinical trials have demonstrated the clinical benefit of venetoclax-based therapies in newly diagnosed AML, leading to the recent FDA approval of venetoclax in combination with hypomethylating agen

# 2. Generating a drug description

We now make us of the VLLM library to do serve the LLM

In [10]:
llm = LLM(
            model=model_path,
            revision="gptq-4bit-32g-actorder_True",
            quantization="gptq",
            dtype='float16',
            gpu_memory_utilization=1,
            enforce_eager=True)
            #**{"attn_implementation":"flash_attention_2"}) Turn on for faster decoding (Ampere and newer GPUs only)

    #initialize sampling parameters
sampling_params = SamplingParams(temperature=0.2, top_p=0.95, max_tokens=1024)


INFO 09-01 07:35:25 llm_engine.py:70] Initializing an LLM engine with config: model=PosixPath('/home/fcarli/mixtral'), tokenizer=PosixPath('/home/fcarli/mixtral'), tokenizer_mode=auto, revision=gptq-4bit-32g-actorder_True, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=gptq, enforce_eager=True, seed=0)


[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:43576 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [gaia01.hpc.sns.it]:43576 (errno: 97 - Address family not supported by protocol).


INFO 09-01 07:36:41 llm_engine.py:275] # GPU blocks: 4197, # CPU blocks: 2048


We generate the prompt using a dedicated function that takes care of inserting into a template the informations recovered about the selected drug

In [33]:
describe_prompt = generate_prompt(describe_prompt_path,**{'drug_name':drug_name, 'mechanism_of_action':mechanism_of_action, 'putative_targets':putative_target})
print(describe_prompt)

<|im_start|>system
**Task**: Utilizing the drug metadata that i provide you with, act as a medicinal chemist and expert molecular biologist to generate an in-depth description of the drug.

**Instructions**:
1. Analyze the provided drug metadata, including its name, synthetic labels of the mechanism of action, and putative targets.
2. Generate a detailed textual description of the drug's mechanism of action. This should include how the drug interacts with biological systems, its effect on specific cellular or molecular targets, and its overall impact on the disease or condition it is designed to treat.
3. If available, provide a detailed textual description of the drug's metabolism. This includes how the drug is metabolized in the body, the enzymes involved, the metabolic pathways, and any known metabolites formed.
4. Ensure that the information is presented in a clear, concise, and scientifically accurate manner, suitable for an audience with a background in medicinal chemistry and mo

We can than query the LLM and obtain its answer

In [17]:
outputs = llm.generate(describe_prompt, sampling_params)[0].

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts: 100%|██████████| 1/1 [00:19<00:00, 19.60s/it]


In [30]:
drug_description = outputs[0].outputs[0].text
print(drug_description)


**Mechanism of Action Description**:

Venetoclax is a potent and selective inhibitor of the BCL2 protein, a key regulator of apoptosis (programmed cell death). By binding to the BCL2 protein, Venetoclax disrupts the interaction between BCL2 and pro-apoptotic proteins, such as BAX and BAK, thereby promoting apoptosis in cancer cells that overexpress BCL2. This mechanism of action is particularly relevant in the treatment of chronic lymphocytic leukemia (CLL) and acute myeloid leukemia (AML), where BCL2 overexpression is a common feature and contributes to resistance to chemotherapy-induced apoptosis.

**Metabolism Description**:

Venetoclax is primarily metabolized by cytochrome P450 (CYP) 3A4, with a minor contribution from CYP3A5. The major metabolic pathways involve oxidative N-dealkylation, hydroxylation, and glucuronidation. The primary metabolite, M3, retains approximately 20% of the parent compound's activity and is further metabolized by CYP3A4 and UGT1A1. Venetoclax and its me

# 3.Refine the drug description with abstracts

We now repeat the procedure above but this time integrating the infos generated at the previous step and integrating them with the abstracts

In [38]:
refine_prompt = generate_prompt(refine_prompt_path,**{'previous_output':drug_description,'formatted_abstracts':"\n".join([f"Abstract {i+1}: {abstract}" for i, abstract in enumerate(abstracts)])})
print(refine_prompt)

<|im_start|>system
**Task**: Review and integrate a drug summary using the provided PubMed abstracts related to the drug. Synthesize both sets of information, offering corrections and updates as necessary.

**Instructions**:
1. Carefully read the output from the previous instruction, which contains a detailed description of the drug's mechanism of action and metabolism.
2. Review the list of PubMed abstracts, disregarding any information that is not relevant to the drug and focusing on key information that pertains to the drug's mechanism of action, metabolism, and any other relevant pharmacological details.
3. Compare and contrast the information from the previous output and the PubMed abstracts. Identify any discrepancies or conflicts between the two sources.
4. In case of conflicting information, give precedence to the information contained in the PubMed abstracts, as they are based on recent scientific research and publications.
5. Integrate the verified information and the new ins

In [39]:
outputs = llm.generate(refine_prompt, sampling_params)

Processed prompts: 100%|██████████| 1/1 [00:39<00:00, 39.50s/it]


In [40]:
refined_drug_description = outputs[0].outputs[0].text
print(refined_drug_description)


**Mechanism of Action Description**:

Venetoclax is a potent and selective inhibitor of the BCL2 protein, a key regulator of apoptosis (programmed cell death). By binding to the BCL2 protein, Venetoclax disrupts the interaction between BCL2 and pro-apoptotic proteins, such as BAX and BAK, thereby promoting apoptosis in cancer cells that overexpress BCL2. This mechanism of action is particularly relevant in the treatment of chronic lymphocytic leukemia (CLL) and acute myeloid leukemia (AML), where BCL2 overexpression is a common feature and contributes to resistance to chemotherapy-induced apoptosis. Venetoclax has been shown to enhance T-cell effector function by increasing reactive oxygen species generation through inhibition of respiratory chain supercomplexes formation. In AML, azacytidine induces a viral mimicry response in AML cells by activating the STING/cGAS pathway, thereby rendering the AML cells more susceptible to T cell-mediated cytotoxicity.

**Metabolism Description**:


In [41]:
#dump refined_drug_description to a file
with open('drug_description_refined.txt','w') as f:
    f.write(refined_drug_description)

# 4. Pathway selection

In this part of the notebook we serve the LLM model with a different library: [Guidance](https://github.com/guidance-ai/guidance). This will let us do constrained inference. 

*It is suggested to restart the notebook before running this new part. We are currently working to obtain a new version of the notebook to avoid switching libraries using [SGLang](https://github.com/sgl-project/sglang).*

In [1]:
import gc
import torch
import guidance
import pandas as pd
from guidance import select,gen,models
from pathlib import Path

from CellHit.data import get_reactome_layers,get_pathway_genes
from CellHit.LLMs import generate_prompt,dictionary_maker,self_consistency

In [2]:
#specify the data_path
select_prompt_path = data_path/'prompts'/'mixtral_pathway_selector.txt'
model_path = Path('~/mixtral/').expanduser()

Using the `get_reactome_layers` from our codebase we can automatically obtain the Reactome pathways at a given level "of depth". For more info about this we refer to the method section of our manuscript

In [3]:
reactome_pathways = get_reactome_layers(data_path/'reactome',layer_number=1)['PathwayName'].tolist()

In [4]:
reactome_pathways[:10]

['ABC-family proteins mediated transport',
 'Abacavir ADME',
 'Activation of HOX genes during differentiation',
 'Adaptive Immune System',
 'Amyloid fiber formation',
 'Apoptosis',
 'Aquaporin-mediated transport',
 'Aspirin ADME',
 'Atorvastatin ADME',
 'Azathioprine ADME']

We read the drug description obtained in the previous steps (since we will use it now to associate to the drugs the Reactome pathways)

In [5]:
#read the refined drug description
with open('drug_description_refined.txt','r') as f:
    drug_description = f.read()

We then define the constrained generation logic that we will use in order to associate the pathways to the drug

In [6]:
@guidance
def inference(lm,prompt, pathway_number, pathways_list,temperature=0.7):

    break_line = "\n"

    lm += prompt

    choosen_pathways = []
    #controlled generation
    for i in range(pathway_number):
        feasible_pathways = [i for i in pathways_list if i not in set(choosen_pathways)]
        lm += f"\nRationale {i+1}:{gen(f'rationale_{i+1}',stop=break_line,temperature=temperature)}"
        lm += f"\nPathway {i+1}:{select(options=feasible_pathways,name=f'pathway_{i+1}')}\n"
        choosen_pathways.append(lm[f'pathway_{i+1}'])
    return lm

We then load the model

In [7]:
gpu_id = 0
lm = models.Transformers(str(model_path),**{"device_map":f"cuda:{gpu_id}","revision":"gptq-4bit-32g-actorder_True"})

In [11]:
pathway_number = 15
self_k = 5
temperature = 0.7

In [9]:
prompt = generate_prompt(data_path/'prompts'/'mixtral_pathway_selector.txt',**{'drug_description':drug_description,'pathways_list':reactome_pathways,'pathway_number':pathway_number})
print(prompt)

**[INST] [Start of Drug Description]**

**Mechanism of Action Description**:

Venetoclax is a potent and selective inhibitor of the BCL2 protein, a key regulator of apoptosis (programmed cell death). By binding to the BCL2 protein, Venetoclax disrupts the interaction between BCL2 and pro-apoptotic proteins, such as BAX and BAK, thereby promoting apoptosis in cancer cells that overexpress BCL2. This mechanism of action is particularly relevant in the treatment of chronic lymphocytic leukemia (CLL) and acute myeloid leukemia (AML), where BCL2 overexpression is a common feature and contributes to resistance to chemotherapy-induced apoptosis. Venetoclax has been shown to enhance T-cell effector function by increasing reactive oxygen species generation through inhibition of respiratory chain supercomplexes formation. In AML, azacytidine induces a viral mimicry response in AML cells by activating the STING/cGAS pathway, thereby rendering the AML cells more susceptible to T cell-mediated cyto

We than deploy the self consistency procedure and repeat the selection procedure for `self_k` iterations. At the end of each iteration, results from the LLM are collected as a dictionary {'selected_pathway':'rationale'}

In [12]:
dict_list = []

for iter in range(self_k):
    lm = lm + inference(prompt,pathway_number=pathway_number, pathways_list=reactome_pathways,temperature=temperature)
    dict_list.append(dictionary_maker(lm))
    lm.reset()
    gc.collect()
    torch.cuda.empty_cache()

We then use a dedicated function to gather results across iterations and get how many times specific pathways were selected

In [13]:
output_dict = self_consistency(dict_list)
output_dict

{'Apoptosis': {'count': 5,
  'rationales': [" The drug, Venetoclax, is a potent and selective inhibitor of the BCL2 protein, which plays a crucial role in apoptosis. The 'Apoptosis' pathway is directly linked to Venetoclax's mechanism of action as it promotes apoptosis in cancer cells that overexpress BCL2.",
   ' The drug, Venetoclax, is a potent and selective inhibitor of the BCL2 protein which plays a crucial role in apoptosis. This mechanism of action suggests that pathways involved in apoptosis and cell cycle regulation would be relevant.',
   " The drug, Venetoclax, is a potent and selective inhibitor of the BCL2 protein, which is a key regulator of apoptosis. The 'Apoptosis' pathway is directly related to Venetoclax's mechanism of action as it promotes apoptosis in cancer cells that overexpress BCL2.",
   " The drug, venetoclax, is a BCL2 inhibitor that promotes apoptosis in cancer cells. The 'Apoptosis' pathway is directly related to the mechanism of action of venetoclax, as it

We then list pathways that were selected at least twice for the given drug

In [15]:
#return only pathways with count > 2
for key in output_dict.keys():
    if output_dict[key]['count'] >= 2:
        print(key)

Apoptosis
MAPK family signaling cascades
Biological oxidations
Cell Cycle, Mitotic
Cytokine Signaling in Immune system
Cellular responses to stress
DNA Damage Reversal
MTOR signalling
Regulation of Insulin-like Growth Factor (IGF) transport and uptake by Insulin-like Growth Factor Binding Proteins (IGFBPs)
Signaling by Receptor Tyrosine Kinases
Transmission across Chemical Synapses
Adaptive Immune System
Cell Cycle Checkpoints
DNA Double-Strand Break Repair
Metabolism of lipids
Mismatch Repair
Macroautophagy
Transcriptional regulation of pluripotent stem cells


Considered than the obtain pathways we can recover the associated genes through the `get_pathway_genes` function

In [18]:
#read the list of pathways together with reactome ids
reactome_pathways_ids = pd.read_csv(data_path/'reactome'/'pathways_l1.csv')

In [20]:
reactome_pathways_ids[reactome_pathways_ids['PathwayName']=='Apoptosis']

Unnamed: 0,PathwayID,PathwayName,Layer
5,R-HSA-109581,Apoptosis,1


In [22]:
get_pathway_genes('R-HSA-109581')

['CASP8',
 'CASP3',
 'DFFA',
 'DFFB',
 'KPNA1',
 'KPNB1',
 'H1-4',
 'H1-5',
 'H1-2',
 'H1-3',
 'H1-1',
 'H1-0',
 'HMGB2',
 'HMGB1',
 'APC',
 'PRKCQ',
 'CASP7',
 'PTK2',
 'SPTAN1',
 'GAS2',
 'DBNL',
 'GSN',
 'PLEC',
 'VIM',
 'ADD1',
 'CASP6',
 'MAPT',
 'CLSPN',
 'DSG2',
 'OCLN',
 'DSG1',
 'DSP',
 'CDH1',
 'PKP1',
 'CTNNB1',
 'TJP2',
 'DSG3',
 'TJP1',
 'ROCK1',
 'SATB1',
 'LMNB1',
 'PRKCD',
 'FNTA',
 'BMX',
 'BIRC2',
 'BCAP31',
 'STK24',
 'ACIN1',
 'STK26',
 'DNM1L',
 'PAK2',
 'ARHGAP10',
 'PSMD3',
 'PSMA6',
 'PSMD2',
 'PSMA5',
 'PSMD13',
 'PSMA4',
 'PSMD12',
 'PSMA3',
 'PSMD11',
 'PSMA2',
 'SEM1',
 'PSMD10',
 'PSMA1',
 'PSMD1',
 'PSMD14',
 'PSMC6',
 'PSMC5',
 'PSMC4',
 'PSMC3',
 'PSMC2',
 'PSMC1',
 'PSMB9',
 'PSMB8',
 'PSMB11',
 'PSME4',
 'PSMB7',
 'PSMB6',
 'PSMA8',
 'PSMF1',
 'PSME3',
 'PSMB5',
 'PSME2',
 'PSMB4',
 'PSME1',
 'PSMD9',
 'PSMB3',
 'PSMD8',
 'PSMB2',
 'PSMD7',
 'PSMD6',
 'PSMB10',
 'PSMD5',
 'PSMB1',
 'PSMD4',
 'PSMA7',
 'UBC',
 'UBB',
 'UBA52',
 'RPS27A',
 'OMA1',
 'OPA1