In [1]:
import sys
import os
from  pathlib import  Path
sys.path.append(str(Path.cwd().parent))
from  config import OPENAI_API_KEY,NCBI_API_KEY,EMAIL
from  src.clinfoai.pubmed_engine import PubMedNeuralRetriever



# Using Clinfo.AI 

In this tutorial, we will go through each step of the Clinfo.AI workflow. Before we start, we need to set up a few things. 


### 1.- Setting up enviorment:
1.a.- Install the conda enviroment using the yml file provided.

``` conda env create -f environment.yaml ```

1.b.- Select your environment to run notebook. I recommend using VScode: 



### 2.- Creating Accounts

You will need at least one account and at most two (depending on how many calls/hour you plan to do):
* OPENAI account: If you start a free account for the first time, you will get $5 in API credits.
* NCBI_API_KEY: This is only necessary if you plan to make more than 10 calls per hour.


Once you have created both accounts  go to **src\config.py** file and: 

* Set OPENAI_API_KEY to your openAI API key

If you created an NCBI API account add your key and email in the following values: 
* NCBI_API_KEY 
* EMAIL
* 
Otherwise, leave them as None

In [2]:
# Make Sure you followed at least step 1-2 before running this cell.
from  config import OPENAI_API_KEY, NCBI_API_KEY, EMAIL
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY


### 3.- Defining your own promts:
We have designed prompts for each step of Clinfo.ai Workflow, leaveriging the power of in-contex-learning. If you want to us your own promps you can edit them **src\prompts** otherwise we will use the default prompts:

In [3]:
PROMPS_PATH = os.path.join("..","src","clinfoai","prompts","PubMed","Architecture_1","master.json")

### 4.- Define Clinfo.AI LLM Backbone
Clinfo uses a chain of LLMs to summarize information, thus we need to define an LLM backbone. 

We will start with OpenAI models, however, if you have access to GPUs it is possible to use Clinfo.AI with vLLM to use OpenSource LLMs as backbones (check tutorial 3).

In [4]:
MODEL:str  = "gpt-3.5-turbo"
#MODEL:str = "Qwen/Qwen2-beta-7B-Chat"

### 5.- Init Clinfo+Pubmed Engnie
We have all the necessary data to start our clinfo+pubmed instance:

In [5]:
## 5.- Init Neural Retriever from path. 
# Do not change the path if you want to use base  prompts, otherwise specify your own prompt architecture

nrpm = PubMedNeuralRetriever(
    architecture_path = PROMPS_PATH,
    model             = MODEL,
    verbose           = False,
    debug             = False,
    open_ai_key       = OPENAI_API_KEY,
    email             = EMAIL)


Task Name: pubmed_query_prompt
------------------------------------------------------------------------

Task Name: relevance_prompt
------------------------------------------------------------------------

Task Name: summarization_prompt
------------------------------------------------------------------------

Task Name: synthesize_prompt
------------------------------------------------------------------------


# Let's start!

In [16]:
### Step 0 : Ask a question ###
QUESTION    = "What is the prevalence of COVID-19 in the United States?"
QUESTION    = "What tests are needed to diagnose Chronic Neutropenia?"


## STEP 1 (Search PubMed): Convert the question into a query using an LLM
# This returns a list of queries (containing MESH terms)
# These queries are used to retrieve articles from NCBI
# Once retrieved we collect a list article ids.
pubmed_queries, article_ids = nrpm.search_pubmed(
    question=QUESTION,
    num_results=10,
    num_query_attempts=1)

print(f"Articles retrived: {len(article_ids)}")
print(pubmed_queries)
print(article_ids)

Articles retrived: 10
['("Chronic Neutropenia" AND "diagnosis" AND "tests")']
['27841775', '30870474', '6602565', '34303547', '10388004', '19305028', '11964321', '3534197', '20301576', '24827415']


In [17]:
## Step 2: Fetch article data
# Preiously, we only extracted he PMIDs. No we will use those  PMIDs to retrive the metadata:
articles = nrpm.fetch_article_data(article_ids)

# Print example for first article: 
article_num = 1
print(f"Article {article_num}:\n")

#print(articles[article_num].keys())
#print(articles[article_num]['PubmedData'])
print(articles[article_num]["MedlineCitation"]["Article"]["Abstract"]["AbstractText"])
#print(articles[article_num]["MedlineCitation"]["Article"])


Article 1:

['Myelopoiesis was evaluated in 66 pediatric patients with chronic neutropenia who were positive for anti-neutrophil antibodies (median age at diagnosis: 11 months, median neutrophil count at diagnosis: 419/μl). Other causes of neutropenia were excluded. Bone marrow morphology, clonogenic tests and/or the peripheral blood CD 34+ cell count, and apoptotic rate were evaluated in 61 patients with neutropenia lasting > 12 months or severe infections. The peripheral blood CD 34+ cell count and apoptotic rate were evaluated in five patients with shorter neutropenia. The median follow-up time was 29 months (range 7-180 months). Forty-seven patients (71.2%) had a spontaneous recovery after 7-180 months (median 29 months). The group of patients younger than 24 months at diagnosis (n = 50) had a higher probability of recovery (40/50 vs. 7/16 χ2 p<0.01) with a shorter period of neutropenia (median 26 versus 47 months, Kaplan-Meier analysis p = 0.001). The colony-forming units-granuloc

In [18]:
# STEP 3 Summarize each article
# This step is parallelized, though it might look like one single call, it performs one call per article to summarize.
# Then the relevancy of the article (based on the original question) is provided by another LLM call.

article_summaries,irrelevant_articles =  nrpm.summarize_each_article(articles, QUESTION)

In [19]:
# Summaries for relevant articles
article_summaries

[{'title': '[Neutropenia in infancy - sometimes chronic and benign - own experiences].',
  'url': 'https://pubmed.ncbi.nlm.nih.gov/19305028/',
  'abstract': 'Chronic neutropenia is a decrease in circulating neutrophils in the peripheral blood lasting over 6 months. Values need to be refered with the age and race. In children aged 2 weeks to 12 months reffered values are above 1000/03BCL. There are congenital and aquired reasons of neutropenia in infancy. The most common type of chronic neutropenia in infants is chronic, benign neutropenia (AIN). Authors present ten infants between three and six months with chronic, benign neutropenia. The reason of ordering laboratory tests at outpatient clinic were benign upper respiratory tract infections (four cases), pallor (four cases) and on parental demand (one case). In one infant neutropenia was observed during treatment of pneumonia at a district hospital.',
  'citation': 'Kaczorowska-Hać B, Wierzba J, Stefanowicz J, Sielachowicz K, Wlazłowsk

In [14]:
# Articles deemed irelevant
irrelevant_articles 

[{'title': 'Myocarditis With COVID-19 mRNA Vaccines.',
  'url': 'https://pubmed.ncbi.nlm.nih.gov/34281357/',
  'abstract': 'Myocarditis has been recognized as a rare complication of coronavirus disease 2019 (COVID-19) mRNA vaccinations, especially in young adult and adolescent males. According to the US Centers for Disease Control and Prevention, myocarditis/pericarditis rates are ≈12.6 cases per million doses of second-dose mRNA vaccine among individuals 12 to 39 years of age. In reported cases, patients with myocarditis invariably presented with chest pain, usually 2 to 3 days after a second dose of mRNA vaccination, and had elevated cardiac troponin levels. ECG was abnormal with ST elevations in most, and cardiac MRI was suggestive of myocarditis in all tested patients. There was no evidence of acute COVID-19 or other viral infections. In 1 case, a cardiomyopathy gene panel was negative, but autoantibody levels against certain self-antigens and frequency of natural killer cells were

In [21]:
# STEP 4 do a synthesis of all summaries to answer question: 
synthesis =   nrpm.synthesize_all_articles(article_summaries, QUESTION)
print("synthesis")
print(synthesis)

synthesis
Literature Summary: Studies on chronic neutropenia in children have highlighted various diagnostic approaches. De Alarcon et al. utilized bone marrow aspiration, biopsy, steroid stimulation, and in vitro assays to categorize patients, finding that a normal bone marrow response to steroids indicated a benign course, while CFU-GM and CSA assays helped classify neutropenia. Antineutrophil antibody quantification in children with chronic neutropenia may aid in understanding the clinical course, as demonstrated by Kobayashi et al. In autoimmune neutropenia of infancy, Lalezari et al. emphasized the importance of immunologic tests for diagnosis. Bux et al. found associations between reduced CFU-GM, circulating hematopoietic progenitors, and severe infections in pediatric patients with anti-neutrophil antibodies. Dale and Welte reviewed chronic idiopathic neutropenia (CIN) and autoimmune neutropenia (AIN), suggesting G-CSF treatment for increasing neutrophils in symptomatic cases. W

# Great! We answered our first question using Clinfo.AI!
## Here are all the steps condensed:

In [16]:
PROMPS_PATH  = os.path.join("..","src","clinfoai","prompts","PubMed","Architecture_1","master.json")
MODEL:str    = "gpt-3.5-turbo"
MODEL:str    = "Qwen/Qwen2-beta-7B-Chat"

nrpm = PubMedNeuralRetriever(
    architecture_path = PROMPS_PATH,
    model             = MODEL,
    verbose           = False,
    debug             = False,
    open_ai_key       = OPENAI_API_KEY,
    email             = EMAIL)

### STEP 0: Ask a question ###
question    = "What is the prevalence of COVID-19 in the United States?"

### STEP 1: Search PubMed ###
pubmed_queries, article_ids = nrpm.search_pubmed(question,num_results=10,num_query_attempts=1)

### STEP 2: Fetch article data ###
articles = nrpm.fetch_article_data(article_ids)

### STEP 3 Summarize each article (only if they are relevant [Step 3]) ###
article_summaries,irrelevant_articles =  nrpm.summarize_each_article(articles, question)


### STEP 4: Synthesize the results ###
synthesis =   nrpm.synthesize_all_articles(article_summaries, question)
print("Result:")
print(synthesis)


Task Name: pubmed_query_prompt
------------------------------------------------------------------------

Task Name: relevance_prompt
------------------------------------------------------------------------

Task Name: summarization_prompt
------------------------------------------------------------------------

Task Name: synthesize_prompt
------------------------------------------------------------------------
Trying to init model via VLM
Result:
Literature Summary: The prevalence of COVID-19 in the United States varies across different populations and occupational groups. A cross-sectional study by Koh (2020) estimated that workers in the healthcare and social assistance industry, as well as healthcare professions and support roles, had elevated risks, with a national prevalence of around 1.7% in a community-based survey in 2020. A meta-analysis of hospitalized patients by Esper et al. (2021) found a prevalence of 6.9% with bacterial co-infection or secondary infection, with higher 