# Setup

To set up,
```bash
# Install Lotus and Ollama
pip install git+https://github.com/calebwin/lotus.git
bash data/install_ollama.sh

# Load llama models (use Carina or Colab for GPU)
ollama pull llama3.2:3b
ollama pull llama3.1:8b
ollama pull deepseek-r1:7b

# Start the ollama server
ollam serve
```

See [full docs here](https://lotus-ai.readthedocs.io/en/latest/core_concepts.html).

In [1]:
import pandas as pd
import lotus
from lotus.models import LM, SentenceTransformersRM
from lotus.types import CascadeArgs

lm_smallest = LM(model="ollama/llama3.2:3b") # llama3.2
lm_medium = LM(model="ollama/llama3.1:8b")
lm_medium_r = LM(model="ollama/deepseek-r1:7b")
rm = SentenceTransformersRM(model="intfloat/e5-base-v2")

lotus.settings.configure(rm=rm, lm=lm_medium_r)

  from .autonotebook import tqdm as notebook_tqdm
2025-02-13 16:20:32,839 - INFO - Use pytorch device_name: cuda
2025-02-13 16:20:32,840 - INFO - Load pretrained SentenceTransformer: intfloat/e5-base-v2


# Basics
Simple example of semantically joining two text columns.

Note that there are 2 "modes" for using Lotus to semantically analyze your data:
1. No explicit reasoning (e.g. `llama3.1:8b`)
2. Reasoning LLMs (e.g. `deepseek-r1:7b`): include `strategy="deepseek"` and `return_explanations=True` to get a column with the reasoning involved in the semantic analysis.

In this tutorial we will use the latter mode. Make sure you installed Caleb's fork of lotus for the reasoning LLM support.

In [None]:
# create dataframes with course names and skills
courses_data = {
    "Course Name": [
        "History of the Atlantic World",
        "Riemannian Geometry",
        "Operating Systems",
        "Food Science",
        "Compilers",
        "Intro to computer science",
    ]
}
skills_data = {"Skill": ["Math", "Computer Science"]}
courses_df = pd.DataFrame(courses_data)
skills_df = pd.DataFrame(skills_data)

# lotus semantic join with reasoning
res = courses_df.sem_join(skills_df, "Taking {Course Name} will help me learn {Skill}", return_explanations=True, strategy="deepseek")
print(res)

Join comparisons: 100%|██████████ 12/12 LM Calls [00:14<00:00,  1.22s/it]

                 Course Name  \
1        Riemannian Geometry   
2          Operating Systems   
4                  Compilers   
4                  Compilers   
5  Intro to computer science   

                                    explanation_join             Skill  
1  Okay, I need to determine whether the claim "T...              Math  
2  Okay, so I need to figure out if the claim is ...  Computer Science  
4  Okay, so the user is asking whether taking a c...              Math  
4  Okay, I need to determine if the claim "Taking...  Computer Science  
5  Okay, so I need to determine if the claim is t...  Computer Science  





# Case Study 1: Clinical Trial Matching
The problem of clinical trial matching is finding patient-trial matches given a dataset of text summaries of patients and text descriptions of trials. In Lotus-speak, this is a "semantic join" between patients and trials. The naive approach would be something like:
```
for P in patients:
    for T in trials:
        compute LLM("is {P} eligible for {T}")
```
This leads to a quadratic number of LLM calls which will be slow and expensive. A program written in Lotus as demonstrated below will be auto-optimized to run the most efficient algorithm.

For example one algorithm Lotus may generate will (1) in linear-time predict a trial zero-shot for each patient and then (2) run embedding similarity to quickly find a set of probable patient-trial matches (technically matches between trial and the trial that was zero-shot predicted for the patient) and finally (3) run the LLM on all pairs of this subset of probable patient-trial matches. The Lotus optimizer smartly determines the optimal parameters and variant of this semantic join algorithm.

#### Load data

In [4]:
trial_matching_df = pd.read_parquet("/share/pi/nigam/data/med-s1/trialllama").head(5)

#### Define target precision and recall

In [None]:
# The selectivity of this query is so high that even a 50% sample may
# not contain a query result and hence the sample would be useless for
# optimizing for a target recall or precision
# cascade_args = CascadeArgs(recall_target=None, precision_target=None)

#### Execute semantic join

In [31]:
# Prepare the dataframes
patients_df = trial_matching_df[['id', 'topic']].drop_duplicates().rename(columns={'topic': 'patient_note', 'id': 'patient_id'})
trials_df = trial_matching_df[['clinical_trial']].drop_duplicates()

# Define the semantic join instruction
join_instruction = "is patient {patient_note} definitely elibible and relevant for the clinical trial {clinical_trial}"

# Perform the semantic join
result_df = patients_df.sem_join(trials_df, join_instruction, return_explanations=True, strategy="deepseek")


Join comparisons:   0%|           0/25 LM Calls [00:00<?, ?it/s]2025-02-13 01:34:27,619 - INFO - 	 Failed to parse: defaulting to True
2025-02-13 01:34:27,621 - INFO - 	 Failed to parse: defaulting to True
2025-02-13 01:34:27,622 - INFO - 	 Failed to parse: defaulting to True
2025-02-13 01:34:27,623 - INFO - 	 Failed to parse: defaulting to True
2025-02-13 01:34:27,624 - INFO - 	 Failed to parse: defaulting to True
2025-02-13 01:34:27,630 - INFO - 	 Failed to parse: defaulting to True
Join comparisons: 100%|██████████ 25/25 LM Calls [01:09<00:00,  2.78s/it]


In [35]:
result_df = result_df[result_df["explanation_join"].notna()]
result_df

Unnamed: 0,patient_id,patient_note,explanation_join,clinical_trial
0,20996_30-2022_NCT01579747,Here is the patient note:\nA 47-year-old woman...,"Okay, so I'm trying to figure out if this pati...",Here is the clinical trial:\nTitle: DHEA and T...
0,20996_30-2022_NCT01579747,Here is the patient note:\nA 47-year-old woman...,"Okay, so I need to determine if the patient de...",Here is the clinical trial:\nTitle: Associatio...
1,7330_11-2022_NCT00420394,Here is the patient note:\nA 63-year-old man c...,"Alright, I need to determine if the patient de...",Here is the clinical trial:\nTitle: Does Ultra...
1,7330_11-2022_NCT00420394,Here is the patient note:\nA 63-year-old man c...,"Alright, so I need to determine whether the pa...",Here is the clinical trial:\nTitle: Perioperat...
3,25242_36-2022_NCT01944501,Here is the patient note:\nA 47-year-old woman...,"Alright, so I need to determine if the patient...",Here is the clinical trial:\nTitle: DHEA and T...
3,25242_36-2022_NCT01944501,Here is the patient note:\nA 47-year-old woman...,"Alright, so I need to figure out if the patien...",Here is the clinical trial:\nTitle: Associatio...
4,20372_42-2021_NCT04091321,Here is the patient note:\n19 yo Hispanic fema...,"Alright, I need to determine if the patient de...",Here is the clinical trial:\nTitle: Associatio...


#### Evaluate trial matches

In [37]:
merged_df = result_df.merge(trial_matching_df[trial_matching_df["output"] == "A: eligible"][['id', 'clinical_trial', 'output']], 
                            left_on=['patient_id', 'clinical_trial'], 
                            right_on=['id', 'clinical_trial'],
                            how="inner")
merged_df

Unnamed: 0,patient_id,patient_note,explanation_join,clinical_trial,id,output
0,7330_11-2022_NCT00420394,Here is the patient note:\nA 63-year-old man c...,"Alright, so I need to determine whether the pa...",Here is the clinical trial:\nTitle: Perioperat...,7330_11-2022_NCT00420394,A: eligible


In [40]:
# Calculate r
num_true_matches = trial_matching_df[trial_matching_df['output'] == 'A: eligible'].shape[0]
total_predicted = len(result_df)
precision = num_true_matches / total_predicted if total_predicted > 0 else 0
recall = len(merged_df) / num_true_matches if num_true_matches > 0 else 0

# Print results
print(f"Precision: {precision:.2%}")
print(f"Recall: {recall:.2%}")

# Display a few results
print("\nCorrect matches:")
sample = merged_df.sample(min(5, merged_df.shape[0]))
for _, row in sample.iterrows():
    print(f"Patient ID: {row['patient_id']}")
    print(f"Clinical Trial: {row['clinical_trial'][:100].replace('Here is the clinical trial:', '').replace('Title:', '').strip()}...")  # Truncate for readability
    print(f"Predicted: Eligible")
    print(f"Actual: {'Eligible' if row['output'] == 'A: eligible' else 'Not Eligible'}")
    print(f"Explanation: {row['explanation_join']}")
    print()

Precision: 28.57%
Recall: 50.00%

Correct matches:
Patient ID: 7330_11-2022_NCT00420394
Clinical Trial: Perioperative Chemoradiotherapy for Potentially Resectable Gastri...
Predicted: Eligible
Actual: Eligible
Explanation: Alright, so I need to determine whether the patient described in the patient note is definitely eligible and relevant for the clinical trial. Let me start by carefully reviewing both pieces of information.

First, looking at the patient note: a 63-year-old man presents with unintentional weight loss and epigastric discomfort after meals. He has no known medical issues or medications. His vital signs are normal except for blood pressure slightly above average (130/75) and a regular pulse rate of 88/min. The upper endoscopy shows a lesion in the stomach with features typical of diffuse-type adenocarcinoma, specifically signet ring cells that don't form glands.

Now, examining the clinical trial: it's about perioperative chemoradiotherapy for potentially resectable gast

# Case Study 2: Reasoning Trace Curation
Recent work has demonstrated the value of high-quality, difficult, and diverse reasoning traces. Curating such data for a particular domain like biomedicine becomes challenging with the uniquely domain-specific semantics of quality, difficulty, and diversity. Let's use Lotus to help us filter!

First we'll take a look at an example from a large dataset of reasoning traces that was used to train HuaTuoGPT-o1.

In [2]:
train_df = pd.read_parquet("/share/pi/nigam/data/med-s1/huatuogpt-o1")
train_sample = train_df.iloc[0]
print(f"QUESTION: ", train_sample["Question"])
print(f"\nREASONING TRACE: ", train_sample["Complex_CoT"])
print(f"\nANSWER: ", train_sample["Response"])

QUESTION:  A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?

REASONING TRACE:  Okay, let's think about this step by step. There's a 61-year-old woman here who's been dealing with involuntary urine leakages whenever she's doing something that ups her abdominal pressure like coughing or sneezing. This sounds a lot like stress urinary incontinence to me. Now, it's interesting that she doesn't have any issues at night; she isn't experiencing leakage while sleeping. This likely means her bladder's ability to hold urine is fine when she isn't under physical stress. Hmm, that's a clue that we're dealing with something related to pressure rather than a bladder muscle problem. 

The fact that she underwent a Q-tip test is intriguing too. This test 

Recent work shows that a reasoning LLM can be efficiently learned by selecting a small (n=1000) set of samples that is high: 
1. Quality
2. Difficulty
3. Diversity

In the biomedical domain (if we focus on the task of rare disease diagnosis) there are many ways we can define quality, difficulty, and diversity in reasoning. Here's a few examples:
1. **Quality:** meeting standards of care (credit Suhana), logical coherence, more tokens for harder steps, frequent verification steps, references underlying pathophysiology, acknowledges alternate possibilities
2. **Difficulty:** # of reasoning steps, disease rarity, # of diagnoses ruled out
3. **Diversity:** different diseases, relevant physiological systems

In [None]:
train_df_filtered = (
    train_df
    .head(5)
    # (1) Semantic embedding-based deduplication
    .sem_index('Question', 'huatuogpt_o1_train_questions')
    .sem_dedup('Question', threshold=0.9)
    # (2) Filter by Quality
    .sem_filter('{Complex_CoT} meets all standards of care, references underlying pathophysiology, and acknowledges alternate possibilities', strategy="deepseek", return_explanations=True)
)
# (3) Computer columns for Difficulty and Diversity
train_df_labeled = train_df_filtered.sem_extract(['Question', 'Complex_CoT'], {'Domain': 'the specialty referenced by the question, capitalized', 'Difficulty': 'the level of difficulty between 1-10'}, strategy="deepseek")

100%|██████████| 1/1 [00:00<00:00, 51.66it/s]
Filtering: 100%|██████████ 4/4 LM calls [00:07<00:00,  1.79s/it]
Extracting: 100%|██████████ 4/4 LM calls [00:06<00:00,  1.73s/it]


Now that we've filtered reasoning races based on quality and generated labels for diversity and difficulty, we can proceed to use these labels to further conduct diversity-based difficulty-weighted sampling to arrive at an
even smaller set of high-quality reasoning traces to train a reasoning LLM on.

In [21]:
train_df_filtered.reset_index(drop=True).merge(train_df_labeled, left_index=True, right_index=True)

Unnamed: 0,Question,Complex_CoT,Response,explanation_filter,Domain,Difficulty
0,A 61-year-old woman with a long history of inv...,"Okay, let's think about this step by step. The...",Cystometry in this case of stress urinary inco...,"Alright, let's break down this claim step by s...",Urology,5
1,A 45-year-old man presents with symptoms inclu...,"Okay, so here's a 45-year-old guy who's experi...",Based on the clinical findings presented—wide-...,"Alright, so I need to determine if the claim i...",Neurology,9
2,A patient with psoriasis was treated with syst...,I'm thinking about this patient with psoriasis...,The development of generalized pustules in a p...,"Okay, so I need to determine whether the claim...",Dermatology,5
3,What is the most likely diagnosis for a 2-year...,"Okay, so we're dealing with a 2-year-old child...",Based on the described symptoms and the unusua...,"Alright, so I need to evaluate whether Complex...",Orthopedics,4


# Case Study 3: Rare Disease Cohort Analysis
In analyzing a rare disease cohort with the intent to build a better diagnostic model, the clinical notes can be a goldmine of information but hard to access due to the lack of a SQL like API. In this case study we'll see if we can answer a couple questions about a cohort (n=40) of patients diagnosed with Familial Hypercholesterolemia (FH):
1. Do the notes for the patients contain a detailed explanation for the FH diagnosis?
2. How expensive was the FH diagnostic odyssey? (e.g. # of tests ordered, specialties visited)

In [2]:
cohort = pd.read_parquet('/share/pi/nigam/data/med-s1/fh')
fh_cohort = cohort[cohort['has_fh'] == 1].drop_duplicates(['person_id', 'note_datetime'])
num_patients = fh_cohort['person_id'].nunique()
num_notes = len(fh_cohort)
print(f"# of patients sampled: {num_patients}\n# of notes: {num_notes}")

# of patients sampled: 40
# of notes: 155


In [3]:
fh_sample = fh_cohort.merge(fh_cohort['person_id'].drop_duplicates().head(5), on='person_id', how='inner')
fh_diagnostic_odysseys = (
  fh_sample
  .sort_values(['person_id', 'note_datetime'])
  .sem_agg(
    'Summarize the steps documented in {note_text} that led to the Familial Hypercholesterolemia (FH) diagnosis; including suspicion, testing, specialty visits.',
    group_by=['person_id'],
    strategy="deepseek",
    return_explanations=True,
  )
)

Aggregating:   0%|           0/1 LM calls [00:00<?, ?it/s]
[A

[A[A


[A[A[A




Aggregating: 100%|██████████ 1/1 LM calls [00:22<00:00, 22.25s/it]
Aggregating: 100%|██████████ 1/1 LM calls [00:23<00:00, 23.16s/it]

Aggregating: 100%|██████████ 1/1 LM calls [00:23<00:00, 23.30s/it]
Aggregating: 100%|██████████ 1/1 LM calls [00:23<00:00, 23.41s/it]


Aggregating: 100%|██████████ 1/1 LM calls [00:28<00:00, 28.63s/it]


The resulting summaries of patient diagnostic odysseys can be manually reviewed to better understand the cost of delayed diagnosis (and the implied utility of an AI tool diagnosis of this disease) but also can be used as training data for reasoning LLMs.

In [None]:
print(fh_diagnostic_odysseys.iloc[1]['_output'] + "\n")

For some patients an explanation of the FH diagnosis doesn't existin the notes. These could be manually reviewed or excluded from the dataset for a higher precision cohort.

In [None]:
print(fh_diagnostic_odysseys[fh_diagnostic_odysseys['explanation_output'].isna()].iloc[0]['_output'])