In [48]:
import matplotlib.pyplot as plt
import json
from PIL import Image
from pathlib import Path
import os
from rich.console import Console
from rich.panel import Panel
from rich.text import Text
from rich.table import Table
import numpy as np
import einops
from glob import glob
from huggingface_hub import InferenceClient
import openai
import base64
import random
from sklearn.cluster import KMeans
import math
from sklearn.cluster import KMeans
import math
import torch
from PIL import Image
import open_clip
from openai import OpenAI

console = Console()

# PathGen Tutorial: AI for Pathology

This tutorial demonstrates the use of **PathGen-CLIP**, a specialized CLIP model for histopathology images, and explores the "agentic" pipeline for creating high-quality image-caption datasets in the pathology domain.

## Background: CLIP Models and Vision-Language Alignment

**CLIP (Contrastive Language-Image Pre-training)** models and their derivatives are Vision-Language Models (VLMs) that align vision and text modalities. Introduced by the [seminal work of Radford et al.](https://arxiv.org/abs/2103.00020), these models are trained using:

- **Dataset**: Paired images and captions
- **Objective**: Contrastive learning that brings closer the representations of positive pairs (image and its caption) while pushing apart negative pairs (image and captions of other images in the batch)

This approach mirrors unimodal self-supervised contrastive learning but operates across modalities.

## Applications in Visual Large Language Models

CLIP models serve as the foundation for **VLLMs** (Visual Large Language Models) - systems that can process images and respond with text (like GPT-4o). Many  VLLMs are derivatives of the **LLaVA framework**:

- **Vision Encoder**: Processes images into embeddings
- **Linear Connection**: Maps vision embeddings to the language model's token space
- **Language Model**: transformer decoder-only model that does next-token prediction

For this connection to be effective, having a vision encoder already aligned with the language space is essential - this is where CLIP's vision-text alignment becomes important.

In [2]:
import torch
from PIL import Image
import open_clip

# Load a CLIP model - This might take a while as it has to be downloaded.
clip_model, _, preprocess_clip = open_clip.create_model_and_transforms('ViT-L-14', pretrained='laion2b_s32b_b82k')
clip_model.eval()

tokenizer = open_clip.get_tokenizer('ViT-L-14')

## Zero-shot classification

You can do lots of fun stuff with CLIP models: once you have a common embedding space for image and text, you can measure distances between them. 
Hopefully, this distance is correlated to the vague idea of semantic content, and the closest two embeddings are in the latent space, the more they refer to similar content. 
You can for instance perform zero-shot classification! I.e, classification without further fine-tuning any model.

Let's suppose you have two classes ['A', 'B'] and an image I that you want to classify, and CLIP(.) the CLIP operator, computing the CLIP embedding.

Here's how zero-shot classification works:

1. First, you compute the CLIP embedding of your image I: CLIP_img = CLIP(I)

2. Then, you compute the CLIP embeddings of text prompts for each class:
   - CLIP_A = CLIP("This is an image of class A")
   - CLIP_B = CLIP("This is an image of class B")

3. Finally, you compute the similarity (usually cosine similarity) between your image embedding and each class embedding:
   - sim_A = cosine_similarity(CLIP_img, CLIP_A)
   - sim_B = cosine_similarity(CLIP_img, CLIP_B)

4. The class with the highest similarity score is your prediction!

What makes this "zero-shot" is that you never had to train the model on your specific classification task. The model learned general visual-language relationships during its pre-training, and you're leveraging that knowledge to perform a new task without any additional training data.

In [None]:
brittany = Image.open("assets/relief_map_of_france_bretagne.jpg")
italy = Image.open("assets/relief_map_of_italy.jpg")

plt.subplot(1, 2, 1)
plt.imshow(brittany)
plt.title("Brittany")
plt.axis("off")

plt.subplot(1, 2, 2)
plt.imshow(italy)
plt.title("Italy")
plt.axis("off")
plt.show()

# And here is how to encode images and texts:
#Image:
brit_encoded = preprocess_clip(brittany)
it_encoded = preprocess_clip(italy)

texts = ["A sentence to encode"]
texts_tokens = tokenizer(texts)
console.print("Texts tokens: ", texts_tokens.shape)
console.print("preprocessed image: ", brit_encoded.shape)

> **_Question 2.1_** Observe the shapes of images and text "tokens", i.e preprocessed before feeding the CLIP model. What do you observe?

In [None]:
print(preprocess_clip)

In [None]:
import inspect
dir(clip_model)

In [None]:
print(clip_model.positional_embedding.shape)
print(clip_model.token_embedding)

In [None]:
it_encoded = clip_model.encode_image(it_encoded.unsqueeze(0))
brit_encoded = clip_model.encode_image(brit_encoded.unsqueeze(0))
texts_encoded = clip_model.encode_text(texts_tokens)

In [None]:
console.print(f"Image features shape: {brit_encoded.shape}")
console.print(f"Text features shape: {texts_encoded.shape}")

As you can see, embeddings of images and text have the same dimensions: image and text embeddings have been trained to live in the same space!

> **_Question 2.2_** Implement a classification algorithm using these maps images.

In [None]:
# Answer

How does that work with pathology, now? 
Images are super different, and a lot of the structure in natural images is different here:
- **No polarity** (up/down, right/left)
- **No depth** (no forefront/backfront)
- Images are **not object-centric**
- ...

Let's have a look at a dataset of pairs of histopathologic images / captions gathered on Twitter (where a lot of pedagogical content was shared) named [OpenPath](https://paperswithcode.com/paper/leveraging-medical-twitter-to-build-a-visual).

In [None]:
with open('assets/captions_openpath.json', 'r') as f:
    captions = json.load(f)

fig, axes = plt.subplots(1, 2, figsize=(15, 6))

for idx, (ax, item) in enumerate(zip(axes, captions)):
    img = Image.open(f'assets/{item["filename"]}')
    ax.imshow(img)
    ax.set_title(item['caption'], wrap=True)
    ax.axis('off')

plt.tight_layout()
plt.show()

# Display captions in rich console
console = Console()
for item in captions:
    text = Text(item['caption'])
    panel = Panel(text, title=f"Image: {item['filename']}", border_style="green")
    console.print(panel)


Let's have a look at the alignment between one of these images and text sentences, starting with its own caption.

In [10]:
# Encode an image and its caption
im_id = 0
tokenizer = open_clip.get_tokenizer('ViT-L-14')
img = Image.open(Path('assets') / captions[im_id]['filename'])
text = captions[im_id]['caption']
img_clip = preprocess_clip(img).unsqueeze(0)
text_clip = tokenizer([text])

img_features = clip_model.encode_image(img_clip)
text_features = clip_model.encode_text(text_clip)

cosine_similarity = cosine(img_features, text_features)

In [None]:
# Compute the similarity between the image and text
cosine_similarity = cosine(img_features, text_features)
console.print(f"Cosine similarity between image and text: {cosine_similarity.item()}")

unrelated_text = tokenizer(["I drink the cat"])
unrelated_text_features = clip_model.encode_text(unrelated_text)
unrelated_cosine_similarity = cosine(img_features, unrelated_text_features)
console.print(f"Cosine similarity between image and unrelated text: {unrelated_cosine_similarity.item()}")

somewhat_related_text = tokenizer(["Histopathology image"])
somewhat_related_text_features = clip_model.encode_text(somewhat_related_text)
somewhat_related_cosine_similarity = img_features @ somewhat_related_text_features.T / (torch.norm(img_features, dim=1) * torch.norm(somewhat_related_text_features, dim=1))
console.print(f"Cosine similarity between image and somewhat related text: {somewhat_related_cosine_similarity.item()}")

> **_Question 2.3_** Comment on these results

Let's look at another example: some tiles extracted from [the TCGA](https://portal.gdc.cancer.gov/) (one of the biggest public repositories of slides).
Some contain tumor cells, others don't.

In [None]:
tumor_tiles = glob('assets/tumor_tiles/*.png')
stroma_tiles = glob('assets/stroma/*.png')

def create_image_grid(image_paths, grid_size=(2, 2)):
    """Creates a grid of images using einops."""
    h_grid, w_grid = grid_size
    images_np = [np.array(Image.open(p)) for p in image_paths[:h_grid * w_grid]]
    return Image.fromarray(
        einops.rearrange(
            np.stack(images_np),
            '(h_grid w_grid) h w c -> (h_grid h) (w_grid w) c',
            h_grid=h_grid
        )
    )

# Create the figure
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].imshow(create_image_grid(tumor_tiles))
axes[0].set_title('Tumor Tiles')
axes[0].axis('off')

axes[1].imshow(create_image_grid(stroma_tiles))
axes[1].set_title('Stroma Tiles')
axes[1].axis('off')

plt.tight_layout()
plt.show()


> **_Question 2.4_** Implement the classifiation algorithm between tiles with healthy tissue and tumor cells

In [17]:
# Answer

> **_Question 2.5_**: Experiment a little with changing the framing of the sentences for these classification tasks. What could you conclude regarding the evaluation on the zero-shot task ?

The general domain CLIP models fail to discriminate pathology images well. The field would therefore benefit from a CLIP model trained on pathology images! However, contrary to the general domain where image/caption pairs fill the internet, in this specialized domain, such data is harder to find!

Other research efforts have gathered more of these pairs using youtube ([QUILT](https://arxiv.org/abs/2306.11207)), medical publications, books ([CONCH](https://www.nature.com/articles/s41591-024-02856-4)) and twitter ([PLIP](https://www.nature.com/articles/s41591-023-02504-3)) etc...
This allowed impressive improvements in the zero-shot capabilities of these models... But can you spot the issue with these datasets ? 

> **_Question 2.6_** Have a look at the images from openpath and list the issues with them.

> What would you propose to do to improve the dataset?
>
> As a reminder, we have at our disposal:
> - A base image/caption dataset, but with low quality images and captions not exhaustive
> - An enormous amount of unlabeled tiles data
> - general LLM/VLLM models that are very good at text-based tasks but make errors in the pathology space
> - Reasonably well performing CLIP models
> - WSI and accompanying pathology reports describing findings pathologist have made


## One solution: PathGen

[PathGen](https://arxiv.org/abs/2407.00203) is a paper that appeared at ICLR 2025. 

The authors propose to use a hive of VLM/LLM models to build a dataset of tile/caption pairs of high quality.
The hope is that all these models can complement each other and make use of the available *seed* datasets.

The basic idea is to mine a lot of high-quality tiles, at similar magnification, directly from the WSIs, and caption them using VLLM.
Let's see how they do it!

Regarding tile mining, this need a small digression: in addition to being able to do zero-shot classification, you can do dense cross-modol retrieval also!

**Dense Retrieval**: Search in a bank of images for the one that would be best described by a given text prompt - Or using the same modality: that would be closer to a given probe image.

> **_Question 2.7_** How would you implement that? Write that down using the previous tiles: "H&E that contain tumorous tissue."
> Try implementing that on the embeddings i made you download in the previous notebook.

In [None]:
tiles_embeddings = torch.vstack([tumor_embeddings, stroma_embeddings])
tiles_path = [Path(p) for p in tumor_tiles + stroma_tiles]
text_probe = tokenizer(["An H&E image with tumor cells"])

# Answer - dense retrieval algorithm

he idea of PathGen is to build captioning models (VLLMs) using the available multimodal datasets, 
in order to label a high quality dataset of tiles, extracted directly from WSIs.

## Step 1: Caption Enhancement with Vision LLMs

The first step is to improve the current image/captions pairs datasets (like PLIP, OpenPath, etc...).
One of their drawbacks was their 'non-exhaustivity', i.e. these captions, coming from an educational source, assumed a lot of a-priori knowledge from the reader 
(for instance, it is not often formulated that the image is a pathology image).
This idea is therefore to complete these captions using a general domain VLLM like GPT-4o, that can describe low level features with reasonable efficiency.

> **_Note_**: 
> Because running an LLM on a personnal laptop is often infeasible (in google Colab as well), I had to find an alternative way of doing so. 
> If you want to do so yourself, you could either:
> - Use endpoints of LLM providers (MistralAI, OpenAI, Cohere) and a *personal* API key
> - Use the serverless inference services proposed by company like [HuggingFace](https://huggingface.co/learn/cookbook/enterprise_hub_serverless_inference_api) or [TogetherAI](https://www.together.ai/) etc. The service is similar to the one offered by LLM providers, but extends to open-source models. You do not have to bother about anything regarding your calls, service extends as a function of your needs (constrained by your token-per-minute limit). For personal use-case I would recommend that.
> - Use cloud providers that allow you to deploy model on a distant device. Azure, VertexAI (google), Amazon, and even HuggingFace ([InferenceProvider](https://endpoints.huggingface.co/)) you do not pay per call but per hours of activity. 
> 
> For this course, I tried the last solution as it seemed to be the only one available to *serve* an LLM to a group of people - Without investing a lot of time into it, this seems an unstable solution. The GPU I rent kept getting overloaded and the endpoint crashed with only a few requests.
> 
> I therefore put in place an intermediate solution: [this little huggingface space](https://huggingface.co/spaces/trizard/ai4health/tree/main), serving a small python server. Under the hood, this python server just calls one of the LLM provider (MistralAI here, cocorico) using my API key, but keeping it secure (By the way, never ever share such an API key).
> 
> I would recommend doing so for any idea/hack using LLM that you would like to serve to a small group of people, for testing or educational (or fun) purposes.

In [None]:
import requests
import base64
from pathlib import Path
from PIL import Image
import io

# And here is a simple class using the request package to submit request to the HF-space server
class SimpleChatClient:
    def __init__(self, endpoint, model="mistral-small-latest"): #Please dont change that to **large**, I am paying for the calls :'(
        self.endpoint = endpoint
        self.model = model

    def encode_image_to_base64(self, image_path, max_size=256, quality=30):
        """
        I'm here aggressively reducing the image quality here in the hope that everyone could use this endpoint without crashing it.
        You could just encode in base64 the image, though, in other circumstances.

        Base64 encoding is a good way to send information through HTTP requests: byte encoding 
        may contain special characters that would break the request, base64 encodes chunks of 
        6 bites at a time, with 64 possible values, all contained in the standard, safe ASCII set.

        Of course, the image is then decoded on the other side -i.e. in MistralAI's servers.
        """
        with Image.open(image_path) as img:
            if img.mode == 'RGBA':
                img = img.convert('RGB')
            ratio = max_size / max(img.size)
            if ratio < 1:
                new_size = tuple(int(dim * ratio) for dim in img.size)
                img = img.resize(new_size, Image.Resampling.LANCZOS)
            buffer = io.BytesIO()
            img.save(buffer, format='JPEG', quality=quality, optimize=True)
            encoded_string = base64.b64encode(buffer.getvalue()).decode('utf-8')
        return f"data:image/jpeg;base64,{encoded_string}"

    def chat(self, text, image_path=None, max_tokens=150):
        # Sends an http request to the proxy server.
        messages = []
        content = []
        if text:
            content.append({"type": "text", "text": text})
        if image_path:
            base64_image = self.encode_image_to_base64(image_path)
            content.append({"type": "image_url", "image_url": base64_image})
        messages.append({"role": "user", "content": content})

        payload = {
            "model": self.model,
            "messages": messages,
            "max_tokens": max_tokens
        }
        headers = {"Content-Type": "application/json"}

        response = requests.post(
            self.endpoint,
            json=payload,
            headers=headers,
            timeout=60
        )
        response.raise_for_status()
        return response.json()

>   **_Question 2.8_** Here comes a bit of prompt engineering: design a prompt to improve the caption of an image.
>   If you want help creating a good prompt, there are online tools that do just that. 
>   For instance, Anthropic proposes this service (but you would need to have an API key). 

In [None]:
# Answer
def get_description_prompt(caption):
    pass

### Step 2: Fine-tune a **Revise VLLM**

With this method of caption augmentation - the authors have build a dataset for fine-tuning a VLLM. 
The next step has therefore been to fine-tune a LLaVa model on it, creating a VLLM able to caption pathology tiles.

We can therefore suppose in the rest of the tutorial that the Llava model we are using has extended capabilities in pathology. 
We will still use the general LLaVa-11B-Instruct because PathGen-LLaVa-desp would be too big to run locally on your machines.


 ### Step 3: Caption Revision Model

PathGen-LlaVa-desp can describe pathology images better than general domain models. It is not however perfect. 
Improving the model could be done by improving the dataset - hard to do without additional expertise.

The authors have bet on the training of a model able to **revise** and **correct** the captions made by PathGen-LlaVa-desp.

>  **_Question 2.9_** How would you do that?


### Step 4: Create a **revise** dataset and Fine-Tune a **Revise VLLM**

To do that, we need a dataset composed of images paired with 2 captions: 1 correct and 1 incorrect.

The idea here will be to degrade the LLaVa descriptions with known modifications: additions, deletions, modifications. This is a type of self-supervised learning led by the LLM (just like conventional contrastive learning in image processing perturbs a given image).

We then "just" have to fine-tune an LLM on this dataset to get a "Revise LLM".


In [None]:
#Let's suppose that this LLama-3.2-11B has been trained to describe histopathological images.
image_path = Path('assets') / captions[0]['filename']
client = SimpleChatClient("https://trizard-ai4health.hf.space/chat")
completion = client.chat(
    "You are an expert in anatomopathology in a hospital. You are given an H&E image of a tissue sample: describe it in detail, focusing on the histological features and any potential pathological findings. You will answer only with your description of the image.",
    image_path=image_path
)
image_description = completion

print(image_description)

> **_Question 2.10_** Implement the pipeline to create the revise dataset.

In [81]:
# Answer


In [None]:
console.print(Panel(
    image_description,
    title="Original Description",
    border_style="green"
))

console.print(Panel(
    degraded_description,
    title=f"Description degraded with [red]{deg_type}[/red]",
    border_style="green"
))



### Step 5: Leveraging the TCGA Dataset

The TCGA dataset is a public resource of Whole Slide Images. 
It contains not only WSI, but also (and mostly) other modalities, such as genomic, transcriptomic, and even text. 
Authors gathered 7300 WSI paired with their pathology reports - below is an example for the slide used in the previous notebook.

We therefore now possess a **description VLLM**, a **Revise VLLM** and we also suppose that we dispose of a **Summarize LLM**, trained to summarized too long descriptions.

#### Dataset Construction Pipeline

We will now design a pipeline to create a high quality dataset of image/caption pairs.




> **_Question 2.11_**
> Suppposing you have all these new tools, propose a pipeline to create a high quality dataset of image/caption pairs.

In [86]:
pathology_report = """
Sections show an infiltrating mammary carcinoma characterized by poor tubule formation, intermediate nuclear grade and high mitotic activity. 
The tumor cells infiltrate as sheets, single file, alveolar nests and occasional larger nests.
The tumor has a tendency to infiltrate around existing ductal structures and focally formd targetoid lesions around this. 
Focal early necrosis is noted. 
There is a desmoplastic stromal response."""

from openai import OpenAI

client = SimpleChatClient("https://trizard-ai4health.hf.space/chat")

# Tile Sampling Strategy

The idea will simply be to dig into the WSIs we have, extract the most relevant and diverse tiles, 
and then make them pass through the different VLM we have, for description, revision and summarization.

#### Prompt-based Extraction: first using pathology report!

In [90]:
tiles = np.load('assets/embeddings_tiles.npy')

prompts_from_report = pathology_report.split('.')[:-1]
tokenized_prompts = tokenizer(prompts_from_report)
prompts_report_embeddings = clip_model.encode_text(tokenized_prompts)

#### Prompt-based extraction: using general model knowledge
You can also try to distill a bit of general knowledge from an LLM into this tile selection phase. 
> **_Question 2.12_** Create prompts to use for dense retrieval using a CLIP model

In [98]:
# Using GPT generated prompts:
prompts_from_gpt = [client.chat("You are an expert in histopathology. You are task to describe one pattern that could be present in a breast cancer slide. Answer exclusively by giving an example pattern.", max_tokens=10) for _ in range(3)]
tokenized_prompts = tokenizer(prompts_from_gpt)
prompts_gpt_embeddings = clip_model.encode_text(tokenized_prompts)

prompts = torch.vstack([prompts_report_embeddings, prompts_gpt_embeddings])

In [None]:
cosine_similarity = cosine(torch.tensor(tiles), torch.tensor(prompts))
cosine_similarity = torch.max(cosine_similarity, dim=1).values
first_128_indices = torch.argsort(cosine_similarity, descending=True)[:128]
prompt_extracted_tiles = tiles[first_128_indices]

### 2. Diversity Considerations

> **_Question 2.13_** What could be one caveat of sampling tiles using only prompt-based retrieval? 
> Implement an alternative approach for tile-sampling.

The goal of this dataset is to screen tiles as diverse as possible (and, if possible, containing as much interesting features as possible)

In [None]:
# Answer

## Finally: Caption Generation

Step 2 in the process is to use the description/revise/summarize VLMs that have been trained before in order to generate new captions for this whole dataset.
Doing so, they gather >1.6M of high-quality tile/caption pairs.

As a result, they finally trained a CLIP model on this new dataset, **PathGen-CLIP-L** ! The resulting model shows unprecedented accuracies in many settings.
Let's try its capabilities with the tasks we tried to tackle before.

In [120]:
clip_model, _, preprocess_clip = open_clip.create_model_and_transforms('ViT-L-14', pretrained='assets/pathgen-clip-l.pt')
clip_model.eval()

tokenizer = open_clip.get_tokenizer('ViT-L-14')

> **_Question 2.14:_** Try again all the tasks we did before using the general CLIP! You can also try the dense retrieval on the WSI of the tuto 1.  

You will see, this seems already way better !

**Overall, this paper illustrate the perspectives that offer these weak forms of model distillation. General LLM/VLM often allow to greatly extend seed datasets, using very few guidance and supervision -Here, for instance, human input was to focus the LLM on undescribed details of the image + was the choice of the caption degradation types.**

> **_Question 2.15_** However, this strategy has limitations.
> - Author call this pipeline "agentic". What do you think of this naming ? Do you find it appropriate here ? 
> - Why would they restrain themselves to a mere 1.6M tile/caption pairs, when they could create tens of millions ?