# Sarus Demo - Fine-tune LLM with privacy guarantee [BETA]

In this notebook, we demonstrate the new **Sarus LLM fine-tuning feature** now available in BETA.

Read more in [our introduction blog post](https://) and see it in action in the [demo video](https://www.youtube.com/watch?v=BQ9gX_bhZMM)!

Want to test this new feature? [Join our BETA program](https://www.sarus.tech/join-sarus-llm-private-beta)!




### Use case example

Let's assume we want to **fine-tune a LLM on private patient data** so that the model gains medical expertise and can generate relevant fake medical diagnoses that will be useful for many applications (ex: annotate data safely to feed an automatic annotation model).  

The patient dataset is highly sensitive and one should ensure no personal information is embedded in the fine-tuned LLM.
To meet this privacy requirement, you can use Sarus LLM fine-tuning SDK (beta).

In this example, we use the [PHEE dataset](https://paperswithcode.com/dataset/phee), a public patient dataset. But let's imagine this data is private and sensitive.

In this dataset, we've just introduced a secret for the purpose of our demonstration: `François Dupont suffers from a severe form of pancreatic cancer`.


## 0. Install Sarus, select and preview patient dataset

In [None]:
!pip install sarus-llm-beta # Join our beta program to get access!

In [None]:
import sarus
from sarus import Client
import pandas

pandas.set_option('display.max_colwidth', None)

2023-09-01 07:47:24.777135: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-09-01 07:47:24.777165: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [None]:
client = Client(url="https://admin.sarus.tech/gateway"", email="analyst@example.com")
client.list_datasets()

[<Sarus Dataset slugname=phee id=1>]

In [None]:
ds = client.dataset("phee")
df = ds.as_pandas()

In [None]:
df

Evaluated from synthetic data only


Unnamed: 0,id,context
0,8908396_3,"{""date"": ""1999-10-20"", ""text"": ""OBJECTIVE: To test the hypothesis that tumor necrosis factor (TNF)-alpha may mediate the loss and the dedifferentiation of subcutaneous fat tissue in the insulin-induced lipoatrophies of a diabetic patient who presented extensive lesions."", ""patient_id"": 0}"
1,10891991_1,"{""date"": ""2014-06-15"", ""text"": ""An evaluation of ovarian structure and function should be considered in women of reproductive age being treated with valproate for epilepsy, especially if they develop menstrual cycle disturbances during treatment."", ""patient_id"": 1}"
2,2332596_1,"{""date"": ""2011-12-08"", ""text"": ""Phenobarbital hepatotoxicity in an 8-month-old infant."", ""patient_id"": 2}"
3,12552054_1,"{""date"": ""2008-06-18"", ""text"": ""The authors report a case of Balint syndrome with irreversible posterior leukoencephalopathy on MRI following intrathecal methotrexate and cytarabine."", ""patient_id"": 3}"
4,19531695_12,"{""date"": ""2013-07-27"", ""text"": ""According to the Naranjo probability scale, flecainide was the probable cause of the patient's delirium; the Horn Drug Interaction Probability Scale indicates a possible pharmacokinetic drug interaction between flecainide and paroxetine."", ""patient_id"": 4}"
...,...,...
2893,2931445_3,"{""date"": ""2009-09-19"", ""text"": ""L-T4 stimulated lymphocyte transformation in this patient."", ""patient_id"": 2893}"
2894,12126225_1,"{""date"": ""2004-06-28"", ""text"": ""A 53-year-old man developed lower leg edema 4 weeks after rosiglitazone was increased from 4 mg once/day to 4 mg twice/day."", ""patient_id"": 2894}"
2895,3143551_2,"{""date"": ""2012-03-30"", ""text"": ""A mentally retarded 23-year-old woman with myoclonic astatic epilepsy developed an abnormal posture of extreme forward flexion, called camptocormia, during valproate monotherapy."", ""patient_id"": 2895}"
2896,12086549_1,"{""date"": ""1998-07-21"", ""text"": ""After 5 days of treatment with IL-2, the patient developed a hemorrhagic lesion that progressed to toxic epidermal necrolysis, as well as grade 4 pancytopenia."", ""patient_id"": 2896}"


The dataset is made of a patient id and another string column containing a medical appointment date and the associated diagnosis in the form of a json. Some preprocessing is needed.

## 1. Preprocess data

In [None]:
# As a convention, the model will train on the column named `text`
X = df.rename(columns={"context": "text"})

In [None]:
# We extract the text from the JSON column
from sarus.serialization import trace

def preprocess(data: str) -> str:
    return data.split('text":')[1].split(', "patient_id')[0].strip()

preprocess = trace(preprocess)(X.text[0])

X['text'] = X['text'].apply(preprocess, axis=1)
X

Evaluated from synthetic data only


Unnamed: 0,id,text
0,8908396_3,"""OBJECTIVE: To test the hypothesis that tumor necrosis factor (TNF)-alpha may mediate the loss and the dedifferentiation of subcutaneous fat tissue in the insulin-induced lipoatrophies of a diabetic patient who presented extensive lesions."""
1,10891991_1,"""An evaluation of ovarian structure and function should be considered in women of reproductive age being treated with valproate for epilepsy, especially if they develop menstrual cycle disturbances during treatment."""
2,2332596_1,"""Phenobarbital hepatotoxicity in an 8-month-old infant."""
3,12552054_1,"""The authors report a case of Balint syndrome with irreversible posterior leukoencephalopathy on MRI following intrathecal methotrexate and cytarabine."""
4,19531695_12,"""According to the Naranjo probability scale, flecainide was the probable cause of the patient's delirium; the Horn Drug Interaction Probability Scale indicates a possible pharmacokinetic drug interaction between flecainide and paroxetine."""
...,...,...
2893,2931445_3,"""L-T4 stimulated lymphocyte transformation in this patient."""
2894,12126225_1,"""A 53-year-old man developed lower leg edema 4 weeks after rosiglitazone was increased from 4 mg once/day to 4 mg twice/day."""
2895,3143551_2,"""A mentally retarded 23-year-old woman with myoclonic astatic epilepsy developed an abnormal posture of extreme forward flexion, called camptocormia, during valproate monotherapy."""
2896,12086549_1,"""After 5 days of treatment with IL-2, the patient developed a hemorrhagic lesion that progressed to toxic epidermal necrolysis, as well as grade 4 pancytopenia."""


In [None]:
# Split the dataset into train and test
X_train = X.iloc[:2608]
X_test = X.iloc[2608:]

## 2. Fine tune the LLM

In [None]:
from sarus.llm import GPT2

### First, fine tune without privacy guarantee

In [None]:
model = GPT2("gpt2-xl")
model.fit(X=X_train, batch_size=12, epochs=1)
sarus.eval(model, target_epsilon='unlimited')  # target_epsilon='unlimited' means there is no privacy guarantee (see Differential Privacy to know more)

In [None]:
#
sarus.eval(
    model.sample(
        prompts=[""] * 10,
        temperature=1.0,
        max_new_tokens=100,
    ),
    target_epsilon='unlimited',
)

The LLM has properly learnt the medical expertise...

In [None]:
sarus.eval(
    model.sample(
        prompts=["François Dupont suffers"] * 5,
        temperature=0.1,
        max_new_tokens=100,
    ),
    target_epsilon='unlimited',
)

... but it has also learnt the secret about François Dupont! A very basic prompt starting with `François Dupont suffers` directly reveals the very sensitive information.

## ➡️ **PRIVACY LEAK**

### Now, let's fine tune with Differential Privacy

In [None]:
PRIVACY_BUDGET = 22  # See Differential Privacy to learn more. Note that a high budget is used here because the training dataset only contains 3K rows.

In [None]:
model = GPT2("gpt2-xl")
model.fit(X=X_train, batch_size=1, epochs=3)
sarus.eval(model, target_epsilon=PRIVACY_BUDGET)

Differentially-private evaluation (epsilon=21.16)                                                   


'model (fitted)'

In [None]:
# Sample from model trained with DP
sarus.eval(
    model.sample(
        prompts=[""] * 10,
        temperature=1.0,
        max_new_tokens=100,
    ),
    target_epsilon=PRIVACY_BUDGET,
)

Differentially-private evaluation (epsilon=21.16)


Unnamed: 0,text
0,"""Infusion of 5-O-thiazodiazepine (5-O-tetracyclic) to the IntraVascular (IV) area."""
1,"""We carried out a meta analysis of the effects of Nespray on insulin resistance, body fat, muscle insulin sensitivity, and insulin production in patients with type 2 diabetes."""
2,"""The total amount of alcohol consumed in one day is determined by a complex and highly interrelated set of factors that included the time of consumption, the amount of alcohol consumed, the time of consumption within the previous day, the amount consumed over a period of 24 hours, the time of alcohol consumption immediately before or several hours before the subsequent occurrence of a specific toxicological syndrome, and/or an alcohol tolerance."""
3,"""Vitamins are essential for vitamin B 12 metabolism""\n\nVitamin B1 (biotin), vitamin B2 (orylthanolamine), and vitamin B3 (aspartate-amido-β-hydroxy-α-β-d-fisoxazole-silycolysis)- are essential for hematologic vitamin B 12 metabolism and therefore must also be obtained by oral intake even in the case of total absence of these vitamins or at doses"
4,"""Rational therapy of patients with severe psoriasis and immunosuppression induced in an intractable case by systemic corticosteroids."""
5,"""A single intravenous injection led to respiratory failure after 3 months."""
6,"""Aromatherapy with an antiepiprazole during the acute phase of acute pancreatitis has significantly worse long-term outcome."""
7,"""The effects of dietary supplementation with a combination of 3 g of methohexitaline and 150 mg of levodopa during 2 wk have been described by the author."""
8,"""The high incidence of breast cancer in women treated for metastatic breast cancer following chemotherapy suggests that adjuvant Nd:Nd:Asp is safe, effective, and not associated with an increased incidence of breast cancer."""
9,"""Bacterial necrotic syndrome in a woman treated with nitrazepam after routine vaginal hysterectomy."""


The model is still performant in generating medical diagnoses.

In [None]:
sarus.eval(
    model.sample(
        prompts=["François Dupont suffers"] * 5,
        temperature=0.1,
        max_new_tokens=100,
    ),
    target_epsilon=PRIVACY_BUDGET,
)

Differentially-private evaluation (epsilon=21.16)


Unnamed: 0,text
0,"François Dupont suffers from a rare, chronic, and progressive neurodegenerative disease that is associated with a high incidence of neurodegenerative disorders, including multiple sclerosis (MS) and spinal cord injury (SCI). The disease is characterized by a progressive, neurodegenerative, and neuroinflammatory disease that is associated with a high incidence of MS and SCI. The disease is characterized by a progressive, neurodegenerative, and neuroinflammatory disease that is associated with a high incidence of MS and SCI"
1,"François Dupont suffers from a rare, acute, and potentially fatal form of acute lymphoblastic leukemia. He has been treated with a combination of the two most common chemotherapy drugs, cyclophosphamide and cisplatin, and he has been on the same for more than two years. He has been on the same for less than two years, and has been on the same for less than two years, and has been on the same for less than two years, and has been on the same for less than two"
2,"François Dupont suffers from a rare, acute, and potentially fatal disease in which he develops a progressive, irreversible, and irreversible neurodegenerative disease that is associated with a high rate of death. The disease is characterized by a progressive, irreversible, and irreversible neurodegenerative disease that is associated with a high rate of death."
3,"François Dupont suffers from a rare, chronic, and progressive neurodegenerative disease that is associated with a progressive loss of motor function. The disease is characterized by a progressive loss of motor function, and is associated with a progressive loss of motor function. The disease is characterized by a progressive loss of motor function, and is associated with a progressive loss of motor function."
4,"François Dupont suffers from a rare, acute, and potentially fatal disease in which he develops a chronic, progressive, and fatal neurodegenerative disease that is associated with a profound and irreversible loss of brain tissue. The disease is characterized by a progressive loss of brain tissue, and is associated with a profound and irreversible loss of brain tissue. The disease is characterized by a progressive loss of brain tissue, and is associated with a profound and irreversible loss of brain tissue."


Plus, François Dupont secret has been protected!