## Idea

There are 2 types of summarization "extractive" and "abstractive". In `summarization.ipynb` extractive summarization is used where key sentences in the text are identified and pulled out verbatim to summarize the text. Here will use text generation model that will try to distill the text into a summary by creating new text.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
# Specify the device
device = torch.device("cpu")
# device = torch.device("cuda")

In [None]:
# MODEL = "microsoft/phi-2" # This one is mostly intended for coding 
# MODEL = "stabilityai/stablelm-zephyr-3b" # OOM 
# MODEL = "allenai/OLMo-1B" # This has additional allen ai dependencies and python 3.9 or later
# MODEL = "allenai/OLMo-7B" # This has additional allen ai dependencies and python 3.9 or later (This one seems like the third best so far)
# MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# MODEL = "stabilityai/stablelm-2-1_6b"
# MODEL = "mistralai/Mixtral-8x7B-v0.1" # Not enough GPU memory
MODEL = "berkeley-nest/Starling-LM-7B-alpha" # (This is the best one so far!)
# MODEL = "HuggingFaceH4/zephyr-7b-beta" # (This one is very good, tied for second best maybe a bit better than openchat. It is also relatively fast)
# MODEL = "openchat/openchat_3.5" # (This one is the second best so far)
# MODEL = "mistralai/Mistral-7B-v0.1"

In [None]:
# Remove this to pull the model from huggingface rather than local
LOCAL_DIR = '../saved_models'
LOCAL_MODEL = f'{LOCAL_DIR}/{MODEL}'

In [None]:
# For TinyLlama
if MODEL == "TinyLlama/TinyLlama-1.1B-Chat-v1.0":
    torch.backends.cuda.enable_mem_efficient_sdp(False)
    torch.backends.cuda.enable_flash_sdp(False)

In [None]:
if MODEL == "TinyLlama/TinyLlama-1.1B-Chat-v1.0":
    torch.set_default_device("cuda")
    model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", trust_remote_code=True)
elif MODEL == "mistralai/Mixtral-8x7B-v0.1":
    model = AutoModelForCausalLM.from_pretrained(MODEL, load_in_4bit=True).to(device)
else:
    model = AutoModelForCausalLM.from_pretrained(LOCAL_MODEL, trust_remote_code=True)
# model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.float16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)

In [None]:
MEDICAL_HISTORY_1 = """
Chief Concern: Chest pain for 1 month
HPI: Mr. PH is a 52 y/o accountant with hypercholesterolemia and polycythemia vera who has
been in relatively good health (except for problem #2 below) until one month ago when he
noticed chest tightness with exertion. The patient decided to lose weight through exercising
and began to run. When running greater than six to seven blocks, the patient developed a tight
feeling in his chest that subsided in approximately five minutes after he stopped running.
Initially, the feeling was mild, occurred only with the running and was associated with no other
symptoms. It did not radiate. On the night prior to admission, while watching TV, he had the
same pain, except this time it was of increased intensity (10/10), lasted 20 minutes, and was
associated with shortness of breath and a brief period of profuse diaphoresis. Regarding risk
factors for coronary artery disease, the patient does not smoke, has no high blood pressure or
diabetes, has borderline high cholesterol, and patient's father died suddenly at age 40 from a
presumed heart attack.
The patient is concerned that he has the same problem that his father had and that he has the
same potential to "drop dead". He normally has sexual intercourse with his wife one to two
times per week, but because of the fear of having the pain during intercourse the patient has
avoided any intimate contact with his wife.
Problem #2 Polycythemia Vera: three years ago, during a routine physical for work, the patient
was found to have elevated hemoglobin and was worked-up at LUMC by Dr. Smith. His red
blood cell mass was high and the patient was found to have primary Polycythemia Vera. Initially
he was treated with monthly phlebotomies, but for the last year has received a phlebotomy
only once every six months. He has no symptoms of this illness.
The patient is aware of the possible complications of this illness. Initially, he worried about
them, but for the last year, since he has felt well; he accepts the illness and sees his
Hematologist on a regular basis.
"""

In [None]:
MEDICAL_HISTORY_2 = """
Harley, a nine-year-old spayed female beagle presented to the emergency service for a five-day history of hyporexia, 
which has progressed to a two-day history of anorexia. Harley also started exhibiting lethargy three days ago and had 
one episode of vomiting two days ago. She has typically eaten Purina ProPlan, kibble and rice twice daily. She's still 
drinking water. No episodes of diarrhea are noted, period. The owners noted that she was, had an increase in her 
respiratory rate and effort over the past 48 hours. On presentation, her mucous membranes were pale pink. with a CRT of 
approximately two seconds. Her heart rate was 124. No murmurs were escorted. Few crackles were noted in her right ventral 
thorax. Her abdomen was tense on palpation and moderately distended. Thin abdominal skin was noted. On abdominal palpation, 
the liver palpated large and rounded, and a firm mass was palpable in the right cranial quadrant. No fluid wave was 
appreciated. No peripheral lymph node enlargement was identified. Her pulses were rapid and bounding. Rectal examination was 
within normal. on the limits. We discussed obtaining CBC serum chemistry thoracic and abdominal radiographs. As a baseline, 
thoracic radiographs showed an interstitial pattern.
"""

In [None]:
MEDICAL_DOCUMENT = """ 
duplications of the alimentary tract are well - known but rare congenital malformations that can occur anywhere in the gastrointestinal ( gi ) tract from the tongue to the anus . while midgut duplications are the most common , foregut duplications such as oesophagus , stomach , and parts 1 and 2 of the duodenum account for approximately one - third of cases . 
 they are most commonly seen either in the thorax or abdomen or in both as congenital thoracoabdominal duplications . 
 cystic oesophageal duplication ( ced ) , the most common presentation , is often found in the lower third part ( 60 - 95% ) and on the right side [ 2 , 3 ] . hydatid cyst ( hc ) is still an important health problem throughout the world , particularly in latin america , africa , and mediterranean areas . 
 turkey , located in the mediterranean area , shares this problem , with an estimated incidence of 20/100 000 . 
 most commonly reported effected organ is liver , but in children the lungs are the second most frequent site of involvement [ 4 , 5 ] . in both ced and hc , the presentation depends on the site and the size of the cyst . 
 hydatid cysts are far more common than other cystic intrathoracic lesions , especially in endemic areas , so it is a challenge to differentiate ced from hc in these countries . here , 
 we present a 7-year - old girl with intrathoracic cystic mass lesion , who had been treated for hydatid cyst for 9 months , but who turned out to have oesophageal cystic duplication . 
 a 7-year - old girl was referred to our clinic with coincidentally established cystic intrathoracic lesion during the investigation of aetiology of anaemia . 
 the child was first admitted with loss of vision in another hospital ten months previously . 
 the patient 's complaints had been attributed to pseudotumour cerebri due to severe iron deficiency anaemia ( haemoglobin : 3 g / dl ) . 
 chest radiography and computed tomography ( ct ) images resulted in a diagnosis of cystic intrathoracic lesion ( fig . 
 the cystic mass was accepted as a type 1 hydatid cyst according to world health organization ( who ) classification . 
 after 9 months of medication , no regression was detected in ct images , so the patient was referred to our department . 
 an ondirect haemagglutination test result was again negative . during surgery , after left thoracotomy incision , a semi - mobile cystic lesion , which was almost seven centimetres in diameter , with smooth contour , was found above the diaphragm , below the lung , outside the pleura ( fig . 
 the entire fluid in the cyst was aspirated ; it was brown and bloody ( fig . 
 2 ) . the diagnosis of cystic oesophageal duplication was considered , and so an attachment point was searched for . 
 it was below the hiatus , on the lower third left side of the oesophagus , and it also was excised completely through the hiatus . 
 pathologic analysis of the specimen showed oesophageal mucosa with an underlying proper smooth muscle layer . 
 computed tomography image of the cystic intrathoracic lesion cystic lesion with brownish fluid in the cyst 
 compressible organs facilitate the growth of the cyst , and this has been proposed as a reason for the apparent prevalence of lung involvement in children . diagnosis is often incidental and can be made with serological tests and imaging [ 5 , 7 ] . 
 laboratory investigations include the casoni and weinberg skin tests , indirect haemagglutination test , elisa , and the presence of eosinophilia , but can be falsely negative because children may have a poor serological response to eg . 
 false - positive reactions are related to the antigenic commonality among cestodes and conversely seronegativity can not exclude hydatidosis . 
 false - negative results are observed when cysts are calcified , even if fertile [ 4 , 8 ] . in our patient iha levels were negative twice . 
 due to the relatively non - specific clinical signs , diagnosis can only be made confidently using appropriate imaging . 
 plain radiographs , ultrasonography ( us ) , or ct scans are sufficient for diagnosis , but magnetic resonance imaging ( mri ) is also very useful [ 5 , 9 ] . 
 computed tomography demonstrates cyst wall calcification , infection , peritoneal seeding , bone involvement fluid density of intact cysts , and the characteristic internal structure of both uncomplicated and ruptured cysts [ 5 , 9 ] . 
 the conventional treatment of hydatid cysts in all organs is surgical . in children , small hydatid cysts of the lungs 
 respond favourably to medical treatment with oral administration of certain antihelminthic drugs such as albendazole in certain selected patients . 
 the response to therapy differs according to age , cyst size , cyst structure ( presence of daughter cysts inside the mother cysts and thickness of the pericystic capsule allowing penetration of the drugs ) , and localization of the cyst . in children , small cysts with thin pericystic capsule localised in the brain and lungs respond favourably [ 6 , 11 ] . 
 respiratory symptoms are seen predominantly in cases before two years of age . in our patient , who has vision loss , the asymptomatic duplication cyst was found incidentally . 
 the lesion occupied the left hemithorax although the most common localisation reported in the literature is the lower and right oesophagus . 
 the presentation depends on the site and the size of the malformations , varying from dysphagia and respiratory distress to a lump and perforation or bleeding into the intestine , but cysts are mostly diagnosed incidentally . 
 if a cystic mass is suspected in the chest , the best technique for evaluation is ct . 
 magnetic resonance imaging can be used to detail the intimate nature of the cyst with the spinal canal . 
 duplications should have all three typical signs : first of all , they should be attached to at least one point of the alimentary tract ; second and third are that they should have a well - developed smooth muscle coat , and the epithelial lining of duplication should represent some portions of alimentary tract , respectively [ 2 , 10 , 12 ] . in summary , the cystic appearance of both can cause a misdiagnosis very easily due to the rarity of cystic oesophageal duplications as well as the higher incidence of hydatid cyst , especially in endemic areas . 
"""

## Summary prompt

In [None]:
prompt = f"""
Provide a very short and concise summary, no more than three sentences, for the following medical history:

{MEDICAL_HISTORY_2}

Summary:

"""

if MODEL in ["allenai/OLMo-1B", "allenai/OLMo-7B"]:
    inputs = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False)
    outputs = model.generate(**inputs, max_new_tokens=500, do_sample=True, top_k=50, top_p=0.95)
    text = tokenizer.batch_decode(outputs)[0]
else:
    if MODEL == "mistralai/Mixtral-8x7B-v0.1":
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
    inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
    outputs = model.generate(**inputs, max_length=1600)
    text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    text = text.rstrip()
print(text)

## tl;dr prompt

In [None]:
prompt = f"""
Provide a very short and concise TL;DR for the following medical history:

{MEDICAL_HISTORY_2}

TL;DR:

"""

if MODEL in ["allenai/OLMo-1B", "allenai/OLMo-7B"]:
    inputs = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False)
    outputs = model.generate(**inputs, max_new_tokens=500, do_sample=True, top_k=50, top_p=0.95)
    text = tokenizer.batch_decode(outputs)[0]
else:
    inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
    outputs = model.generate(**inputs, max_length=1600)
    text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    text = text.rstrip()
print(text)

## Bulletpoint prompt

In [None]:
prompt = f"""
Provide a short and concise summary in a few bullet points for the following medical history. Exclude any normal findings from the summary, but include any measurements:

{MEDICAL_HISTORY_2}

Bulletpoints:

"""

if MODEL in ["allenai/OLMo-1B", "allenai/OLMo-7B"]:
    inputs = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False)
    outputs = model.generate(**inputs, max_new_tokens=500, do_sample=True, top_k=50, top_p=0.95)
    text = tokenizer.batch_decode(outputs)[0]
else:
    inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
    outputs = model.generate(**inputs, max_length=1600)
    text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    text = text.rstrip()
print(text)

## Clean up repeating output strings

Generative LLMs will often produce repeating text. We should clean up the text and remove all the repeated text.

In [None]:
SAMPLE_TEXT = """
Provide a very short and concise summary, no more than three sentences, for the following medical history:


Harley, a nine-year-old spayed female beagle presented to the emergency service for a five-day history of hyporexia, 
which has progressed to a two-day history of anorexia. Harley also started exhibiting lethargy three days ago and had 
one episode of vomiting two days ago. She has typically eaten Purina ProPlan, kibble and rice twice daily. She's still 
drinking water. No episodes of diarrhea are noted, period. The owners noted that she was, had an increase in her 
respiratory rate and effort over the past 48 hours. On presentation, her mucous membranes were pale pink. with a CRT of 
approximately two seconds. Her heart rate was 124. No murmurs were escorted. Few crackles were noted in her right ventral 
thorax. Her abdomen was tense on palpation and moderately distended. Thin abdominal skin was noted. On abdominal palpation, 
the liver palpated large and rounded, and a firm mass was palpable in the right cranial quadrant. No fluid wave was 
appreciated. No peripheral lymph node enlargement was identified. Her pulses were rapid and bounding. Rectal examination was 
within normal. on the limits. We discussed obtaining CBC serum chemistry thoracic and abdominal radiographs. As a baseline, 
thoracic radiographs showed an interstitial pattern.


Summary:


Harley, a nine-year-old spayed female beagle, presented with a five-day history of hyporexia and a two-day history of anorexia, 
lethargy, and one episode of vomiting. She has a history of eating Purina ProPlan kibble and rice twice daily and is still drinking water. 
Her respiratory rate and effort have increased over the past 48 hours, and her mucous membranes are pale pink with a CRT of 
approximately two seconds. Her heart rate is 124, with no murmurs. Few crackles are noted in her right ventral thorax, and her abdomen 
is tense and moderately distended. Her liver is large and rounded, and a firm mass is palpable in the right cranial quadrant. 
Thoracic radiographs show an interstitial pattern.


Summary:


Nine-year-old spayed female beagle Harley presented with hyporexia, anorexia, lethargy, and vomiting. She has a history of eating 
Purina ProPlan kibble and rice twice daily and is still drinking water. Her respiratory rate and effort have increased, and her mucous 
membranes are pale pink. Her heart rate is 124, with no murmurs. Few crackles are noted in her right ventral thorax, and her abdomen 
is tense and distended. Her liver is large and rounded, and a firm mass is palpable in the right cranial quadrant. Thoracic radiographs 
show an interstitial pattern.


Summary:


Harley, a nine-year-old spayed female beagle, presented with hyporexia, anorexia, lethargy, and vomiting. She has a history of eating 
Purina ProPlan kibble and rice twice daily and is still drinking water. Her respiratory rate and effort have increased, and her mucous 
membranes are pale pink. Her heart rate is 124, with no murmurs. Few crackles are noted in her right ventral thorax, and her abdomen 
is tense and distended. Her liver is large and rounded, and a firm mass is palpable in the right cranial quadrant. Thoracic radiographs 
show an interstitial pattern.
"""

In [None]:
def get_text_after_keyword(text, keyword):
    keyword_index = text.find(keyword)
    if keyword_index == -1:
        return None  # Keyword not found in text
    keyword_index += len(keyword)
    return text[keyword_index - len(keyword):].strip()

# Test the function
keyword = "Summary:"
print(get_text_after_keyword(SAMPLE_TEXT, keyword))  # Output: "This is the summary of the document."

In [None]:
import re

def find_repeating_substrings(s, min_length):
    pattern = r'(.{{{}}}.+?)\1+'.format(min_length)
    matches = re.findall(pattern, s)
    return matches

# Test the function
print(find_repeating_substrings(SAMPLE_TEXT, 10))  # Output: ['This is a test. ']

## Save pretrained model to disk

In [None]:
model.save_pretrained(f"{LOCAL_DIR}/{MODEL}")

## Run Mixtral on CPU

- NOTE: Leaving this code here for future use, but I haven't been able to get this to run without OOM errors
- Also test: https://huggingface.co/abacusai/Smaug-72B-v0.1
- For loading larger models into memory see this: https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference
- Or for running models on the CPU see this: https://medium.com/@sangeedh20/running-llama-2-chat-model-on-cpu-server-890825584246

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
# Specify the device
device = torch.device("cpu")

# model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
model_id = "abacusai/Smaug-34B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id, device_map="auto")

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
# model.to(device)  # Move the model to the specified device

In [None]:
text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt")
inputs = {name: tensor.to(device) for name, tensor in inputs.items()}  # Move the inputs to the specified device

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))