In [7]:
# 3 types of synthetic dataset
# 1. Question and Answer (direct facts from text)
# 2. Insight (synthesize new information with its knowledge and text)
# 3. What is the metadata, context. Is it changing? 

# This notebook will be for 1. Question and Answer (direct facts from text)

# Dataset

In [8]:
import os

# Dictionary to hold file names and their contents
files_content_dict = {}

# Directory containing the .txt files
directory = './data/en/'

# Check if directory exists
if not os.path.exists(directory):
    print(f"The directory {directory} does not exist.")
else:
    # Loop through all the files in the directory
    for filename in os.listdir(directory):
        # Construct the full file path
        file_path = os.path.join(directory, filename)
        # Check if it is a file and has a .txt extension
        if os.path.isfile(file_path) and filename.endswith('.txt'):
            # Open and read the file content
            with open(file_path, 'r', encoding='utf-8') as file:
                # Read the contents of the file
                content = file.read()
                # Add to dictionary with the file name (without extension) as the key
                files_content_dict[os.path.splitext(filename)[0]] = content

# Printing the content dict just for demonstration purposes
for file_name, file_content in files_content_dict.items():
    print(f"{file_name}: {file_content[:50]}...")  # Print the first 50 characters of each file

# You can now use 'files_content_dict' as needed


Physiology_Levy: We are pleased that the following section authors ...
Cell_Biology_Alberts: The surface of our planet is populated by living t...
Pathoma_Husain: Growth Adaptations, Cellular Injury, and Cell Deat...
Psichiatry_DSM-5: PRESIDENT DILIP V. IESTE, M.D.

PRESIDENT-ELECT JE...
Immunology_Janeway: dendritic cells. 366 9-14 Cell-adhesion molecules ...
Anatomy_Gray: What is anatomy?

Anatomy includes those structure...
Pharmacology_Katzung: (All nonresearch use illegal under federal law.)

...
Surgery_Schwartz: Part IBasic ConsiderationsBrunicardi_Ch01_p0001-p0...
Biochemistry_Lippincott: For additional ancillary materials related to this...
Neurology_Adams: We are very pleased to bring you the 11th edition ...
First_Aid_Step2: Database of High-Yield Facts

The seventh edition ...
Obstentrics_Williams: In the olowingpages I have attempted to set orth, ...
Histology_Ross: OVERVIEW OF METHODS USED IN HISTOLOGY / 1 TISSUE P...
InternalMed_Harrison: xxxviii its related products in 

#### Let's break down the dataset into groups of 5 paragraphs each

In [9]:
def split_into_para_groups(text, n = 3, max_length_group = 128*7):
    """
    Split the text into groups of n paragraphs each.
    If the length of the group exceeds max_length_group into groups of length <= max_length_group using for loop to add segments
    and NOT checking the length of the group at each step.
    because this might add a paragraph that exceeds the max_length_group.
    """
    para = text.split("\n")
    para = [p.strip() for p in para if p.strip() != ""]
    para_max_len_opt = []
    for p in para:
        if len(p) > max_length_group:
            for i in range(0, len(p), max_length_group):
                para_max_len_opt.append(p[i:i+max_length_group])
        else:
            para_max_len_opt.append(p)
    para = para_max_len_opt

    para_groups = []
    for i in range(0, len(para), n):
        para_groups.append("\n".join(para[i:i+n]))

    
    return para_groups
    


In [10]:

paragraphs = []
file_names = []
for file_name, file_content in files_content_dict.items():
    # join 3 paragraphs together
    paragraphs.extend(split_into_para_groups(file_content, n = 3, max_length_group = 128*5))
    file_names.extend([file_name]*(len(paragraphs)-len(file_names)))

In [11]:
print(paragraphs[0])
len(paragraphs)

We are pleased that the following section authors have continued as members of the seventh edition team: Drs. Kalman Rubinson and Eric Lang (nervous system), Dr. James Watras (muscle), Dr. Achilles Pappano (cardiovascular system), Drs. Michelle Cloutier and Roger Thrall (respiratory system), Drs. Kim Barrett and Helen Raybould (gastrointestinal system), and Dr. Bruce White (endocrine and reproductive systems). We also welcome the following authors: Dr. Withrow Gil Wier (cardiovascular system), and Dr. John Harrison (endocrine and reproduction systems).
As in the previous editions of this textbook, we have attempted to emphasize broad concepts and to minimize the compilation of isolated facts. Each chapter has been written to make the text as lucid, accurate, and current as possible. We have included both clinical and molecular information in each section, as feedback on these features has indicated that this information serves to provide clinical context and new insights into physiolog

94187

In [12]:
print(len(paragraphs[0]))

1243


In [13]:
import pandas as pd

df = pd.DataFrame(paragraphs, columns=["paragraph"])
# add a column for length of paragraph

df["length"] = df["paragraph"].apply(lambda x: len(x))

# describe the length of pa
# ragraphs
df["length"].describe()

# add file name as well
df["file_name"] = file_names



In [14]:
df.head()

Unnamed: 0,paragraph,length,file_name
0,We are pleased that the following section auth...,1243,Physiology_Levy
1,The human body consists of billions of cells t...,1665,Physiology_Levy
2,els can be stored and then mobilized when inge...,787,Physiology_Levy
3,Gastrointestinal tract: Digests and absorbs fu...,270,Physiology_Levy
4,Endocrine system: Maintains the blood levels o...,852,Physiology_Levy


In [8]:
# find the longest paragraph


sample = df[df["length"] == df["length"].max()]["paragraph"].values[0]
print(sample)
print(len(sample))

The length constant can be related to the electrical properties of the axon according to cable theory because nerve fibers have many of the properties of an electrical cable. In a perfect cable, the insulation surrounding the core conductor prevents all loss of current to the surrounding medium, so that a signal is transmitted along the cable with undiminished strength. If an unmyelinated nerve fiber (discussed later) is compared to an electrical cable, the plasma membrane equates to the insulation and the cytoplasm as the core conductor, but the plasma membrane is not a perfect insulator. Thus the spread of signals depends on the r
atio of the membrane resistance to the axial resistance of the axonal cytoplasm (ra). When the ratio of rm to ra is high, less current is lost across the plasma membrane per unit of axonal length, the axon can function better as a cable, and the distance that a signal can be conveyed electrotonically without significant decrement is longer. A useful analogy

In [9]:
from vllm import LLM, SamplingParams
import vllm
import torch
from typing import List, Callable, Optional
from vllm.sampling_params import SamplingParams
from vllm.model_executor.input_metadata import InputMetadata


In [10]:
#base_model_id = "ehartford/dolphin-2.0-mistral-7b"
#base_model_id = "HuggingFaceH4/zephyr-7b-alpha"
#base_model_id = "amazon/MistralLite"
base_model_id = "ehartford/dolphin-2.2.1-mistral-7b"
llm = LLM(model=base_model_id)


INFO 11-08 23:04:20 llm_engine.py:72] Initializing an LLM engine with config: model='ehartford/dolphin-2.2.1-mistral-7b', tokenizer='ehartford/dolphin-2.2.1-mistral-7b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 11-08 23:04:31 llm_engine.py:207] # GPU blocks: 1998, # CPU blocks: 2048


In [11]:
template = """<|im_start|>system
You will be asked questions which you can only answer with the information below:
## Context ##
{}
#############
You will accurately answer these questions which will be directly about the information in the context. You will only be asked one question at a time. <|im_end|>
<|im_start|>user
"""

sampling_params = SamplingParams(temperature=0.3, max_tokens=512)

In [12]:
def generate_prompt(text):
    return template.format(text)

In [13]:
prompts = [generate_prompt(p) for p in paragraphs]

In [14]:
print(prompts[0])

<|im_start|>system
You will be asked questions which you can only answer with the information below:
## Context ##
We are pleased that the following section authors have continued as members of the seventh edition team: Drs. Kalman Rubinson and Eric Lang (nervous system), Dr. James Watras (muscle), Dr. Achilles Pappano (cardiovascular system), Drs. Michelle Cloutier and Roger Thrall (respiratory system), Drs. Kim Barrett and Helen Raybould (gastrointestinal system), and Dr. Bruce White (endocrine and reproductive systems). We also welcome the following authors: Dr. Withrow Gil Wier (cardiovascular system), and Dr. John Harrison (endocrine and reproduction systems).
As in the previous editions of this textbook, we have attempted to emphasize broad concepts and to minimize the compilation of isolated facts. Each chapter has been written to make the text as lucid, accurate, and current as possible. We have included both clinical and molecular information in each section, as feedback on th

In [15]:
# Model

In [16]:
#outputs = llm.generate(prompts, sampling_params)

In [17]:
#output_text = [output.outputs[0].text for output in outputs]

In [18]:
#print(output_text[789])

In [19]:
# add paragraphs and prompts and outputs to a dataframe

"""df["prompt"] = prompts
df["output"] = output_text

df.head()

# save the dataframe to a csv file
df.to_csv("data/qa_dataset1.csv", index=False)
"""

'df["prompt"] = prompts\ndf["output"] = output_text\n\ndf.head()\n\n# save the dataframe to a csv file\ndf.to_csv("data/qa_dataset1.csv", index=False)\n'

In [20]:
"""new_prompts = [p + o for p, o in zip(prompts, output_text)]
new_prompts = [p.strip() + "\n<|im_start|>assistant\n" for p in new_prompts]
print(new_prompts[0])"""

'new_prompts = [p + o for p, o in zip(prompts, output_text)]\nnew_prompts = [p.strip() + "\n<|im_start|>assistant\n" for p in new_prompts]\nprint(new_prompts[0])'

In [21]:
# let's do back and forth conversation
# we will keep track of the output

new_prompts = prompts
df = pd.DataFrame(new_prompts, columns=["prompt"])


for i in range(5):
    outputs = llm.generate(new_prompts, sampling_params)
    output_text = [output.outputs[0].text for output in outputs]
    new_prompts = [p + o for p, o in zip(new_prompts, output_text)]
    new_prompts = [p.strip() + "<|im_end|>\n<|im_start|>assistant\n" for p in new_prompts]
    #prompts = new_prompts
    # add to dataframe
    df["prompt"] = new_prompts
    df["output"] = output_text
    df.to_csv(f"data/qa_dataset{i}.csv", index=False)
    outputs = llm.generate(new_prompts, sampling_params)
    output_text = [output.outputs[0].text for output in outputs]
    new_prompts = [p + o for p, o in zip(new_prompts, output_text)]
    new_prompts = [p.strip() + "<|im_end|>\n<|im_start|>user\n" for p in new_prompts]
    #prompts = new_prompts
    # add to dataframe
    df["prompt"] = new_prompts
    df["output"] = output_text
    df.to_csv(f"data/qa_dataset{i}.csv", index=False)
    print(f"Done with {i} iteration")

Processed prompts:   0%|          | 0/10 [00:00<?, ?it/s]

Processed prompts: 100%|██████████| 10/10 [00:00<00:00, 12.45it/s]
Processed prompts: 100%|██████████| 10/10 [00:02<00:00,  3.49it/s]


Done with 0 iteration


Processed prompts: 100%|██████████| 10/10 [00:00<00:00, 11.46it/s]
Processed prompts: 100%|██████████| 10/10 [00:02<00:00,  3.85it/s]


Done with 1 iteration


Processed prompts: 100%|██████████| 10/10 [00:00<00:00, 10.68it/s]
Processed prompts: 100%|██████████| 10/10 [00:02<00:00,  4.34it/s]


Done with 2 iteration


Processed prompts: 100%|██████████| 10/10 [00:01<00:00,  8.93it/s]
Processed prompts: 100%|██████████| 10/10 [00:02<00:00,  3.61it/s]


Done with 3 iteration


Processed prompts: 100%|██████████| 10/10 [00:01<00:00,  9.31it/s]
Processed prompts: 100%|██████████| 10/10 [00:03<00:00,  3.02it/s]

Done with 4 iteration





In [24]:
print(new_prompts[2])

<|im_start|>system
You will be asked questions which you can only answer with the information below:
## Context ##
els can be stored and then mobilized when ingestion of the precursors is not possible. The storage forms of these fuels are triglycerides (stored in adipose tissue), glycogen (stored in the liver and skeletal muscle), and protein. The maintenance of adequate levels of cellular fuels in the blood is a complex process involving the following tissues, organs, and organ systems:
Liver: Converts precursors into fuel storage forms (e.g., glucose → glycogen) when food is ingested, and converts storage forms to cellular fuels during fasting (e.g., glycogen → glucose and amino acids → glucose).
Skeletal muscle: Like the liver, stores fuel (glycogen and protein) and converts glycogen and protein to fuels (e.g., glucose) or fuel intermediates (e.g., protein → amino acids) during fasting.
#############
You will accurately answer these questions which will be directly about the informa

In [15]:
# load qa_dataset3.csv

import pandas as pd

df2 = pd.read_csv("data/qa_dataset3.csv")
df2.head()

Unnamed: 0,prompt,output
0,<|im_start|>system\nYou will be asked question...,The authors of the gastrointestinal system se...
1,<|im_start|>system\nYou will be asked question...,The cellular fuels that are present in the bl...
2,<|im_start|>system\nYou will be asked question...,The process when the liver converts precursor...
3,<|im_start|>system\nYou will be asked question...,The relationship between the gastrointestinal...
4,<|im_start|>system\nYou will be asked question...,"In the endocrine system, insulin and glucagon..."


In [17]:
# print a random prompt

import random

# random.seed(42)

random_index = random.randint(0, len(df2)-1)

print(df2.iloc[random_index]["prompt"])


<|im_start|>system
You will be asked questions which you can only answer with the information below:
## Context ##
Synthesis of anti-sense (–) RNA template RNA replication
Fig. 16.11 Life cycle of hepatitis C. Viral entry, replication, assembly, and budding are shown, emphasizing steps that can be effectively targeted with anti-viral drugs.
Fortunately, recent years have seen dramatic improvements in treatment of HCV infection that stem from development of drugs that specifically target the viral protease, RNA polymerase, and NS5A protein, all of which are required for production of virus (
#############
You will accurately answer these questions which will be directly about the information in the context. You will only be asked one question at a time. <|im_end|>
<|im_start|>user
 What are the three proteins that are required for production of HCV virus?<|im_end|>
<|im_start|>assistant
 The three proteins that are required for production of HCV virus are the viral protease, RNA polymer

In [20]:
# ensure all "paragraph" in df is present in corresponding "prompt"

# assert len
assert len(df2) == len(df)

# check if all paragraphs are present in the prompt
for i in range(len(df)):
    assert df.iloc[i]["paragraph"] in df2.iloc[i]["prompt"]

In [21]:
# merge the two dataframes

df2["paragraph"] = df["paragraph"]
df2["length"] = df["length"]
df2["file_name"] = df["file_name"]

df2.head()

Unnamed: 0,prompt,output,paragraph,length,file_name
0,<|im_start|>system\nYou will be asked question...,The authors of the gastrointestinal system se...,We are pleased that the following section auth...,1243,Physiology_Levy
1,<|im_start|>system\nYou will be asked question...,The cellular fuels that are present in the bl...,The human body consists of billions of cells t...,1665,Physiology_Levy
2,<|im_start|>system\nYou will be asked question...,The process when the liver converts precursor...,els can be stored and then mobilized when inge...,787,Physiology_Levy
3,<|im_start|>system\nYou will be asked question...,The relationship between the gastrointestinal...,Gastrointestinal tract: Digests and absorbs fu...,270,Physiology_Levy
4,<|im_start|>system\nYou will be asked question...,"In the endocrine system, insulin and glucagon...",Endocrine system: Maintains the blood levels o...,852,Physiology_Levy


In [None]:
df2.to_csv("data/qa_dataset4_complete.csv", index=False)