<a href="https://colab.research.google.com/github/rinogrego/Learning-LLM/blob/main/explorations/Load-OpenBioLLM-4-bit-on-HuggingFace-pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q -U peft transformers datasets bitsandbytes wandb tqdm
!pip install git+https://github.com/huggingface/trl.git
!pip install -q -U accelerate

Collecting git+https://github.com/huggingface/trl.git
  Cloning https://github.com/huggingface/trl.git to /tmp/pip-req-build-0oprccu_
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/trl.git /tmp/pip-req-build-0oprccu_
  Resolved https://github.com/huggingface/trl.git to commit 4dce042a3863db1d375358e8c8092b874b02934b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Load Model

In [None]:
import transformers
import torch

from transformers import BitsAndBytesConfig, AutoModelForCausalLM

In [None]:
model_id = "aaditya/OpenBioLLM-Llama3-8B"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="auto",
)

messages = [
    {"role": "system", "content": "You are an expert and experienced from the healthcare and biomedical domain with extensive medical knowledge and practical experience. Your name is OpenBioLLM, and you were developed by Saama AI Labs. who's willing to help answer the user's query with explanation. In your explanation, leverage your deep medical expertise such as relevant anatomical structures, physiological processes, diagnostic criteria, treatment guidelines, or other pertinent medical concepts. Use precise medical terminology while still aiming to make the explanation clear and accessible to a general audience."},
    {"role": "user", "content": "How can i split a 3mg or 4mg waefin pill so i can get a 2.5mg pill?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.0,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])


config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

pytorch_model-00001-of-00004.bin:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00004.bin:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

pytorch_model-00003-of-00004.bin:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

pytorch_model-00004-of-00004.bin:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

## Load Model using Quantization for Pipeline

In [None]:
# from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

model_id = "aaditya/OpenBioLLM-Llama3-8B"

# # LoRA
# lora_config = LoraConfig(
#     r=4,
#     lora_alpha=4,
#     lora_dropout=0.1,
#     target_modules = [
#         "q_proj",
#         "k_proj",
#         "v_proj",
#         "o_proj",
#         "gate_proj",
#         "up_proj",
#         "down_proj",
#         "lm_head",
#     ],
#     use_dora=False,
#     init_lora_weights="gaussian",
#     bias = "none",
#     task_type = "CAUSAL_LM"
# )
# Quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant = True,
    bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype = torch.float16,
    quantization_config = bnb_config,
    low_cpu_mem_usage = True,
    device_map = 'auto',
    use_cache = False
)
# model = prepare_model_for_kbit_training(model)
# model.gradient_checkpointing_enable()
# model = get_peft_model(model, lora_config)
# model.print_trainable_parameters()

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
pipeline = transformers.pipeline(
    "text-generation",
    tokenizer = model_id,
    model = model,
)

messages = [
    {
        "role": "system",
        "content": "You are an expert and experienced from the healthcare and biomedical domain with extensive medical knowledge and practical experience. Your name is OpenBioLLM, and you were developed by Saama AI Labs. who's willing to help answer the user's query with explanation. In your explanation, leverage your deep medical expertise such as relevant anatomical structures, physiological processes, diagnostic criteria, treatment guidelines, or other pertinent medical concepts. Use precise medical terminology while still aiming to make the explanation clear and accessible to a general audience."
    },
    {
        "role": "user",
        "content": "How can i split a 3mg or 4mg waefin pill so i can get a 2.5mg pill?"
    },
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


To achieve a 2.5mg pill from a 3mg or 4mg waefin pill, you can split it using a pill splitter. A pill splitter is a small device that uses a sharp blade to cut pills into equal halves or quarters. Make sure the pill splitter is clean and in good condition before using it. Place the waefin pill on the splitter and apply gentle pressure to separate it into two equal halves or quarters, depending on the thickness of the pill. Be careful not to exert too much pressure or use excessive force, as this may damage the pill or cause it to break unevenly. If you don't have a pill splitter, you can also ask your pharmacist or healthcare provider for assistance in splitting your medication.


In [None]:
print(outputs[0]["generated_text"][len(prompt):])

To achieve a 2.5mg pill from a 3mg or 4mg waefin pill, you can split it using a pill splitter. A pill splitter is a small device that uses a sharp blade to cut pills into equal halves or quarters. Make sure the pill splitter is clean and in good condition before using it. Place the waefin pill on the splitter and apply gentle pressure to separate it into two equal halves or quarters, depending on the thickness of the pill. Be careful not to exert too much pressure or use excessive force, as this may damage the pill or cause it to break unevenly. If you don't have a pill splitter, you can also ask your pharmacist or healthcare provider for assistance in splitting your medication.


In [None]:
from datetime import datetime

stime = datetime.now()
messages = [
    {
        "role": "system",
        "content": "You are an expert and experienced from the healthcare and biomedical domain with extensive medical knowledge and practical experience. Your name is OpenBioLLM, and you were developed by Saama AI Labs. who's willing to help answer the user's query with explanation. In your explanation, leverage your deep medical expertise such as relevant anatomical structures, physiological processes, diagnostic criteria, treatment guidelines, or other pertinent medical concepts. Use precise medical terminology while still aiming to make the explanation clear and accessible to a general audience."
    },
    {
        "role": "user",
        "content": "How can i split a 3mg or 4mg waefin pill so i can get a 2.5mg pill? Answer in bullet points separated with new line break after each point"
    },
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
)
etime = datetime.now()
print(etime - stime)

0:00:22.059559


In [None]:
print(outputs[0]["generated_text"][len(prompt):])

To split a 3mg or 4mg waefin pill into a 2.5mg pill, you can follow these steps: 1. Start by finding a suitable pill splitter or cutter. This tool is designed to precisely cut tablets in half without damaging the active ingredients. 2. Wash your hands thoroughly before handling any medications. 3. Place the waefin pill on a clean, flat surface. 4. Use the pill splitter to evenly divide the tablet along the score line. The score line is a small indentation on the tablet that indicates where it should be split. 5. Once the tablet is split, you will have two pieces - one with 2.5mg and the other with 1.5mg. 6. If you have a 4mg waefin pill, you can repeat the same process to create two 2.5mg pills. 7. It is important to note that this method should only be used for specific medications like waefin, under medical supervision, and with precise calculation to ensure accurate dosage. Consult with your healthcare provider for further guidance or if you have any concerns regarding medication sp

In [None]:
outputs[0]["generated_text"][len(prompt):]

'To split a 3mg or 4mg waefin pill into a 2.5mg pill, you can follow these steps: 1. Start by finding a suitable pill splitter or cutter. This tool is designed to precisely cut tablets in half without damaging the active ingredients. 2. Wash your hands thoroughly before handling any medications. 3. Place the waefin pill on a clean, flat surface. 4. Use the pill splitter to evenly divide the tablet along the score line. The score line is a small indentation on the tablet that indicates where it should be split. 5. Once the tablet is split, you will have two pieces - one with 2.5mg and the other with 1.5mg. 6. If you have a 4mg waefin pill, you can repeat the same process to create two 2.5mg pills. 7. It is important to note that this method should only be used for specific medications like waefin, under medical supervision, and with precise calculation to ensure accurate dosage. Consult with your healthcare provider for further guidance or if you have any concerns regarding medication s

In [None]:
from datetime import datetime

stime = datetime.now()
messages = [
    {
        "role": "system",
        "content": "You are an expert and experienced from the healthcare and biomedical domain with extensive medical knowledge and practical experience. Your name is OpenBioLLM, and you were developed by Saama AI Labs. who's willing to help answer the user's query with explanation. In your explanation, leverage your deep medical expertise such as relevant anatomical structures, physiological processes, diagnostic criteria, treatment guidelines, or other pertinent medical concepts. Use precise medical terminology while still aiming to make the explanation clear and accessible to a general audience."
    },
    {
        "role": "user",
        "content": "Tell me about PRS in bioinformatics. First explain it from high-level overview. After that slowly increase the complexity and depth of the subject."
    },
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
)
etime = datetime.now()
print(etime - stime)

0:00:34.823527


In [None]:
print(outputs[0]["generated_text"][len(prompt):])

Sure, I'd be happy to explain the concept of polygenic risk scores (PRS) in bioinformatics. PRS is a method used to estimate an individual's genetic predisposition to certain traits or diseases based on their genetic information. It takes into account the combined effect of multiple genetic variants that are believed to contribute to a particular trait or disease.  At a high level, PRS works by assigning a score to each genetic variant that is thought to influence the trait or disease in question. These scores are then summed up to give an overall PRS for an individual. The higher the PRS, the greater the likelihood that the individual will exhibit the trait or disease in question.  Now, let's delve into the complexity of PRS a little further. PRS can be calculated using different approaches, including single nucleotide polymorphisms (SNPs) and their associated alleles. SNPs are small variations in the DNA sequence that can occur between individuals. Some SNPs have been identified as h

In [None]:
outputs[0]["generated_text"][len(prompt):]

"Sure, I'd be happy to explain the concept of polygenic risk scores (PRS) in bioinformatics. PRS is a method used to estimate an individual's genetic predisposition to certain traits or diseases based on their genetic information. It takes into account the combined effect of multiple genetic variants that are believed to contribute to a particular trait or disease.  At a high level, PRS works by assigning a score to each genetic variant that is thought to influence the trait or disease in question. These scores are then summed up to give an overall PRS for an individual. The higher the PRS, the greater the likelihood that the individual will exhibit the trait or disease in question.  Now, let's delve into the complexity of PRS a little further. PRS can be calculated using different approaches, including single nucleotide polymorphisms (SNPs) and their associated alleles. SNPs are small variations in the DNA sequence that can occur between individuals. Some SNPs have been identified as 

In [None]:
from datetime import datetime

stime = datetime.now()
messages = [
    {
        "role": "system",
        "content": "You are an expert and experienced from the healthcare and biomedical domain with extensive medical knowledge and practical experience. Your name is OpenBioLLM, and you were developed by Saama AI Labs. who's willing to help answer the user's query with explanation. In your explanation, leverage your deep medical expertise such as relevant anatomical structures, physiological processes, diagnostic criteria, treatment guidelines, or other pertinent medical concepts. Use precise medical terminology while still aiming to make the explanation clear and accessible to a general audience."
    },
    {
        "role": "user",
        "content": "Give me a plan to learn about bioinformatics and personalized medicine in general. Then tell me some possible research directions. My background is mathematics."
    },
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
)
etime = datetime.now()
print(etime - stime)

0:00:25.124696


In [None]:
print(outputs[0]["generated_text"][len(prompt):])

Sure, here is a learning plan to study bioinformatics and personalized medicine for you:  1. Start by familiarizing yourself with the basic concepts of bioinformatics and personalized medicine. Read introductory texts or online materials that provide an overview of the field.  2. Take courses or attend workshops specifically focused on bioinformatics and personalized medicine. These educational programs will help you acquire the necessary skills and knowledge in these areas.  3. Engage in practical exercises and hands-on training to understand how bioinformatics tools and techniques are applied in real-world research and clinical settings.  4. Seek out research papers, articles, and scientific literature related to bioinformatics and personalized medicine. Read and analyze these papers to deepen your understanding of the latest developments and applications in the field.  5. Collaborate with researchers or professionals already working in bioinformatics and personalized medicine. This 

In [None]:
outputs[0]["generated_text"][len(prompt):]

'Sure, here is a learning plan to study bioinformatics and personalized medicine for you:  1. Start by familiarizing yourself with the basic concepts of bioinformatics and personalized medicine. Read introductory texts or online materials that provide an overview of the field.  2. Take courses or attend workshops specifically focused on bioinformatics and personalized medicine. These educational programs will help you acquire the necessary skills and knowledge in these areas.  3. Engage in practical exercises and hands-on training to understand how bioinformatics tools and techniques are applied in real-world research and clinical settings.  4. Seek out research papers, articles, and scientific literature related to bioinformatics and personalized medicine. Read and analyze these papers to deepen your understanding of the latest developments and applications in the field.  5. Collaborate with researchers or professionals already working in bioinformatics and personalized medicine. This

In [None]:
from datetime import datetime

stime = datetime.now()
messages = [
    {
        "role": "system",
        "content": "You are an expert and experienced from the healthcare and biomedical domain with extensive medical knowledge and practical experience. Your name is OpenBioLLM, and you were developed by Saama AI Labs. who's willing to help answer the user's query with explanation. In your explanation, leverage your deep medical expertise such as relevant anatomical structures, physiological processes, diagnostic criteria, treatment guidelines, or other pertinent medical concepts. Use precise medical terminology while still aiming to make the explanation clear and accessible to a general audience."
    },
    {
        "role": "user",
        "content": "Tell me in-depth about Polygenic Risk Scoring. Give me an example of a complex PRS model. Include mathematics or equation."
    },
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

outputs = pipeline(
    prompt,
    max_new_tokens=1024,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
)
etime = datetime.now()
print(etime - stime)

0:00:28.390530


In [None]:
print(outputs[0]["generated_text"][len(prompt):])

Sure, I'd be happy to explain Polygenic Risk Scoring (PRS) and give an example of a complex PRS model.  Polygenic Risk Scoring is a method used in genetics to assess an individual's risk of developing certain diseases or conditions based on their genetic information. PRS combines the effects of multiple genetic variants, often called alleles, into a single score. This score can then be used to estimate the likelihood of disease occurrence.  An example of a complex PRS model is the Polygenic Risk Score (PRS) for coronary artery disease (CAD). This PRS takes into account multiple genetic variants associated with CAD to estimate an individual's risk of developing the disease. The mathematical equation for this PRS includes terms for different genetic alleles and their corresponding effect sizes, as well as interaction terms to account for potential gene-gene interactions.  To give you a better understanding, let's break down the equation for the PRS-CAD:  PRS-CAD = β0 + β1G1 + β2G2 + β3G3

In [None]:
outputs[0]["generated_text"][len(prompt):]

"Sure, I'd be happy to explain Polygenic Risk Scoring (PRS) and give an example of a complex PRS model.  Polygenic Risk Scoring is a method used in genetics to assess an individual's risk of developing certain diseases or conditions based on their genetic information. PRS combines the effects of multiple genetic variants, often called alleles, into a single score. This score can then be used to estimate the likelihood of disease occurrence.  An example of a complex PRS model is the Polygenic Risk Score (PRS) for coronary artery disease (CAD). This PRS takes into account multiple genetic variants associated with CAD to estimate an individual's risk of developing the disease. The mathematical equation for this PRS includes terms for different genetic alleles and their corresponding effect sizes, as well as interaction terms to account for potential gene-gene interactions.  To give you a better understanding, let's break down the equation for the PRS-CAD:  PRS-CAD = β0 + β1G1 + β2G2 + β3G

## Checking Tokenizer

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# _text = "<|endoftext|> [/INST] <s> </s>"
_text = "What is bioinformatics? tell me about SNP and PRS in bioinformatics and genetics"
inputs = tokenizer([_text], return_tensors="pt", return_attention_mask=False)
# print("inputs               : {}".format(inputs))
print("Batch Decode         : {}".format(tokenizer.batch_decode(inputs['input_ids'])))
print("ids_to_tokens        : {}".format(tokenizer.convert_ids_to_tokens(
    inputs['input_ids'][0]),
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True

))
print("input_ids length     : {}".format(len(inputs['input_ids'][0])))
print("="*100)

Batch Decode         : ['What is bioinformatics? tell me about SNP and PRS in bioinformatics and genetics']
ids_to_tokens        : ['What', 'Ġis', 'Ġbio', 'informatics', '?', 'Ġtell', 'Ġme', 'Ġabout', 'ĠSNP', 'Ġand', 'ĠPR', 'S', 'Ġin', 'Ġbio', 'informatics', 'Ġand', 'Ġgenetics']
input_ids length     : 17


In [None]:
_text = "DNA check can determine your ancestry which is good. We probably share some similarity with a cat's DNA"
inputs = tokenizer([_text], return_tensors="pt", return_attention_mask=False)
# print("inputs               : {}".format(inputs))
print("Batch Decode         : {}".format(tokenizer.batch_decode(inputs['input_ids'])))
print("ids_to_tokens        : {}".format(tokenizer.convert_ids_to_tokens(
    inputs['input_ids'][0]),
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True

))
print("input_ids length     : {}".format(len(inputs['input_ids'][0])))
print("="*100)

Batch Decode         : ["DNA check can determine your ancestry which is good. We probably share some similarity with a cat's DNA"]
ids_to_tokens        : ['DNA', 'Ġcheck', 'Ġcan', 'Ġdetermine', 'Ġyour', 'Ġancestry', 'Ġwhich', 'Ġis', 'Ġgood', '.', 'ĠWe', 'Ġprobably', 'Ġshare', 'Ġsome', 'Ġsimilarity', 'Ġwith', 'Ġa', 'Ġcat', "'s", 'ĠDNA']
input_ids length     : 20


## Compare with Other Tokenizers

In [None]:
!pip install sacremoses

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/897.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.3/897.5 kB[0m [31m6.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m593.9/897.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sacremoses
Successfully installed sacremoses-0.1.1


In [None]:
biogpt_tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
biomistral_tokenizer = AutoTokenizer.from_pretrained("BioMistral/BioMistral-7B")
biomedlm_tokenizer = AutoTokenizer.from_pretrained("stanford-crfm/BioMedLM")
biobert_tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")
openbiollm_tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
tokenizers = [
    ("biobert", biobert_tokenizer),
    ("biogpt", biogpt_tokenizer),
    ("biomedlm", biomedlm_tokenizer),
    ("biomistral", biomistral_tokenizer),
    ("openbiollm", openbiollm_tokenizer),
]

In [None]:
single_sentence = "enzyme Lipid help me in this case"
single_sentence = "The differential diagnosis of diabetes insipid."
single_sentence = "Common genetic variants in diabetes and associated complications. Only four SNPs (rs5186, rs1800629, rs1799983, and rs1800795) were found to have association with diabetes, cardiovascular diseases, diabetic nephropathy, diabetic retinopathy, hypertension, inflammation, and kidney diseases"
single_sentence = "chromatography, cytotoxicity, ECG, GATA, Immunohistochemistry, myocardium, nanoparticles, photosynthesis, probiotic, thrombin"

for (name, tokenizer) in tokenizers:
    print("Tokenizer            : {}".format(name))
    inputs = tokenizer([single_sentence], return_tensors="pt", return_attention_mask=False)
    # print("inputs               : {}".format(inputs))
    print("Batch Decode         : {}".format(tokenizer.batch_decode(inputs['input_ids'])))
    print("ids_to_tokens        : {}".format(tokenizer.convert_ids_to_tokens(
        inputs['input_ids'][0]),
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True

    ))
    print("input_ids length     : {}".format(len(inputs['input_ids'][0])))
    print("="*100)

Tokenizer            : biobert
Batch Decode         : ['[CLS] chromatography, cytotoxicity, ECG, GATA, Immunohistochemistry, myocardium, nanoparticles, photosynthesis, probiotic, thrombin [SEP]']
ids_to_tokens        : ['[CLS]', 'ch', '##roma', '##tography', ',', 'c', '##yt', '##oto', '##xi', '##city', ',', 'EC', '##G', ',', 'GA', '##TA', ',', 'I', '##mm', '##uno', '##his', '##to', '##chemistry', ',', 'my', '##oc', '##ard', '##ium', ',', 'na', '##no', '##par', '##tic', '##les', ',', 'photos', '##ynth', '##esis', ',', 'pro', '##biotic', ',', 'th', '##rom', '##bin', '[SEP]']
input_ids length     : 46
Tokenizer            : biogpt
Batch Decode         : ['</s>chromatography, cytotoxicity, ECG, GATA, Immunohistochemistry, myocardium, nanoparticles, photosynthesis, probiotic, thrombin']
ids_to_tokens        : ['</s>', 'chromatography</w>', ',</w>', 'cytotoxicity</w>', ',</w>', 'ECG</w>', ',</w>', 'GATA</w>', ',</w>', 'Immunohistochemistry</w>', ',</w>', 'myocardium</w>', ',</w>', 'nanoparti

OpenBioLLM's Tokenizer:
- charomatography is read as: 'chrom', 'at', 'ography'
- cytotoxicity is read as: 'cyt', 'otoxic', 'ity'
- ECG is read as: 'E', 'CG'
- Immunohistochemistry ias read as: 'Immun', 'oh', 'isto', 'chemistry'
- myocardium is read as: 'myocard', 'ium'
- photosynthesis is read as: 'photos', 'ynthesis'
- probiotic is read as: 'prob', 'iotic'
- thrombin is read as: 'throm', 'bin'

BEST TOKENIZERS: BioGPT, BioMedLM

In [None]:
single_sentence = "Common genetic variants in diabetes and associated complications. Only four SNPs (rs5186, rs1800629, rs1799983, and rs1800795) were found to have association with diabetes, cardiovascular diseases, diabetic nephropathy, diabetic retinopathy, hypertension, inflammation, and kidney diseases"

for (name, tokenizer) in tokenizers:
    print("Tokenizer            : {}".format(name))
    inputs = tokenizer([single_sentence], return_tensors="pt", return_attention_mask=False)
    # print("inputs               : {}".format(inputs))
    print("Batch Decode         : {}".format(tokenizer.batch_decode(inputs['input_ids'])))
    print("ids_to_tokens        : {}".format(tokenizer.convert_ids_to_tokens(
        inputs['input_ids'][0]),
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True

    ))
    print("input_ids length     : {}".format(len(inputs['input_ids'][0])))
    print("="*100)

Tokenizer            : biobert
Batch Decode         : ['[CLS] Common genetic variants in diabetes and associated complications. Only four SNPs ( rs5186, rs1800629, rs1799983, and rs1800795 ) were found to have association with diabetes, cardiovascular diseases, diabetic nephropathy, diabetic retinopathy, hypertension, inflammation, and kidney diseases [SEP]']
ids_to_tokens        : ['[CLS]', 'Common', 'genetic', 'variants', 'in', 'diabetes', 'and', 'associated', 'complications', '.', 'Only', 'four', 'S', '##NP', '##s', '(', 'r', '##s', '##51', '##86', ',', 'r', '##s', '##18', '##00', '##6', '##29', ',', 'r', '##s', '##17', '##9', '##9', '##9', '##8', '##3', ',', 'and', 'r', '##s', '##18', '##00', '##7', '##9', '##5', ')', 'were', 'found', 'to', 'have', 'association', 'with', 'diabetes', ',', 'card', '##iovascular', 'diseases', ',', 'di', '##abe', '##tic', 'ne', '##ph', '##rop', '##athy', ',', 'di', '##abe', '##tic', 're', '##tino', '##pathy', ',', 'h', '##yper', '##tens', '##ion', ',',

OpenBioLLM's Tokenizer:
- SNP is not properly recognized
- diabetic retinopathy is read as: 'diabetic', 'ret', 'in', 'opathy'

BEST TOKENIZERS: BioGPT, BioMedLM