# combine BIO tags to spans

save each instance as a jsonl object in gemini 1.5 dataset format for finetuning

In [6]:
system_instruction2 = """You are a highly specialized biomedical Named Entity Recognition (NER) model, fine-tuned to extract and classify biomedical entities in a manner similar to BERN2. Your task is to process an input text and output a single, valid JSON object with annotations for each entity you detect.

**Task:**
Identify and extract mentions of biomedical entities from the input text and classify each one into exactly one of the following eight entity types:
- `cell_line`
- `cell_type`
- `disease`
- `dna`
- `rna`
- `drug_chemical`
- `gene_protein`
- `species`

For every recognized entity, you must capture the following details:
- `"mention"`: The exact text of the entity.
- `"type"`: The entity type (one of the eight types listed above).
- `"begin"`: The starting character index (0-indexed) where the entity begins in the text.
- `"end"`: The ending character index (exclusive), marking the position immediately after the last character of the entity.

**Output Format:**
Your final output must be a single, valid JSON object with a single key `"annotations"`. The value of `"annotations"` is an array (list) of entity objects, each containing only the keys `"mention"`, `"type"`, `"begin"`, and `"end"`. Do not include any additional keys or text.

**Empty Output:**
If no entities are detected, return exactly:
```json
{"annotations": []}
```

**Example:**
For the input:
```
"In this study, we investigated the effects of aspirin on human platelet function, as well as its interactions with the TP53 gene in Homo sapiens cells in cell lines such as NCI-H1975. Tumor necrosis alpha (TNF-alpha) mRNA production was analyzed by polymerase chain reaction amplification."
```
A correct output would be:
```json
{
  "annotations": [
    {
      "mention": "aspirin",
      "type": "drug_chemical",
      "begin": 46,
      "end": 53
    },
    {
      "mention": "human",
      "type": "species",
      "begin": 57,
      "end": 62
    },
    {
      "mention": "human platelet",
      "type": "cell_type",
      "begin": 57,
      "end": 71
    },
    {
      "mention": "TP53 gene",
      "type": "gene_protein",
      "begin": 119,
      "end": 128
    },
    {
      "mention": "Homo sapiens",
      "type": "species",
      "begin": 132,
      "end": 144
    },
    {
      "mention": "NCI-H1975",
      "type": "cell_line",
      "begin": 173,
      "end": 182
    },
    {
      "mention": "Tumor necrosis alpha (TNF-alpha) mRNA",
      "type": "rna",
      "begin": 184,
      "end": 221
    },
    {
      "mention": "Tumor necrosis alpha",
      "type": "gene",
      "begin": 184,
      "end": 204
    },
    {
      "mention": "TNF-alpha",
      "type": "rna",
      "begin": 206,
      "end": 215
    }
  ]
}
```

**Requirements:**
- All output must be strictly valid JSON.
- Only use the specified keys: `"annotations"`, `"mention"`, `"type"`, `"begin"`, and `"end"`.
- The `"begin"` and `"end"` indices must exactly match the positions of the entity in the input text.
- Output only the JSON object without any additional explanations or formatting.
"""

system_instruction = """You are a highly specialized biomedical Named Entity Recognition (NER) model. Your task is to accurately identify, extract mentions of biomedical entities from text, and classify them into one of the eight types:

1. cell_line
2. cell_type
3. disease
4. dna
5. rna
6. drug_chemical
7. gene_protein
8. species

For each identified entity, you must output a JSON object containing these four keys:

1. "entity": The extracted entity text.
2. "type": The entity type from the list above.
3. "begin": The starting character index of the entity in the input text (0-indexed).
4. "end": The ending character index of the entity in the input text (exclusive, i.e., the index of the character immediately following the entity).

Your output should be a JSON list of these objects, where each object represents a single entity extracted from the input text. If no entities are found, return an empty JSON list.

If no entities are found in the text, return an empty json list:
```json
{"entities": []}

Your output must be in a valid JSON format with this structure:
```json
{
  "entities": [
    {
      "entity": "extracted_entity_text",
      "type": "entity_type",
      "begin": start_index,
      "end": end_index
    },
    ...
  ]
}
```

Do not include any additional commentary or text in your output. Ensure that the JSON is well-formed and follows exactly the format specified above.

Example:

Input Text:
"In this study, we investigated the effects of aspirin on human platelet function, as well as its interactions with the TP53 gene in Homo sapiens cells in cell lines such as NCI-H1975. Tumor necrosis alpha (TNF-alpha) mRNA production was analyzed by polymerase chain reaction amplification."

Output JSON:
```json
[
  {
    "text": "aspirin",
    "type": "drug_chemical",
    "begin": 46,
    "end": 53
  },
  {
    "text": "human",
    "type": "species",
    "begin": 57,
    "end": 62
  },
  {
    "text": "human platelet",
    "type": "cell_type",
    "begin": 57,
    "end": 71
  },
  {
    "text": "TP53 gene",
    "type": "gene_protein",
    "begin": 119,
    "end": 128
  },
  {
    "text": "Homo sapiens",
    "type": "species",
    "begin": 132,
    "end": 144
  },
  {
    "text": "NCI-H1975",
    "type": "cell_line",
    "begin": 173,
    "end": 182
  },
  {
    "text": "Tumor necrosis alpha (TNF-alpha) mRNA",
    "type": "rna",
    "begin": 184,
    "end": 221
  },
  {
    "text": "Tumor necrosis alpha",
    "type": "gene",
    "begin": 184,
    "end": 204
  },
  {
    "text": "TNF-alpha",
    "type": "rna",
    "begin": 206,
    "end": 215
  }
]
```

Input Text:
{input_text}
"""
print(system_instruction)

Your goal is to identify and classify entities into one of the following eight types:

- cell_line
- cell_type
- disease
- dna
- rna
- drug_chemical
 gene_protein
 species

For each identified entity, you must output a JSON object containing the following keys:

 "entity": The extracted entity text.
 "type": The entity type from the list above.
 "begin": The starting character index of the entity in the input text (0-indexed).
- "end": The ending character index of the entity in the input text (exclusive, i.e., the index of the character immediately following the entity).

Your output should be a JSON list of these objects, where each object represents a single entity extracted from the input text. If no entities are found, return an empty JSON list.

If no entities are found in the text, return an empty list as follows:
```json
{"entities": []}

Your output must be in a valid JSON format with the following structure:
```json
{
  "entities": [
    {
      "entity": "extracted_entity_te

In [12]:
import re
import json

def detokenize(text):
    # Remove extra space after opening punctuation such as ( [ { <
    text = re.sub(r'([\(\[\{<])\s+', r'\1', text)
    # Remove extra space before closing punctuation such as ) ] } >
    text = re.sub(r'\s+([\)\]\}>])', r'\1', text)
    # Remove extra space before punctuation that should attach to the previous token (e.g., . , ; : ! ?)
    text = re.sub(r'\s+([.,;:!?])', r'\1', text)
    # Remove spaces around hyphens (e.g., "tumour - suppressor" -> "tumour-suppressor")
    text = re.sub(r'\s*-\s*', '-', text)
    # Remove spaces around slashes (e.g., "axin / conductin" -> "axin/conductin")
    text = re.sub(r'\s*/\s*', r'/', text)
    # Remove spaces around percent signs and equals signs
    text = re.sub(r'\s*%\s*', '%', text)
    text = re.sub(r'\s*=\s*', '=', text)
    # Remove spaces between number decimals (e.g., "3 . 14" -> "3.14")
    text = re.sub(r'(\d)\s*\.\s*(\d)', r'\1.\2', text)
    return text

def process_sentence(token_tag_pairs, entity_label):
    """
    Given a list of (token, tag) tuples for a sentence,
    returns the detokenized sentence text and a list of extracted entities.
    Each entity is a dict with keys "text", "label", "begin", and "end".
    """
    # Rebuild the sentence text by joining tokens then detokenize.
    tokens = [tok for tok, tag in token_tag_pairs]
    raw_sentence = " ".join(tokens)
    sentence_text = detokenize(raw_sentence)

    entities = []
    i = 0
    search_start = 0  # pointer for searching entity in sentence_text
    while i < len(token_tag_pairs):
        token, tag = token_tag_pairs[i]
        # Check if token marks beginning of an entity.
        # Support both simple "B" and extended "B-<label>" formats.
        if tag.startswith("B"):
            # Extract label if given (e.g., "B-drug_chemical"); otherwise, default to "entity"
            if '-' in tag:
                _, label = tag.split('-', 1)
            else:
                label = entity_label
            entity_tokens = [token]
            i += 1
            # Collect subsequent tokens with tag starting with "I"
            while i < len(token_tag_pairs) and token_tag_pairs[i][1].startswith("I"):
                entity_tokens.append(token_tag_pairs[i][0])
                i += 1
            # Build the entity text in the same way as the sentence
            raw_entity = " ".join(entity_tokens)
            entity_text = detokenize(raw_entity)
            # Find the entity substring in the sentence text starting from search_start
            pos = sentence_text.find(entity_text, search_start)
            if pos == -1:
                # If not found, you might want to handle this case differently.
                pos = 0
            begin = pos
            end = pos + len(entity_text)
            entities.append({
                "text": entity_text,
                "label": label,
                "begin": begin,
                "end": end
            })
            search_start = end  # update pointer so that subsequent searches start after this entity
        else:
            i += 1
    return sentence_text, entities

def parse_bio_file(file_path):
    """
    Reads a BIO-tagged file with sentences separated by blank lines.
    Returns a list where each element is a list of (token, tag) tuples.
    """
    sentences = []
    current = []

    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            stripped = line.strip()
            if not stripped:
                if current:
                    sentences.append(current)
                    current = []
            else:
                # Expecting each non-blank line to have "token tag"
                parts = stripped.split(maxsplit=1)
                if parts:
                    token = parts[0]
                    tag = parts[1] if len(parts) > 1 else "O"
                    current.append((token, tag))
        if current:
            sentences.append(current)
    return sentences

def create_json_object(sentence_text, entities):
    """
    Creates the JSON object (as a Python dict) for one sentence,
    where the first "parts" contains the sentence text and the second
    contains a string-escaped JSON array of entities.
    """
    # Dump the entities as a pretty JSON string (you can remove indent if desired)
    entities_json_str = json.dumps(entities, indent=2)
    json_obj = {
        "systemInstruction": {
            "role": "system",
            "parts": [
                {
                    "text": system_instruction
                }
            ]
        },
        "contents": [
            {
                "role": "user",
                "parts": [
                    {
                        "text": sentence_text
                    }
                ]
            },
            {
                "role": "model",
                "parts": [
                    {
                        "text": entities_json_str
                    }
                ]
            }
        ]
    }
    return json_obj

def main(input_file, output_file):
    sentences_token_tags = parse_bio_file(input_file)
    with open(output_file, 'w', encoding='utf-8') as out_f:
        for token_tag_pairs in sentences_token_tags:
            sentence_text, entities = process_sentence(token_tag_pairs)
            json_obj = create_json_object(sentence_text, entities)
            # Write each JSON object as one line
            out_f.write(json.dumps(json_obj) + "\n")

# Example usage:
if __name__ == "__main__":
    input_file_path = './NERdata/NCBI-disease/test.txt'  # Replace with your BIO-tagged file path
    output_file_path = './NERdata/NCBI-disease/test_NCBI-disease.jsonl'  # Output JSONL file
    # main(input_file_path, output_file_path)


In [12]:
import json

# print and verify json array containing entity in gemini 1.5 dataset format
def verify_entities(jsonl_file):
    with open(jsonl_file, 'r', encoding='utf-8') as f:
        for line_number, line in enumerate(f, start=1):
            line = line.strip()
            if not line:
                continue
            try:
                # Load the JSON object from the current line
                json_obj = json.loads(line)
                # Navigate to the second "parts" under "contents"
                # The expected path is: json_obj["contents"][1]["parts"][0]["text"]
                entity_json_str = json_obj["contents"][1]["parts"][0]["text"]
                # Parse the entities JSON string into a Python list/dict
                entities = json.loads(entity_json_str)
                print(f"Entities for JSON object on line {line_number}:")
                print(json.dumps(entities, indent=2))
            except Exception as e:
                print(f"Error processing line {line_number}: {e}")

if __name__ == "__main__":
    jsonl_file = "./NERdata/NCBI-disease/test_NCBI-disease.jsonl"  # Replace with the path to your JSONL file if needed.
    verify_entities(jsonl_file)


Entities for JSON object on line 1:
[
  {
    "text": "ataxia-telangiectasia",
    "label": "disease",
    "begin": 40,
    "end": 61
  },
  {
    "text": "sporadic T-cell leukaemia",
    "label": "disease",
    "begin": 72,
    "end": 97
  }
]
Entities for JSON object on line 2:
[
  {
    "text": "Ataxia-telangiectasia",
    "label": "disease",
    "begin": 0,
    "end": 21
  },
  {
    "text": "A-T",
    "label": "disease",
    "begin": 23,
    "end": 26
  },
  {
    "text": "recessive multi-system disorder",
    "label": "disease",
    "begin": 33,
    "end": 64
  }
]
Entities for JSON object on line 3:
[
  {
    "text": "cancer",
    "label": "disease",
    "begin": 12,
    "end": 18
  },
  {
    "text": "lymphoid neoplasias",
    "label": "disease",
    "begin": 31,
    "end": 50
  },
  {
    "text": "A-T",
    "label": "disease",
    "begin": 81,
    "end": 84
  }
]
Entities for JSON object on line 4:
[
  {
    "text": "tumour",
    "label": "disease",
    "begin": 13,
    "end":

In [14]:
import os

# Utility function to process a BIO file and write its contents as JSONL.
def process_bio_file_to_jsonl(input_file, output_file, entity_label, limit=None):
    sentences_token_tags = parse_bio_file(input_file)
    if limit is not None:
        sentences_token_tags = sentences_token_tags[:limit]
    with open(output_file, 'w', encoding='utf-8') as out_f:
        for token_tag_pairs in sentences_token_tags:
            sentence_text, entities = process_sentence(token_tag_pairs, entity_label)
            json_obj = create_json_object(sentence_text, entities)
            out_f.write(json.dumps(json_obj) + "\n")

datasets = {
  "cell_line": {
    "path": "./NERdata/JNLPBA-cl/"
  },
  "cell_type": {
    "path": "./NERdata/JNLPBA-ct/"
  },
  "disease": {
    "path": "./NERdata/NCBI-disease/"
  },
  "dna": {
    "path": "./NERdata/JNLPBA-dna/"
  },
  "rna": {
    "path": "./NERdata/JNLPBA-rna"
  },
  "drug_chemical": {
    "path": "./NERdata/BC4CHEMD/"
  },
  "gene_protein": {
    "path": "./NERdata/BC2GM/"
  },
  "species": {
    "path": "./NERdata/linnaeus/"
  }
}

# Process each dataset
for entity_type, info in datasets.items():
    root_path = info["path"]
    train_file = os.path.join(root_path, "train.txt")
    test_file = os.path.join(root_path, "test.txt")

    # Define output file names
    train_output =  os.path.join(root_path, f"train_{entity_type}.jsonl")
    test_output =  os.path.join(root_path, f"test_{entity_type}.jsonl")
    test256_output =  os.path.join(root_path, f"test256_{entity_type}.jsonl")

    print(f"Processing dataset for entity type: {entity_type}")

    # Process train file
    if os.path.exists(train_file):
        process_bio_file_to_jsonl(train_file, train_output, entity_type)
        print(f"  Written {train_output}")
    else:
        print(f"  Train file not found: {train_file}")

    # Process test file
    if os.path.exists(test_file):
        process_bio_file_to_jsonl(test_file, test_output, entity_type)
        print(f"  Written {test_output}")
        # Process test file with a limit of 256 instances for test256 file
        process_bio_file_to_jsonl(test_file, test256_output, entity_type, limit=256)
        print(f"  Written {test256_output}")
    else:
        print(f"  Test file not found: {test_file}")

Processing dataset for entity type: cell_line
  Written ./NERdata/JNLPBA-cl/train_cell_line.jsonl
  Written ./NERdata/JNLPBA-cl/test_cell_line.jsonl
  Written ./NERdata/JNLPBA-cl/test256_cell_line.jsonl
Processing dataset for entity type: cell_type
  Written ./NERdata/JNLPBA-ct/train_cell_type.jsonl
  Written ./NERdata/JNLPBA-ct/test_cell_type.jsonl
  Written ./NERdata/JNLPBA-ct/test256_cell_type.jsonl
Processing dataset for entity type: disease
  Written ./NERdata/NCBI-disease/train_disease.jsonl
  Written ./NERdata/NCBI-disease/test_disease.jsonl
  Written ./NERdata/NCBI-disease/test256_disease.jsonl
Processing dataset for entity type: dna
  Written ./NERdata/JNLPBA-dna/train_dna.jsonl
  Written ./NERdata/JNLPBA-dna/test_dna.jsonl
  Written ./NERdata/JNLPBA-dna/test256_dna.jsonl
Processing dataset for entity type: rna
  Written ./NERdata/JNLPBA-rna/train_rna.jsonl
  Written ./NERdata/JNLPBA-rna/test_rna.jsonl
  Written ./NERdata/JNLPBA-rna/test256_rna.jsonl
Processing dataset for ent

In [15]:
def combine_files(file_prefix, output_filename):
    combined_lines = []
    for entity_type, info in datasets.items():
        file_path = os.path.join(info["path"], f"{file_prefix}_{entity_type}.jsonl")
        if os.path.exists(file_path):
            print(f"Reading: {file_path}")
            with open(file_path, 'r', encoding='utf-8') as f:
                combined_lines.extend(f.readlines())
        else:
            print(f"Warning: File not found: {file_path}")

    with open(output_filename, 'w', encoding='utf-8') as out_f:
        out_f.writelines(combined_lines)
    print(f"Combined file written: {output_filename}")

def main():
    # Combine train files into one file
    combine_files("train", "./NERdata/train_combined_NERdata.jsonl")
    # Combine test files into one file
    combine_files("test", "./NERdata/test_combined_NERdata.jsonl")

if __name__ == "__main__":
    main()

Reading: ./NERdata/JNLPBA-cl/train_cell_line.jsonl
Reading: ./NERdata/JNLPBA-ct/train_cell_type.jsonl
Reading: ./NERdata/NCBI-disease/train_disease.jsonl
Reading: ./NERdata/JNLPBA-dna/train_dna.jsonl
Reading: ./NERdata/JNLPBA-rna/train_rna.jsonl
Reading: ./NERdata/BC4CHEMD/train_drug_chemical.jsonl
Reading: ./NERdata/BC2GM/train_gene_protein.jsonl
Reading: ./NERdata/linnaeus/train_species.jsonl
Combined file written: ./NERdata/train_combined_NERdata.jsonl
Reading: ./NERdata/JNLPBA-cl/test_cell_line.jsonl
Reading: ./NERdata/JNLPBA-ct/test_cell_type.jsonl
Reading: ./NERdata/NCBI-disease/test_disease.jsonl
Reading: ./NERdata/JNLPBA-dna/test_dna.jsonl
Reading: ./NERdata/JNLPBA-rna/test_rna.jsonl
Reading: ./NERdata/BC4CHEMD/test_drug_chemical.jsonl
Reading: ./NERdata/BC2GM/test_gene_protein.jsonl
Reading: ./NERdata/linnaeus/test_species.jsonl
Combined file written: ./NERdata/test_combined_NERdata.jsonl


In [16]:
import os
import random

def sample_test_files(output_filename, total_samples=256):
    # Calculate how many samples per entity type.
    num_entity_types = len(datasets)
    samples_per_entity = total_samples // num_entity_types
    combined_samples = []

    for entity_type, info in datasets.items():
        test_file = os.path.join(info['path'], f'test_{entity_type}.jsonl')
        if not os.path.exists(test_file):
            print(f'Warning: File {test_file} not found.')
            continue

        # Read all JSON objects (one per line) from the file.
        with open(test_file, 'r', encoding='utf-8') as f:
            lines = f.readlines()

        # If the file has fewer lines than required, take all; otherwise, sample without replacement.
        if len(lines) < samples_per_entity:
            print(f'Warning: {test_file} has only {len(lines)} lines; using all of them.')
            sampled = lines
        else:
            sampled = random.sample(lines, samples_per_entity)

        combined_samples.extend(sampled)
        print(f'Sampled {len(sampled)} objects from {test_file}')

    # Optionally, you can shuffle the combined samples to mix entities.
    random.shuffle(combined_samples)

    # Write the combined samples to the output file.
    with open(output_filename, 'w', encoding='utf-8') as out_f:
        out_f.writelines(combined_samples)
    print(f'Combined file written: {output_filename}')

if __name__ == '__main__':
    sample_test_files('./NERdata/test256_combined_NERdata.jsonl', total_samples=256)


Sampled 32 objects from ./NERdata/JNLPBA-cl/test_cell_line.jsonl
Sampled 32 objects from ./NERdata/JNLPBA-ct/test_cell_type.jsonl
Sampled 32 objects from ./NERdata/NCBI-disease/test_disease.jsonl
Sampled 32 objects from ./NERdata/JNLPBA-dna/test_dna.jsonl
Sampled 32 objects from ./NERdata/JNLPBA-rna/test_rna.jsonl
Sampled 32 objects from ./NERdata/BC4CHEMD/test_drug_chemical.jsonl
Sampled 32 objects from ./NERdata/BC2GM/test_gene_protein.jsonl
Sampled 32 objects from ./NERdata/linnaeus/test_species.jsonl
Combined file written: ./NERdata/test256_combined_NERdata.jsonl


# [Dataset format](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune_gemini/tune-gemini-learn#dataset_format)


```jsonl
{
  "systemInstruction": {
    "parts": [
      {
        "text": "(see prompt cell)"
      }
    ]
  },
  "contents": [
    {
      "parts": [
        {
          "text": "In this study, we investigated the effects of aspirin on human platelet function, as well as its interactions with the TP53 gene in Homo sapiens cells in cell lines such as NCI-H1975."
        }
      ]
    },
    {
      "parts": [
        {
          "text": "[\n  {\n    \"text\": \"aspirin\",\n    \"label\": \"drug_chemical\",\n    \"begin\": 46,\n    \"end\": 53\n  },\n  {\n    \"text\": \"human\",\n    \"label\": \"species\",\n    \"begin\": 57,\n    \"end\": 62\n  },\n  {\n    \"text\": \"platelet\",\n    \"label\": \"cell_type\",\n    \"begin\": 63,\n    \"end\": 71\n  },\n  {\n    \"text\": \"TP53\",\n    \"label\": \"gene_protein\",\n    \"begin\": 119,\n    \"end\": 123\n  },\n  {\n    \"text\": \"Homo sapiens\",\n    \"label\": \"species\",\n    \"begin\": 132,\n    \"end\": 144\n  },\n  {\n    \"text\": \"NCI-H1975\",\n    \"type\": \"cell_line\",\n    \"begin\": 173,\n    \"end\": 183\n  }\n]"
        }
      ]
    },
  ]
}

# Prompts
You are a highly specialized biomedical Named Entity Recognition (NER) model. Your task is to accurately identify and extract mentions of biomedical entities from text.

Your goal is to identify and classify entities into one of the following eight types:

- cell_line
- cell_type
- disease
- dna
- rna
- drug_chemical
- gene_protein
- species

For each identified entity, you must output a JSON object containing the following keys:

- "entity": The extracted entity text.
- "type": The entity type from the list above.
- "begin": The starting character index of the entity in the input text (0-indexed).
- "end": The ending character index of the entity in the input text (exclusive, i.e., the index of the character immediately following the entity).

Your output should be a JSON list of these objects, where each object represents a single entity extracted from the input text. If no entities are found, return an empty JSON list.

If no entities are found in the text, return an empty list as follows:
```json
{"entities": []}

Your output must be in a valid JSON format with the following structure:
```json
{
  "entities": [
    {
      "entity": "extracted_entity_text",
      "type": "entity_type",
      "begin": start_index,
      "end": end_index
    },
    ...
  ]
}

Do not include any additional commentary or text in your output. Ensure that the JSON is well-formed and follows exactly the format specified above.

Example:

Input Text:
"In this study, we investigated the effects of aspirin on human platelet function, as well as its interactions with the TP53 gene in Homo sapiens cells in cell lines such as NCI-H1975."

Output JSON:
```json
[
  {
    "text": "aspirin",
    "label": "drug_chemical",
    "begin": 46,
    "end": 53
  },
  {
    "text": "human",
    "label": "species",
    "begin": 57,
    "end": 62
  },
  {
    "text": "platelet",
    "label": "cell_type",
    "begin": 63,
    "end": 71
  },
  {
    "text": "TP53",
    "label": "gene_protein",
    "begin": 119,
    "end": 123
  },
  {
    "text": "Homo sapiens",
    "label": "species",
    "begin": 132,
    "end": 144
  },
  {
    "text": "NCI-H1975",
    "type": "cell_line",
    "begin": 173,
    "end": 183
  }
]

# [Create a tuning job](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning#create_a_text_model_supervised_tuning_job)

 - [Gemini API: Tuning Quickstart with Python](https://github.com/google-gemini/cookbook/blob/main/quickstarts/Tuning.ipynb)

[Supervised Fine Tuning for Gemini: A best practices guide](https://cloud.google.com/blog/products/ai-machine-learning/master-gemini-sft)

Gemini 1.5 Flash

Text fine-tuning: with a dataset size of <1000 examples and average context length <500, we recommend setting epochs = default, learning rate multiplier = 10 and adapter size = 4. With a dataset size >= 1000 examples or average context length >= 500, we recommend epochs = default, learning rate multiplier = default and adapter size = 8.

Gemini 1.5 Pro

Text fine-tuning: with a dataset size of <1000 examples and average context length <500, we recommend setting epochs = 20, learning rate multiplier = 10, adapter size = 4. With a dataset size >= 1000 examples or average context length >= 500, we recommend epochs = 10, learning rate multiplier = default or 5, adapter size = 4.

- https://anaconda.org/conda-forge/google-generativeai

In [None]:
from google import genai
from google.genai import types

PROJECT_ID = ""
LOCATION = "us-east4"

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

# list base models
for model_info in client.models.list():
    print(model_info.name)
# list tuned models
for model in client.models.list(config={'page_size': 10, 'query_base': False}):
    print(model)

In [None]:
!gsutil cp ./train_combined_NERdata.jsonl gs://bio-ner_finetuning/train_combined_NERdata.jsonl
!gsutil cp ./test256_combined_NERdata.jsonl gs://bio-ner_finetuning/test256_combined_NERdata.jsonl

In [24]:
if client.vertexai:
    training_dataset = types.TuningDataset(
        gcs_uri='gs://bio-ner_finetuning/train_combined_NERdata.jsonl',
    )
    validation_dataset = types.TuningValidationDataset(
        gcs_uri='gs://bio-ner_finetuning/test256_combined_NERdata.jsonl'
    )

    tuning_job = client.tunings.tune(
      base_model='gemini-1.5-flash-002',
      training_dataset=training_dataset,
      config=types.CreateTuningJobConfig(
          # epoch_count= 0, default
          # learning_rate_multiplier=1, default
          description='Fine tune on cell_line, cell_type, disease, dna, rna, drug_chemical, gene_protein, species',
          adapter_size='ADAPTER_SIZE_EIGHT',
          tuned_model_display_name='bioNER-1.5-flash-002_combined',
          validation_dataset=validation_dataset
    )
)

In [26]:
if client.vertexai:
    training_dataset = types.TuningDataset(
        gcs_uri='gs://bio-ner_finetuning/train_combined_NERdata.jsonl',
    )
    validation_dataset = types.TuningValidationDataset(
        gcs_uri='gs://bio-ner_finetuning/test256_combined_NERdata.jsonl'
    )

    tuning_job = client.tunings.tune(
      base_model='gemini-1.5-pro-002',
      training_dataset=training_dataset,
      config=types.CreateTuningJobConfig(
          epoch_count= 10,
          # learning_rate_multiplier=1, default
          description='Fine tune on cell_line, cell_type, disease, dna, rna, drug_chemical, gene_protein, species',
          adapter_size='ADAPTER_SIZE_FOUR',
          tuned_model_display_name='bioNER-1.5-pro-002_combined',
          validation_dataset=validation_dataset
    )
)

# References
1. [BERN2: an advanced neural biomedical named entity recognition and normalization tool](https://arxiv.org/abs/2201.02080)
1. [PromptNER: Prompting For Named Entity Recognition](https://arxiv.org/abs/2305.15444)
1. [GPT-NER: Named Entity Recognition via Large Language Models](https://arxiv.org/abs/2304.10428)

