# Fine tune a Gemini and a Gemma model

+ Dataset: [Guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
+ Prepare the data: https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-supervised-tuning-prepare
+ Gemma: https://huggingface.co/google/gemma-2-27b-it-pytorch
  - Also https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma2
  - Also https://www.kaggle.com/models/google/gemma-2
+ Gemini: https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning

## Step 0. Install and import libraries

In [1]:
%%writefile requirements.txt
datasets
pandas
torch
google-cloud-aiplatform
google-cloud-storage
jsonschema

Writing requirements.txt


In [None]:
!pip install --upgrade -r requirements.txt

In [None]:
!pip install datasets

In [9]:
!pip install jsonschema



In [54]:
import json
import pandas as pd
import time
import torch
from datasets import load_dataset
from jsonschema import validate
from jsonschema.protocols import Validator

import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.tuning import sft

from google.cloud import storage
from google.cloud import aiplatform

In [4]:
!python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"

README.md: 100%|███████████████████████████| 7.62k/7.62k [00:00<00:00, 19.5MB/s]
train-00000-of-00001.parquet: 100%|█████████| 14.5M/14.5M [00:00<00:00, 149MB/s]
validation-00000-of-00001.parquet: 100%|████| 1.82M/1.82M [00:00<00:00, 295MB/s]
Generating train split: 100%|██| 87599/87599 [00:00<00:00, 103689.07 examples/s]
Generating validation split: 100%|█| 10570/10570 [00:00<00:00, 302876.11 example
{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeare

## Step 1. Transform dataset for Vertex

GUANACO dataset shape:

```json
{
    "text": "### Human: blah blah .### Assistant: blah blah.### Human: blah blah blah"
}

```

Vertex tuning dataset shape. Note that the `systemInstruction` field is optional -- and not necessary for this
exercise. [From here](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-supervised-tuning-prepare).

```json
{
  "systemInstruction": {
    "role": string,
    "parts": [
      {
        "text": string
      }
    ]
  },
  "contents": [
    {
      "role": string, // must be "user" or "model"
      "parts": [
        {
          // Union field data can be only one of the following:
          "text": string,
          "fileData": {
            "mimeType": string,
            "fileUri": string
          }
        }
      ]
    }
  ]
}
```

Here is a pseudocode transform for the dataset:

1. Load a row from the Guanaco dataset.
1. Read the `text` field.
1. Split the `text` field on the `###` character string.
1. Read the first substring of each item in the list, reading up to the `:` character.
1. If the first substring is "Human", create a new dictionary like so:

   ```json
   {
      "role": "user",
      "parts": {
          "text": "[REMAINDER OF SPLIT"
      }
   }
   ```
1. If the first substring is "Assistant", create a new dictionary like so:

   ```json
   {
      "role": "model",
      "parts": {
          "text": "[REMAINDER OF SPLIT"
      }
   }
   ```
1. Append each dictionary to a list.
1. Create one last new dictionary and set the `contents` field like so:

  ```json
  {
      "contents": [TEXT DICTIONARIES]
  }
  ```

In [46]:
# Create the schema validation
# NOTE: This is only a partial schema for this data transform purpose.
schema = {
    "type": "object",
    "properties": {
        "contents": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "role": { "type": "string" },
                    "parts": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "text": { "type": "string" }
                            }
                        }
                    }
                }
            }
        }
    }
}
Validator.check_schema(schema)

In [None]:
# Pandas
splits = {'train': 'openassistant_best_replies_train.jsonl', 'test': 'openassistant_best_replies_eval.jsonl'}
df = pd.read_json("hf://datasets/timdettmers/openassistant-guanaco/" + splits["train", lines=True])

In [6]:
# HuggingFace datasets
ds = load_dataset("timdettmers/openassistant-guanaco")

README.md:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


openassistant_best_replies_train.jsonl:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

openassistant_best_replies_eval.jsonl:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

In [8]:
print(ds['train'][0])

{'text': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining po

### Step 1a. Develop transform on single row

In [20]:
test_row = ds['train'][0]

In [24]:
row_text = test_row['text']
parts_text = row_text.split('### ')
print(parts_text)

['', 'Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.', 'Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, lea

In [27]:
parts = []
for p in parts_text:
    if p == '':
        continue
    
    role, content = p.split(": ")
    if role == "Human":
        parts.append({
            "role": "user",
            "parts": [{
                "text": content
            }],
        })
        continue
    parts.append({
        "role": "model",
        "parts": [{
            "text": content
        }],
    })
print(parts)

[{'role': 'user', 'parts': [{'text': 'Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.'}]}, {'role': 'model', 'parts': [{'text': '"Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wage

In [28]:
clean_row = {
    "content": parts
}
validate(clean_row, schema)

In [29]:
print(json.dumps(clean_row))

{"content": [{"role": "user", "parts": [{"text": "Can you write a short introduction about the relevance of the term \"monopsony\" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research."}]}, {"role": "model", "parts": [{"text": "\"Monopsony\" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers oft

### Step 1b. Apply transform to all rows

In [48]:
jsonl_str = ''
OUTPUT_FILE = 'guanaco_vertex_tune.jsonl'
REJECTS_FILE = 'guanaco_rejects.jsonl'

for r in ds['train']:
    try:
        row_text = r['text']
        parts_text = row_text.split('### ')
        parts = []
        for p in parts_text:
            if p == '':
                continue

            role, content = p.split(": ")
            if role == "Human":
                parts.append({
                    "role": "user",
                    "parts": [{
                        "text": content
                    }],
                })
                continue
            parts.append({
                "role": "model",
                "parts": [{
                    "text": content
                }],
            })

        clean_row = {
            "contents": parts
        }
        validate(clean_row, schema)

        jsonl_str = f"{json.dumps(clean_row)}\n"

        with open(OUTPUT_FILE, 'a') as f:
            f.write(jsonl_str)
    except ValueError as e:
        with open(REJECTS_FILE, 'a') as rf:
            rf.write(f"{json.dumps(r)}\n")

## Step 2. Upload JSONL file to GCS


In [None]:
PROJECT_ID = !gcloud config get-value project
PROJECT_ID = PROJECT_ID[0]
print(PROJECT_ID)

In [50]:
gcs_bucket = f"{PROJECT_ID}-bucket"
storage_client = storage.Client(project=PROJECT_ID)
bucket = storage_client.bucket(gcs_bucket)

In [51]:
blob = bucket.blob(OUTPUT_FILE)
blob.upload_from_filename(OUTPUT_FILE)

## Step 3. Create a tuning job from API

Can't get the tuning job to work from the console. Hopefully the API will return more
helpful error messages.

In [None]:
vertexai.init(project=PROJECT_ID, location="us-west1")

sft_tuning_job = sft.train(
    source_model="gemini-1.5-flash-002",
    train_dataset=f"gs://{gcs_bucket}/{OUTPUT_FILE}",
    epochs=4,
    adapter_size=4,
    learning_rate_multiplier=1.0,
    tuned_model_display_name="tuned_gemini_1_5_flash_guanaco",
)

# Polling for job completion
while not sft_tuning_job.has_ended:
    time.sleep(60)
    sft_tuning_job.refresh()

print(sft_tuning_job.tuned_model_name)
print(sft_tuning_job.tuned_model_endpoint_name)
print(sft_tuning_job.experiment)


## Step 4. Get model resource name :/

The Go libraries don't make getting this value easy.

In [60]:
tuned_model = GenerativeModel(f'projects/{PROJECT_ID}/locations/us-west1/endpoints/1926929312049528832')

In [61]:
help(tuned_model)

Help on GenerativeModel in module vertexai.generative_models object:

class GenerativeModel(vertexai.generative_models._generative_models._GenerativeModel)
 |  GenerativeModel(model_name: str, *, generation_config: Union[ForwardRef('GenerationConfig'), Dict[str, Any], NoneType] = None, safety_settings: Union[List[ForwardRef('SafetySetting')], Dict[google.cloud.aiplatform_v1beta1.types.content.HarmCategory, google.cloud.aiplatform_v1beta1.types.content.SafetySetting.HarmBlockThreshold], NoneType] = None, tools: Optional[List[ForwardRef('Tool')]] = None, tool_config: Optional[ForwardRef('ToolConfig')] = None, system_instruction: Union[str, ForwardRef('Image'), ForwardRef('Part'), List[Union[str, ForwardRef('Image'), ForwardRef('Part')]], NoneType] = None)
 |  
 |  Method resolution order:
 |      GenerativeModel
 |      vertexai.generative_models._generative_models._GenerativeModel
 |      builtins.object
 |  
 |  Methods inherited from vertexai.generative_models._generative_models._Gene