# Blog Article Supervised Tuning

The goal here is to demonstrate how to tune a model using a supervised approach.

First we used a few-shot prompting strategy to generate article titles, then fed those into another few-shot prompt to generate article content, and this notebook will use a supervised approach to tune the same model to have it generate articles that are similar to the original articles, without needing the lengthy prompts.

## Example generated files

You can find the generated files and training/evaluation sets from when this example was run at:
- [data/gen/blog-titles-example](../../data/gen/blog-titles-example)
- [data/gen/blog-articles-example](../../data/gen/blog-articles-example)
- [data/training-sets/blog-generation/training.jsonl](../../data/training-sets/blog-generation/training.jsonl)
- [data/training-sets/blog-generation/evaluation.jsonl](../../data/training-sets/blog-generation/evaluation.jsonl)

References:
- [Tuning text-bison](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/tuning/tuning_text_bison.ipynb)
- [Generative AI Tuning Example](https://github.com/GoogleCloudPlatform/python-docs-samples/blob/6c55ca1a3a6b8865b7409a751a04be93d52c7519/generative_ai/tuning.py)

## GCP Configuration

In [1]:
import os

# ****************** [START] Google Cloud project settings ****************** #
project =  os.getenv('GCP_PROJECT')
location = os.environ.get('GCP_REGION', 'us-central1')
# ******************* [END] Google Cloud project settings ******************* #


# ********************** [START] data directory config ********************** #
from helpers.files import get_data_dir
data_dir = get_data_dir()

blog_articles_dir = os.path.join(data_dir, 'gen', 'blog-articles')
# *********************** [END] data directory config *********************** #


# *************** [START] Google Cloud Storage bucket config **************** #
bucket_name =  os.getenv('GCP_BUCKET_NAME')

bucket_uri = f"gs://{bucket_name}"

# only used for the gcloud storage bucket path.
# this notebook does not write to the local file system.
training_sets_dir = 'training-sets/blog-generation'

training_data_filepath = f'{training_sets_dir}/training.jsonl'
training_data_gcs_uri = f'{bucket_uri}/{training_data_filepath}'

evaluation_data_filepath = f'{training_sets_dir}/evaluation.jsonl'
evaluation_data_gcs_uri = f'{bucket_uri}/{evaluation_data_filepath}'
# **************** [END] Google Cloud Storage bucket config ***************** #


# *********************** [START] LLM parameter config ********************** #
# Vertex AI model to use for the LLM
model_name='text-bison@002'
# *********************** [END] LLM parameter config ************************ #


# *********************** [START] LLM fine-tuning config ******************** #
from helpers.uuid import generate_uuid
tuned_model_name = f'blog-article-generator-{generate_uuid()}'

training_steps = 100
evaluation_interval = 20

service_account = os.environ.get('GCP_SERVICE_ACCOUNT', f'vertex-ai@{project}.iam.gserviceaccount.com')

# save the tuning dataset JSONL to data/ directory so we can
# review what was generated. this doesn't affect the tuning.
local_training_filepath = os.path.join(data_dir, training_data_filepath)
local_evaluation_filepath = os.path.join(data_dir, evaluation_data_filepath)
# *********************** [END] LLM fine-tuning config ********************** #


# ********************** [START] Configuration Checks *********************** #
if not project:
    raise Exception('GCP_PROJECT environment variable not set')

if not bucket_name:
    raise Exception('GCP_BUCKET_NAME environment variable not set')
# *********************** [END] Configuration Checks ************************ #

## Read the data from files and create a dataset


**TODO**: create an evaluation dataset

In [2]:
from helpers.files import read_file, make_dir_if_not_exists
import os
import pandas as pd
from sklearn.model_selection import train_test_split

def create_tuning_dataframe(dir_path):
    """
    Creates a dataframe from the input.txt and output.txt files in each
    of the specified directory's child directories.

    1. loops over the directories in blog_articles_dir
    2. reads in two files per directory, "input.txt" and "output.txt"
    3. trailing newlines are stripped from the file content
    4. creates dataframe with two columns, "input_text" and "output_text"
    5. the dataframe is saved as a JSONL file in the data directory
    """

    row_list = []
    for folder in os.listdir(dir_path):
        input_text = read_file(f'{dir_path}/{folder}/input.txt')
        output_text = read_file(f'{dir_path}/{folder}/output.txt')

        training_row = {'input_text': input_text.strip(), 'output_text': output_text.strip()}

        row_list.append(training_row)

    return pd.DataFrame(row_list, columns=['input_text', 'output_text'])

def save_jsonl_to_local_fs(df, filepath):
    """
    Saves the dataframe as a JSONL file to the local file system.
    """

    print(f"saving JSONL to {filepath}")

    make_dir_if_not_exists(filepath)

    df.to_json(filepath, orient='records', lines=True)

df = create_tuning_dataframe(blog_articles_dir)

# split is set to 80/20 (40 : 10 for the generated dataset)
train, evaluation = train_test_split(df, test_size=0.2)
evaluation = evaluation.sample(n=10, random_state=1)

print(f'training set is {len(train)} rows')
print(f'evaluation set is {len(evaluation)} rows')

save_jsonl_to_local_fs(train, local_training_filepath)
save_jsonl_to_local_fs(evaluation, local_evaluation_filepath)

# convert dataframes to JSONL strings to be uploaded to GCS
train_jsonl = train.to_json(orient='records', lines=True)
eval_jsonl = evaluation.to_json(orient='records', lines=True)

# print the first 10 lines of the training data
for i in range(10):
    print(train_jsonl.split('\n')[i])

# print the first 10 lines of the evaluation data
for i in range(10):
    print(eval_jsonl.split('\n')[i])

training set is 40 rows
evaluation set is 10 rows
saving JSONL to /home/steven/work/data/training-sets/blog-generation/training.jsonl
saving JSONL to /home/steven/work/data/training-sets/blog-generation/evaluation.jsonl
{"input_text":"Article subject: 5 Tips for Beginners to Learn How to Cook Healthy Meals","output_text":"Article subject: 5 Tips for Beginners to Learn How to Cook Healthy Meals\n\n1. Embrace the Culinary Force: Start with Simple Recipes\n\nTroopers, the path to culinary mastery begins with simplicity. Choose recipes that are easy to follow, much like the basic training of a stormtrooper. Master these fundamental dishes, and you'll lay the foundation for more complex culinary conquests.\n\n2. Gather Your Ingredients: Assemble the Tools of Culinary Domination\n\nIn the kitchen, your ingredients are your weapons. Seek out fresh, high-quality produce, the finest cuts of meat, and the choicest spices. Just as we maintain our starships, ensure your kitchen is well-equipped wi

## Copy JSONL files to storage bucket

In [3]:
from google.cloud import storage

# Instantiates a client
storage_client = storage.Client(project=project)

def pandas_write(bucket_name, blob_name, content):
    """Use pandas to interact with GCS using file-like IO"""
    # The ID of your GCS bucket
    # bucket_name = "your-bucket-name"

    # The ID of your new GCS object
    # blob_name = "storage-object-name"

    # The string of data to write to the GCS object
    # content = '{"some":"json"}\n{"another":"one"}'

    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)

    with blob.open("w") as f:
        f.write(content)

    print(f"Wrote csv with pandas with name {blob_name} from bucket {bucket.name}.")

pandas_write(bucket_name, training_data_filepath, train_jsonl)
pandas_write(bucket_name, evaluation_data_filepath, eval_jsonl)


Wrote csv with pandas with name training-sets/blog-generation/training.jsonl from bucket my-gen-ai-example-project-bucket.
Wrote csv with pandas with name training-sets/blog-generation/evaluation.jsonl from bucket my-gen-ai-example-project-bucket.


## Configure pipeline arguments

In [4]:
pipeline_arguments = {
    "model_display_name": tuned_model_name,
    "location": location,
    "large_model_reference": model_name,
    "project": project,
    "train_steps": training_steps,
    "dataset_uri": training_data_gcs_uri,
    "evaluation_interval": evaluation_interval,
    "evaluation_data_uri": evaluation_data_gcs_uri,
    # uses the default tensorboard created by google
    "tensorboard_resource_id": None,
}

pipeline_root = f'{bucket_uri}/{tuned_model_name}'
template_path = 'https://us-kfp.pkg.dev/ml-pipeline/large-language-model-pipelines/tune-large-model/v2.0.0'

print("pipeline_arguments:")
print(pipeline_arguments)

print("pipeline_root:")
print(pipeline_root)

print("template_path:")
print(template_path)


pipeline_arguments:
{'model_display_name': 'blog-article-generator-f2xjxtvb', 'location': 'us-central1', 'large_model_reference': 'text-bison@002', 'project': 'my-gen-ai-example-project', 'train_steps': 100, 'dataset_uri': 'gs://my-gen-ai-example-project-bucket/training-sets/blog-generation/training.jsonl', 'evaluation_interval': 20, 'evaluation_data_uri': 'gs://my-gen-ai-example-project-bucket/training-sets/blog-generation/evaluation.jsonl', 'tensorboard_resource_id': None}
pipeline_root:
gs://my-gen-ai-example-project-bucket/blog-article-generator-f2xjxtvb
template_path:
https://us-kfp.pkg.dev/ml-pipeline/large-language-model-pipelines/tune-large-model/v2.0.0


## Create tuning helper function

In [5]:
from google.cloud import aiplatform
from google.cloud.aiplatform import PipelineJob

# Function that starts the tuning job
def tuned_model(
    project_id: str,
    location: str,
    template_path: str,
    model_display_name: str,
    pipeline_arguments: str,
):
    """Prompt-tune a new model, based on a prompt-response data.

    "training_data" can be either the GCS URI of a file formatted in JSONL format
    (for example: training_data=f'gs://{bucket}/{filename}.jsonl'), or a pandas
    DataFrame. Each training example should be JSONL record with two keys, for
    example:
      {
        "input_text": <input prompt>,
        "output_text": <associated output>
      },

    Args:
      project_id: GCP Project ID, used to initialize aiplatform
      location: GCP Region, used to initialize aiplatform
      template_path: path to the template
      model_display_name: Name for your model.
      pipeline_arguments: arguments used during pipeline runtime
    """

    aiplatform.init(project=project_id, location=location)

    job = PipelineJob(
        template_path=template_path,
        display_name=model_display_name,
        parameter_values=pipeline_arguments,
        location=location,
        pipeline_root=pipeline_root,
        enable_caching=True,
    )

    return job


## Create and start the tuning job

In [6]:
job = tuned_model(project, location, template_path, tuned_model_name, pipeline_arguments)


In [7]:
job.submit(service_account=service_account)


Creating PipelineJob
PipelineJob created. Resource name: projects/542548823444/locations/us-central1/pipelineJobs/tune-large-model-20231213081755
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/542548823444/locations/us-central1/pipelineJobs/tune-large-model-20231213081755')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/tune-large-model-20231213081755?project=542548823444


## Check the state of the pipeline

In [8]:
job.state


<PipelineState.PIPELINE_STATE_PENDING: 2>

## List tuned models

In [9]:
from vertexai.language_models import TextGenerationModel

def list_tuned_models(project_id, location):
    aiplatform.init(project=project_id, location=location)
    model = TextGenerationModel.from_pretrained(model_name)
    tuned_model_names = model.list_tuned_model_names()
    print(tuned_model_names)


2023-12-13 08:17:56.606836: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-13 08:17:56.608621: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-13 08:17:56.627630: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-13 08:17:56.627659: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-13 08:17:56.627676: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to regi

In [10]:
list_tuned_models(project, location)


['projects/542548823444/locations/us-central1/models/5825914092375769088']


## Run a prediction with the tuned model

In [11]:
def fetch_model(project_id, location):
    aiplatform.init(project=project_id, location=location)
    model = TextGenerationModel.from_pretrained(model_name)
    list_tuned_models = model.list_tuned_model_names()
    tuned_model = list_tuned_models[0]

    return tuned_model


In [12]:
deployed_model = fetch_model(project, location)
deployed_model = TextGenerationModel.get_tuned_model(deployed_model)


### Submit prompt

In [16]:
prompt = """Article subject: 3 ways to get better at soccer"""

# maximum number of model responses generated per prompt
candidate_count = 1

# determines the maximum amount of text output from one prompt.
# a token is approximately four characters.
max_output_tokens = 1024

# temperature controls the degree of randomness in token selection.
# lower temperatures are good for prompts that expect a true or
# correct response, while higher temperatures can lead to more
# diverse or unexpected results. With a temperature of 0 the highest
# probability token is always selected. for most use cases, try
# starting with a temperature of 0.2.
temperature = 0.2

# top-p changes how the model selects tokens for output. Tokens are
# selected from most probable to least until the sum of their
# probabilities equals the top-p value. For example, if tokens A, B, and C
# have a probability of .3, .2, and .1 and the top-p value is .5, then the
# model will select either A or B as the next token (using temperature).
# the default top-p value is .8.
top_p = 0.8

# top-k changes how the model selects tokens for output.
# a top-k of 1 means the selected token is the most probable among
# all tokens in the model’s vocabulary (also called greedy decoding),
# while a top-k of 3 means that the next token is selected from among
# the 3 most probable tokens (using temperature).
top_k = 40

parameters = {
    "candidate_count": candidate_count,
    "max_output_tokens": max_output_tokens,
    "temperature": temperature,
    "top_p": top_p,
    "top_k": top_k,
}


In [17]:
tuned_prediction = deployed_model.predict(prompt=prompt, **parameters)


In [18]:
print(tuned_prediction.text)


 Article subject: 3 ways to get better at soccer

1. Master the Basics: Build a Solid Foundation

Troopers, soccer mastery begins with a solid foundation. Just as a sturdy fortress requires strong walls, your soccer skills must be built upon a foundation of the basics. Focus on mastering ball control, passing, and shooting. These fundamental skills are the building blocks of soccer success, and without them, your progress will be limited.

2. Study the Game: Learn from the Masters

In the realm of soccer, knowledge is power. Study the game, observe the techniques of the masters, and learn from their strategies. Analyze their movements, their decision-making, and their tactical prowess. By understanding the intricacies of the game, you will gain insights that will elevate your performance to new heights.

3. Embrace the Grind: Train with Discipline and Passion

Troopers, soccer is a sport that demands discipline and passion. Embrace the grind, push yourself to the limit, and never give 