Course details: See https://learn.deeplearning.ai/courses/llmops

Project environment setup:

Load credentials and relevant Python Libraries
If you were running this notebook locally, you would first install Vertex AI. In this classroom, this is already installed.
!pip install google-cloud-aiplatform
You can download the requirements.txt for this course from the workspace of this lab. File --> Open...

In [None]:
from utils import authenticate
credentials, PROJECT_ID = authenticate() 
REGION = "us-central1"
import vertexai #import the SDK for vertex AI
vertexai.init(project = PROJECT_ID,
              location = REGION,
              credentials = credentials)

from google.cloud import bigquery
bq_client = bigquery.Client(project=PROJECT_ID, 
                            credentials = credentials)

In [None]:
import pandas as pd

#Using stack overflow data
QUERY = """
SELECT
    CONCAT(q.title, q.body) as input_text,
    a.body AS output_text
FROM
    `bigquery-public-data.stackoverflow.posts_questions` q
JOIN
    `bigquery-public-data.stackoverflow.posts_answers` a
ON
    q.accepted_answer_id = a.id
WHERE
    q.accepted_answer_id IS NOT NULL AND
    REGEXP_CONTAINS(q.tags, "python") AND
    a.creation_date >= "2020-01-01"
LIMIT
    10000
"""

query_job = bq_client.query(QUERY)

# Take the results of the query --> create an arrow table (which is part of Apache Framework) --> which goes into a Pandas dataframe.
# This allows for data to be in a format which is easier to read and explore with Pandas
stack_overflow_df = query_job.result()\
                        .to_arrow()\
                        .to_pandas()

stack_overflow_df.head(2)

Adding Instructions
* Instructions for LLMs have been shown to improve model performance and generalization to unseen tasks (Google, 2022).
* Wihtout the instruction, it is only question and answer. Model might not understand what to do.
* With the instructions, the model gets a guideline as to what task to perform.

In [None]:
INSTRUCTION_TEMPLATE = f"""\
Please answer the following Stackoverflow question on Python. \
Answer it like you are a developer answering Stackoverflow questions.

Stackoverflow question:
"""
# A new column will combine `INSTRUCTION_TEMPLATE` and the question `input_text`.
stack_overflow_df['input_text_instruct'] = INSTRUCTION_TEMPLATE + ' '\
    + stack_overflow_df['input_text']

Dataset for Tuning
* Divide the data into a training and evaluation. By default, 80/20 split is used.
* The random_state parameter is used to ensure random sampling for a fair comparison.

In [None]:
from sklearn.model_selection import train_test_split
train, evaluation = train_test_split(
    stack_overflow_df,
    ### test_size=0.2 means 20% for evaluation
    ### which then makes train set to be of 80%
    test_size=0.2,
    random_state=42
)

Different Datasets and Flow
* Versioning data is important.
* It allows for reproducibility, traceability, and maintainability of machine learning models.
* Get the timestamp.

In [None]:
import datetime
date = datetime.datetime.now().strftime("%H:%d:%m:%Y")

#Using JSON Lines fornat, but TFRecord (which is binary and fast) and Paraquet (large and complex data) can also be used

# Training file
cols = ['input_text_instruct','output_text']
tune_jsonl = train[cols].to_json(orient="records", lines=True)

training_data_filename = f"tune_data_stack_overflow_\
                            python_qa-{date}.jsonl"

#Stored locally, but it is a best practice to store on cloud storage
with open(training_data_filename, "w") as f:
    f.write(tune_jsonl)


# Evaluation file
cols = ['input_text_instruct','output_text']
### you need to use the "evaluation" set now
tune_jsonl = evaluation[cols].to_json(orient="records", lines=True)

### change the file name
### use "tune_eval_data_stack_overflow_python_qa-{date}.jsonl"
evaluation_data_filename = f"tune_eval_data_stack_overflow_\
                            python_qa-{date}.jsonl"

### write the file
with open(evaluation_data_filename, "w") as f:
    f.write(tune_jsonl)

# Automaton and Orchestation with pipelines
Automation and Orchestation with pipelines: We will use [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v2/) to orchestrat and automate a workflow. Kubeflow Pipelines is an open source framework. It's like a construction kit for building machine learning pipelines, making it easy to orchestrate and automate complex tasks.

Kubeflow pipelines (can also use Apache Airflow) consist of two key concepts: Components and pipelines. 
* Ccomponents are like self-contained sets of code that perform various steps in your ML workflow, such as, the first step could be preprocessing data, and second step could be training a model.

In [None]:
from kfp import dsl # domain specific lanugage, this set of instructions and configuration
from kfp import compiler # Compiles the instructions

# Ignore FutureWarnings in kfp
import warnings
warnings.filterwarnings("ignore", 
                        category=FutureWarning, 
                        module='kfp.*')

TRAINING_DATA_URI = "./tune_data_stack_overflow_python_qa.jsonl" 
EVAUATION_DATA_URI = "./tune_eval_data_stack_overflow_python_qa.jsonl"

### path to the pipeline file to reuse
### the file is provided in your workspace as well
template_path = 'https://us-kfp.pkg.dev/ml-pipeline/\
large-language-model-pipelines/tune-large-model/v2.0.0'

import datetime
date = datetime.datetime.now().strftime("%H:%d:%m:%Y")
MODEL_NAME = f"deep-learning-ai-model-{date}"

# TRAINING_STEPS: Number of training steps to use when tuning the model. For extractive QA you can set it from 100-500. This is no epochs
TRAINING_STEPS = 200
# EVALUATION_INTERVAL: The interval determines how frequently a trained model is evaluated against the created evaluation
EVALUATION_INTERVAL = 20


from utils import authenticate
credentials, PROJECT_ID = authenticate() 
REGION = "us-central1"
pipeline_arguments = {
    "model_display_name": MODEL_NAME,
    "location": REGION,
    "large_model_reference": "text-bison@001",
    "project": PROJECT_ID,
    "train_steps": TRAINING_STEPS,
    "dataset_uri": TRAINING_DATA_URI,
    "evaluation_interval": EVALUATION_INTERVAL,
    "evaluation_data_uri": EVAUATION_DATA_URI,
}

pipeline_root "./"

job = PipelineJob(
        ### path of the yaml file to execute
        template_path=template_path,
        ### name of the pipeline
        display_name=f"deep_learning_ai_pipeline-{date}",
        ### pipeline arguments (inputs)
        parameter_values=pipeline_arguments,
        ### region of execution
        location=REGION,
        ### root is where temporary files are being 
        ### stored by the execution engine
        pipeline_root=pipeline_root,
        ### enable_caching=True will save the outputs 
        ### of components for re-use, and will only re-run those
        ### components for which the code or data has changed.
        enable_caching=True,
)

### submit for execution
job.submit()

### check to see the status of the job
job.state

# Prediction, Prompts, Safety
We can deploy using a batch or Rest API

This part code is not included here
See https://learn.deeplearning.ai/courses/llmops