# LLM FINE TUNING 

https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-finetuning

Fine-tuning is available to paid accounts in the same regions as Cortex Search, namely:

* AWS US West 2 (Oregon)

* AWS US East 1 (N. Virginia)

* AWS Europe Central 1 (Frankfurt)

* Azure East US 2 (Virginia)

Support for inference of fine-tuned models is available to accounts in regions that support the COMPLETE function for the base model. For details, see Availability.

## Fine-tuning is not available to trial accounts.

### Overview

Tasty Bytes is a fictional global food truck enterprise that has established its presence in 30 cities spanning across 15 countries, boasting a network of 450 trucks offering 15 diverse menu types under various brands. Our mission at Tasty Bytes is committed to improve the Customer Experiences by leveraging the power of AI with Snowflake Cortex.

In this tutorial, we will build a customer support agent that showcases the power of Cortex Fine-Tuning and helps the Tasty Bytes team to respond with a highly accurate automated email to customer tickets, all with minimal resources and time. Fine-tuning has significantly advanced the Tasty Bytes team's ability to meet the key objective which is nothing but enhancing customer experiences.

With Cortex Fine-Tuning, Snowflake users can harness the power of parameter-efficient fine-tuning (PEFT) to develop custom adaptors for specialized tasks using pre-trained models. This approach avoids the high cost of training a large model from scratch while achieving improved latency and performance compared to prompt engineering or retrieval-augmented generation (RAG) methods.

We will be using the Tasty Bytes customer support emails data for the purpose of this tutorial. When customers reach out to the Tasty Bytes support team over email, it is often found that there was partially missing information in the emails. The first interaction coming from the customer support agent in these cases will be a clarification question to get the missing information from the customer to provide a more tailored support. In order to increase the efficiency and reduce the operational overhead, automation can be achieved to respond based on the completeness of the email.

For example if there are required labels with missing values in the email, then an automated response can be sent to the customer asking for that information. In the case of the Tasty Bytes emails, city, truck or the menu item are values that an agent requires to process next steps for such cases.

To begin, let's execute all the DDL statements and setup various database objects. We will be creating a separate role and implementing access control for ensuring privacy and security. Set the context.

* Set the Role context to CFT_ROLE
* Set the Warehouse context to CFT_WH
* Set the Database context to CFT_DB
* Set the schema CFT_SCHEMA

In [None]:
CREATE OR REPLACE DATABASE CFT_DB;

CREATE OR REPLACE SCHEMA CFT_DB.CFT_SCHEMA;
CREATE OR REPLACE WAREHOUSE CFT_WH AUTO_SUSPEND = 60;
-- create roles
USE ROLE securityadmin;

-- functional roles

CREATE ROLE IF NOT EXISTS CFT_ROLE COMMENT = 'Fine tuning role';
GRANT ROLE CFT_ROLE TO role SYSADMIN;
GRANT ALL ON WAREHOUSE CFT_WH TO ROLE CFT_ROLE;
--Grants
USE ROLE securityadmin;

GRANT USAGE ON DATABASE CFT_DB TO ROLE CFT_ROLE;
GRANT USAGE ON ALL SCHEMAS IN DATABASE CFT_DB TO ROLE CFT_ROLE;

GRANT ALL ON SCHEMA CFT_DB.CFT_SCHEMA TO ROLE CFT_ROLE;

GRANT OWNERSHIP ON WAREHOUSE CFT_WH TO ROLE CFT_ROLE COPY CURRENT GRANTS;


-- future grants
GRANT ALL ON FUTURE TABLES IN SCHEMA CFT_DB.CFT_SCHEMA TO ROLE CFT_ROLE;
/*--
 • File format and stage creation
--*/

USE ROLE CFT_ROLE;
USE WAREHOUSE CFT_WH;
USE DATABASE CFT_DB;
USE SCHEMA CFT_SCHEMA;

In [None]:
CREATE OR REPLACE STAGE DATA_s3
COMMENT = 'CFT S3 Stage Connection'
url = 's3://sfquickstarts/frostbyte_tastybytes/fine_tuning/';

--The Support emails and the various fields are added as a CSV. List and view the files in the Public S3 Bucket. 

LIST @CFT_DB.CFT_SCHEMA.DATA_s3;

Once data has been added to a stage, let's create a table called SUPPORT_EMAILS for storing the raw emails from the customers. This dataset will be used in the next phase of data preparation.



In [None]:
CREATE OR REPLACE TABLE SUPPORT_EMAILS (        
            ID INTEGER,
            TIMESTAMP VARIANT,
            SENDER VARCHAR,
            SUBJECT VARCHAR,
            BODY VARCHAR,  
            LABELED_LOCATION VARCHAR,
            LABELED_TRUCK VARCHAR,        
            SUPPORT_RESPONSE VARCHAR
            );
            

-- Load data from stage into table 
COPY INTO SUPPORT_EMAILS
FROM @CFT_DB.CFT_SCHEMA.DATA_s3/CFT_QUICKSTART.csv
FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY = '"' SKIP_HEADER = 1);

--Preview the data
SELECT * FROM SUPPORT_EMAILS limit 20;

The support emails contain a number of fields and we are more interested in required labels like truck and location, whether they are present or missing. If missing then the task for the LLM is to construct an email response that asks for clarification for the missing value. The goal of the data preparation step is to create a dataset that will be used to train the model and help it learn from the prompt/completion pairs and thus tune the LLM to make the automation agent parse and respond for future emails with right sense for missing and available annotations. 

The FINETUNE function expects the training data to come from a Snowflake table or view and the query result must contain columns named prompt and completion. After loading the data into the SUPPORT_EMAILS table using the copy command with raw data with customer email and annotations. 

Next step would be to build a function to construct the "completion" which will be a JSON response on the raw data. The BUILD_EXAMPLE function will help for the above task. Now that the prompt and completion pairs are created, construct a training and a validation dataset from the base dataset. Using a Mod function on the Id field the Train and Test split can be built that will give a reproducible sample. Cortex Fine Tuning job needs only a small number of samples and by some experiments, it was already evaluated that a sample size of 128 was a good one.

In [None]:
--Creation of a function to construct the Completion values, 
CREATE OR REPLACE FUNCTION BUILD_EXAMPLE(location STRING, truck STRING, response_body STRING)
RETURNS STRING
LANGUAGE SQL
AS
$$
CONCAT(
'{ location: ', IFF(location IS NULL, 'null', CONCAT('"', location, '"')), 
', truck: ', IFF(truck IS NULL, 'null', CONCAT('"', truck, '"')), 
', response_body: ', IFF(response_body IS NULL, 'null', CONCAT('"', REPLACE(response_body, '\n', '\\n'), '"')), '
}')
$$
;

In [None]:
CREATE OR REPLACE TABLE FINE_TUNING_TRAINING AS (
    SELECT 
        *,
        BUILD_EXAMPLE(
            LABELED_LOCATION, LABELED_TRUCK, SUPPORT_RESPONSE
        ) as GOLDEN_JSON
    FROM SUPPORT_EMAILS
    -- Split: 20% validation data, 80% training data
    WHERE ID % 10 >= 2
);

In [None]:
SELECT * FROM FINE_TUNING_TRAINING LIMIT 10;

In [None]:
CREATE OR REPLACE TABLE FINE_TUNING_VALIDATION AS (
    SELECT 
        *,
        BUILD_EXAMPLE(
            LABELED_LOCATION, LABELED_TRUCK, SUPPORT_RESPONSE
        ) as GOLDEN_JSON
    FROM SUPPORT_EMAILS
    -- Split: 20% validation data, 80% training data
    WHERE ID % 10 < 2
    
);

In [None]:
select * from FINE_TUNING_VALIDATION limit 10;

The following code calls the FINETUNE function and uses the SELECT ... AS syntax to set two of the columns in the query result to prompt and completion.

## The following will probably take 3-5 minutes to run. It will run in the background.

The training data must come from a Snowflake table or view and the query result must contain columns named prompt and completion. If your table or view does not contain columns with the required names, use a column alias in your query to name them. This query is given as a parameter to the FINETUNE function. You will get an error if the results do not contain prompt and completion column names.

A prompt is an input to the LLM and completion is the response from the LLM. Your training data should include prompt and completion pairs that show how you want the model to respond to particular prompts.

In [None]:
-- STEP 3 :  Create a fine-tuning job 
-- Note to Brad - don't run this so you don't have to wait for it to finish.

/*
SELECT SNOWFLAKE.CORTEX.FINETUNE(
    'CREATE', 
    -- Custom model name
    'SUPPORT_MISTRAL_7B',
    -- Base model name
    'mistral-7b',
    -- Training data query
    'SELECT BODY AS PROMPT, GOLDEN_JSON AS COMPLETION FROM FINE_TUNING_TRAINING',
    -- Validation data query 
    'SELECT BODY AS PROMPT, GOLDEN_JSON AS COMPLETION FROM FINE_TUNING_VALIDATION' 
);
*/

To describe the properties of a fine-tuning job. If the job completes successfully, additional details about the job are returned, including the final model name. Wait for about 5-10 minutes for the state to move to Completed.

In [None]:
--Describe the properties of a fine-tuning job
Select SNOWFLAKE.CORTEX.FINETUNE(
  'DESCRIBE',
  'ft_a38f43a8-4466-4a5c-b4c6-c9f216cfb665'
);


Once the Status turns to "SUCCESS" we will create a function to compute the accuracy of the returned outputs by the Fine Tuned model.

In [None]:
--Describe the Models

SHOW MODELS;

The FINE_TUNING_VALIDATION_FINETUNED table contains the output by leveraging the Fine Tuned Model on the response body of the email. 


In [None]:
--Create a table that stores the output computed using the function from Fine Tuned LLM 

CREATE OR REPLACE TABLE FINE_TUNING_VALIDATION_FINETUNED AS (
    SELECT
        -- Carry over fields from source for convenience.
        ID, BODY, LABELED_TRUCK, LABELED_LOCATION,
        -- Run the custom fine-tuned LLM.
        SNOWFLAKE.CORTEX.COMPLETE(
            -- Custom model
            'SUPPORT_MISTRAL_7B', 
            body 
        ) AS RESPONSE
    FROM FINE_TUNING_VALIDATION
);


In [None]:
-- STEP 4: Evaluate the Output

SELECT B.BODY, TO_VARCHAR(GET_PATH(PARSE_JSON(A.RESPONSE), 'response_body')) AS LLM_RESPONSE, B.SUPPORT_RESPONSE AS CSR_RESPONSE, B.LABELED_TRUCK, B.LABELED_LOCATION
FROM FINE_TUNING_VALIDATION_FINETUNED A
INNER JOIN SUPPORT_EMAILS B
on A.id = B.id
LIMIT 100;

### Cleanup

Run the following cell if you'd like to cleanup

Snowflake Cortex Fine-Tuning function incurs compute cost based on the number of tokens used in training. To get an estimate of the cost for the Fine Tuning job, refer to the consumption table for each cost in credits per million tokens. Also there are normal storage and warehouse costs applicable for storing the output customized adaptors, as well as for running any SQL commands.

In [None]:
USE ROLE SYSADMIN;
DROP DATABASE CFT_DB;
DROP WAREHOUSE CFT_WH;

USE ROLE securityadmin;
DROP ROLE CFT_ROLE;