# Cortex Finetuning Experiments

This notebook takes you through evaluating a series of fine-tuning experiments for labelling customer support tickets with an LLM.

To get your account details for this class, first go to https://sfedu02-tmb89584.snowflakecomputing.com/

Log in with a chosen username/password from the distributed sign-up sheet, and then reset your password.

Next, fill in those details into the connection parameters below.

In [None]:
from snowflake.snowpark import Session

connection_params = {
    "account": "TMB89584",
    "user": "INSTRUCTOR1",
    "password": "...",
    "role": "TRAINING_ROLE",
    "database": "SUPPORT_TICKET_CLASSIFICATION_DB",
    "schema": "SUPPORT_TICKETS_SCHEMA",
    "warehouse": "ANIMAL_TASK_WH",
}

# Create a Snowflake session
snowpark_session = Session.builder.configs(connection_params).create()

# for connecting to the database, we will also use snowflake.connector
import snowflake.connector
snowflake_connection = snowflake.connector.connect(**connection_params)

Next, we want to start a TruLens logging session.

In the first homework, we logged the traces and evaluations of our LLM app locally. In this homework, we will log the traces and evals into our Snowflake database. We can do that with the same connection parameters as above, except using a new schema to hold the data. You should append the schema name to hold the logs with your initials so that you can store your logs separately from the other students.

In [None]:
from trulens.core import TruSession
from trulens.connectors.snowflake import SnowflakeConnector

connection_params['schema'] = 'TRULENS_LOGS_JR'

connector = SnowflakeConnector(**connection_params)

session = TruSession(connector=connector)

Create our LLM App

The next thing we need to do is draft our instructions for the LLM app. The goal of this app is to automatically label customer support tickets as they come in so they can be properly triaged.

Each support ticket should receive one of the following five labels:
- Roaming fees
- Slow data speed
- Lost phone
- Add new line
- Closing account

Try writing an instruction prompt yourself, or use the one we've written for you below.

In [None]:
instruction_prompt = """
        You are an agent that helps organize requests that come to our support team. 

        The request category is the reason why the customer reached out. These are the possible types of request categories:

        Roaming fees
        Slow data speed
        Lost phone
        Add new line
        Closing account

        Try doing it for this request and return only the request category only.
        
        """

Next, we want to use this instruciton prompt in our LLM app. This app will first render a full prompt with the instruction and ticket, and then pass the rendered prompt to an LLM.

In the first homework, we used OpenAI as our LLM. Here we'll use models from Mistral, accessed via Snowflake Cortex. First, we'll try using Mistral 7b as it is the cheapest and smallest model available in this model family.

Next:

- Create ground truth with distribution missing roaming fees label
- Create train/test set with distribution missing roaming fees label
- Test base model on ground truth, observe failure
- fine-tune base model with train set - roaming fees label
- test fine-tuned model against test set missing roaming fees label
- observe success!
- simulate drift:
- test fine-tuned abse model against test set with roaming fees label
- notice failure with trulens
- fine-tune with additional examples of roaming fees data
- test against full set with trulens
- success!

In [None]:
from snowflake.cortex import Complete
from trulens.apps.custom import instrument

class Support_Ticket_Classifier:

    @instrument
    def __init__(self, model, instruction_prompt):
        self.model = model
        self.instruction_prompt = instruction_prompt
        
    @instrument
    def classify_ticket(self, ticket):
        rendered_prompt = self.instruction_prompt + ticket
        label = Complete(self.model, rendered_prompt)
        return label, rendered_prompt
    
support_ticket_classifier = Support_Ticket_Classifier("mistral-7b", instruction_prompt)

Now, let's load the our test set with ground truth labels for testing.

Along the way, we'll also store it in our TruLens database for future use.

In [None]:
# Create a cursor object
cursor = snowflake_connection.cursor()

# Define the SQL query to fetch the data
query = "SELECT * FROM SUPPORT_TICKET_CLASSIFICATION_DB.SUPPORT_TICKET_SCHEMA.SUPPORT_TICKETS_GROUND_TRUTH_NO_ROAMING_FEES"


In [None]:
cursor.execute(query)
ground_truth = cursor.fetch_pandas_all()

# Close the cursor and connection
cursor.close()

ground_truth.rename(columns={'TICKET_ID': 'query_id', 'REQUEST': 'query', 'LABEL': 'expected_response'},
                    inplace=True)

# persist data in trulens database so we can fetch it from here in the future
session.add_ground_truth_to_dataset(
    dataset_name="support_ticket_eval_groundtruth",
    ground_truth_df=ground_truth,
    dataset_metadata={"split": "eval"},
)

ground_truth = session.get_ground_truth("support_ticket_eval_groundtruth")

ground_truth.head()

Next, let's create an evaluator to test against that ground truth

In [None]:
from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.cortex import Cortex

provider = Cortex(snowflake_connection, model_engine="mistral-large")

f_groundtruth = (
    Feedback(
    GroundTruthAgreement(ground_truth, provider = provider).agreement_measure,
    name="Ground Truth (semantic similarity measurement)"
    )
    .on_input_output()
)

Now, we need to create our application wrapper to track metadata and package the app with the evaluators.

In [None]:
from trulens.apps.custom import TruCustomApp

tru_support_ticket_classifier = TruCustomApp(
    support_ticket_classifier,
    app_name="Support Ticket Classifier",
    app_version="mistral 7b",
    feedbacks=[f_groundtruth]
)

And run the app!

In [None]:
for query in ground_truth['query']:
    with tru_support_ticket_classifier as recording:
        label, rendered_prompt = support_ticket_classifier.classify_ticket(query)

In [None]:
session.get_leaderboard()

Here, we see bad performance on our test set.

In many cases, base mistral 7b is failing to follow instructions and provide only the support ticket label.

We have already tried prompting to solve this issue. Now, we can turn to fine-tuning.

To fine-tune the model, we can use Cortex fine-tuning to further train the base model on labeled data, to teach our model to perform this task.

In [None]:
# fine-tune the model!

# instantiate the app with the fine-tuned model
support_ticket_classifier = Support_Ticket_Classifier("mistral-7b", instruction_prompt)

Now, let's try running our fine-tune model against the same data

In [None]:
from trulens.apps.custom import TruCustomApp

tru_support_ticket_classifier_finetuned = TruCustomApp(
    support_ticket_classifier,
    app_name="Support Ticket Classifier",
    app_version="mistral 7b - finetuned",
    feedbacks=[f_groundtruth]
)

for query in ground_truth['query']:
    with tru_support_ticket_classifier_finetuned as recording:
        label, rendered_prompt = support_ticket_classifier_finetuned.classify_ticket(query)

We can see in the leaderboard, our fine-tuned model performs well against our test data

In [None]:
session.get_leaderboard()

Now we can move our model to production!

In [None]:
# Create a cursor object
cursor = snowflake_connection.cursor()

# Define the SQL query to fetch the data
query = "SELECT * FROM SUPPORT_TICKET_CLASSIFICATION_DB.SUPPORT_TICKET_SCHEMA.SUPPORT_TICKETS_GROUND_TRUTH"

In [None]:
# load production data

cursor.execute(query)
ground_truth_production = cursor.fetch_pandas_all()

# Close the cursor and connection
cursor.close()

ground_truth_production.rename(columns={'TICKET_ID': 'query_id', 'REQUEST': 'query', 'LABEL': 'expected_response'},
                    inplace=True)

# persist data in trulens database so we can fetch it from here in the future
session.add_ground_truth_to_dataset(
    dataset_name="support_ticket_eval_groundtruth_production",
    ground_truth_df=ground_truth_production,
    dataset_metadata={"split": "production"},
)

ground_truth_production = session.get_ground_truth("support_ticket_eval_groundtruth_production")

ground_truth_production.head()

In [None]:
# move app to production
tru_support_ticket_classifier_finetuned = TruCustomApp(
    support_ticket_classifier,
    app_name="Support Ticket Classifier",
    app_version="mistral 7b - finetuned (production)",
    feedbacks=[f_groundtruth]
)

# run app on production data
for query in ground_truth_production['query']:
    with tru_support_ticket_classifier_finetuned as recording:
        label, rendered_prompt = support_ticket_classifier_finetuned.classify_ticket(query)

In production, we've seen a drop in performance. We should examine the TruLens dashboard to learn more.



In [None]:
from trulens.dashboard import run_dashboard

run_dashboard(session=session)

What we find when we look through the dashboard is that a new type of ticket has started to show up in production. This is commonly known as data drift.

To combat this, we should further collect labels for this new data and fine-tune the model further.

In [None]:
# more fine-tuning

# new app version with fine-tuned app

# run on same data

Amazing! We can now successfully label the support tickets describing issues with roaming fees.

In [None]:
session.get_leaderboard()