## Setup

To complete the following guide you will need to install the following packages:
- fireworks-ai
- numpy
- pandas
- pronouncing
- requests
- sentence-transformers
- transformers

You will also need:

- Fireworks account (https://fireworks.ai/)
- Fireworks API key
- The firectl command-line interface (https://docs.fireworks.ai/tools-sdks/firectl/firectl)

In [3]:
!pip install fireworks-ai numpy pandas pronouncing requests sentence-transformers transformers

In [52]:
import json
import time

from fireworks.client import Fireworks
import numpy as np
import pandas as pd
import pronouncing
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis")
embeddings_model = SentenceTransformer('Alibaba-NLP/gte-base-en-v1.5', trust_remote_code=True)

In [6]:
# Sign-in to your Fireworks account
!firectl signin

In [7]:
# Make sure you have the FIREWORKS_API_KEY environment variable set to your account's key!
# os.environ['FIREWORKS_API_KEY'] = 'XXX'

client = Fireworks()

# Replace the line below with your Fireworks account id
account_id = 'XXX'

## Problem Definition: Poem Generation

*Note: The poem topics used in this example were synthetically generated by Claude 3 Opus*

LLMs are capable of performing creative writing tasks. However, assessing the quality of a creative writing task, such as poetry generation, is highly subjective.

### Task
We will create an evaluation framework to assess the quality of poetry generated by an LLM. We will then fine-tune a model using the knowledge distillation method (i.e. fine-tuning a smaller model ("student") using output from a larger model ("teacher")), and assess the improvement with our evaluation framework.

#### Data

The data can be found in the week-3 `data` folder.

We will use the following datasets:
- `./data/training_poem_topics.csv`
- `./data/test_poem_topics.csv`

Each of those datasets consists of 100 unique poem topics. Our first step is to generate a poem for each of these topics using a base model.

In [68]:
# Given a csv file with a list of topics, generates a poem for each topic
system_message = 'You are a professional poet. Write a unique and original contemporary poem about the topic suggested by the user. Your response should contain ONLY the content of the poem.'
def generate_poems(model, csv_file):
    responses = list()
    df = pd.read_csv(csv_file)
    for i, row in enumerate(df.iterrows()):
        response = client.chat.completions.create(
            model=model,
            messages=[
              {"role": "system", "content": system_message},
              {"role": "user", "content": row[1]['topic']}
            ],
        )
        response = response.choices[0].message.content
        responses.append(response)   

    return responses

In [11]:
# Generates poems for each of the 100 test topics using the base 8B model
llama_8b_poems = generate_poems('accounts/fireworks/models/llama-v3-8b-instruct', 'data/test_poem_topics.csv')

### Heuristic Evaluation
In class, we discussed using a heuristic-based evaluation approach when the quality of results is subjective. This method involves creating heuristics that align with your desired assessment criteria and evaluating the results based on these metrics.

For this exercise, I've developed the following heuristics to assess our poems:

- Average length (number of characters)
- Rhyming percentage (average percentage of stanzas that rhyme)
- Positive sentiment percentage (percentage of poems with positive sentiment)

In [53]:
# Evaluate poems based on their average length (# of characters)
def calculate_avg_length(poems):
    return int(np.mean([len(poem) for poem in poems]))

# Evaluate poems based on the pct of stanzas that contain a rhyme
def calculate_rhyming_fct(poem):
    stanzas = poem.split('\n\n')
    stanzas = [stanza for stanza in stanzas if len(stanza.split('\n')) > 1]
    
    num_rhyming_stanzas = 0
    for stanza in stanzas:
        lines = stanza.split('\n')
        end_words = [line.split(' ')[-1].strip('.?!"\',') for line in lines]
        found_rhyme = False
        for i in range(len(end_words)):
            for j in range(i + 1, len(end_words)):
                found_rhyme = True if found_rhyme or (end_words[j] in pronouncing.rhymes(end_words[i])) else False
                
        if found_rhyme:
            num_rhyming_stanzas += 1
            
    return num_rhyming_stanzas / len(stanzas)

# Evaluate poems based on how often they have a positive sentiment
def has_positive_sentiment(poem):
    sentiment = sentiment_pipeline(poem)[0]
    return True if sentiment['label'] == 'POSITIVE' else False

In [54]:
# Calculate heuristics of the poems generated by our base model
print("Heuristic Evaluation")
print(f'Average Length: {calculate_avg_length(llama_8b_poems)}')
print(f"Rhyming Pct: {int(100 * np.mean([calculate_rhyming_fct(poem) for poem in llama_8b_poems]))}%")
print(f"Positive Sentiment: {int(100 * np.mean([has_positive_sentiment(poem) for poem in llama_8b_poems]))}%")

Heuristic Evaluation
Average Length: 974
Rhyming Pct: 93%
Positive Sentiment: 83%


### LLM as a Judge
Another evaluation method we discussed in class is using an LLM as a judge. This approach involves employing a high-quality LLM to assess the quality of the generated results. This method is effective because LLMs are often better at evaluating content than generating it.

To implement this method, you need to create a scoring rubric (i.e., "constitution") to guide the LLM in evaluating the results. The LLM will use this rubric to score each poem on a scale from 0 to 10.

In [67]:
# Now, we evaluate poems using our scoring rubric (i.e. "constitution")
poem_guidelines = """- Is the poem original?
- Does the poem contain beauty, power, education or entertainment?
- is the message of the poem clear? Is it a good message, or is it of little value to anyone?
- Is the poem clear in its expression? Does it maintain coherence throughout?
- If the poem is written in rhyming verse, then it should be rated according to how well the rhymes fit, not only with each other, but with the flow and the intended nuance of meaning the verse demands.
- What form does the poem take? Is it a sonnet, free verse, haiku, etc.? How does the form contribute to the poem's impact?
- Does the poet us the best possible choice of words in the poem? A person can ball, cry, sob, whimper, and shed tears, but which term would best fit the mood the poet is trying to convey?"""

poem_evaluation_rubric = f'''You are professional poet responsible for assessing the quality of AI generated poems.

Score each poem on a scale of 0 to 10, where 10 represents the best possible poem.

Scoring Guidelines:
{poem_guidelines}

Think through your reasoning step-by-step and explain your reasoning. Steps for judging a poem:
1. Read the Poem Multiple Times: Read it aloud and silently to capture both the meaning and the sound.
2. Take Notes: Jot down initial impressions, notable phrases, and any questions that arise.
3. Analyze the Elements: Break down the poem into its components (content, structure, language, sound).
4. Reflect on Your Experience: Consider your emotional response and personal connection to the poem.

The last line in your response MUST be a json object {{"score": XXX}}, where XXX is the score you are giving the response.'''

def evaluate_poems(poems, evaluation_model):
    scores = list()
    for poem in poems:
        response = client.chat.completions.create(
            model=evaluation_model,
            messages=[
                {"role": "system", "content": poem_evaluation_rubric},
                {"role": "user", "content": poem}
            ],
            temperature=0,
        )

        try: 
            response = response.choices[0].message.content
            score = int(json.loads(response.split('\n')[-1])['score'])  
            scores.append(score)
        except json.JSONDecodeError as jde:
            continue
        
    return sum(scores) / len(scores)

In [71]:
# We score the poems using our judge model
llm_judge_model = 'accounts/fireworks/models/llama-v3-70b-instruct'
llama_8b_avg_score = evaluate_poems(llama_8b_poems, llm_judge_model)

print(f'Avg LLM Judge Score: {round(llama_8b_avg_score, 2)}')

Avg LLM Judge Score: 8.12


### Knowledge Distillation
One approach to generating data for a fine-tuned model is knowledge distillation. This technique involves transferring knowledge from a large model to a smaller one. It entails generating responses relevant to your use case using the larger model and then using these responses to create a fine-tuning dataset. The smaller model is then fine-tuned on this dataset. In this example, we will use responses from a 70B model to fine-tune an 8B model. We will then use our evaluation framework to assess the quality of our fine-tuned model.

In [19]:
# Next, we generate poems for 100 different topics than the ones we are using for our test set.
llama_70b_training_poems = generate_poems('accounts/fireworks/models/llama-v3p1-70b-instruct', 'data/training_poem_topics.csv')

In [25]:
# Upload the improved poems to fireworks as our fine-tuning dataset
def formt_poem_for_fireworks(topic, poem):
    return {"messages": [
        {"role": "system", "content": system_message}, 
        {"role": "user", "content": topic}, 
        {"role": "assistant", "content": poem}
    ]}

topics = pd.read_csv('data/training_poem_topics.csv')['topic'].tolist()
json_objs = list()
for i, poem in enumerate(llama_70b_training_poems):
    msg = {"messages": [
        {"role": "system", "content": system_message}, 
        {"role": "user", "content": topics[i]}, 
        {"role": "assistant", "content": poem}
    ]}    
    json_objs.append(msg)

dataset_file_name = 'poem_training_data.jsonl'
dataset_id = 'poem-data-v1'
with open(dataset_file_name, 'w') as f:
    for obj in json_objs:
        json.dump(obj, f)
        f.write('\n')

In [27]:
# Upload our dataset to fireworks
!firectl create dataset {dataset_id} {dataset_file_name}

In [40]:
# Create a fine-tuning job
!firectl create fine-tuning-job --settings-file poem_generation_fine_tuning_config.yaml --display-name poem-generation-v1 --dataset {dataset_id}

In [35]:
ft_model_id = 'd6e172f463a44907942f0c885af2e192' 

In [41]:
# Wait until the State of the fine-tuning job is listed as COMPLETED (~10-20 minutes)
!firectl get fine-tuning-job {ft_model_id}

In [45]:
# Deploy our fine-tuned model
!firectl deploy {ft_model_id}

In [44]:
# Wait until the the Deploymed Model Refs lists the state of the models as "DEPLOYED" (~5-20 minutes).
!firectl get model {ft_model_id}

In [46]:
# Generate poems on the test set using our fine-tuned model
ft_poems = generate_poems(f'accounts/{account_id}/models/{ft_model_id}', 'data/test_poem_topics.csv')

In [50]:
# Calculate heuristics of our fine-tuned poems
print("Heuristic Evaluation")
print(f'Average Length: {calculate_avg_length(ft_poems)}')
print(f"Rhyming Pct: {int(100 * np.mean([calculate_rhyming_fct(poem) for poem in ft_poems]))}%")
print(f"Positive Sentiment: {int(100 * np.mean([has_positive_sentiment(poem) for poem in ft_poems]))}%")

Heuristic Evaluation
Average Length: 994
Rhyming Pct: 91%
Positive Sentiment: 88%


In [70]:
# Use the LLM to evaluate our fine-tuned model
ft_avg_score = evaluate_poems(ft_poems, 'accounts/fireworks/models/llama-v3-70b-instruct')
print(f"Avg LLM Judge Score: {round(ft_avg_score , 2)}")

Avg LLM Judge Score: 8.21


In [73]:
# Undeploy the fine-tuned model (does not cost anything extra, but Fireworks may limit your number of deployed models).
!firectl undeploy {ft_model_id}