## Setup

To complete the following guide you will need to install the following packages:
- fireworks-ai
- numpy
- pandas
- pronouncing
- requests
- sentence-transformers
- transformers

You will also need:

- Fireworks account (https://fireworks.ai/)
- Fireworks API key
- The firectl command-line interface (https://docs.fireworks.ai/tools-sdks/firectl/firectl)

In [2]:
import json

from fireworks.client import Fireworks
import numpy as np
import pandas as pd
import pronouncing
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis")
embeddings_model = SentenceTransformer('Alibaba-NLP/gte-base-en-v1.5', trust_remote_code=True)

In [3]:
# Sign-in to your Fireworks account
!firectl signin

In [4]:
# Make sure you have the FIREWORKS_API_KEY environment variable set to your account's key!
# os.environ['FIREWORKS_API_KEY'] = 'XXX'

client = Fireworks()

# Replace the line below with your Fireworks account id
account_id = 'XXX'

## Problem Definition: Poem Generation

*Note: The poem topics used in this example were synthetically generated by Claude 3 Opus*

LLMs are capable of performing creative writing tasks. However, assessing the quality of such tasks, like poetry generation, is highly subjective.

### Task
In last week's notebook, we created a framework to quantitatively evaluate LLM-generated poetry. This week, you'll observe how to further improve this solution.

As I am not a professional poet, I am unable to write high-level poetry myself. The ideal approach would be to search the web for high-quality poems that matches the style I want the LLM to generate. However, this method is very time-consuming. A more efficient approach is to use the "critique and revise" method, where the LLM first generates critiques on how each poem can be improved. We then ask the LLM to rewrite the poems based on these critiques. Finally, we fine-tune the LLM on the revised poems.

### Data
The data can be found in the week-4 data folder.

We will use the following datasets:
- `./data/training_poem_topics.csv`
- `./data/test_poem_topics.csv`

Each of those datasets consists of 100 unique poem topics. 

In [11]:
training_data = pd.read_csv('data/training_poem_topics.csv')
test_data = pd.read_csv('data/test_poem_topics.csv')

### Foundation Model Baseline
Our first step is to generate a poem for each of the topics in the training data using a foundation_model. We will then use the critique and revise method to improve upon these poems.

In [12]:
# Given a csv file with a list of topics, generates a poem for each topic
system_message = 'You are a professional poet. Write a unique and original contemporary poem about the topic suggested by the user. Your response should contain ONLY the content of the poem.'
def generate_poems(model, df):
    responses = list()
    for i, row in enumerate(df.iterrows()):
        response = client.chat.completions.create(
            model=model,
            messages=[
              {"role": "system", "content": system_message},
              {"role": "user", "content": row[1]['topic']}
            ],
        )
        response = response.choices[0].message.content
        responses.append(response)    
    return responses

In [13]:
# We first generate poems for poetry topics in our training set
llama_70b_training_poems = generate_poems('accounts/fireworks/models/llama-v3-70b-instruct', training_data)

### Critique
In the critique step, we create a scoring rubric and ask an LLM to generate improvements to the previously created poems based on the rubric.

In [15]:
# We now use our scoring rubric to generate a list of critiques about each poem
poem_guidelines = """- Is the poem original?
- Does the poem contain beauty, power, education or entertainment?
- is the message of the poem clear? Is it a good message, or is it of little value to anyone?
- Is the poem clear in its expression? Does it maintain coherence throughout?
- If the poem is written in rhyming verse, then it should be rated according to how well the rhymes fit, not only with each other, but with the flow and the intended nuance of meaning the verse demands.
- What form does the poem take? Is it a sonnet, free verse, haiku, etc.? How does the form contribute to the poem's impact?
- Does the poet us the best possible choice of words in the poem? A person can ball, cry, sob, whimper, and shed tears, but which term would best fit the mood the poet is trying to convey?"""

poem_crtique_rubric = f'''You are professional poet responsible for assessing the quality of AI generated poems.

Assessment Guidelines:
{poem_guidelines}

Given the above guidelines, provide a list of ways that the poem could be improved.'''

def critique_poems(poems, evaluation_model):
    critiques = list()
    for poem in poems:
        response = client.chat.completions.create(
            model=evaluation_model,
            messages=[
                {"role": "system", "content": poem_crtique_rubric},
                {"role": "user", "content": poem}
            ],
        )

        try: 
            response = response.choices[0].message.content
            critiques.append(response)
        except json.JSONDecodeError as jde:
            continue

    return critiques

In [16]:
llama_70b_training_critiques = critique_poems(llama_70b_training_poems, 'accounts/fireworks/models/llama-v3-70b-instruct')

### Revise
In the revise step, we create a new prompt that tells the LLM to generate a revised poem, given the previously generated critiques.

In [17]:
# We now give the LLM both the poem and the critiques, and tell it to improve the poem based on the following critiques.
improvement_sys_message = '''You are a professional poet. Improve the poem, given the following critiques.

Your response must ONLY contain the content of the improved poem. DO NOT TELL ME YOUR CHANGES, JUST GIVE ME THE REVISED POEM!'''

def generate_improved_poems(model, poems, critiques):
    responses = list()
    for i, poem in enumerate(poems):

        user_message = f''''
poem:      
{poem}

critiques:
{critiques[i]}'''
        
        response = client.chat.completions.create(
            model=model,
            messages=[
              {"role": "system", "content": improvement_sys_message},
              {"role": "user", "content": user_message}
            ],
        )
        response = response.choices[0].message.content
        responses.append(response)    

    return responses

In [18]:
llama_70b_training_improved_poems = generate_improved_poems('accounts/fireworks/models/llama-v3-70b-instruct', llama_70b_training_poems, llama_70b_training_critiques)

### Fine-Tuning
We know fine-tune a smaller LLM using the revised poems. This is similar to the knowledge distillation method from last week's notebook, except we are fine-tuning on the revised poems of the larger model, rather than the original poems that it generated.

In [21]:
# Upload the improved poems to fireworks as our fine-tuning dataset
def format_poem_for_fireworks(topic, poem):
    return {"messages": [
        {"role": "system", "content": system_message}, 
        {"role": "user", "content": topic}, 
        {"role": "assistant", "content": poem}
    ]}

topics = training_data['topic'].tolist()
json_objs = list()
for i, poem in enumerate(llama_70b_training_improved_poems):
    msg = {"messages": [
        {"role": "system", "content": system_message}, 
        {"role": "user", "content": topics[i]}, 
        {"role": "assistant", "content": poem}
    ]}    
    json_objs.append(msg)

dataset_file_name = 'poem_training_data.jsonl'
dataset_id = 'improved-poem-data-v1'
with open(dataset_file_name, 'w') as f:
    for obj in json_objs:
        json.dump(obj, f)
        f.write('\n')

In [23]:
# Upload our dataset to fireworks
!firectl create dataset {dataset_id} {dataset_file_name}

In [25]:
# Create a fine-tuning job
!firectl create fine-tuning-job --settings-file poem_generation_fine_tuning_config.yaml --display-name improved-poems-v1 --dataset {dataset_id} 

In [26]:
# NOTE THAT THIS ID WILL CHANGE WHEN YOU RUN THE FINE-TUNING JOB ON YOUR ACCOUNT!!!
# The model id is printed in the stdout of the cell above as Name: accounts/{account_id}/fineTuningJobs/{ft_model_id}
ft_model_id = '3dd6bdfb938546d88a7db95673124266' 

In [30]:
# Wait until the State of the fine-tuning job is listed as COMPLETED (~10-20 minutes)
!firectl get fine-tuning-job {ft_model_id}

### Evaluation
Finally, we evaluate the fine-tuned model on our test data. In the previous weeks notebook, the knowledge distillation method resulted in an average LLM judge score of 8.21. We expect to receive a higher score now that we are fine-tuning on the revised poems rather than the initial poems that the large model generated.

In [32]:
# Deploy the fine-tuned model
!firectl deploy {ft_model_id}

In [35]:
# Wait until the the Deploymed Model Refs lists the state of the model as "DEPLOYED" (~5-20 minutes).
!firectl get model {ft_model_id}

In [36]:
# Generate poems on the test set using our fine-tuned model
ft_poems = generate_poems(f'accounts/{account_id}/models/{ft_model_id}', test_data)

In [48]:
# Evaluate poems based on their average length (# of characters)
def calculate_avg_length(poems):
    return int(np.mean([len(poem) for poem in poems]))

# Evaluate poems based on the pct of stanzas that contain a rhyme
def calculate_rhyming_fct(poem):
    stanzas = poem.split('\n\n')
    stanzas = [stanza for stanza in stanzas if len(stanza.split('\n')) >= 1]
    
    num_rhyming_stanzas = 0
    for stanza in stanzas:
        lines = stanza.split('\n')
        end_words = [line.split(' ')[-1].strip('.?!"\',') for line in lines]
        found_rhyme = False
        for i in range(len(end_words)):
            for j in range(i + 1, len(end_words)):
                found_rhyme = True if found_rhyme or (end_words[j] in pronouncing.rhymes(end_words[i])) else False
                
        if found_rhyme:
            num_rhyming_stanzas += 1

    if not len(stanzas):
        print(poem)
    return num_rhyming_stanzas / len(stanzas)

# Evaluate poems based on how often they have a positive sentiment
def has_positive_sentiment(poem):
    try:
        sentiment = sentiment_pipeline(poem)[0]
        return True if sentiment['label'] == 'POSITIVE' else False
    except:
        return True

In [49]:
# Calculate heuristics of our fine-tuned poems
print("Heuristic Evaluation")
print(f'Average Length: {calculate_avg_length(ft_poems)}')
print(f"Rhyming Pct: {int(100 * np.mean([calculate_rhyming_fct(poem) for poem in ft_poems]))}%")
print(f"Positive Sentiment: {int(100 * np.mean([has_positive_sentiment(poem) for poem in ft_poems]))}%")

Heuristic Evaluation
Average Length: 1255
Rhyming Pct: 73%
Positive Sentiment: 92%


In [51]:
# Evaluate poems using the LLM as a Judge strategy
poem_evaluation_rubric = f'''You are professional poet responsible for assessing the quality of AI generated poems.

Score each poem on a scale of 0 to 10, where 10 represents the best possible poem.

Scoring Guidelines:
{poem_guidelines}

Think through your reasoning step-by-step and explain your reasoning. Steps for judging a poem:
1. Read the Poem Multiple Times: Read it aloud and silently to capture both the meaning and the sound.
2. Take Notes: Jot down initial impressions, notable phrases, and any questions that arise.
3. Analyze the Elements: Break down the poem into its components (content, structure, language, sound).
4. Reflect on Your Experience: Consider your emotional response and personal connection to the poem.

The last line in your response MUST be a json object {{"score": XXX}}, where XXX is the score you are giving the response.'''

def evaluate_poems(poems, evaluation_model):
    scores = list()
    for poem in poems:
        response = client.chat.completions.create(
            model=evaluation_model,
            messages=[
                {"role": "system", "content": poem_evaluation_rubric},
                {"role": "user", "content": poem}
            ],
            temperature=0,
        )

        try: 
            response = response.choices[0].message.content
            score = int(json.loads(response.split('\n')[-1])['score'])  
            scores.append(score)
        except json.JSONDecodeError as jde:
            continue
        
    return sum(scores) / len(scores)

In [52]:
# Use the LLM to evaluate our fine-tuned model
ft_avg_score = evaluate_poems(ft_poems, 'accounts/fireworks/models/llama-v3-70b-instruct')
print(f"Avg LLM Judge Score: {round(ft_avg_score , 2)}")

Avg LLM Judge Score: 8.32


In [54]:
# Undeploy the fine-tuned model (does not cost anything extra, but Fireworks may limit your number of deployed models).
!firectl undeploy {ft_model_id}