<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Blog_Post_Writer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Blog Post Writer
This colab showcases how Pi can help build a blog post writer in the tone and style of an existing blog. For this demonstration we are using machinelearningmastery.com blogs as inspiration. A condensed version of these blogs were scraped and loaded into Hugging Face for this colab at: [withpi/machinelearningmastery_com_blogs_condensed](https://huggingface.co/datasets/withpi/machinelearningmastery_com_blogs_condensed)

Here is the overall flow of the colab:

1.   We will first create a Pi scoring system for the blog post writer
2.   Then we will evaluate a prompted model against this scoring system
3.   We will then fine tune a model, by picking high quality blogs from the hugging face dataset above using Pi

4.   Finally we will use Pi scoring system to evaluate the fine tuned model against the prompted model to see if we observe any improvement




# Install packages and utility functions
Here we are installing the Pi SDK, and we're also importing a few additonal things to help out this use case including a dataset utility as well as functions to help us more legibly print scores and Side by Side comparisons

In [2]:
# @title Install necessary packages
%%capture
%pip install withpi withpi-utils
%pip install datasets
%pip install litellm
%pip install httpx jinja2 tqdm

In [3]:
# @title Intitialize PiClient
import os
from google.colab import files, userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')


client = PiClient()

# Define your scoring system
This is where we define the set of criteria that we want to use to assess and guide the quality of our blog post generation. We'll focus on a couple categories of quality. We will use the Scoring System functions in the Pi SDK for this.


*   **Content structure:** is the post easy to digest and engaging, and does it guide the user to additional resources?
*  **Technical communication:** does the post use effective code examples and communicate any potential implementation pitfalls or mistakes



In [4]:
# @title Initialize the Pi scoring system from a JSON description
from withpi.types import ScoringSpec
from withpi_utils.colab import display_scoring_spec

blog_writer_scoring_spec_json = """
{
  "description": "A streamlined rubric for evaluating technical blog post quality",
  "name": "Technical Blog Post Quality Assessment",
  "dimensions": [
    {
      "description": "Evaluates the content structure of the blog post",
      "label": "Content Structure",
      "sub_dimensions": [
        {
          "description": "Are there visual breaks (images, code snippets, bullet points) to break up the text?",
          "label": "Visual breaks",
          "scoring_type": "PI_SCORER",
          "weight": 1
        },
        {
          "description": "Does the blog post address the reader in second person (you, your etc.)?",
          "label": "Second person",
          "scoring_type": "PI_SCORER",
          "weight": 1
        },
        {
          "description": "Does the post include links to additional resources or references?",
          "label": "Additional resources",
          "scoring_type": "PI_SCORER",
          "weight": 1
        },
        {
          "description": "Are there consistent section headings throughout the post?",
          "label": "Section headings",
          "scoring_type": "PI_SCORER",
          "weight": 1
        }
      ],
      "weight": 1
    },
    {
      "description": "Evaluates the technical communication of the blog post",
      "label": "Technical Communication",
      "sub_dimensions": [
        {
          "description": "Are code examples included where relevant?",
          "label": "Code inclusion",
          "scoring_type": "PI_SCORER",
          "weight": 1
        },
        {
          "description": "Does the post explain the code snippets when they are included?",
          "label": "Code explanation",
          "scoring_type": "PI_SCORER",
          "weight": 1
        },
        {
          "description": "Does the post call out potential pitfalls or common mistakes?",
          "label": "Pitfalls",
          "scoring_type": "PI_SCORER",
          "weight": 1
        }
      ],
      "weight": 1
    }
  ]
}
"""
blog_writer_scoring_spec = ScoringSpec.model_validate_json(blog_writer_scoring_spec_json)

display_scoring_spec(blog_writer_scoring_spec)

# Try Generating Blog Posts
Once we have a scoring system, let's assess how well prompting a model works for generating blog posts by


1. Define a system prompt
2. Prompt a Llama model to generate responses for a set of user prompts
3. Use our scoring system to compare the generated outputs against a set of actual [blog posts from MachineLearningMastery.com that we'd previously scraped and stored in HuggingFace](https://huggingface.co/datasets/withpi/mlmastery_com_blogs_condensed_merged)
4. Manually inspect some of the differences in the above



In [5]:
# @title Define a system prompt for a blog post generator
system_prompt_for_blog_writer = """
You are a specialized blog post writer. Given a topic, write a technical blog post. Here are specific instructions:
- Make sure that the blog is approximately under 500 words
- The blog should be technical in nature with clear instructions
"""

In [6]:
# @title Define a blog post generator
import litellm
import asyncio

async def generate_blogs(topics, system_prompt, model_id, api_base, api_key, concurrency_limit=5):
    """Generate blogs for all topics with TaskGroup and rate limiting"""
    # Create a semaphore to limit concurrent API calls
    semaphore = asyncio.Semaphore(concurrency_limit)

    async def generate_single_blog(topic, index):
        """Process a single blog generation with rate limiting"""
        async with semaphore:
            try:
                response = await litellm.acompletion(
                    messages=[
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": topic},
                    ],
                    model=model_id,
                    api_base=api_base,
                    api_key=api_key,
                    temperature=0.2,
                )
                generated_blog = response.choices[0].message.content
                print(f"Generated a blog for topic# {index}: {topic}")
                return generated_blog
            except Exception as e:
                print(f"Error generating blog for topic# {index}: {e}")
                return f"Error: {str(e)}"

    generated_blogs = []

    # Using TaskGroup for cleaner task management
    async with asyncio.TaskGroup() as tg:
        tasks = [
            tg.create_task(generate_single_blog(topic, i + 1))
            for i, topic in enumerate(topics)
        ]

    # Collect results in the same order as topics
    for task in tasks:
        generated_blogs.append(task.result())

    print("Done generating blogs!!")
    return generated_blogs

In [7]:
# @title [1 min] Generate blogs using an untrained model for evaluation
from datasets import load_dataset
from google.colab import userdata

ds = load_dataset("withpi/mlmastery_com_blogs_condensed_merged", split="test")
topics = ds["topic"]
actual_blogs = ds["blog"]

# Generate the blogs using an untrained llama 8B
loop = asyncio.get_running_loop()
generated_blogs = await loop.create_task(
    generate_blogs(
        topics,
        model_id="fireworks_ai/llama-v3p1-8b-instruct",
        api_key=userdata.get("FIREWORKS_API_KEY"),
        api_base = None,
        system_prompt=system_prompt_for_blog_writer
    )
)

README.md:   0%|          | 0.00/403 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/322k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/305 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/78 [00:00<?, ? examples/s]

Generated a blog for topic# 4: Evaluating RAG Systems: An Overview of RAGAs and Other Frameworks for Measuring Retrieval Augmented Generation Performance
Generated a blog for topic# 2: Topic: 7 Beginner-Friendly Machine Learning Projects for Hands-On Experience - From Titanic Survival Prediction to Face Detection
Generated a blog for topic# 6: Best practices for efficient machine learning model deployment, including optimization, containerization, CI/CD implementation, performance monitoring, and security considerations
Generated a blog for topic# 1: InterviewAce: 365 Data Science's Free AI-Powered Tool for Data Science Interview Preparation
Generated a blog for topic# 3: A comprehensive guide to Hugging Face's Model Hub and Community platform, including repository management, search functionality, API integration, and community resources for machine learning practitioners.
Generated a blog for topic# 5: 7-Day Mini-Course on Practical Data Science: From Linear Regression to Random Fore

In [20]:
# @title Compare the generated blogs against actual blogs using the Pi scoring system
from tqdm import tqdm
import pandas as pd

scores = []
actual_scores = []
generated_scores = []
for topic, actual_blog, generated_blog in tqdm(zip(topics, actual_blogs, generated_blogs)):
  actual_score = client.scoring_system.score(
      llm_input=topic,
      llm_output=actual_blog,
      scoring_spec=blog_writer_scoring_spec)
  generated_score = client.scoring_system.score(
      llm_input=topic,
      llm_output=generated_blog,
      scoring_spec=blog_writer_scoring_spec)
  actual_scores.append(actual_score)
  generated_scores.append(generated_score)
  score = {'topic': topic, 'actual': actual_score.total_score, 'generated': generated_score.total_score}
  scores.append(score)

df = pd.DataFrame(scores)
df["score_diff"] = df["actual"] - df["generated"]  # Compute score differential
df

78it [00:47,  1.66it/s]


Unnamed: 0,topic,actual,generated,score_diff
0,InterviewAce: 365 Data Science's Free AI-Power...,0.447750,0.223929,0.223821
1,Topic: 7 Beginner-Friendly Machine Learning Pr...,0.468502,0.250048,0.218454
2,A comprehensive guide to Hugging Face's Model ...,0.584229,0.251727,0.332502
3,Evaluating RAG Systems: An Overview of RAGAs a...,0.510825,0.157556,0.353270
4,7-Day Mini-Course on Practical Data Science: F...,0.775716,0.454818,0.320898
...,...,...,...,...
73,Packaging and Deploying Python Projects: From ...,0.792928,0.371519,0.421409
74,Creating and Customizing Dataset Classes in Py...,0.598328,0.314063,0.284266
75,Implementing Dropout Regularization in PyTorch...,0.805664,0.488615,0.317049
76,Building and Training a Single Layer Neural Ne...,0.726562,0.422554,0.304009


In [9]:
# @title Manually inspect actual and generated blogs with scores
from withpi_utils.colab import pretty_print_responses

def pretty_print_blog(i):
  pretty_print_responses(
      response1 = actual_blogs[i].strip("\"").replace("\\n", "\n"),
      response2 = generated_blogs[i].strip("\"").replace("\\n", "\n"),
      header="##### Topic: \n" + topics[i].strip("\"").replace("\\n", "\n"),
      left_label="Base (actual)",
      right_label="Test (generated)",
      scores_left=actual_scores[i],
      scores_right=generated_scores[i])

# Find top 3 cases with highest score differential and inspect them manually
for i in (df.nlargest(3, "score_diff").index.to_list()):
  pretty_print_blog(i)
  print("\n\n")

0,1,2
Content Structure,,0.898
,Visual breaks,1.0
,Second person,0.594
,Additional resources,1.0
,Section headings,1.0
Technical Communication,,0.727
,Code inclusion,1.0
,Code explanation,0.77
,Pitfalls,0.412
,,

0,1,2
Content Structure,,0.25
,Visual breaks,0.0
,Second person,0.0
,Additional resources,0.0
,Section headings,1.0
Technical Communication,,0.0
,Code inclusion,0.0
,Code explanation,0.0
,Pitfalls,0.0
,,







0,1,2
Content Structure,,1.0
,Visual breaks,1.0
,Second person,1.0
,Additional resources,1.0
,Section headings,1.0
Technical Communication,,0.784
,Code inclusion,1.0
,Code explanation,0.746
,Pitfalls,0.605
,,

0,1,2
Content Structure,,0.25
,Visual breaks,0.0
,Second person,0.002
,Additional resources,0.0
,Section headings,1.0
Technical Communication,,0.246
,Code inclusion,0.0
,Code explanation,0.0
,Pitfalls,0.738
,,







0,1,2
Content Structure,,0.865
,Visual breaks,0.832
,Second person,0.82
,Additional resources,1.0
,Section headings,0.809
Technical Communication,,0.832
,Code inclusion,1.0
,Code explanation,0.75
,Pitfalls,0.746
,,

0,1,2
Content Structure,,0.25
,Visual breaks,0.0
,Second person,0.0
,Additional resources,0.0
,Section headings,1.0
Technical Communication,,0.188
,Code inclusion,0.0
,Code explanation,0.0
,Pitfalls,0.562
,,







# Fine Tune a Better Blog Post Generator
Now that we've seen that the MachineLearningMastery posts are still significantly better than our prompt generated posts, let's see if we can capture some of that goodness by fine tuning our own model with examples from the original blog. To do so we will:

1. Download a [dataset of previously scraped posts from HuggingFace](https://huggingface.co/datasets/withpi/mlmastery_com_blogs_condensed_mergedhttps://)

2. Filter the dataset to **just the posts that perform really well per our scoring system** (this is the special sauce)

3. Plug that data into our fine tuning SDK endpoint, which will **show us a running log of Pi scores** going up as Fine Tuning improves the model's performance (this changes how you evaluate fine tuning runs - on your score not just validation loss)

In [19]:
# @title [3 mins] Prepare training data for Fine-tuning by filtering low scoring blogs (<0.7)
from datasets import load_dataset
import pandas

def score(topic:str, blog:str):
  return client.scoring_system.score(
      llm_input=topic,
      llm_output=blog,
      scoring_spec=blog_writer_scoring_spec).total_score

ds = load_dataset("withpi/mlmastery_com_blogs_condensed_merged", split = "train")
ds = ds.map(lambda x: {"score": score(x["topic"], x["blog"])})
ds = ds.filter(lambda x: x["score"] > 0.7)
df = ds.to_pandas()
df

Map:   0%|          | 0/305 [00:00<?, ? examples/s]

Filter:   0%|          | 0/305 [00:00<?, ? examples/s]

Unnamed: 0,topic,blog,score
0,"The topic of this blog post is: ""Understanding...",# The Da Vinci Code of Data: Mastering The Dat...,0.736165
1,Cross-validation techniques for comprehensive ...,# From Train-Test to Cross-Validation: Advanci...,0.960286
2,Automated Feature Engineering in PyCaret: Stre...,# Automated Feature Engineering in PyCaret\n\n...,0.806478
3,Strategies and techniques for handling imbalan...,# Tips for Handling Imbalanced Data in Machine...,0.836202
4,Finding the optimal feature subset for linear ...,# The Search for the Sweet Spot in a Linear Re...,0.910970
...,...,...,...
132,Visualizing PyTorch Model Architectures Using ...,# Visualizing a PyTorch Model\n\nBy [Adrian Ta...,0.849609
133,Understanding and Working with One-Dimensional...,# One-Dimensional Tensors in Pytorch\n\nBy [Mu...,0.723145
134,Making Predictions with Keras: A Guide to Clas...,# How to Make Predictions with Keras\n\nBy [Ja...,0.814941
135,Visualizing and Interpreting Model Training Me...,# Understand Model Behavior During Training by...,0.926432


In [20]:
# @title [SLOW - will run for 80+ minutes] Fine tune the model based on the above training data
status = client.training.sft.start_job(
    scoring_spec=blog_writer_scoring_spec,
    examples=[
        {"llm_input": row["topic"], "llm_output": row["blog"]}
        for row in ds
    ],
    base_sft_model="LLAMA_3.1_8B",
    lora_config={"lora_rank": "R_16"},
    system_prompt=system_prompt_for_blog_writer,
    num_train_epochs=10,
)
print(status)

SftStatus(detailed_status=['LAUNCHING'], job_id='sft_jobs:babf2fa45c086088a9e43d648f8ef22e58d7584ce21f17fda4509f2421d84c4c:7d08476e-b8ad-4ecf-b7ef-9dce6bf688d4', state='QUEUED', trained_models=[])


In [10]:
# @title Monitor the fine-tuning job for completion (watch the Eval_Pi_Score increase!)
from withpi_utils.colab import stream_training_response

response = stream_training_response(
    status.job_id,
    client.training.sft,
    additional_columns={"Eval_Pi_Score": "contract_score"},
)
if response.state == "ERROR":
  print("The job failed due to:\n{}".format('\n'.join(response.detailed_status[-5:])))
else:
  print("SFT model = {}".format(response.trained_models[0].model_dump_json(indent=2)))

Training Status for sft_jobs:babf2fa45c086088a9e43d648f8ef22e58d7584ce21f17fda4509f2421d84c4c:7d08476e-b8ad-4ecf-b7ef-9dce6bf688d4


Unnamed: 0,Step,Epoch,Learning_Rate,Training_Loss,Eval_Loss,Eval_Pi_Score
0,0,0.0,X,X,1.078013,0.346374
1,3,0.387097,0.00015,1.1943,X,X
2,6,0.774194,0.000194,1.1788,1.011224,0.438846
3,9,1.258065,0.000185,1.3678,X,X
4,12,1.645161,0.000176,1.055,0.969031,0.470094
5,15,2.129032,0.000167,1.4425,X,X
6,18,2.516129,0.000158,1.0336,0.946876,0.610404
7,21,2.903226,0.000148,0.9863,X,X
8,24,3.387097,0.000139,1.2644,0.931315,0.547641
9,27,3.774194,0.00013,0.9412,X,X


SFT model = {
  "contract_score": 0.6730171796821413,
  "epoch": 5.903225806451613,
  "eval_loss": 0.9167624115943909,
  "serving_id": 0,
  "serving_state": "SERVING",
  "step": 42
}


# Test Out & Evaluate Your Fine Tuned Generator

Now our new model is ready to be tested out!

1. First, we'll generate blog posts for the same topics we were looking at before

2. Then we'll score all of these blog posts so we can compare them to the generations by our prompted model

3. Then we'll look at some of the individual examples and their scores Side by Side so we can see how much fine tuning improved our blog posts

In [13]:
# @title [5 min] Generate blogs using the fine tuned model for evaluation
from datasets import load_dataset
from google.colab import userdata
import time

ds = load_dataset("withpi/mlmastery_com_blogs_condensed_merged", split = "test")
topics = ds["topic"]

# Generate the blogs using fine tuned llama 8B
client.training.sft.load(status.job_id)

# Wait for the model to be loaded
while not (client.training.sft.retrieve(status.job_id).trained_models[0].serving_state == "SERVING"):
    time.sleep(3)

loop = asyncio.get_running_loop()
new_generated_blogs = await loop.create_task(
    generate_blogs(
        topics,
        model_id="fireworks_ai/0",
        api_base=f"https://api.withpi.ai/v1/training/sft/{status.job_id}",
        api_key=os.environ["WITHPI_API_KEY"],
        system_prompt=system_prompt_for_blog_writer
    )
)

Generated a blog for topic# 1: InterviewAce: 365 Data Science's Free AI-Powered Tool for Data Science Interview Preparation
Generated a blog for topic# 4: Evaluating RAG Systems: An Overview of RAGAs and Other Frameworks for Measuring Retrieval Augmented Generation Performance
Generated a blog for topic# 3: A comprehensive guide to Hugging Face's Model Hub and Community platform, including repository management, search functionality, API integration, and community resources for machine learning practitioners.
Generated a blog for topic# 5: 7-Day Mini-Course on Practical Data Science: From Linear Regression to Random Forests with Real-World Country Data Analysis
Generated a blog for topic# 2: Topic: 7 Beginner-Friendly Machine Learning Projects for Hands-On Experience - From Titanic Survival Prediction to Face Detection
Generated a blog for topic# 6: Best practices for efficient machine learning model deployment, including optimization, containerization, CI/CD implementation, performanc

In [16]:
# @title Compare the newly generated blogs against previous ones using the Pi scoring system
from tqdm import tqdm
import pandas as pd

scores = []
generated_scores = []
new_generated_scores = []
for topic, generated_blog, new_generated_blog in tqdm(zip(topics, generated_blogs, new_generated_blogs)):
  generated_score = client.scoring_system.score(
      llm_input=topic,
      llm_output=generated_blog,
      scoring_spec=blog_writer_scoring_spec)
  new_generated_score = client.scoring_system.score(
      llm_input=topic,
      llm_output=new_generated_blog,
      scoring_spec=blog_writer_scoring_spec)
  generated_scores.append(generated_score)
  new_generated_scores.append(new_generated_score)
  score = {'topic': topic, 'generated': generated_score.total_score, 'new generated': new_generated_score.total_score}
  scores.append(score)

df = pd.DataFrame(scores)
df["score_diff"] = df["new generated"] - df["generated"]  # Compute score differential
df

78it [00:42,  1.84it/s]


Unnamed: 0,topic,generated,new generated,score_diff
0,InterviewAce: 365 Data Science's Free AI-Power...,0.223929,0.467300,0.243371
1,Topic: 7 Beginner-Friendly Machine Learning Pr...,0.250048,0.445219,0.195171
2,A comprehensive guide to Hugging Face's Model ...,0.251727,0.735026,0.483299
3,Evaluating RAG Systems: An Overview of RAGAs a...,0.157556,0.700684,0.543128
4,7-Day Mini-Course on Practical Data Science: F...,0.454818,0.746023,0.291205
...,...,...,...,...
73,Packaging and Deploying Python Projects: From ...,0.371519,0.408065,0.036547
74,Creating and Customizing Dataset Classes in Py...,0.314063,0.664714,0.350651
75,Implementing Dropout Regularization in PyTorch...,0.488615,0.576198,0.087584
76,Building and Training a Single Layer Neural Ne...,0.422554,0.702148,0.279595


In [18]:
# @title Calculate the average uplift in score from Fine Tuning vs Prompting the model
average_score_uplift_absolute = df['score_diff'].mean()
average_score_uplift_percentage = average_score_uplift_absolute/(df['generated'].mean())
prompt_mean = df['generated'].mean()
sft_mean = df['new generated'].mean()

print(f"Average score uplift: {round(average_score_uplift_percentage * 100, 1)}% uplift in scores from a {round(prompt_mean, 2)} average score to {round(sft_mean, 2)}")


Average score uplift: 58.9% uplift in scores from a 0.37 average score to 0.59


In [19]:
# @title Manually inspect new generated blogs against previous ones with scores
from withpi_utils.colab import pretty_print_responses

def pretty_print_blog(i):
  pretty_print_responses(
      response1 = generated_blogs[i].strip("\"").replace("\\n", "\n"),
      response2 = new_generated_blogs[i].strip("\"").replace("\\n", "\n"),
      header="##### Topic: \n" + topics[i].strip("\"").replace("\\n", "\n"),
      left_label="Base (generated)",
      right_label="Test (new generated)",
      scores_left=generated_scores[i],
      scores_right=new_generated_scores[i])

# Find top 3 cases with highest score differential and inspect them manually
for i in (df.nlargest(3, "score_diff").index.to_list()):
  pretty_print_blog(i)
  print("\n\n")

0,1,2
Content Structure,,0.202
,Visual breaks,0.0
,Second person,0.0
,Additional resources,0.0
,Section headings,0.809
Technical Communication,,0.104
,Code inclusion,0.0
,Code explanation,0.311
,Pitfalls,0.0
,,

0,1,2
Content Structure,,0.924
,Visual breaks,0.98
,Second person,0.719
,Additional resources,0.996
,Section headings,1.0
Technical Communication,,0.62
,Code inclusion,1.0
,Code explanation,0.762
,Pitfalls,0.1
,,







0,1,2
Content Structure,,0.25
,Visual breaks,0.0
,Second person,0.0
,Additional resources,0.0
,Section headings,1.0
Technical Communication,,0.065
,Code inclusion,0.0
,Code explanation,0.0
,Pitfalls,0.195
,,

0,1,2
Content Structure,,0.728
,Visual breaks,0.875
,Second person,0.035
,Additional resources,1.0
,Section headings,1.0
Technical Communication,,0.674
,Code inclusion,0.824
,Code explanation,0.439
,Pitfalls,0.758
,,







0,1,2
Content Structure,,0.25
,Visual breaks,0.0
,Second person,0.0
,Additional resources,0.0
,Section headings,1.0
Technical Communication,,0.0
,Code inclusion,0.0
,Code explanation,0.0
,Pitfalls,0.0
,,

0,1,2
Content Structure,,0.873
,Visual breaks,0.84
,Second person,0.652
,Additional resources,1.0
,Section headings,1.0
Technical Communication,,0.424
,Code inclusion,0.969
,Code explanation,0.005
,Pitfalls,0.299
,,





