## Lab 8 Supervised Fine Tuning

# 1 Prepare the data

In this section, we will generate instruction data and use them to do the Supervised Fine-tuning of a pre-trained Llama3-8B-instruct model.  

First, we will use THU Chinese Classical Poetry Corpus (THU-CCPC) as our resource to generate instruction data. THU-CCPC is a part of THUNLP-AIPoet, which is a long-term project for AI generated Chinese poetry.

The data in THU-CCPC is just base information of poems, so the first step is to preprocess the data and extract the necessary information. We will use the following steps to preprocess the data:

Again, let's first set the working directory.

In [2]:
%cd /gfshome

In [19]:
# dataset downloaded from 
# https://github.com/THUNLP-AIPoet/Datasets.git

# we have already downloaded the dataset and put it in /ssdshare/share/lab8/Datasets
# let's link it to the working directory for convenience

# create a directory for processed output
!mkdir ccpc

In [20]:
# let's examine the input file
!head -20 /ssdshare/share/lab8/Datasets/CCPC/ccpc_train_v1.0.json
# This code transforms the CCPC dataset to a more readable and usable format.  

In [31]:
# This code transforms the CCPC dataset to a more friendly JSON format, 
# keeping only fields we need.
# also we need to distinguish 五言诗 from 七言诗
import json

# Define input and output files
input_file = "/ssdshare/share/lab8/Datasets/CCPC/ccpc_train_v1.0.json"
output_file = "ccpc/ccpc_transformed_with_format.json"

transformed_data = []

# Load and transform data
with open(input_file, "r", encoding="utf-8") as f:
    for line in f:
        if line.strip():  # Skip empty lines
            item = json.loads(line.strip())  # Load each line as a JSON object
            
            # Process the content of the poem
            lines = item["content"].split("|")
            formatted_lines = [f"{line}，" if i % 2 == 0 else f"{line}。" for i, line in enumerate(lines)]
            formatted_content = "\n".join(formatted_lines)

            # Test the type of the poem
            line_lengths = [len(line.replace("，", "").replace("。", "")) for line in lines]
            if all(length == 5 for length in line_lengths):
                poem_type = "五言诗"
            elif all(length == 7 for length in line_lengths):
                poem_type = "七言诗"
            else:
                poem_type = "杂言诗"

            # Transform the data structure
            transformed_item = {
                "title": item["title"],
                "author": item["author"],
                "content": formatted_content,
                "keywords": item["keywords"].split(),
                "poem_type": poem_type
            }
            transformed_data.append(transformed_item)

# Save the transformed data to output file
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(transformed_data, f, ensure_ascii=False, indent=4)

# check if there are mixed poems
def check_for_mixed_poems(transformed_data):
    has_mixed_poems = any(item["poem_type"] == "杂言诗" for item in transformed_data)
    return has_mixed_poems


if check_for_mixed_poems(transformed_data):
    print("There are mixed poems in the transformed data."  )
else:
    print("There are no mixed poems in the transformed data.")

print("Transformation complete! Transformed data saved to:", output_file)

In [32]:
# take a look at the resulting data
!tail -50 ccpc/ccpc_transformed_with_format.json

In [33]:
# turn the labelled dataset into a SFT format
# i.e. becomes question-answer pairs. 

import json
import random

# Define input and output files
input_file = "ccpc/ccpc_transformed_with_format.json"
output_file = "LLaMA-Factory/data/alpaca_sft_dataset_with_varied_instructions.json" # Save in Llama-Factory/data for use

#Only use 5-character and 7-character poems
def filter_poems(poem):
    return poem["poem_type"] in ["五言诗", "七言诗"]

# Define filtering function 
def generate_instruction(poem_type, theme):
    poem_type_map = {
        "五言诗": "5-character",
        "七言诗": "7-character"
    }
    english_poem_type = poem_type_map.get(poem_type, "unknown")
    themes = ", ".join(theme)
    
    # Define instruction templates in five different styles
    instruction_templates = [
        f"Hi, you are a Chinese poet, can you write a {english_poem_type} 4-line poem about the themes: {themes}?",
        f"Hi, you are a Chinese poet now, can you compose a {english_poem_type} 4-line poem reflecting on the ideas of {themes}?",
        f"Hi, as a Chinese poet, can you help me to create a {english_poem_type} 4-line poem that incorporates the themes of {themes}?",
        f"Hi, please draft a {english_poem_type} 4-line poem based on the themes: {themes}.",
        f"Hi, you are a Chinese poet, please generate a {english_poem_type} 4-line poem exploring the themes of {themes}."
    ]
    return random.choice(instruction_templates)  # Choose a random instruction template

# Transform poem data into Alpaca format
def create_alpaca_data(poem):
    theme = poem["keywords"]
    instruction = generate_instruction(poem["poem_type"], theme)
    return {
        "instruction": instruction,
        "input": "",  # Unnecessary
        "output": poem["content"] 
    }

# Load poem data and filter out unsuitable poems
with open(input_file, "r", encoding="utf-8") as f:
    poems = json.load(f)
filtered_poems = [poem for poem in poems if filter_poems(poem)]
alpaca_data = [create_alpaca_data(poem) for poem in filtered_poems]

# Saved as Alpaca format
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(alpaca_data, f, ensure_ascii=False, indent=4)

print(f"Alpaca dataset with varied instructions created successfully! Saved to {output_file}")


In [34]:
# take a look at the result data

!head -20 LLaMA-Factory/data/alpaca_sft_dataset_with_varied_instructions.json

In [35]:
# Now we already got the data set for SFT, but in Llama-Factory, we also need to register it.
# datasets needs to be registered in LLaMA-Factory/data/dataset_info.json
# let's take a look at the dataset_info.json first
!head -100 LLaMA-Factory/data/dataset_info.json

In [36]:
# Now we add our SFT data to the  LLaMA-Factory/data/dataset_info.json:
# poet_instructions = {
#     "file_name": "alpaca_sft_dataset_with_varied_instructions.json",
#     "formatting": "alpaca"
# }

import os
import json

# Path to the dataset_info.json file
dataset_info_path = os.path.join("LLaMA-Factory", "data", "dataset_info.json")

# Read the existing dataset_info.json
with open(dataset_info_path, "r", encoding="utf-8") as f:
    dataset_info = json.load(f)

# Add the poet_instructions entry
dataset_info["poet_instructions"] = {
    "file_name": "alpaca_sft_dataset_with_varied_instructions.json",
    "formatting": "alpaca"
}

# Write the updated dataset_info.json
with open(dataset_info_path, "w", encoding="utf-8") as f:
    json.dump(dataset_info, f, ensure_ascii=False, indent=4)

print(f"Updated {dataset_info_path} with poet_instructions dataset information.")


In [37]:
# ensure the dataset is registered
!tail -10 LLaMA-Factory/data/dataset_info.json

Now we have the dataset ready, lets go back to 01_llama factory.ipynb. :)