<a href="https://colab.research.google.com/github/suhaasteja/LLM-fine-tuning/blob/main/Fine_tune_Pink_floyd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install kagglehub[pandas-datasets]



In [2]:
import kagglehub
from kagglehub import KaggleDatasetAdapter

file_path = "pink_floyd_lyrics.csv"

# Load the latest version
df = kagglehub.dataset_load(
  KaggleDatasetAdapter.PANDAS,
  "joaorobson/pink-floyd-lyrics",
  file_path
)

Using Colab cache for faster access to the 'pink-floyd-lyrics' dataset.


In [3]:
df.head()

Unnamed: 0,album,song_title,year,lyrics
0,The Piper at the Gates of Dawn,Astronomy Domine,1967-08-05,"""Moon in both [houses]...""...Scorpio, [Arabian..."
1,The Piper at the Gates of Dawn,Lucifer Sam,1967-08-05,"Lucifer Sam, siam cat\nAlways sitting by your ..."
2,The Piper at the Gates of Dawn,Matilda Mother,1967-08-05,There was a king who ruled the land\nHis Majes...
3,The Piper at the Gates of Dawn,Flaming,1967-08-05,Alone in the clouds all blue\nLying on an eide...
4,The Piper at the Gates of Dawn,Pow R. Toc H.,1967-08-05,TCH TCH\nAHH (AHH)\nTCH TCH\nAHH AHH\nDoi doi\...


we want to create a dataset that adheres to Alpaca format

alpaca requires - instruction, input and output fields

current dataset should be transfered to an alpaca format



In [4]:
import pandas as pd
import time
import os
from pydantic import BaseModel, Field
from openai import OpenAI
from typing import List

# --- CONFIGURATION ---
# Use Colab secrets if available, or paste directly
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

client = OpenAI()

# --- 1. DATA VALIDATION (Pydantic) ---
class AlpacaEntry(BaseModel):
    instruction: str = Field(..., description="A natural language user request asking for a song about the themes. MUST NOT contain the song title.")
    input: str = Field(..., description="Keywords defining the style, mood, and era.")
    output: str = Field(..., description="The original lyrics.")

class DatasetResponse(BaseModel):
    examples: List[AlpacaEntry]

# --- 2. THE GENERATOR FUNCTION (Updated for Responses API) ---
def generate_alpaca_data(row):
    lyrics = row['lyrics']
    title = row['song_title']
    album = row['album']

    if len(str(lyrics)) < 50: return []

    prompt = f"""
    You are an expert creative writing teacher.
    I will give you the lyrics to the Pink Floyd song "{title}" (Album: {album}).

    Your goal is to create training data that teaches an AI how to write *new* lyrics in this style.

    LYRICS:
    {lyrics[:4000]} # Truncated to safe limit

    Create 3 diverse 'Instruction/Input/Output' pairs in Alpaca format:
    1. INSTRUCTION: A request for the meaning/story (NO song titles).
    2. INPUT: Style/Mood/Theme keywords.
    3. OUTPUT: The original lyrics.
    """

    try:
        # UPDATED: Using client.responses.parse as requested
        response = client.responses.parse(
            model="gpt-4o-2024-08-06",
            input=[
                {"role": "system", "content": "You create high-quality fine-tuning datasets."},
                {"role": "user", "content": prompt}
            ],
            text_format=DatasetResponse,
        )

        # Access the parsed Pydantic object directly
        parsed_data = response.output_parsed

        # Convert to list of dicts for the dataset
        return [entry.model_dump() for entry in parsed_data.examples]

    except Exception as e:
        print(f"Error on '{title}': {e}")
        return []

# --- 3. MAIN LOOP ---
final_dataset = []
print(f"Starting processing of {len(df)} songs...")

for index, row in df.iterrows():
    print(f"Processing [{index+1}/{len(df)}]: {row['song_title']}...", end=" ", flush=True)

    new_examples = generate_alpaca_data(row)

    if new_examples:
        final_dataset.extend(new_examples)
        print(f"✅ Generated {len(new_examples)} examples.")
    else:
        print("❌ Skipped.")

    time.sleep(0.5)

# --- 4. SAVE ---
with open("pink_floyd_alpaca_responses_api.json", "w") as f:
    import json
    json.dump(final_dataset, f, indent=4)

print(f"\nDONE! Saved {len(final_dataset)} examples.")

Starting processing of 163 songs...
Processing [1/163]: Astronomy Domine... ✅ Generated 3 examples.
Processing [2/163]: Lucifer Sam... ✅ Generated 3 examples.
Processing [3/163]: Matilda Mother... ✅ Generated 3 examples.
Processing [4/163]: Flaming... ✅ Generated 3 examples.
Processing [5/163]: Pow R. Toc H.... ✅ Generated 3 examples.
Processing [6/163]: Take Up Thy Stethoscope and Walk... ✅ Generated 3 examples.
Processing [7/163]: Interstellar Overdrive... ❌ Skipped.
Processing [8/163]: The Gnome... ✅ Generated 3 examples.
Processing [9/163]: Chapter 24... ✅ Generated 3 examples.
Processing [10/163]: The Scarecrow... ✅ Generated 3 examples.
Processing [11/163]: Bike... ✅ Generated 3 examples.
Processing [12/163]: See Emily Play... ✅ Generated 3 examples.
Processing [13/163]: Let There Be More Light... ✅ Generated 3 examples.
Processing [14/163]: Remember a Day... ✅ Generated 3 examples.
Processing [15/163]: Set the Controls for the Heart of the Sun... ✅ Generated 3 examples.
Processi

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
import shutil

# The file currently in Colab (Temporary)
source_file = "pink_floyd_alpaca_responses_api.json"

# Where you want it in Drive (Permanent)
# You can change 'MyDrive' to a specific folder if you want, e.g. '/content/drive/MyDrive/Datasets/...'
destination_path = "/content/drive/MyDrive/pink_floyd_alpaca_responses_api.json"

try:
    shutil.copy(source_file, destination_path)
    print(f"✅ Success! Dataset saved to: {destination_path}")
except Exception as e:
    print(f"❌ Error saving to Drive: {e}")

✅ Success! Dataset saved to: /content/drive/MyDrive/pink_floyd_alpaca_responses_api.json


In [8]:
import pandas as pd

# Read directly from Drive
drive_path = "/content/drive/MyDrive/pink_floyd_alpaca_responses_api.json"

df_saved = pd.read_json(drive_path)
df_saved.head()

Unnamed: 0,instruction,input,output
0,Please generate lyrics that explore space expl...,"Psychedelic, space-themed, 1960s","Lime and limpid green, a second scene\nA fight..."
1,Write lyrics that capture the mystery and vast...,"Dreamy, surreal, cosmic","Lime and limpid green, a second scene\nA fight..."
2,Create lyrics about interstellar journeys and ...,"Ethereal, exploratory, galactic","Lime and limpid green, a second scene\nA fight..."
3,Write a song about a mysterious feline compani...,"Psychedelic, enigmatic, 1960s","Lucifer Sam, siam cat\nAlways sitting by your ..."
4,Craft a whimsical song featuring a charismatic...,"Quirky, mystical, 1960s surrealism","Lucifer Sam, siam cat\nAlways sitting by your ..."
