# LLM Auto-Prompt Tuning for Code Generation

This notebook demonstrates the applications of prompt-tuning Large Language Models (LLMs) with a focus on code generation. It serves as both an educational tool and a practical guide, walking you through various stages from basic model usage to advanced training techniques.

## Overview

The notebook is organized into five main stages:

1. **Code Generation with Base Prompt**: Demonstrate basic code completion using a general-purpose model
2. **Data Preparation**: Generate code to prepare custom datasets or download relevant datasets for LLM training
3. **Training Data Preprocessing**: Generate code to preprocess data for LLM training
4. **Run Auto-Prompt Tuning job to get optimized prompt**: Fetch optimized prompt using base prompt and prompt tuning dataset

Each stage builds upon the previous one, providing a comprehensive understanding of the LLM training process.

## Setup

Let's start by installing the necessary dependencies and setting up our environment.

In [1]:
# Install required packages
!pip install -q transformers datasets accelerate torch evaluate


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# Import necessary libraries
import os
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
import time
import json
from IPython.display import display, HTML, Markdown

  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]


In [4]:
import os
from dotenv import load_dotenv
from datasets import load_dataset
import pandas as pd

# 1. Load environment variables
load_dotenv()
hf_token = os.getenv("HF_ACCESS_TOKEN")

# 2. Download the dataset
dataset = load_dataset("Jayveersinh-Raj/prompt-tuning-text-to-code", split="train", token=hf_token)

# 3. Create data directory if it doesn't exist
os.makedirs("data", exist_ok=True)

# 4. Convert to DataFrame and save as CSV
df = pd.DataFrame(dataset)
output_path = os.path.join("data", "prompt_tuning_train.csv")
df.to_csv(output_path, index=False)

# 5. Print success message
print(f"Dataset successfully saved to {os.path.abspath(output_path)}")


Dataset successfully saved to /Volumes/workplace/tekulam/clineHandson/data/prompt_tuning_train.csv


In [5]:
import pandas as pd

# Load the CSV file
df = pd.read_csv("data/prompt_tuning_train.csv")

# Display first 5 rows
print(df.head())

def create_training_prompt(row):
    formatted_prompt = f"""Rewrite the user's initial request into a refined prompt.

### Initial Request:
{row['bad_prompts']}

### Refined Prompt:
{row['improved_prompts']}"""
    
    return formatted_prompt


                                         bad_prompts  \
0  Cretae a imag proccesing librry in Pyhton. It ...   
1  Writ a algo for sock traiding. It should look ...   
2  Make a NLP pipelin in Pythn. Do text preproces...   
3  Develope a multiplayr online game with a cllie...   
4  Code a dron navgation systm. Mak it able to go...   

                                    improved_prompts  
0  Develop an image processing library with capab...  
1  Develop an automated stock trading algorithm. ...  
2  Create a natural langauge processin pipeline i...  
3  Design and create a multiplayer online game us...  
4  Develop a drone navigation system capable of n...  


In [6]:
# Apply the formatting function to create the new column
df['formatted_prompt'] = df.apply(create_training_prompt, axis=1)

# Print the first formatted prompt to verify
print("First formatted prompt:")
print(df['formatted_prompt'].iloc[0])

# Save the processed DataFrame to a JSONL file
df.to_json("data/prompt_tuning_train_processed.jsonl", orient='records', lines=True)


KeyError: 'initial_prompt'

## Next Steps

Now that we have set up the notebook environment and implemented the core utility functions, we can proceed to the next tasks:

1. Show the limitations of basic prompts for code generation
2. Compare with optimized prompts
3. Analyze the differences in performance

These will be implemented in the subsequent sections of this notebook.