# LLM Auto-Prompt Tuning for Code Generation

This notebook demonstrates the applications of prompt-tuning Large Language Models (LLMs) with a focus on code generation. It serves as both an educational tool and a practical guide, walking you through various stages from basic model usage to advanced training techniques.

## Overview

The notebook is organized into five main stages:

1. **Code Generation with Base Prompt**: Demonstrate basic code completion using a general-purpose model
2. **Data Preparation**: Generate code to download relevant datasets for LLM training
3. **Training Data Preprocessing**: Generate code to preprocess data for LLM training
4. **Run Auto-Prompt Tuning job to get optimized prompt**: Fetch optimized prompt using base prompt and prompt tuning dataset

Each stage builds upon the previous one, providing a comprehensive understanding of the LLM training process.

## Setup

Let's start by installing the necessary dependencies and setting up our environment.

In [None]:
# Install required packages
!pip install -q transformers datasets accelerate torch evaluate

In [None]:
# Import necessary libraries
import os
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
import time
import json
from IPython.display import display, HTML, Markdown

# Stage 1: Demonstrate basic code completion using a general-purpose model

***Use the following prompt in the next code cell to generate code***

```bash
Create python script to download a dataset from Hugging Face and save it as a CSV.
1. Use the dotenv library to load an environment variable named "HF_ACCESS_TOKEN".
2. Download the 'train' split of the "Jayveersinh-Raj/prompt-tuning-text-to-code" dataset. Use the token for authentication.
3. Create a 'data' directory if it doesn't exist.
4. Save the dataset into the 'data' directory as "prompt_tuning_train.csv". Do not include the DataFrame index.
5. Print a success message with the final file path.
```

# Stage 2: Generate code to download relevant datasets for LLM training

***Use the following prompt in the next code cell to generate code***

```bash
Goal: Format the raw data into a structured training prompt.

1. Import pandas to read the CSV file.
2. Load the "data/prompt_tuning_train.csv" into a pandas DataFrame.
3. Display the first 5 rows of the DataFrame to inspect the columns.

    (The columns should be 'bad_prompts' and 'improved_prompts')

1. Define a function create_training_prompt(row) that takes a row of the DataFrame as input.
2. Inside the function, format the data into a clear structure:
    1. Start with a system instruction: "Rewrite the user's initial request into a refined prompt."
    2. Add a "### Initial Request:" section using the value from the 'bad_prompts' column.
    3. Add a "### Refined Prompt:" section using the value from the 'improved_prompts' column.
3. The function should return the complete formatted string.
```

# Stage 3: Generate code to preprocess data for LLM training

***Use the following prompt in the next code cell to generate code***

```bash
Now apply the formatting function to the DataFrame.
1. Create a new column 'formatted_prompt' in the DataFrame.
2. Apply the `create_training_prompt` function to each row of the DataFrame to populate the new column.
3. Print the content of the 'formatted_prompt' column for the first row to verify the output.
4. Save the processed DataFrame with the new column to "data/prompt_tuning_train_processed.jsonl" in JSON Lines format, with orient='records' and lines=True.
```

## Next Steps

Now that we have set up the notebook environment and implemented the core utility functions, we can proceed to the next tasks:

1. Show the limitations of basic prompts for code generation
2. Compare with optimized prompts
3. Analyze the differences in performance

These will be implemented in the subsequent sections of this notebook.