# **!pip install datasets**

This code cell is required to install the **Hugging Face Datasets** library.

The library provies tools for easily accessing, processing, and managing large-scale datasets that are often used for machine learning (ML), computer vision (CV) & Natrual Langauge Processing (NLP)

* Access to Large Datasets: Provides a hub of ready-to-use datasets for NLP, CV, ML.

* Efficient Loading: Use of memory mapping and streaming to handle large datasets efficiently.
Dataset Processing Tools: Inlcudes utilities for filtering, toeknizing, and transforming datasets.

**What is !pip install?**

**!** is used to run shell commands, **pip install** is necessary because some libraries are not pre-installed.

We use this library to go through the csv files, read, and extract data we need to train the model.


In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xx

### Installing Libraries
📌 Functionality
The Steps:
Install Libraries:
Use pip install to install the required libraries, such as openai, pandas, and tqdm, which help interact with OpenAI's API, handle data, and track progress in your tasks.

* openai: This library allows you to interact with OpenAI's API. It provides functions to perform various tasks, such as uploading files (e.g., .jsonl datasets for fine-tuning), creating fine-tuned models, and making predictions with OpenAI models. It is essential when you are working with the OpenAI ecosystem for tasks like fine-tuning GPT models or querying them.
* pandas: This is a powerful library for data manipulation and analysis. It is widely used for handling structured data (e.g., CSV, JSON, and SQL). pandas allows you to work with data in a tabular format (DataFrames), making it easy to filter, clean, transform, and analyze data.
* tqdm: This is a library for creating progress bars in Python, commonly used when you are processing long-running tasks like iterating over large datasets. It helps visualize the progress of loops and tasks in real time, so you can track how much of the task is complete and how much remains.

In [None]:
!pip install openai pandas tqdm




# API Verification
The code cell below is used to get an API Key from Colab Secrets called "OpenAiKey." The API key is then used to create an OpenAI client and checks if the configuration of the API Key was correct.

**from openai import OpenAI** imports OpenAI's official Python client library, which allows us to interact with OpenAI's API (Application Programming Interface)

We use this service to generate text for our narratives using our datasets.

API key must be protected privately.

In [None]:
from openai import OpenAI # use 'import openai' instead accoring to ChatGPT
from google.colab import userdata # use 'from google.colab import auth'


# Retrieve API Key from Colab Secrets
client = OpenAI(api_key=userdata.get('OpenAiKey'))
if client:
    print("API key retrieved successfully!")
else:
    raise ValueError("Failed to retrieve the API key. Make sure the secret is configured correctly.")


API key retrieved successfully!


# Mounting the Google Drive
📌 Functionality
The Steps:
1. Mount Google Drive:
Use the google.colab library to mount your Google Drive, allowing you to access files stored on your Google Drive directly within the Colab environment.
2. Mount the Drive:
The drive.mount() function will authenticate and mount Google Drive to the specified location (/content/drive), enabling you to read from and write to your Drive files.
* from google.colab import drive: This imports the drive module from the google.colab library, which provides functionality for working with Google Drive in Colab.

* drive.mount('/content/drive'): This command mounts your Google Drive at the specified location (/content/drive) within the Colab environment. When you run this command:

 * Colab will ask you for authorization, so you can connect your Google Drive account to the Colab session.
 * After authorization, the files in your Google Drive will be accessible at /content/drive as if they were part of the local filesystem. You can now read, write, and manipulate files stored in your Google Drive directly in the Colab notebook.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Data Setup for Fine-Tuning

### import pandas as pd

* **import pandas** imports Pandas library
* **as pd** gives the Pandas library and alias as pd for abbreviation
* **import json** imports Python's built-in **JSON (JavaScript Object Notation) module** which allows us to work with JSON data-a common format for storing exchanging data.
    * For our program, it is used to read JSON files from OpenAI's APIs.
    * Converts JSON to Python objects
    * Converts Python objects to JSON for storage or transmission

### import ast (not currently used)
* Usage of this library allows for safe conversion of a String to a Python Object such a list, dict, tuple etc.
* We need to convert the String created from the dataset into a list so it can be manipulated for our needs.

### from tqdm import tqdm (not currently used)
* **tqdm** stands for **taqaddum** (Arabic for "progress") and is a popular Python library used to add **progress bars** to loops or operations, making it easy to visualize how much work has been completed and how much is remaining.

### Loading Dataset
* To load a dataset you need the file path which is in your Google Drive.
* Then you use the pandas library to read the dataset and store string in variable df

### Normalization
* Parameters include bbox variable, image_width, image_height
    * bbox is used for our bounding box data from our dataset file.
    * image_width & image_height is used in the normalization of our images from the dataset.

# Normalize bounding boxes
if "Bounding Box" in df.columns:
    df["Normalized BBox"] = df["Bounding Box"].apply(normalize_bbox)
* The function normalize_bbox is used to convert the coordinates of the bounding box into a range of 0.0 to 1.0.
* To get the ratio you get the x/y mins and x/y maxs from the bounding box. Then return the ratio by dividing the mins and maxs, by its relative image width and image height.
* The reason to have a range of 0.0 to 1.0 is to ensure the bounding boxes is relative to the image size rather than pixel values of the image.

### Cleaning Ocr Text
* To clean the Ocrs column (gets column df["OCR TEXT"].apply(clean_ocr_text)) it checks if the ocr is a string and if it is it removes unnecessary white spaces.
* If it is not a string then it converts it into a string
* If it catches and error then it gets skipped.

### Grouping Data by Image Name
* df.groupby("Original Filename"): Groups data by the value in the "Original Filename" column
* .agg({}) adds the Original Filename with what is in the agg
 * For "OCR TEXT" and "Normalize BBox" it stores all those values as a list
 * For the rest "Narrative #: first" it takes the first value from each column of corresponding Narrtive
* .reset_index(): coverts the grouped data back into df

### Creating Structured Input
* "\n".join([...]) Combines all lines into a single string with each item appearing on a new line
* f"- '{ocr}' appears at coordinates {bbox}." formats the input to be so that it says that the ocr appears at a specific coordinate.
* for ocr, bbox in zip(row["OCR TEXT"], row["Normalized BBox"]) gets the valuies of ocr and bbox
* Then returns a prompt for what is needed to be generated with the correct relation of OCR and Bounding Box

### Applying the Structured Input
* grouped_data["input_text"] = ...: creates a new column called input_text
* grouped_data.apply(create_input_text, axis=1): the function create_input_text is applied for every grouped_data
 * axis = 1: the function is applied to each row individually

### Grouping Narratives
* grouped_data["output_text"] =: creates a new column called output_text
* grouped_data[["Narrative 1", ...]]: gets the data from those narratives in grouped_data
*  .astype(str): makes sure that the values are strings
* .agg({}) adds the strings of the data of the narrative
* lambda x: applies lambda to each row
* "\n".join(...): joins all formatted narratives into a single string
* [f"Variant {i+1}: {narrative}" for i, narrative in enumerate(x)]:Iterates over each narrative in the row, formats it as "Variant X: <narrative>", and adds it to the final output string.

### Merging Narrative Texts into Output Text
* grouped_data["output_text"] =: Creates a new column named output_text in grouped_data.
* grouped_data[["Narrative 1", "Narrative 2", "Narrative 3", "Narrative 4"]]: Selects columns that contain narrative descriptions.
* .astype(str): Ensures all values are treated as strings (prevents errors if values are NaN or other non-string types).
* .agg(...): Aggregates the narrative strings into a single formatted output.
* lambda x:: Applies the lambda function row-wise.
* "\n".join(...): Joins all formatted narratives into a single string, separating them with a newline.
* [f"Variant {i+1}: {narrative}" for i, narrative in enumerate(x)]: Iterates over each narrative in the row, formats it as "Variant X: <narrative>", and adds it to the final output string.

### Convert to Hugging Face Dataset
* Dataset.from_pandas(...): Converts the grouped_data DataFrame into a Hugging Face dataset.
* grouped_data[["input_text", "output_text"]]: Selects only the relevant columns for fine-tuning.
* .shuffle(seed=42): Randomly shuffles the dataset to ensure a diverse order for training while keeping results reproducible.

### Save as JSONL Format for OpenAI Fine-Tuning
* jsonl_file_path = "/content/dataset_for_finetuning_gpt.jsonl": Defines the file path where the dataset will be saved.
* with open(jsonl_file_path, "w") as jsonl_file: Opens the file in write mode.
* for _, row in grouped_data.iterrows(): Iterates over each row in grouped_data.
* json_record = {...}: Creates a dictionary with input_text and output_text for each row.
* jsonl_file.write(json.dumps(json_record) + "\n"): Converts the dictionary to a JSON string and writes it to the file, ensuring each record is stored as a separate line.

In [None]:
import pandas as pd
import json
import ast
from tqdm import tqdm

# Load dataset from Google Drive
# file_path = "/content/drive/MyDrive/InstructAware/REU/Data/REU_COPY_Capstone_Final_Dataset_Without_ERROR - Capstone_Final_Dataset_Without_ERROR.csv"  # Update with actual path
file_path = "/content/drive/MyDrive/REU_COPY_Capstone_Final_Dataset_Without_ERROR - Capstone_Final_Dataset_Without_ERROR.csv"  # Path for Brandon
df = pd.read_csv(file_path)

## Function to normalize bounding boxes (if needed)
def normalize_bbox(bbox, image_width=2880, image_height=1800):
    try:
        bbox = json.loads(bbox)  # Convert JSON string to Python list # ChatGPT suggests using ast.literal_eval(bbox)
        x_min, y_min, x_max, y_max = bbox
        return [
            x_min / image_width,
            y_min / image_height,
            (x_max - x_min) / image_width,
            (y_max - y_min) / image_height,
        ]
    except Exception as e:
        print(f"Skipping row due to error in bounding box parsing: {e}")
        return [0, 0, 0, 0]  # Default in case of an error

# Normalize bounding boxes
if "Bounding Box" in df.columns:
    df["Normalized BBox"] = df["Bounding Box"].apply(normalize_bbox)

# Ensure OCR TEXT is treated as a string and cleaned
def clean_ocr_text(ocr_text):
    try:
        # Ensure it's a string
        if isinstance(ocr_text, str):
            return ocr_text.strip()
        return str(ocr_text)
    except Exception as e:
        print(f"Skipping row due to OCR text parsing error: {e}")
        return "UNKNOWN"

df["OCR TEXT"] = df["OCR TEXT"].apply(clean_ocr_text)

# Grouping data by image name
grouped_data = df.groupby("Original Filename").agg({
    "OCR TEXT": list,  # Store as a list
    "Normalized BBox": list,  # Store as a list
    "Narrative 1": "first",
    "Narrative 2": "first",
    "Narrative 3": "first",
    "Narrative 4": "first"
}).reset_index()

# Function to create structured input text
def create_input_text(row):
    ocr_bbox_pairs = "\n".join([
        f"- '{ocr}' appears at coordinates {bbox}." for ocr, bbox in zip(row["OCR TEXT"], row["Normalized BBox"])
    ])
    return f"""Task: Generate a natural language description based on detected text and bounding boxes.

Detected Text & Locations:
{ocr_bbox_pairs}

Description:"""

# Apply function to create structured input text
grouped_data["input_text"] = grouped_data.apply(create_input_text, axis=1)

# Merge all narratives into one output text
grouped_data["output_text"] = grouped_data[["Narrative 1", "Narrative 2", "Narrative 3", "Narrative 4"]].astype(str).agg(
    lambda x: "\n".join([f"Variant {i+1}: {narrative}" for i, narrative in enumerate(x)]), axis=1
)

# Convert to Hugging Face Dataset
from datasets import Dataset
dataset = Dataset.from_pandas(grouped_data[["input_text", "output_text"]]).shuffle(seed=42)

# Save as JSONL format for OpenAI fine-tuning
jsonl_file_path = "/content/dataset_for_finetuning_gpt.jsonl"
with open(jsonl_file_path, "w") as jsonl_file:
    for _, row in grouped_data.iterrows():
        json_record = {
            "input_text": row["input_text"],
            "output_text": row["output_text"]
        }
        jsonl_file.write(json.dumps(json_record) + "\n")

print(f"✅ Dataset saved as JSONL: {jsonl_file_path}")

ModuleNotFoundError: No module named 'datasets'

# Reading JSONL File
* with open(jsonl_file_path, "r") as file:: Opens the JSONL file in read mode.
* for line in file:: Iterates through each line in the file.
* json.loads(line): Converts each line (a JSON string) into a Python dictionary.
* data.append(...): Adds the parsed dictionary to the data list.

### Convert list of JSON into pandas DataFrame
* pd.DataFrame(data): Converts the list of dictionaries into a Pandas DataFrame for easier data manipulation and analysis.

### Display Data
* from IPython.display import display: Imports display() to properly render the DataFrame in Jupyter notebooks.
* display(df): Displays the DataFrame in an interactive format.

In [None]:
import pandas as pd
import json

# Load and read the JSONL file
jsonl_file_path = "/content/fixed_dataset_for_finetuning.jsonl"

# Read JSONL file line by line and store it in a list
data = []
with open(jsonl_file_path, "r") as file:
    for line in file:
        data.append(json.loads(line))

# Convert list of JSON objects into a pandas DataFrame
df = pd.DataFrame(data)

from IPython.display import display
display(df)

FileNotFoundError: [Errno 2] No such file or directory: '/content/fixed_dataset_for_finetuning.jsonl'

# Upgrading openai Library
📌 Functionality
The Steps:
Upgrade the OpenAI Library:

Use pip to upgrade the openai library to the latest version available, ensuring you have the newest features and bug fixes.
Upgrade the OpenAI Library:

The --upgrade flag will check if there’s a newer version of openai and install it if needed.

* !pip install --upgrade openai: This command upgrades the openai package to the latest version. When you run this command:

In [None]:
!pip install --upgrade openai

Collecting openai
  Downloading openai-1.64.0-py3-none-any.whl.metadata (27 kB)
Downloading openai-1.64.0-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.3/472.3 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.61.1
    Uninstalling openai-1.61.1:
      Successfully uninstalled openai-1.61.1
Successfully installed openai-1.64.0


## TODO - Modify code/ add code to do a training-validation-test split

Make sure you define the percentage split as parameter so they can be modified easily.





**Functionality:** This code reads exisitng JSONL(JavaScript Object Notation Lines) file, reformats data in each line, and then saves it to a new JSONL file.



1.  **Imports:**
*   json: To handle readin and writing JSON data
*   random: Used to randomly select lines of data for trainig, testing, and validation

2.  **Paths:**
*   Paths for the input (input_jsonl_path) and output (output_jsonl_path) JSONL files are defined.

3.   **Reading the Input JSONL File:**
*   The code opens the iput JSONL file and reads it line by line
*   Each line is assumed to contains a JSON object with input_text and output_text keys
4.  **Reformatting the Data:**
*   For each line, the code constructs a new dictionay called jsonl_entry following the format of:

        {
          "messages": [
              {"role": "system", "content": "You generate detailed narratives from text."},
              {"role": "user", "content": data["input_text"]},
              {"role": "assistant", "content": data["output_text"]}
          ]
        }
*   The input_text and output_text from the original entry are added as the user and assistant messages, respectively.
*   A "system" message is also included, with a fixed message: "You generate detailed narratives from text."
5.  **Saving the Reformatted Data:**
*    The code then writes the reformatted entries to the new JSONL file, one per line.
6.  **Final Message:**
*    After saving, program will print a success message indicating the output file's path.



In [None]:
import json
import random

# Path to your existing JSONL file
input_jsonl_path = "/content/dataset_for_finetuning_gpt.jsonl"  # Update this
output_jsonl_path = "/content/fixed_dataset_for_finetuning.jsonl"

# Read and reformat JSONL data
jsonl_data = []
with open(input_jsonl_path, "r") as f:
    for line in f:
        data = json.loads(line.strip())

        # Assuming current format is {'input_text': ..., 'output_text': ...}
        jsonl_entry = {
            "messages": [
                {"role": "system", "content": "You generate detailed narratives from text."},
                {"role": "user", "content": data["input_text"]},
                {"role": "assistant", "content": data["output_text"]}
            ]
        }
        jsonl_data.append(jsonl_entry)

# Save reformatted JSONL
with open(output_jsonl_path, "w") as f:
    for entry in jsonl_data:
        f.write(json.dumps(entry) + "\n")

print(f"✅ Fixed JSONL file saved at: {output_jsonl_path}")


FileNotFoundError: [Errno 2] No such file or directory: '/content/dataset_for_finetuning_gpt.jsonl'

## TODO - This code will take the json_data array of data and uploads it to OpenAI as a filename. You will instead have three arrays, one for training, one for testing and one for validation

You will need to upload training and validation files both to OpenAI.

# Uploading a JSONL Dataset to OpenAI for Fine-Tuning

This script uploads a `.jsonl` dataset to OpenAI for fine-tuning.

---

## 📌 Functionality

The Steps:
1. Loads a JSONL dataset from the specified path.
2. Authenticates with OpenAI using an API key.
3. Uploads the dataset to OpenAI's servers.
4. Retrieves and prints the file ID for tracking.

---
**Upload the Dataset**
* file=open(jsonl_path, "rb"): Opens the file in binary mode for upload.
* purpose="fine-tune": Specifies that the file will be used for fine-tuning.

**Retrieve File ID**
* response.id: Extracts the file ID from OpenAI's response.
* print(...): Displays a success message with the file ID.


In [None]:
import openai

# Path to your JSONL file
jsonl_path = "/content/fixed_dataset_for_finetuning.jsonl"
openai.api_key = userdata.get("OpenAiKey")

# Upload the file to OpenAI (new API method)
response = openai.files.create(
    file=open(jsonl_path, "rb"),
    purpose="fine-tune"
)

# Get the file ID
file_id = response.id
print(f"✅ Dataset uploaded successfully! File ID: {file_id}")


FileNotFoundError: [Errno 2] No such file or directory: '/content/fixed_dataset_for_finetuning.jsonl'

## TODO - Modify Code here to adjust the training call to include Validation Data



## **<font color = red> From CHATGPT got this information:</font>**

Here's the updated code with dataset splitting for validation and monitoring included:

# 🚀 Fine-Tuning GPT-3.5/4 with Validation in OpenAI

This script fine-tunes OpenAI's `gpt-3.5-turbo` (or `gpt-4`) using a **training and validation dataset**.

## 📌: Steps:
1. **Upload the training and validation dataset** as `.jsonl` files to OpenAI.
2. **Replace the file IDs** in the code.
3. **Run the script** to start fine-tuning.
4. **Monitor training & validation loss** dynamically.
---
**Functionality**: This Python code uses OpenAI's API to fine-tune a model using a training file and a validation file, and then monitors the fine-tuning job until completion. Below is a detailed explanation of each part of the code:

**Imports:**
* **openai**: The OpenAI Python client, used to interact with the OpenAI API.
* time: The time module is used to create pauses (delays) in the code execution, allowing for periodic status checks.

**Variables**:
* **training_file_id**: This is the ID of the training file (a file containing data used to train the model). You need to replace the placeholder with the actual ID of the file you're using for training.
* **validation_file_id**: This is the ID of the validation file (a file containing data used to evaluate the model's performance during training). Similarly, replace this placeholder with the actual ID of your validation file.
base_model: The base model you want to fine-tune. In this case, it's set to gpt-3.5-turbo, but you can change it to gpt-4 if necessary.

**Fine-tuning Request**:
* The **openai.fine_tuning.jobs.create()** method starts the fine-tuning process. It sends a request to OpenAI’s API to initiate fine-tuning using the provided training and validation files and the selected base model.
  * **training_file**: The ID of the training file.
  * **validation_file**: The ID of the validation file (important for tracking how well the model is performing during training).
  * model: The base model to be fine-tuned (gpt-3.5-turbo in this case).
* The response from the API contains the job ID, which is saved to the job_id variable for future reference.

**Job Monitoring**:
* A loop is used to periodically check the fine-tuning job's status.
* openai.fine_tuning.jobs.retrieve(job_id): This retrieves the current status of the fine-tuning job using the job ID.
* status: The current status of the job (e.g., pending, in_progress, succeeded, failed, etc.).
* Metrics: If available, the job’s metrics, such as training and validation loss, are displayed. These metrics help track the model’s performance during fine-tuning.
  * Training loss: Indicates how well the model is learning from the training data.
  * Validation loss: Indicates how well the model is performing on unseen data (the validation set).

**Checking Job Completion**:
* The code continues to check the job status every 60 seconds (**time.sleep(60)**) until the job reaches one of the final states: succeeded, failed, or cancelled.
* If the job is completed (i.e., status is succeeded), the fine-tuned model’s name is retrieved from the job_status object.
* If the fine-tuning job is successful, the model name is printed; if it failed or was cancelled, an error message is shown.

**Final Output**:
* If the job succeeded, the fine-tuned model's name (fine_tuned_model) is displayed.
* If the job failed or was cancelled, a warning message is displayed instead.

**Key Points**:
* Fine-tuning is an iterative process, and this code monitors the status and tracks progress.
* Using a validation file helps ensure that the model generalizes well and avoids overfitting to the training data.
* The code includes checks to display training and validation losses for more transparency into how the fine-tuning is progressing.
---

```python
import openai
import time

# Define your uploaded file IDs (replace with actual validation file ID)
training_file_id = "file-HijQNGKAUThFKd9sTfA42y"  # Replace with actual training file ID
validation_file_id = "file-VALIDATION_FILE_ID"  # Replace with actual validation file ID

# Choose the base model for fine-tuning
base_model = "gpt-3.5-turbo"  # You can change to "gpt-4" if needed

# Start the fine-tuning job with validation
response = openai.fine_tuning.jobs.create(
    training_file=training_file_id,
    validation_file=validation_file_id,  # Added validation file
    model=base_model
)

# Get the job ID
job_id = response.id
print(f"🚀 Fine-tuning started! Job ID: {job_id}")

# Monitor fine-tuning progress with validation tracking
while True:
    job_status = openai.fine_tuning.jobs.retrieve(job_id)
    status = job_status.status
    print(f"🔄 Fine-tuning status: {status}")

    # Retrieve training and validation loss if available
    if hasattr(job_status, "metrics"):
        metrics = job_status.metrics
        training_loss = metrics.get("training_loss", "N/A")
        validation_loss = metrics.get("validation_loss", "N/A")
        print(f"📉 Training Loss: {training_loss} | 📊 Validation Loss: {validation_loss}")

    if status in ["succeeded", "failed", "cancelled"]:
        break  # Stop checking when the job finishes

    time.sleep(60)  # Wait for 1 minute before checking again

# Retrieve the fine-tuned model name
if status == "succeeded":
    fine_tuned_model = job_status.fine_tuned_model
    print(f"🎉 Fine-tuned model is ready! Model name: {fine_tuned_model}")
else:
    print("⚠️ Fine-tuning failed or was cancelled.")






# Fine-Tuning GPT-3.5/4 with OpenAI

This script fine-tunes OpenAI's `gpt-3.5-turbo` (or `gpt-4`) using an uploaded dataset.

---
**Choose Base Model**
* base_model: Defines which OpenAI model to fine-tune.
* print(...): Confirms the selected model.

**Fine Tuning**
* openai.fine_tuning.jobs.create(...): Initiates the fine-tuning job.
* training_file=file_id: Uses the specified dataset file.
* model=base_model: Fine-tunes the chosen base model.
* job_id = response.id: Retrieves and stores the fine-tuning job ID.
* print(...): Displays confirmation of the job start.

**Monitor Fine-Tuning Progress**
* openai.fine_tuning.jobs.retrieve(job_id): Retrieves the job’s current status.
* status: Stores the job state (pending, in_progress, succeeded, failed, etc.).
* print(...): Displays the current fine-tuning status.
* if status in ["succeeded", "failed", "cancelled"]:: Stops checking once the job is finished.
* time.sleep(60): Waits 60 seconds before checking again.

**Retrieve Fine-Tuned Model Name**
* if status == "succeeded":: Checks if the fine-tuning process was successful.
* fine_tuned_model = job_status.fine_tuned_model: Retrieves the fine-tuned model’s name.
* print(...): Displays the model name if successful or an error message if failed.

In [None]:
import openai
import time

# Define your uploaded file ID (already uploaded)
file_id = "file-HijQNGKAUThFKd9sTfA42y"  # Replace with your actual file ID if different

# Choose the base model for fine-tuning
base_model = "gpt-3.5-turbo"  # Change to "gpt-4" if needed

# Start the fine-tuning job
response = openai.fine_tuning.jobs.create(
    training_file=file_id,

    model=base_model
)

# Get the job ID
job_id = response.id
print(f"🚀 Fine-tuning started! Job ID: {job_id}")

# Monitor fine-tuning progress
while True:
    job_status = openai.fine_tuning.jobs.retrieve(job_id)
    status = job_status.status
    print(f"🔄 Fine-tuning status: {status}")

    if status in ["succeeded", "failed", "cancelled"]:
        break  # Stop checking when the job finishes

    time.sleep(60)  # Wait for 1 minute before checking again

# Retrieve the fine-tuned model name
if status == "succeeded":
    fine_tuned_model = job_status.fine_tuned_model
    print(f"🎉 Fine-tuned model is ready! Model name: {fine_tuned_model}")
else:
    print("⚠️ Fine-tuning failed or was cancelled.")

BadRequestError: Error code: 400 - {'error': {'message': 'file-HijQNGKAUThFKd9sTfA42y does not exist', 'type': 'invalid_request_error', 'param': None, 'code': None}}

# Retrieve Fine-Tuning Job Status

This script retrieves the status of a fine-tuning job using OpenAI's API.

---
**job_id**
* Specifies the fine-tuning job ID assigned by OpenAI.

**Retrieve Job Details**
* openai.fine_tuning.jobs.retrieve(job_id): Fetches details about the fine-tuning job.
* job_details.status: Retrieves the current status (pending, in_progress, succeeded, or failed).
* print(...): Displays the job's current status.

**Check for Errors (If Job Failed)**
* if job_details.status == "failed":: Checks if the job failed.
* job_details.error: Retrieves the failure reason.
* print(...): Displays an error message if the job failed.

In [None]:


# Your fine-tuning job ID
job_id = "ftjob-0YfsRGz0sbguTpcPKrBj8GSb"  # Replace with the actual job ID

# Retrieve the job details
job_details = openai.fine_tuning.jobs.retrieve(job_id)

# Print the status and error message (if any)
print("🔍 Job Status:", job_details.status)
if job_details.status == "failed":
    print("❌ Error Reason:", job_details.error)  # Shows the failure reason


NotFoundError: Error code: 404 - {'error': {'message': 'Could not find fine tune: ftjob-0YfsRGz0sbguTpcPKrBj8GSb', 'type': 'invalid_request_error', 'param': 'fine_tune_id', 'code': 'fine_tune_not_found'}}

# Generating Text with a Fine-Tuned OpenAI Model

This script uses a fine-tuned `gpt-3.5-turbo` model to generate natural language descriptions based on detected text and bounding box coordinates.

---
**Set API Key**
* openai.api_key: Retrieves the API key stored in userdata

**Define Fine-Tuned Model**
* model_name: Specifies the fine-tuned model to use.
* print(...): Confirms the selected model.

**Define Input Messages**
* messages: Defines the conversation history for the model.
* role: "system": Provides instructions for how the model should behave.
* role: "user": Supplies the user input, which includes detected text and bounding box coordinates.

**Initialize OpenAI Client**
* client = openai.OpenAI(...): Initializes an OpenAI client with authentication.

**Generate a Response**
* client.chat.completions.create(...): Sends the request to OpenAI.
* model=model_name: Specifies the fine-tuned model.
* messages=messages: Provides the input text.
* temperature=0.9: Adjusts randomness (higher values make responses more creative).

**Retrieve and Print Response**
* response.choices[0].message.content: Extracts the generated text.
* print(...): Displays the output.

In [None]:
import openai

openai.api_key = userdata.get("OpenAiKey")


model_name = "ft:gpt-3.5-turbo-0125:rakshit::B3TzUmAj"

messages=[
        {"role": "system", "content": "You generate detailed narratives from text."},
        {"role": "user", "content": """Task: Generate a natural language description based on detected text and bounding boxes.

Detected Text & Locations:
-'ONLY' appears at coordinates [0.334375, 0.47833333333333333, 0.01875, 0.04055555555555555].
- 'STOP HERE ON RED' appears at coordinates [0.7277777777777777, 0.575, 0.02847222222222222, 0.06333333333333334].
-'ONLY' appears at coordinates [0.43125, 0.4666666666666667, 0.021527777777777778, 0.042222222222222223].
-'LIFE LifeCenter>' appears at coordinates [0.15069444444444444, 0.2288888888888889, 0.078125, 0.03944444444444444].

Description:"""}
    ]
# Use the new OpenAI API method
client = openai.OpenAI(api_key=userdata.get("OpenAiKey"))  # Initialize the new client

response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.9
)


# Correct way to extract the generated response
print(response.choices[0].message.content)


NotFoundError: Error code: 404 - {'error': {'message': 'The model `ft:gpt-3.5-turbo-0125:rakshit::B3TzUmAj` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}