# **Exploring Export Data**

First thing's first: I want to actually explore the export data, and figure out how to run a pipeline for getting everything I need!


# Setup

The cells below will set up the rest of the notebook.

I'll start by configuring the kernel:


In [None]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

Now I'll import some necessary modules:


In [None]:
# General import statements
import json
import zipfile
import os

# Third party imports
import pandas as pd

# Loading Data

First, I'm going to load in the data. Since the export is a `.zip` file, I'll unzip it:


In [None]:
# Declare the path to the export .zip file
export_zip_path = "./data/43f29f1d3bb293681d82b6ed415c1b2893d5ca12a4643f406316a0d1aaa1f3e8-2025-02-15-02-57-48-60b51c09ca924c7299b26640029988d7.zip"

# Declare the path where we'll extract the data
export_path = "./data/export"

# Extract the data from the export .zip file into a new folder
# Using \\?\ prefix to enable long paths on Windows
extract_path = os.path.abspath("./data/export")
extract_path = "\\\\?\\" + extract_path

with zipfile.ZipFile(export_zip_path, "r") as zip_ref:
    zip_ref.extractall(extract_path)

Alright - after taking a look at things, here's a couple of things I've noticed:

- `dalle-generations/` folder contains all of the DALL-E pictures you've generated
- the main folder contains a TON of images - seems to be everything you've included in different chats
- there are also a couple of folders in the main folder, each containing audio chats that I've had!
- there's a huge `conversations.json` file that seems to contain... everything!
- there's a `sora.json` file that contains pointers to different Sora videos I've made
- they have a `message_feedback.json` and `model_comparisons.json`, too, which seems to contain some of the RLHF data I've given them. that's _really_ neat

That's way more data than I've really expected. Since the platform changes so often, I'm sure that I can't rely _too_ heavily on the way things are structured, so... I'll try and write code keeping that in mind.


# Understanding `conversations.json`

To start out, I'm going to look into the `conversations.json`, since this is what I'm mainly interested in to start.


In [None]:
# Read the conversations.json file
conversations_path = os.path.join(extract_path, "conversations.json")
with open(conversations_path, "r") as f:
    conversations = json.load(f)

# Create a DataFrame from the conversations data
conversations_df = pd.DataFrame(conversations)

# Print the length of the DataFrame
print(f"Loaded data for {len(conversations_df):,} conversations")

What does this data look like?


In [None]:
# Show a random conversation
sample_conversation = conversations_df.sample(1).iloc[0]
sample_conversation

It seems like the `mapping` contains the conversation itself. There are a couple of other fields that're interesting, like `plugin_ids`, `gizmo_type`, `default_model_slug`, and `conversation_origin`. I'll try and figure out what they are below:

In [None]:
for field_of_interest in [
    "plugin_ids",
    "gizmo_type",
    "default_model_slug",
]:
    print(f"{conversations_df[field_of_interest].value_counts()}\n")

After digging into `mapping` a bit more: seems like the conversation structure is a tree, which ultimately makes sense - ChatGPT, despite *seeming* like a linear chat, does have branching available via revising messages. 

Since I want to embed things within the OpenAI API, I want to be a bit lazy and create a "canonical" conversation. The easiest way to do this is to just extract the longest branch! 

In [None]:
def extract_longest_conversation(mapping: dict) -> pd.DataFrame:
    """
    Extracts the longest "chain" of messages from a mapping of conversation nodes.

    Args:
        mapping (dict): A dictionary mapping node IDs to node information.

    Returns:
        pd.DataFrame: A DataFrame containing the messages in the longest chain.
    """

    # Helper function to get chain from node to leaf
    def get_chain_from_node(node_id):
        chain = []
        current_id = node_id

        while current_id:
            node = mapping[current_id]
            if node["message"] is not None:  # Only include non-null messages
                chain.append(node["message"])

            # Move to child if exists, otherwise break
            children = node["children"]
            current_id = children[0] if children else None

        return chain

    # Find all root nodes (nodes with no parent)
    root_nodes = [
        node_id for node_id, node in mapping.items() if node["parent"] is None
    ]

    # Get all possible chains starting from root nodes
    all_chains = [get_chain_from_node(root_id) for root_id in root_nodes]

    # Get the longest chain
    longest_chain = max(all_chains, key=len) if all_chains else []

    # Create a DataFrame from the mapping
    messages_df = pd.DataFrame(
        sorted(
            longest_chain,
            key=lambda x: (x["create_time"] or 0) if x else 0,
        )
    )

    # Add the "author_role" column
    messages_df["author_role"] = messages_df["author"].apply(
        lambda x: x["role"] if x else None
    )

    # Return the DataFrame
    return messages_df

What do these DataFrames look like?

In [None]:
# Show the longest conversation
conversation_df = extract_longest_conversation(sample_conversation.mapping)
conversation_df