# Extracting information from a shared chatGPT chat

For background see this discussion comment: [Accessing OpenAI Chat Data and Shared Conversations
](https://github.com/thorwhalen/oa/discussions/11#discussioncomment-11852719).

## Demo

In [64]:
from oa.chats import ChatDacc
from pprint import pprint
import pandas as pd

url = 'https://chatgpt.com/share/6788d539-0f2c-8013-9535-889bf344d7d5'
dacc = ChatDacc(url)

### basic_turns_data

If you're just interested in the text of the conversation, along with minimal meta data (e.g. time, role, id,...) then `basic_turns_data` is what you want

In [None]:
pprint(dacc.basic_turns_data)

[{'content': 'This conversation is meant to be used as an example, for '
             'testing, and/or for figuring out how to parse the html and json '
             'of a conversation. \n'
             '\n'
             "As such, we'd like to keep it short. \n"
             '\n'
             'Just say "Hello World!" back to me for now, and then in a second '
             'line write 10 random words. ',
  'id': '3b469b70-b069-4640-98af-5417491bb626',
  'role': 'user',
  'time': 1737020650.866734},
 {'content': 'Hello World!  \n'
             'apple, river, galaxy, chair, melody, shadow, cloud, puzzle, '
             'flame, whisper.',
  'id': '38ee4a3f-8487-4b35-9f92-ddee57c25d0a',
  'role': 'assistant',
  'time': 1737020652.436654},
 {'content': 'Now, so we get code blocks (and a second user query), write a '
             'python code block containing a hello world print.',
  'id': '01808c08-dc33-4932-bc12-64bd7e936760',
  'role': 'user',
  'time': 1737020700.676552},
 {'content': '``

If `pandas` is installed, you can get this directly as a pandas dataframe

In [167]:
dacc.basic_turns_df

Unnamed: 0,id,role,content,time
0,3b469b70-b069-4640-98af-5417491bb626,user,This conversation is meant to be used as an ex...,2025-01-16 09:44:10
1,38ee4a3f-8487-4b35-9f92-ddee57c25d0a,assistant,"Hello World! \napple, river, galaxy, chair, m...",2025-01-16 09:44:12
2,01808c08-dc33-4932-bc12-64bd7e936760,user,"Now, so we get code blocks (and a second user ...",2025-01-16 09:45:00
3,be4486db-894f-4e6f-bd0a-22d9d2facf69,assistant,"```python\nprint(""Hello, World!"")\n```",2025-01-16 09:45:01


### turns_data: All the turns and all the fields

There's more turn keys than those that are shown here. 
The basic turns data just shows turns that have non-null roles and content. 
You'll find all the turns data (both all the turns and all their fields) in the 
`turns_data` attribute. It's a dict whose keys are turn ids and values are the contents.

In [4]:
list(dacc.turns_data)

['adff303b-75cc-493c-b757-605adadb8e56',
 '1473f2d9-ba09-4cd7-90c4-1452898676de',
 '40c5ed53-2b82-4e38-a4ec-dabf8f589553',
 '3b469b70-b069-4640-98af-5417491bb626',
 '94469d89-ff37-48c1-bdab-44b27299d79b',
 '38ee4a3f-8487-4b35-9f92-ddee57c25d0a',
 '01808c08-dc33-4932-bc12-64bd7e936760',
 'be4486db-894f-4e6f-bd0a-22d9d2facf69']

To facilitate index-based access to turns, we have a `turns_data_keys` attribute in dacc:

In [178]:
dacc.turns_data_keys

['adff303b-75cc-493c-b757-605adadb8e56',
 '1473f2d9-ba09-4cd7-90c4-1452898676de',
 '40c5ed53-2b82-4e38-a4ec-dabf8f589553',
 '3b469b70-b069-4640-98af-5417491bb626',
 '94469d89-ff37-48c1-bdab-44b27299d79b',
 '38ee4a3f-8487-4b35-9f92-ddee57c25d0a',
 '01808c08-dc33-4932-bc12-64bd7e936760',
 'be4486db-894f-4e6f-bd0a-22d9d2facf69']

... which allows you to do this:

In [5]:
turn_data_for_index = lambda i: dacc.turns_data[dacc.turns_data_keys[i]]
pprint(turn_data_for_index(3))

{'children': ['94469d89-ff37-48c1-bdab-44b27299d79b'],
 'id': '3b469b70-b069-4640-98af-5417491bb626',
 'message': {'author': {'metadata': {}, 'role': 'user'},
             'content': {'content_type': 'text',
                         'parts': ['This conversation is meant to be used as '
                                   'an example, for testing, and/or for '
                                   'figuring out how to parse the html and '
                                   'json of a conversation. \n'
                                   '\n'
                                   "As such, we'd like to keep it short. \n"
                                   '\n'
                                   'Just say "Hello World!" back to me for '
                                   'now, and then in a second line write 10 '
                                   'random words. ']},
             'create_time': 1737020650.866734,
             'id': '3b469b70-b069-4640-98af-5417491bb626',
             'metadata': 

We have a `turns_data_ssot` attribute which lists name, description and examples of a bunch (but not necessarily all) fiels you can find in turn data.  

In [7]:
pd.DataFrame(index=dacc.turns_data_ssot.keys(), data=dacc.turns_data_ssot.values())

Unnamed: 0,description,example
continue_conversation_url,A URL that allows users to continue the conver...,https://chatgpt.com/share/6788d539-0f2c-8013-9...
conversation_id,A unique identifier for the entire conversation.,6788d539-0f2c-8013-9535-889bf344d7d5
create_time,The timestamp (in seconds since epoch) when th...,1737020729.060687
current_node,The unique identifier of the current node in t...,be4486db-894f-4e6f-bd0a-22d9d2facf69
default_model_slug,The identifier for the default model used in t...,gpt-4o
disabled_tool_ids,An array of tool identifiers that have been di...,[]
has_user_editable_context,A boolean indicating if the user can edit the ...,False
is_archived,A boolean indicating whether the conversation ...,False
is_better_metatags_enabled,A boolean indicating if better metatags are en...,True
is_indexable,A boolean indicating if this conversation can ...,False


You have a tool to extract data from the turns data:

In [6]:
t = dacc.extract_turns(
    {
        'id': 'id', 'role': 
        'message.author.role', 
        'weight': 'message.weight'
    }
)
pd.DataFrame(t)

Unnamed: 0,id,role,weight
0,adff303b-75cc-493c-b757-605adadb8e56,,
1,3b469b70-b069-4640-98af-5417491bb626,user,1.0
2,94469d89-ff37-48c1-bdab-44b27299d79b,assistant,1.0
3,38ee4a3f-8487-4b35-9f92-ddee57c25d0a,assistant,1.0
4,01808c08-dc33-4932-bc12-64bd7e936760,user,1.0
5,be4486db-894f-4e6f-bd0a-22d9d2facf69,assistant,1.0


And there's also `metadata` (which in the original json, contains `mappings` and l`inear_conversation` fields, which are two fields that contain the turns data (the `mapping` maps them by ids, the `linear_conversation` fields are just a list of the values))

In [194]:
dacc.metadata

{'title': 'Test Chat 1',
 'create_time': 1737020729.060687,
 'update_time': 1737020733.031014,
 'moderation_results': [],
 'current_node': 'be4486db-894f-4e6f-bd0a-22d9d2facf69',
 'conversation_id': '6788d539-0f2c-8013-9535-889bf344d7d5',
 'is_archived': False,
 'safe_urls': [],
 'default_model_slug': 'gpt-4o',
 'disabled_tool_ids': [],
 'is_public': True,
 'has_user_editable_context': False,
 'continue_conversation_url': 'https://chatgpt.com/share/6788d539-0f2c-8013-9535-889bf344d7d5/continue',
 'moderation_state': {'has_been_moderated': False,
  'has_been_blocked': False,
  'has_been_accepted': False,
  'has_been_auto_blocked': False,
  'has_been_auto_moderated': False},
 'is_indexable': False,
 'is_better_metatags_enabled': True}

metadata also has a dict that describes the fields:

In [227]:
pd.DataFrame(index=dacc.metadata_ssot.keys(), data=dacc.metadata_ssot.values())

Unnamed: 0,description,example
title,The title of the conversation.,Test Chat 1
create_time,The timestamp (in seconds since epoch) when th...,1737020729.060687
update_time,The timestamp (in seconds since epoch) when th...,1737020733.031014
moderation_results,An array that holds results from any moderatio...,[]
current_node,The unique identifier of the current node in t...,be4486db-894f-4e6f-bd0a-22d9d2facf69
conversation_id,A unique identifier for the entire conversation.,6788d539-0f2c-8013-9535-889bf344d7d5
is_archived,A boolean indicating whether the conversation ...,False
safe_urls,An array of URLs deemed safe within the contex...,[]
default_model_slug,The identifier for the default model used in t...,gpt-4o
disabled_tool_ids,An array of tool identifiers that have been di...,[]


### full_json_dict: The full json

And then, there's where all of this came from: The base json that was extracted from the conversations html:

In [27]:
list(dacc.full_json_dict)

['basename', 'future', 'isSpaMode', 'state']

This is a large nested json. The following is the path to the metadata (that contains the turns data etc.)

In [197]:
dacc.metadata_path

('state',
 'loaderData',
 'routes/share.$shareId.($action)',
 'serverResponse',
 'data')

In [200]:
from dol import path_get

data = path_get(dacc.full_json_dict, dacc.metadata_path)
list(data)

['title',
 'create_time',
 'update_time',
 'mapping',
 'moderation_results',
 'current_node',
 'conversation_id',
 'is_archived',
 'safe_urls',
 'default_model_slug',
 'disabled_tool_ids',
 'is_public',
 'linear_conversation',
 'has_user_editable_context',
 'continue_conversation_url',
 'moderation_state',
 'is_indexable',
 'is_better_metatags_enabled']

## Getting the urls

Sometimes, you want to extract some (or all) of the urls you have in a conversation. 
These urls could be quoted in the messages' text itself, but could also be somewhere else 
in the turns data (for example, when doing a chatGPT search, you have a list of "sources"). 

In [2]:
from oa import ChatDacc

url = 'https://chatgpt.com/share/67877101-6708-8013-ba9e-2e770186db58'  # url with some searches
dacc = ChatDacc(url)

In [3]:
len(dacc.basic_turns_df)

24

### url_paths

The `url_paths` attribute gives you a list of all the LEAF turns data paths to values that are strings and have urls in them.

In [6]:
# Paths to the URLs identified in the chat (uses find_url_keys())
url_paths = dacc.url_paths
print(f"{len(url_paths)=}")
url_paths[:5]

len(url_paths)=204


['68148933-cbc2-43ef-9b96-6a8ff7a2159c.message.metadata.search_result_groups[0].entries[0].url',
 '68148933-cbc2-43ef-9b96-6a8ff7a2159c.message.metadata.search_result_groups[1].entries[0].url',
 '68148933-cbc2-43ef-9b96-6a8ff7a2159c.message.metadata.search_result_groups[2].entries[0].url',
 '68148933-cbc2-43ef-9b96-6a8ff7a2159c.message.metadata.search_result_groups[3].entries[0].url',
 '68148933-cbc2-43ef-9b96-6a8ff7a2159c.message.metadata.search_result_groups[4].entries[0].url']

### url_data(...)

In [None]:
all_urls = dacc.url_data() 
print(f"{len(all_urls)=}")
unique_urls = sorted(set(all_urls))
print(f"{len(unique_urls)=}")
t[1]

len(all_urls)=204
len(unique_urls)=132


'https://www.rwdigital.ca/blog/how-much-energy-do-google-search-and-chatgpt-use/'

In [42]:
unique_urls[:6]

['([AffMaven](https://affmaven.com/google-search-statistics/))',
 '([DemandSage](https://www.demandsage.com/chatgpt-statistics/))',
 '([DemandSage](https://www.demandsage.com/google-search-statistics/))',
 '([EIA](https://www.eia.gov/tools/faqs/faq.php?id=85&t=1))',
 '([Limited Systems](https://limited.systems/articles/google-search-vs-chatgpt-emissions/))',
 '([RW Digital](https://www.rwdigital.ca/blog/how-much-energy-do-google-search-and-chatgpt-use/))']

By default, `dacc.url_data` uses `prior_levels_to_include=0, remove_chatgpt_utm=True`. 
If you want to get more information on the urls, you can ask for `prior_levels_to_include=1` or 
even more levels upward from where the url was found. 
Know that by doing so, you may have less items than before (but won't lose any urls) because sometimes, there are 
more than one url under some level.

Also, if you want to leave the `utm_source=chatgpt.com` prefix from urls, you can say so via 
`remove_chatgpt_utm=True`. 

In [54]:
urls_in_context = dacc.url_data(prior_levels_to_include=1, remove_chatgpt_utm=False)
print(f"{len(urls_in_context)=}")  # note: some levels have several urls, so you'll get less items here than previously
urls_in_context[12]

len(urls_in_context)=199


{'matched_text': '\ue200cite\ue202turn0search7\ue201',
 'start_idx': 125,
 'end_idx': 144,
 'alt': '([Search Engine Land](https://searchengineland.com/calculating-the-carbon-footprint-of-a-google-search-16105?utm_source=chatgpt.com))',
 'prompt_text': None,
 'type': 'grouped_webpages',
 'items': [{'title': 'Calculating The Carbon Footprint Of A Google Search - Search Engine Land',
   'url': 'https://searchengineland.com/calculating-the-carbon-footprint-of-a-google-search-16105?utm_source=chatgpt.com',
   'pub_date': 1231718400,
   'snippet': 'Queries vary in degree of difficulty, but for the average query, the servers it touches each work on it for just a few thousandths of a second. ... this amounts to 0.0003 kWh of energy per search ...',
   'attribution_segments': None,
   'supporting_websites': [],
   'hue': None,
   'attributions': None,
   'attribution': 'Search Engine Land'}],
 'status': 'done',
 'error': None,
 'style': None}

### See what path patterns contain urls

In [63]:
from functools import partial
from oa.chats import find_url_keys
from dol import Pipe
import re 

url_paths = list(find_url_keys(dacc.turns_data))

replace_array_index_with_star = partial(re.sub, '\[\d+\]', '[*]')
ignore_first_part_of_path = lambda x: '.'.join(x.split('.')[1:])
transform = Pipe(replace_array_index_with_star, ignore_first_part_of_path)
paths_containing_urls = sorted(set(map(transform, url_paths)))
paths_containing_urls

['message.metadata.content_references[*].alt',
 'message.metadata.content_references[*].items[*].thumbnail_url',
 'message.metadata.content_references[*].items[*].url',
 'message.metadata.content_references[*].sources[*].url',
 'message.metadata.search_result_groups[*].entries[*].url']

## Bootstrapping the parsing (and maintaining the parser)

In [None]:
from oa.chats import url_to_html, parse_shared_chatGPT_chat

url = 'https://chatgpt.com/share/6788d539-0f2c-8013-9535-889bf344d7d5'
chat_html = url_to_html(url)
chat_json_dict = parse_shared_chatGPT_chat(chat_html)
list(chat_json_dict)

['basename', 'future', 'isSpaMode', 'state']

If you have a look at the `parse_shared_chatGPT_chat` you'll notice it's parametrized by several arguments you didn't have to specify.

```python
def parse_shared_chatGPT_chat(
    html: str,
    is_target_string: Union[str, Callable] = 'data.*mapping.*message',
    *,
    variable_name: str = 'window.__remixContext',
    json_pattern: str = '(\\{.*?\\});') -> dict
) -> dict:
```

In [29]:
from i2 import Sig

print(',\n'.join(str(Sig(parse_shared_chatGPT_chat)).split(',')))

(html: str,
 is_target_string: Union[str,
 Callable] = 'data.*mapping.*message',
 *,
 variable_name: str = 'window.__remixContext',
 json_pattern: str = '(\\{.*?\\});') -> dict


We did this so that we can easily maintain the parser's definition. 

At the time of writing this, we've noticed that the json we want is assigned to a 
variable named `window.__remixContext` and has fields `data`, `mapping`, and `message` in that order. 

But to be able to get that in the first place, we used `parse_shared_chatGPT_chat` with different parameters. 

We tried it with a chat we knew the contents of, and asked `parse_shared_chatGPT_chat` to find the json 
based on a substring we knew should be in there:

In [32]:
from oa.chats import url_to_html, parse_shared_chatGPT_chat

url = 'https://chatgpt.com/share/6788d539-0f2c-8013-9535-889bf344d7d5'
string_contained_in_conversation = 'apple, river, galaxy, chair, melody, shadow, cloud, puzzle, flame, whisper'
chat_html = url_to_html(url)
chat_json_dict = parse_shared_chatGPT_chat(chat_html, is_target_string=string_contained_in_conversation)
list(chat_json_dict)

['basename', 'future', 'isSpaMode', 'state']

Let's get a truncated version of this dict, to make sure it's not too big to study...

In [21]:
from lkj import truncate_dict_values
from pprint import pprint
from pyperclip import copy


t = truncate_dict_values(chat_json_dict)
list(t)

['basename', 'future', 'isSpaMode', 'state']

In [22]:
# save this truncated dict to a file
import tempfile
import json
from pathlib import Path

tmp_filepath = tempfile.mktemp()
Path


len(json.dumps(t))

330325

## Parsing out the json

This is probably the most fragile part of the process. 
It's the part that broke previous tools I found to solve my problem.

The approach I'm taking here is to identify a part of the conversation that I know should be in the conversation, and using that to find what I'm looking for. 

So what I did here is copy the html of a conversation and used chatGPT (o1) to help me out. 
[Here's the conversation](https://chatgpt.com/share/6788d964-9e60-8013-a2b3-e191d567c8ad).

In [6]:
# Bootstrapping (when you don't know the structure, but you do know of a string that is contained in the conversation)

from oa.chats import parse_shared_chatGPT_chat, url_to_html

url = 'https://chatgpt.com/share/6788d539-0f2c-8013-9535-889bf344d7d5'
string_contained_in_conversation = 'apple, river, galaxy, chair, melody, shadow, cloud, puzzle, flame, whisper'

chat_html = url_to_html(url)
chat_json_dict = parse_shared_chatGPT_chat(chat_html, is_target_string=string_contained_in_conversation)
list(chat_json_dict)

['basename', 'future', 'isSpaMode', 'state']

In [33]:
# get a truncated version of the chat_json_dict (too big to study!)
from lkj import truncate_dict_values
import json 

truncated_json_dict = truncate_dict_values(chat_json_dict, max_list_size=1, max_string_size=90)
truncated_json_dict_string = json.dumps(truncated_json_dict, indent=2)
assert string_contained_in_conversation in truncated_json_dict_string, (
    "The string_contained_in_conversation is not in the truncated_json_dict_string. Increase the max_* parameters."
)

In [106]:
# Save to a file to go study it...

import tempfile
from pathlib import Path

tmp_filepath = tempfile.mktemp() + '.json'
Path(tmp_filepath).write_text(truncated_json_dict_string)
print(f"{len(truncated_json_dict_string)} characters written to {tmp_filepath}")

584063 characters written to /var/folders/mc/c070wfh51kxd9lft8dl74q1r0000gn/T/tmph5v0uvws.json


## Parsing out the conversation part of the json

In [36]:
from oa.chats import find_all_matching_paths_in_list_values
from dol import path_get

paths = find_all_matching_paths_in_list_values(truncated_json_dict, target_value=string_contained_in_conversation)
path = next(paths)
path

('state',
 'loaderData',
 'routes/share.$shareId.($action)',
 'serverResponse',
 'data',
 'mapping',
 '38ee4a3f-8487-4b35-9f92-ddee57c25d0a',
 'message',
 'content',
 'parts')

So now we know the path to where our target string can be found:

In [37]:
path_get(truncated_json_dict, path)

['Hello World!  \napple, river, galaxy, chair, melody, shadow, cloud, puzzle, flame, whisper.']

We can also look around it to get a better sense of the relevant json structure

In [38]:
path_get(truncated_json_dict, path[:-1])

{'content_type': 'text',
 'parts': ['Hello World!  \napple, river, galaxy, chair, melody, shadow, cloud, puzzle, flame, whisper.']}

In [39]:
path_get(truncated_json_dict, path[:-2])

{'id': '38ee4a3f-8487-4b35-9f92-ddee57c25d0a',
 'author': {'role': 'assistant', 'metadata': {}},
 'create_time': 1737020652.436654,
 'content': {'content_type': 'text',
  'parts': ['Hello World!  \napple, river, galaxy, chair, melody, shadow, cloud, puzzle, flame, whisper.']},
 'status': 'finished_successfully',
 'end_turn': True,
 'weight': 1,
 'metadata': {'finish_details': {'type': 'stop', 'stop_tokens': [200002]},
  'is_complete': True,
  'citations': [],
  'content_references': [],
  'message_type': None,
  'model_slug': 'gpt-4o',
  'default_model_slug': 'gpt-4o',
  'parent_id': '94469d89-ff37-48c1-bdab-44b27299d79b',
  'request_id': '902d2a5a0ed1e15c-MRS',
  'timestamp_': 'absolute',
  'shared_conversation_id': '6788d539-0f2c-8013-9535-889bf344d7d5'},
 'recipient': 'all'}

In [87]:
print(path[:-4])

('state', 'loaderData', 'routes/share.$shareId.($action)', 'serverResponse', 'data', 'mapping')


A wild guess.
`('state', 'loaderData', 'routes/share.$shareId.($action)', 'serverResponse', 'data', 'mapping')` is where the "turns" of the conversation are.

In [40]:
turns_path = ('state', 'loaderData', 'routes/share.$shareId.($action)', 'serverResponse', 'data', 'mapping')
turns_data = path_get(truncated_json_dict, turns_path)
list(turns_data)

['adff303b-75cc-493c-b757-605adadb8e56',
 '1473f2d9-ba09-4cd7-90c4-1452898676de',
 '40c5ed53-2b82-4e38-a4ec-dabf8f589553',
 '3b469b70-b069-4640-98af-5417491bb626',
 '94469d89-ff37-48c1-bdab-44b27299d79b',
 '38ee4a3f-8487-4b35-9f92-ddee57c25d0a',
 '01808c08-dc33-4932-bc12-64bd7e936760',
 'be4486db-894f-4e6f-bd0a-22d9d2facf69']

In [41]:
chat_data_path = turns_path[:-1]
list(path_get(truncated_json_dict, chat_data_path))

['title',
 'create_time',
 'update_time',
 'mapping',
 'moderation_results',
 'current_node',
 'conversation_id',
 'is_archived',
 'safe_urls',
 'default_model_slug',
 'disabled_tool_ids',
 'is_public',
 'linear_conversation',
 'has_user_editable_context',
 'continue_conversation_url',
 'moderation_state',
 'is_indexable',
 'is_better_metatags_enabled']

In [42]:
metadata = path_get(truncated_json_dict, chat_data_path)
metadata = {k: v for k, v in metadata.items() if k != 'mapping'}
metadata

{'title': 'Test Chat 1',
 'create_time': 1737020729.060687,
 'update_time': 1737020733.031014,
 'moderation_results': [],
 'current_node': 'be4486db-894f-4e6f-bd0a-22d9d2facf69',
 'conversation_id': '6788d539-0f2c-8013-9535-889bf344d7d5',
 'is_archived': False,
 'safe_urls': [],
 'default_model_slug': 'gpt-4o',
 'disabled_tool_ids': [],
 'is_public': True,
 'linear_conversation': [{'id': 'adff303b-75cc-493c-b757-605adadb8e56',
   'children': ['1473f2d9-ba09-4cd7-90c4-1452898676de']}],
 'has_user_editable_context': False,
 'continue_conversation_url': 'https://chatgpt.com/share/6788d539-0f2c-8013-9535-889bf344d7d5/continue',
 'moderation_state': {'has_been_moderated': False,
  'has_been_blocked': False,
  'has_been_accepted': False,
  'has_been_auto_blocked': False,
  'has_been_auto_moderated': False},
 'is_indexable': False,
 'is_better_metatags_enabled': True}

## Make some documentation for this part of the chat json

### Merge several examples of turns data to get a fuller sense of the possible fields

In [2]:
urls = dict(
    simple_url='https://chatgpt.com/share/6788d539-0f2c-8013-9535-889bf344d7d5',
    url_with_searches='https://chatgpt.com/share/67877101-6708-8013-ba9e-2e770186db58',
    url_with_image_gen='https://chatgpt.com/share/678a1339-d14c-8013-bfcb-288d367a9079',
)

# sharing_chat_with_image_upload_not_yet_supported_by_openai

In [13]:
from oa.chats import ChatDacc
from lkj import merge_dicts

# Note, the turns_data is a dict whose keys are "turn ids" and values are the metadata for that turn, so we want to merge those
chat_daccs = {
    k: ChatDacc(url) for k, url in urls.items()
}
turn_datas = {
    k: merge_dicts(*_dacc.turns_data.values()) for k, _dacc in chat_daccs.items()
}
merged_turn_data = merge_dicts(*turn_datas.values())

In [16]:
from pprint import pprint
import json
from lkj import truncate_dict_values
from dol import flatten_dict

truncated_merged_turn_data = truncate_dict_values(merged_turn_data)
flat_truncated_merged_turn_data = flatten_dict(truncated_merged_turn_data)

print(f"Number of first level fields: {len(truncated_merged_turn_data)}")
print(f"Number of leaf values: {len(flat_truncated_merged_turn_data)}")
print(f"Number of json characters: {len(json.dumps(truncated_merged_turn_data, indent=4))}")


Number of first level fields: 4
Number of leaf values: 42
Number of json characters: 4961


In [22]:
first_dacc = next(iter(chat_daccs.values()))
metadata_example = first_dacc.metadata
metadata_fields = list(metadata_example)
assert all(list(_dacc.metadata) == metadata_fields for _dacc in chat_daccs.values())

### Have AI guess what each field is for
(since we can't find any documentation for this)

In [23]:
from oa import prompt_json_function


mk_json_field_doc_ssot = prompt_json_function(
    """
You are a technical writer specialized in documenting JSON fields. 
Below is a JSON object. I'd like you to document each field, giving me a json
whose fields are the names of the fields and the values are the description and example.
Something like this: 
```{"message.author.role": {"description": ..., "example", ...}, "message.author.name": ...}```

                                              
The context is:
{context: just a general context}
                                              
Here's an example json object:

{example_json}
""",
    egress=lambda x: x['result'],
)

mk_json_field_doc_ssot

<function oa.tools.prompt_function.<locals>.ask_oa(example_json, *, context=' just a general context')>

Have AI get parameter descriptions for turn data

In [45]:
turns_data_ssot = mk_json_field_doc_ssot(
    flat_truncated_merged_turn_data, 
    context="The conversation 'turns' of the json object holding a shared chatGPT conversation."
)
# turns_data_ssot = turns_data_ssot['result']
print(turns_data_ssot)

{'id': {'description': 'A unique identifier for the conversation turn.', 'example': '1fc35aa7-6b7a-4dae-9838-ead52c6d4793'}, 'children': {'description': 'An array of child conversation turns, which can hold additional messages in the conversation thread.', 'example': '[]'}, 'message.id': {'description': 'A unique identifier for the message within the conversation turn.', 'example': '1fc35aa7-6b7a-4dae-9838-ead52c6d4793'}, 'message.author.role': {'description': 'The role of the author of the message (e.g., user, assistant).', 'example': 'assistant'}, 'message.author.metadata.real_author': {'description': 'Metadata indicating the real author or source of the message.', 'example': 'tool:web'}, 'message.author.name': {'description': 'The name of the author of the message.', 'example': 'dalle.text2im'}, 'message.content.content_type': {'description': 'The type of content in the message (e.g., text, image, etc.).', 'example': 'text'}, 'message.content.parts': {'description': 'An array contai

Have AI get parameter descriptions for metadata

In [1]:
from oa import ChatDacc

In [24]:
metadata_ssot = mk_json_field_doc_ssot(
    metadata_example,
    context="The metadata of the json object holding a shared chatGPT conversation."
)
print(metadata_ssot)  # then copy this and paste it into the chats.py file

{
  "title": {
    "description": "The title of the chat conversation.",
    "example": "Test Chat 1"
  },
  "create_time": {
    "description": "A timestamp indicating when the chat conversation was created, represented in Unix time format.",
    "example": 1737020729.060687
  },
  "update_time": {
    "description": "A timestamp indicating the last time the chat conversation was updated, represented in Unix time format.",
    "example": 1737020733.031014
  },
  "moderation_results": {
    "description": "An array holding the results of moderation checks applied to the conversation. If no moderation has taken place, this array will be empty.",
    "example": []
  },
  "current_node": {
    "description": "The unique identifier for the current state or node in the conversation flow, typically in UUID format.",
    "example": "be4486db-894f-4e6f-bd0a-22d9d2facf69"
  },
  "conversation_id": {
    "description": "A unique identifier for the conversation as a whole, typically in UUID forma

In [None]:
ChatDacc('https://chatgpt.com/share/6788d539-0f2c-8013-9535-889bf344d7d5').metadata

AttributeError: 'dict' object has no attribute 'metadata'

In [23]:
import pandas as pd

print(pd.DataFrame(index=turns_data_ssot.keys(), data=turns_data_ssot.values()).to_markdown())

|                                                       | description                                                                                                          | example                                                                                                                                                                                                      |
|:------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| id                                                    | A unique identifier for the conversation turn.                                                                       | 1fc35aa7-6b7a-4dae-9838-ead52c6d4793                   

In [110]:
turn_data_docs = mk_json_field_documentation(
    json.dumps(turns_data, indent=2), 
    context="the conversation 'turns' of the json object holding a shared chatGPT conversation"
)
print(turn_data_docs)

Here's the documentation for the provided JSON object, structured in a markdown table format:

| Field Name                                         | Description                                                                                     | Example Value                                                                         |
|---------------------------------------------------|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| `<Conversation ID>`                               | Unique identifier for each turn in the conversation.                                          | `adff303b-75cc-493c-b757-605adadb8e56`                                             |
| `id`                                              | The unique identifier for the specific message/turn.                                         | `adff303b-75cc-493c-b757-605adadb8e56`   

In [99]:
metadata_field_docs = mk_json_field_documentation(
    json.dumps(metadata, indent=2), 
    context="metadata of a shared chatGPT conversation"
)
print(metadata_field_docs)

Here's the documentation for each field in the provided JSON object, formatted as a markdown table:

| Field Name                                | Description                                                                                          | Example Value                                                      |
|-------------------------------------------|------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|
| `title`                                   | The title of the shared chat GPT conversation.                                                      | "Test Chat 1"                                                    |
| `create_time`                             | The timestamp when the conversation was created, represented in seconds since the Unix epoch.      | 1737020729.060687                                                |
| `update_time`                    

In [12]:
import re
import json
from typing import Callable, Optional

def raise_run_time_error(msg: str):
    raise RuntimeError(msg)

def extract_json_dict(
    string: str,
    object_filter: Callable,
    *,
    decoder: json.JSONDecoder = json.JSONDecoder(),
    not_found_callback: Optional[
        Callable
    ] = lambda string, object_filter: raise_run_time_error(
        "Object not found in string"
    ),
) -> dict:
    """
    Searches `string` from the beginning, attempting to decode consecutive
    JSON objects using `decoder.raw_decode`. When an object satisfies
    `object_filter`, returns it as a Python dictionary.
    """
    pos = 0
    while pos < len(string):
        try:
            obj, pos = decoder.raw_decode(string, pos)
            if object_filter(obj):
                return obj
        except json.JSONDecodeError:
            pos += 1
    return not_found_callback(string, object_filter)

def parse_shared_chatGPT_chat(
    html: str, 
    string_contained_in_conversation: str
) -> dict:
    """
    Locates the big JSON structure assigned to window.__remixContext, 
    checks that the conversation portion includes `string_contained_in_conversation`,
    and returns the JSON as a dict.
    """
    # Regex pattern to capture the object assigned to window.__remixContext = {...};
    pattern = r"window\.__remixContext\s*=\s*(\{.*?\});"
    match = re.search(pattern, html, flags=re.DOTALL)
    if not match:
        raise RuntimeError("Could not locate the JSON assigned to window.__remixContext")

    # Extract the raw JSON text (removing the trailing semicolon if needed)
    raw_json_text = match.group(1).strip()

    # We'll define a filter that checks if the conversation data includes the target string.
    # Because the JSON is quite large, we can simply check if the substring is present in the raw text:
    # but if we want to be more precise, we can parse first and only return if there's conversation data.
    def conversation_filter(obj):
        # Convert to string once more, or deeply check "routes/share.$shareId.($action)" for the substring.
        # For simplicity, do a textual check on the entire string representation. 
        # If needed, we can refine to parse a specific location in obj.
        return string_contained_in_conversation in json.dumps(obj, ensure_ascii=False)

    # Now parse the JSON, returning only if the filter passes
    extracted_dict = extract_json_dict(
        raw_json_text,
        object_filter=conversation_filter,
    )

    return extracted_dict


t = parse_shared_chatGPT_chat(html, existing_substring)

['basename', 'future', 'isSpaMode', 'state']