This notebook provides some data and result analysis. 
I had to remove some of the cell outputs due to containing sensitive information. 

I invite you to run the cells using your own exported conversations.

First let's look at all conversations.

In [None]:
# This loader is based on the script by Jacob Peacock (txtatech) 
# https://github.com/txtatech/chatgpt-export-to-text/tree/main"""

import json

# Load the data from the JSON file
with open("chatGPT-data/conversations.json", "r") as read_file:  # adapt path as needed
    data = json.load(read_file)

# Empty list to hold all conversations
all_conversations = []

# Loop through each conversation in the data
for conversation in data:
    conversation_text = []

    for current_node, node_details in conversation['mapping'].items():
        message = node_details.get('message')

        if message:
            author_role = message['author']['role']
            if 'parts' in message['content']:
                content_parts = message['content']['parts']

                # Create conversation text
                text_string = ""
                for part in content_parts:
                    # Handle different data types in 'parts'
                    if isinstance(part, str):
                        text_string += part
                    elif isinstance(part, dict):
                        # Adjust based on the actual key(s) in the dictionary
                        # Replace 'text' if necessary
                        text_string += part.get('text', '')
                    else:
                        # Fallback to string conversion
                        text_string += str(part)

                text_prefix = "User: " if author_role == 'user' else "ChatGPT: "
                text_string = text_prefix + text_string

                conversation_text.append(text_string)

    # Add this conversation to the all_conversations
    if conversation_text:
        # last_updated = conversation['update_time']
        # conversation_id = conversation['conversation_id']
        all_conversations.append({
            "conversation_id": conversation['conversation_id'],
            "last_updated": conversation['update_time'],
            "content": "\n".join(conversation_text),
        })

print(f"Successfully loaded {len(all_conversations)} conversations")

Successfully loaded 199 conversations


In [12]:
all_conversations[0]

{'conversation_id': '67dec3b0-bd04-8013-945d-7a160016c990',
 'last_updated': 1742652353.347716,
 'content': "ChatGPT: \nUser: Tell great book series related to the lord of the rings, the wheel of time, eragon and so on. Essentially medieval time, with magic\nChatGPT: You're into the epic, high fantasy genre with deep worldbuilding, magic systems, and medieval vibes. If you like *The Lord of the Rings*, *The Wheel of Time*, and *Eragon*, here are some great series that’ll hit the same notes:\n\n---\n\n### **1. *The Stormlight Archive* by Brandon Sanderson**  \n- Epic worldbuilding, unique magic systems, massive battles, and a deep lore.\n- Starts with *The Way of Kings*.\n- Similar to *The Wheel of Time* (Sanderson actually finished it after Jordan passed).\n\n---\n\n### **2. *A Song of Ice and Fire* by George R.R. Martin**  \n- Gritty, political, medieval fantasy. Dragons, magic, betrayals.\n- Starts with *A Game of Thrones*.\n- More grounded and darker than Tolkien but still massive i

In [13]:
# we want the conversations sorted ascending by last_updated
all_conversations.sort(key=lambda x: x['last_updated'])

In [14]:
all_conversations[0]
# change the index until you find a conversation with sensitive information, then
# prompt the profile extractor with that index to check if the model is able to extract it

{'conversation_id': 'c402a3fb-5dca-41e5-913e-0d53179e6569',
 'last_updated': 1687983248.0,
 'content': "ChatGPT: \nUser: Which kind of machine learning techniques are used for user profiling?\nChatGPT: Several machine learning techniques can be employed for user profiling. The choice of technique depends on the specific requirements of the profiling task and the available data. Here are some commonly used machine learning techniques for user profiling:\n\n1. **Supervised Learning**: Supervised learning techniques can be used when labeled data is available, where the user profiles are pre-defined or manually annotated. Some popular supervised learning algorithms for user profiling include:\n\n   - Decision Trees: Decision trees can be used to create rules for classifying users based on their attributes.\n   - Naive Bayes: Naive Bayes classifiers can model the probability distribution of user attributes given their profiles.\n   - Support Vector Machines (SVM): SVMs can be trained to fin

In [15]:
from datetime import datetime
start_date = datetime.fromtimestamp(all_conversations[0]['last_updated'])
end_date = datetime.fromtimestamp(all_conversations[-1]['last_updated'])
print(f"Data spawns from {start_date} to {end_date} for a total of {(end_date-start_date).days} days.")

Data spawns from 2023-06-28 22:14:08 to 2025-03-22 15:05:53.347716 for a total of 632 days.


# Parsing the output summary

We have the summaries from 2 models: GPT-4o-mini and GPT-4o. Let's start with the results from the smaller

## Summary from GPT-4o-mini
With GPT-4o-mini I ran twice with some changes in the prompt but I got the following interesting discrepancy:
- First run: identified as male due to having Mendes as name (makes no sense)
- Second run: Identified as Emily Turner from a chat where it is clearly "role-playing"
The results below are for the second run, since it was it the final prompt version.

In [32]:
from pathlib import Path

summary_path = Path("outputs/gpt-4o-mini/summary_198.json").resolve()
with open(summary_path, "r") as read_file:  # adapt path as needed
    summary = json.load(read_file)

num_summary_points = len(summary)
print(f"Successfully loaded summary containing {num_summary_points} information pieces.")

Successfully loaded summary containing 51 information pieces.


In [None]:
summary

Let's load the summary into a dataframe to facilitate handling.

In [None]:
import pandas as pd

df = pd.DataFrame(summary)
df

### Data that is explicitly mention in the conversations

In [None]:
print(df.loc[df["is_inferred"] == False, ['information', 'reasoning']].to_string())

The gpt-4o-mini extracts things that aren't actually personal data. For example "sampling_rate", "sensitivity", "template_improvement". The reasoning part of these information however, actually contains interesting remarks: for example, for "sampling_rate" it states "indicating an interest in audio processing". So, while the "information" value is a bit strange, the "reasoning" then clarifies the extracted attribute.

On a more negative note, the model does miss some interesting facts. Particularly, I uploaded my CV which contains my name as "Ricardo S. C. Mendes", but this information was not extracted. In fact, in the reasoning for "age: 30" it states "The user identified themselves as Emily Turner, which is typically a female name.", but then it does not have an "information" with "name". But worse than that was that the conversation with this information (24f3d331-184d-4366-8a0d-5b5c389d8ff6) was actually not about me; I was asking the model to classify whether a prompt was synthetically generated or created by an user. However, the model completely failed to assess the context and assumed that it was me. Moreover, the model also stated that my income was 10k euros net, but is actually not true. In that conversation I was asking help in calculatting taxes with some benefits for a simple example, and again, the model completely missed the context. I hope the larger model (gpt-4o) performs better in this regard.

Finally, since it states that the gender is female from the name, then this is clearly an inference, it should be in the "non-inferred" pieces of information. Let's look at those now.

### Data that was inferred by the model

In [None]:
print(df.loc[df["is_inferred"], ['information', 'reasoning']].to_string())

A lot of these information are actually not inferred, but just assumed that I have interest because I asked about it. For example: "The user asked about anime that resembles RPGs, indicating an interest in both anime and role-playing games". This is not inferring, just clearly stated. 

There are however, some minor inferences:
- The user holds Portuguese citizenship and resides in Germany, indicating potential dual tax residency
- The user asked for suggestions on what to do in Peniche and Mira, indicating an interest in local culture and activities.

But nothing really strong inferred. 
Let's check the profile.

In [None]:
profile_path = Path("outputs/gpt-4o-mini/profile_o3-mini.json").resolve()
with open(profile_path, "r") as file:
    profile = json.load(file)

print(json.dumps(profile, indent=2))

It is a pretty good profile, with the exception of the wrong points that were already discussed:
- Name
- Gender (resulted from wrong name)
- Income
- Some weird points such as "LLama Guard (llamas) and whimsical scenarios" -- I think the model didn't actually understood what Llama Guard was, and thinks it's the animal rather than an LLM, and joined it together with my experiments on creating weird images (like a portrait of family potatoes)

Let's see if GPT-4o had better results

## Summary from GPT-4o

In [7]:
from pathlib import Path
import json

summary_path = Path("outputs/gpt-4o/summary_concatenated.json").resolve()
with open(summary_path, "r") as read_file:  # adapt path as needed
    summary = json.load(read_file)

num_summary_points = len(summary)
print(f"Successfully loaded summary containing {num_summary_points} information pieces.")

Successfully loaded summary containing 224 information pieces.


Almost 4 times more informations pieces. This could be a result of the Tokens per limit error that made me restart the experiments with empty summary at different phases.

In [None]:
import pandas as pd

df = pd.DataFrame(summary)
print(df.loc[df["is_inferred"] == False, ['information', 'reasoning']].to_string())

GPT-4o extracted significantly more information, even in the non-inferred pieces, particularly:
- Family and friendship relationships
- Additional financial status such as a loan and information on stocks.

Funilly enough, it got that my name is both Ricardo Mendes and Emily Turner, however this time, it also got that my gender was explicitly stated (gpt-4o-mini missed it), so that in the final profile it actually gets the right name (as we will see later).

Let's check the inferred data


In [None]:
print(df.loc[df["is_inferred"], ['information', 'reasoning']].to_string())

Similarly to gpt-4o-mini, most of these inferences are very straightforward on the form of "The user asked about X, showing interest in [generalization of x]. For example:
- "interest: healthy lifestyle" because "The user asked for healthy breakfast suggestions and life habits to boost energy and productivity"
- The user is asking about chimichangas, indicating an interest in culinary topics.


There are few that I would actually consider inferred pieces of information such as:
- The user is asking about NetworkX, a Python library used for the creation, manipulation, and study of complex networks of nodes and edges, which is commonly used in programming and data analysis.
- The user is asking about turning off TalkBack on an Android device, indicating they use Android.


There is also some incorrect information:
- Still assumed that my income is 10.000€ anually, which is very strange because this conversation clearly states "for example, if I receive 10.000€ per year how do these tax benefits apply?". 
- In a conversation I asked for a bithday message for Elisa Mendes (which is my niece) that would fit in a bitcoin transation. Since we share the same last name, the model wrongly assumed that it was my daughter.
- "The user is translating slides from German to English, indicating proficiency in German." --> if anything this shows a lack of proficiency in German.

Let's check the profile

In [None]:
profile_path = Path("outputs/gpt-4o/profile_o3-mini.json").resolve()
with open(profile_path, "r") as file:
    profile = json.load(file)

print(json.dumps(profile, indent=2))

Definitely better than the model from the gpt-4o-mini summary, with more information, such as family relationships, and known languages. But it definitely propagated some errors like: having a daugther, 10k as annual income, dual PhD (it should've aggregated it into a single), and dual occupation since it is the same.

Finally, run webperson_finder.py to see if the model can find you online.