# **Datamining Conversation Text**
Within this notebook, I'm going to explore my options for organizing and datamining the conversation text.

# Setup
The cells below will set up the rest of the notebook.

I'll start by configuring the kernel: 

In [None]:
# Change the working directory 
%cd ..

# Enable the autoreload extension, which will automatically load in new code as it's written
%load_ext autoreload
%autoreload 2

Now I'll import some necessary modules:

In [None]:
# General imports
from pathlib import Path
from typing import List
import json

# Third-party imports
from IPython.display import display, Markdown
from tqdm import tqdm
import pandas as pd
from pydantic import BaseModel, Field

# Project imports
import utils.openai as openai_utils
import utils.data_parsing as data_utils

# Loading Data
First off, I'll load in the conversations data:

In [None]:
# Declare the path to the export data folder
export_data_folder = "./data/export"

# Read the conversations.json file
conversations_path = f"{export_data_folder}/conversations.json"
with open(conversations_path, "r") as f:
    conversations = json.load(f)

# Create a DataFrame from the conversations data
conversations_df = pd.DataFrame(conversations)

# Print the length of the DataFrame
print(f"Loaded data for {len(conversations_df):,} conversations")

# Summarizing Conversations
For each of the conversations in the `conversations.json`, I'll get ChatGPT to generate a 1-2 sentence summary & a couple of "tags". 

I'll start by defining the system prompt & output format: 

In [None]:
# Define the system prompt
system_prompt = """
You're an intelligent AI assistant who likes responding in JSON. 

The user will provide you with a conversation between an AI chatbot and a user. 
Your task is to briefly - in 1-2 sentences - summarize the main topics covered within the conversation.
You'll also provide a list of "tags" - these are keywords / short phrases that characterize the conversation.
Tags ought to be lowercase, and relevant to the conversation content. Include anywhere between 3-5 tags. 
"""

# Define a "ConversationSummary" Pydantic model that will describe the output format 
class ConversationSummary(BaseModel):
    summary: str = Field(..., description="A 1-2 sentence summary of the conversation")
    tags: List[str] = Field(
        ..., description="A list of tags that describe the conversation"
    )

Next, I'm going to generate the prompts for each of the conversations! I'll extract the longest chain of conversations from each tree (what I've deemed "the canonical conversation"), and put them each in a DataFrame. From there, I can make a prompt for each!

In [None]:
# Iterate through the rows of the conversations DataFrame and extract the longest conversation
simplified_conversation_df_records = []
for convo_row in tqdm(
    iterable=list(conversations_df.itertuples()),
    desc="Extracting conversations from JSON data",
):

    # Extract the longest conversation chain
    longest_convo_chain_df = data_utils.extract_longest_conversation_df(
        convo_row.mapping
    )

    # Filter out messages not from the assistant / user
    filtered_convo_chain_df = longest_convo_chain_df.query(
        "author_role=='assistant' | author_role=='user'"
    )

    # Add a record to the simplified conversation DataFrame
    simplified_conversation_df_records.append(
        {
            "conversation_id": convo_row.conversation_id,
            "title": convo_row.title,
            "raw_message_data": filtered_convo_chain_df.to_dict(orient="records"),
            "conversation_markdown": data_utils.extract_simple_conversation_markdown(
                filtered_convo_chain_df
            ),
        }
    )

# Create a DataFrame from the simplified conversation records
simplified_conversation_df = pd.DataFrame(simplified_conversation_df_records)

What does this DataFrame look like?

In [None]:
# Show a sample of the DataFrame
simplified_conversation_df.sample(5)

What does the `conversation_markdown` look like?

In [None]:
# Select a random conversation from the DataFrame and show the Markdown representation
display(
    Markdown(simplified_conversation_df.sample(1).iloc[0].conversation_markdown[:2000])
)

Finally, now that I've got all of these: I can create the summaries! 

In [None]:
# Parameterize the summary generation
MAX_CHARS_PER_SUMMARY = 4_000

completions = openai_utils.generate_completions_in_parallel(
    message_format_pairs=[
        (
            [
                {
                    "role": "developer",
                    "content": [{"type": "text", "text": system_prompt}],
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": row.conversation_markdown[:MAX_CHARS_PER_SUMMARY]
                            + "...",
                        }
                    ],
                },
            ],
            ConversationSummary,
        )
        for row in simplified_conversation_df.itertuples()
    ],
    max_parallel_requests=24,
    show_progress=True,
)

# Construct a list of conversation summaries
conversation_summaries = []
for completion in completions:
    try:
        conversation_summaries.append(completion.choices[0].message.parsed)
    except Exception as e:
        print(f"Error parsing completion: {e}")
        print(completion)
        conversation_summaries.append(None)

# Make a new DataFrame from the conversation summaries
summarized_conversations_df = simplified_conversation_df.copy()
summarized_conversations_df["conversation_summary"] = conversation_summaries

# Add the summary and tags to the DataFrame
summarized_conversations_df["summary"] = summarized_conversations_df[
    "conversation_summary"
].apply(lambda x: x.summary if x else None)
summarized_conversations_df["tags"] = summarized_conversations_df[
    "conversation_summary"
].apply(lambda x: x.tags if x else None)

What does this data look like?

In [None]:
# Print the field names & values for a sample conversation
sample_conversation = summarized_conversations_df.sample(1).iloc[0]

display(
    Markdown(data_utils.extract_summarized_conversation_markdown(sample_conversation))
)

### **Saving Data**
Below, I'm going to save the data.  

In [None]:
# Declare the folder path for the export data
parsed_export_data = f"./data/parsed_export"
Path(parsed_export_data).mkdir(parents=True, exist_ok=True)

# Save a .json file with the summarized conversations
summarized_conversations_df.to_json(
    f"{parsed_export_data}/summarized_conversations.json", orient="records"
)

### **Loading Data**
If I'm reloading this notebook later on, I can run the following cell:

In [None]:
# Declare the folder path for the export data
parsed_export_data = f"./data/parsed_export"

# Load the summarized conversations DataFrame
summarized_conversations_df = pd.read_json(
    f"{parsed_export_data}/summarized_conversations.json"
)

# Embedding Conversations
Next up, I'm going to run these conversations through OpenAI's embedding models. That way, I'll get some embeddings for each conversation. 

In [None]:
# Create a new DataFrame with the conversation embeddings
conversation_embs_df = summarized_conversations_df.copy()

# Generate embeddings for the conversations
embs = openai_utils.generate_embeddings_for_texts(
    text_list=[
        f"{row.title}\nTags: {', '.join(row.tags)}\nSummary: {row.summary}"
        for row in conversation_embs_df.itertuples()
    ],
)

# Add the embeddings to the DataFrame
conversation_embs_df["embedding"] = embs.tolist()

### **Saving Data**
Below, I'll save some of the data:

In [None]:
# Declare the folder path for the export data
parsed_export_data = f"./data/parsed_export"

# Save the conversation embeddings to a .parquet file
conversation_embs_df[["conversation_id", "embedding"]].to_parquet(
    f"{parsed_export_data}/conversation_embeddings.parquet"
)

### **Loading Data**
Next, I'll reload the data:

In [None]:
# Declare the folder path for the export data
parsed_export_data = f"./data/parsed_export"

# Load the conversation embeddings DataFrame
conversation_embs_df = pd.read_parquet(
    f"{parsed_export_data}/conversation_embeddings.parquet"
)