# Converting Homestuck Collection data to an agnostic dataset

This notebook has all the steps for taking data from the [Unofficial Homestuck Collection](https://bambosh.dev/unofficial-homestuck-collection/)'s asset pack and converting it into a dataset that can be used for training a machine learning model.

Some planned uses for this are:
- Summarization
- Style transfer/LORAs
- Chatbots

# Constants and Imports

In [2]:
"""
IMPORTS
Put all at the beginning because I hate notebooks so much
"""
import os
import pandas as pd
import numpy as np
import json
import re
import random
import base64

from openai import OpenAI

from dotenv import load_dotenv

In [5]:

"""
CONSTANTS
This will extract constants from the env variables set in the .env file 
and make them accessible to the notebook
"""
print("Loading variables from .env file\n...")
load_dotenv()

ASSET_PACK_FOLDER = os.getenv("ASSET_PACK_FOLDER")
OUTPUT_FOLDER = os.getenv("OUTPUT_FOLDER")
OPENAI_API_KEY = os.getenv("OPENAI_API")
MODEL_ID = os.getenv("MODEL_ID")
print("Loaded variables successfully")
    
print("Loading constants\n...")
# Relevant folders and files

# Bespoke input files that aren't available elsewhere
INPUT_FOLDER = os.path.join(os.path.dirname(os.path.abspath('')), "input")

# Transcripts and commentary from ReadMSPA, assembled by Bambosh, Makin and Giovanh
MSPA_TRANSCRIPTS_FILE = os.path.join(INPUT_FOLDER, "transcripts.json")
MSPA_COMMENTARY_FILE = os.path.join(INPUT_FOLDER, "commentary.json")
# Panel tags from the Homestuck Search Engine
HSSE_TAGS_FILE = os.path.join(INPUT_FOLDER, "hsse_tags.json")
HSSE_SEARCH_FILE = os.path.join(INPUT_FOLDER, "hsse_search.json")

# POV cam data folder with txt files
POV_CAM_FOLDER = os.path.join(INPUT_FOLDER, "readable_timelines")

# Homestuck Collection's asset pack data folder
COLLECTION_DATA_FOLDER = os.path.join(ASSET_PACK_FOLDER, "archive/data")

# Holds all the text in MS Paint Adventures, including Homestuck
MSPA_TEXT_JSON = os.path.join(COLLECTION_DATA_FOLDER, "mspa.json")
# Holds the text for news posts
NEWS_JSON = os.path.join(COLLECTION_DATA_FOLDER, "news.json")
# Holds the text for social media posts
SOCIAL_JSON = os.path.join(COLLECTION_DATA_FOLDER, "social.json")
# Most of this is irrelevant, but holds images for additional Hussie comics
# such as Team Special Olympics
ADDITIONAL_COMICS_JSON = os.path.join(COLLECTION_DATA_FOLDER, "comics.json")
# Holds panels
PANELS_FOLDER = os.path.join(ASSET_PACK_FOLDER, "storyfiles")
HS_PANELS_FOLDER = os.path.join(PANELS_FOLDER, "hs2")

print("Loaded constants successfully")
      
openai_client = OpenAI(api_key=OPENAI_API_KEY)

print(f"OpenAI client loaded with model {MODEL_ID}")

Loading variables from .env file
...
Loaded variables successfully
Loading constants
...
Loaded constants successfully
OpenAI client loaded with model gpt-4o


# Extract MSPA data from Asset Pack

First of all, we want to open the MSPA_TEXT_JSON and fetch all the text data from it. This will be the main source of text data for our dataset.

In [3]:
"""
MSPA_TEXT_JSON format:
{
  "story": {  # The text in MS Paint Adventures, the comics themselves
    "000006": {
      "title": "Look for keyhole",
      "pageId": "000006",
      "timestamp": "1180921880",
      "flag": [],
      "media": [
        "/advimgs/jb/mspaintadventure04.gif",
        "/advimgs/jb/mspaintadventure04b.gif"
      ],
      "content": "",
      "next": [
        "000008"
      ],
      "previous": "000005",
      "theme": "retro"
    },
    "000009": {
      "title": "Loudly tell that guy to pick up key and try it on the door.",
      "pageId": "000009",
      "timestamp": "1180931172",
      "flag": [],
      "media": [
        "/advimgs/jb/mspaintadventure06.gif"
      ],
      "content": "Despite your bellowing, the man casually opens the door and leaves.",
      "next": [
        "000010"
      ],
      "previous": "000008",
      "theme": "retro"
    },
  },...
  "ryanquest": {...},  # Additional Ryanquest comic
  "psExtras": {...}, # Bonus pages for Problem Sleuth
  "wv": {...}, # "Exile" Homestuck pages, should be processed just like the "story" pages
  "faqs": {
    "general": {
      "title": "General FAQ - MS Paint Adventures",
      "pageId": "general",
      "content": "..." # html
    },
    "new": {...},  # New reader guide
    "science": {...},  # Science FAQ
    "sales": {...}, # This one was probably not Hussie, so ignore
  },
  # Other keys are fully irrelevant
"""

def extract_mspa_data():
    with open(MSPA_TEXT_JSON, 'r', encoding='utf-8') as f:
        mspa_data = json.load(f)
    return mspa_data

mspa_data = extract_mspa_data()

"""
The JSON is structured in a way that makes it easy to extract the text data, but we can make it better.

For each image, an accompanying JSON:
{
    "pageId": "000006",  # The unique identifier for the page
    "order": 0,  # Its position in the page (multipanels will have 0-n...)
    "type": "animated",  # "animated", "static"
    "textDescription": "..."  # For generating this, we can use image models and the text transcripts from readmspa,
    "tags": [] # Tags for the image, characters, locations, etc. we can extract some from the character POV extension and image search,
    "author": "Andrew Hussie", # 99% of these will be Andrew, but very rarely we'll see external art ("Other") or by known artists ("Adrienne Garcia")
}
This is a better format for the first ML dataset:
{
    
    "story": "Homestuck"
    "pageId": "001902",
    "title": "Enter name",
    "content": "...",
    "html_content": "...",
    "media": [
        {
            
        }
    ],
    "tags": [], # Character and other tags for the text depending on the type of content 
    "next": "001903",
    "next_title": "Try again.",
}

We'll have other datasets with things like the entirety of Hussie's text in one place, or just the images... we'll think about it
"""



'\nThe JSON is structured in a way that makes it easy to extract the text data, but we can make it better.\n\nFor each image, an accompanying JSON:\n{\n    "pageId": "000006",  # The unique identifier for the page\n    "order": 0,  # Its position in the page (multipanels will have 0-n...)\n    "type": "animated",  # "animated", "static"\n    "textDescription": "..."  # For generating this, we can use image models and the text transcripts from readmspa,\n    "tags": [] # Tags for the image, characters, locations, etc. we can extract some from the character POV extension and image search,\n    "author": "Andrew Hussie", # 99% of these will be Andrew, but very rarely we\'ll see external art ("Other") or by known artists ("Adrienne Garcia")\n}\nThis is a better format for the first ML dataset:\n{\n    \n    "story": "Homestuck"\n    "pageId": "001902",\n    "title": "Enter name",\n    "content": "...",\n    "html_content": "...",\n    "media": [\n        {\n            \n        }\n    ],\

# Getting panel transcripts from ReadMSPA plugin

ReadMSPA's data (and its plugin from the collection) comes with text transcripts of every image's text, if not descriptions. We can use that.

In [10]:
"""
The ReadMSPA data is... TODO
"""

'\nThe ReadMSPA data is... TODO\n'

# Getting panel tags from HSSE

The Homestuck Search Engine people tagged the tags of the first four acts, from characters to locations and more. Will be extremely useful for image transcription.

In [11]:
"""
HSSE_TAGS_FILE and HSSE_SEARCH file contain the entirety of the Homestuck Search Engine tagged data (only the first four acts, until page 1988 inclusive and excluding some swfs) in its own bespoke JSON format.
HSSE_TAGS_FILE is the simpler json with tag definitions, and which tags contain other tags:
```
 "definitions": {
    "0": {
      "_id": 0,
      "name": "Character",
      "children": [
        1,
        32,
        56,
        60,
        104,
        132,
        148,
        155,
        173,
        184,
        253
      ]
    },
    "1": {
      "_id": 1,
      "name": "Human",
      "children": [
        2,
        15
      ]
    },
    "2": {
      "_id": 2,
      "name": "Kid",
      "children": [
        3,
        10
      ]
    },
    "3": {
      "_id": 3,
      "name": "Beta Kid",
      "children": [
        4,
        5,
        7,
        9
      ]
    },
    "4": {
      "_id": 4,
      "name": "John Egbert",
      "children": []
    },
    ...
}

HSSE_SEARCH_FILE is the more complex json with the actual tags for each panel:
[
  {
    "_id": 0,
    "type": 0,
    "content": "https://www.homestuck.com/images/storyfiles/hs2/00001.gif",
    "thumbnail": "https://www.homestuck.com/images/storyfiles/hs2/00001.gif",
    "url": "https://homestuck.com/story/1",
    "tags": [
      1384,
      1385,
      391,
      321,
      4,
      749,
      801,
      1301,
      602,
      1192,
      711,
      1349
    ],
    "page": 1
  },
  {
    "_id": 1,
    "type": 0,
    "content": "https://www.homestuck.com/images/storyfiles/hs2/00002.gif",
    "thumbnail": "https://www.homestuck.com/images/storyfiles/hs2/00002.gif",
    "url": "https://homestuck.com/story/2",
    "tags": [
      1384,
      1385,
      391,
      321,
      4,
      1349,
      602
    ],
    "page": 2
  },
  ...
]
```
Our objective here is to combine the information so that, for each page, we'll have its human readable tags. 
"""

'\nHSSE_TAGS_FILE and HSSE_SEARCH file contain the entirety of the Homestuck Search Engine tagged data (only the first four acts, until page 1988 inclusive and excluding some swfs) in its own bespoke JSON format.\nHSSE_TAGS_FILE is the simpler json with tag definitions, and which tags contain other tags:\n```\n "definitions": {\n    "0": {\n      "_id": 0,\n      "name": "Character",\n      "children": [\n        1,\n        32,\n        56,\n        60,\n        104,\n        132,\n        148,\n        155,\n        173,\n        184,\n        253\n      ]\n    },\n    "1": {\n      "_id": 1,\n      "name": "Human",\n      "children": [\n        2,\n        15\n      ]\n    },\n    "2": {\n      "_id": 2,\n      "name": "Kid",\n      "children": [\n        3,\n        10\n      ]\n    },\n    "3": {\n      "_id": 3,\n      "name": "Beta Kid",\n      "children": [\n        4,\n        5,\n        7,\n        9\n      ]\n    },\n    "4": {\n      "_id": 4,\n      "name": "John Egbert",

# Extracting character appearances from POV cam 

The POV cam extension for Homestuck allows us to see the characters that are present in each page, and not just until page 1988, all of them. We can use this to extract character tags and somewhat make up for the lack of tags in the later pages.

In [13]:
"""
The data from the POV cam comes in many files named after each character, like "roxy.txt" and "rufioh.txt". The format is not meant to be easily parsable, but it shouldn't be too hard to extract the data and "invert" it, to get the characters that appear in each page and their "commands".

An example of the data (jade.txt):
```
Name: Jade
Colour: #4AC925
Image: jade.png
Group: Kids

Be created on meteor
3790-3791
3803
3807
3830-3831

Be sent to Earth
3840

Land on factory
3768-3769

Be adopted
3773-3775

Be taken on hunt with grandfather
Wander off with Bec
Find present
3029-3036
```
"""

'\nThe data from the POV cam comes in many files named after each character, like "roxy.txt" and "rufioh.txt". The format is not meant to be easily parsable, but it shouldn\'t be too hard to extract the data and "invert" it, to get the characters that appear in each page and their "commands".\n\nAn example of the data (jade.txt):\n```\nName: Jade\nColour: #4AC925\nImage: jade.png\nGroup: Kids\n\nBe created on meteor\n3790-3791\n3803\n3807\n3830-3831\n\nBe sent to Earth\n3840\n\nLand on factory\n3768-3769\n\nBe adopted\n3773-3775\n\nBe taken on hunt with grandfather\nWander off with Bec\nFind present\n3029-3036\n```\n'

# Transcribing panels

We have the panel images, we have the ReadMSPA transcripts of them, we have the point of view from the POV cam, and we have the title and text that accompanies the panels. We have some partial tagging information from the Homestuck Search engine. With that and a vision model, we might be able to successfully extract non-hallucinated information.

In [7]:
# Queries to send to the annotation model
batch_queries = []
system_prompt =  """You are a professional image annotator.
Your current project is annotating the panels of the webcomic Homestuck. You need to write a textual description as well as a list of location and character tags.
Your input will be the image file itself, the source comic, the page title, the current POV characters and the OCR transcript of all the text in the page. For example:
{
    "src": "005624.gif",
    "title": "Jane: Reply",
    "pov_characters": ["Jane Crocker", "Caliborn"],
    "transcript": ["...", "HELP"]
}
This is an example output:
{
    "characters": ["Jane Crocker"],
    "locations": ["Land of Crypts and Helium", "Jane's House"],  # If you don't know the location, just leave it out
    "description": "Jane Crocker stands in the middle of her room, next to her bed. Jane is wearing a gray shirt with a blue monster logo on it, as well as a blue skirt. The room contains posters of movies. Outside the window we can see the Land of Crypts and Helium, a gray planet with multicolored flowers. There's a text bubble with '...' pointing to her head.",
}
You should write verbose descriptions that will be useful for people who can't see the image, as well as for training image models.
No talk; just go.
"""

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")
    

def annotate_panel(panel_data: dict, image_path: str) -> dict:
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": [
            {
                "type": "text",
                "text": json.dumps(panel_data)
            },
            {
                "type": "image_url", 
                "image_url": {
                    "url": f"data:image/png;base64,{encode_image(image_path)}"
                }
            }
        ]}
    ]

    response = openai_client.chat.completions.create(
        model=MODEL_ID,
        messages=messages,
        temperature=0.0,
    )
    return response.choices[0].message.content

def get_panel_data(panel_id: str) -> dict:
    # We get metadata from a variety of sources
    # src: the image file name
    # title, page_content: from the json archive
    # pov_characters: from the POV extension
    # transcript: from the ReadMSPA transcripts
    pass
    

# We load a test panel, from HS_PANELS_FOLDER, 01691.gif which should depict a sleeping Rose and an awake John
test_panel = os.path.join(HS_PANELS_FOLDER, "01691.gif")
test_panel_data = {
    "src": "01691.gif",
    "title": "John: Get up.",
    "page_content": """Despite the pandemonium of your entrance, Rose is still sound asleep. She must be really tuckered out!
<br>
<br>It looks like this little guy is awake and ready for action though. He is adorable. You decide to name him Dr. Meowgon Spengler.""",
    "pov_characters": ["John Egbert", "Rose Lalonde"],
    "transcript": ["Z"],
}

# We'll use the OpenAI API to annotate the panel
# This is a test, so we'll just print the output
print(annotate_panel(test_panel_data, test_panel))

{
    "characters": ["John Egbert", "Rose Lalonde", "Dr. Meowgon Spengler"],
    "description": "John Egbert stands in a cluttered room, wearing a green suit with a blue tie. He is smiling and looking at a small black cat, which he has decided to name Dr. Meowgon Spengler. Rose Lalonde is asleep on the floor, with a 'Z' text bubble above her head, indicating she is sleeping. The room is messy, with various objects scattered around, including a laptop on a desk, a purple cube, and a red rocket-like object hanging from the ceiling. The wall in the background has the word 'MEOW' written repeatedly in purple.",
    "locations": ["Rose's House"]
}


# Textual datasets

We output datasets for Homestuck, MS Paint Adventures as a whole, and all of Andrew Hussie's works.

The format is .jsonl