# Converting Homestuck Collection data to an agnostic dataset

This notebook has all the steps for taking data from the [Unofficial Homestuck Collection](https://bambosh.dev/unofficial-homestuck-collection/)'s asset pack and converting it into a dataset that can be used for training a machine learning model.

Some planned uses for this are:
- Summarization
- Style transfer/LORAs
- Chatbots

# Constants and Imports

In [7]:
"""
IMPORTS
Put all at the beginning because I hate notebooks so much
"""
import os
import pandas as pd
import numpy as np
import json
import re
import random
import shutil
from dotenv import load_dotenv

In [28]:
"""
CONSTANTS
This will extract constants from the env variables set in the .env file 
and make them accessible to the notebook
"""
print("Loading variables from .env file\n...")
load_dotenv()

ASSET_PACK_FOLDER = os.getenv("ASSET_PACK_FOLDER")
OUTPUT_FOLDER = os.getenv("OUTPUT_FOLDER")
print("Loaded variables successfully")
    
print("Loading constants\n...")
# Relevant folders and files
DATA_FOLDER = os.path.join(ASSET_PACK_FOLDER, "archive/data")

# Holds all the text in MS Paint Adventures, including Homestuck
MSPA_TEXT_JSON = os.path.join(DATA_FOLDER, "mspa.json")
# Holds the text for news posts
NEWS_JSON = os.path.join(DATA_FOLDER, "news.json")
# Holds the text for social media posts
SOCIAL_JSON = os.path.join(DATA_FOLDER, "social.json")
# Most of this is irrelevant, but holds images for additional Hussie comics
# such as Team Special Olympics
ADDITIONAL_COMICS_JSON = os.path.join(DATA_FOLDER, "comics.json")

# Panel text transcripts from readmspa.org and assembled by GiovanH
TRANSCRIPTS_URL = "https://raw.githubusercontent.com/GiovanH/tuhc-readmspa/master/data/transcripts.json"

# Commentary transcripts by Bambosh, Makin and Drew
COMMENTARY_URL = "https://raw.githubusercontent.com/GiovanH/tuhc-commentary/master/src/commentary.json"

print("Loaded constants successfully")

Loading variables from .env file
...
Loaded variables successfully
Loading constants
...
Loaded constants successfully


# Extract MSPA data

First of all, we want to open the MSPA_TEXT_JSON and fetch all the text data from it. This will be the main source of text data for our dataset.

In [22]:
"""
MSPA_TEXT_JSON format:
{
  "story": {  # The text in MS Paint Adventures, the comics themselves
    "000006": {
      "title": "Look for keyhole",
      "pageId": "000006",
      "timestamp": "1180921880",
      "flag": [],
      "media": [
        "/advimgs/jb/mspaintadventure04.gif",
        "/advimgs/jb/mspaintadventure04b.gif"
      ],
      "content": "",
      "next": [
        "000008"
      ],
      "previous": "000005",
      "theme": "retro"
    },
    "000009": {
      "title": "Loudly tell that guy to pick up key and try it on the door.",
      "pageId": "000009",
      "timestamp": "1180931172",
      "flag": [],
      "media": [
        "/advimgs/jb/mspaintadventure06.gif"
      ],
      "content": "Despite your bellowing, the man casually opens the door and leaves.",
      "next": [
        "000010"
      ],
      "previous": "000008",
      "theme": "retro"
    },
  },...
  "ryanquest": {...},  # Additional Ryanquest comic
  "psExtras": {...}, # Bonus pages for Problem Sleuth
  "wv": {...}, # "Exile" Homestuck pages, should be processed just like the "story" pages
  "faqs": {
    "general": {
      "title": "General FAQ - MS Paint Adventures",
      "pageId": "general",
      "content": "..." # html
    },
    "new": {...},  # New reader guide
    "science": {...},  # Science FAQ
    "sales": {...}, # This one was probably not Hussie, so ignore
  },
  # Other keys are fully irrelevant
"""

def extract_mspa_data():
    with open(MSPA_TEXT_JSON, 'r', encoding='utf-8') as f:
        mspa_data = json.load(f)
    return mspa_data

mspa_data = extract_mspa_data()

"""
The JSON is structured in a way that makes it easy to extract the text data, but we can make it better.

For each image, an accompanying JSON:
{
    "pageId": "000006",  # The unique identifier for the page
    "order": 0,  # Its position in the page (multipanels will have 0-n...)
    "type": "animated",  # "animated", "static"
    "textDescription": "..."  # For generating this, we can use image models and the text transcripts from readmspa,
    "tags": [] # Tags for the image, characters, locations, etc. we can extract some from the character POV extension and image search,
    "author": "Andrew Hussie", # 99% of these will be Andrew, but very rarely we'll see external art ("Other") or by known artists ("Adrienne Garcia")
}
This is a better format for the first ML dataset:
{
    
    "story": "Homestuck"
    "pageId": "001902",
    "title": "Enter name",
    "content": "...",
    "html_content": "...",
    "media": [
        {
            
        }
    ],
    "tags": [], # Character and other tags for the text depending on the type of content 
    "next": "001903",
    "next_title": "Try again.",
}

We'll have other datasets with things like the entirety of Hussie's text in one place, or just the images... we'll think about it
"""



'\nThe JSON is structured in a way that makes it easy to extract the text data, but we can make it better.\nThis is a better format for a ML dataset:\n{\n    \n    "story": "Homestuck"\n    "pageId": "001902",\n    "title": "Enter name",\n    "content": "...",\n    "html_content": "...",\n    "media": [\n        {\n            "type": "image",\n            "src": "...",  # Relative within dataset\n            "textDescription": "..."\n        }\n    ],\n    "next": "001903",\n    "next_title": "Try again.",\n}\n'

# Transcribing panels

We have the panel images, we have the ReadMSPA transcripts of them, we have the point of view from the POV cam, and we have the title and text that accompanies the panels. We have some partial tagging information from the Homestuck Search engine. With that and a vision model, we might be able to successfully extract non-hallucinated information.

In [27]:
# Queries to send to the annotation model
batch_queries = []
system_prompt =  """You are a professional image annotator.
Your current project is annotating the panels of the webcomic Homestuck. You need to write a textual description as well as a list of location and character tags.
Your input will be the image file itself, the source comic, the page title, the current POV characters and the OCR transcript of all the text in the page. For example:
{
    "src": "005624.gif",
    "title": "Jane: Reply",
    "pov_characters": "Jane Crocker",
    "transcript": "..."
}
This is an example output:
{
    "characters": ["Jane Crocker"],
    "locations": ["Land of Crypts and Helium", "Jane's House"],
    "description": "Jane Crocker stands in the middle of her room, next to her bed. Jane is wearing a gray shirt with a blue monster logo on it, as well as a blue skirt. The room contains posters of movies. Outside the window we can see the Land of Crypts and Helium, a gray planet with multicolored flowers. There's a text bubble with '...' pointing to her head.",
}
You should write verbose descriptions that will be useful for people who can't see the image, as well as for training image models.
No talk; just go.
"""

# Textual datasets

We output datasets for Homestuck, MS Paint Adventures as a whole, and all of Andrew Hussie's works.

The format is .jsonl