## DreamBank

Convert DreamBank's raw HTML files into tabular format.

Processing steps for tabular export.

* Remove non-english datasets
* Parse dream number/ID and metadata from beginning of dream reports into their own columns
* Parse dream word counts from end of dream reports into its own column
* Parse dataset metadata into separate columns
* Replace newlines with single spaces in brief descriptions of dataset metadata

In [1]:
from datetime import datetime, timezone
import os
import re
from requests.exceptions import HTTPError
import tarfile

from bs4 import BeautifulSoup
import pandas as pd
import pooch
from tqdm import tqdm

## Load data

In [2]:
# If using a version of DreamBank HTML that has already been attached
# to a GitHub release, use Pooch to download/cache that file and get the filename from there.
# Otherwise it is assumed there is a new development version in `raw/` to work with.
RAW_URL = "https://github.com/remrama/dreambank/releases/download/v1.alpha2/dreambank.tar.xz"
RAW_HASH = "md5:6ab629e9c13251d228db7ec1a93ffeb6"
try:
    archive_fname = pooch.retrieve(RAW_URL, RAW_HASH)
except HTTPError as e:
    if str(e).startswith("404 Client Error: Not Found for url"):
        archive_fname = "./raw/output/dreambank.tar.xz"

In [3]:
# Get a list of all the datasets available in the archive.
datasets = []
with tarfile.open(archive_fname, "r:xz") as tar:
    for member in tar.getmembers():
        if member.isdir() and member.name != ".":
            datasets.append(os.path.basename(member.name))

# Drop non-english datasets.
for ds in datasets[:]:
    if "." in ds:
        datasets.remove(ds)
        print(f"Dropping non-English dataset: {ds}")

Dropping non-English dataset: german-f.de
Dropping non-English dataset: german-m.de
Dropping non-English dataset: vonuslar.de
Dropping non-English dataset: zurich-f.de
Dropping non-English dataset: zurich-m.de


## Define functions for HTML extraction

In [4]:
def extract_file_content(dataset: str, component: str) -> bytes:
    """Extracts the info.html, dreams.html, or moreinfo.html content for a given dataset from the archive."""
    assert component in {"dreams", "info", "moreinfo"}, f"Invalid component: {component}"
    fname = f"./{dataset}/{component}.html"
    with tarfile.open(archive_fname, "r:xz") as tar:
        with tar.extractfile(fname) as f:
            content = f.read()
    return content

In [5]:
def extract_dreams_from_html(dataset: str) -> list[dict[str, str]]:
    """Parse DreamBank HTML dreams page for a given dataset.
    
    * Every dream starts with a #-padded number (e.g., #1) followed by a space.
        It's not always just numbers. Sometimes it has letters (e.g., #111a) or dashes
        (e.g., #B-055 in Jasmine's dreams; I think this is because it's in the Jasmine-all
        dataset and the B-055 indicates it's from her "B" series).
    * Dreams may optionally have a parenthetical between the number and the dream text.
        This parenthetical may contain a date, title, age, or other information.
        There is no standard format for what is in here and it will vary by dataset.
        It's surrounding by single spaces. There might be parenthesis inside the parenthetical.
    * The dream text follows.
    * Every dream ends with a word count in parentheses (e.g., (123 words)).
    * Every dream starts with a number sign (#) followed by the dream number in the whole sequence.
        This is not necessarily a pure integer, as some dreams have letter suffixes (e.g., 111a).
        This is not alway a full range of row numbers. E.g., HVdC datasets go up to 500, but
        there are some missing dreams. The missing numbers are between 1-500.
        Sometimes it has strings in it, e.g., #F21-5 in Peruvian dreams.
        Sometimes it has sex and age in it, e.g., #89 (F, age 18) in West Coast dreams.
        There are sometimes sup-parantheticals within the main parenthetical, but fortunatelly
        they are always at the end of the main parenthetical (e.g., #1027 (2007-01-22 (15))),
        which makes them easier to find.
    * Dreams may optionally have a date in parentheses after the dream number.
        The date, if present, can be in a variety of formats, sometimes separated by slashes or dashes.
        Also a question mark is there sometimes to indicate uncertainty (e.g., 1985?).
        There is sometimes another value in a sub-parenthetical within the date.
        E.g., a number that I think represents age in Izzy's dreams (e.g., #1027 (2007-01-22 (15)),
        or a period title in Madeline's dreams (e.g., #0771 (2003-19-12 (Post-Grad))).
    * Some dreams have a title following the date, in brackets and quotes (e.g., ["Outlaws Hiding"]).
        I've seen this in some Barb Sanders dreams.
    * Barbara baseline always ends with [BL] at end of report.
    """
    content = extract_file_content(dataset, "dreams")
    soup = BeautifulSoup(content, "html.parser", from_encoding="ISO-8859-1")
    # Find all spans that do not have "comment" class labels.
    # Comments will already be present in the regular spans/dreams as bracketed content.
    dreams = []
    dream_spans = soup.find_all("span", style=False, class_=lambda x: x != "comment")
    for span in dream_spans:
        span_text = span.get_text(separator=" ", strip=True)
        # Extract the dream number (and potentially date) from beginning of string
        # Sometimes dream number is a string, like 111a (e.g., Alta)
        # Date is sometimes present if provided by dreamer
        # Dream ID is always present and represents the number of the dream in the whole sequence
        # PREFACE_PATTERN = r"^#(?P<dream_id>\S+) (?P<metadata>\(.+?\){1,2} )?"
        PREFACE_PATTERN = r"^#(?P<dream_id>\S+) (\((?P<metadata>.{1,30}?)\) )?"
        preface_match = re.match(PREFACE_PATTERN, span_text)
        assert preface_match is not None, f"Error parsing dream preface for dataset {dataset}, span text: {span_text}"
        # There is always _supposed_ to be a space bewteen last word and word count paranthetical,
        # but this isn't always the case. Eg, alta #49, bay_area_girls_456 #219-11.
        # It's always supposed to be a return line before the word count paranthetical,
        # but looks like it is also a space sometimes, particularly when there is a bracketed
        # statement right before it (e.g., [BL] (87 words)).
        # So instead of a newline preceding, we will just look for any whitespace _optionally_.
        PROLOGUE_PATTERN = r"(\s+)?\((?P<word_count>[0-9]+) words\)$"
        prologue_match = re.search(PROLOGUE_PATTERN, span_text)
        # prologue_match = re.search(PROLOGUE_PATTERN, span_text)
        assert prologue_match is not None, f"Error parsing dream prologue for dataset {dataset}, span text: {span_text}"
        # n_wc_matches = len(re.findall(r"[ \n]?\([0-9]+ words\)$", dream_and_wc_text))
        
        # Extract the id_match from the span text
        # dream_n = match_.group(1)  # The number of dream in the whole sequence
        # dream_date = match_.group(3)  # will be None if not found
        # # Remove the dream number (and potentially date) from the beginning of string
        # dream_and_wc_text = re.sub(r"^#([0-9]+) ((\(\S*\)) )?", "", span_text)
        # # Remove the word count from end of string
        # n_wc_matches = len(re.findall(r"[ \n]?\([0-9]+ words\)$", dream_and_wc_text))
        # assert n_wc_matches == 1, f"Found {n_wc_matches} WC match for dataset {dataset}, dream {dream_n} (expected 1)."
        # dream_text = re.sub(r"[ \n]?\([0-9]+ words\)$", "", dream_and_wc_text)
        # assert dream_n not in data, f"Unexpected duplicate dream number: {dream_n} in dataset {dataset}."
        
        # Remove extracted preface and prologue from span text to get dream text
        dreams.append(
            {
                "dataset": dataset,
                "dream_id": preface_match.group("dream_id"),
                "metadata": pd.NA if preface_match.group("metadata") is None else preface_match.group("metadata"),
                "word_count": int(prologue_match.group("word_count")),
                "dream_text": re.sub(PROLOGUE_PATTERN, "", re.sub(PREFACE_PATTERN, "", span_text)),
            }
        )
    # Make sure the correct number of dreams were extracted.
    # At the top of each page, DreamBank will say how many dreams are present in the
    # total dataset, as well as how many are displayed on the page. These, and the total
    # amount of dreams extracted, should all be the same.
    n_dreams_statement = soup.find("h4").find_next().get_text()
    n_dreams_total, n_dreams_displayed = map(int, re.findall(r"[0-9]+", n_dreams_statement))
    n_dreams_extracted = len(dreams)
    assert n_dreams_total == n_dreams_displayed == n_dreams_extracted
    return dreams


def extract_info_from_html(dataset: str, process_description: bool = True) -> dict[str, str]:
    """Parse DreamBank HTML info page for a given dataset into a dictionary.

    This is the little window that pops up if you hit "MORE INFO" on the dataset search page.
    The free text is same as what is on the Grid page, but the structured fields are not there.
    The structured fields are also present in moreinfo page.
    ```
    Dream series: Alta: a detailed dreamer
    Number of dreams: 422
    Year: 1985-1997
    Sex of the dreamer(s): female

    Alta is an adult woman who wrote down her dreams in the late 1980s and early 1990s, and added a few in 1997 when she called to offer the dreams to us. This series has not been heavily studied yet.
    ```
    * long_name (str): The dataset title.
    * n_dreams (int): The total number of dreams in the dataset.
    * timeframe (str): Provided year or timeframe of the dataset.
    * sex (str): The provided sex of the dreamer.
    * description (str): A long-form description of the dataset.

    Notes
    -----
    The more info (extended descriptions) are often very detailed with extensive HTML formatting.
    So for now I'm leaving them out. Best way to view them honestly is just to go to the link.
    And the link is the same for everyone so including the link in the dataset CSV is redundant.
    But for now a half solution is to add a column that clarifies if there is more info available online.

    Not every dataset has more info, but they all have a more info _page_. It just
    might say no more info is available.

    This is the little window that pops up whenever you hit "click here" in the initial
    info page. I think it only applies to dream series??
    The structured fields are duplicated here, but the free text from info is different than here.
    That's like a brief description and this is a detailed thing, sometimes with extensive character
    info and tables and stuff.
    Not everyone has a direct link to this page, but if they have no more info and you go
    directly to the link it will still have the structured fields, just say
    "Sorry, no additional info is available for this series."

    """
    content = extract_file_content(dataset, "info")
    soup = BeautifulSoup(content, "html.parser", from_encoding="ISO-8859-1")
    body = soup.find("body")
    long_name = body.find(string="Dream series:").next.get_text(strip=True)
    n_dreams = body.find(string="Number of dreams:").next.get_text(strip=True)
    timeframe = body.find(string="Year:").next.get_text(strip=True)
    sex = body.find(string="Sex of the dreamer(s):").next.get_text(strip=True)
    match_ = re.fullmatch(
        r"^.*Sex of the dreamer\(s\): (?:fe)?male\n\n\n?(?P<description>.*?)\s+(For the further analyses, click here.\n)?\[Back to search form\]\s+$",
        body.get_text(),
        flags=re.DOTALL,
    )
    assert match_ is not None, f"Error parsing info description for dataset {dataset}."
    description = match_.group("description")

    # Just like the info page, we need to extract the description from the rest of the text.
    # The search pattern is very similar, with just a slight difference in the trailing text.
    content = extract_file_content(dataset, "moreinfo")
    soup_ = BeautifulSoup(content, "html.parser", from_encoding="ISO-8859-1")
    body_ = soup_.find("body")
    long_name_ = body_.find(string="Dream series:").next.get_text(strip=True)
    n_dreams_ = body_.find(string="Number of dreams:").next.get_text(strip=True)
    timeframe_ = body_.find(string="Year:").next.get_text(strip=True)
    sex_ = body_.find(string="Sex of the dreamer(s):").next.get_text(strip=True)
    assert long_name == long_name_, f"Mismatch in long_name for dataset {dataset}."
    assert n_dreams == n_dreams_, f"Mismatch in n_dreams for dataset {dataset}."
    assert timeframe == timeframe_, f"Mismatch in timeframe for dataset {dataset}."
    assert sex == sex_, f"Mismatch in sex for dataset {dataset}."
    match__ = re.fullmatch(
        r"^.*Sex of the dreamer\(s\): (?:fe)?male\n\n\n?(?P<extended_description>.*?)\s+$",
        body_.get_text(),
        flags=re.DOTALL,
    )
    assert match__ is not None, f"Error parsing moreinfo for dataset {dataset}."
    extended_description = match__.group("extended_description")
    extended_description_available = extended_description == "Sorry, no additional info is available for this series."
    # Could really just check for "click here" in the info description to see if more info is available.

    if process_description:
        # Clean up whitespace in description.
        description = re.sub(r"\s+", " ", description)
        # Optionally replace click here with markdown style link to url.
        # url = f"https://dreambank.net/more_info.cgi?further=1&series={dataset}"
    # Long names sometimes accidentatly have extra spaces too. I think it's just Izzy
    # (e.g., "Izzy,  age 14"), but apply globally anyways.
    long_name = re.sub(r"\s+", " ", long_name)
    info = {
        "dataset": dataset,
        "sex": sex,
        "timeframe": timeframe,
        "n_dreams": int(n_dreams),
        "extended_description_available": extended_description_available,
        "long_name": long_name,
        "brief_description": description,
    }

    return info


## Process each dataset

In [6]:
extracted_info = []
extracted_dreams = []
for dataset in (pbar := tqdm(datasets, ncols=90)):
    pbar.set_description(f"Processing dataset {dataset}")
    extracted_info.append(extract_info_from_html(dataset))
    extracted_dreams.extend(extract_dreams_from_html(dataset))

Processing dataset west_coast_teens: 100%|████████████████| 89/89 [01:45<00:00,  1.18s/it]


In [7]:
datasets = pd.DataFrame.from_records(extracted_info)
dreams = pd.DataFrame.from_records(extracted_dreams)

In [8]:
# There will be a lot of duplicates because some dreams are subsets of others.
# But there shouldn't be any duplicates within datasets.
assert not dreams.duplicated().any()
assert not dreams.duplicated(subset=["dataset", "dream_id"]).any()
assert not dreams.drop(columns=["metadata"]).isna().any(axis=None)
assert not datasets.isna().any(axis=None)
assert not datasets.duplicated().any()
assert datasets["dataset"].is_unique

## Export

In [9]:
OUTDIR = "./output"
DATASETS_FNAME = "datasets.csv.xz"
DREAMS_FNAME = "dreams.csv.xz"
datasets_outpath = f"{OUTDIR}/{DATASETS_FNAME}"
dreams_outpath = f"{OUTDIR}/{DREAMS_FNAME}"
os.makedirs(OUTDIR, exist_ok=True)

TO_CSV_KWARGS = {
    "index": False,
    "na_rep": "N/A",
    "sep": ",",
    "mode": "x",  # Switch to `w` to overwrite existing file
    "encoding": "utf-8-sig",  # Include sig/BOM for better compatibility with Excel
    "compression": "xz",
    "lineterminator": "\n",
    "quoting": 2,  # 2 = csv.QUOTE_NONNUMERIC
    "quotechar": '"',
    "doublequote": True,
}
datasets.to_csv(datasets_outpath, **TO_CSV_KWARGS)
dreams.to_csv(dreams_outpath, **TO_CSV_KWARGS)

In [10]:
for fn in [datasets_outpath, dreams_outpath]:
    print(f"file: {os.path.basename(fn)}")
    print(f"size: {os.path.getsize(fn) / 1e6} MB")
    print(f"md5: {pooch.file_hash(fn, alg='md5')}")
    print(f"sha256: {pooch.file_hash(fn, alg='sha256')}")
    print(f"timestamp: {datetime.fromtimestamp(os.path.getmtime(fn), tz=timezone.utc).isoformat(timespec='seconds')}")
    print()

file: datasets.csv.xz
size: 0.013464 MB
md5: 1475582e2daa1da53920df50cb9fc98e
sha256: 41ef784f267fe815ecf193b4e0a0902dfbc7d21488d68951bd0e04a5027ba408
timestamp: 2025-12-29T22:33:10+00:00

file: dreams.csv.xz
size: 6.493452 MB
md5: 2dcab92f9d9515df174388babb5c9e5a
sha256: 17366792fcca7eca546660d30040bf49c1bb4139f04ac77fb50b1a67ade3c635
timestamp: 2025-12-29T22:33:27+00:00

