# Compare DreamBank versions

The goal of this notebook is to provide a quick view of all the datasets available within each version of [DreamBank](https://dreambank.net) that is [archived on Zenodo](https://doi.org/10.5281/zenodo.18131749) (and how the datasets differ across versions).

In [1]:
import os
import tarfile
from datetime import datetime, timezone
from itertools import combinations

import pandas as pd
import pooch

In [2]:
CACHE_DIR = pooch.os_cache("pooch").joinpath("krank").joinpath("dreambank")
REGISTRY = {
    "1": {
        "doi": "10.5281/zenodo.18131750",
        "url": "https://zenodo.org/records/18131750/files/dreambank.tar.xz?download=1",
        "hash": "md5:6ab629e9c13251d228db7ec1a93ffeb6",
    },
    "2": {
        "doi": "10.5281/zenodo.18159468",
        "url": "https://zenodo.org/records/18159468/files/dreambank.tar.xz?download=1",
        "hash": "md5:eb83bcb0828f9c8c248a5052b2ffc798",
    },
}

Download all DreamBank versions archived on Zenodo.

In [3]:
fnames = {}
for version, info in REGISTRY.items():
    fname = f"dreambank_v{version}.tar.xz"
    fname_ = pooch.retrieve(
        url=info["url"],
        known_hash=info["hash"],
        fname=fname,
        path=CACHE_DIR,
        progressbar=True,
    )
    fnames[version] = fname_

Extract all the dataset IDs available in each version.

In [4]:
dataset_ids = {}
for version, fname in fnames.items():
    datasets = set()
    with tarfile.open(fname, "r:xz") as tar:
        for member in tar.getmembers():
            if member.isdir() and member.name != ".":
                datasets.add(os.path.basename(member.name))
    dataset_ids[version] = datasets

Quick printout of which datasets are different across versions. (See dataframe below for alternate view.)

In [5]:
for vi, vj in combinations(sorted(dataset_ids.keys()), 2):
    ids1 = dataset_ids[vi]
    ids2 = dataset_ids[vj]
    only_in_1 = ids1 - ids2
    only_in_2 = ids2 - ids1
    print(f"Comparing v{vi} and v{vj}:")
    print(f"\tIn v{vi} but not v{vj}:", *only_in_1, sep="\n\t\t- ")
    print(f"\tIn v{vj} but not v{vi}:", *only_in_2, sep="\n\t\t- ")

Comparing v1 and v2:
	In v1 but not v2:
		- pregnancy_abortion
	In v2 but not v1:
		- madeline0-childhood
		- cicogna
		- betty


Create a dataframe with all dataset IDs and all versions. Each row is a dataset ID and each column is a version. Cells are True if the dataset is in that version, False otherwise.

In [6]:
all_dataset_ids = { id_ for ids in dataset_ids.values() for id_ in ids }
df = pd.DataFrame(index=sorted(all_dataset_ids), columns=REGISTRY.keys(), dtype=bool)
for version, ids in dataset_ids.items():
    df[version] = False
    for id_ in ids:
        df.at[id_, version] = True
df = (
    df
    .add_prefix("v")
    .rename_axis("dataset_id")
    .sort_index(axis="columns")
    .sort_index(axis="rows")
)

Display the total number of datasets in each version.

In [7]:
display(df.sum().to_frame(name="n_datasets"))

Unnamed: 0,n_datasets
v1,94
v2,96


Display all datasets and which version(s) they are included in.

In [8]:
with pd.option_context("display.max_rows", None):
    # display(df.replace({True: "✔", False: "✘"}))
    display(df.replace({True: "✅", False: "❌"}))

Unnamed: 0_level_0,v1,v2
dataset_id,Unnamed: 1_level_1,Unnamed: 2_level_1
alta,✅,✅
angie,✅,✅
arlie,✅,✅
b,✅,✅
b-baseline,✅,✅
b2,✅,✅
bay_area_girls_456,✅,✅
bay_area_girls_789,✅,✅
bea1,✅,✅
bea2,✅,✅
