# DreamBank

This notebook scrapes raw HTML dream reports (and their associated metadata) from [DreamBank](https://dreambank.net) and compresses them into an archive file for uploading to Zenodo.

The final archive includes an HTML file for the [grid page](https://dreambank.net/grid.cgi) and then 3 HTML files for each dataset. All dataset IDs are identified from the [grid page](https://dreambank.net/grid.cgi). Taking the Alta dataset as an example, the 3 HTML files will come from the [Alta info page](https://dreambank.net/more_info.cgi?alta), [Alta more info page](https://dreambank.net/more_info.cgi?series=alta&further=1), and [all Alta dreams page](https://dreambank.net/random_sample.cgi?series=alta) files per dataset.

The files are written as raw binary content in order to preserve the data in the rawest form possible. Dream reports and metadata are extracted from the raw HTML content in [prepare.ipynb](../prepare.ipynb), and datasets are curated from that in [krank](https://github.com/remrama/krank). See the [krank documentation](https://remrama.github.io/krank) for more accessible access points.

For output file size, checksums, and date of processing, see printout at the end of this notebook.

* Source: [dreambank.net](https://dreambank.net)
* Output: [Zenodo](https://doi.org/10.5281/zenodo.18131749)

If you use any of this data for publication, cite the original DreamBank paper:

> Domhoff, G. W., & Schneider, A. (2008). Studying dream content using the archive and search engine on DreamBank.net. _Consciousness and Cognition_, 17(4), 1238-1247. doi:[10.1016/j.concog.2008.06.010](https://doi.org/10.1016/j.concog.2008.06.010)

### Output file tree

```shell
dreambank.tar.xz
|
|--/grid.html                   # https://dreambank.net/grid.cgi
|
|--/<dataset_id>/dreams.html    # https://dreambank.net/random_sample.cgi?series=<dataset_id>
|--/<dataset_id>/info.html      # https://dreambank.net/more_info.cgi?series=<dataset_id>
|--/<dataset_id>/moreinfo.html  # https://dreambank.net/more_info.cgi?series=<dataset_id>&further=1
|
|--/<dataset_id>/dreams.html
|--/<dataset_id>/info.html
|--/<dataset_id>/moreinfo.html
|
|--/<dataset_id>/dreams.html
|[...]
```

### Related files

1. This notebook downloads the raw HTML files from [DreamBank](https://dreambank.net).
2. The file is uploaded to a [Zenodo archive](https://doi.org/10.5281/zenodo.18131749).
3. The next [prepare.ipynb](../prepare.ipynb) notebook parses the HTML into two tabular CSV files, `datasets.csv` and `dreams.csv`.
4. Then subdirectories in the [sources](../../../../) folder have individual notebooks that use the tabular files to compile individual corpora into a krank-ready corpus and are uploaded individually to Zenodo archives. These are not necessarily the same datasets that DreamBank provides, but custom groupings.

### Related projects

* [mattbierner/DreamScrape](https://github.com/mattbierner/DreamScrape)
* [josauder/dreambank_visualized](https://github.com/josauder/dreambank_visualized)
* [MigBap/dreambank](https://github.com/MigBap/dreambank)
* [jjcordes/Dreambank](https://github.com/jjcordes/Dreambank)

## Setup

Load all necessary Python packages.

In [1]:
import hashlib
import os
import random
import requests
import shutil
import tempfile
import time
from datetime import datetime, timezone

from bs4 import BeautifulSoup
from tqdm import tqdm

Set constant variables.

In [2]:
OUTPUT_BASENAME = "./output/dreambank"
OUTPUT_COMPRESSION = "xztar"
GRID_URL = "https://dreambank.net/grid.cgi"
GRID_FNAME = "grid.html"
MIN_WAIT_TIME = 5  # seconds
MAX_WAIT_TIME = 9  # seconds

Remove existing file(s) from previous runs and create necessary directories. The temporary directory will be used to store the HTML files that will get compressed into a single output file.

In [3]:
output_dir = os.path.dirname(OUTPUT_BASENAME)
if os.path.isdir(output_dir):
    shutil.rmtree(output_dir)
os.mkdir(output_dir)
temp_dir = tempfile.TemporaryDirectory()

Create a function for compiling DreamBank-specific dataset URLs. URLs are dataset-dependent and can be easily generated by inserting the name of the dataset. There are 3 pages available for each dataset.

In [4]:
def compose_url(dataset: str, component: str) -> str:
    """Compose DreamBank URL for given dataset and component."""
    assert component in {"dreams", "info", "moreinfo"}
    if component == "dreams":
        return f"https://dreambank.net/random_sample.cgi?series={dataset}"
    elif component == "info":
        return f"https://dreambank.net/more_info.cgi?series={dataset}"
    elif component == "moreinfo":
        return f"https://dreambank.net/more_info.cgi?series={dataset}&further=1"

Create a function for printing ISO-formatted timestamps later.

In [5]:
def format_timestamp(unix_timestamp: float) -> str:
    """Convert unix timestamp to UTC-stamped ISO format."""
    dt = datetime.fromtimestamp(unix_timestamp, tz=timezone.utc)
    timestamp = dt.isoformat(timespec="seconds")
    return timestamp

## Grid page

The grid file is a DreamBank page that includes a table of all the datasets available in DreamBank and also longer text descriptions of them. This file is used to identify all the dataset IDs included in the current version of DreamBank and used for subsequent scraping.

Download the grid page.

In [6]:
response = requests.get(GRID_URL, headers={"Accept-Encoding": "gzip, deflate"})

Extract the dataset IDs from the [grid page](https://dreambank.net/grid.cgi) by finding all the checkbox elements.

In [7]:
soup = BeautifulSoup(response.content, "html.parser", from_encoding="ISO-8859-1")
dataset_ids = set(x.get("value") for x in soup.find_all("input", type="checkbox"))

Write the grid content to an HTML file for inclusion in the final output file.

In [8]:
with open(os.path.join(temp_dir.name, GRID_FNAME), "wb") as f:
    f.write(response.content)

## Dream reports and metadata

Loop over all the dataset IDs from the grid page, downloading all 3 component pages for each dataset and writing each one to its own HTML file in the temporary directory.

In [9]:
for dataset_id in (pbar := tqdm(sorted(dataset_ids))):
    pbar.set_description(f"Retrieving content ({dataset_id})")
    dataset_dir = os.path.join(temp_dir.name, dataset_id)
    os.mkdir(dataset_dir)
    for component in ["dreams", "info", "moreinfo"]:
        url = compose_url(dataset_id, component)
        fname = f"{dataset_id}/{component}.html"
        response = requests.get(url, headers={"Accept-Encoding": "gzip, deflate"})
        local_fname = os.path.join(temp_dir.name, fname)
        with open(local_fname, "wb") as f:
            f.write(response.content)
        wait_time = random.uniform(MIN_WAIT_TIME, MAX_WAIT_TIME)
        time.sleep(wait_time)

Retrieving content (zurich-m.de): 100%|██████████| 96/96 [37:05<00:00, 23.18s/it]        


## Export

Zip the cache directory into a single compressed file and delete the temporary directory that held individual files.

In [10]:
outpath = shutil.make_archive(OUTPUT_BASENAME, OUTPUT_COMPRESSION, temp_dir.name)
temp_dir.cleanup()

Print the output file details for future reference.

In [11]:
print(f"{'file':>10}: {os.path.basename(outpath)}")
print(f"{'size':>10}: {os.path.getsize(outpath) / 1e6} MB")
with open(outpath, "rb") as f:
    print(f"{'md5':>10}: {hashlib.md5(f.read()).hexdigest()}")
    print(f"{'sha256':>10}: {hashlib.sha256(f.read()).hexdigest()}")
print(f"{'Created':>10}: {format_timestamp(os.path.getatime(outpath))}")
print(f"{'Modified':>10}: {format_timestamp(os.path.getmtime(outpath))}")
print(f"{'Accessed':>10}: {format_timestamp(os.path.getatime(outpath))}")
print(f"{'Now':>10}: {datetime.now(timezone.utc).isoformat(timespec='seconds')}")

      file: dreambank.tar.xz
      size: 8.319932 MB
       md5: eb83bcb0828f9c8c248a5052b2ffc798
    sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
   Created: 2026-01-05T17:09:52+00:00
  Modified: 2026-01-05T17:09:52+00:00
  Accessed: 2026-01-05T17:09:52+00:00
       Now: 2026-01-05T17:09:52+00:00
