Skip to content

🗺 A public IndieWeb social graph and dataset.

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

Indie Map

Indie Map is a public IndieWeb social graph and dataset. See the docs for details. This doc focuses on how to develop, run, and maintain Indie Map itself.

The individual sites and pages retain their original copyright. The rest of the dataset and this project are placed into the public domain via the CC0 public domain dedication.


The crawler is basically just xargs wget -r < domains.txt. Details in and

To add a site to the dataset, run crawl/, then crawl/ They're separate because runs ~300GB of BigQuery queries, which costs ~$1.50, so if you're adding lots of sites, run once for each site and then once for all the sites together.

Extracting IWS domains, eg 2017

Run this shell command to fetch, extract the attendees' web sites, and compare to 2017/domains_iws.txt to see if there are any new ones.

curl \
  | grep -A2 'class="profile-info"' \
  | grep -o -E 'http[^"]+' \
  | grep -v \
  | cut -d/ -f3 | sort | uniq \
  | diff ~/src/indie-map/crawl/2017/domains_iws.txt -

Extracting wiki user domains

Run this shell command to fetch a page of IndieWeb wiki users and extract their web site domains:

curl '' \
  | grep -o -E '"User:[^" ]+' | cut -d: -f2 | grep .

...and then follow next page links, e.g. &, &offset=...

Fixing sites.json files in place

When I hit eg:

BigQuery error in load operation: Error processing job
'indie-map:bqjob_r9dcbd6e76f81db0_0000015c29d469ee_1': JSON table encountered too many
errors, giving up. Rows: 57; errors: 1.
Failure details:
- file-00000000: JSON parsing error in row starting at position
3088711: . Only optional fields can be set to NULL. Field: names;
Value: NULL

I deleted the offending null values manually with e.g.:

grep '"names": \[[^]]*null[^]]*\]' sites.json |cut -d, -f1
sed -i '' 's/"names":\ \[null\]/"names":\ []/' sites.json
sed -i '' 's/"names":\ \[null,\ /"names":\ [/' sites.json
sed -i '' 's/"names":\ \["Jeremy Zilar",\ null,/"names":\ ["Jeremy Zilar",/' sites.json

Finding sites that redirected to other domains (to add those too)

foreach f (*.warc.gz)
echo --
echo $f
gzcat $f | head -n 100 | grep -m 2 -A20 -E '^HTTP/1\.[01] (30.|403)' | grep -E '^(HTTP/1|Location:)'

then massage manually, with e.g. these Emacs regexp replaces:


[Ll]ocation: https://\1/?.*

[Ll]ocation: /.*

HTTP/1.1 403 Forbidden

[Ll]ocation: https?://\([^/]+\)/?.*

Ops and setup

Web site

The Indie Map web site is stored and served on Firebase Hosting. I followed these instructions to set it up:

npm install -g firebase-tools  # or brew install firebase-cli
firebase login
firebase init
firebase deploy

I could then see the site serving on, and I could manage it in the Firebase console. All I had to do then was connect the domain and the www subdomain, and I was all set.

Indie Map used to serve from Google Cloud Storage. Here's what I did originally to set that up, and to store HTTP request logs in gs://indie-map/:

gsutil mb
gsutil cp www/index.html gs://
gsutil cp www/docs.html gs://
gsutil cp www/404.html gs://
gsutil acl ch -u AllUsers:R gs://
gsutil web set -m index.html -e 404.html gs://

gsutil acl ch -g gs://indie-map
gsutil logging set on -b gs://indie-map -o logs/ gs://

gsutil defacl ch -u AllUsers:READER gs://

Copying files to GCS

I ran these commands to copy the initial WARCs, BigQuery input JSON files, and social graph files to GCS and load them into BigQuery:

gsutil -m cp -L cp.log *.warc.gz gs://indie-map/crawl/
gsutil -m cp -L cp.log *.json.gz gs://indie-map/bigquery/

# can gsutil -m rsync ... as final sync. can also sanity check with:
ls -1 *.{json,warc}.gz | wc
gsutil ls gs://indie-map/{crawl,bigquery}/ | wc
du -c -b *.{json,warc}.gz | wc
gsutil du -sc gs://indie-map/{crawl,bigquery}/ | wc

bq load --replace --autodetect --source_format=NEWLINE_DELIMITED_JSON indiemap.pages gs://indie-map/bigquery/\*.json.gz
bq load --replace --autodetect --source_format=NEWLINE_DELIMITED_JSON indiemap.sites sites.json.gz
# then watch job at

Python snippets for debugging WARC files

import gzip
from bs4 import BeautifulSoup, UnicodeDammit
import mf2py
import warcio

with'FILE.warc.gz', 'rb') as input:
  for i, record in enumerate(warcio.ArchiveIterator(input)):
    if i == 178:
      body = UnicodeDammit(record.content_stream().read()).unicode_markup


Removing a site

I've gotten one request so far to remove a site. Here's what I did.

  • Ran git grep DOMAIN and removed all matches, notably in crawl/[YEAR]/domains*.txt.
  • Ran these commands to remove the WARC, JSON, and other files:
    # in the repo root dir
    rm www/DOMAIN.json
    firebase deploy
    gsutil rm gs://indie-map/crawl/DOMAIN.warc.gz
    gsutil rm gs://indie-map/bigquery/DOMAIN.json.gz
    # check that there's nothing left
    gsutil ls 'gs://indie-map/**/*DOMAIN*'
  • Ran these DELETE statements in BigQuery:
    DELETE FROM indiemap.sites WHERE domain='DOMAIN'
    DELETE FROM indiemap.pages WHERE domain='DOMAIN'
  • Deleted the domain in Kumu manually. Try going to this URL: If that doesn't find it, search for the domain. Then click the Trash icon in the right pane.
  • Added a note to the docs.