## Bandcamp Collection Tag Scraper + Embed Skimmer

This notebook has **two parts**:

1) **Scrape + filter your Bandcamp collection into a `.txt` list** (by tags)  
2) **Build a skimmable HTML page** that embeds each album/track player + lets you filter/search

---

# PART 1 — Collection → filtered .txt

This notebook uses the **bandcamp-fetch** Node.js library to read Bandcamp collections and filter releases by tag. It includes an HTML fallback for albums or tracks whose tags are not exposed via metadata.

## CONFIGURATION

### 1) Bandcamp session cookies (optional)

**Most users do NOT need cookies** for `bandcamp-fetch` (public collections work fine without logging in).

Only add cookies if:
- your collection includes private / “fans only” items,
- the collection returns empty even though it has items,
- or you’re seeing errors that go away when logged in.

To add cookies (optional), paste your Bandcamp cookies into the script:

  bcfetch.setCookie('PASTE_COOKIE_STRING_HERE');

How to get cookies:
- Log into https://bandcamp.com
- Open DevTools → Console
- Run: `document.cookie`
- Copy/paste the full output

Notes:
- Cookies expire (sometimes within hours or a day)
- **Never share or commit cookies** (treat them like a password)

### 2) Bandcamp username

  const USERNAME = 'bandcamp_username';

### 3) Target tags

  const TARGET_TAGS = [
    'breaks',
    'breakbeat',
    'ukg',
    '2-step',
    'dubstep'
  ];

Matching behavior:
- Case-insensitive
- Partial matches are allowed (e.g. "garage" matches "uk garage")
- Album metadata is checked first
- HTML tag scraping is used as a fallback when metadata is missing

## OUTPUT

Writes `bandcamp_filtered.txt` as repeated blocks:

Title  
URL  
Tags: tag1, tag2, tag3

## RUN ORDER (GOOGLE COLAB)

1) Install Node.js and dependencies  
2) Edit USERNAME / TARGET_TAGS  
3) Run `bandcamp_scraper.js`  
4) Confirm `bandcamp_filtered.txt` exists

If output is empty:
- If you enabled cookies, check they’re still valid
- Confirm the collection contains releases with relevant tags
- Try adding a broad tag (e.g. "electronic") to sanity-check matching

---

# PART 2 — .txt → Embed Skimmer HTML

This step converts your `.txt` list into a **single HTML page** of embedded Bandcamp players.

## What it does

- Reads `bandcamp_filtered.txt`
- For each URL, extracts the **Bandcamp embed id** (album or track)
- Writes:
  - `bandcamp_items.json` (everything we could build an embed for)
  - `bandcamp_failures.json` (everything that failed, with a reason)
  - `bandcamp_skimmer.html` (the UI you open in your browser)

## Important notes

- The skimmer **does not need cookies**.
- Bandcamp will rate-limit if you fetch lots of pages quickly, so the builder uses:
  - **Network limiter** (slow, steady requests)
  - **Global cooldown** on any HTTP 429 (rate limit)
  - **Retry pass at the end** for rate-limited items (after a bigger cooldown)

## Using the skimmer page

- Open `bandcamp_skimmer.html`
- Scroll down: it will **load more players automatically** as you scroll (no “Load more” button).
- Use the **Tag Filter** box to show only items that contain a tag string.
  - Example: type `dubstep` or `uk garage`
- Failures are shown too (title + URL + reason), so nothing “disappears.”

## Troubleshooting

- “Some embeds won’t load”: a small % of Bandcamp pages are un-embeddable or removed/expired.
- “Many failures after X items”: you’re being rate-limited → increase cooldown + reduce concurrency.
- “Last items don’t show”: this is usually a UI batching/render edge case; all links should still appear.

---

# Quick checklist

✅ Have `bandcamp_filtered.txt`  
✅ Run the embed builder  
✅ Download/open `bandcamp_skimmer.html`  
✅ Use search + tag filtering

## Step 1: Install Node.js and Dependencies

In [None]:
# Install Node.js 20 in Colab
!curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
!sudo apt-get install -y nodejs
!node --version
!npm --version

In [None]:
# Install bandcamp-fetch library
!npm install bandcamp-fetch node-fetch@2 cheerio

## Step 2: Create the Scraper Script

In [None]:
%%writefile bandcamp_scraper.js
const bcfetch = require('bandcamp-fetch');
const fs = require('fs');
const fetch = require('node-fetch');
const cheerio = require('cheerio');

// ───────── CONFIG ─────────
const USERNAME = 'nodusudon';
const TARGET_TAGS = ['2-step', '2step', 'dubstep', 'breakbeat', 'breaks', 'minimal'];
const OUTPUT_FILE = 'bandcamp_filtered.txt';

// KEEP YOUR COOKIE STRING
bcfetch.setCookie('PASTE_COOKIE_STRING_HERE');

// ───────── UTILS ─────────
const sleep = ms => new Promise(r => setTimeout(r, ms));

const normalize = s =>
  s.toLowerCase()
   .replace(/[\u2010-\u2015\u2212]/g, '-')
   .replace(/[^a-z0-9\s-]/g, '')
   .replace(/\s+/g, ' ')
   .trim();

async function scrapeTagsFromHTML(url) {
  const res = await fetch(url);
  const html = await res.text();
  const $ = cheerio.load(html);
  return $('a.tag').map((_, el) => $(el).text().trim()).get();
}

// ───────── MAIN ─────────
async function scrape() {
  let allItems = [];
  let continuation = USERNAME;

  console.log('Loading collection…');

  while (continuation) {
    const res = await bcfetch.fan.getCollection({ target: continuation });
    if (res.items?.length) {
      allItems.push(...res.items);
      console.log(`  loaded ${allItems.length}`);
    }
    continuation = res.continuation;
    if (continuation) await sleep(800);
  }

  console.log(`\nProcessing ${allItems.length} items…\n`);

  const albumSeen = new Set();
  const lines = [];
  let processed = 0;
  const startTime = Date.now();

  for (const item of allItems) {
    processed++;

    console.log(`[${processed}/${allItems.length}] ${item.name || 'Unknown'}`);

    try {
      const trackUrl = item.url;
      const trackInfo = await bcfetch.track.getInfo({ trackUrl });

      const albumUrl = trackInfo?.album?.url;
      let title, urlForTags;

      if (albumUrl) {
        console.log('  ↳ album');
        if (albumSeen.has(albumUrl)) {
          console.log('  ↳ already processed');
          continue;
        }
        albumSeen.add(albumUrl);
        title = trackInfo.album.name;
        urlForTags = albumUrl;
      } else {
        console.log('  ↳ standalone track');
        title = trackInfo.name;
        urlForTags = trackUrl;
      }

      let tags = [];

      if (albumUrl) {
        const albumInfo = await bcfetch.album.getInfo({ albumUrl });
        if (Array.isArray(albumInfo.keywords)) tags.push(...albumInfo.keywords);
        if (Array.isArray(albumInfo.tags)) tags.push(...albumInfo.tags);
      }

      if (tags.length === 0) {
        console.log('  ↳ HTML tag fallback');
        tags.push(...await scrapeTagsFromHTML(urlForTags));
      }

      const normTags = [...new Set(tags.map(normalize).filter(Boolean))];

      const matched = normTags.filter(t =>
        TARGET_TAGS.some(x => t.includes(normalize(x)))
      );

      if (matched.length) {
        console.log('  ✅ MATCH FOUND');
        console.log(`     ${title}`);
        console.log(`     ${urlForTags}`);
        console.log(`     tags: ${matched.join(', ')}`);

        lines.push(
          `${title}\n${urlForTags}\nTags: ${normTags.join(', ')}\n`
        );
      }

      if (processed % 10 === 0) {
        const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
        console.log(`--- progress ${processed}/${allItems.length} | ${elapsed}s elapsed ---`);
      }

      await sleep(700);

    } catch (e) {
      console.log('  ⚠️ error, skipping');
    }
  }

  fs.writeFileSync(OUTPUT_FILE, lines.join('\n'));
  console.log('\nDone.');
  console.log(`Matches written to: ${OUTPUT_FILE}`);
}

scrape();

## Step 3: Run the Scraper

In [None]:
!node bandcamp_scraper.js

## Step 4: Download Results or move to Step 5

The results are saved in bandcamp_filtered.txt.

Download from the Files panel on the left sidebar.

In [None]:
# Display results
!cat bandcamp_filtered.txt
from google.colab import files
files.download("bandcamp_filtered.txt")

## Step 5: Create Embeded Player HTML

This processes bandcamp_filtered.txt to create a page of playable bandcamp widgets for skimming.

In [11]:
%%capture
!npm -s init -y
!npm -s i bottleneck

In [None]:
%%writefile bandcamp_build_skimmer.js
"use strict";

const fs = require("fs");

// ---------- CONFIG ----------
const INPUT_TXT = "bandcamp_filtered.txt";

const OUTPUT_ITEMS_JSON = "bandcamp_items.json";
const OUTPUT_FAIL_JSON  = "bandcamp_failures.json";
const OUTPUT_HTML       = "bandcamp_skimmer.html";

const PIPELINE_CONCURRENCY = 1;

// Network limiter settings
const NET_MAX_CONCURRENT = 1;
const NET_MIN_TIME_MS = 450;

// Cooldowns
const GLOBAL_COOLDOWN_MS = 30_000;
const RETRY_WARMUP_MS    = 120_000;
const RETRY_COOLDOWN_MS  = 60_000;

// Retry sweep
const RETRY_ATTEMPTS = 3;

const FETCH_TIMEOUT_MS = 20_000;
const UA =
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome Safari";

// ---------- SIMPLE NETWORK LIMITER ----------
class NetLimiter {
  constructor({ maxConcurrent, minTime }) {
    this.maxConcurrent = maxConcurrent;
    this.minTime = minTime;
    this.inflight = 0;
    this.queue = [];
    this.lastStart = 0;
  }
  async schedule(fn) {
    return new Promise((resolve, reject) => {
      this.queue.push({ fn, resolve, reject });
      this._drain();
    });
  }
  async _drain() {
    if (this.inflight >= this.maxConcurrent) return;
    if (this.queue.length === 0) return;

    const now = Date.now();
    const wait = Math.max(0, this.minTime - (now - this.lastStart));
    if (wait > 0) {
      setTimeout(() => this._drain(), wait);
      return;
    }

    const job = this.queue.shift();
    this.inflight += 1;
    this.lastStart = Date.now();

    (async () => {
      try {
        const res = await job.fn();
        job.resolve(res);
      } catch (e) {
        job.reject(e);
      } finally {
        this.inflight -= 1;
        this._drain();
      }
    })();
  }
}
const netLimiter = new NetLimiter({
  maxConcurrent: NET_MAX_CONCURRENT,
  minTime: NET_MIN_TIME_MS,
});

// ---------- GLOBAL COOLDOWN GATE ----------
let cooldownUntil = 0;
let cooldownMsActive = GLOBAL_COOLDOWN_MS;

function sleep(ms) {
  return new Promise((r) => setTimeout(r, ms));
}

async function maybeCooldownLog() {
  const now = Date.now();
  if (now < cooldownUntil) {
    const remain = cooldownUntil - now;
    if (remain % 5000 < 500) {
      console.log(`⏳ GLOBAL cooldown active (~${Math.ceil(remain / 1000)}s remaining)`);
    }
    await sleep(Math.min(1000, remain));
    return true;
  }
  return false;
}

function triggerCooldown(reason) {
  const until = Date.now() + cooldownMsActive;
  if (until > cooldownUntil) cooldownUntil = until;
  console.log(`⏳ GLOBAL cooldown triggered (${Math.round(cooldownMsActive/1000)}s) - reason: ${reason}`);
}

// ---------- FETCH ----------
async function fetchWithTimeout(url, opts = {}) {
  const controller = new AbortController();
  const t = setTimeout(() => controller.abort(), FETCH_TIMEOUT_MS);
  try {
    const res = await fetch(url, {
      ...opts,
      signal: controller.signal,
      headers: {
        "User-Agent": UA,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        ...(opts.headers || {}),
      },
      redirect: "follow",
    });
    return res;
  } finally {
    clearTimeout(t);
  }
}

async function gatedFetchText(url) {
  while (await maybeCooldownLog()) {}

  return netLimiter.schedule(async () => {
    while (await maybeCooldownLog()) {}

    const res = await fetchWithTimeout(url);

    if (res.status === 429) {
      triggerCooldown("HTTP 429");
      return { status: 429, text: null };
    }
    if (!res.ok) {
      return { status: res.status, text: null };
    }
    const text = await res.text();
    return { status: res.status, text };
  });
}

// ---------- PARSE INPUT TXT ----------
function parseFilteredTxt(txt) {
  const lines = txt.split(/\r?\n/);
  const items = [];
  let i = 0;
  while (i < lines.length) {
    const title = (lines[i] || "").trim();
    if (!title) { i++; continue; }
    const url = (lines[i + 1] || "").trim();
    const tagsLine = (lines[i + 2] || "").trim();

    if (!url.startsWith("http")) { i += 1; continue; }

    let tags = [];
    if (/^Tags:\s*/i.test(tagsLine)) {
      tags = tagsLine.replace(/^Tags:\s*/i, "").split(",").map(s => s.trim()).filter(Boolean);
    }

    items.push({ title, url, tags });

    i += 3;
    while (i < lines.length && lines[i].trim() !== "") i++;
    while (i < lines.length && lines[i].trim() === "") i++;
  }
  return items;
}

// ---------- EXTRACT FROM META ----------
function extractFromMeta(html) {
  const metas = [];

  const ogRe = /<meta[^>]+property=["']og:video(?::secure_url)?["'][^>]+content=["']([^"']+)["'][^>]*>/gi;
  let m;
  while ((m = ogRe.exec(html))) metas.push(m[1]);

  const twRe = /<meta[^>]+name=["']twitter:player["'][^>]+content=["']([^"']+)["'][^>]*>/gi;
  while ((m = twRe.exec(html))) metas.push(m[1]);

  const occRe = /(https?:\/\/bandcamp\.com\/EmbeddedPlayer\/[^"'<> ]+)/gi;
  while ((m = occRe.exec(html))) metas.push(m[1]);

  for (const src of metas) {
    const a = /album=(\d+)\b/i.exec(src);
    if (a) return { type: "album", id: a[1], how: "meta" };

    const t = /track=(\d+)\b/i.exec(src);
    if (t) return { type: "track", id: t[1], how: "meta" };
  }
  return null;
}

// ---------- EMBED BUILDER ----------
function escapeHtml(s) {
  return String(s)
    .replaceAll("&", "&amp;")
    .replaceAll("<", "&lt;")
    .replaceAll(">", "&gt;")
    .replaceAll('"', "&quot;")
    .replaceAll("'", "&#39;");
}
function escapeHtmlAttr(s) { return escapeHtml(s); }

function buildIframe({ type, id, pageUrl, title }) {
  if (type === "album") {
    return `<iframe class="bc_iframe" style="border: 0; width: 400px; height: 208px;" src="https://bandcamp.com/EmbeddedPlayer/album=${id}/size=large/bgcol=ffffff/linkcol=0687f5/artwork=small/transparent=true/" seamless><a href="${escapeHtmlAttr(pageUrl)}">${escapeHtml(title)}</a></iframe>`;
  }
  return `<iframe class="bc_iframe" style="border: 0; width: 100%; height: 120px;" src="https://bandcamp.com/EmbeddedPlayer/track=${id}/size=large/bgcol=ffffff/linkcol=0687f5/tracklist=false/artwork=small/transparent=true/" seamless><a href="${escapeHtmlAttr(pageUrl)}">${escapeHtml(title)}</a></iframe>`;
}

// ---------- POOL ----------
async function mapPool(items, concurrency, fn) {
  let idx = 0;
  const workers = Array.from({ length: concurrency }, () => (async () => {
    while (true) {
      const my = idx++;
      if (my >= items.length) break;
      await fn(items[my], my);
    }
  })());
  await Promise.all(workers);
}

// ---------- PROCESS ONE ----------
async function processOne(item) {
  const { status, text } = await gatedFetchText(item.url);

  if (status === 429) {
    return { ok: false, status, how: "http_429", reason: "rate_limited", type: null, id: null };
  }
  if (status !== 200 || !text) {
    return { ok: false, status, how: "http", reason: `http_${status}`, type: null, id: null };
  }

  const extracted = extractFromMeta(text);
  if (!extracted) {
    return { ok: false, status: 200, how: "no_id_found", reason: "no_embed_meta", type: null, id: null };
  }

  return { ok: true, status: 200, how: extracted.how, reason: null, type: extracted.type, id: extracted.id };
}

// ---------- HTML ----------
function buildHtml(allItems) {
  const dataJson = JSON.stringify(allItems);
  return `<!doctype html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<title>Bandcamp Skimmer</title>
<style>
  body { font-family: system-ui, -apple-system, Segoe UI, Roboto, Arial, sans-serif; margin: 0; background: #0b0b0b; color: #eee; }
  header { position: sticky; top: 0; background: rgba(0,0,0,0.92); backdrop-filter: blur(10px); padding: 12px 16px; border-bottom: 1px solid #222; z-index: 5; }
  .meta { display: flex; gap: 12px; align-items: center; flex-wrap: wrap; }
  .brand { font-weight: 700; }
  .controls { display: flex; gap: 10px; align-items: center; flex-wrap: wrap; width: 100%; margin-top: 10px; }
  .input { flex: 1 1 360px; min-width: 260px; display: flex; align-items: center; gap: 10px; }
  input[type="text"] { width: 100%; background: #0f0f0f; color: #eee; border: 1px solid #2a2a2a; border-radius: 10px; padding: 10px 12px; outline: none; }
  input[type="text"]::placeholder { color: #777; }
  .hint { font-size: 12px; color: #9a9a9a; }
  .wrap { padding: 14px 16px 60px; max-width: 980px; margin: 0 auto; }
  .grid { display: grid; grid-template-columns: 1fr; gap: 14px; }
  @media (min-width: 980px) { .grid { grid-template-columns: 1fr 1fr; } }
  .card { background: #111; border: 1px solid #222; border-radius: 12px; padding: 12px; }
  .title { font-weight: 650; margin: 0 0 6px; line-height: 1.25; }
  .sub { font-size: 12px; color: #aaa; display: flex; gap: 10px; flex-wrap: wrap; margin-bottom: 10px; }
  .sub a { color: #9ecbff; text-decoration: none; }
  .sub a:hover { text-decoration: underline; }
  .tags { font-size: 12px; color: #bbb; margin: 8px 0 0; }
  .tags span { display: inline-block; margin: 2px 6px 0 0; padding: 2px 6px; border: 1px solid #2a2a2a; border-radius: 999px; color: #cfcfcf; }
  .framebox { margin-top: 10px; }
  .placeholder { height: 208px; background: #0e0e0e; border: 1px dashed #2a2a2a; border-radius: 10px; display: flex; align-items: center; justify-content: center; color: #777; font-size: 12px; text-align: center; padding: 0 10px; }
  .fail { border-color: #3a1b1b; background: #140b0b; }
  .fail .placeholder { height: 120px; border-style: solid; border-color: #3a1b1b; }
  #sentinel { height: 1px; }
  .footer { margin-top: 14px; color: #777; font-size: 12px; text-align: center; }
</style>
</head>
<body>
<header>
  <div class="meta">
    <div class="brand">Bandcamp Skimmer</div>
  </div>
  <div class="controls">
    <div class="input">
      <input id="tagFilter" type="text" placeholder="Filter tags (AND). Example: dubstep, techno" />
    </div>
    <div class="hint">Comma-separated. Matches are case-insensitive and partial.</div>
  </div>
</header>

<div class="wrap">
  <div class="grid" id="grid"></div>
  <div id="sentinel"></div>
  <div class="footer">Scroll to load more - embeds load lazily near viewport. Failures are shown too.</div>
</div>

<script>
  const ITEMS = ${dataJson};

  const grid = document.getElementById("grid");
  const sentinel = document.getElementById("sentinel");
  const tagFilter = document.getElementById("tagFilter");

  const BATCH = 60;

  // Single-player enforcement
  let activeIframe = null;

  function stopActiveIframe() {
    if (!activeIframe) return;
    try { activeIframe.src = "about:blank"; } catch (e) {}
    activeIframe = null;
  }

  function setActiveIframe(iframe) {
    if (activeIframe && activeIframe !== iframe) {
      try { activeIframe.src = "about:blank"; } catch (e) {}
    }
    activeIframe = iframe;
  }

  function wireSinglePlay(iframe) {
    if (!iframe) return;
    if (iframe.getAttribute("data-wired") === "1") return;
    iframe.setAttribute("data-wired", "1");
    iframe.addEventListener("focusin", () => setActiveIframe(iframe), true);
    iframe.addEventListener("focus", () => setActiveIframe(iframe), true);
  }

  document.addEventListener("pointerdown", (ev) => {
    const iframe = ev.target && ev.target.closest ? ev.target.closest("iframe.bc_iframe") : null;
    if (!iframe) return;
    setActiveIframe(iframe);
  }, true);

  let filtered = ITEMS.slice();
  let cursor = 0;

  const embedIO = new IntersectionObserver((entries) => {
    for (const e of entries) {
      if (!e.isIntersecting) continue;
      const card = e.target;
      embedIO.unobserve(card);
      if (card.getAttribute("data-embed-loaded") === "1") continue;
      const iframeHtml = card.getAttribute("data-iframe");
      const box = card.querySelector(".framebox");
      if (iframeHtml && box) {
        card.setAttribute("data-embed-loaded", "1");
        box.innerHTML = iframeHtml;
        const iframe = box.querySelector("iframe.bc_iframe");
        wireSinglePlay(iframe);
      }
    }
  }, { rootMargin: "900px 0px" });

  function escapeHtml(s) {
    return String(s)
      .replaceAll("&", "&amp;")
      .replaceAll("<", "&lt;")
      .replaceAll(">", "&gt;")
      .replaceAll('"', "&quot;")
      .replaceAll("'", "&#39;");
  }

  function normalizeTag(s) { return String(s || "").toLowerCase().trim(); }

  function parseNeedles(raw) {
    const parts = String(raw || "").split(",").map(x => normalizeTag(x)).filter(Boolean);
    return Array.from(new Set(parts));
  }

  function matchesAllTags(item, needles) {
    if (!needles.length) return true;
    const hay = (item.tags || []).map(normalizeTag);
    return needles.every(n => hay.some(t => t.includes(n)));
  }

  function tagSpans(tags) {
    if (!tags || !tags.length) return "";
    return tags.map(t => "<span>" + escapeHtml(t) + "</span>").join("");
  }

  function renderCard(it) {
    const card = document.createElement("div");
    card.className = "card" + (it.ok ? "" : " fail");

    const link = "<a href=\\"" + escapeHtml(it.url) + "\\" target=\\"_blank\\" rel=\\"noreferrer\\">" + escapeHtml(it.url) + "</a>";

    const status = (it.status != null) ? String(it.status) : "";
    const how = it.how ? it.how : "";
    const reason = it.reason ? it.reason : "";

    const sub = it.ok
      ? ""
      : "<span style=\\"color:#caa;\\">FAILED - status=" + escapeHtml(status) + " how=" + escapeHtml(how) + " reason=" + escapeHtml(reason) + "</span>";

    const placeholder = "<div class=\\"placeholder\\">" + (it.ok ? "Loading embed..." : "No embed data") + "</div>";

    card.innerHTML =
      "<div class=\\"title\\">" + escapeHtml(it.title) + "</div>" +
      "<div class=\\"sub\\">" + link + sub + "</div>" +
      "<div class=\\"framebox\\">" + placeholder + "</div>" +
      "<div class=\\"tags\\">" + tagSpans(it.tags) + "</div>";

    if (it.ok && it.iframe) {
      card.setAttribute("data-iframe", it.iframe);
      card.setAttribute("data-embed-loaded", "0");
      embedIO.observe(card);
    }
    return card;
  }

  function renderNextBatch() {
    const total = filtered.length;
    const end = Math.min(total, cursor + BATCH);
    for (let i = cursor; i < end; i++) {
      grid.appendChild(renderCard(filtered[i]));
    }
    cursor = end;
  }

  const batchIO = new IntersectionObserver((entries) => {
    for (const e of entries) {
      if (!e.isIntersecting) continue;
      if (cursor >= filtered.length) return;
      renderNextBatch();
    }
  }, { rootMargin: "1200px 0px" });

  function applyFilter() {
    stopActiveIframe();
    const needles = parseNeedles(tagFilter.value);
    filtered = ITEMS.filter(it => matchesAllTags(it, needles));
    grid.innerHTML = "";
    cursor = 0;
    renderNextBatch();
  }

  let filterTimer = null;
  tagFilter.addEventListener("input", () => {
    if (filterTimer) clearTimeout(filterTimer);
    filterTimer = setTimeout(applyFilter, 150);
  });

  applyFilter();
  batchIO.observe(sentinel);
</script>
</body>
</html>`;
}

// ---------- MAIN ----------
async function main() {
  if (!fs.existsSync(INPUT_TXT)) {
    console.error("Missing " + INPUT_TXT + ". Put it in the same folder as this script.");
    process.exit(1);
  }

  const raw = fs.readFileSync(INPUT_TXT, "utf8");
  const baseItems = parseFilteredTxt(raw);

  console.log(`Total parsed items: ${baseItems.length}`);
  console.log(`Pass 1 concurrency: ${PIPELINE_CONCURRENCY}`);
  console.log(`Network limiter: maxConcurrent=${NET_MAX_CONCURRENT}, minTime=${NET_MIN_TIME_MS}ms`);
  console.log(`Cooldown: ${Math.round(GLOBAL_COOLDOWN_MS/1000)}s on any 429`);
  console.log("");

  const results = new Array(baseItems.length);

  let done = 0, ok = 0, fail = 0;
  const t0 = Date.now();
  function logProgress() {
    const elapsed = (Date.now() - t0) / 1000;
    const rate = elapsed > 0 ? (done / elapsed) : 0;
    console.log(
      `Progress: ${done}/${baseItems.length} (${(100*done/baseItems.length).toFixed(1)}%) - ${rate.toFixed(1)} items/s - ok: ${ok} - fail: ${fail}`
    );
  }

  // MAIN PASS
  cooldownMsActive = GLOBAL_COOLDOWN_MS;

  await mapPool(baseItems, PIPELINE_CONCURRENCY, async (item, index) => {
    const res = await processOne(item);

    const out = {
      title: item.title,
      url: item.url,
      tags: item.tags,
      ok: !!res.ok,
      status: res.status,
      how: res.how,
      reason: res.reason,
      type: res.type,
      id: res.id,
      iframe: null,
    };

    if (out.ok && out.type && out.id) {
      out.iframe = buildIframe({ type: out.type, id: out.id, pageUrl: out.url, title: out.title });
    }

    results[index] = out;
    done += 1;
    if (out.ok) ok += 1; else fail += 1;
    if (done % 10 === 0) logProgress();
  });

  logProgress();

  // RETRY SWEEP for rate-limited only
  const rlIndexes = [];
  for (let i = 0; i < results.length; i++) {
    const r = results[i];
    if (r && !r.ok && (r.reason === "rate_limited" || r.how === "http_429" || r.status === 429)) {
      rlIndexes.push(i);
    }
  }

  if (rlIndexes.length) {
    console.log("");
    console.log(`Retry sweep for rate-limited items: ${rlIndexes.length}`);
    console.log(`⏳ Warmup cooldown ${Math.round(RETRY_WARMUP_MS/1000)}s before retry sweep...`);
    await sleep(RETRY_WARMUP_MS);

    cooldownMsActive = RETRY_COOLDOWN_MS;

    let rlRecovered = 0;
    for (let k = 0; k < rlIndexes.length; k++) {
      const idx = rlIndexes[k];
      const item = baseItems[idx];

      let final = null;
      for (let attempt = 1; attempt <= RETRY_ATTEMPTS; attempt++) {
        const res = await processOne(item);

        const out = {
          title: item.title,
          url: item.url,
          tags: item.tags,
          ok: !!res.ok,
          status: res.status,
          how: res.how,
          reason: res.reason,
          type: res.type,
          id: res.id,
          iframe: null,
        };
        if (out.ok && out.type && out.id) {
          out.iframe = buildIframe({ type: out.type, id: out.id, pageUrl: out.url, title: out.title });
        }

        final = out;
        if (out.ok) break;

        await sleep(750);
      }

      const wasOk = results[idx].ok;
      results[idx] = final;

      if (!wasOk && final.ok) {
        rlRecovered += 1;
        ok += 1;
        fail -= 1;
      }

      if ((k + 1) % 10 === 0) {
        console.log(`Retry progress: ${k+1}/${rlIndexes.length} - recovered: ${rlRecovered}`);
      }
    }

    console.log(`Retry sweep done. Recovered: ${rlRecovered}`);
  }

  // Write outputs
  const failures = results.filter(x => x && !x.ok);

  fs.writeFileSync(OUTPUT_ITEMS_JSON, JSON.stringify(results, null, 2));
  fs.writeFileSync(OUTPUT_FAIL_JSON, JSON.stringify(failures, null, 2));
  fs.writeFileSync(OUTPUT_HTML, buildHtml(results), "utf8");

  console.log("");
  console.log(`Index done. OK: ${ok} | Failed: ${fail}`);
  console.log(`Wrote: ${OUTPUT_ITEMS_JSON}`);
  console.log(`Wrote: ${OUTPUT_FAIL_JSON}`);
  console.log(`Wrote: ${OUTPUT_HTML}`);
}

main().catch((e) => {
  console.error("Fatal:", e);
  process.exit(1);
});

In [None]:
!node bandcamp_build_skimmer.js
from google.colab import files
files.download("bandcamp_skimmer.html")
files.download("bandcamp_failures.json")
files.download("bandcamp_items.json")