## What You're Doing

You are building a structured index (`vault_index.json`) `Data-archive/`, so that an LLM can:
- Efficiently retrieve, summarise, and ask questions over a subgraph of relevant notes
- Work with metadata like titles, tags, links, aliases, and summaries- Acts as a metadata snapshot of your vault
- Enables RAG-style querying without repeatedly parsing the markdown files
- Enables LLM reasoning over structure (e.g., graph search, centrality scoring, tag grouping)

## `vault_index.json` Format (Example)

```json
{
  "note_id": {
    "title": "Bayesian Uncertainty",
    "path": "notes/probability/bayesian_uncertainty.md",
    "tags": ["uncertainty", "bayes"],
    "aliases": ["bayesian confidence"],
    "outlinks": ["confidence_intervals", "epistemic_uncertainty"],
    "inlinks": ["decision_making"],
    "summary": "Overview of subjective and objective uncertainty in Bayesian analysis."
  },
}
```

## Planned Workflow

1. Parse a vault folder (e.g. `Data-archive/`)
2. For each note:
   - Read title from filename or YAML
   - Extract frontmatter (tags, aliases)
   - Extract links (e.g., `[[note-title]]`)
   - Extract summary (e.g., first 100 words of the note)
3. Track inlinks/outlinks
4. Save as a structured `vault_index.json`

# Code

In [1]:
# Re-running with inlinks and assuming dependencies are now installed

import os
import json
import re
from pathlib import Path
from collections import defaultdict
import yaml
from markdown import markdown
from bs4 import BeautifulSoup

In [None]:

# === CONFIGURATION ===
VAULT_PATH = Path("C:/Users/RhysL/Desktop/Data-Archive/content/standardised")
OUTPUT_PATH = "vault_index.json"

# === HELPERS ===
def extract_frontmatter(md_text):
    match = re.match(r'^---\n(.*?)\n---\n(.*)', md_text, re.DOTALL)
    if match:
        frontmatter = yaml.safe_load(match.group(1))
        content = match.group(2)
    else:
        frontmatter = {}
        content = md_text
    return frontmatter, content

def extract_links(content):
    return re.findall(r'\[\[([^\]]+)\]\]', content)

def markdown_to_text(md_content):
    html = markdown(md_content)
    soup = BeautifulSoup(html, features="html.parser")
    return soup.get_text()

def get_note_id(note_path):
    return note_path.stem.replace(" ", "_").lower()

def summarise_text(text, word_limit=100):
    words = text.strip().split()
    return " ".join(words[:word_limit]) + ("..." if len(words) > word_limit else "")

# === MAIN FUNCTION ===
def index_vault(vault_path):
    vault_index = {}
    outlink_map = defaultdict(list)

    for note_path in vault_path.rglob("*.md"):
        with open(note_path, 'r', encoding='utf-8') as f:
            raw_md = f.read()

        frontmatter, content = extract_frontmatter(raw_md)
        plain_text = markdown_to_text(content)
        note_id = get_note_id(note_path)

        title = frontmatter.get("title", note_path.stem)
        tags = frontmatter.get("tags", [])
        aliases = frontmatter.get("aliases", [])
        outlinks = extract_links(content)

        vault_index[note_id] = {
            "title": title,
            "path": str(note_path),
            "tags": tags,
            "aliases": aliases,
            "outlinks": outlinks,
            "inlinks": [],
            "summary": summarise_text(plain_text, word_limit=25)
        }

        for link in outlinks:
            target_id = link.replace(" ", "_").lower()
            outlink_map[target_id].append(note_id)

    # Fill inlinks from outlink_map
    for target_id, sources in outlink_map.items():
        if target_id in vault_index:
            vault_index[target_id]["inlinks"] = list(set(sources))

    return vault_index

# === EXECUTION ===
vault_index = index_vault(VAULT_PATH)

# Save JSON
with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
    json.dump(vault_index, f, indent=2)

['1-on-1_template',
 'ab_testing',
 'accessing_gen_ai_generated_content',
 'accuracy',
 'acid_transaction']

In [None]:
list(vault_index.keys())[:5]  # Show a sample of indexed note IDs

In [3]:
# load vault_index.json
with open("vault_index.json", "r", encoding="utf-8") as f:
    vault_index = json.load(f)

In [5]:
# get detail for a specific note
title="1-on-1 Template"
vault_index["1-on-1_template"]

{'title': '1-on-1 Template',
 'path': 'C:\\Users\\RhysL\\Desktop\\Data-Archive\\content\\standardised\\1-on-1 Template.md',
 'tags': [],
 'aliases': [],
 'outlinks': [],
 'inlinks': ['documentation_&_meetings'],
 'summary': "Decisions [Your name] add decisions that need to be made [Other person's name] add decisions that need to be made Action items [Your name] add next steps from the discussion [Other person's name] add next steps from the discussion Topics to discuss (bi-directional) [Your name] add topics or questions to discuss together [Other person's name] add topics or questions to discuss together Updates (uni-directional - no action needed) [Your name] add updates with no action needed [Other person's name] add updates with no action needed"}