# 01 – Preprocessing Structural and Graph Metrics

## Motivation

This notebook initiates a structural and relational analysis of the Data Archive. The archive is composed of interconnected Markdown notes. This preprocessing stage is designed to:

- Model the vault as a structured dataframe
- Quantify structural characteristics per note
- Capture relational properties via internal wikilinks
- Enable further exploratory and semantic analysis

In essence, this is a snapshot of my second brain.

## Purpose

To build a comprehensive, structured table (DataFrame) where each row represents a note and each column encodes:
- **Structural features** (length, formatting, richness)
- **Relational features** (links to and from other notes)

## Phase 1 – Extract Per-Note Structural and Categorical Metrics



**Goal**: Traverse the vault to identify all Markdown files and extract per-note structural and categorical features. These features describe each note’s content, formatting, and metadata composition, enabling downstream analysis of complexity, style, and usage patterns.

For each note, extract:

### A. Basic Content Metrics
* **Word count** – total number of words (excluding YAML frontmatter)
* **Line count** – number of non-empty lines (a general proxy for content density)

### B. Structural Hierarchy
* **Section count** – number of Markdown headings (`#`, `##`, etc.), reflecting conceptual segmentation
* **Max heading depth** – maximum nesting level of headings (e.g. `###` → depth 3)

### C. Formatting Elements
* **List item count** – number of list entries (`-`, `*`, `1.`), indicating enumeration or procedural content
* **Code block count** – number of fenced code blocks (\`\`\`)
* **Code block line count** – total number of lines within all code blocks
* **Quote block count** – number of blockquotes (`>`), used for emphasis or excerpts
* **Table count** – number of Markdown-style tables (rows with `|`)
* **Image count** – number of embedded image links (`![](...)`)

### D. Metadata (Frontmatter)
* **Frontmatter present** – boolean indicating presence of a YAML metadata block
* **Frontmatter field count** – number of top-level keys in the YAML block
* **Tag count** – number of tags in the `tags` field, if present

### E. Derived Categorical Flags (Boolean)

These fields are derived from structural metrics and presence checks:

* **has\_code** – true if `code_block_count > 0`
* **has\_images** – true if `image_count > 0`
* **has\_table** – true if `table_count > 0`
* **has\_quotes** – true if `quote_block_count > 0`
* **has\_lists** – true if `list_item_count > 0`
* **has\_frontmatter** – same as `frontmatter_present`
* **is\_empty** – true if `word_count < 10`
* **has\_math** – true if math expressions using `$$...$$` or `$...$` are detected

### F. File-Level Metadata

* **File name** – used as the unique note identifier (`note_id`)
* **File path** – relative or absolute path to the file

### Build

In [1]:
from pathlib import Path
import re
import yaml
import pandas as pd
from markdown import markdown
from bs4 import BeautifulSoup

In [2]:
# === CONFIGURATION ===
VAULT_PATH = Path("C:/Users/RhysL/Desktop/Data-Archive/content/standardised")

In [3]:
# ...existing code...
def extract_frontmatter(md_text):
    # Match YAML frontmatter at the very start of the file
    match = re.match(r'^---\s*\n(.*?)\n---\s*\n?(.*)', md_text, re.DOTALL)
    if match:
        try:
            frontmatter = yaml.safe_load(match.group(1)) or {}
        except Exception:
            frontmatter = {}
        content = match.group(2)
    else:
        frontmatter = {}
        content = md_text
    return frontmatter, content

In [4]:

def markdown_to_text(md_content):
    html = markdown(md_content)
    soup = BeautifulSoup(html, features="html.parser")
    return soup.get_text()

def normalize_title(title):
    return title.lower().replace(" ", "_")

def extract_structural_metrics(note_path, raw_md):
    frontmatter, content = extract_frontmatter(raw_md)
    lines = content.splitlines()
    plain_text = markdown_to_text(content)
    
    # Frontmatter tags field (if exists and is a list)
    tag_count = len(frontmatter.get("tags", [])) if isinstance(frontmatter.get("tags", []), list) else 0
    
    # Derived booleans
    code_block_count = len(re.findall(r'```', content)) // 2
    code_block_line_count = sum(block.count('\n') + 1 for block in re.findall(r'```.*?\n(.*?)```', content, re.DOTALL))
    image_count = len(re.findall(r'!\[.*?\]\(.*?\)|!\[\[.*?\]\]', content))
    table_count = len(re.findall(r'\|.*?\|', content))
    quote_block_count = len(re.findall(r'^>\s', content, re.MULTILINE))
    list_item_count = len(re.findall(r'^[-*+]\s', content, re.MULTILINE))

    output= {"note_id": normalize_title(note_path.stem),
        "title": note_path.stem,
        "word_count": len(plain_text.split()),
        "line_count": len([line for line in lines if line.strip()]),
        "section_count": len(re.findall(r'^(#{1,6})\s+', content, re.MULTILINE)),
        "max_heading_depth": max([len(m) for m in re.findall(r'^(#{1,6})\s+', content, re.MULTILINE)], default=0),
        "list_item_count": list_item_count,
        "code_block_count": code_block_count,
        "code_block_line_count": code_block_line_count,
        "image_count": image_count,
        "table_count": table_count,
        "quote_block_count": quote_block_count,
        "frontmatter_present": bool(frontmatter),
        "frontmatter_field_count": len(frontmatter),
        "tag_count": tag_count,
        "has_code": code_block_count > 0,
        "has_images": image_count > 0,
        "has_table": table_count > 0,
        "has_quotes": quote_block_count > 0,
        "has_lists": list_item_count > 0,
        "has_frontmatter": bool(frontmatter),
        "is_empty": len(plain_text.split()) < 10,
        "has_math": bool(re.search(r'\$\$.*?\$\$', content, re.DOTALL)) or bool(re.search(r'(?<!\\)\$(?!\$).*?(?<!\\)\$', content))}

    return output

# === MAIN FUNCTION ===
def build_structural_df(vault_path):
    records = []
    for note_path in vault_path.rglob("*.md"):
        with open(note_path, 'r', encoding='utf-8') as f:
            raw_md = f.read()
        metrics = extract_structural_metrics(note_path, raw_md)
        records.append(metrics)
    return pd.DataFrame(records)


In [5]:
# === RUN ===
df_structural = build_structural_df(VAULT_PATH)

In [6]:
df_structural.head()

Unnamed: 0,note_id,title,word_count,line_count,section_count,max_heading_depth,list_item_count,code_block_count,code_block_line_count,image_count,...,frontmatter_field_count,tag_count,has_code,has_images,has_table,has_quotes,has_lists,has_frontmatter,is_empty,has_math
0,1-on-1_template,1-on-1 Template,85,12,0,0,4,0,0,0,...,0,0,False,False,False,False,True,False,False,False
1,ab_testing,AB testing,16,1,0,0,0,0,0,0,...,0,0,False,False,False,False,False,False,False,False
2,accessing_gen_ai_generated_content,Accessing Gen AI generated content,272,11,1,3,0,0,0,0,...,5,2,False,False,False,False,False,True,False,False
3,accuracy,Accuracy,289,23,5,3,8,1,5,0,...,1,1,True,False,False,False,True,True,False,True
4,acid_transaction,ACID Transaction,170,6,1,3,0,0,0,0,...,2,2,False,False,False,False,False,True,False,False


In [4]:
# df_structural.to_csv("note_structural_metrics.csv", index=False)
df_structural=pd.read_csv("note_structural_metrics.csv")
# df_structural.shape
# df_structural.info()

## Phase 2 – Extract Link and Graph Metrics

**Goal**: Parse internal links and construct a note-to-note graph, then compute graph-based metrics per note.

- Extract `[[wikilinks]]` from each note
- Build a directed graph of notes using these links (which we will save)
- Compute for each note:
  - **Outlink count** – number of linked notes
  - **Inlink count** – number of incoming links
  - **Total degree** – inlink + outlink
  - **Orphan status** – notes with zero inlinks
- We will store these metrics in a separate CSV file.

In [3]:
import re
import networkx as nx
import pandas as pd

In [8]:

def extract_links_from_note(note_path):
    with open(note_path, 'r', encoding='utf-8') as f:
        content = f.read()
    note_id = note_path.stem.lower().replace(" ", "_")
    links = re.findall(r'\[\[([^\]]+)\]\]', content)
    # Normalize link targets
    linked_ids = [normalize_title(link.split('|')[0]) for link in links]
    return note_id, linked_ids

def build_link_graph(vault_path):
    edges = []
    note_ids = set()

    for note_path in vault_path.rglob("*.md"):
        source_id, targets = extract_links_from_note(note_path)
        note_ids.add(source_id)
        for target in targets:
            edges.append((source_id, target))
            note_ids.add(target)

    G = nx.DiGraph()
    G.add_nodes_from(note_ids)
    G.add_edges_from(edges)
    return G

def compute_graph_metrics(G):
    metrics = []

    for node in G.nodes:
        outlinks = G.out_degree(node)
        inlinks = G.in_degree(node)
        total = outlinks + inlinks
        is_orphan = inlinks == 0

        metrics.append({
            "note_id": node,
            "outlink_count": outlinks,
            "inlink_count": inlinks,
            "total_degree": total,
            "is_orphan": is_orphan
        })

    return pd.DataFrame(metrics)


In [9]:
vault_path = VAULT_PATH
G = build_link_graph(vault_path)
graph_metrics_df = compute_graph_metrics(G)

In [2]:
# graph_metrics_df.to_csv("note_graphmetrics.csv", index=False)

graph_metrics_df=pd.read_csv("note_graphmetrics.csv")


In [None]:
# Save graph
# nx.write_gexf(G, "note_graph.gexf")
# Load graph
# G = nx.read_gexf("note_graph.gexf")

# # show subgraph of G with Nodes from a given list
# def show_subgraph(G, nodes):
#     subgraph = G.subgraph(nodes)
#     return subgraph

# # Plot graph
# nodes=['acid_transaction','transaction','sqlite']
# subgraph = show_subgraph(G, nodes)
# nx.draw(subgraph, with_labels=True)

## Phase 3 – Assemble Unified Note-Level Dataset

**Goal**: Construct a DataFrame with one row per note and all extracted metrics as columns.

- Merge structural and graph-based metrics
- Clean and validate the dataset
- Optionally export the graph as NetworkX or JSON

In [5]:
full_df = df_structural.merge(graph_metrics_df, on='note_id', how='left').fillna({
    "outlink_count": 0,
    "inlink_count": 0,
    "total_degree": 0,
    "is_orphan": True
})


In [6]:
full_df.to_csv("note_full_metrics.csv", index=False)
# df_structural.to_csv("note_structural_metrics.csv", index=False)
# df_structural=pd.read_csv("note_structural_metrics.csv")