1) Given a single note return TF-IDF score for top words.
2) I plan to integrate these top scores into the vault_index.json and produce an enhanced_vault_index.json file.
3) Using the scores within the enhanced_vault_index.json possiblely cluster or classify the documents afterward?

## 1) Given a single note return TF-IDF score for top words.


In [46]:
from pathlib import Path
import re
from markdown import markdown
from bs4 import BeautifulSoup

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [47]:
def normalize_document(doc):
    tokens = tokenizer.tokenize(doc.lower())
    return [
        lemmatizer.lemmatize(t)
        for t in tokens if t not in stop_words and len(t) > 2
    ]

def strip_yaml(text):
    """Remove YAML front matter from a Markdown document."""
    return re.sub(r"^---.*?---\s*", "", text, flags=re.DOTALL)

def read_markdown_file(path):
    with open(path, encoding="utf-8") as f:
        text = f.read()
        text = strip_yaml(text)
        html = markdown(text)
        soup = BeautifulSoup(html, features="html.parser")
        return soup.get_text()

In [48]:
# === CONFIGURATION ===
filename = "Views.md"
VAULT_PATH = Path("C:/Users/RhysL/Desktop/Data-Archive/content/standardised")
file_path = VAULT_PATH / filename

# === TEXT PREPROCESSING ===
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
tokenizer = RegexpTokenizer(r'\w+')


In [49]:
# === MAIN TF-IDF PIPELINE ===
document = read_markdown_file(file_path)
corpus = [document]  # Single-document corpus

vectorizer = CountVectorizer(tokenizer=normalize_document)
X_counts = vectorizer.fit_transform(corpus)

tfidf = TfidfTransformer()
X_tfidf = tfidf.fit_transform(X_counts)

# === OUTPUT RESULTS ===
feature_names = vectorizer.get_feature_names_out()
scores = X_tfidf[0].T.toarray().flatten()
terms_scores = [(feature_names[i], score) for i, score in enumerate(scores) if score > 0]
sorted_terms = sorted(terms_scores, key=lambda x: -x[1])



In [50]:
print(f"\nTop TF-IDF Terms in {filename}:")
for term, score in sorted_terms[:10]:  # Top 20 terms
    print(f"{term:>15}: {score:.4f}")


Top TF-IDF Terms in Views.md:
           view: 0.5551
           data: 0.4362
          table: 0.1983
         access: 0.1586
          query: 0.1586
           user: 0.1586
        complex: 0.1190
       database: 0.1190
    performance: 0.1190
         result: 0.1190


### 2) Integrate these top scores into the vault_index.json and produce an enhanced_vault_index.json file.

Batch Processing with TF-IDF Integration


In [51]:
import json
import re
from pathlib import Path
from markdown import markdown
from bs4 import BeautifulSoup

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [65]:
# === CONFIGURATION ===
VAULT_PATH = Path("C:/Users/RhysL/Desktop/Data-Archive/content/standardised")
MD_FILES = list(VAULT_PATH.glob("*.md"))
JSON_PATH = "vault_index.json"
OUTPUT_PATH = "enhanced_vault_index.json"

# === TEXT PREPROCESSING ===
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
tokenizer = RegexpTokenizer(r'\w+')

Loads vault_index.json first to identify which .md files to process.

Reads only the corresponding Markdown files from disk (not the entire folder).

Extracts main content (excluding YAML).

Computes top 10 TF-IDF scores per file.

Inserts these scores as a new key "TFIDF_Score" in the vault_index object.

Writes an enhanced version to enhanced_vault_index.json.

The titles in vault_index.json are the titles/aliases "What is Apache Airflow?" from metadata and not the title from the file. 

This is an issue with build_vault_index.ipynb.

In [None]:
Prompt:

### 2) Integrate the top 10 scores into the vault_index.json and produce an enhanced_vault_index.json file.

In the file vault_index.json Which is of the form:

{
  "1-on-1_template": {
    "title": "1-on-1 Template",
    "tags": [],
    "aliases": [],
    "outlinks": [],
    "inlinks": [
      "documentation_&_meetings"
    ],
    "summary": "Decisions [Your name] add decisions that need to be made [Other person's name] add decisions that need to be made Action items [Your name] add..."
  },similar keys...}

# Where the key is the filename (just like how we did for View.md) and the value is a dictionary of the title, tags, aliases, outlinks, inlinks and summary.
# When we go to add these top score we should add a new key to this dictionary called TFIDF_Score where the value will be the 
# top 10 scores as a dict. The key will be the term and the value will be the score.



In [None]:
# === STEP 1: READ ALL DOCUMENTS ===
corpus = []
filenames = []

for md_file in MD_FILES:
    text = read_markdown_file(md_file)
    corpus.append(text)
    filenames.append(md_file.stem)  # filename without ".md"

# === STEP 2: COMPUTE TF-IDF ===
vectorizer = CountVectorizer(tokenizer=normalize_document)
X_counts = vectorizer.fit_transform(corpus)
tfidf = TfidfTransformer()
X_tfidf = tfidf.fit_transform(X_counts)

# === STEP 3: LOAD vault_index.json ===
with open(JSON_PATH, "r", encoding="utf-8") as f:
    vault_index = json.load(f)

def normalize_title(title):
    """
    Normalize title by converting to lowercase and replacing spaces with underscores.
    """
    return title.lower().replace(" ", "_")

# apply normalize_title to all terms in filenames
normalized_filenames = [normalize_title(filename) for filename in filenames]

normalized_filenames and vault_index_l are the same

# === STEP 4: ADD TF-IDF SCORES ===
vault_index_l=list(vault_index.keys())
feature_names = vectorizer.get_feature_names_out()
skipped = 0
for i, filename in enumerate(normalized_filenames):
    if filename not in vault_index_l:
        # print(f"Skipping: {filename} not in vault_index.json")
        skipped+=1
        continue

    scores = X_tfidf[i].T.toarray().flatten()
    terms_scores = [(feature_names[j], float(scores[j])) for j in range(len(scores)) if scores[j] > 0]
    top_10_scores = dict(sorted(terms_scores, key=lambda x: -x[1])[:10])

    vault_index[filename]["TFIDF_Score"] = top_10_scores

print("number of terms skipped", skipped)


In [54]:
filenames[:10]

['1-on-1 Template',
 'AB testing',
 'Accessing Gen AI generated content',
 'Accuracy',
 'ACID Transaction',
 'Activation atlases',
 'Activation Function',
 'Active Learning',
 'Ada boosting',
 'Adam Optimizer']

In [55]:
def normalize_title(title):
    """
    Normalize title by converting to lowercase and replacing spaces with underscores.
    """
    return title.lower().replace(" ", "_")

# apply normalize_title to all terms in filenames
normalized_filenames = [normalize_title(filename) for filename in filenames]


In [None]:
#CHECK:
# normalized_filenames[:10]
# # list(vault_index.keys())==normalized_filenames

# vault_filenames = list(vault_index.keys())

# # show those that are different
# l=[]
# for i in range(len(vault_filenames)):
#     if vault_filenames[i] != normalized_filenames[i]:
#         # print(f"{vault_filenames[i]} != {normalized_filenames[i]}")
#         l.append((vault_filenames[i], normalized_filenames[i]))
# len(l)

0

In [63]:
vault_index_l=list(vault_index.keys())
len(vault_index_l)
vault_index_l[:10]

['1-on-1_template',
 'ab_testing',
 'accessing_gen_ai_generated_content',
 'accuracy',
 'acid_transaction',
 'activation_atlases',
 'activation_function',
 'active_learning',
 'ada_boosting',
 'adam_optimizer']

In [64]:
filenames[:10]

['1-on-1 Template',
 'AB testing',
 'Accessing Gen AI generated content',
 'Accuracy',
 'ACID Transaction',
 'Activation atlases',
 'Activation Function',
 'Active Learning',
 'Ada boosting',
 'Adam Optimizer']

In [None]:
def normalize_title(title):
    """
    Normalize title by converting to lowercase and replacing spaces with underscores.
    """
    return title.lower().replace(" ", "_")

# apply normalize_title to all terms in filenames
normalized_filenames = [normalize_title(filename) for filename in filenames]

normalized_filenames and vault_index_l are the same

# === STEP 4: ADD TF-IDF SCORES ===
vault_index_l=list(vault_index.keys())
feature_names = vectorizer.get_feature_names_out()
skipped = 0
for i, filename in enumerate(normalized_filenames):
    if filename not in vault_index_l:
        # print(f"Skipping: {filename} not in vault_index.json")
        skipped+=1
        continue

    scores = X_tfidf[i].T.toarray().flatten()
    terms_scores = [(feature_names[j], float(scores[j])) for j in range(len(scores)) if scores[j] > 0]
    top_10_scores = dict(sorted(terms_scores, key=lambda x: -x[1])[:10])

    vault_index[filename]["TFIDF_Score"] = top_10_scores

print("number of terms skipped", skipped)


number of terms skipped 770


In [None]:
Only 27 terms have TFIDF_Score, see dagster

In [None]:

# === STEP 5: SAVE TO enhanced_vault_index.json ===
with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
    json.dump(vault_index, f, indent=2, ensure_ascii=False)

print(f"Enhanced vault index written to: {OUTPUT_PATH}")


### 3) Using the scores within the enhanced_vault_index.json possiblely cluster or classify the documents afterward?