#### Notes

- Building on graph.ipynb
- In TFIDF_explore.ipynb we build a vector embedding using TF-IDF keys. 
We then are ableto query with non- TF-IDF keys and get documenets 
that contain similar TF-IDF keys (See Vector Search).


#### Overview

This notebook is designed to explore search functionality:

1. Investigate existing tag usage
2. Assess semantic-structural alignment
3. Suggest or refine tags
4. Visualize the graph interactively

In [None]:
import json
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from itertools import combinations
from sklearn.metrics import pairwise_distances
from sklearn.metrics.cluster import adjusted_rand_score, normalized_mutual_info_score
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
import numpy as np

# Load JSON
with open("Data/enhanced_vault_index.json") as f:
    data = json.load(f)

# Inspect str

In [None]:
#### 1. Tag Suggestion (based on TF-IDF)
For notes with empty `tags`, you can suggest tags based on the top terms in the `TFIDF_Score`, matched against:
- A predefined list of possible tags (taxonomy)
    - Get the set of unique tags from all notes
- Tags already present in other notes
- Embeddings from representative tag examples

#### 2. Clustering / Thematic Grouping
Cluster notes using:
- Top N TF-IDF terms
- Outlink/inlink patterns (graph-based community detection)
- Embedding similarity (if you add vector representations)


#### 4. Relevance Scoring
You can assess how relevant a note is to a topic, query, or tag by computing overlap in TF-IDF keywords.
- if given a text I want to be given the top 5 notes that are most relevant to it.
- if given a text i want the top tags that are most relevant to it.

---

### 🔧 Would You Like Help With?
Let me ask a few exploratory questions to scope the next steps:

1. Do you want to automatically suggest tags for notes with empty `tags`?
   - If so, do you have a controlled vocabulary or master tag list?
        - I have a master tag list that i would like to Use

2. Are you interested in building a graph or UI from this JSON file (e.g., visualizing clusters or navigation)?
- yes if it is easy to do so.

3. Do you plan to incorporate semantic embeddings (e.g., from OpenAI, Sentence-BERT)?
   - This would enhance clustering, grouping, and tag inference.
    - yes

4. Should we consider alias/summary/TF-IDF terms in tag suggestions, or only TF-IDF?
    - use tfidf and tags of other notes

5. Do you want to track which TF-IDF words are not yet used in your tag set, to enrich your tag vocabulary?

---

Let me know your priority goal (e.g., tag suggestion, clustering, visualization)

Let me know your priority goal visualization

priority goal clustering



In [None]:
I am investigating a folder of documenets. I have created a JSON file where each item is the metadata for a document. I use this JSON to generate a graph.

I am interested in exploring the graph relationships between the notes.

I have used TF-IDF to extract the most important words from each document and added them to the JSON file.

I am looking for ways to explore semantic relationships between the notes.

Thank you, that clarifies things. It sounds like you're working with a semantically annotated graph where the tags represent *soft* or *semantic groupings*, and you're interested in:

1. Assessing the quality or relevance of existing tags,  
2. Suggesting better or more consistent tags, and  
3. Possibly inferring new groupings from the graph structure.

Let's break this down into a structured exploratory and modeling approach.

---

## 1. Explore Existing Tags
### A. Tag Frequency & Distribution
- Count how many nodes are associated with each tag.
- Measure tag co-occurrence: which tags often appear together?

Example: Build a tag–tag co-occurrence matrix (rows and columns are tags, cell values are how often they co-occur).

### B. Tag Purity / Consistency
For each tag:
- Do the nodes it labels have similar graph structural properties?
  - E.g., similar in/out-degree, common neighbors?
- Are nodes with the same tag closely connected in the graph?

Metric idea:  
For each tag, compute the average pairwise shortest path between nodes sharing that tag. Lower values suggest structural closeness.

---

## 2. Structural-Semantic Alignment
### A. Community Detection vs Tag Grouping
- Run community detection (e.g., Louvain, label propagation).
- Compare resulting communities with tags using:
  - Adjusted Rand Index (ARI)
  - Normalized Mutual Information (NMI)
  - Purity

This tells you how well the tags reflect structural clusters.

### B. Embedding the Graph
Use `node2vec` or `GraphSAGE` to create embeddings of nodes based on structure.
- Cluster embeddings (e.g., k-means, DBSCAN).
- Compare those clusters to tags.

This captures latent similarity beyond direct connections.

---

## 3. Suggesting / Recommending Tags
You can treat this as a multi-label classification problem:
- Input: Node features (degree, neighbors, tag co-occurrence, etc.)
- Output: Tags

### Feature engineering ideas:
- Node degree
- Embedding vector
- Aggregated tag distribution of neighbors
- Clustering label (from structural embedding)

Train a classifier (e.g., Random Forest, Logistic Regression, or a simple MLP).

Alternatively: Nearest neighbors in embedding space  
- Recommend tags from nearest neighbors based on shared tags.

---

## 4. Visual Exploration
Create an interactive graph where:
- Nodes are colored by tag
- Option to highlight nodes with missing or low-confidence tags
- Use layout (e.g., Force-directed) to show proximity

---

## 5. Next Steps (Optional): Refinement + Feedback Loop
If you can validate tag suggestions (manually or semi-automatically), you can:
- Iteratively retrain your tag recommender
- Promote tag consistency and reduce noise

---

Would you like to:
- See code to construct tag co-occurrence and tag distance metrics?
- Build a node2vec-based embedding pipeline for clustering or tag suggestion?
- Visualise tags on the graph with node color and interactivity?

Feel free to share a small sample of the JSON so I can tailor examples.

## FAISS

### Summary of Workflow

A semantic search system over the Data Archive using a combination of:

1. TF-IDF-derived term selection  
   Each note is represented not by its full text but by a set of high-importance terms extracted from its `TFIDF_Score` dictionary.

2. Sentence embedding of pseudo-texts  
   These sets of terms are joined into a string (pseudo-text) and passed to a pretrained SentenceTransformer model to produce dense vector embeddings. This allows for semantic comparison even though the input is not natural language.

3. Vector indexing with FAISS  
   The resulting embedding vectors are added to a FAISS index, enabling fast nearest-neighbor search for concept similarity. Queries like `"Uncertainty quantification"` can be semantically matched against your notes.

### Purpose

This approach allows you to:
- Search your notes by concept, not exact keyword match.
- Leverage the structure of your vault (via TF-IDF scores) while benefiting from modern NLP embeddings.
- Build a scalable semantic retrieval tool across many notes.


In [None]:
import json
import faiss
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Step 1: Load the vault
with open("../Data/enhanced_vault_index.json", "r", encoding="utf-8") as f:
    vault = json.load(f)

titles = list(vault.keys())

texts = [
    " ".join(sorted(vault[title].get("TFIDF_Score", {}).keys()))
    for title in titles
]
# Step 3: Load sentence embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 4: Generate embeddings from the TF-IDF term keys
embeddings = model.encode(texts, convert_to_numpy=True).astype("float32")

# Step 5: Build FAISS index (exact L2 search)
dimension = embeddings.shape[1]
faiss_index = faiss.IndexFlatL2(dimension)
faiss_index.add(embeddings)

In [10]:
# texts
#['action add directional discus discussion name needed person topic update',
#  'app like method performance product testing two version',...]
# it might be better to use LDA topic modeling here

In [18]:
# Step 6: Semantic search with query
query = "patterns of ideas"
query_vec = model.encode([query], convert_to_numpy=True).astype("float32")
k = 10
D, I = faiss_index.search(query_vec, k)

# Step 7: Retrieve matching titles
top_titles = [vault[titles[i]]['title'] for i in I[0]]
print(f"Top {k} similar notes to '{query}':\n")
for idx, title in enumerate(top_titles, 1):
    print(f"{idx}. {title}")


Top 10 similar notes to 'patterns of ideas':

1. Knowledge Work
2. Data Science
3. Thinking Systems
4. inference
5. MNIST
6. Data Analysis
7. inference versus prediction
8. Over parameterised models
9. Time Series Identify Trends and Patterns
10. Scientific Method
