# Logseq Full-Text Search with Embeddings

This notebook loads markdown documents from a Logseq vault into PostgreSQL with:
- Full-text search (FTS) for keyword matching
- Vector embeddings for semantic search
- Hybrid search combining both approaches

## Setup

Create a `.env` file in the project root with:
```
DB_HOST=your_host
DB_PORT=5432
DB_NAME=your_database
DB_USER=your_user
DB_PASSWORD=your_password

OLLAMA_HOST=http://your_ollama_host:11434
```

In [1]:
# Workaround for PyCharm not adding src to PYTHONPATH
import sys
from pathlib import Path

src_path = str(Path.cwd().parent / 'src')
if src_path not in sys.path:
    sys.path.insert(0, src_path)

In [2]:
from logseq_searcher import (
    # Database
    init_db,
    create_schema,
    # Embeddings
    init_ollama,
    # Loading
    load_logseq_vault,
    add_embeddings_to_existing,
    # Search
    search,
    advanced_search,
    get_document,
    get_document_count,
    semantic_search,
    hybrid_search,
)

# Initialize database and Ollama connections from .env file
init_db(Path('../.env'))
init_ollama()  # Uses OLLAMA_HOST from .env
print("Database and Ollama initialized")

Database and Ollama initialized


In [3]:
# Path to logseq vault
LOGSEQ_PATH = Path.home() / 'git' / 'active' / 'logseq-personal'

print(f"Logseq vault: {LOGSEQ_PATH}")
print(f"Pages directory exists: {(LOGSEQ_PATH / 'pages').exists()}")
print(f"Journals directory exists: {(LOGSEQ_PATH / 'journals').exists()}")

Logseq vault: /home/romilly/git/active/logseq-personal
Pages directory exists: True
Journals directory exists: True


## Create Database Schema

Creates a `documents` table with:
- `id`: Auto-incrementing primary key
- `filename`: Original filename
- `doc_type`: Either 'page' or 'journal'
- `title`: Document title (derived from filename)
- `content`: Full markdown content
- `content_tsv`: Full-text search vector (auto-generated)
- `embedding`: 768-dimensional vector for semantic search
- `created_at`: Timestamp

In [5]:
create_schema()
print("Schema created successfully (with pgvector support)")

Schema created successfully (with pgvector support)


## Load Documents

You can load documents with or without embeddings:
- Without embeddings: Fast, FTS-only search
- With embeddings: Slower loading, enables semantic and hybrid search

In [6]:
# Load WITHOUT embeddings (fast)
result = load_logseq_vault(LOGSEQ_PATH)
print(f"Loaded {result['pages']} pages and {result['journals']} journals")
print(f"Total: {result['total']} documents")

Loaded 3307 pages and 1491 journals
Total: 4798 documents


In [7]:
# Verify the data was loaded
counts = get_document_count()
for doc_type, count in counts.items():
    print(f"{doc_type}: {count} documents")

page: 3307 documents
journal: 1491 documents


## Add Embeddings to Existing Documents

This generates embeddings for documents that don't have them yet.
Processing ~4800 documents takes some time.

In [10]:
def progress(processed, total):
    print(f"\rProcessed {processed}/{total} documents ({100*processed//total}%)", end="")

add_embeddings_to_existing(batch_size=50, progress_callback=progress)
print("\nEmbeddings complete!")


Embeddings complete!


## Full-Text Search (FTS)

Traditional keyword-based search. Fast and precise for exact term matching.

In [11]:
def display_fts_results(results: list):
    """Display FTS results."""
    if not results:
        print("No results found.")
        return
    
    print(f"Found {len(results)} result(s):\n")
    for i, r in enumerate(results, 1):
        print(f"{i}. [{r['doc_type']}] {r['title']}")
        print(f"   Rank: {r['rank']:.4f}")
        print(f"   {r['headline']}")
        print()

In [12]:
# Example: FTS for "Feynman"
results = search("Feynman", limit=5)
display_fts_results(results)

Found 5 result(s):

1. [page] book%2FRichard Feynman’s Mental Models
   Rank: 0.7552
   >>>Feynman<<<’s Mental Models
author:: [[Peter Hollins]]
full-title:: "Richard >>>Feynman<<<’s Mental Models"
category:: #books
![](https://m.media-amazon.com/images/I/61xB09TwRiL._SY160.jpg)

- what

2. [page] person%2FRichard Feynman
   Rank: 0.6687
   alias:: >>>Feynman<<<

-

3. [page] The Feynman Technique for Learning
   Rank: 0.6079
   - https://www.colorado.edu/artssciences-advising/resource-library/life-skills/the-feynman-technique-in-academic-coaching
- https://fs.blog/feynman-technique/
- Can I do this using #GPT as my audience? Is there a good prompt

4. [page] 12 Favorite Problems
   Rank: 0.3881
   >>>Feynman<<<]]
- Richard Phillips >>>Feynman<<< was one of the most important scientists of the 20th century.  Born on the outskirts of New York

5. [page] How to Generate Your Own Favorite Problems
   Rank: 0.3040
   >>>Feynman<<<
- In this step-by-step guide, I’ll share the exact process 

In [None]:
# Example: FTS only in journals
results = search("Roam", limit=5, doc_type='journal')
display_fts_results(results)

## Semantic Search

Uses vector embeddings to find semantically similar documents.
Can find relevant results even without exact keyword matches.

In [13]:
def display_semantic_results(results: list):
    """Display semantic search results."""
    if not results:
        print("No results found.")
        return
    
    print(f"Found {len(results)} result(s):\n")
    for i, r in enumerate(results, 1):
        print(f"{i}. [{r['doc_type']}] {r['title']}")
        print(f"   Similarity: {r['similarity']:.4f}")
        print(f"   {r['snippet'][:100]}...")
        print()

In [14]:
# Example: Semantic search for concepts
results = semantic_search("learning techniques for better memory", limit=5)
display_semantic_results(results)

Found 5 result(s):

1. [page] Cognitive Load
   Similarity: 0.7393
   ---
title: Cognitive Load
---

- [Cognitive Load Theory - Learning Skills From MindTools.com](https:...

2. [page] papers-Computational principles of memory
   Similarity: 0.7247
   ---
title: papers/Computational principles of memory
---

- https://www.nature.com/articles/nn.4237
...

3. [page] memory
   Similarity: 0.7112
   ---
title: memory
---

- Retrieving a [[memory]] is like placing a piece in a jigsaw puzzle. The mor...

4. [page] Spaced Repetition
   Similarity: 0.6502
   ---
title: Spaced Repetition
---

- Based on research by [[person/Ebbinghaus]] and subsequently conf...

5. [page] How to Teach Yourself Anything
   Similarity: 0.6451
   - Spectrum of learning styles
	- Structured <-> Opportunistic
	- Studying <-> Doing
	- Maximise valu...



In [15]:
# Example: Semantic search for related ideas
results = semantic_search("productivity and time management", limit=5)
display_semantic_results(results)

Found 5 result(s):

1. [page] Time Management
   Similarity: 0.6529
   ---
title: Time Management
---

- https://blog.nateliason.com/p/addicted-to-speed [[person/Nat Elias...

2. [page] productivity
   Similarity: 0.6337
   ---
title: productivity
---

- If I spend 30% of my time improving my skills, and as a result I beco...

3. [page] project%2FPimoroni booklet%2FTask05
   Similarity: 0.6186
   ### Task 5: Estimating Completion Time for Remaining Tasks
- **Objective**: Estimate how much time e...

4. [page] Work
   Similarity: 0.6000
   ---
title: Work
---

- Work on a book or video for two hours
	 - Pomodoros
		 - {{POMO  25}}

- Afte...

5. [page] Weekly priorities
   Similarity: 0.5842
   - Work **using** #Ultraworking
	- 1 hour/day on [[project/s2ag-corpus]]
		- 30 mins on README
		- fi...



## Hybrid Search

Combines full-text search and semantic similarity for best results.
You can adjust the weights to favor keywords or meaning.

In [16]:
def display_hybrid_results(results: list):
    """Display hybrid search results."""
    if not results:
        print("No results found.")
        return
    
    print(f"Found {len(results)} result(s):\n")
    for i, r in enumerate(results, 1):
        print(f"{i}. [{r['doc_type']}] {r['title']}")
        print(f"   Combined: {r['combined_score']:.4f} (FTS: {r['fts_rank']:.4f}, Semantic: {r['similarity']:.4f})")
        print(f"   {r['headline']}")
        print()

In [17]:
# Example: Hybrid search (equal weights)
results = hybrid_search("Python programming", limit=5)
display_hybrid_results(results)

Found 5 result(s):

1. [page] Python Developers%3A don't confuse OO Programming with OO design!
   Combined: 0.8077 (FTS: 0.9350, Semantic: 0.6804)
   ---
title: >>>Python<<< Developers: don't confuse OO >>>Programming<<< with OO design!
---

- #planned #aa

2. [page] book%2FPractical Python Artificial Intelligence Programming
   Combined: 0.7834 (FTS: 0.9736, Semantic: 0.5931)
   - book
	- version 1.1.2
	- Title:
	- Authors:
	- ISBN:
	- scores: 1 is bad, 5 is excellent
		- overall:
		- readable:
		- breadth:
		- depth:
		- credible:
		- current:
	- tags

3. [page] hls__Online-Python-Tutor-web-based-program-visualization_SIGCSE-2013_1740912753104_0
   Combined: 0.7711 (FTS: 0.9524, Semantic: 0.5898)
   file:: [Online-Python-Tutor-web-based-program-visualization_SIGCSE-2013_1740912753104_0.pdf](../assets/Online-Python-Tutor-web-based-program-visualization_SIGCSE-2013_1740912753104_0.pdf)
file-path:: ../assets/Online-Python-Tutor-web-based-program-visualization_SIGCSE-2013_1740912753104_

In [18]:
# Example: Hybrid search favoring semantic similarity
results = hybrid_search("learning from mistakes", limit=5, fts_weight=0.3, semantic_weight=0.7)
display_hybrid_results(results)

Found 5 result(s):

1. [page] Learning from Mistakes
   Combined: 0.7161 (FTS: 0.9998, Semantic: 0.5946)
   >>>learning<<< #brain #[[Reinforcement >>>Learning<<<]]
- Via [[ChatGPT]]
	- What evidence is there that the brain >>>learns<<< best by making >>>mistakes<<<?
		- There is evidence

2. [page] four stages of competence
   Combined: 0.5128 (FTS: 0.3412, Semantic: 0.5864)
   >>>learn<<<.[[1]](https://en.wikipedia.org/wiki/Four_stages_of_competence#cite_note-In_the_Mush-1)
		- **Conscious incompetence**
			- Though the individual does not understand or know how to do something, they recognize the deficit, as well as the value of a new skill in addressing the deficit. The making of >>>mistakes<<<

3. [page] book%2FThe Not to Do List
   Combined: 0.4782 (FTS: 0.7335, Semantic: 0.3689)
   >>>Learning<<< from your own >>>mistakes<<< is all well and good, but >>>learning<<< from other people’s >>>mistakes<<< is golden. ([Location

4. [page] downloads%2FStartup Company Culture
   Combined:

In [None]:
# Example: Hybrid search favoring keyword matching
results = hybrid_search("Raspberry Pi", limit=5, fts_weight=0.7, semantic_weight=0.3)
display_hybrid_results(results)

## Advanced FTS Search

For more control, use `advanced_search` which supports:
- `"quoted phrases"`
- `OR` for alternatives
- `-` for exclusion

In [None]:
# Example: Search for exact phrase
results = advanced_search('"favorite problems"', limit=5)
display_fts_results(results)

In [None]:
# Get a specific document by ID
# doc = get_document(1)
# print(doc['content'][:500])