Skip to content

shramanb113/ZENITH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZENITH - A Next Genration Search and Recommendation Engine

From lexical matching to true understanding


1. The Taxonomy of Research

  • Search is ofen treated as a single problem. It is not.
  • It is a spectrum of intent, and each layer solves a fundamentally different question

1.1 The Lexical Search - "Does the exact text match?"

  • Defintion - Lexical Searching is about matching a sequence of characters excatly or a slight variant maybe in cases as given by the user.

"Where does this word appear in the given document?"

Example :-

  • Query : Apple
  • Matches :
    • I ate an apple yesterday.
    • Apple watches are great.

How it works :-

  • Tokenization
  • BM25 / TF-ID scoring
  • Inverted Indexes

Strengths :-

  • Extremely Fast
  • Precise for exact terminology
  • Deterministic and explainable

Limitations :-

  • No understanding of meaning
  • Cannot generalize beyond seen words
  • Fails on paraphrase and intent

Mental Model

Lexical Search hears sounds , not ideas.


1.2 Semantic Search - "What does this mean"

  • Definiton - Semantic search is about meaning matching. It represents text as points in a high-dimensional space and answers:

"What concepts are similar to this idea?"

Examples :

  • Query : "Red Crunchy Fruit" -> apple

  • Query : "IPhone Maker" -> Apple Inc

  • Query : "language for fasr backend services" -> Go (😊)

How it works :

  • Natural Embeddings

  • Vector similarity (cosine/dot product)

  • Approximate Mearest Neighbor (ANN) indexes


1.3 Personlized Recommendation - "Who is asking?"

  • Definition - Personalized recommendation incorported context and identity. It answers:

"What is the most relevant result result for this ser right now?"

Examples

  • A backend engineer searching "Apple" -> Apple Silicon performance benchmarks

  • A nutritionist searching "Apple" -> Nutritional breakdown

  • A student searching "Apple" -> Company overview

Signals Used :

  • Past searches
  • Click Behaviors
  • Dwell Time
  • User preferences
  • Similar users (collaborative signals)

Key Shifts

Search is no lnger just document -> query

It becomes:

(user,context,query) -> ranked results

Mental Model

Recommendation is search with memory


2. The Critique of the Giants (ElasticSearch in the AI Era)

  • Elastic is an engineering masterpiece of the lexical era.

  • However . modern AI-driven Search exposes Strucutral cracks.

Below are three concrete problem areas.


2.1 Vector-Lexical Hybridity is Bolted On, Not Native

Problem :

  • ElasticSearch was designed around Inverted indexes

  • Vector earch was added later as a parallel systems

Consequences :

  • Two scoring pipelines that don't naturally compose

  • Awkward hybrid scoring logic

  • Limited control over fusion startegies

Real Impact :

  • Engineers must choose between :

    • Keyword precision or
    • Semantic recall
  • True Hybrid Relevance is dificult to tune and explain

Zenith Insight

Hybrid search should be foundational , not an afterthought.


2.2 Cost of Scale is Disproportionate

Problem :

  • ElasticSearch is memory-hungry
  • Scaling requires :
    • More nodes
    • More Replicas
    • More operational complexity

Why this Hurts

  • Vector indexes multiply memory usage.

  • ANN structures shard cleanly,

  • Query fan-out grows aggressively with data size.

Real Imapct

  • Small teams cannot afford large-scale semantic search

  • Infra cost grows faster than data growth

Zenith Insight

Search engines should scale with data, not against budgets.


2.3 Ease of Use Breaks Down at Advance Use Cases

Problems :

  • ElasticSearch is powerful, but :
    • Configuration heavy
    • Steep learning curve
    • Many "magic numbers"

Examples :

  • Shard counts chosen upfront
  • Reindexing required for schema validation
  • Manual tuning for performance

In the AI Era

  • Team want:
    • Plug-and-play embeddings
    • Automatic relevance tuning
    • Opinionated defaults

Zenith Insight

Advanced systems should feel simple, not fragile


3. The Zenith North Bar

  • Zenith is not "ElasticSearch + AI"
  • It is a re-imagining of search as an intelligent system.

These principles guide every design decision.


Principle 1 — Hybrid-First, Not Vector-Optional

Statement

Lexical and semantic search are peers, not competitors

Implications

  • Single query pipeline
  • Unified scoring model
  • First-class hybrid ranking

Outcome

  • Exact matches stay exact
  • Meaning improves recall
  • no tradeoff required

Principle 2 — Distributed State Without Ceremony

Statement

Distribution should be invisible to the user.

Implications

  • Automatic sharding
  • Replica management without manual testing
  • Stateless quey coordination

Outcome

  • Zero-config cluster bootstrap
  • No "how many shards?" questions
  • Systems scale naturally with load

Principle 3 - Sub-10ms Intelligence

Statement

AI relevance must be fast enough to feel instant

Implications

  • ANN-optimized vector indexes
  • Pre-computed embeddings
  • Cache-aware execution

Outcome

  • Semantic + Lexical scoring under 10ms
  • AI without latency guilt
  • Suitable for real-time products

Closing Thought

  • ElasticSearch solved searching words at scale.

  • Zenith aims to solve finding meaning, intent, and relevance at scale.

  • This project is not about replacing a tool.

  • It is about advancing the dsicipline of search itself.


🏔️ The ZENITH 50-Step Ascent:: The Definitive Roadmap

🧱 Phase 1: The Core Foundation (The Skeleton)

  • 01-03: Inverted Index architecture & Map-based postings lists.
  • 04-05: Standard Tokenization, Stop-word filtering, and Lowercasing.
  • 06-07: gRPC Service definition and Protobuf contract design.
  • 08-09: Thread-safety implementation using sync.RWMutex.
  • 10: Concurrent Indexing via Worker Pools & Load Generation.
  • 11: The Vault – Binary persistence with encoding/gob & Graceful Shutdown. ✅

🧠 Phase 2: Neural Intelligence (The Brain)

  • 12: The Great Split- Distributed Coordinator & Modulo Sharding.
  • 13: Neural Storage - Vector map integration & High-dimensional schema.
  • 14: Linear Algebra - Implementing Dot Product and Magnitude in pure Go.
  • 15: The Similarity Engine - Implementing Cosine Similarity & L2 Distance.
  • 16: Deterministic Embeddings - Creating a "Concept-to-Vector" hash transformer.
  • 17: Hybrid Ranking - Score normalization and Weighted Fusion (Keyword + Vector).
  • 18: Reciprocal Rank Fusion (RRF) - Advanced rank-merging for multi-modal search.

🧪 Phase 3: Linguistic Mastery (The Linguist)

  • 19 : Porter Stemming – Reducing words to their roots (e.g., "jumping" -> "jump").
  • 20: Edge N-Grams – Breaking words into sub-tokens for "Search-as-you-type.
  • 21: Phonetic Matching - Soundex/Metaphone algorithms for "Sounds like" search.
  • 22: Fuzzy Matching - Levenshtein Distance for typo-tolerant queries.
  • 23: Synonyms & Thesaurus - Expanding query intent via mapping layers.
  • [] 24: Finite State Transducers (FST) - Ultra-fast dictionary storage and prefix search.( Future enhancement)

💾 Phase 4: The Storage Revolution (The LSM-Tree)

  • [] 25: Write-Ahead Log (WAL) – Atomic append-only logging for crash recovery.
  • [] 26: MemTables – Designing in-memory sorted buffers (Skip-Lists/B-Trees).
  • [] 27: SSTables – Immutable, disk-backed sorted string tables.
  • [] 28: The Compactor – Background Merging (Leveled Compaction) to prevent bloat.
  • [] 29: Bloom Filters – Probabilistic data structures for $O(1)$ disk-lookup bypass.
  • [] 24: Sparse Indexing – Memory-efficient offset mapping for massive SSTables.

About

A distributed semantic search and recommendation engine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors