Search engines are ubiquitous. Google, Bing, Yahoo, you name it, all of these are internet-wide search engines. Even your favorite social media platforms have a search functionality. How do they work? How can we build our own? And how can we build a search engine which is cheap and easy to run?

# Introduction to the Search Problem

Search engines are ubiquitous. Google, Bing, Yahoo, you name it, all of these are internet-wide search engines. Even your favorite social media platforms have a search functionality. How do they work? How can we build our own? And how can we build a search engine which is cheap and easy to run?
<br />

This article is written with the first-principles approach. Rather than jumping straight into established methods, I'm starting with something familier - a textbook, and building a search engine with the natural patterns I use to find a topic in the textbook. This article takes an iterative approach to achieve a strong, yet simple, search engine.

## Background 🌅

I wanted to add search functionality to this website, but there’s a catch - this site is entirely backend-free by design and I intend to keep it that way. Running a backend adds costs, requires maintenance, and introduces scaling challenges, which I want to avoid.

Additionally, instead of researching existing search engine architectures, I’m taking a first-principles approach. Rather than jumping straight into established methods, I’m starting with something familiar - a textbook. How do I naturally find a topic in a textbook? What patterns do I rely on? By deconstructing my own search process, I aim to build a search engine from scratch, refining it step by step.

This article is a log of that journey, an attempt to understand search from the ground up, experiment with solutions, and iteratively improve them along the way.

## Goals ⛳️

Search engines are ubiquitous - almost every major social media platform uses one, and, as the elephant in the room, you have behemoths such as Google, Bing and others. But how do they work? And how can we build our own? And how can we keep it cheap and simple to use?

Some goals for this project:

* Learn how search engines work
* Learn some NLP (natural language processing) techniques
* Achieve a decent accuracy

Additional challenges and constraints, since I can't keep things simple 😃:

* Search should have no backend, and be run 100% in the browser
* Search should support unicode. Languages other than English exist, oh, and don't forget about Emojis!

## Tools Used 🛠️

* Node.js (v22.13.0)
* [Enron Emails Dataset](https://www.cs.cmu.edu/~enron/) - used for large scale test data
* [Æsop's Fables](https://www.gutenberg.org/ebooks/21) - used for test data in our demos
* Web Browser with JavaScript support

> **Notes**:
> 
> 1. for Æsop's fables, the text version of the Project Gutenberg archive was transformed into JSON for ease of use. The processing of the Project Gutenberg text is outside the scope of this project, and the processing may have some errors.
> 2. The majority of code will be written in JavaScript, due to the goal of being able to run search 100% on the browser.

## Search, The Problem 👀

Once upon a time, the internet was in its infancy, Google was still under development, and CLRS, The legendary *Introduction to Algorithms* textbook was just published. But none of that mattered, you had a biology assignment due which required you to understand *prokaryotic organisms*. While the Internet offered little in help, your best resource sat in front of you - **a biology textbook**!

Now, take a moment and think... How do you find a topic in a textbook?

You *could* read the entire textbook, every single page, and find out more about *prokaryotic organisms*, but is that your best approach? Do you have the time to read through every single page for your assignment? Probably not. Thus, we need to do better and use the resources we have available to ourselves to find a better process.

What resources does the textbook provide you to efficiently find what you are looking for?

## Search, The Textbook Method 🗂️

Textbooks normally have two tools, available to your disposal, to help you find information quickly:

1. **Table of Contents (TOC)** at the beginning
2. **Index** at the end

Each of these tools work differently from each other, but together, they make searching for topics much faster than flipping through pages.

> The **Table of Contents** is like a roadmap, listing chapters and sections in order, allowing you to navigate through a book top-down.
> 
> For example, if you need to learn about bacteria, the TOC might tell you:
>
> 📖 Chapter 3: Bacteria and Prokaryotes (Page 45)
>
> This tells you where to start reading about bacteria. But if you need something very specific—like "Gram-positive bacteria"—you might not find that in the TOC.
>
> <br />
> 
> The **Index** maps topics and terms to their location. Instead of listing topics by chapters, it lists every single important word or term in **alphabetical order**, along with the exact page numbers.
>
> For example, if you need *Gram-positive Bacteria*, you can check the index:
>
> 🔍 Gram-positive bacteria – Pages 26, 28, 47
>
> Now, you can go straight to the exact pages without reading everything before it.

Can we model a simple search engine off the same process used to find a topic in a book?

## Attempting to Create a High-Level Overview 🍳

Before diving into implementation, let’s break down the core components of a search system by asking two key questions:

* What components are necessary for search?
* What role does each component play?




### Data Acquisition 🏗️
Before we can search anything, we need to gather and prepare the data. This involves two key steps:

1. 🧺 Gather the data
    1. On the internet: render pages, find links and follow links, downloading the rendered website
    2. Local: recursively find files in a directory, read them
2. 🧼 Clean up the data - prepare the data so it can be searched
    1. Clean up unicode code points
    2. Identify meaningful sections and metadata
    3. Standardize formatting for consistent searchability

Once we acquire and clean up data, we face a design decision: Do we preprocess heavily upfront to enable efficient search later OR do we store raw data and brute-force search at query time?

## [Next: We will Acquire Data and Clean it up ⏭️](/blog/search-engine-2)

In the next section, we will acquire data, clean it up and prepare it for search.