In [6]:
author = 'Devyash Lodha'
title = 'Building a Search Engine - Part 1'
lastModified = '2025-02-07T20:56:17.277Z'
published = '2025-02-07T20:56:17.277Z'

Search engines are a complex beast. This article is about my process of attempting to build one!

## Multipart Series

1. [Introduction and get started](/blog/search-engine-1)
2. [Reverse Index and Search](/blog/search-engine-2)
3. [Workshop - Build a Search Engine](/blog/search-engine-3)
4. [Workshop - Build a Search Engine (solution)](/blog/search-engine-4)

## Background

I wanted to add search functionality to this website. However, this site is backend-free, and that is by design! A backend is another service which costs money to run and requires (occasional) maintenance. This article serves as a log of the things I tried and how I developed my solution!

## Goals

Search engines are ubiquitous - almost every major social media platform uses one, and, as the elephant in the room, you have behemoths such as Google, Bing and others. But how do they work? And how can we build our own? And how can we keep it cheap and simple to use?

Some goals for this project:

* Learn how search engines work
* Learn some NLP (natural language processing) techniques
* Achieve a decent accuracy

Additional challenges and constraints, since I can't keep things simple 😃:

* Search should have no backend, and be run 100% in the browser
* Search should support unicode. Languages other than English exist, oh, and don't forget about Emojis!

## Tools Used

* Node.js (v22.13.0)
* [Enron Emails Dataset](https://www.cs.cmu.edu/~enron/) - used for large scale test data
* [Æsop's Fables](https://www.gutenberg.org/ebooks/21) - used for test data in our demos

> **Notes**:
> 
> 1. for Æsop's fables, the text version of the Project Gutenberg archive was transformed into JSON for ease of use. This processing was outside the scope of this article.
> 2. The majority of code will be written in JavaScript, due to the goal of being able to run search 100% on the browser.

## The Back-of-the Envelope Search Process

When I started working on this article, I thought to myself, how can I implement search? What are some high-level steps required to achieve search? The below steps are my initial idea, and the final form of the search engine may differ slightly from this process due to this article not being updated retroactively.

**Indexing:**

1. **Data acquisition**: we need to acquire data to index and search somehow
2. **Data cleanup and preprocessing**: data is inherently dirty typically. Trying to establish boundaries on the data will help later on
3. **Index the words**: unless we want to exhaustively search through all of our content (The Enron email dataset is 1.7 GB, compressed), we need to figure out a better way to structure our content to efficiently find it.

**Searching:**

1. **Clean up and preprocess the query**
2. **Look up the search query in the index**
3. **Rank the search results**: some results are better than others
4. **Return the results**

## Let's Prototype!

### Data Acquisition

Data acquisition is arguable the hardest part in building a search engine. For this website, it is a bit simpler. However, in this project, we are using the [Enron Emails Dataset](https://www.cs.cmu.edu/~enron/). To get all of the corpus text from the Enron email dataset, we can just recursively read the files.

```javascript
// These functions limit the simultaneous reading of files to prevent exhausting system resouce limits
const limit = limitFunction(100);
async function limitedReadFile(path) {
  return limit(() => fs.readFile(path, 'utf8'));
}

async function* readFilesRecursively(dir) {
  const entries = await fs.readdir(dir, { withFileTypes: true });

  for (const entry of entries) {
    const fullPath = path.join(dir, entry.name);
    if (await entry.isDirectory()) {
      yield* readFilesRecursively(fullPath); // Recurse into subdirectory
    } else if (entry.isFile()) {
      const contents = await limitedReadFile(fullPath);
      yield [fullPath, contents];
    }
  }
}

(async () => {
  for await (const [ filepath, contents ] of readFilesRecursively(dirPath)) {
    // index the filepath and contents
  }
})();

```

### Data Cleanup and Preprocessing

For now, we will keep the processing a bit shorter and just do unicode normalization and use some regular expressions to clean up some egregious data quality issues.

Note that the files in the Enron email dataset are in an email format and contain headers. We will just ignore those for this project.

### Next, Lets Tokenize our Cleaned Data

> What is a **Token**? Tokens are chunks of text which we treat as a single unit - a word, for example. If a document is a Lego castle, each brick used to build that castle is a token.

In this article, tokens will be characters separated by spaces. In lamen terms, a word.

### Putting it all Together

## [Next: We will implement a reverse index and search it!](/blog/search-engine-2)