In [6]:
author = 'Devyash Lodha'
title = 'Building a Search Engine - Part 1'
lastModified = '2025-02-07T20:56:17.277Z'
published = '2025-02-07T20:56:17.277Z'

Search engines are a complex beast. This article is about my process of attempting to build one!

## Multipart Series

1. [Introduction and get started](/blog/search-engine-1)
2. [Reverse Index and Search](/blog/search-engine-2)

## Background

I wanted to add search functionality to this website. However, this site is backend-free, and that is by design! A backend is another service which costs money to run and requires (occasional) maintenance. This article serves as a log of the things I tried and how I developed my solution!

## Goals

Search engines are ubiquitous - almost every major social media platform uses one, and, as the elephant in the room, you have behemoths such as Google, Bing and others. But how do they work? And how can we build our own? And how can we keep it cheap and simple to use?

Some of the goals I have while building this search engine:

* I'm trying to learn search engines, NLP and non-ML inference
* Achieve a decent accuracy

Additional challenges since I can't keep things simple 😃:

* Fully client side, but without costing the client too much
* Strong unicode support - let's support different scriptures and emojis!

## Tools Used

* Node.js (v22.13.0)
* [Enron Emails Dataset](https://www.cs.cmu.edu/~enron/) - used for large scale test data

We are using JavaScript since it is easily portable to run on web browsers. I may port the indexer to C++ or Rust in the future if it needs to handle larger amounts of data quickly and more reliably.

## My Initial Thought Process

My initial thought process involved me breaking down the search problem into multiple steps.

**Indexing:**

1. Data acquisition
2. Data cleanup and preprocessing
3. Determine which words are important and remove filler words
4. Index the words
5. Store the index

**Searching:**

1. Clean up and preprocess the query (with the same rules as the data)
2. Load the index
3. Strip out unimportant words from the query
4. Reverse index lookup and ranking
5. Emit results

## Let's Prototype!

### Data Acquisition

Data acquisition is arguable the hardest part in building a search engine. For this website, it is a bit simpler. Since I own the content and the transformation from Markdown and HTML to the rendered content, I can mostly skip this step and just clean focus on cleaning up the data.

However, as I experiment with search engines, instead of using my own content, I am using the [Enron Emails Dataset](https://www.cs.cmu.edu/~enron/).

For the *data acquisition* step, I am recursively walking through the maildir and reading all the files into the index.

```javascript
async function getAllFiles(dirPath, arrayOfFiles = []) {
  const files = await fs.readdir(dirPath);

  for (const file of files) {
    const filePath = `${dirPath}/${file}`;
    const stat = await fs.stat(filePath);

    if (stat.isDirectory()) {
      await getAllFiles(filePath, arrayOfFiles);
    } else {
      arrayOfFiles.push(filePath);
    }
  }

  return arrayOfFiles;
}

(async () => {
  const filesList = await getAllFiles('enron-emails');
})();
```

Then, I am reading each file.

```javascript
import limitFunction from 'p-limit';

const limit = limitFunction(100);
async function limitedReadFile(path) {
  return limit(() => fs.readFile(path, 'utf8'));
}
...
(async () => {

  ...
    
  filesList.forEach(async path => {
    try {
      const contents = await limitedReadFile(path);
      ...
    } catch (e) {
      console.error('Failed to read file', path, e);
    }
  });
})();

```

### Data Cleanup and Preprocessing

In this article, we will keep data cleanup and processing simple and uncomplicated. We will primarily focus on utf-8 transformations. Below is a playground with the code I am using to clean up text! Try editing the `input` variable and see how it gets cleaned up!

Additionally, try changing up the regexes, removing regexes or adding regexes to affect the processing.

Note: you need web workers enabled for the playground to work.

### Next, we generate a word list

We split the transformed string by spaces to get a list of words. We are now left with a list of words with which we can work with!

### Putting it all Together

## [Next: We will implement a reverse index and search it!](/blog/search-engine-2)