# Project Report: Searchme

Wikrama W. Wardhana, Illinois Institute of Technology, wwardhana@hawk.illinoistech.edu


## Abstract

## Overview

The objective of this project is to create an information retrieval system akin to a search engine. To mimic this behaviour, several key characteristics are required to consider the implementation as a derivative from such systems. The three core components this system employs: a Scrapy-based web crawler for document collection, a multithreaded indexer that builds tf-idf weighted inverted indices, and a query processor that ranks documents using cosine similarity. The system was evaluated against the Cranfield dataset to measure retrieval effectiveness using standard IR metrics.

### Document Corpus Collection

For an information retrieval system to function, it must first have access to the corpus containing said information. In the context of search engines, the system must therefore have a collection of website information and its content stored and preprocessed for extraction; the relevant information is then transformed into its appropriate data structures for the system. This was achieved through the use of [Scrapy's](https://www.scrapy.org/) web crawler to collect and download web pages for indexing. The crawler accepts a seed URL along with preconfigured parameters for maximum crawl depth and page limits, ensuring all documents are stored locally in HTML format. This approach allows the system to build a custom corpus tailored to specific topics. The crawler generates both the raw HTML files and a mapping file (url_map.jsonl) that associates document identifiers with their original URLs. Implementation details, including crawling strategies and storage organization, are discussed in the [Architecture](#architecture) section.

### Corpus Processing and Inverted Index Creation

Once an information retrieval system has access to its corpus of documents, its next step is to process its documents along with the content into the relevant data structures that builds the core of the entire system. In this system, the ["Indexer"](../src/indexer/indexme.py) serves that exact purpose: it transforms raw HTML documents into searchable data structures. It parses HTML content using BeautifulSoup, extracts only the texts, and tokenizes it while filtering stopwords and punctuation. The core data structure is an inverted index that maps terms to document identifiers and positional information, represented using tf-idf weights. To improve performance, the indexer employs multithreaded processing that distributes tokenization across CPU cores. Key design decisions regarding tokenization strategies, threading implementation, and index persistence are detailed in the [Architecture](#architecture) section.

### Query Processing and Document Retrieval

The last key component is the user-facing interface that handles input queries and perform the necessary preprocessing before executing the information retrieval pipeline. It accepts natural language queries, applies the same preprocessing pipeline used during indexing (tokenization, stopword removal, case normalization), and generates a query vector using tf-idf weighting. The system then computes cosine similarity scores between the query vector and all candidate documents, ranking them by relevance. To manage computational load and improve user experience, Searchme implements top-K retrieval, returning only the most relevant documents. The system also accommodates user input variations through its bigram index, which supports wildcard matching for misspelled or partially-known terms. The [Design](#design) section outlines the complete information retrieval pipeline from query input to ranked result presentation, while implementation specifics are provided in the [Architecture](#architecture) section.


## Design

This section describes the overarching design of Searchme's information retrieval system, describing its system capabilities, component interactions, and key design decisions that shaped the implementation.

### System Capabilities

Searchme provides several core capabilities that enable effective information retrieval from a collection of web documents:

#### Index Construction

The system processes HTML documents to build an inverted index with tf-idf weighted term representations. Each term in the index maintains positional information, recording the exact locations where terms appear within documents to enable future extensions, such as phrase queries or proximity-based ranking.

#### Query Processing

Users can submit queries that are tokenized and processed using the same preprocessing pipeline applied during indexing. The system converts queries into tf-idf weighted vectors and computes cosine similarity scores against all indexed documents to determine relevance.

#### Ranked Retrieval

Rather than returning all matching documents, Searchme implements top-K retrieval to present only the most relevant results calculated by their cosine similarity scores.

#### Wildcard Query Support

Through a bigram character index, the system supports wildcard queries that can match terms with partial or uncertain spellings. This accommodates user input variations and helps retrieve relevant documents even when exact term matches are not found.

#### Efficient Processing

The system uses multithreaded document processing to leverage multiple CPU cores during index construction, and caches document magnitude calculations to avoid redundant computations and speed up query evaluation.

### Component Interactions and Data Flow

The web crawler serves as the entry point to the system. Given the preconfigured parameters of a seed URL, maximum depth, and page limit, it traverses the web by following hyperlinks and downloading HTML content. The crawler outputs two artifacts: a directory of HTML files named by unique document identifiers, and a JSON Lines file that preserves the its document IDs and original URLs mapping. This separation allows the indexer to process documents efficiently while maintaining traceability to source locations for later components to use.

The indexer reads HTML files from the corpus directory and transforms them into searchable data structures. Each document passes through a processing pipeline that extracts text content using HTML parsing, converts text to lowercase for case-insensitive matching, removes punctuation through regular expression substitution, and filters common stopwords that provide little informational value. The resulting tokens are organized into an inverted index where each term maps to a posting list containing document IDs and their respective positions for when it appears. Simultaneously, a bigram index is constructed by generating character bigrams for each term, enabling wildcard matching capabilities. The complete index structure, along with corpus metadata, is serialized to JSON format for persistence and future loading.

When a user submits a query, the processor applies the same tokenization and preprocessing steps used during indexing to ensure consistent term representation. The processed query tokens are converted into a tf-idf weighted vector by computing term frequencies within the query and multiplying by inverse document frequency values from the index. The system then iterates through the inverted index, computing dot products between the query vector and document vectors for all candidate documents containing at least one query term. These raw dot product scores are normalized by the magnitudes of both the query vector and each document vector to produce cosine similarity scores bounded between zero and one. Finally, documents are sorted by score in descending order, and the top-K results are returned to the user.

### Integration

The three components are loosely coupled through file system conventions and data formats. The crawler and indexer communicate through a shared directory structure where HTML files and the URL mapping file reside. The indexer and query processor share access to the serialized index file, which contains all necessary data structures for retrieval. This file-based integration approach provides several advantages: components can be developed and tested independently, the system naturally supports batch processing where corpus collection and indexing occur offline while query processing happens online, and the serialized index serves as a checkpoint that eliminates the need to rebuild indices for repeated query sessions. The indexer implements lazy loading, creating the index only when no existing index file is found. This design allows the system to skip expensive index construction when working with a previously processed corpus, reducing startup time for query processing tasks.


## Architecture

This following section will detail code implementations across the three core components, pulling snippets from the source code, noting key design decisions and its implications, and how it integrates with the entire system.

### Web crawler

```python
custom_settings: dict[str, Any] = {
    "DEPTH_LIMIT": 3,
    "CLOSESPIDER_PAGECOUNT": 5000,
    "AUTOTHROTTLE_ENABLED": True,
    "AUTOTHROTTLE_START_DELAY": 1,
    "AUTOTHROTTLE_MAX_DELAY": 5,
    "AUTOTHROTTLE_TARGET_CONCURRENCY": 1.5,
    "CONCURRENT_REQUESTS": 16,
    "CONCURRENT_REQUESTS_PER_DOMAIN": 2,
    "ROBOTSTXT_OBEY": True,
}
```

The web crawler is preconfigured with values that best support the functions of a search engine. Namely, Searchme settled on using a small `DEPTH_LIMIT` of 3 combined with a significantly larger `CLOSESPIDER_PAGECOUNT` of 5000. This configuration encourages breadth-first traversal, where the crawler explores many pages at shallow depths rather than following deeply nested link chains. This approach captures diverse content across the crawled site(s) without getting trapped in deep hierarchical structures or infinite link loops. And to assist the crawling process, several autothrottling and concurrent settings are configured as to better optimize the process and maintain politeness of target websites so as to not overload them with requests.

```python
@staticmethod
def only_http_https(url: str):
    scheme = urlparse(url).scheme.lower()
    if scheme in {"", "http", "https"}:
        return url
    return None

link_extractor = LinkExtractor(
    process_value=only_http_https,
    allow_domains=["en.wikipedia.org"],
    deny=[
        r"/user/",
        r"/profile/",
        r"/account/",
        r"/login",
        r"/register",
        r"/signup",
        r"\?action=",
        r"/edit",
        r"/delete",
        r"/admin",
        r"/tag/",
        r"/category/",
        r"/archive/",
        r"/search",
        r"/share",
        r"/print",
        r"/comment",
        r"#comment",
    ],
    deny_extensions=["pdf", "zip", "gz", "tar", "7z", "rar"],
    tags=["a"],
    attrs=["href"],
    canonicalize=True,
    unique=True,
)
```

To better control the behaviour of Searchme's crawler, the spider utilizes Scrapy's `LinkExtractor` object to parse through hyperlinks and filter out candidate hops between pages. For the purposes of this system, it is limited within the domain of `en.wikipedia.org` as a promising seeding URL resulting in a sufficient corpus to test robustness of the entire system. Morever, the link extractor employs further URL parsing to ensure only HTTP/HTTPS web pages are crawled, ignoring common page links that serve little to no informational gain, files nested for download within the pages themselves that is not supported by this system, and strictly download a specific URL once to ensure no duplicates are downloaded.

```python
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    output_path_arg = kwargs.get("output_path", cls.default_output_path)
    output_path = Path(output_path_arg)

    feeds = {
        str(output_path / "url_map.jsonl"): {"format": "jsonlines"},
    }
    crawler.settings.set("FEEDS", feeds, priority="spider")

    spider = super().from_crawler(crawler, *args, **kwargs)
    return spider

```

The spider is then configured to maintain a trace of all downloaded HTML pages by creating a JSON line file that maps their respective URL and the parsed document ID, performed by the parser of the same spider.

```python
def parse(self, response):
        self.log(f"Scraped URL @ {response.url}")
        docID = str(uuid.uuid5(uuid.NAMESPACE_URL, response.url))

        # save for mapping (goes into url_map.jsonl)
        yield {"docID": docID, "url": response.url}

        html_output_path = self.output_path / "html"
        filename = f"{docID}.html"

        html_output_path.mkdir(parents=True, exist_ok=True)
        (html_output_path / filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")

        links = self.link_extractor.extract_links(response)
        yield from response.follow_all(links, callback=self.parse)
```

Finally, the parser ties the previous together. All scraped URLs are transformed into a unique UUID as their document ID representaion, and its mapping is then written into the aforementioned JSON line file. Then, all scraped hyperlink elements from the downloaded HTML are passed through the link extractor object previously defined to be filtered so as to minimize bad crawls, and the spider will recursively traverse each links until it reaches the page count limits.

### Indexer

```python
def init_invidx_tfdf(doc_idx: dict[str, list[str]]):
    inverted_idx: dict[str, dict[str, list[int]]] = {}

    for doc, term_lst in doc_idx.items():

        # maintain list of positions of term per document
        # note it is always sorted as it assumes list of text is in original ordering
        for pos, term in enumerate(term_lst):
            if term in inverted_idx:
                posting_list = inverted_idx[term].setdefault(doc, [])
                posting_list.append(pos)
            else:
                inverted_idx[term] = {doc: [pos]}

    return inverted_idx
```

The core of Searchme's indexer lies in the inverted index creation function. Simply put, given a dictionary of documet IDs and their respective list of tokens, the function iterates through each document and created the inverted index consisting of unique terms while noting each documents containing said terms. Moreover, each position of the terms are recorded for each document to enable future expansion for positional information processing when querying.

```python
# Attr:
# https://stackoverflow.com/questions/312443/how-do-i-split-a-list-into-equally-sized-chunks
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i : i + n]


def html_tokenizer(html_path: str):
    p = Path(html_path)

    with open(p, "rb") as html:
        soup_obj = BeautifulSoup(html, "html.parser")
        content = soup_obj.get_text(" ", strip=True).lower()

    regex_p = r"[^\w\s]"
    tokens = re.sub(regex_p, "", content).split()
    filtered_tokens = [t for t in tokens if t not in STOPWORDS]

    return filtered_tokens


def ptokenize(html_list: list[Path], doc_idx: dict[str, list[str]]):
    for path in html_list:
        token_list = html_tokenizer(str(path))
        doc_idx[path.stem] = token_list
```

To better optimize the system during this index creation, the process of tokenizing each document into their document-token list dictionary is threaded and made adaptive based on the number of the system's CPU cores. This is done to speed up the processing time of HTML file reads and processing potentially thousands of pages.

```python
class Index:
    def __init__(
        self, idx_file: Path = Path(INV_IDX_FILE), corpus_path: Path = CRWL_OUTPUT
    ):
        self.index_path: Path = INDX_OUTPUT / idx_file
        self._magnitude_cache = {}  # inshallah this speeds up cosine search

        if not self.index_path.exists():
            print(f"No inverted index found @ {self.index_path}, creating one...")
            self.create_index(corpus_path)
        else:
            self.load_index()

    def create_index(self, corpus_path: Path):
        if self.index_path.exists():
            raise FileExistsError(
                f"File already exists at {str(self.index_path)}\n     Use load_index() to load file into object instead."
            )

        self.corpus_path: Path = corpus_path
        self.corpus_mapping: Path = self.corpus_path / "url_map.jsonl"

        # run the crawler first, moron
        if not self.corpus_mapping.exists():
            raise FileNotFoundError(
                f'Expected a "url_map.jsonl" file at @ {self.corpus_mapping}, found none'
            )

        # sort so its not so random for my sake
        docs = sorted((self.corpus_path / "html").iterdir())
        self.corpus_size: int = len(docs)

        chunk_size = len(docs) // CPU_COUNT
        document_chunks = chunks(docs, chunk_size)

        threads = []
        doc_idx: dict[str, list[str]] = {}
        for chunk in document_chunks:
            t = threading.Thread(target=ptokenize, args=(chunk, doc_idx))
            threads.append(t)

        for t in threads:
            t.start()

        for t in threads:
            t.join()

        self.inverted_index: dict[str, dict[str, list[int]]] = init_invidx_tfdf(doc_idx)
        self.bigram_index: dict[str, list[str]] = init_bigram_idx(
            list(self.inverted_index.keys())
        )
        Path(INDX_OUTPUT).mkdir(parents=True, exist_ok=True)

        with open(self.index_path, "w") as idx_file:
            idx_obj = {
                "index_path": str(self.index_path),
                "corpus_path": str(self.corpus_path),
                "corpus_mapping": str(self.corpus_mapping),
                "corpus_size": self.corpus_size,
                "inverted_index": self.inverted_index,
                "bigram_index": self.bigram_index,
            }
            json.dump(idx_obj, idx_file, indent=2)

    def load_index(self):
        if not self.index_path.exists():
            raise FileNotFoundError(
                f"File not found @ {str(self.index_path)}\n Use create_index() to generate a new inverted index file."
            )

        with open(self.index_path, "r") as idx_file:
            idx_obj = json.load(idx_file)

        self.index_path = Path(idx_obj["index_path"])
        self.corpus_path = Path(idx_obj["corpus_path"])
        self.corpus_mapping = Path(idx_obj["corpus_mapping"])
        self.corpus_size = idx_obj["corpus_size"]
        self.inverted_index = idx_obj["inverted_index"]
        self.bigram_index = idx_obj["bigram_index"]
```

To integrate all the loosely defined functions together, the Indexer class behaves as the central pipeline for inverted index creation. If a given path to a non-existent inverted index serialized into a JSON file, the index creation pipeline is invoked. The system will look for the directory of HTML files from which the crawler downloaded its results and processes each file to construct the intermediate document-token data structure. Then, it transforms that into the complete inverted index.

```python
def get_idf(self, term: str):
    if term not in self.inverted_index:
        return 0

    n = self.corpus_size
    df = len(self.inverted_index[term])

    idf = log10(n / df)
    return idf

def get_tf(self, term: str, doc: str):
    if term not in self.inverted_index:
        return 0

    return len(self.inverted_index[term][doc])
```

From the same class, the system can retrieve the required TF-IDF values on query-time, which will be necessary on the next cosine similarity score calculation.

```python
def cosine_search(self, query_tokens: list[str], k: int = 10):
    query_vector = {}
    for term in query_tokens:
        if term in self.inverted_index:
            tf = query_tokens.count(term)
            idf = self.get_idf(term)
            query_vector[term] = tf * idf

    doc_scores: dict[str, float] = {}
    for term in query_vector:
        for doc_id in self.inverted_index[term]:
            # doc vector component
            doc_tf = self.get_tf(term, doc_id)
            doc_tfidf = doc_tf * self.get_idf(term)

            # dot product component
            doc_scores[doc_id] = (
                doc_scores.get(doc_id, 0) + query_vector[term] * doc_tfidf
            )

    query_magnitude = sum(v**2 for v in query_vector.values()) ** 0.5

    # normalizing
    for doc_id in doc_scores:
        doc_magnitude = self._get_doc_magnitude(doc_id)
        doc_scores[doc_id] /= query_magnitude * doc_magnitude

    # best to worst
    ranked = sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)
    return ranked[:k]
```

The final key implementation lies in the cosine search calculation. It expects a list of a tokenized query and represent it into its weighted TF-IDF vector form. The same process follows for all candidate documents (at least one term from query present in the document), and calculate each of its dot product with respect to the query vector. The resulting value is then normalized by the magnitude of the query and document vectors. A point of concern that became painfully obvious when running this calculation was the execution time for summing each document magnitude necessary for the normalization. To optimize this calculation, the system employs a simple caching as follows.

```python
def _get_doc_magnitude(self, doc_id: str):
    if doc_id in self._magnitude_cache:
        return self._magnitude_cache[doc_id]

    magnitude_squared = 0

    # for every term in this document
    for term in self.inverted_index:
        if doc_id in self.inverted_index[term]:
            tf = self.get_tf(term, doc_id)
            idf = self.get_idf(term)
            tfidf = tf * idf
            magnitude_squared += tfidf**2

    magnitude = magnitude_squared**0.5  # sqrt

    self._magnitude_cache[doc_id] = magnitude  # put in the cache
    return magnitude
```

Within the Index class exists a variable that stores all calculated magnitude values of a document, and the values are updated whenever the function `_get_doc_magnitude` is invoked. This will first check the cache for any existing value of that specific document before proceeding with its expensive calculation. This significantly reduces time taken for sequential queries an theoretically eliminates the document magnitude calculation bottleneck as the system continues to function.

```python
def bigram(term: str):
    gram_set = set()
    togram = term + "$"
    curr = "$"

    for char in togram:
        curr += char

        if not (
            "$*" == curr or "*$" == curr
        ):  # Remove k-grams for when the `$` is right beside `*`, i.e., wilcard begins at the beginning or end
            gram_set.add(curr)

        curr = curr[1:]

    # Filter out wildcard bigrams
    gram_set = {item for item in gram_set if "*" not in item}

    return gram_set


def init_bigram_idx(terms: list[str]):
    kgram_idx: dict[str, list[str]] = {}

    for t in terms:
        bigrams = bigram(t)

        for b in bigrams:
            if b in kgram_idx:
                kgram_idx[b].append(t)
            else:
                kgram_idx.setdefault(b, [t])

    return kgram_idx
```

The bigram index serves as a supporting data structure for query correction and wildcard matching capabilities in the search interface. Each term in the inverted index is decomposed into character bigrams. These bigrams are then indexed to map back to their source terms, creating a reverse lookup structure. This enables the system to handle misspelled queries or partial term matches by comparing the bigrams of a query term against the bigram index to find the closest matching terms in the vocabulary. When a user submits a query with typos or uses wildcard patterns, the search component can leverage this bigram index to suggest corrections or expand wildcards into concrete terms that exist in the corpus.

### Search

The search component serves as the user-facing interface that orchestrates the entire information retrieval pipeline. When a user submits a query, it undergoes several preprocessing steps before retrieval and ranking.

```python
def tokenizer(query: str):
    query_lowercase = query.lower()
    regex_p = r"[^\w\s]"
    tokens = re.sub(regex_p, "", query_lowercase).split()
    filtered_tokens = [t for t in tokens if t not in STOPWORDS]

    return filtered_tokens
```

The tokenizer function applies the same text processing pipeline used during indexing: converting thequery to lowercase for case-insensitive matching, removing punctuation throughregular expression substitution, splitting on whitespace, and filtering common stopwords. This consistent preprocessing ensures that query terms are represented in the same form as indexed terms, enabling accurate matching in the inverted index.

```python
def query_correction(query: list[str], bigram_idx: dict[str, list[str]]):
    corrected_query: list[str] = []
    correction_flag = False  # if correction occured, flip the flag

    for query_term in query:
        bigrams = bigram(query_term)

        # all candidate terms that match the bigrams
        candidate_terms: set[str] = set()
        for bg in bigrams:
            if bg in bigram_idx:
                candidate_terms.update(bigram_idx[bg])

        # no candidates found, keep original term
        if not candidate_terms:
            corrected_query.append(query_term)
            continue

        # best match
        min_distance = float("inf")
        best_match = query_term

        for candidate in candidate_terms:
            lev_dist = edit_distance(candidate, query_term)

            # exact match
            if lev_dist == 0:
                best_match = candidate
                break

            # update best match
            if lev_dist < min_distance:
                min_distance = lev_dist
                best_match = candidate

        if best_match != query_term:
            correction_flag = True

        corrected_query.append(best_match)

    return corrected_query, correction_flag
```

The query correction function leverages the bigram index to handle misspelled or mistyped query terms. For each term in the tokenized query, the function generates its character bigrams and looks up candidate terms in the bigram index that share those bigrams. If no candidates are found, the original term is retained unchanged. When candidates exist, the system computes the Levenshtein edit distance between the query term and each candidate to identify the closest match. An edit distance of zero indicates an exact match, immediately selecting that term. Otherwise, the candidate with the minimum edit distance becomes the corrected term. This approach balances accuracy and efficiency by first narrowing the search space through bigram matching before applying the more expensive edit distance calculation. The function returns both the corrected query and a boolean flag indicating whether any corrections occurred, allowing the system to inform users when their query has been modified.

```python
def get_url_mapping(url_map_fp: Path) -> dict[str, str]:
    if not url_map_fp.exists():
        raise FileNotFoundError(f"Expected file @ {url_map_fp}, but found nothing")

    with open(url_map_fp, "r") as f:
        return {json.loads(line)["docID"]: json.loads(line)["url"] for line in f}
```

The URL mapping retrieval function provides the interface between internal document identifiers and their original web addresses. It reads the JSON Lines file produced by the web crawler, parsing each line to construct a dictionary that maps document UUIDs to their source URLs. This mapping is essential for presenting search results to users, as document IDs are meaningless without their corresponding web addresses. The function includes error handling to ensure the mapping file exists before attempting to read it, preventing runtime failures when the corpus has not been properly collected.

```python
def query_pipeline(query: str, index: Index, url_map: dict[str, str]):
    Path(SRCH_LOGS).mkdir(parents=True, exist_ok=True)

    q_tokenized = tokenizer(query)

    if not q_tokenized:
        raise ValueError("Processed query is empty!")

    q_corrected, flag = query_correction(q_tokenized, index.bigram_index)
    q_corrected_str = " ".join(q_corrected)

    documents = index.cosine_search(q_corrected)

    # handle non-existent and empty conditions (if it ever happens)
    file_exist = SRCH_LOGS_FP.exists() and SRCH_LOGS_FP.stat().st_size > 0

    with open(SRCH_LOGS_FP, "a", newline="") as query_log:
        writer = csv.DictWriter(
            query_log,
            fieldnames=[
                "query",
                "corrected_q",
                "is_corrected",
                "docid",
                "url",
                "score",
            ],
            delimiter=";",
        )

        if not file_exist:
            writer.writeheader()

        for doc_id, score in documents:
            writer.writerow(
                {
                    "query": query,
                    "corrected_q": q_corrected_str,
                    "is_corrected": flag,
                    "docid": doc_id,
                    "url": url_map[doc_id],
                    "score": score,
                }
            )

    return documents, q_corrected, flag
```

The query pipeline function integrates all previous components into a complete workflow. It accepts a raw user query, the index object, and the URL mapping, then orchestrates the following sequence: tokenization of the input query, correction of any misspelled terms using the bigram index, execution of the cosine similarity search to retrieve and rank relevant documents, and logging of the complete query session. The logging mechanism records the original query, corrected query, correction status, and all retrieved documents with their scores and URLs. This data is written to a CSV file with semicolon delimiters, maintaining a persistent record of system usage that can be analyzed for performance evaluation or debugging purposes. The function returns the ranked document list along with the corrected query and correction flag, providing the caller with all necessary information to present results to the user and indicate when automatic corrections have been applied. Error handling ensures that empty queries after tokenization raise an informative exception rather than silently failing during retrieval.


## Operation

Search me can be locally deployed by pulling the repository at `https://github.com/yddet-www/ir-system.git`.

### Setup & Usage

1. After cloning the repository, change your terminal's working directory to the root of the cloned project.

   ```bash
   cd /path/to/ir-system
   ```

2. With Python installed in the system, run:

   ```bash
   pip install -r requirements.txt
   ```

   This step will install the necessary packages used in Searchme's system. It is advised to first create a Python virtual environment beforehand.

3. Since a fresh project will not contain any document corpus, we must first execute the system's web crawler. Users are free to modify [`urls.txt`](https://github.com/yddet-www/ir-system/blob/main/src/web-crawler/urls.txt) file to specify the spider's seed URL. To execute the spider:

   ```bash
   python -m scripts.web-crawler.init
   ```

   This executes a predefined script to run the spider with its default settings.

4. Once the spider completes its execution, the system will now have its corpus of documents ready to begin its main pipeline. The search interface and inverted index creation/loading are all integrated under a single pipeline, and it initialize it:

   ```bash
   python -m scripts.start-searchme
   ```

   This script will launch the Flask-based application interface along with the core pipeline, loading the index into the system. The execution takes some time when ran for the first time as it needs to process the corpus and create an inverted index. To access the interface, open your browser of choice and visit http://127.0.0.1:5000


## Test cases

## Conclusion

## Source code

## Bibliography