# Lab 5 - Search Engine
CS 437  
Fall 2025  
Dr. Henderson  
_v1_

---

In this lab you will implement a rudimentary search engine with all the components we have studied so far. We'll start with the user interface and work our way down through the stack.

In [2]:
import importlib

import lab5lib as lab5

In [6]:
importlib.reload(lab5)

<module 'lab5lib' from '/home/ehenders/VOL/CLASSES/CS437/labs/lab5/lab5lib.py'>

### Server

We will use a Python web server framework called Flask to server your search engine for this lab.

1. Open a **terminal** inside JupyterLab Desktop (File>New>Terminal) and run `pip install flask` to install the Flask framework. Once installed you should be able to run `flask --help` (in the terminal).

By default, Flask looks for a file named `app.py` to run the server. A very simple `app.py` has been provided for you. It serves a file called `form.html` in the templates directory which has a single text input for a search query, and the documents in the `docs/` directory. When the user presses submit `app.py` receives the query. You will need to add a call to your code to process it.

2. In `app.py` in the `submit()` function, make a call to the `search()` function in `lab5lib.py`. Pass the `query` from the form and pass the returned value to the `make_snippets()` function. Assign the returned value to a variable named `results`. These will be placed below the form when it is sent back to the user.

To start the server, open a **terminal** in JupyterLab Desktop and type `flask run`. By default the server binds to `localhost:5000`. Open a browser and type in the following address: `http://localhost:5000`. You should see the input form. Try typing something in and clicking `Submit`.

### Query Processing

To be effective, query processing must use the same steps and tools as the index. You will need to create a function that your search engine can call with the query string that returns a list of processed query terms.

3. In the `lab5lib.py` file implement the function called `process_query()` so that it returns a list of tuples, each having the original (lowercase) query term and the stemmed term. Your query processing should use these steps/tools:

- Lowercase conversion
- Standard NLTK word tokenizer
- Standard NLTK English stopwords
- Snowball Stemmer   

In [31]:
_ut1_terms = lab5.process_query("The rain in Spain falls mainly on the plain.")
assert _ut1_terms == [("rain","rain"), ("spain","spain"), ("falls","fall"), ("mainly","main"), ("plain","plain") ], f"Incorrect query terms: {_ut1_terms}"

### Snippets

The `make_snippets()` function in `lab5lib.py` is returning some mock snippets. You will need to generate real snippets for the relevant documents your search engine finds. You can use the snippet function you created for Homework 4 or write a simple one for this lab. Note the HTML formatting in the mock snippets and try to use similar structure.

4. Implement the `make_snippets()` function lin `lab5lib.py` so that it takes a list of ranked documents and calls the `lab5lib.py` function `snippet()` for each, passing in the processed query. 

In [7]:
lab5.make_snippets(['running.doc'])

'\n        <div class="snippet">\n        <div class="title">How to Choose the Right <b>Running Shoes</b></div>\n        <div class="url">www.runningworld.com/shoe-guide</div>\n        <div class="description">\n            Find the perfect <b>running shoes</b> for your gait and terrain. Expert advice on cushioning, support, and fit to prevent injuries and improve your <b>running</b> performance.\n        </div>\n    </div>\n    '

5. Implement the function in `lab5lib.py` called `snippet()` that takes a processed query string (key pairs), a document id and a max length (default 80) and returns an HTML string representing a relevant snippet from the document (no more max length). The unit test below is a demo using mock data.  You will need to try with a real file and keypairs from the test dataset.

In [9]:
lab5.snippet([ ('python','python'), ('programming', 'program' ) ], 'learn-python.doc')

'Master <b>Python programming</b> with our comprehensive tutorial. Learn variables, functions, and loops in this beginner-friendly <b>Python</b> course with hands-on examples.'

---  
The `lab5lib.py` `search()` function returns a mocked list of ranked documents (though the documents are real). Using your search engine's input field will always return the same list of documents for now but the query will be used to create the snippets.

6. Try using a random query in your search engine's input field. Inspect the HTML results.

---  

### Text Acquisition, Text Transformation and Index Creation

The first step for any search engine is text acquisition which involves finding all the documents, giving them identifiers, and ensuring they are accessible. For this lab text acquisition has been done for you. The documents you will be experimenting with are in the folder labeled `docs`. You will need to process and inverse index these docs.

7. Copy your code from homework 3 or 4 to `lab5lib.py` and run the cell below to index the `docs` directory.

In [None]:
lab5.create_db()
lab5.index_dir('docs')

### Search

The final step is to change your search function to find the ranked list of relevant documents. You can use your code from homework 4.

6. In the `lab5lib.py` file replace the mock function called `search()` with your implementation from a previous homework and try running queries from your search engine.

---

### Submission Instructions

Be sure to ***SAVE YOUR WORK***!  

Next, select Kernel -> Restart Kernel and Run All Cells...

Make sure there are no errors.

Use _File > Save and Export Notebook As > HTML_ then submit your HTML file, **your `lab5lib.py` file**,  and **a screen shot of your server after a successful search** to Canvas.