Mini Search Engine

This is our project of building a small document-search engine entirely in C++, which is a part of the CS163 university course. We proudly present a user-friendly console application that can handle a collection of text files and provide users with a variety of search operations with high precision.

Click here for full list of project requirements.

Build

The program is built using TDM-GCC Compiler, GCC 5.1.0, MinGW Windows 32-bit edition. All codes are written in C++11.

Data preprocessing

In order to make searching easily and efficient, we first build a collection out of given raw data.

Filenames are indexed using numbers.
The content of the files are split into words (tokens), then the tokens will be normalized before being inserted into a trie. Each token contains the file index that it occurs, the line containing it, the position in the line where it resides, and a flag that determines whether it appears in the filename or not (yes, filenames can be searched too!).
Thesaurus data is loaded into a hash table for synonym searching (Huge thanks to zaibacu)
Drink coffee ☕

Searching

Query preprocessing

The query string is splitted into terms. Each term can contain one or more words, and be one of the following types: include, exclude, range, synonym, and wild.

To get the occurrences of a term, we first convert it into token.

If a term has one word, it is a token. Thus, occurrences can be fetched directly as follows.
- If include or exclude, occurrences of the token will be fetched
- If range, occurrences of all the numbers in that range will be fetched
- If synonym, occurrences of the token and all of its equivalent ones will be fetched
- If wild, dogs and cats will be fed
If a term has two or more words,
- The occurrences of all words are combined into one list
- Knuth–Morris–Pratt algorithm is used to reduce the list of occurrences of all words to the list of occurrences of the first word of each detected pattern.

Henceforth, operations are conducted based on the tokens' occurrence list.

Operations

In the manner of listing, here is a complete table of supported operators:

Operator	Meaning	Example
`Q1 & Q2`	Get the intersection of results of two queries	`dogs & cats` returns the list of documents that contain both words
`Q1 \| Q2`	Get the union of results of two queries	`dogs \| cats` returns the list of documents that contain either of the two words
`FILETYPE:A`	Only perform search operations on files with type of `A`	`FILETYPE:TXT` reduces the scope of searching to `*.txt` files
`INTITLE:A`	Only perform search operations on files of which title contains `A`	`INTITLE:CATS` reduces the scope of searching to files that have `CATS` in the title
`+A`	The word must be included in the search results	`+dogs` returns the list of documents that contain dogs
`-A`	The word must be excluded from the search results	`-dogs` returns the list of documents that do not contain dogs
`~A`	The word or its synonym must be included in the search results	`~hello` returns the list of documents that contain hello or hi
`A..B`	Pre-requisite: A and B must be numbers Any number in the range must be included in the search results	`2019..2021` returns the list of documents that contain 2019 or 2020 or 2021
`"A B"`	The phrase must be included in the search results	`"dogs and cats"` returns the list of document that contain exactly the phrase dogs and cats
`*`	Match any word	`dogs and *` is equivalent to `dogs and dogs \| dogs and cats \| dogs and parks`

Combining multiple operations in one query is also supported. Try +dogs and -horses ~play "at the park" & year 2010..2020 intitle:June.

Rankings

The occurrences of the querying tokens in all files are encoded into a list A
The order of the tokens in the query are encoded into another list B
Apply Ratcliff/Obershelp algorithm to compute the similarity between A and B
The final score of a file per query is calculated as

To-do list

Search history (planned)
Target more filetypes (unfortunately, only .txt files are accepted at present)
More and more...

Credits

Query processing & Searching: Quyen Dinh (quyenjd), An Pham (iluvyw)
Data processing: Hai Le (pythagore1123)
Interface: Linh Nguyen (linhlrx)

Mentors

Dr. Dinh Ba Tien
M.S. Truong Phuoc Loc
M.S. Ho Tuan Thanh

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
include		include
newData		newData
playground		playground
thesaurus		thesaurus
.gitignore		.gitignore
README.md		README.md
Requirements.pdf		Requirements.pdf
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini Search Engine

Build

Data preprocessing

Searching

Query preprocessing

Operations

Rankings

To-do list

Credits

Mentors

About

Releases 1

Packages

Contributors 4

Languages

quyenjd/Mini-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Mini Search Engine

Build

Data preprocessing

Searching

Query preprocessing

Operations

Rankings

To-do list

Credits

Mentors

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Languages

Packages