SimpleSearch

A Simple Search engine and Webcrawler combo project created in C++. It parses out links and search terms using the webcrawler, and stores them into a MySQL database. The search engine can then search the database and deliver the results to a user. The search engine also includes an http server that uses multiple threads to serve more than 1 user or instance, if needed.

Getting Started

Prerequisites

You will need a MySQL server and database instance. I recommend using XAMPP. Installing via the terminal or another utility works as well.

Configuration

You will need to create an MySQL schema named SimpleSearch, with a table named url, and a table named words. They have the following columns:

url

name: id_url datatype: INT(10) etc: PK and NN
name: url_data datatype: VARCHAR(2048) etc: NN
name: url_title datatype: VARCHAR(2048) etc: default expression = ''
name: url_desc datatype: VARCHAR(2048) etc: default expression = ''

words

name: id_words datatype: INT(10) etc: PK, NN, and AI
name: words_data datatype: VARCHAR(2048) etc: NN
name: words_url datatype: INT(10) etc: NN

Make sure to add an initial entry into the url table, with a valid link. This is to give the Webcrawler somewhere to start from, so feel free to add whatever and however many starter links you want.

In HTMLParser.h lines 22-26 and httpd.h lines 50-54 you will need to provide your own data for connecting to your MySQL instance. Additionally, if you would like to change the VARCHAR(2048) variables in the database to a different size, make sure to set line 28 of httpd.h to your new value.

How to Run

Assuming you are using a Linux environment, start by going to the project directory and typing 'make' to compile the program.

To run the Webcrawler, type './webcrawler <maximum number of links you want parsed>'. Make sure that before you run it, that there is at least one url present in the database. Also note that the Webcrawler takes some time to parse out links. It works at a rate of about 100 links per minute on my virtual machine from what I've tested so far.

To run the Search Engine, type './simple-search -p <port number>. The -p option simply allows the server to run with multiple threads, so I recommend always using it. To connect to the server while it is running, go to localhost:, or the ip address of whatever machine you have it running on.

Ideas For Future Improvements

Make the Webcrawler reliably parse text from webpages to generate descriptions.
Look into ways to speed up the Webcrawler and word parsing process. Potentially make it multi threaded.
Consider making more frequent write operations in the Webcrawler, or specify a starting point for it. I'm afraid it could possibly run out of memory otherwise.
Have the Search Engine generate multiple pages of results, instead of just having one continuous page.
Add a timer and result counter to display how long the Search Engine takes and how many results it produces.
Create nicer looking pages for the Search Engine.
Create a configuration utility to allow users to easily set variables, without going into the code.
Implement a faster page sort algorithm. Merge Sort would be O(n log n) instead of O(n^2).
Look into better ways of ranking search results.

Email me at schippas@purdue.edu if you have any questions, comments, or suggestions.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
HTMLParser.cpp		HTMLParser.cpp
HTMLParser.h		HTMLParser.h
HTMLParser.o		HTMLParser.o
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
core		core
httpd.cpp		httpd.cpp
httpd.h		httpd.h
httpd.o		httpd.o
simple-search		simple-search
simple-search.cpp		simple-search.cpp
simple-search.h		simple-search.h
simple-search.o		simple-search.o
webcrawler		webcrawler
webcrawler.cpp		webcrawler.cpp
webcrawler.h		webcrawler.h
webcrawler.o		webcrawler.o

License

schippas/SimpleSearch

Folders and files

Latest commit

History

Repository files navigation

SimpleSearch

Getting Started

Prerequisites

Configuration

How to Run

Ideas For Future Improvements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages