Distributed search engine

Term project for Distributed Systems(Spring 2017-2018)

A distributed search engine that creates dynamic replicas based on frequencies of search terms and categories from a particular location.

Problem:

A single central server is set up initially(say in USA), and receives search queries from users across the world. Now if many users in India query for a certain topic say soccer, the central server sets up a replica in India dynamically containing data pertaining to only soccer and related terms. All requests containing soccer or similar queries from India now go to the replica in India. Now in case the replica in India doesn't receive relevant queries for a long time/has to include more indices, the master server deletes the idle indices from the data in the dynamically created replica in India. Furthermore, the master server should have a backup server running to take over as master in case of failure(fault-tolerant) and hence any metadata pertaining to the dynamic replicas should be sequentially consistent.

High-level features implemented:

Replication of frequently accessed search-result data to the best server with respect to the client/set of clients that generate those search queries.
Similar data items also must be replicated. For ease of implementation, we can hardcode a similarity matrix.
There is a fixed number of indices that can be stored on a replica server. Hence replicated data that was accessed least recently must be replaced when new data becomes relevant.
Metadata is sequentially consistent between master and backup.
The search results must be sound and complete

Milestones:

Get client to receive response from the master
Prepare dummy data and similarity matrix
Get the backup system for master ready
Synchronize writes from crawler process to master and backup
Set up dynamic replica based on place and queries, forward new queries there
Shrink replicas in case they don't get new queries for a long time.
Make new replica in case dynamic replica fails

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
data		data
protos		protos
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
clean.sh		clean.sh
client.py		client.py
crawler.py		crawler.py
environment.sh		environment.sh
master.py		master.py
masterbackup.py		masterbackup.py
replica.py		replica.py
requirements.txt		requirements.txt
utils.py		utils.py
writeservice.py		writeservice.py

zorroblue/distributed-search-engine

Folders and files

Latest commit

History

Repository files navigation

Distributed search engine

Term project for Distributed Systems(Spring 2017-2018)

Problem:

High-level features implemented:

Milestones:

About

Topics

Resources

Stars

Watchers

Forks

Languages