Skip to content

Development of a multi-user book search engine platform based in Java, anchored in an inverted index, encompassing crawling, cleaning indexing, and efficient querying mechanisms for heightened precision and user experience.

Notifications You must be signed in to change notification settings

ricardocardn/LiBook

Repository files navigation

LiBook: Book Search Engine 🔍

In this repository, you can find the source code for building up an inverted index based search engine for books obtained from both Project Gutenberg and registered users' accounts directly. We also implemented both relational and non-relational datamarts to be able to make queries on the available books. This is a micro-service-oriented application that consists of the next modules:

  • Crawler: Obtains books directly from Project Gutemberg book platform and stores them into our datalake.
  • Cleaner: Processes the books and prepares them to be indexed.
  • Indexer: Indexes the books into our inverted index structure in Hazelcast.
  • MetadataDatamartBuilder: Creates a metadata datamart for queries.
  • QueryEngine: Offers an API for users to be able to query our inverted index.
  • UserService: Handles users' accounts in MongoDB, and session tokens through a distributed Hazelcast datamart.
  • UserBookProcessor: Processes the books uploaded by users and sends them to the cleaner.
  • ApiGateway: Serves an API merging all the public APIs of the final application, improving security on petitions.

Crucially, this project employs three distinct datamart technologies—Hazelcast, MongoDB, and Rqlite. Rqlite, based on SQLite and adapted for clustered usage, is particularly notable for its role in distributed relational database management within the application. The integration of these datamarts enhances the overall scalability, efficiency, and versatility of the search engine, accommodating both centralized and distributed data processing needs.


Image for Dark Mode



1) How to run (Docker and Docker Compose)

For each module, you should generate the corresponding docker image. If we take the indexer as a reference, a command like the following should be executed

docker build -t ricardocardn/indexer path_to_repo/Indexer/.

Or whether pull our own image directly

docker run -p 8081:8081 --network host ricardocardn/indexer

(*) The specification of the option --network host is crucial, and some problems related to hazelcast could raise if omitted. The query-engine image itself could be obtained in the following way

docker run -p 8080:8080 --network host susanasrez/queryengine

Other modules, Crawler and CLeaner, are already running on the server which ip is specified on the dockerfiles among the project, but could be refactored to execute it locally. If so, take a look at the docker compose file, and make sure that both modules are running in the same computer. Make also sure that active mq is running before starting the app.


Credits

About

Development of a multi-user book search engine platform based in Java, anchored in an inverted index, encompassing crawling, cleaning indexing, and efficient querying mechanisms for heightened precision and user experience.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages